Sometimes I am simply amazed by the software and technology that is currently available. Many people are aware of tools like Open Site Explorer and Majestic SEO, but not many people know about the technology they are built upon. Today, I will take you on a tour of search technologies that tech-savvy SEOs can leverage to get a leg up on the competition. Let's get started!
This is as simple as it gets. Wget is a command line tool available on Unix/Linux platforms for downloading files. Many people with a little knowledge of Linux know about wget, for downloading single files at a time. Not many know that wget can crawl an entire website, download images and JavaScript, and even rewrite HTML links to make a website browse-able on your local computer. Wget is also available on the Mac, but you'll need to install MacPorts. If you want to give it a try, pop open your Unix/Linux/Mac console and try the following.
This command will download all of the files on the domain schema.org:
Curl is similar to wget but was designed with the coder in mind. Where wget is a command line tool only, curl also has a programming library associated with it. That being said, curl isn't a one-stop-shop. Getting the contents of a single web page is simple with curl, but you will have to use a programming language with the curl library if you want curl to automatically find links and crawl like wget. This is best to use with scripting languages like PHP and programming languages like C++.
If you need something more fine-grained, you might as well write your own crawler. The easiest scripting language for this is Python. By using urllib2, you can easily create a custom crawler that does everything you need and more. However, all this does is fetch the data. To parse it, you need to use a library like BeautifulSoup. BeautifulSoup is a fantastic library that will parse even the ugliest HTML into something more manageable. This is my go-to for custom crawling needs. Give it a shot.
Nutch is an enterprise level set of search technologies, available from the Apache Software Foundation. If there's one thing Apache knows, it's web servers. Currently, 63% of the planet's web servers run on Apache. Go figure they would make a tool for searching them.
This is no simple web scraper. Nutch is a suite of tools for prioritized crawling, content parsing, link graphing, indexation, and comes with an API-friendly interface to all information contained in its database(s). It runs on Java and supports clustering so that it will scale as large as you need. As you might expect, it comes with a learning curve. Documentation for crawling and indexing is sparseĀ but available. OutsideĀ of that, not much exists on hacking and modding without editing source code. If the community embraced this project a bit more, it could be huge.
If you want to stay on the bleeding edge, you might want to check out Droids. This Apache incubator project's goal is to, "...be an intelligent standalone robot framework that allows to create and extend existing." They "released" their first version at the beginning of November, but it still has a way to go before it's ready.
There are tons of great search-based solutions and search technology out there, and I've only touched on a few. Tell me, faithful followers, what great search tech have you used lately?