Sometimes I am simply amazed by the software and technology that is currently available. Many people are aware of tools like Open Site Explorer and Majestic SEO, but not many people know about the technology they are built upon. Today, I will take you on a tour of search technologies that tech-savvy SEOs can leverage to get a leg up on the competition. Let’s get started!
This command will download all of the files on the domain schema.org:
wget -r http://schema.org/
Curl is similar to wget, but was designed with the coder in mind. Where wget is a command line tool only, curl also has a programming library associated with it. That being said, curl isn’t a one-stop-shop. Getting the contents of a single web page is simple with curl, but you will have to use a scripting/programming language with the curl library if you want curl to automatically find links and crawl like wget. This is best used with scripting languages like PHP and programming languages like C++.
Python, urllib2, and BeautifulSoup
If you need something more fine-grained, you might as well write your own crawler. The easiest scripting language for this is Python. By using urllib2, you can easily create a custom crawler that does everything you need and more. However, all this does is fetch the data. To parse it, you need to use a library like BeautifulSoup. BeautifulSoup is a fantastic library that will parse even the ugliest HTML into something more manageable. This is my go-to for custom crawling needs. Give it a shot.
Nutch is a enterprise level set of search technologies, available from the Apache Software Foundation. If there’s one thing Apache knows, it’s web servers. Currently, 63% of the planet’s web servers run on Apache. Go figure they would make a tool for searching them.
This is no simple web scraper. Nutch is a suite of tools for prioritized crawling, content parsing, link graphing, indexation, and comes with an API-friendly interface to all information contained in its database(s). It runs on Java and supports clustering, so that it will scale as large as you need. As you might expect, it comes with a learning curve. Documentation for crawling and indexing is sparse, but available. Outside that, not much exists on hacking and modding without editing source code. If the community embraced this project a bit more, it could be huge.
If you want to stay on the bleeding edge, you might want to check out Droids. This Apache incubator project’s goal is to, “…be an intelligent standalone robot framework that allows to create and extend existing (robots).“ They “released” their first version at the beginning of November, but it still has a way to go before it’s ready. That being said, it’s an interesting idea. What do you think?
There are tons of great search-based solutions out there, and I’ve only touched on a few. Tell me, faithful followers, what great search tech have you used lately?