Hashtweeps is a simple site where you can search all tweets with a specific hash term. I have decided to give Scrubyt a go and write a little Hashtweeps scraper to get a feel of it. The CEO of the company where I am at was at the leadscon conference, so I was challenged to gather all tweets with the leadscon hash (#leadscon).
The Code[gist https://gist.github.com/cawel/78a2d567de66b844bc78]
Screenshot of my output
I added a very simple XSLT to my XML output and here it is:
- All in one: navigator, extractor, output builder. With very few lines of code, you can write a simple scraper which can navigate pages, scrape and build/output a custom XML structure.
- Lack of a good API reference. I had issues with the official one. How am I supposed to know that if you end your method with “_detail” it will navigate to that page? It seems hard to ge beyond the “scrape Google results”-type of scenario. Indeed, there’s lots of TODO’s in the reference. Hopefully this reference will get more structure and coverage.
- No debug info. Even though the code above is fairly simple, when it breaks, you have no idea what the error was: Scrubyt just exits.
- Impossible to test. How do you make sure the second page you navigate still exists? How do you make sure the HTML elements are still where you thought they were? How do you make sure your code constructs the XML structure as you want it to be?
- The dependencies are not harnessed. I am a fan of having tight control on dependencies’ versions. I wasted time to figure out that Scrubyt could not run with the (latest) version of the Mechanize gem I had installed.
- The unofficial doc is outdated. Tutorials are nice when you want to get a first feel of a new tool. Unfortunately, probably because the Scrubyt source code has been changing a lot in the last year, Scrubyt tutorials out there are no more accurate. Remedy: the first stop to kick start your first scraper should be the tutorials on Scrubyt’s wiki.
If you’re serious about scraping, scrubyt is not a viable option. As soon as you’re beyond the trivial scraper (like the one I did), you’re in for some waste of your time.