Hashtweeps Scraper: Mixed Feelings About Scrubyt

Motivation

Hashtweeps is a simple site where you can search all tweets with a specific hash term. I have decided to give Scrubyt a go and write a little Hashtweeps scraper to get a feel of it. The CEO of the company where I am at was at the leadscon conference, so I was challenged to gather all tweets with the leadscon hash (#leadscon).

The Code

[gist https://gist.github.com/cawel/78a2d567de66b844bc78]

Screenshot of my output

I added a very simple XSLT to my XML output and here it is:

Scrubyt: Pros

  • All in one: navigator, extractor, output builder. With very few lines of code, you can write a simple scraper which can navigate pages, scrape and build/output a custom XML structure.

Scrubyt: Cons

  • Lack of a good API reference. I had issues with the official one. How am I supposed to know that if you end your method with “_detail” it will navigate to that page? It seems hard to ge beyond the “scrape Google results”-type of scenario. Indeed, there’s lots of TODO’s in the reference. Hopefully this reference will get more structure and coverage.
  • No debug info. Even though the code above is fairly simple, when it breaks, you have no idea what the error was: Scrubyt just exits.
  • Impossible to test. How do you make sure the second page you navigate still exists? How do you make sure the HTML elements are still where you thought they were? How do you make sure your code constructs the XML structure as you want it to be?
  • The dependencies are not harnessed. I am a fan of having tight control on dependencies’ versions. I wasted time to figure out that Scrubyt could not run with the (latest) version of the Mechanize gem I had installed.
  • The unofficial doc is outdated. Tutorials are nice when you want to get a first feel of a new tool. Unfortunately, probably because the Scrubyt source code has been changing a lot in the last year, Scrubyt tutorials out there are no more accurate. Remedy: the first stop to kick start your first scraper should be the tutorials on Scrubyt’s wiki.

Conclusion

If you’re serious about scraping, scrubyt is not a viable option. As soon as you’re beyond the trivial scraper (like the one I did), you’re in for some waste of your time.

Advertisements

One comment

  1. Very interesting!Do you suggest any other scrapper?Im currently looking for a good one.Thanks

Got a comment?

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: