What Color is that H1

I came across an incredibly poorly worded question on StackOverflow where the person asked How to Manipulate the DOM with Ruby on Rails. After some back and forth it turns out that he’s not asking how to use RJS or even how parse a page with Hpricot or Nokogiri, but instead was asking for a general solution on how to programatically determine what color a given HTML element, such as “H1″, on a page might be, so that he could write a spider to do analysis over a bunch of different sites.

Unless I’m missing something, this turns out to be a fantastically difficult problem. Consider:

  1. You have to take into account both style sheets and in page markup.
  2. Because the solution needs to work across multiple sites, you can’t “cheat” and pull specific CSS selectors.
  3. Raw HTML is still mostly soup and a bizarre mix of broken tags, bad markup and coded craziness. 
  4. Style is inherited from items further up the DOM, so you can’t even pull just the CSS for a specific tag.

You’re really stuck trying to reverse engineer how a browser renders an entire page of elements and when you think that oftentimes, it is a struggle even to get two well tested and supported browsers to render markup the same, it is that much more daunting of a task.

So after some thought, my proposed solution was to not even try to reverse engineer a browser and instead just embed
gecko into an app with RubyGnome’s Gtk::MozEmbed functionality.

As the spider would browse pages, it would pass them to the embedded instance of gecko where getComputedStyle would sort out what color a H1 (or whatever) happened to be. 

I’m not sure this is the best solution, but it was really interesting to consider and research.

Leave a Reply