Our job, at Mapado, is collecting all “things to do”, all around the planet.

In order to get this huge amount of information, we crawl the web, like Google does, searching for content related to concert, show, visits, attractions, …. When we find an interesting page, we try to extract the “good” data from it.

One of our major challenge is to separate content that we are interested in (title, description, photos, dates, …) from all the crap around (advertising, navigation bar, footer content, related content…).

In that challenge, one task is to regroup content that are visually close from each other. Usually, elements composing the main content of a web page are close from each other.

When we began working on this task, we, innocently, thought that we could deal with the HTML DOM. In the DOM, elements are stored as a hierarchy, so elements with the same parent have good chance to be related.

dom-properties

A very intersting paper covering page segmentation can be found at “Page Segmentation by Web Content Clustering“.

Using DOM hierarchy is a good starting point but in many cases things are getting a lot more complicated :

  • CSS stylesheet can move elements : elements which are close on the DOM hierarchy can be moved everywhere, including outside browser windows
  • CSS stylesheet can hide or show elements : many contents can share the same visual position, just being moved (or removed) by CSS and javascript
  • javascript code can display things that are not even in the DOM

So we started considering using webkit as a visual renderer in order to get visual features. There is a bunch of headless webkit packages like phantomjs, zombie.js or casperjs. Each of them can render a web page and get all computed properties of each element on the page.

One should use some of following useful features in order to cluster visually thing together :

  • position of the element in page (from top and left)
  • width & eight of element

Below is a screenshot of the Quai Branly Museum we want to cluster elements for.

quai-branly

When building the clustering model, we found that one of the main feature is the position of the leftmost and rightmost pixel of each bloc. Indeed, if you look at web pages, very often, different content blocks are separated by a vertical gap.

Adding position of the center of each element bloc and DOM depth improve the efficiency of the clustering.

Below is a first implementation of these concept in Python, using Scikit-Learn to perform the clustering.

This algorithm is far from perfect but is a good starting point when trying to cluster things visually.

Below is the result of the clustering from one page of Quai Branly Museum, corresponding to above screenshot.

branly-clustering