You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@gora.apache.org by Alparslan Avcı <al...@agmlab.com> on 2014/02/19 13:03:00 UTC

Getting statistics about crawled pages

Hi all,

In order to get more info about structures of the pages we crawled, we 
need to save the HTML tags, attributes, and their values, I think. After 
Nutch provides this info, a data analysis process (with help of Pig, for 
example) can be run over the collected datum. (Google also saves this 
kind of info. You can see the stats in this link: 
https://developers.google.com/webmasters/state-of-the-web/) We can 
develop an HTML parser plug-in to provide such an improvement.

In the plug-in, we can iterate over the DOM root element, and save the 
tags, attributes and values into the WebPage object. We can create a new 
field for this, however this will change the data model. Instead, we can 
add the tag info into the metadata map. (We can also add a prefix to map 
key to differ the tag content data from other info.)

What do you think about this? Any comments or suggestions?

Alparslan