You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@gora.apache.org by Alparslan Avcı <al...@agmlab.com> on 2014/02/19 13:03:00 UTC
Getting statistics about crawled pages
Hi all,
In order to get more info about structures of the pages we crawled, we
need to save the HTML tags, attributes, and their values, I think. After
Nutch provides this info, a data analysis process (with help of Pig, for
example) can be run over the collected datum. (Google also saves this
kind of info. You can see the stats in this link:
https://developers.google.com/webmasters/state-of-the-web/) We can
develop an HTML parser plug-in to provide such an improvement.
In the plug-in, we can iterate over the DOM root element, and save the
tags, attributes and values into the WebPage object. We can create a new
field for this, however this will change the data model. Instead, we can
add the tag info into the metadata map. (We can also add a prefix to map
key to differ the tag content data from other info.)
What do you think about this? Any comments or suggestions?
Alparslan