You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by Alparslan Avcı <al...@agmlab.com> on 2014/02/19 14:07:26 UTC

Getting statistics about crawled pages

Hi all,

In order to get more info about structures of the pages we crawled, we 
need to save the HTML tags, attributes, and their values, I think. After 
Nutch provides this info, a data analysis process (with help of Pig, for 
example) can be run over the collected datum. (Google also saves this 
kind of info. You can see the stats in this link: 
https://developers.google.com/webmasters/state-of-the-web/) We can 
develop an HTML parser plug-in to provide such an improvement.

In the plug-in, we can iterate over the DOM root element, and save the 
tags, attributes and values into the WebPage object. We can create a new 
field for this, however this will change the data model. Instead, we can 
add the tag info into the metadata map. (We can also add a prefix to map 
key to differ the tag content data from other info.)

What do you think about this? Any comments or suggestions?

Alparslan

Re: Getting statistics about crawled pages

Posted by Alparslan Avcı <al...@agmlab.com>.

Hi Sebastian,

Developing a seperate job is a good idea. With this approach, we can 
also collect info about non-HTML documents. Moreover, a job approach 
will also allow us to collect info about historically crawled pages. And 
as you said, we do not have to store the info in WebPage.

However, Iprefer to start with an HTML parse filter since it will be 
easier to implement and easier to be accepted by Nutch commiters. :) 
Maybe later on, somebody else or me will also develop the job version.

Alparslan.

On 19-02-2014 21:07, Sebastian Nagel wrote:
> Hi Alparslan,
>
>> You can see the stats in this link: https://developers.google.com/webmasters/state-of-the-web/) We
>> can develop an HTML parser plug-in to provide such an improvement.
> Nice resource and nice idea.
>
> For me that sounds like a combination of the ParserJob and the classic Hadoop word count expample:
> 1. take the ParserJob and modify ParserMapper.map():
>     instead of
>       context.write(key, page);
>     traverse the DOM and do a
>       context.write(new Text("<"+element_name+">"), 1);
>       context.write(new Text("<meta name=description>"), 1);
>     etc. for all your required statistics.
>     For simplicity all keys are strings (Text). But you could
>     define special objects to hold, e.g. element - attribute pairs.
> 2. instead of the IdentityPageReducer use the WordCountReducer.
> 3. you'll get a list of counts in the output directory
>     which can be processed by scripts or Excel to plot diagrams.
>
> Maybe that's simpler than to modify WebPage and finally get the numbers out.
>
> Sebastian
>
> On 02/19/2014 02:07 PM, Alparslan Avcı wrote:
>> Hi all,
>>
>> In order to get more info about structures of the pages we crawled, we need to save the HTML tags,
>> attributes, and their values, I think. After Nutch provides this info, a data analysis process (with
>> help of Pig, for example) can be run over the collected datum. (Google also saves this kind of info.
>> You can see the stats in this link: https://developers.google.com/webmasters/state-of-the-web/) We
>> can develop an HTML parser plug-in to provide such an improvement.
>>
>> In the plug-in, we can iterate over the DOM root element, and save the tags, attributes and values
>> into the WebPage object. We can create a new field for this, however this will change the data
>> model. Instead, we can add the tag info into the metadata map. (We can also add a prefix to map key
>> to differ the tag content data from other info.)
>>
>> What do you think about this? Any comments or suggestions?
>>
>> Alparslan

Re: Getting statistics about crawled pages

Posted by Sebastian Nagel <wa...@googlemail.com>.

Hi Alparslan,

> You can see the stats in this link: https://developers.google.com/webmasters/state-of-the-web/) We
> can develop an HTML parser plug-in to provide such an improvement.
Nice resource and nice idea.

For me that sounds like a combination of the ParserJob and the classic Hadoop word count expample:
1. take the ParserJob and modify ParserMapper.map():
   instead of
     context.write(key, page);
   traverse the DOM and do a
     context.write(new Text("<"+element_name+">"), 1);
     context.write(new Text("<meta name=description>"), 1);
   etc. for all your required statistics.
   For simplicity all keys are strings (Text). But you could
   define special objects to hold, e.g. element - attribute pairs.
2. instead of the IdentityPageReducer use the WordCountReducer.
3. you'll get a list of counts in the output directory
   which can be processed by scripts or Excel to plot diagrams.

Maybe that's simpler than to modify WebPage and finally get the numbers out.

Sebastian

On 02/19/2014 02:07 PM, Alparslan Avcı wrote:
> Hi all,
> 
> In order to get more info about structures of the pages we crawled, we need to save the HTML tags,
> attributes, and their values, I think. After Nutch provides this info, a data analysis process (with
> help of Pig, for example) can be run over the collected datum. (Google also saves this kind of info.
> You can see the stats in this link: https://developers.google.com/webmasters/state-of-the-web/) We
> can develop an HTML parser plug-in to provide such an improvement.
> 
> In the plug-in, we can iterate over the DOM root element, and save the tags, attributes and values
> into the WebPage object. We can create a new field for this, however this will change the data
> model. Instead, we can add the tag info into the metadata map. (We can also add a prefix to map key
> to differ the tag content data from other info.)
> 
> What do you think about this? Any comments or suggestions?
> 
> Alparslan