You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Gaurav Agarwal <ga...@yahoo.com> on 2007/03/27 22:11:34 UTC

0.8.x Crawler compared to 0.7.2 Crawler

Hi everyone,

I am part of a academic research project which involves mining web-structure
to identify social links between organizations. I have been evaluating Nutch
to fill in the web-crawling task in the application stack. I have a few
questions regarding it, and I would greatly appreciate if someone could
answer them or point me to the answers.

I read a few tutorials on the net and found that Nutch's (0.7.x)
IWebDBReader provides API's to get all the crawled pages (by URL/MD5) and to
get incoming links to and outgoing links from a particular URL. This is
great as this was precisely the functionality I was looking for to make
web-link graph and mine information out of it.  In addition to this, a
highly simple plugin architecture used in Nutch made it look very very
attractive. However, after working with a couple of hours with release
0.8.1, I realized that these API's are no longer supported by the
WebDBReader (only the incoming links to is supported by 0.8.1). This has
left me wondering about the version I should be using for my project.

Definitely the advantage with 0.8.x is that it models all most every
operation as  a Map-Reduce calls (which is amazing!), and therefore is much
more scalable; but in the absence of the API's mentioned above it does not
provide me much help to build the web-link graph from the crawler output.

I may be completely wrong here and please correct me if I am, it looks like
post 0.8.0 release the thrust has been to develop the Nutch project
completely as an indexing library/application and the crawl module itself
loosing its independence or decoupling. With 0.8.x, the crawl output in
itself does not give much of useful information (or at least I failed to
locate such API's).

I'll rephrase my concerns as concrete questions:

1) Is there a way (API's) in 0.8.x/0.9 release of Nutch to access the
information about crawled data like : get all pages(contents) given a
URL/md5, get outgoing links from a URL and get all incoming links to a
URL(this last API is provided; i mentioned it for the sake of completeness).
Or an easy way I can improvise these API's.

2) If answer to 1 is NO, are there any plans to add these functionality back
in the forthcoming releases.

3) If answer to both 1 and 2 is NO, can someone point me to the discussions
which explains the rationale behind making these changes to the interface
which (in my opinion) leaves the crawler module slightly weakened ( I tried
scanning the forum posts till the era when 0.7.2 was released but failed to
locate any such discussion).

As, I mentioned earlier, I have very recently started using Nutch and many
of my thoughts might be irrelevant or even completely wrong; please excuse
me for them.

Thanks in advance!

Regards,
Gaurav
-- 
View this message in context: http://www.nabble.com/0.8.x-Crawler-compared-to-0.7.2-Crawler-tf3475330.html#a9700124
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: 0.8.x Crawler compared to 0.7.2 Crawler

Posted by Andrzej Bialecki <ab...@getopt.org>.
Gaurav Agarwal wrote:
> Hi Andrzej,
> 
> Thanks a lot for pointing out the features to me. I greatly appreciate the
> help. Things look a lot better now :)
> 
> Just one more thing: Can you point me to any document/email/discussion
> (internal or published) which can give me some insights about the
> architecture of Nutch 0.8.x and may be the information on the kind of data
> that goes in every directory.

If Wiki doesn't already contain this info (I didn't check) then only the 
mailing lists may contain it ... though most of the stuff is the same, 
the basic work cycle is still the same. Data formats differ, e.g. webdb 
was split into two parts, outlinks are stored in crawl_parse (and in 
parse_data), and there are those funky part-xxxx subdirectories, which 
are a side-effect of using Hadoop. Other than that not much changed in 
the data layout.

When it comes to the architecture, it was completely rewritten - I don't 
think there's any detailed documentation on this, though...


-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Re: 0.8.x Crawler compared to 0.7.2 Crawler

Posted by Gaurav Agarwal <ga...@yahoo.com>.
Hi Andrzej,

Thanks a lot for pointing out the features to me. I greatly appreciate the
help. Things look a lot better now :)

Just one more thing: Can you point me to any document/email/discussion
(internal or published) which can give me some insights about the
architecture of Nutch 0.8.x and may be the information on the kind of data
that goes in every directory.

Thanks,
Gaurav



Andrzej Bialecki wrote:
> 
> Gaurav Agarwal wrote:
>> Hi everyone,
>>> Definitely the advantage with 0.8.x is that it models all most every
>> operation as  a Map-Reduce calls (which is amazing!), and therefore is
>> much
>> more scalable; but in the absence of the API's mentioned above it does
>> not
>> provide me much help to build the web-link graph from the crawler output.
> 
> There is a similar API for reading from the DB, which is called 
> CrawlDbReader. It is relatively simple compared to WebDBReader, because 
> most of the support is already provided by Hadoop (i.e. the map-reduce 
> framework).
> 
> In 0.8 and later the information about pages and information about links 
> are split into two different DB-s - crawldb and linkdb - but exactly the 
> same information can be obtained from them as before.
> 
> 
>> 
>> I may be completely wrong here and please correct me if I am, it looks
>> like
>> post 0.8.0 release the thrust has been to develop the Nutch project
>> completely as an indexing library/application and the crawl module itself
>> loosing its independence or decoupling. With 0.8.x, the crawl output in
>> itself does not give much of useful information (or at least I failed to
>> locate such API's).
> 
> 
> That's not the case - if anything, the amount of useful information you 
> can retrieve has tremendously increased. Please see all the tools 
> available through the bin/nutch script, and prefixed with read* - and 
> then look at their implementation for inspiration.
> 
> 
>> 
>> I'll rephrase my concerns as concrete questions:
>> 
>> 1) Is there a way (API's) in 0.8.x/0.9 release of Nutch to access the
>> information about crawled data like : get all pages(contents) given a
> 
> Fetched pages are stored in segments. Please see SegmentReader tool that 
> allows you to retrieve the segment content.
> 
> 
>> URL/md5, get outgoing links from a URL and get all incoming links to a
> 
> SegmentReader as above. For incoming links use the linkdb, and
> LinkDbReader.
> 
>> URL(this last API is provided; i mentioned it for the sake of
>> completeness).
>> Or an easy way I can improvise these API's.
>> 
>> 2) If answer to 1 is NO, are there any plans to add these functionality
>> back
>> in the forthcoming releases.
>> 
>> 3) If answer to both 1 and 2 is NO, can someone point me to the
>> discussions
>> which explains the rationale behind making these changes to the interface
>> which (in my opinion) leaves the crawler module slightly weakened ( I
>> tried
>> scanning the forum posts till the era when 0.7.2 was released but failed
>> to
>> locate any such discussion).
> 
> Please see above. The answer is yes. ;)
> 
> -- 
> Best regards,
> Andrzej Bialecki     <><
>   ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/0.8.x-Crawler-compared-to-0.7.2-Crawler-tf3475330.html#a9718429
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: 0.8.x Crawler compared to 0.7.2 Crawler

Posted by Andrzej Bialecki <ab...@getopt.org>.
Gaurav Agarwal wrote:
> Hi everyone,
>> Definitely the advantage with 0.8.x is that it models all most every
> operation as  a Map-Reduce calls (which is amazing!), and therefore is much
> more scalable; but in the absence of the API's mentioned above it does not
> provide me much help to build the web-link graph from the crawler output.

There is a similar API for reading from the DB, which is called 
CrawlDbReader. It is relatively simple compared to WebDBReader, because 
most of the support is already provided by Hadoop (i.e. the map-reduce 
framework).

In 0.8 and later the information about pages and information about links 
are split into two different DB-s - crawldb and linkdb - but exactly the 
same information can be obtained from them as before.


> 
> I may be completely wrong here and please correct me if I am, it looks like
> post 0.8.0 release the thrust has been to develop the Nutch project
> completely as an indexing library/application and the crawl module itself
> loosing its independence or decoupling. With 0.8.x, the crawl output in
> itself does not give much of useful information (or at least I failed to
> locate such API's).


That's not the case - if anything, the amount of useful information you 
can retrieve has tremendously increased. Please see all the tools 
available through the bin/nutch script, and prefixed with read* - and 
then look at their implementation for inspiration.


> 
> I'll rephrase my concerns as concrete questions:
> 
> 1) Is there a way (API's) in 0.8.x/0.9 release of Nutch to access the
> information about crawled data like : get all pages(contents) given a

Fetched pages are stored in segments. Please see SegmentReader tool that 
allows you to retrieve the segment content.


> URL/md5, get outgoing links from a URL and get all incoming links to a

SegmentReader as above. For incoming links use the linkdb, and LinkDbReader.

> URL(this last API is provided; i mentioned it for the sake of completeness).
> Or an easy way I can improvise these API's.
> 
> 2) If answer to 1 is NO, are there any plans to add these functionality back
> in the forthcoming releases.
> 
> 3) If answer to both 1 and 2 is NO, can someone point me to the discussions
> which explains the rationale behind making these changes to the interface
> which (in my opinion) leaves the crawler module slightly weakened ( I tried
> scanning the forum posts till the era when 0.7.2 was released but failed to
> locate any such discussion).

Please see above. The answer is yes. ;)

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com