You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Eric <er...@lakemeadonline.com> on 2009/10/05 21:47:05 UTC

Incremental Whole Web Crawling

My plan is to crawl ~1.6M TLD's to a depth of 2. Is there a way I can  
crawl it in increments of 100K? e.g. crawl 100K 16 times for the TLD's  
then crawl the links generated from the TLD's in increments of 100K?

Thanks,

EO

Re: mergecrawls.sh

Posted by Alex Basa <al...@yahoo.com>.

It seems like when the indexer gets a 'Job failed' it doesn't back up one directory so in the next phase where it does the dedup, it won't find the newindexes directory since it's looking for it under index.  Anyone know of a fix to Indexer for this?  I'm running Nutch 0.9

As always, thanks in advance

 Indexing [http://www.plataformaarquitectura.cl/2009/06/28/summer-show-2009-barl
ett-school-of-architecture-ucl/100_7191/] with analyzer org.apache.nutch.analysi
s.NutchDocumentAnalyzer@799e11a1 (null)
Indexer: java.io.IOException: Job failed!
        at org.apache.hadoop.mapred..JobClient.runJob(JobClient.java:604)
        at org.apache.nutch.indexer..Indexer.index(Indexer.java:307)
        at org.apache.nutch.indexer.Indexer.run(Indexer.java:329)
        at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189)
        at org.apache.nutch.indexer.Indexer.main(Indexer.java:312)

log4j:ERROR Failed to flush writer,
java.io.InterruptedIOException
        at java.io.FileOutputStream.writeBytes(Native Method)
        at java.io.FileOutputStream.write(FileOutputStream.java:260)
        at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:202)
        at sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:272)
        at sun.nio.cs.StreamEncoder.implFlush(StreamEncoder.java:276)
        at sun.nio.cs.StreamEncoder.flush(StreamEncoder.java:122)
        at java.io.OutputStreamWriter.flush(OutputStreamWriter.java:212)
        at org.apache.log4j.helpers.QuietWriter.flush(QuietWriter.java:57)
        at org.apache.log4j.WriterAppender.subAppend(WriterAppender.java:315)
        at org.apache.log4j.DailyRollingFileAppender.subAppend(DailyRollingFileA
ppender.java:358)
        at org.apache.log4j.WriterAppender.append(WriterAppender.java:159)
        at org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:230)
        at org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders
(AppenderAttachableImpl.java:65)
        at org.apache.log4j.Category.callAppenders(Category.java:203)
        at org.apache.log4j.Category.forcedLog(Category.java:388)
        at org.apache.log4j.Category.log(Category.java:853)
        at org..apache.commons.logging.impl.Log4JLogger.warn(Log4JLogger.java:169
)
        at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:1
66)
De-duplicate indexes
Dedup: starting
Dedup: adding indexes in: /database/Nutch/index/newindexes
org.apache.hadoop.mapred.InvalidInputException: Input path doesnt exist : /database/Nutch/index/newindexes
        at org.apache.hadoop.mapred.InputFormatBase.validateInput(InputFormatBas
e.java:138)
        at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:326)
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:543)
        at org.apache.nutch.indexer.DeleteDuplicates..dedup(DeleteDuplicates.java
:603)
        at org.apache.nutch.indexer.DeleteDuplicates.run(DeleteDuplicates.java:6
74)
        at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189)
        at org.apache.nutch.indexer.DeleteDuplicates.main(DeleteDuplicates.java:
658)
DeleteDuplicates: java.io.IOException: org.apache.hadoop.mapred.InvalidInputException: Input path doesnt exist : /database/Nutch/index.uchi/newindexes
        at org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java
:653)
        at org.apache.nutch.indexer.DeleteDuplicates.run(DeleteDuplicates.java:6
74)
        at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189)
        at org.apache.nutch.indexer.DeleteDuplicates.main(DeleteDuplicates.java:
658)

Merge indexes


--- On Tue, 1/12/10, Alex Basa <al...@yahoo.com> wrote:

> From: Alex Basa <al...@yahoo.com>
> Subject: mergecrawls.sh
> To: nutch-user@lucene.apache.org
> Date: Tuesday, January 12, 2010, 12:01 PM
> Does anyone know of any bug fixes to
> mergecrawls.sh?  I have two working indexes that I try
> to merge and it seems to work but when it's done, the index
> is corrupt.
> 
> before the merge, both indexes have
> crawldb     index   
>    linkdb     
> newindexes  segments
> 
> after the merge, the newindexes directory is gone
> crawldb   index 
>    linkdb    segments
> 
> I didn't log the output so I'll re-run it again and look at
> the output.
> 
> Thanks,
> 
> Alex
> 
> 
>       
> 
>

mergecrawls.sh

Posted by Alex Basa <al...@yahoo.com>.

Does anyone know of any bug fixes to mergecrawls.sh?  I have two working indexes that I try to merge and it seems to work but when it's done, the index is corrupt.

before the merge, both indexes have
crawldb     index       linkdb      newindexes  segments

after the merge, the newindexes directory is gone
crawldb   index     linkdb    segments

I didn't log the output so I'll re-run it again and look at the output.

Thanks,

Alex

Re: Incremental Whole Web Crawling

Posted by Julien Nioche <li...@gmail.com>.

Hi Jesse,

no problem. Feel free to post your comments / bug fixes / suggestions on the
JIRA NUTCH-762

Thanks
-- 
DigitalPebble Ltd
http://www.digitalpebble.com

2009/11/4 Jesse Hires <jh...@gmail.com>

> My apologies. missed a patch option :-P
> Must need more coffee.
> Jesse
>
> int GetRandomNumber()
> {
>   return 4; // Chosen by fair roll of dice
>                // Guaranteed to be random
> } // xkcd.com
>
>
>
> On Tue, Nov 3, 2009 at 8:08 PM, Jesse Hires <jh...@gmail.com> wrote:
>
> > Julien,
> > I tried to apply your patch because I was curious.
> > $ patch < NUTCH-762-MultiGenerator.patch
> >
> > but this seems to drop the two java files into the root directory instead
> > of
> > src/java/org/apache/nutch/crawl/URLPartitioner.java
> > src/java/org/apache/nutch/crawl/MultiGenerator.java
> >
> > But if I copy the files to those locations, I get compile errors.
> > I'm up to date on the svn trunk.
> > Did I miss a step?
> >
> >
> > Jesse
> >
> > int GetRandomNumber()
> > {
> >    return 4; // Chosen by fair roll of dice
> >                 // Guaranteed to be random
> > } // xkcd.com
> >
> >
> >
> >
> > On Tue, Nov 3, 2009 at 7:09 AM, Julien Nioche <
> > lists.digitalpebble@gmail.com> wrote:
> >
> >> FYI : there is an implementation of such a modified Generator in
> >> http://issues.apache.org/jira/browse/NUTCH-762
> >>
> >> Julien
> >> --
> >> DigitalPebble Ltd
> >> http://www.digitalpebble.com
> >>
> >> 2009/10/5 Andrzej Bialecki <ab...@getopt.org>
> >>
> >> > Eric wrote:
> >> >
> >> >> My plan is to crawl ~1.6M TLD's to a depth of 2. Is there a way I can
> >> >> crawl it in increments of 100K? e.g. crawl 100K 16 times for the
> TLD's
> >> then
> >> >> crawl the links generated from the TLD's in increments of 100K?
> >> >>
> >> >
> >> > Yes. Make sure that you have the "generate.update.db" property set to
> >> true,
> >> > and then generate 16 segments each having 100k urls. After you finish
> >> > generating them, then you can start fetching.
> >> >
> >> > Similarly, you can do the same for the next level, only you will have
> to
> >> > generate more segments.
> >> >
> >> > This could be done much simpler with a modified Generator that outputs
> >> > multiple segments from one job, but it's not implemented yet.
> >> >
> >> >
> >> > --
> >> > Best regards,
> >> > Andrzej Bialecki     <><
> >> >  ___. ___ ___ ___ _ _   __________________________________
> >> > [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> >> > ___|||__||  \|  ||  |  Embedded Unix, System Integration
> >> > http://www.sigram.com  Contact: info at sigram dot com
> >> >
> >> >
> >>
> >
> >
>

Re: Incremental Whole Web Crawling

Posted by Jesse Hires <jh...@gmail.com>.

My apologies. missed a patch option :-P
Must need more coffee.
Jesse

int GetRandomNumber()
{
   return 4; // Chosen by fair roll of dice
                // Guaranteed to be random
} // xkcd.com



On Tue, Nov 3, 2009 at 8:08 PM, Jesse Hires <jh...@gmail.com> wrote:

> Julien,
> I tried to apply your patch because I was curious.
> $ patch < NUTCH-762-MultiGenerator.patch
>
> but this seems to drop the two java files into the root directory instead
> of
> src/java/org/apache/nutch/crawl/URLPartitioner.java
> src/java/org/apache/nutch/crawl/MultiGenerator.java
>
> But if I copy the files to those locations, I get compile errors.
> I'm up to date on the svn trunk.
> Did I miss a step?
>
>
> Jesse
>
> int GetRandomNumber()
> {
>    return 4; // Chosen by fair roll of dice
>                 // Guaranteed to be random
> } // xkcd.com
>
>
>
>
> On Tue, Nov 3, 2009 at 7:09 AM, Julien Nioche <
> lists.digitalpebble@gmail.com> wrote:
>
>> FYI : there is an implementation of such a modified Generator in
>> http://issues.apache.org/jira/browse/NUTCH-762
>>
>> Julien
>> --
>> DigitalPebble Ltd
>> http://www.digitalpebble.com
>>
>> 2009/10/5 Andrzej Bialecki <ab...@getopt.org>
>>
>> > Eric wrote:
>> >
>> >> My plan is to crawl ~1.6M TLD's to a depth of 2. Is there a way I can
>> >> crawl it in increments of 100K? e.g. crawl 100K 16 times for the TLD's
>> then
>> >> crawl the links generated from the TLD's in increments of 100K?
>> >>
>> >
>> > Yes. Make sure that you have the "generate.update.db" property set to
>> true,
>> > and then generate 16 segments each having 100k urls. After you finish
>> > generating them, then you can start fetching.
>> >
>> > Similarly, you can do the same for the next level, only you will have to
>> > generate more segments.
>> >
>> > This could be done much simpler with a modified Generator that outputs
>> > multiple segments from one job, but it's not implemented yet.
>> >
>> >
>> > --
>> > Best regards,
>> > Andrzej Bialecki     <><
>> >  ___. ___ ___ ___ _ _   __________________________________
>> > [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
>> > ___|||__||  \|  ||  |  Embedded Unix, System Integration
>> > http://www.sigram.com  Contact: info at sigram dot com
>> >
>> >
>>
>
>

Re: Incremental Whole Web Crawling

Posted by Jesse Hires <jh...@gmail.com>.

Julien,
I tried to apply your patch because I was curious.
$ patch < NUTCH-762-MultiGenerator.patch

but this seems to drop the two java files into the root directory instead of
src/java/org/apache/nutch/crawl/URLPartitioner.java
src/java/org/apache/nutch/crawl/MultiGenerator.java

But if I copy the files to those locations, I get compile errors.
I'm up to date on the svn trunk.
Did I miss a step?


Jesse

int GetRandomNumber()
{
   return 4; // Chosen by fair roll of dice
                // Guaranteed to be random
} // xkcd.com



On Tue, Nov 3, 2009 at 7:09 AM, Julien Nioche <lists.digitalpebble@gmail.com
> wrote:

> FYI : there is an implementation of such a modified Generator in
> http://issues.apache.org/jira/browse/NUTCH-762
>
> Julien
> --
> DigitalPebble Ltd
> http://www.digitalpebble.com
>
> 2009/10/5 Andrzej Bialecki <ab...@getopt.org>
>
> > Eric wrote:
> >
> >> My plan is to crawl ~1.6M TLD's to a depth of 2. Is there a way I can
> >> crawl it in increments of 100K? e.g. crawl 100K 16 times for the TLD's
> then
> >> crawl the links generated from the TLD's in increments of 100K?
> >>
> >
> > Yes. Make sure that you have the "generate.update.db" property set to
> true,
> > and then generate 16 segments each having 100k urls. After you finish
> > generating them, then you can start fetching.
> >
> > Similarly, you can do the same for the next level, only you will have to
> > generate more segments.
> >
> > This could be done much simpler with a modified Generator that outputs
> > multiple segments from one job, but it's not implemented yet.
> >
> >
> > --
> > Best regards,
> > Andrzej Bialecki     <><
> >  ___. ___ ___ ___ _ _   __________________________________
> > [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> > ___|||__||  \|  ||  |  Embedded Unix, System Integration
> > http://www.sigram.com  Contact: info at sigram dot com
> >
> >
>

Re: Incremental Whole Web Crawling

Posted by Julien Nioche <li...@gmail.com>.

FYI : there is an implementation of such a modified Generator in
http://issues.apache.org/jira/browse/NUTCH-762

Julien
-- 
DigitalPebble Ltd
http://www.digitalpebble.com

2009/10/5 Andrzej Bialecki <ab...@getopt.org>

> Eric wrote:
>
>> My plan is to crawl ~1.6M TLD's to a depth of 2. Is there a way I can
>> crawl it in increments of 100K? e.g. crawl 100K 16 times for the TLD's then
>> crawl the links generated from the TLD's in increments of 100K?
>>
>
> Yes. Make sure that you have the "generate.update.db" property set to true,
> and then generate 16 segments each having 100k urls. After you finish
> generating them, then you can start fetching.
>
> Similarly, you can do the same for the next level, only you will have to
> generate more segments.
>
> This could be done much simpler with a modified Generator that outputs
> multiple segments from one job, but it's not implemented yet.
>
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>

Re: Incremental Whole Web Crawling

Posted by Julien Nioche <li...@gmail.com>.

>
>
> This could be done much simpler with a modified Generator that outputs
> multiple segments from one job, but it's not implemented yet.
>

This would also be more efficient as crawlDB operations such as generate or
update take more time as the crawlDB grows (unlike fetch and parse which are
proportional to the size of the fetchlist). When the crawlDB sizes in
billions of URL the fetching / parsing takes relatively little time.

generate.update.db requires to read and write a whole crawlDB everytime but
I suppose that it would be fine for a small crawlDB

J.

-- 
DigitalPebble Ltd
http://www.digitalpebble.com

Re: Incremental Whole Web Crawling

Posted by Paul Tomblin <pt...@xcski.com>.

Don't change options in nutch-default.xml - copy the option into
nutch-site.xml and change it there.  That way the change will
(hopefully) survive an upgrade.

On Tue, Oct 6, 2009 at 1:01 AM, Gaurang Patel <ga...@gmail.com> wrote:
> Hey,
>
> Never mind. I got *generate.update.db* in *nutch-default.xml* and set it
> true.
>
> Regards,
> Gaurang
>
> 2009/10/5 Gaurang Patel <ga...@gmail.com>
>
>> Hey Andrzej,
>>
>> Can you tell me where to set this property (generate.update.db)? I am
>> trying to run similar kind of crawl scenario that Eric is running.
>>
>> -Gaurang
>>
>> 2009/10/5 Andrzej Bialecki <ab...@getopt.org>
>>
>> Eric wrote:
>>>
>>>> Andrzej,
>>>>
>>>> Just to make sure I have this straight, set the generate.update.db
>>>> property to true then
>>>>
>>>> bin/nutch generate crawl/crawldb crawl/segments -topN 100000: 16 times?
>>>>
>>>
>>> Yes. When this property is set to true, then each fetchlist will be
>>> different, because the records for those pages that are already on another
>>> fetchlist will be temporarily locked. Please note that this lock holds only
>>> for 1 week, so you need to fetch all segments within one week from
>>> generating them.
>>>
>>> You can fetch and updatedb in arbitrary order, so once you fetched some
>>> segments you can run the parsing and updatedb just from these segments,
>>> without waiting for all 16 segments to be processed.
>>>
>>>
>>>
>>> --
>>> Best regards,
>>> Andrzej Bialecki     <><
>>>  ___. ___ ___ ___ _ _   __________________________________
>>> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
>>> ___|||__||  \|  ||  |  Embedded Unix, System Integration
>>> http://www.sigram.com  Contact: info at sigram dot com
>>>
>>>
>>
>



-- 
http://www.linkedin.com/in/paultomblin

Re: Incremental Whole Web Crawling

Posted by Eric Osgood <er...@lakemeadonline.com>.

O ok,

You learn something new everyday! I didn't know that the trunk was the  
most recent build. Good to know! So this current trunk does have a fix  
for the generator bug?


On Oct 13, 2009, at 2:05 PM, Andrzej Bialecki wrote:

> Eric Osgood wrote:
>> So the trunk contains the most recent nightly update?
>
> It's the other way around - nightly build is created from a snapshot  
> of the trunk. The trunk is always the most recent.
>
>
> -- 
> Best regards,
> Andrzej Bialecki     <><
> ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>

Eric Osgood
---------------------------------------------
Cal Poly - Computer Engineering, Moon Valley Software
---------------------------------------------
eosgood@calpoly.edu, eric@lakemeadonline.com
---------------------------------------------
www.calpoly.edu/~eosgood, www.lakemeadonline.com

Re: Incremental Whole Web Crawling

Posted by Andrzej Bialecki <ab...@getopt.org>.

Eric Osgood wrote:
> So the trunk contains the most recent nightly update?

It's the other way around - nightly build is created from a snapshot of 
the trunk. The trunk is always the most recent.


-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Incremental Whole Web Crawling

Posted by Eric Osgood <er...@lakemeadonline.com>.

So the trunk contains the most recent nightly update?
On Oct 13, 2009, at 1:50 PM, Andrzej Bialecki wrote:

> Eric Osgood wrote:
>> Ok, I think I am on the right track now, but just to be sure: the  
>> code I want is the branch section of svn under nutchbase at http://svn.apache.org/repos/asf/lucene/nutch/branches/nutchbase/ 
>>  correct?
>
> No, you need the trunk from here:
>
> http://svn.apache.org/repos/asf/lucene/nutch/trunk
>
>
> -- 
> Best regards,
> Andrzej Bialecki     <><
> ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>

Eric Osgood
---------------------------------------------
Cal Poly - Computer Engineering, Moon Valley Software
---------------------------------------------
eosgood@calpoly.edu, eric@lakemeadonline.com
---------------------------------------------
www.calpoly.edu/~eosgood, www.lakemeadonline.com

Re: Incremental Whole Web Crawling

Posted by Andrzej Bialecki <ab...@getopt.org>.

Eric Osgood wrote:
> Ok, I think I am on the right track now, but just to be sure: the code I 
> want is the branch section of svn under nutchbase at 
> http://svn.apache.org/repos/asf/lucene/nutch/branches/nutchbase/ correct?

No, you need the trunk from here:

http://svn.apache.org/repos/asf/lucene/nutch/trunk


-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Incremental Whole Web Crawling

Posted by Eric Osgood <er...@lakemeadonline.com>.

Ok, I think I am on the right track now, but just to be sure: the code  
I want is the branch section of svn under nutchbase at http://svn.apache.org/repos/asf/lucene/nutch/branches/nutchbase/ 
  correct?

Thanks,

Eric


On Oct 13, 2009, at 1:38 PM, Andrzej Bialecki wrote:

> Eric Osgood wrote:
>> Andrzej,
>> Where do I get the nightly builds from? I tried to use the eclipse  
>> plugin that supports svn to no avail. Is there a ftp, http server  
>> where I can download the nutch source fresh?
>
> Personally I prefer to use a command-line svn, even though I do  
> development in Eclipse - I'm probably old-fashioned but I always  
> want to be very clear on what's going on when I do an update.
>
> See the instructions here:
>
> http://lucene.apache.org/nutch/version_control.html
>
>
> -- 
> Best regards,
> Andrzej Bialecki     <><
> ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>

Eric Osgood
---------------------------------------------
Cal Poly - Computer Engineering, Moon Valley Software
---------------------------------------------
eosgood@calpoly.edu, eric@lakemeadonline.com
---------------------------------------------
www.calpoly.edu/~eosgood, www.lakemeadonline.com

Re: Incremental Whole Web Crawling

Posted by Andrzej Bialecki <ab...@getopt.org>.

Eric Osgood wrote:
> Andrzej,
> 
> Where do I get the nightly builds from? I tried to use the eclipse 
> plugin that supports svn to no avail. Is there a ftp, http server where 
> I can download the nutch source fresh?

Personally I prefer to use a command-line svn, even though I do 
development in Eclipse - I'm probably old-fashioned but I always want to 
be very clear on what's going on when I do an update.

See the instructions here:

http://lucene.apache.org/nutch/version_control.html


-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Incremental Whole Web Crawling

Posted by Eric Osgood <er...@lakemeadonline.com>.

Andrzej,

Where do I get the nightly builds from? I tried to use the eclipse  
plugin that supports svn to no avail. Is there a ftp, http server  
where I can download the nutch source fresh?

Thanks,

Eric

On Oct 11, 2009, at 12:40 PM, Andrzej Bialecki wrote:

> Eric Osgood wrote:
>> When I set generate.update.db to true and then run generate, it  
>> only runs twice and generates 100K for the 1st gen, 62.5K for the  
>> second gen and 0 for the 3rd gen on a seed list of 1.6M. I don't  
>> understand this, for a topN of 100K it should run 16 times and  
>> create 16 distinct lists if I am not mistaken.
>
> There was a bug in this code that I fixed recently - please get a  
> new nightly build and try it again.
>
>
> -- 
> Best regards,
> Andrzej Bialecki     <><
> ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>

Eric Osgood
---------------------------------------------
Cal Poly - Computer Engineering, Moon Valley Software
---------------------------------------------
eosgood@calpoly.edu, eric@lakemeadonline.com
---------------------------------------------
www.calpoly.edu/~eosgood, www.lakemeadonline.com

Re: Incremental Whole Web Crawling

Posted by Andrzej Bialecki <ab...@getopt.org>.

Eric Osgood wrote:
> When I set generate.update.db to true and then run generate, it only 
> runs twice and generates 100K for the 1st gen, 62.5K for the second gen 
> and 0 for the 3rd gen on a seed list of 1.6M. I don't understand this, 
> for a topN of 100K it should run 16 times and create 16 distinct lists 
> if I am not mistaken.

There was a bug in this code that I fixed recently - please get a new 
nightly build and try it again.


-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Incremental Whole Web Crawling

Posted by Eric Osgood <er...@lakemeadonline.com>.

When I set generate.update.db to true and then run generate, it only  
runs twice and generates 100K for the 1st gen, 62.5K for the second  
gen and 0 for the 3rd gen on a seed list of 1.6M. I don't understand  
this, for a topN of 100K it should run 16 times and create 16 distinct  
lists if I am not mistaken.

Eric


On Oct 5, 2009, at 10:01 PM, Gaurang Patel wrote:

> Hey,
>
> Never mind. I got *generate.update.db* in *nutch-default.xml* and  
> set it
> true.
>
> Regards,
> Gaurang
>
> 2009/10/5 Gaurang Patel <ga...@gmail.com>
>
>> Hey Andrzej,
>>
>> Can you tell me where to set this property (generate.update.db)? I am
>> trying to run similar kind of crawl scenario that Eric is running.
>>
>> -Gaurang
>>
>> 2009/10/5 Andrzej Bialecki <ab...@getopt.org>
>>
>> Eric wrote:
>>>
>>>> Andrzej,
>>>>
>>>> Just to make sure I have this straight, set the generate.update.db
>>>> property to true then
>>>>
>>>> bin/nutch generate crawl/crawldb crawl/segments -topN 100000: 16  
>>>> times?
>>>>
>>>
>>> Yes. When this property is set to true, then each fetchlist will be
>>> different, because the records for those pages that are already on  
>>> another
>>> fetchlist will be temporarily locked. Please note that this lock  
>>> holds only
>>> for 1 week, so you need to fetch all segments within one week from
>>> generating them.
>>>
>>> You can fetch and updatedb in arbitrary order, so once you fetched  
>>> some
>>> segments you can run the parsing and updatedb just from these  
>>> segments,
>>> without waiting for all 16 segments to be processed.
>>>
>>>
>>>
>>> --
>>> Best regards,
>>> Andrzej Bialecki     <><
>>> ___. ___ ___ ___ _ _   __________________________________
>>> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
>>> ___|||__||  \|  ||  |  Embedded Unix, System Integration
>>> http://www.sigram.com  Contact: info at sigram dot com
>>>
>>>
>>

Eric Osgood
---------------------------------------------
Cal Poly - Computer Engineering, Moon Valley Software
---------------------------------------------
eosgood@calpoly.edu, eric@lakemeadonline.com
---------------------------------------------
www.calpoly.edu/~eosgood, www.lakemeadonline.com

Re: Incremental Whole Web Crawling

Posted by Gaurang Patel <ga...@gmail.com>.

Hey,

Never mind. I got *generate.update.db* in *nutch-default.xml* and set it
true.

Regards,
Gaurang

2009/10/5 Gaurang Patel <ga...@gmail.com>

> Hey Andrzej,
>
> Can you tell me where to set this property (generate.update.db)? I am
> trying to run similar kind of crawl scenario that Eric is running.
>
> -Gaurang
>
> 2009/10/5 Andrzej Bialecki <ab...@getopt.org>
>
> Eric wrote:
>>
>>> Andrzej,
>>>
>>> Just to make sure I have this straight, set the generate.update.db
>>> property to true then
>>>
>>> bin/nutch generate crawl/crawldb crawl/segments -topN 100000: 16 times?
>>>
>>
>> Yes. When this property is set to true, then each fetchlist will be
>> different, because the records for those pages that are already on another
>> fetchlist will be temporarily locked. Please note that this lock holds only
>> for 1 week, so you need to fetch all segments within one week from
>> generating them.
>>
>> You can fetch and updatedb in arbitrary order, so once you fetched some
>> segments you can run the parsing and updatedb just from these segments,
>> without waiting for all 16 segments to be processed.
>>
>>
>>
>> --
>> Best regards,
>> Andrzej Bialecki     <><
>>  ___. ___ ___ ___ _ _   __________________________________
>> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
>> ___|||__||  \|  ||  |  Embedded Unix, System Integration
>> http://www.sigram.com  Contact: info at sigram dot com
>>
>>
>

Re: Incremental Whole Web Crawling

Posted by Gaurang Patel <ga...@gmail.com>.

Hey Andrzej,

Can you tell me where to set this property (generate.update.db)? I am trying
to run similar kind of crawl scenario that Eric is running.

-Gaurang

2009/10/5 Andrzej Bialecki <ab...@getopt.org>

> Eric wrote:
>
>> Andrzej,
>>
>> Just to make sure I have this straight, set the generate.update.db
>> property to true then
>>
>> bin/nutch generate crawl/crawldb crawl/segments -topN 100000: 16 times?
>>
>
> Yes. When this property is set to true, then each fetchlist will be
> different, because the records for those pages that are already on another
> fetchlist will be temporarily locked. Please note that this lock holds only
> for 1 week, so you need to fetch all segments within one week from
> generating them.
>
> You can fetch and updatedb in arbitrary order, so once you fetched some
> segments you can run the parsing and updatedb just from these segments,
> without waiting for all 16 segments to be processed.
>
>
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>

Re: Incremental Whole Web Crawling

Posted by Andrzej Bialecki <ab...@getopt.org>.

Eric wrote:
> Andrzej,
> 
> Just to make sure I have this straight, set the generate.update.db 
> property to true then
> 
> bin/nutch generate crawl/crawldb crawl/segments -topN 100000: 16 times?

Yes. When this property is set to true, then each fetchlist will be 
different, because the records for those pages that are already on 
another fetchlist will be temporarily locked. Please note that this lock 
holds only for 1 week, so you need to fetch all segments within one week 
from generating them.

You can fetch and updatedb in arbitrary order, so once you fetched some 
segments you can run the parsing and updatedb just from these segments, 
without waiting for all 16 segments to be processed.

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Incremental Whole Web Crawling

Posted by Eric <er...@lakemeadonline.com>.

Andrzej,

Just to make sure I have this straight, set the generate.update.db  
property to true then

bin/nutch generate crawl/crawldb crawl/segments -topN 100000: 16 times?

Thanks,

Eric

On Oct 5, 2009, at 1:27 PM, Andrzej Bialecki wrote:

> Eric wrote:
>> My plan is to crawl ~1.6M TLD's to a depth of 2. Is there a way I  
>> can crawl it in increments of 100K? e.g. crawl 100K 16 times for  
>> the TLD's then crawl the links generated from the TLD's in  
>> increments of 100K?
>
> Yes. Make sure that you have the "generate.update.db" property set  
> to true, and then generate 16 segments each having 100k urls. After  
> you finish generating them, then you can start fetching.
>
> Similarly, you can do the same for the next level, only you will  
> have to generate more segments.
>
> This could be done much simpler with a modified Generator that  
> outputs multiple segments from one job, but it's not implemented yet.
>
> -- 
> Best regards,
> Andrzej Bialecki     <><
> ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>

Re: Incremental Whole Web Crawling

Posted by Andrzej Bialecki <ab...@getopt.org>.

Eric wrote:
> My plan is to crawl ~1.6M TLD's to a depth of 2. Is there a way I can 
> crawl it in increments of 100K? e.g. crawl 100K 16 times for the TLD's 
> then crawl the links generated from the TLD's in increments of 100K?

Yes. Make sure that you have the "generate.update.db" property set to 
true, and then generate 16 segments each having 100k urls. After you 
finish generating them, then you can start fetching.

Similarly, you can do the same for the next level, only you will have to 
generate more segments.

This could be done much simpler with a modified Generator that outputs 
multiple segments from one job, but it's not implemented yet.

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com