You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Eric <er...@lakemeadonline.com> on 2009/10/05 21:47:05 UTC
Incremental Whole Web Crawling
My plan is to crawl ~1.6M TLD's to a depth of 2. Is there a way I can
crawl it in increments of 100K? e.g. crawl 100K 16 times for the TLD's
then crawl the links generated from the TLD's in increments of 100K?
Thanks,
EO
Re: mergecrawls.sh
Posted by Alex Basa <al...@yahoo.com>.
It seems like when the indexer gets a 'Job failed' it doesn't back up one directory so in the next phase where it does the dedup, it won't find the newindexes directory since it's looking for it under index. Anyone know of a fix to Indexer for this? I'm running Nutch 0.9
As always, thanks in advance
Indexing [http://www.plataformaarquitectura.cl/2009/06/28/summer-show-2009-barl
ett-school-of-architecture-ucl/100_7191/] with analyzer org.apache.nutch.analysi
s.NutchDocumentAnalyzer@799e11a1 (null)
Indexer: java.io.IOException: Job failed!
at org.apache.hadoop.mapred..JobClient.runJob(JobClient.java:604)
at org.apache.nutch.indexer..Indexer.index(Indexer.java:307)
at org.apache.nutch.indexer.Indexer.run(Indexer.java:329)
at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189)
at org.apache.nutch.indexer.Indexer.main(Indexer.java:312)
log4j:ERROR Failed to flush writer,
java.io.InterruptedIOException
at java.io.FileOutputStream.writeBytes(Native Method)
at java.io.FileOutputStream.write(FileOutputStream.java:260)
at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:202)
at sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:272)
at sun.nio.cs.StreamEncoder.implFlush(StreamEncoder.java:276)
at sun.nio.cs.StreamEncoder.flush(StreamEncoder.java:122)
at java.io.OutputStreamWriter.flush(OutputStreamWriter.java:212)
at org.apache.log4j.helpers.QuietWriter.flush(QuietWriter.java:57)
at org.apache.log4j.WriterAppender.subAppend(WriterAppender.java:315)
at org.apache.log4j.DailyRollingFileAppender.subAppend(DailyRollingFileA
ppender.java:358)
at org.apache.log4j.WriterAppender.append(WriterAppender.java:159)
at org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:230)
at org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders
(AppenderAttachableImpl.java:65)
at org.apache.log4j.Category.callAppenders(Category.java:203)
at org.apache.log4j.Category.forcedLog(Category.java:388)
at org.apache.log4j.Category.log(Category.java:853)
at org..apache.commons.logging.impl.Log4JLogger.warn(Log4JLogger.java:169
)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:1
66)
De-duplicate indexes
Dedup: starting
Dedup: adding indexes in: /database/Nutch/index/newindexes
org.apache.hadoop.mapred.InvalidInputException: Input path doesnt exist : /database/Nutch/index/newindexes
at org.apache.hadoop.mapred.InputFormatBase.validateInput(InputFormatBas
e.java:138)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:326)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:543)
at org.apache.nutch.indexer.DeleteDuplicates..dedup(DeleteDuplicates.java
:603)
at org.apache.nutch.indexer.DeleteDuplicates.run(DeleteDuplicates.java:6
74)
at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189)
at org.apache.nutch.indexer.DeleteDuplicates.main(DeleteDuplicates.java:
658)
DeleteDuplicates: java.io.IOException: org.apache.hadoop.mapred.InvalidInputException: Input path doesnt exist : /database/Nutch/index.uchi/newindexes
at org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java
:653)
at org.apache.nutch.indexer.DeleteDuplicates.run(DeleteDuplicates.java:6
74)
at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189)
at org.apache.nutch.indexer.DeleteDuplicates.main(DeleteDuplicates.java:
658)
Merge indexes
--- On Tue, 1/12/10, Alex Basa <al...@yahoo.com> wrote:
> From: Alex Basa <al...@yahoo.com>
> Subject: mergecrawls.sh
> To: nutch-user@lucene.apache.org
> Date: Tuesday, January 12, 2010, 12:01 PM
> Does anyone know of any bug fixes to
> mergecrawls.sh? I have two working indexes that I try
> to merge and it seems to work but when it's done, the index
> is corrupt.
>
> before the merge, both indexes have
> crawldb index
> linkdb
> newindexes segments
>
> after the merge, the newindexes directory is gone
> crawldb index
> linkdb segments
>
> I didn't log the output so I'll re-run it again and look at
> the output.
>
> Thanks,
>
> Alex
>
>
>
>
>
mergecrawls.sh
Posted by Alex Basa <al...@yahoo.com>.
Does anyone know of any bug fixes to mergecrawls.sh? I have two working indexes that I try to merge and it seems to work but when it's done, the index is corrupt.
before the merge, both indexes have
crawldb index linkdb newindexes segments
after the merge, the newindexes directory is gone
crawldb index linkdb segments
I didn't log the output so I'll re-run it again and look at the output.
Thanks,
Alex
Re: Incremental Whole Web Crawling
Posted by Julien Nioche <li...@gmail.com>.
Hi Jesse,
no problem. Feel free to post your comments / bug fixes / suggestions on the
JIRA NUTCH-762
Thanks
--
DigitalPebble Ltd
http://www.digitalpebble.com
2009/11/4 Jesse Hires <jh...@gmail.com>
> My apologies. missed a patch option :-P
> Must need more coffee.
> Jesse
>
> int GetRandomNumber()
> {
> return 4; // Chosen by fair roll of dice
> // Guaranteed to be random
> } // xkcd.com
>
>
>
> On Tue, Nov 3, 2009 at 8:08 PM, Jesse Hires <jh...@gmail.com> wrote:
>
> > Julien,
> > I tried to apply your patch because I was curious.
> > $ patch < NUTCH-762-MultiGenerator.patch
> >
> > but this seems to drop the two java files into the root directory instead
> > of
> > src/java/org/apache/nutch/crawl/URLPartitioner.java
> > src/java/org/apache/nutch/crawl/MultiGenerator.java
> >
> > But if I copy the files to those locations, I get compile errors.
> > I'm up to date on the svn trunk.
> > Did I miss a step?
> >
> >
> > Jesse
> >
> > int GetRandomNumber()
> > {
> > return 4; // Chosen by fair roll of dice
> > // Guaranteed to be random
> > } // xkcd.com
> >
> >
> >
> >
> > On Tue, Nov 3, 2009 at 7:09 AM, Julien Nioche <
> > lists.digitalpebble@gmail.com> wrote:
> >
> >> FYI : there is an implementation of such a modified Generator in
> >> http://issues.apache.org/jira/browse/NUTCH-762
> >>
> >> Julien
> >> --
> >> DigitalPebble Ltd
> >> http://www.digitalpebble.com
> >>
> >> 2009/10/5 Andrzej Bialecki <ab...@getopt.org>
> >>
> >> > Eric wrote:
> >> >
> >> >> My plan is to crawl ~1.6M TLD's to a depth of 2. Is there a way I can
> >> >> crawl it in increments of 100K? e.g. crawl 100K 16 times for the
> TLD's
> >> then
> >> >> crawl the links generated from the TLD's in increments of 100K?
> >> >>
> >> >
> >> > Yes. Make sure that you have the "generate.update.db" property set to
> >> true,
> >> > and then generate 16 segments each having 100k urls. After you finish
> >> > generating them, then you can start fetching.
> >> >
> >> > Similarly, you can do the same for the next level, only you will have
> to
> >> > generate more segments.
> >> >
> >> > This could be done much simpler with a modified Generator that outputs
> >> > multiple segments from one job, but it's not implemented yet.
> >> >
> >> >
> >> > --
> >> > Best regards,
> >> > Andrzej Bialecki <><
> >> > ___. ___ ___ ___ _ _ __________________________________
> >> > [__ || __|__/|__||\/| Information Retrieval, Semantic Web
> >> > ___|||__|| \| || | Embedded Unix, System Integration
> >> > http://www.sigram.com Contact: info at sigram dot com
> >> >
> >> >
> >>
> >
> >
>
Re: Incremental Whole Web Crawling
Posted by Jesse Hires <jh...@gmail.com>.
My apologies. missed a patch option :-P
Must need more coffee.
Jesse
int GetRandomNumber()
{
return 4; // Chosen by fair roll of dice
// Guaranteed to be random
} // xkcd.com
On Tue, Nov 3, 2009 at 8:08 PM, Jesse Hires <jh...@gmail.com> wrote:
> Julien,
> I tried to apply your patch because I was curious.
> $ patch < NUTCH-762-MultiGenerator.patch
>
> but this seems to drop the two java files into the root directory instead
> of
> src/java/org/apache/nutch/crawl/URLPartitioner.java
> src/java/org/apache/nutch/crawl/MultiGenerator.java
>
> But if I copy the files to those locations, I get compile errors.
> I'm up to date on the svn trunk.
> Did I miss a step?
>
>
> Jesse
>
> int GetRandomNumber()
> {
> return 4; // Chosen by fair roll of dice
> // Guaranteed to be random
> } // xkcd.com
>
>
>
>
> On Tue, Nov 3, 2009 at 7:09 AM, Julien Nioche <
> lists.digitalpebble@gmail.com> wrote:
>
>> FYI : there is an implementation of such a modified Generator in
>> http://issues.apache.org/jira/browse/NUTCH-762
>>
>> Julien
>> --
>> DigitalPebble Ltd
>> http://www.digitalpebble.com
>>
>> 2009/10/5 Andrzej Bialecki <ab...@getopt.org>
>>
>> > Eric wrote:
>> >
>> >> My plan is to crawl ~1.6M TLD's to a depth of 2. Is there a way I can
>> >> crawl it in increments of 100K? e.g. crawl 100K 16 times for the TLD's
>> then
>> >> crawl the links generated from the TLD's in increments of 100K?
>> >>
>> >
>> > Yes. Make sure that you have the "generate.update.db" property set to
>> true,
>> > and then generate 16 segments each having 100k urls. After you finish
>> > generating them, then you can start fetching.
>> >
>> > Similarly, you can do the same for the next level, only you will have to
>> > generate more segments.
>> >
>> > This could be done much simpler with a modified Generator that outputs
>> > multiple segments from one job, but it's not implemented yet.
>> >
>> >
>> > --
>> > Best regards,
>> > Andrzej Bialecki <><
>> > ___. ___ ___ ___ _ _ __________________________________
>> > [__ || __|__/|__||\/| Information Retrieval, Semantic Web
>> > ___|||__|| \| || | Embedded Unix, System Integration
>> > http://www.sigram.com Contact: info at sigram dot com
>> >
>> >
>>
>
>
Re: Incremental Whole Web Crawling
Posted by Jesse Hires <jh...@gmail.com>.
Julien,
I tried to apply your patch because I was curious.
$ patch < NUTCH-762-MultiGenerator.patch
but this seems to drop the two java files into the root directory instead of
src/java/org/apache/nutch/crawl/URLPartitioner.java
src/java/org/apache/nutch/crawl/MultiGenerator.java
But if I copy the files to those locations, I get compile errors.
I'm up to date on the svn trunk.
Did I miss a step?
Jesse
int GetRandomNumber()
{
return 4; // Chosen by fair roll of dice
// Guaranteed to be random
} // xkcd.com
On Tue, Nov 3, 2009 at 7:09 AM, Julien Nioche <lists.digitalpebble@gmail.com
> wrote:
> FYI : there is an implementation of such a modified Generator in
> http://issues.apache.org/jira/browse/NUTCH-762
>
> Julien
> --
> DigitalPebble Ltd
> http://www.digitalpebble.com
>
> 2009/10/5 Andrzej Bialecki <ab...@getopt.org>
>
> > Eric wrote:
> >
> >> My plan is to crawl ~1.6M TLD's to a depth of 2. Is there a way I can
> >> crawl it in increments of 100K? e.g. crawl 100K 16 times for the TLD's
> then
> >> crawl the links generated from the TLD's in increments of 100K?
> >>
> >
> > Yes. Make sure that you have the "generate.update.db" property set to
> true,
> > and then generate 16 segments each having 100k urls. After you finish
> > generating them, then you can start fetching.
> >
> > Similarly, you can do the same for the next level, only you will have to
> > generate more segments.
> >
> > This could be done much simpler with a modified Generator that outputs
> > multiple segments from one job, but it's not implemented yet.
> >
> >
> > --
> > Best regards,
> > Andrzej Bialecki <><
> > ___. ___ ___ ___ _ _ __________________________________
> > [__ || __|__/|__||\/| Information Retrieval, Semantic Web
> > ___|||__|| \| || | Embedded Unix, System Integration
> > http://www.sigram.com Contact: info at sigram dot com
> >
> >
>
Re: Incremental Whole Web Crawling
Posted by Julien Nioche <li...@gmail.com>.
FYI : there is an implementation of such a modified Generator in
http://issues.apache.org/jira/browse/NUTCH-762
Julien
--
DigitalPebble Ltd
http://www.digitalpebble.com
2009/10/5 Andrzej Bialecki <ab...@getopt.org>
> Eric wrote:
>
>> My plan is to crawl ~1.6M TLD's to a depth of 2. Is there a way I can
>> crawl it in increments of 100K? e.g. crawl 100K 16 times for the TLD's then
>> crawl the links generated from the TLD's in increments of 100K?
>>
>
> Yes. Make sure that you have the "generate.update.db" property set to true,
> and then generate 16 segments each having 100k urls. After you finish
> generating them, then you can start fetching.
>
> Similarly, you can do the same for the next level, only you will have to
> generate more segments.
>
> This could be done much simpler with a modified Generator that outputs
> multiple segments from one job, but it's not implemented yet.
>
>
> --
> Best regards,
> Andrzej Bialecki <><
> ___. ___ ___ ___ _ _ __________________________________
> [__ || __|__/|__||\/| Information Retrieval, Semantic Web
> ___|||__|| \| || | Embedded Unix, System Integration
> http://www.sigram.com Contact: info at sigram dot com
>
>
Re: Incremental Whole Web Crawling
Posted by Julien Nioche <li...@gmail.com>.
>
>
> This could be done much simpler with a modified Generator that outputs
> multiple segments from one job, but it's not implemented yet.
>
This would also be more efficient as crawlDB operations such as generate or
update take more time as the crawlDB grows (unlike fetch and parse which are
proportional to the size of the fetchlist). When the crawlDB sizes in
billions of URL the fetching / parsing takes relatively little time.
generate.update.db requires to read and write a whole crawlDB everytime but
I suppose that it would be fine for a small crawlDB
J.
--
DigitalPebble Ltd
http://www.digitalpebble.com
Re: Incremental Whole Web Crawling
Posted by Paul Tomblin <pt...@xcski.com>.
Don't change options in nutch-default.xml - copy the option into
nutch-site.xml and change it there. That way the change will
(hopefully) survive an upgrade.
On Tue, Oct 6, 2009 at 1:01 AM, Gaurang Patel <ga...@gmail.com> wrote:
> Hey,
>
> Never mind. I got *generate.update.db* in *nutch-default.xml* and set it
> true.
>
> Regards,
> Gaurang
>
> 2009/10/5 Gaurang Patel <ga...@gmail.com>
>
>> Hey Andrzej,
>>
>> Can you tell me where to set this property (generate.update.db)? I am
>> trying to run similar kind of crawl scenario that Eric is running.
>>
>> -Gaurang
>>
>> 2009/10/5 Andrzej Bialecki <ab...@getopt.org>
>>
>> Eric wrote:
>>>
>>>> Andrzej,
>>>>
>>>> Just to make sure I have this straight, set the generate.update.db
>>>> property to true then
>>>>
>>>> bin/nutch generate crawl/crawldb crawl/segments -topN 100000: 16 times?
>>>>
>>>
>>> Yes. When this property is set to true, then each fetchlist will be
>>> different, because the records for those pages that are already on another
>>> fetchlist will be temporarily locked. Please note that this lock holds only
>>> for 1 week, so you need to fetch all segments within one week from
>>> generating them.
>>>
>>> You can fetch and updatedb in arbitrary order, so once you fetched some
>>> segments you can run the parsing and updatedb just from these segments,
>>> without waiting for all 16 segments to be processed.
>>>
>>>
>>>
>>> --
>>> Best regards,
>>> Andrzej Bialecki <><
>>> ___. ___ ___ ___ _ _ __________________________________
>>> [__ || __|__/|__||\/| Information Retrieval, Semantic Web
>>> ___|||__|| \| || | Embedded Unix, System Integration
>>> http://www.sigram.com Contact: info at sigram dot com
>>>
>>>
>>
>
--
http://www.linkedin.com/in/paultomblin
Re: Incremental Whole Web Crawling
Posted by Eric Osgood <er...@lakemeadonline.com>.
O ok,
You learn something new everyday! I didn't know that the trunk was the
most recent build. Good to know! So this current trunk does have a fix
for the generator bug?
On Oct 13, 2009, at 2:05 PM, Andrzej Bialecki wrote:
> Eric Osgood wrote:
>> So the trunk contains the most recent nightly update?
>
> It's the other way around - nightly build is created from a snapshot
> of the trunk. The trunk is always the most recent.
>
>
> --
> Best regards,
> Andrzej Bialecki <><
> ___. ___ ___ ___ _ _ __________________________________
> [__ || __|__/|__||\/| Information Retrieval, Semantic Web
> ___|||__|| \| || | Embedded Unix, System Integration
> http://www.sigram.com Contact: info at sigram dot com
>
Eric Osgood
---------------------------------------------
Cal Poly - Computer Engineering, Moon Valley Software
---------------------------------------------
eosgood@calpoly.edu, eric@lakemeadonline.com
---------------------------------------------
www.calpoly.edu/~eosgood, www.lakemeadonline.com
Re: Incremental Whole Web Crawling
Posted by Andrzej Bialecki <ab...@getopt.org>.
Eric Osgood wrote:
> So the trunk contains the most recent nightly update?
It's the other way around - nightly build is created from a snapshot of
the trunk. The trunk is always the most recent.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
Re: Incremental Whole Web Crawling
Posted by Eric Osgood <er...@lakemeadonline.com>.
So the trunk contains the most recent nightly update?
On Oct 13, 2009, at 1:50 PM, Andrzej Bialecki wrote:
> Eric Osgood wrote:
>> Ok, I think I am on the right track now, but just to be sure: the
>> code I want is the branch section of svn under nutchbase at http://svn.apache.org/repos/asf/lucene/nutch/branches/nutchbase/
>> correct?
>
> No, you need the trunk from here:
>
> http://svn.apache.org/repos/asf/lucene/nutch/trunk
>
>
> --
> Best regards,
> Andrzej Bialecki <><
> ___. ___ ___ ___ _ _ __________________________________
> [__ || __|__/|__||\/| Information Retrieval, Semantic Web
> ___|||__|| \| || | Embedded Unix, System Integration
> http://www.sigram.com Contact: info at sigram dot com
>
Eric Osgood
---------------------------------------------
Cal Poly - Computer Engineering, Moon Valley Software
---------------------------------------------
eosgood@calpoly.edu, eric@lakemeadonline.com
---------------------------------------------
www.calpoly.edu/~eosgood, www.lakemeadonline.com
Re: Incremental Whole Web Crawling
Posted by Andrzej Bialecki <ab...@getopt.org>.
Eric Osgood wrote:
> Ok, I think I am on the right track now, but just to be sure: the code I
> want is the branch section of svn under nutchbase at
> http://svn.apache.org/repos/asf/lucene/nutch/branches/nutchbase/ correct?
No, you need the trunk from here:
http://svn.apache.org/repos/asf/lucene/nutch/trunk
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
Re: Incremental Whole Web Crawling
Posted by Eric Osgood <er...@lakemeadonline.com>.
Ok, I think I am on the right track now, but just to be sure: the code
I want is the branch section of svn under nutchbase at http://svn.apache.org/repos/asf/lucene/nutch/branches/nutchbase/
correct?
Thanks,
Eric
On Oct 13, 2009, at 1:38 PM, Andrzej Bialecki wrote:
> Eric Osgood wrote:
>> Andrzej,
>> Where do I get the nightly builds from? I tried to use the eclipse
>> plugin that supports svn to no avail. Is there a ftp, http server
>> where I can download the nutch source fresh?
>
> Personally I prefer to use a command-line svn, even though I do
> development in Eclipse - I'm probably old-fashioned but I always
> want to be very clear on what's going on when I do an update.
>
> See the instructions here:
>
> http://lucene.apache.org/nutch/version_control.html
>
>
> --
> Best regards,
> Andrzej Bialecki <><
> ___. ___ ___ ___ _ _ __________________________________
> [__ || __|__/|__||\/| Information Retrieval, Semantic Web
> ___|||__|| \| || | Embedded Unix, System Integration
> http://www.sigram.com Contact: info at sigram dot com
>
Eric Osgood
---------------------------------------------
Cal Poly - Computer Engineering, Moon Valley Software
---------------------------------------------
eosgood@calpoly.edu, eric@lakemeadonline.com
---------------------------------------------
www.calpoly.edu/~eosgood, www.lakemeadonline.com
Re: Incremental Whole Web Crawling
Posted by Andrzej Bialecki <ab...@getopt.org>.
Eric Osgood wrote:
> Andrzej,
>
> Where do I get the nightly builds from? I tried to use the eclipse
> plugin that supports svn to no avail. Is there a ftp, http server where
> I can download the nutch source fresh?
Personally I prefer to use a command-line svn, even though I do
development in Eclipse - I'm probably old-fashioned but I always want to
be very clear on what's going on when I do an update.
See the instructions here:
http://lucene.apache.org/nutch/version_control.html
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
Re: Incremental Whole Web Crawling
Posted by Eric Osgood <er...@lakemeadonline.com>.
Andrzej,
Where do I get the nightly builds from? I tried to use the eclipse
plugin that supports svn to no avail. Is there a ftp, http server
where I can download the nutch source fresh?
Thanks,
Eric
On Oct 11, 2009, at 12:40 PM, Andrzej Bialecki wrote:
> Eric Osgood wrote:
>> When I set generate.update.db to true and then run generate, it
>> only runs twice and generates 100K for the 1st gen, 62.5K for the
>> second gen and 0 for the 3rd gen on a seed list of 1.6M. I don't
>> understand this, for a topN of 100K it should run 16 times and
>> create 16 distinct lists if I am not mistaken.
>
> There was a bug in this code that I fixed recently - please get a
> new nightly build and try it again.
>
>
> --
> Best regards,
> Andrzej Bialecki <><
> ___. ___ ___ ___ _ _ __________________________________
> [__ || __|__/|__||\/| Information Retrieval, Semantic Web
> ___|||__|| \| || | Embedded Unix, System Integration
> http://www.sigram.com Contact: info at sigram dot com
>
Eric Osgood
---------------------------------------------
Cal Poly - Computer Engineering, Moon Valley Software
---------------------------------------------
eosgood@calpoly.edu, eric@lakemeadonline.com
---------------------------------------------
www.calpoly.edu/~eosgood, www.lakemeadonline.com
Re: Incremental Whole Web Crawling
Posted by Andrzej Bialecki <ab...@getopt.org>.
Eric Osgood wrote:
> When I set generate.update.db to true and then run generate, it only
> runs twice and generates 100K for the 1st gen, 62.5K for the second gen
> and 0 for the 3rd gen on a seed list of 1.6M. I don't understand this,
> for a topN of 100K it should run 16 times and create 16 distinct lists
> if I am not mistaken.
There was a bug in this code that I fixed recently - please get a new
nightly build and try it again.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
Re: Incremental Whole Web Crawling
Posted by Eric Osgood <er...@lakemeadonline.com>.
When I set generate.update.db to true and then run generate, it only
runs twice and generates 100K for the 1st gen, 62.5K for the second
gen and 0 for the 3rd gen on a seed list of 1.6M. I don't understand
this, for a topN of 100K it should run 16 times and create 16 distinct
lists if I am not mistaken.
Eric
On Oct 5, 2009, at 10:01 PM, Gaurang Patel wrote:
> Hey,
>
> Never mind. I got *generate.update.db* in *nutch-default.xml* and
> set it
> true.
>
> Regards,
> Gaurang
>
> 2009/10/5 Gaurang Patel <ga...@gmail.com>
>
>> Hey Andrzej,
>>
>> Can you tell me where to set this property (generate.update.db)? I am
>> trying to run similar kind of crawl scenario that Eric is running.
>>
>> -Gaurang
>>
>> 2009/10/5 Andrzej Bialecki <ab...@getopt.org>
>>
>> Eric wrote:
>>>
>>>> Andrzej,
>>>>
>>>> Just to make sure I have this straight, set the generate.update.db
>>>> property to true then
>>>>
>>>> bin/nutch generate crawl/crawldb crawl/segments -topN 100000: 16
>>>> times?
>>>>
>>>
>>> Yes. When this property is set to true, then each fetchlist will be
>>> different, because the records for those pages that are already on
>>> another
>>> fetchlist will be temporarily locked. Please note that this lock
>>> holds only
>>> for 1 week, so you need to fetch all segments within one week from
>>> generating them.
>>>
>>> You can fetch and updatedb in arbitrary order, so once you fetched
>>> some
>>> segments you can run the parsing and updatedb just from these
>>> segments,
>>> without waiting for all 16 segments to be processed.
>>>
>>>
>>>
>>> --
>>> Best regards,
>>> Andrzej Bialecki <><
>>> ___. ___ ___ ___ _ _ __________________________________
>>> [__ || __|__/|__||\/| Information Retrieval, Semantic Web
>>> ___|||__|| \| || | Embedded Unix, System Integration
>>> http://www.sigram.com Contact: info at sigram dot com
>>>
>>>
>>
Eric Osgood
---------------------------------------------
Cal Poly - Computer Engineering, Moon Valley Software
---------------------------------------------
eosgood@calpoly.edu, eric@lakemeadonline.com
---------------------------------------------
www.calpoly.edu/~eosgood, www.lakemeadonline.com
Re: Incremental Whole Web Crawling
Posted by Gaurang Patel <ga...@gmail.com>.
Hey,
Never mind. I got *generate.update.db* in *nutch-default.xml* and set it
true.
Regards,
Gaurang
2009/10/5 Gaurang Patel <ga...@gmail.com>
> Hey Andrzej,
>
> Can you tell me where to set this property (generate.update.db)? I am
> trying to run similar kind of crawl scenario that Eric is running.
>
> -Gaurang
>
> 2009/10/5 Andrzej Bialecki <ab...@getopt.org>
>
> Eric wrote:
>>
>>> Andrzej,
>>>
>>> Just to make sure I have this straight, set the generate.update.db
>>> property to true then
>>>
>>> bin/nutch generate crawl/crawldb crawl/segments -topN 100000: 16 times?
>>>
>>
>> Yes. When this property is set to true, then each fetchlist will be
>> different, because the records for those pages that are already on another
>> fetchlist will be temporarily locked. Please note that this lock holds only
>> for 1 week, so you need to fetch all segments within one week from
>> generating them.
>>
>> You can fetch and updatedb in arbitrary order, so once you fetched some
>> segments you can run the parsing and updatedb just from these segments,
>> without waiting for all 16 segments to be processed.
>>
>>
>>
>> --
>> Best regards,
>> Andrzej Bialecki <><
>> ___. ___ ___ ___ _ _ __________________________________
>> [__ || __|__/|__||\/| Information Retrieval, Semantic Web
>> ___|||__|| \| || | Embedded Unix, System Integration
>> http://www.sigram.com Contact: info at sigram dot com
>>
>>
>
Re: Incremental Whole Web Crawling
Posted by Gaurang Patel <ga...@gmail.com>.
Hey Andrzej,
Can you tell me where to set this property (generate.update.db)? I am trying
to run similar kind of crawl scenario that Eric is running.
-Gaurang
2009/10/5 Andrzej Bialecki <ab...@getopt.org>
> Eric wrote:
>
>> Andrzej,
>>
>> Just to make sure I have this straight, set the generate.update.db
>> property to true then
>>
>> bin/nutch generate crawl/crawldb crawl/segments -topN 100000: 16 times?
>>
>
> Yes. When this property is set to true, then each fetchlist will be
> different, because the records for those pages that are already on another
> fetchlist will be temporarily locked. Please note that this lock holds only
> for 1 week, so you need to fetch all segments within one week from
> generating them.
>
> You can fetch and updatedb in arbitrary order, so once you fetched some
> segments you can run the parsing and updatedb just from these segments,
> without waiting for all 16 segments to be processed.
>
>
>
> --
> Best regards,
> Andrzej Bialecki <><
> ___. ___ ___ ___ _ _ __________________________________
> [__ || __|__/|__||\/| Information Retrieval, Semantic Web
> ___|||__|| \| || | Embedded Unix, System Integration
> http://www.sigram.com Contact: info at sigram dot com
>
>
Re: Incremental Whole Web Crawling
Posted by Andrzej Bialecki <ab...@getopt.org>.
Eric wrote:
> Andrzej,
>
> Just to make sure I have this straight, set the generate.update.db
> property to true then
>
> bin/nutch generate crawl/crawldb crawl/segments -topN 100000: 16 times?
Yes. When this property is set to true, then each fetchlist will be
different, because the records for those pages that are already on
another fetchlist will be temporarily locked. Please note that this lock
holds only for 1 week, so you need to fetch all segments within one week
from generating them.
You can fetch and updatedb in arbitrary order, so once you fetched some
segments you can run the parsing and updatedb just from these segments,
without waiting for all 16 segments to be processed.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
Re: Incremental Whole Web Crawling
Posted by Eric <er...@lakemeadonline.com>.
Andrzej,
Just to make sure I have this straight, set the generate.update.db
property to true then
bin/nutch generate crawl/crawldb crawl/segments -topN 100000: 16 times?
Thanks,
Eric
On Oct 5, 2009, at 1:27 PM, Andrzej Bialecki wrote:
> Eric wrote:
>> My plan is to crawl ~1.6M TLD's to a depth of 2. Is there a way I
>> can crawl it in increments of 100K? e.g. crawl 100K 16 times for
>> the TLD's then crawl the links generated from the TLD's in
>> increments of 100K?
>
> Yes. Make sure that you have the "generate.update.db" property set
> to true, and then generate 16 segments each having 100k urls. After
> you finish generating them, then you can start fetching.
>
> Similarly, you can do the same for the next level, only you will
> have to generate more segments.
>
> This could be done much simpler with a modified Generator that
> outputs multiple segments from one job, but it's not implemented yet.
>
> --
> Best regards,
> Andrzej Bialecki <><
> ___. ___ ___ ___ _ _ __________________________________
> [__ || __|__/|__||\/| Information Retrieval, Semantic Web
> ___|||__|| \| || | Embedded Unix, System Integration
> http://www.sigram.com Contact: info at sigram dot com
>
Re: Incremental Whole Web Crawling
Posted by Andrzej Bialecki <ab...@getopt.org>.
Eric wrote:
> My plan is to crawl ~1.6M TLD's to a depth of 2. Is there a way I can
> crawl it in increments of 100K? e.g. crawl 100K 16 times for the TLD's
> then crawl the links generated from the TLD's in increments of 100K?
Yes. Make sure that you have the "generate.update.db" property set to
true, and then generate 16 segments each having 100k urls. After you
finish generating them, then you can start fetching.
Similarly, you can do the same for the next level, only you will have to
generate more segments.
This could be done much simpler with a modified Generator that outputs
multiple segments from one job, but it's not implemented yet.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com