You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Bartek <ba...@o2.pl> on 2009/02/19 12:28:11 UTC
How to index while fetcher works
Hello,
I started to crawl huge amount of websites (dmoz with no limits in
crawl-urlfilter.txt) with -depth 10 and -topN 1 mln
My /tmp/hadoop-root/ is more than 18GB for now (map-reduce jobs)
This fetching will not stop soon :) so I would like to convert already
made segments (updatedb, invertlinks, index) but there are parts missing
in them:
[root@server nutch]# bin/nutch invertlinks crawls/linkdb -dir
crawls/segments/20090216142840/
LinkDb: adding segment:
file:/usr/local/nutch/crawls/segments/20090216142840/crawl_generate
...
LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input path does
not exist:
file:/usr/local/nutch/crawls/segments/20090216142840/crawl_generate/parse_data
etc.
When manualy trying to bin/parse segments it says that they are parsed.
So my question is how to design whole proces of crawling large amount of
websites without limiting them for specific domains (like in regular
search engine eg. google)?
Should I make loops of small amount of links? Like -topN 1000 and then
updatedb,invertlinks, index ?
For now I can start crawling and any data will appear in weeks.
I found that in 1.0 (so made already) you are introducing live indexing
in nutch. Are there any docs that I can use of ?
Regards,
Bartosz Gadzimski
Re: How to index while fetcher works
Posted by Bartosz Gadzimski <ba...@o2.pl>.
Doğacan Güney pisze:
> Hi,
>
>
> On Thu, Feb 19, 2009 at 13:28, Bartek <ba...@o2.pl> wrote:
>
>> Hello,
>>
>> I started to crawl huge amount of websites (dmoz with no limits in
>> crawl-urlfilter.txt) with -depth 10 and -topN 1 mln
>>
>> My /tmp/hadoop-root/ is more than 18GB for now (map-reduce jobs)
>>
>>
>> This fetching will not stop soon :) so I would like to convert already made
>> segments (updatedb, invertlinks, index) but there are parts missing in them:
>>
>> [root@server nutch]# bin/nutch invertlinks crawls/linkdb -dir
>> crawls/segments/20090216142840/
>>
>>
>
>
> If you use -dir option then you pass segments directory not individual
> segments, e.g:
>
> bin/nutch invertlinks crawls/linkdb -dir crawls/segments
>
> which will read every directory under segments
>
> To pass individual directories skip -dir option:
>
> bin/nutch invertlinks crawls/linkdb crawls/segments/20090216142840
>
Thanks a lot!
It's working but it's a bit strange:
bin/nutch invertlinks crawls/linkdb -dir crawls/segments it's not
working (the same error as previous message)
bin/nutch invertlinks crawls/linkdb crawls/segments/2009* (it's working
correctly)
>> LinkDb: adding segment:
>> file:/usr/local/nutch/crawls/segments/20090216142840/crawl_generate
>>
>> ...
>>
>> LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input path does not
>> exist:
>> file:/usr/local/nutch/crawls/segments/20090216142840/crawl_generate/parse_data
>>
>> etc.
>>
>> When manualy trying to bin/parse segments it says that they are parsed.
>>
>>
>> So my question is how to design whole proces of crawling large amount of
>> websites without limiting them for specific domains (like in regular search
>> engine eg. google)?
>>
>> Should I make loops of small amount of links? Like -topN 1000 and then
>> updatedb,invertlinks, index ?
>>
>>
>> For now I can start crawling and any data will appear in weeks.
>>
>> I found that in 1.0 (so made already) you are introducing live indexing in
>> nutch. Are there any docs that I can use of ?
>>
>> Regards,
>> Bartosz Gadzimski
>>
>>
>>
>>
>>
>
>
>
>
Re: AW: AW: AW: How to index while fetcher works
Posted by Bartosz Gadzimski <ba...@o2.pl>.
Dear Nadine,
Your case is very interesting, can you tell us more about how to deal
with sutch situation ? As you said it looks that you have to rank news
according to dates, how you are achiving it? Keeping sites up to date
looks like really cool feature.
Anyway I am surprised that you are using nutch crawler for such specific
field. I would use something like content scrapping (very popular in seo
and spam when you know your source, just php + regexp :) Ofcourse you
can use this only when you know your source website.
Thank you for advice with large segments, I must remember this, it
couses a lot of problems (starting with waiting for fetch job to finish
and as you said later problems with merging and indexing).
In my quick tests with intel dual core 2GHz, 2GB RAM, 250GB SATA hdd server
invertlinks on 1.5GB segment took 22 minutes which is a little bit long
Regards,
Bartosz
Höchstötter Nadine pisze:
> Hi,
> we do news crawling, that is why we have different ranking issues, such as up to dateness and article recognition.
> I have two scripts, one for the generate, fetch, parse cycle, where I also update crawldb and linkdb. And another script to merge segments and build indexes. For me, it is most important to have the newest pages of websites. For you it will be better to have all, but not every page will be updated that frequently, so if you fetch them regularly, you will have them all after a while. But long crawl cycles produce huge segments. We had some performance problems to merge and index them quickly.
>
> -----UrsprĂźngliche Nachricht-----
> Von: Bartosz Gadzimski [mailto:bartek--g@o2.pl]
> Gesendet: Donnerstag, 19. Februar 2009 15:38
> An: nutch-user@lucene.apache.org
> Betreff: Re: AW: AW: How to index while fetcher works
>
> Dear Nadine,
>
> So when you are doing depth 1 or depth 2 crawls can you crawl whole
> website? I can just imagine that with depth 2 you will crawl whole
> website only when links from other pages appear. But it will take a lot
> of time to get it all. Any modern website has a lot of "levels" do go
> depth in it (guessing 4-5 minimum).
>
> About dmoz - it's only for testing. Good place with lot of links :)
>
> Ad. script - I didn't realize that you are not doing invertlinks - is
> this necessary for proper indexing and searching?
>
> Thanks,
> Bartosz
>
> HĂśchstĂśtter Nadine pisze:
>
>> We also do depth 1 or two crawls, so the crawldb is also up to date.
>> Be careful with Dmoz, there is a lot of Spam out there.
>> The loop is also useful for invertinglinks etc. whenever it is important to have single segments and not the whole directory.
>>
>> -----Urspr�źngliche Nachricht-----
>> Von: Bartosz Gadzimski [mailto:bartek--g@o2.pl]
>> Gesendet: Donnerstag, 19. Februar 2009 14:56
>> An: nutch-user@lucene.apache.org
>> Betreff: Re: AW: How to index while fetcher works
>>
>> Thanks Nadine, I am few days ahead thanks to your script :)
>>
>> Nutch is really nice pice of software, it just takes time to know it better.
>>
>> Regards,
>> Bartosz
>>
>> H��chst��tter Nadine pisze:
>>
>>
>>> Hi. This is my version of an incremental index: I have one working dir for all the new segments flying in and a routine every four hours to build a new index for a special webindex folder which is nearly up to date.
>>> I merge segments in another folder with YYYYMMDDHH Pattern in my working segment dir. With this I can always recognize which segments have already been indexed. Move or copy the merged segment under YYYYMMDDHH folder to your fresh webindex segment folder and also everything under $merge_dir (new index) to your index folder in webindex dir. This dir has same structure as your working crawl dir.
>>> It is also good for backup reasons. Call the script below with a cron and add cp, mv, rm, or tar commands wherever you like. I zip my crawldb and linkdb with this cron, too, as a backup.
>>>
>>>
>>> index_dir=/nutchcrawl/indexes/$CRAWLNAME/index
>>> TIMEH=`date +%Y%m%d%H`
>>> merge_dir=/nutchcrawl/indexes/$CRAWLNAME/indexmerged$TIMEH
>>> # Update segments
>>>
>>> for segment in `ls -d /nutchcrawl/indexes/$CRAWLNAME/segments/* | grep '[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]' `
>>> do
>>> if [ -d $segment/_temporary ]; then
>>> echo "$segment is temporary"
>>> else
>>> echo "$segment"
>>> segments="$segments $segment"
>>> fi
>>> done
>>> mergesegs_dir=/nutchcrawl/$CRAWLNAME/segments/$TIMEH
>>> /bin/nutch mergesegs $mergesegs_dir $segments
>>>
>>> indexes=/nutchcrawl/indexes/$CRAWLNAME/indexes$TIMEH
>>>
>>> NEW=`ls -d /nutchcrawl/indexes/$CRAWLNAME/segments/$TIMEH/*`
>>> echo "$NEW"
>>> bin/nutch index $indexes $webdb_dir $linkdb_dir $NEW/
>>>
>>> for allindex in `ls -d /nutchcrawl/indexes/$CRAWLNAME/indexes*`
>>> do
>>> allindexes="$allindexes $allindex"
>>> done
>>>
>>>
>>> bin/nutch merge $merge_dir $allindexes
>>>
>>> cheers, Nadine.
>>>
>>> -----Urspr���ngliche Nachricht-----
>>> Von: Do���ş�acan G��Ě�ney [mailto:dogacan@gmail.com]
>>> Gesendet: Donnerstag, 19. Februar 2009 12:35
>>> An: nutch-user@lucene.apache.org
>>> Betreff: Re: How to index while fetcher works
>>>
>>> Hi,
>>>
>>>
>>> On Thu, Feb 19, 2009 at 13:28, Bartek <ba...@o2.pl> wrote:
>>>
>>>
>>>
>>>> Hello,
>>>>
>>>> I started to crawl huge amount of websites (dmoz with no limits in
>>>> crawl-urlfilter.txt) with -depth 10 and -topN 1 mln
>>>>
>>>> My /tmp/hadoop-root/ is more than 18GB for now (map-reduce jobs)
>>>>
>>>>
>>>> This fetching will not stop soon :) so I would like to convert already made
>>>> segments (updatedb, invertlinks, index) but there are parts missing in them:
>>>>
>>>> [root@server nutch]# bin/nutch invertlinks crawls/linkdb -dir
>>>> crawls/segments/20090216142840/
>>>>
>>>>
>>>>
>>>>
>>> If you use -dir option then you pass segments directory not individual
>>> segments, e.g:
>>>
>>> bin/nutch invertlinks crawls/linkdb -dir crawls/segments
>>>
>>> which will read every directory under segments
>>>
>>> To pass individual directories skip -dir option:
>>>
>>> bin/nutch invertlinks crawls/linkdb crawls/segments/20090216142840
>>>
>>>
>>>
>>>> LinkDb: adding segment:
>>>> file:/usr/local/nutch/crawls/segments/20090216142840/crawl_generate
>>>>
>>>> ...
>>>>
>>>> LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input path does not
>>>> exist:
>>>> file:/usr/local/nutch/crawls/segments/20090216142840/crawl_generate/parse_data
>>>>
>>>> etc.
>>>>
>>>> When manualy trying to bin/parse segments it says that they are parsed.
>>>>
>>>>
>>>> So my question is how to design whole proces of crawling large amount of
>>>> websites without limiting them for specific domains (like in regular search
>>>> engine eg. google)?
>>>>
>>>> Should I make loops of small amount of links? Like -topN 1000 and then
>>>> updatedb,invertlinks, index ?
>>>>
>>>>
>>>> For now I can start crawling and any data will appear in weeks.
>>>>
>>>> I found that in 1.0 (so made already) you are introducing live indexing in
>>>> nutch. Are there any docs that I can use of ?
>>>>
>>>> Regards,
>>>> Bartosz Gadzimski
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>
>>
>
>
AW: AW: AW: How to index while fetcher works
Posted by Höchstötter Nadine <Ho...@huberverlag.de>.
Hi,
we do news crawling, that is why we have different ranking issues, such as up to dateness and article recognition.
I have two scripts, one for the generate, fetch, parse cycle, where I also update crawldb and linkdb. And another script to merge segments and build indexes. For me, it is most important to have the newest pages of websites. For you it will be better to have all, but not every page will be updated that frequently, so if you fetch them regularly, you will have them all after a while. But long crawl cycles produce huge segments. We had some performance problems to merge and index them quickly.
-----Ursprüngliche Nachricht-----
Von: Bartosz Gadzimski [mailto:bartek--g@o2.pl]
Gesendet: Donnerstag, 19. Februar 2009 15:38
An: nutch-user@lucene.apache.org
Betreff: Re: AW: AW: How to index while fetcher works
Dear Nadine,
So when you are doing depth 1 or depth 2 crawls can you crawl whole
website? I can just imagine that with depth 2 you will crawl whole
website only when links from other pages appear. But it will take a lot
of time to get it all. Any modern website has a lot of "levels" do go
depth in it (guessing 4-5 minimum).
About dmoz - it's only for testing. Good place with lot of links :)
Ad. script - I didn't realize that you are not doing invertlinks - is
this necessary for proper indexing and searching?
Thanks,
Bartosz
Höchstötter Nadine pisze:
> We also do depth 1 or two crawls, so the crawldb is also up to date.
> Be careful with Dmoz, there is a lot of Spam out there.
> The loop is also useful for invertinglinks etc. whenever it is important to have single segments and not the whole directory.
>
> -----UrsprĂźngliche Nachricht-----
> Von: Bartosz Gadzimski [mailto:bartek--g@o2.pl]
> Gesendet: Donnerstag, 19. Februar 2009 14:56
> An: nutch-user@lucene.apache.org
> Betreff: Re: AW: How to index while fetcher works
>
> Thanks Nadine, I am few days ahead thanks to your script :)
>
> Nutch is really nice pice of software, it just takes time to know it better.
>
> Regards,
> Bartosz
>
> HĂśchstĂśtter Nadine pisze:
>
>> Hi. This is my version of an incremental index: I have one working dir for all the new segments flying in and a routine every four hours to build a new index for a special webindex folder which is nearly up to date.
>> I merge segments in another folder with YYYYMMDDHH Pattern in my working segment dir. With this I can always recognize which segments have already been indexed. Move or copy the merged segment under YYYYMMDDHH folder to your fresh webindex segment folder and also everything under $merge_dir (new index) to your index folder in webindex dir. This dir has same structure as your working crawl dir.
>> It is also good for backup reasons. Call the script below with a cron and add cp, mv, rm, or tar commands wherever you like. I zip my crawldb and linkdb with this cron, too, as a backup.
>>
>>
>> index_dir=/nutchcrawl/indexes/$CRAWLNAME/index
>> TIMEH=`date +%Y%m%d%H`
>> merge_dir=/nutchcrawl/indexes/$CRAWLNAME/indexmerged$TIMEH
>> # Update segments
>>
>> for segment in `ls -d /nutchcrawl/indexes/$CRAWLNAME/segments/* | grep '[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]' `
>> do
>> if [ -d $segment/_temporary ]; then
>> echo "$segment is temporary"
>> else
>> echo "$segment"
>> segments="$segments $segment"
>> fi
>> done
>> mergesegs_dir=/nutchcrawl/$CRAWLNAME/segments/$TIMEH
>> /bin/nutch mergesegs $mergesegs_dir $segments
>>
>> indexes=/nutchcrawl/indexes/$CRAWLNAME/indexes$TIMEH
>>
>> NEW=`ls -d /nutchcrawl/indexes/$CRAWLNAME/segments/$TIMEH/*`
>> echo "$NEW"
>> bin/nutch index $indexes $webdb_dir $linkdb_dir $NEW/
>>
>> for allindex in `ls -d /nutchcrawl/indexes/$CRAWLNAME/indexes*`
>> do
>> allindexes="$allindexes $allindex"
>> done
>>
>>
>> bin/nutch merge $merge_dir $allindexes
>>
>> cheers, Nadine.
>>
>> -----Urspr�źngliche Nachricht-----
>> Von: Do��acan G�źney [mailto:dogacan@gmail.com]
>> Gesendet: Donnerstag, 19. Februar 2009 12:35
>> An: nutch-user@lucene.apache.org
>> Betreff: Re: How to index while fetcher works
>>
>> Hi,
>>
>>
>> On Thu, Feb 19, 2009 at 13:28, Bartek <ba...@o2.pl> wrote:
>>
>>
>>> Hello,
>>>
>>> I started to crawl huge amount of websites (dmoz with no limits in
>>> crawl-urlfilter.txt) with -depth 10 and -topN 1 mln
>>>
>>> My /tmp/hadoop-root/ is more than 18GB for now (map-reduce jobs)
>>>
>>>
>>> This fetching will not stop soon :) so I would like to convert already made
>>> segments (updatedb, invertlinks, index) but there are parts missing in them:
>>>
>>> [root@server nutch]# bin/nutch invertlinks crawls/linkdb -dir
>>> crawls/segments/20090216142840/
>>>
>>>
>>>
>> If you use -dir option then you pass segments directory not individual
>> segments, e.g:
>>
>> bin/nutch invertlinks crawls/linkdb -dir crawls/segments
>>
>> which will read every directory under segments
>>
>> To pass individual directories skip -dir option:
>>
>> bin/nutch invertlinks crawls/linkdb crawls/segments/20090216142840
>>
>>
>>> LinkDb: adding segment:
>>> file:/usr/local/nutch/crawls/segments/20090216142840/crawl_generate
>>>
>>> ...
>>>
>>> LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input path does not
>>> exist:
>>> file:/usr/local/nutch/crawls/segments/20090216142840/crawl_generate/parse_data
>>>
>>> etc.
>>>
>>> When manualy trying to bin/parse segments it says that they are parsed.
>>>
>>>
>>> So my question is how to design whole proces of crawling large amount of
>>> websites without limiting them for specific domains (like in regular search
>>> engine eg. google)?
>>>
>>> Should I make loops of small amount of links? Like -topN 1000 and then
>>> updatedb,invertlinks, index ?
>>>
>>>
>>> For now I can start crawling and any data will appear in weeks.
>>>
>>> I found that in 1.0 (so made already) you are introducing live indexing in
>>> nutch. Are there any docs that I can use of ?
>>>
>>> Regards,
>>> Bartosz Gadzimski
>>>
>>>
>>>
>>>
>>>
>>>
>>
>>
>>
>
>
Re: AW: AW: How to index while fetcher works
Posted by Bartosz Gadzimski <ba...@o2.pl>.
Dear Nadine,
So when you are doing depth 1 or depth 2 crawls can you crawl whole
website? I can just imagine that with depth 2 you will crawl whole
website only when links from other pages appear. But it will take a lot
of time to get it all. Any modern website has a lot of "levels" do go
depth in it (guessing 4-5 minimum).
About dmoz - it's only for testing. Good place with lot of links :)
Ad. script - I didn't realize that you are not doing invertlinks - is
this necessary for proper indexing and searching?
Thanks,
Bartosz
Höchstötter Nadine pisze:
> We also do depth 1 or two crawls, so the crawldb is also up to date.
> Be careful with Dmoz, there is a lot of Spam out there.
> The loop is also useful for invertinglinks etc. whenever it is important to have single segments and not the whole directory.
>
> -----UrsprĂźngliche Nachricht-----
> Von: Bartosz Gadzimski [mailto:bartek--g@o2.pl]
> Gesendet: Donnerstag, 19. Februar 2009 14:56
> An: nutch-user@lucene.apache.org
> Betreff: Re: AW: How to index while fetcher works
>
> Thanks Nadine, I am few days ahead thanks to your script :)
>
> Nutch is really nice pice of software, it just takes time to know it better.
>
> Regards,
> Bartosz
>
> HĂśchstĂśtter Nadine pisze:
>
>> Hi. This is my version of an incremental index: I have one working dir for all the new segments flying in and a routine every four hours to build a new index for a special webindex folder which is nearly up to date.
>> I merge segments in another folder with YYYYMMDDHH Pattern in my working segment dir. With this I can always recognize which segments have already been indexed. Move or copy the merged segment under YYYYMMDDHH folder to your fresh webindex segment folder and also everything under $merge_dir (new index) to your index folder in webindex dir. This dir has same structure as your working crawl dir.
>> It is also good for backup reasons. Call the script below with a cron and add cp, mv, rm, or tar commands wherever you like. I zip my crawldb and linkdb with this cron, too, as a backup.
>>
>>
>> index_dir=/nutchcrawl/indexes/$CRAWLNAME/index
>> TIMEH=`date +%Y%m%d%H`
>> merge_dir=/nutchcrawl/indexes/$CRAWLNAME/indexmerged$TIMEH
>> # Update segments
>>
>> for segment in `ls -d /nutchcrawl/indexes/$CRAWLNAME/segments/* | grep '[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]' `
>> do
>> if [ -d $segment/_temporary ]; then
>> echo "$segment is temporary"
>> else
>> echo "$segment"
>> segments="$segments $segment"
>> fi
>> done
>> mergesegs_dir=/nutchcrawl/$CRAWLNAME/segments/$TIMEH
>> /bin/nutch mergesegs $mergesegs_dir $segments
>>
>> indexes=/nutchcrawl/indexes/$CRAWLNAME/indexes$TIMEH
>>
>> NEW=`ls -d /nutchcrawl/indexes/$CRAWLNAME/segments/$TIMEH/*`
>> echo "$NEW"
>> bin/nutch index $indexes $webdb_dir $linkdb_dir $NEW/
>>
>> for allindex in `ls -d /nutchcrawl/indexes/$CRAWLNAME/indexes*`
>> do
>> allindexes="$allindexes $allindex"
>> done
>>
>>
>> bin/nutch merge $merge_dir $allindexes
>>
>> cheers, Nadine.
>>
>> -----Urspr�źngliche Nachricht-----
>> Von: Do��acan G�źney [mailto:dogacan@gmail.com]
>> Gesendet: Donnerstag, 19. Februar 2009 12:35
>> An: nutch-user@lucene.apache.org
>> Betreff: Re: How to index while fetcher works
>>
>> Hi,
>>
>>
>> On Thu, Feb 19, 2009 at 13:28, Bartek <ba...@o2.pl> wrote:
>>
>>
>>> Hello,
>>>
>>> I started to crawl huge amount of websites (dmoz with no limits in
>>> crawl-urlfilter.txt) with -depth 10 and -topN 1 mln
>>>
>>> My /tmp/hadoop-root/ is more than 18GB for now (map-reduce jobs)
>>>
>>>
>>> This fetching will not stop soon :) so I would like to convert already made
>>> segments (updatedb, invertlinks, index) but there are parts missing in them:
>>>
>>> [root@server nutch]# bin/nutch invertlinks crawls/linkdb -dir
>>> crawls/segments/20090216142840/
>>>
>>>
>>>
>> If you use -dir option then you pass segments directory not individual
>> segments, e.g:
>>
>> bin/nutch invertlinks crawls/linkdb -dir crawls/segments
>>
>> which will read every directory under segments
>>
>> To pass individual directories skip -dir option:
>>
>> bin/nutch invertlinks crawls/linkdb crawls/segments/20090216142840
>>
>>
>>> LinkDb: adding segment:
>>> file:/usr/local/nutch/crawls/segments/20090216142840/crawl_generate
>>>
>>> ...
>>>
>>> LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input path does not
>>> exist:
>>> file:/usr/local/nutch/crawls/segments/20090216142840/crawl_generate/parse_data
>>>
>>> etc.
>>>
>>> When manualy trying to bin/parse segments it says that they are parsed.
>>>
>>>
>>> So my question is how to design whole proces of crawling large amount of
>>> websites without limiting them for specific domains (like in regular search
>>> engine eg. google)?
>>>
>>> Should I make loops of small amount of links? Like -topN 1000 and then
>>> updatedb,invertlinks, index ?
>>>
>>>
>>> For now I can start crawling and any data will appear in weeks.
>>>
>>> I found that in 1.0 (so made already) you are introducing live indexing in
>>> nutch. Are there any docs that I can use of ?
>>>
>>> Regards,
>>> Bartosz Gadzimski
>>>
>>>
>>>
>>>
>>>
>>>
>>
>>
>>
>
>
AW: AW: How to index while fetcher works
Posted by Höchstötter Nadine <Ho...@huberverlag.de>.
We also do depth 1 or two crawls, so the crawldb is also up to date.
Be careful with Dmoz, there is a lot of Spam out there.
The loop is also useful for invertinglinks etc. whenever it is important to have single segments and not the whole directory.
-----Ursprüngliche Nachricht-----
Von: Bartosz Gadzimski [mailto:bartek--g@o2.pl]
Gesendet: Donnerstag, 19. Februar 2009 14:56
An: nutch-user@lucene.apache.org
Betreff: Re: AW: How to index while fetcher works
Thanks Nadine, I am few days ahead thanks to your script :)
Nutch is really nice pice of software, it just takes time to know it better.
Regards,
Bartosz
Höchstötter Nadine pisze:
> Hi. This is my version of an incremental index: I have one working dir for all the new segments flying in and a routine every four hours to build a new index for a special webindex folder which is nearly up to date.
> I merge segments in another folder with YYYYMMDDHH Pattern in my working segment dir. With this I can always recognize which segments have already been indexed. Move or copy the merged segment under YYYYMMDDHH folder to your fresh webindex segment folder and also everything under $merge_dir (new index) to your index folder in webindex dir. This dir has same structure as your working crawl dir.
> It is also good for backup reasons. Call the script below with a cron and add cp, mv, rm, or tar commands wherever you like. I zip my crawldb and linkdb with this cron, too, as a backup.
>
>
> index_dir=/nutchcrawl/indexes/$CRAWLNAME/index
> TIMEH=`date +%Y%m%d%H`
> merge_dir=/nutchcrawl/indexes/$CRAWLNAME/indexmerged$TIMEH
> # Update segments
>
> for segment in `ls -d /nutchcrawl/indexes/$CRAWLNAME/segments/* | grep '[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]' `
> do
> if [ -d $segment/_temporary ]; then
> echo "$segment is temporary"
> else
> echo "$segment"
> segments="$segments $segment"
> fi
> done
> mergesegs_dir=/nutchcrawl/$CRAWLNAME/segments/$TIMEH
> /bin/nutch mergesegs $mergesegs_dir $segments
>
> indexes=/nutchcrawl/indexes/$CRAWLNAME/indexes$TIMEH
>
> NEW=`ls -d /nutchcrawl/indexes/$CRAWLNAME/segments/$TIMEH/*`
> echo "$NEW"
> bin/nutch index $indexes $webdb_dir $linkdb_dir $NEW/
>
> for allindex in `ls -d /nutchcrawl/indexes/$CRAWLNAME/indexes*`
> do
> allindexes="$allindexes $allindex"
> done
>
>
> bin/nutch merge $merge_dir $allindexes
>
> cheers, Nadine.
>
> -----UrsprĂźngliche Nachricht-----
> Von: Do�acan Gßney [mailto:dogacan@gmail.com]
> Gesendet: Donnerstag, 19. Februar 2009 12:35
> An: nutch-user@lucene.apache.org
> Betreff: Re: How to index while fetcher works
>
> Hi,
>
>
> On Thu, Feb 19, 2009 at 13:28, Bartek <ba...@o2.pl> wrote:
>
>> Hello,
>>
>> I started to crawl huge amount of websites (dmoz with no limits in
>> crawl-urlfilter.txt) with -depth 10 and -topN 1 mln
>>
>> My /tmp/hadoop-root/ is more than 18GB for now (map-reduce jobs)
>>
>>
>> This fetching will not stop soon :) so I would like to convert already made
>> segments (updatedb, invertlinks, index) but there are parts missing in them:
>>
>> [root@server nutch]# bin/nutch invertlinks crawls/linkdb -dir
>> crawls/segments/20090216142840/
>>
>>
>
>
> If you use -dir option then you pass segments directory not individual
> segments, e.g:
>
> bin/nutch invertlinks crawls/linkdb -dir crawls/segments
>
> which will read every directory under segments
>
> To pass individual directories skip -dir option:
>
> bin/nutch invertlinks crawls/linkdb crawls/segments/20090216142840
>
>> LinkDb: adding segment:
>> file:/usr/local/nutch/crawls/segments/20090216142840/crawl_generate
>>
>> ...
>>
>> LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input path does not
>> exist:
>> file:/usr/local/nutch/crawls/segments/20090216142840/crawl_generate/parse_data
>>
>> etc.
>>
>> When manualy trying to bin/parse segments it says that they are parsed.
>>
>>
>> So my question is how to design whole proces of crawling large amount of
>> websites without limiting them for specific domains (like in regular search
>> engine eg. google)?
>>
>> Should I make loops of small amount of links? Like -topN 1000 and then
>> updatedb,invertlinks, index ?
>>
>>
>> For now I can start crawling and any data will appear in weeks.
>>
>> I found that in 1.0 (so made already) you are introducing live indexing in
>> nutch. Are there any docs that I can use of ?
>>
>> Regards,
>> Bartosz Gadzimski
>>
>>
>>
>>
>>
>
>
>
>
Re: AW: How to index while fetcher works
Posted by Bartosz Gadzimski <ba...@o2.pl>.
Thanks Nadine, I am few days ahead thanks to your script :)
Nutch is really nice pice of software, it just takes time to know it better.
Regards,
Bartosz
Höchstötter Nadine pisze:
> Hi. This is my version of an incremental index: I have one working dir for all the new segments flying in and a routine every four hours to build a new index for a special webindex folder which is nearly up to date.
> I merge segments in another folder with YYYYMMDDHH Pattern in my working segment dir. With this I can always recognize which segments have already been indexed. Move or copy the merged segment under YYYYMMDDHH folder to your fresh webindex segment folder and also everything under $merge_dir (new index) to your index folder in webindex dir. This dir has same structure as your working crawl dir.
> It is also good for backup reasons. Call the script below with a cron and add cp, mv, rm, or tar commands wherever you like. I zip my crawldb and linkdb with this cron, too, as a backup.
>
>
> index_dir=/nutchcrawl/indexes/$CRAWLNAME/index
> TIMEH=`date +%Y%m%d%H`
> merge_dir=/nutchcrawl/indexes/$CRAWLNAME/indexmerged$TIMEH
> # Update segments
>
> for segment in `ls -d /nutchcrawl/indexes/$CRAWLNAME/segments/* | grep '[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]' `
> do
> if [ -d $segment/_temporary ]; then
> echo "$segment is temporary"
> else
> echo "$segment"
> segments="$segments $segment"
> fi
> done
> mergesegs_dir=/nutchcrawl/$CRAWLNAME/segments/$TIMEH
> /bin/nutch mergesegs $mergesegs_dir $segments
>
> indexes=/nutchcrawl/indexes/$CRAWLNAME/indexes$TIMEH
>
> NEW=`ls -d /nutchcrawl/indexes/$CRAWLNAME/segments/$TIMEH/*`
> echo "$NEW"
> bin/nutch index $indexes $webdb_dir $linkdb_dir $NEW/
>
> for allindex in `ls -d /nutchcrawl/indexes/$CRAWLNAME/indexes*`
> do
> allindexes="$allindexes $allindex"
> done
>
>
> bin/nutch merge $merge_dir $allindexes
>
> cheers, Nadine.
>
> -----UrsprĂźngliche Nachricht-----
> Von: Do�acan Gßney [mailto:dogacan@gmail.com]
> Gesendet: Donnerstag, 19. Februar 2009 12:35
> An: nutch-user@lucene.apache.org
> Betreff: Re: How to index while fetcher works
>
> Hi,
>
>
> On Thu, Feb 19, 2009 at 13:28, Bartek <ba...@o2.pl> wrote:
>
>> Hello,
>>
>> I started to crawl huge amount of websites (dmoz with no limits in
>> crawl-urlfilter.txt) with -depth 10 and -topN 1 mln
>>
>> My /tmp/hadoop-root/ is more than 18GB for now (map-reduce jobs)
>>
>>
>> This fetching will not stop soon :) so I would like to convert already made
>> segments (updatedb, invertlinks, index) but there are parts missing in them:
>>
>> [root@server nutch]# bin/nutch invertlinks crawls/linkdb -dir
>> crawls/segments/20090216142840/
>>
>>
>
>
> If you use -dir option then you pass segments directory not individual
> segments, e.g:
>
> bin/nutch invertlinks crawls/linkdb -dir crawls/segments
>
> which will read every directory under segments
>
> To pass individual directories skip -dir option:
>
> bin/nutch invertlinks crawls/linkdb crawls/segments/20090216142840
>
>> LinkDb: adding segment:
>> file:/usr/local/nutch/crawls/segments/20090216142840/crawl_generate
>>
>> ...
>>
>> LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input path does not
>> exist:
>> file:/usr/local/nutch/crawls/segments/20090216142840/crawl_generate/parse_data
>>
>> etc.
>>
>> When manualy trying to bin/parse segments it says that they are parsed.
>>
>>
>> So my question is how to design whole proces of crawling large amount of
>> websites without limiting them for specific domains (like in regular search
>> engine eg. google)?
>>
>> Should I make loops of small amount of links? Like -topN 1000 and then
>> updatedb,invertlinks, index ?
>>
>>
>> For now I can start crawling and any data will appear in weeks.
>>
>> I found that in 1.0 (so made already) you are introducing live indexing in
>> nutch. Are there any docs that I can use of ?
>>
>> Regards,
>> Bartosz Gadzimski
>>
>>
>>
>>
>>
>
>
>
>
AW: How to index while fetcher works
Posted by Höchstötter Nadine <Ho...@huberverlag.de>.
Hi. This is my version of an incremental index: I have one working dir for all the new segments flying in and a routine every four hours to build a new index for a special webindex folder which is nearly up to date.
I merge segments in another folder with YYYYMMDDHH Pattern in my working segment dir. With this I can always recognize which segments have already been indexed. Move or copy the merged segment under YYYYMMDDHH folder to your fresh webindex segment folder and also everything under $merge_dir (new index) to your index folder in webindex dir. This dir has same structure as your working crawl dir.
It is also good for backup reasons. Call the script below with a cron and add cp, mv, rm, or tar commands wherever you like. I zip my crawldb and linkdb with this cron, too, as a backup.
index_dir=/nutchcrawl/indexes/$CRAWLNAME/index
TIMEH=`date +%Y%m%d%H`
merge_dir=/nutchcrawl/indexes/$CRAWLNAME/indexmerged$TIMEH
# Update segments
for segment in `ls -d /nutchcrawl/indexes/$CRAWLNAME/segments/* | grep '[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]' `
do
if [ -d $segment/_temporary ]; then
echo "$segment is temporary"
else
echo "$segment"
segments="$segments $segment"
fi
done
mergesegs_dir=/nutchcrawl/$CRAWLNAME/segments/$TIMEH
/bin/nutch mergesegs $mergesegs_dir $segments
indexes=/nutchcrawl/indexes/$CRAWLNAME/indexes$TIMEH
NEW=`ls -d /nutchcrawl/indexes/$CRAWLNAME/segments/$TIMEH/*`
echo "$NEW"
bin/nutch index $indexes $webdb_dir $linkdb_dir $NEW/
for allindex in `ls -d /nutchcrawl/indexes/$CRAWLNAME/indexes*`
do
allindexes="$allindexes $allindex"
done
bin/nutch merge $merge_dir $allindexes
cheers, Nadine.
-----Ursprüngliche Nachricht-----
Von: Doğacan Güney [mailto:dogacan@gmail.com]
Gesendet: Donnerstag, 19. Februar 2009 12:35
An: nutch-user@lucene.apache.org
Betreff: Re: How to index while fetcher works
Hi,
On Thu, Feb 19, 2009 at 13:28, Bartek <ba...@o2.pl> wrote:
> Hello,
>
> I started to crawl huge amount of websites (dmoz with no limits in
> crawl-urlfilter.txt) with -depth 10 and -topN 1 mln
>
> My /tmp/hadoop-root/ is more than 18GB for now (map-reduce jobs)
>
>
> This fetching will not stop soon :) so I would like to convert already made
> segments (updatedb, invertlinks, index) but there are parts missing in them:
>
> [root@server nutch]# bin/nutch invertlinks crawls/linkdb -dir
> crawls/segments/20090216142840/
>
If you use -dir option then you pass segments directory not individual
segments, e.g:
bin/nutch invertlinks crawls/linkdb -dir crawls/segments
which will read every directory under segments
To pass individual directories skip -dir option:
bin/nutch invertlinks crawls/linkdb crawls/segments/20090216142840
>
> LinkDb: adding segment:
> file:/usr/local/nutch/crawls/segments/20090216142840/crawl_generate
>
> ...
>
> LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input path does not
> exist:
> file:/usr/local/nutch/crawls/segments/20090216142840/crawl_generate/parse_data
>
> etc.
>
> When manualy trying to bin/parse segments it says that they are parsed.
>
>
> So my question is how to design whole proces of crawling large amount of
> websites without limiting them for specific domains (like in regular search
> engine eg. google)?
>
> Should I make loops of small amount of links? Like -topN 1000 and then
> updatedb,invertlinks, index ?
>
>
> For now I can start crawling and any data will appear in weeks.
>
> I found that in 1.0 (so made already) you are introducing live indexing in
> nutch. Are there any docs that I can use of ?
>
> Regards,
> Bartosz Gadzimski
>
>
>
>
--
Doğacan Güney
Re: How to index while fetcher works
Posted by Doğacan Güney <do...@gmail.com>.
Hi,
On Thu, Feb 19, 2009 at 13:28, Bartek <ba...@o2.pl> wrote:
> Hello,
>
> I started to crawl huge amount of websites (dmoz with no limits in
> crawl-urlfilter.txt) with -depth 10 and -topN 1 mln
>
> My /tmp/hadoop-root/ is more than 18GB for now (map-reduce jobs)
>
>
> This fetching will not stop soon :) so I would like to convert already made
> segments (updatedb, invertlinks, index) but there are parts missing in them:
>
> [root@server nutch]# bin/nutch invertlinks crawls/linkdb -dir
> crawls/segments/20090216142840/
>
If you use -dir option then you pass segments directory not individual
segments, e.g:
bin/nutch invertlinks crawls/linkdb -dir crawls/segments
which will read every directory under segments
To pass individual directories skip -dir option:
bin/nutch invertlinks crawls/linkdb crawls/segments/20090216142840
>
> LinkDb: adding segment:
> file:/usr/local/nutch/crawls/segments/20090216142840/crawl_generate
>
> ...
>
> LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input path does not
> exist:
> file:/usr/local/nutch/crawls/segments/20090216142840/crawl_generate/parse_data
>
> etc.
>
> When manualy trying to bin/parse segments it says that they are parsed.
>
>
> So my question is how to design whole proces of crawling large amount of
> websites without limiting them for specific domains (like in regular search
> engine eg. google)?
>
> Should I make loops of small amount of links? Like -topN 1000 and then
> updatedb,invertlinks, index ?
>
>
> For now I can start crawling and any data will appear in weeks.
>
> I found that in 1.0 (so made already) you are introducing live indexing in
> nutch. Are there any docs that I can use of ?
>
> Regards,
> Bartosz Gadzimski
>
>
>
>
--
Doğacan Güney