You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by ad...@interfree.it on 2005/09/16 12:13:26 UTC

crawl-urlfilter

Hi,
thank you for your hints but I didn' give you the following information:

I modified the file crawl-urlfilter.txt in this mode:
#start crawl-urlfilter
# skip file:, ftp:, & mailto: urls
-^(ftp|mailto):

# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|rtf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe)$

# skip URLs containing certain characters as probable queries, etc.
-[*!@]

# accept anything else
+.
#end crawl-urlfilter


I started nutch with this line_command :
bin/nutch crawl urls -dir /home/paul/nutch-searcher.dir -depth 3 -thread 8 >& crawl.log

In the file "urls" there is the url of the following page:

<HTML>

<HEAD>
<TITLE>  TitleOfSite </TITLE>
</HEAD>

<FRAMESET ROWS="14%, *">

<FRAME NORESIZE NAME="MENU" SRC="MyServlet?menu=1" SCROLLING =AUTO">

<FRAME NAME="PAGE"  SRC="../welcome.html" SCROLLING=AUTO">

</FRAMESET>

</HTML>


Nutch crawls and fetchs "welcome.html"  but doesn't work with MyServlet?menu=1
The servlet "MyServlet?menu=1"  shows some links but in the log  nutch doesn't 
fetch  any of those links.
I hope the question is clear and am looking forward to receiving your answer.

                                         Adriano
please help me!!!!!!!


-------------------------------------------------------------------------
Visita http://domini.interfree.it, il sito di Interfree dove trovare
soluzioni semplici e complete che soddisfano le tue esigenze in Internet,
ecco due esempi di offerte:

-  Registrazione Dominio: un dominio con 1 MB di spazio disco +  2 caselle
   email a soli 18,59 euro
-  MioDominio: un dominio con 20 MB di spazio disco + 5 caselle email 
   a soli 51,13 euro

Vieni a trovarci!

Lo Staff di Interfree 
-------------------------------------------------------------------------

Re: indexing is very very very slow

Posted by Gal Nitzan <gn...@usa.net>.

Doug,

Should the indexer.maxMergeDocs be set to the same value : 2147483647  ?

Thanks,

Gal

Doug Cutting wrote:
> The default for indexer.maxMergeDocs was mistakenly set to 50, which 
> can make indexing really slow.  Try putting the following in your 
> nutch-site.xml:
>
> <property>
>   <name>indexer.maxMergeDocs</name>
>   <value>2147483647</value>
> </property>
>
> Does that help?
>
> I just fixed this in trunk.  We should fix this in the 0.7 release 
> branch.
>
> Doug
>
> .
>

Re: indexing is very very very slow

Posted by Gal Nitzan <gn...@usa.net>.

EM wrote:
> Andrzej Bialecki wrote:
>
>>
>> The fix is simple: delete the "index" files in each subdirectory of 
>> the 20050916014401 segment. Then run "nutch segread -fix 
>> 20050916014401". Then re-run mergesegs - it will now work at full speed.
>>
>> NB. if there are any more segments which give you this warning, do 
>> the same before you run mergesegs.
>>
> "segread -fix" doesn't work, "-segslice" solves this though.
>
>
>
> .
>
Yes segslice solved it.

Thanks.

Re: indexing is very very very slow

Posted by EM <em...@cpuedge.com>.

Andrzej Bialecki wrote:

>
> The fix is simple: delete the "index" files in each subdirectory of 
> the 20050916014401 segment. Then run "nutch segread -fix 
> 20050916014401". Then re-run mergesegs - it will now work at full speed.
>
> NB. if there are any more segments which give you this warning, do the 
> same before you run mergesegs.
>
"segread -fix" doesn't work, "-segslice" solves this though.

Re: indexing is very very very slow

Posted by Gal Nitzan <gn...@usa.net>.

Hi Andrzej,

Thank you for your reply.

I have tried twice but the segment is not being fixed:

[root@kunzon nutch]# find index/segments/20050919092227/ -name index -print
index/segments/20050919092227/fetcher/index
index/segments/20050919092227/parse_text/index
index/segments/20050919092227/content/index
index/segments/20050919092227/parse_data/index
[root@kunzon nutch]# rm -rf  index/segments/20050919092227/fetcher/index
[root@kunzon nutch]# rm -rf index/segments/20050919092227/parse_text/index
[root@kunzon nutch]# rm -rf index/segments/20050919092227/content/index
[root@kunzon nutch]# rm -rf index/segments/20050919092227/parse_data/index
[root@kunzon nutch]# bin/nutch segread index/segments/20050919092227 -fix
050920 031844 parsing file:/nutch/conf/nutch-default.xml
050920 031844 parsing file:/nutch/conf/nutch-site.xml
050920 031845 No FS indicated, using default:local
050920 031849  - fixed fetcher
050920 031932  - fixed content
050920 031952  - fixed parse_data
050920 032006  - fixed parse_text
050920 032006 Finished fixing 20050919092227
050920 032006  - data in segment index/segments/20050919092227 is 
corrupt, using only 91212 entries.

Thanks,

Gal

Andrzej Bialecki wrote:
> Gal Nitzan wrote:
>> Hi,
>>
>> Well I still get a very slow mergesegs:
>>
>> [root@kunzon nutch]# tail -f nutch-mergesegs-kunzon.com.log
>> 050919 171351  Processed 120000 records (1146.5918 rec/s)
>> 050919 171408  Processed 140000 records (1158.2788 rec/s)
>> 050919 171428  Processed 160000 records (1019.8358 rec/s)
>> 050919 171451  Processed 180000 records (879.2368 rec/s)
>> 050919 171510  Processed 200000 records (1054.9636 rec/s)
>> 050919 171528  Processed 220000 records (1069.2328 rec/s)
>> 050919 171547  Processed 240000 records (1099.868 rec/s)
>> 050919 171832  - creating next subindex...
>> 050919 174512  Processed 260000 records (11.328647 rec/s)
>> 050919 200315  Processed 280000 records (2.4145627 rec/s)
>>
>> It is falling to 2.4 res per second ...
>>
>> Can somebody help me please. 400K records is only the beginning what 
>> will happen when it is 4M?
>
>>> >050917 043332  - data in segment index/segments/20050916014401 is 
>>> corrupt, using only 128115 entries.
>
> This is the real reason for the slowdown. Technically speaking, a 
> partially corrupted MapFile is readable and usable. However, random 
> access is orders of magnitude slower...
>
> The fix is simple: delete the "index" files in each subdirectory of 
> the 20050916014401 segment. Then run "nutch segread -fix 
> 20050916014401". Then re-run mergesegs - it will now work at full speed.
>
> NB. if there are any more segments which give you this warning, do the 
> same before you run mergesegs.
>

Re: indexing is very very very slow

Posted by Andrzej Bialecki <ab...@getopt.org>.

Gal Nitzan wrote:
> Hi,
> 
> Well I still get a very slow mergesegs:
> 
> [root@kunzon nutch]# tail -f nutch-mergesegs-kunzon.com.log
> 050919 171351  Processed 120000 records (1146.5918 rec/s)
> 050919 171408  Processed 140000 records (1158.2788 rec/s)
> 050919 171428  Processed 160000 records (1019.8358 rec/s)
> 050919 171451  Processed 180000 records (879.2368 rec/s)
> 050919 171510  Processed 200000 records (1054.9636 rec/s)
> 050919 171528  Processed 220000 records (1069.2328 rec/s)
> 050919 171547  Processed 240000 records (1099.868 rec/s)
> 050919 171832  - creating next subindex...
> 050919 174512  Processed 260000 records (11.328647 rec/s)
> 050919 200315  Processed 280000 records (2.4145627 rec/s)
> 
> It is falling to 2.4 res per second ...
> 
> Can somebody help me please. 400K records is only the beginning what 
> will happen when it is 4M?

>> >050917 043332  - data in segment index/segments/20050916014401 is 
>> corrupt, using only 128115 entries.

This is the real reason for the slowdown. Technically speaking, a 
partially corrupted MapFile is readable and usable. However, random 
access is orders of magnitude slower...

The fix is simple: delete the "index" files in each subdirectory of the 
20050916014401 segment. Then run "nutch segread -fix 20050916014401". 
Then re-run mergesegs - it will now work at full speed.

NB. if there are any more segments which give you this warning, do the 
same before you run mergesegs.

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Updated FAQ

Posted by Gal Nitzan <gn...@usa.net>.

Hello,

I have updated the FAQ: http://wiki.apache.org/nutch/FAQ?action=show

I've just started so be patient.

Please let me know if I'm on the right direction and please let me know 
of any additions/changes/deletions required.

Regards,

Gal

Re: JDK 1.5

Posted by Gal Nitzan <gn...@usa.net>.

Jay Pound wrote:
> I'm running JDK 1.5 on windows 64bit, no problems, in fact very fast/stable
> ----- Original Message ----- 
> From: "Piotr Kosiorowski" <pk...@gmail.com>
> To: <nu...@lucene.apache.org>
> Sent: Tuesday, September 20, 2005 5:31 AM
> Subject: Re: JDK 1.5
>
>
> I was running it on JDK 1.5 on AMD64 linux for more than half a year and had
> no such problems.
> Regards
> Piotr
>
> On 9/20/05, Gal Nitzan <gn...@usa.net> wrote:
>   
>> Hi,
>>
>> I have tried running Nutch with JDK 1.5 and got very weird results,
>> like fetcher is hanging and merge is hanging.
>>
>> After that I switched to 1.4 and all went well.
>>
>> Is it just a matter of re-build?
>>
>> Regards,
>>
>> Gal
>>
>>     
>
>
> Thanks for the replies. Did you have to re-build or out of the box?
>   

Gal
> .
>
>

Re: JDK 1.5

Posted by Jay Pound <we...@poundwebhosting.com>.

I'm running JDK 1.5 on windows 64bit, no problems, in fact very fast/stable
----- Original Message ----- 
From: "Piotr Kosiorowski" <pk...@gmail.com>
To: <nu...@lucene.apache.org>
Sent: Tuesday, September 20, 2005 5:31 AM
Subject: Re: JDK 1.5


I was running it on JDK 1.5 on AMD64 linux for more than half a year and had
no such problems.
Regards
Piotr

On 9/20/05, Gal Nitzan <gn...@usa.net> wrote:
>
> Hi,
>
> I have tried running Nutch with JDK 1.5 and got very weird results,
> like fetcher is hanging and merge is hanging.
>
> After that I switched to 1.4 and all went well.
>
> Is it just a matter of re-build?
>
> Regards,
>
> Gal
>

Re: JDK 1.5

Posted by Piotr Kosiorowski <pk...@gmail.com>.

I was running it on JDK 1.5 on AMD64 linux for more than half a year and had 
no such problems.
Regards
Piotr

On 9/20/05, Gal Nitzan <gn...@usa.net> wrote:
> 
> Hi,
> 
> I have tried running Nutch with JDK 1.5 and got very weird results,
> like fetcher is hanging and merge is hanging.
> 
> After that I switched to 1.4 and all went well.
> 
> Is it just a matter of re-build?
> 
> Regards,
> 
> Gal
>

JDK 1.5

Posted by Gal Nitzan <gn...@usa.net>.

Hi,

I have tried running Nutch with JDK 1.5 and got very weird results, 
like  fetcher is hanging and  merge is  hanging.

After that I switched to 1.4 and all went well.

Is it just a matter of  re-build?

Regards,

Gal

Re: Is it possible to change the list of common words without crawling everything again

Posted by Andrzej Bialecki <ab...@getopt.org>.

Gal Nitzan wrote:
> This question was in the FAQ unanswered.
> 
> Can someone answer that please. I shall put it in the FAQ

Yes. The list of common words is used only when indexing and searching, 
and not during other steps. So, if you change the list of common words, 
there is no need to re-fetch the content, you just need to re-create 
segment indexes to reflect the changes.

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Is it possible to change the list of common words without crawling everything again

Posted by Gal Nitzan <gn...@usa.net>.

Tim Archambault wrote:
> How can I unsubscribe? Sorry to bother.
>
>   
> To remove your address from the list, send a message to:
>   <nu...@lucene.apache.org>

Re: Is it possible to change the list of common words without crawling everything again

Posted by Tim Archambault <jo...@gmail.com>.

How can I unsubscribe? Sorry to bother.

On 9/20/05, Gal Nitzan <gn...@usa.net> wrote: 
> 
> This question was in the FAQ unanswered.
> 
> Can someone answer that please. I shall put it in the FAQ
> 
> Regards,
> 
> Gal
>

Is it possible to change the list of common words without crawling everything again

Posted by Gal Nitzan <gn...@usa.net>.

This question was in the FAQ unanswered.

Can someone answer that please. I shall put it in the FAQ

Regards,

Gal

Re: Proposal: refuse to open partially trunc. MapFile, unless forced (Re: indexing is very very very slow)

Posted by Gal Nitzan <gn...@usa.net>.

Andrzej Bialecki wrote:
> Gal Nitzan wrote:
>> Andrzej Bialecki wrote:
>>
>>> Hi all,
>>>
>>>> Well I still get a very slow mergesegs:
>>>
>>>
>>>>>
>>>>> >050917 043332  - data in segment index/segments/20050916014401 is 
>>>>> corrupt, using only 128115 entries.
>>>
>>>
>>> This is a common and recurring problem. What's worse is that an 
>>> unfixed segment like this will destroy the performance of the 
>>> search, too, not just the backend pre-processing.
>>>
>>> I propose to modify MapFile.Reader so that it refuses to open such 
>>> file, and throws an Exception, unless a force=true flag is given. 
>>> Tools that want to ignore this can do so, but all other tools will 
>>> be able to make a conscious decision whether to fix it first, or to 
>>> use it as such.
>>>
>>> If there are no objections, I will change it in the trunk/ in a 
>>> couple of days.
>>>
>> Hi,
>>
>> I think it would be very confusing to old users as well as new users. 
>> Throwing an exception when actually  a segment corruption is trivial 
>> and can be fixed easily (now that I know how to do that :-)...
>
> You missed my point - I proposed that we change the API. On the 
> surface, command-line tools would behave like now, with the benefit 
> that segment corruption would be fixed automatically by those tools 
> that require clean segments - unless _prevented_ by a cmd-line switch. 
> So, this is just to improve the default behaviour, and not to complain 
> even louder than now.
>
>>
>> Instead I would like to suggest building a FAQ for Nutch.
>>
>> I would like to propose myself  to build at least the skeleton for it.
>>
>> As a new user to Nutch I have run to so many problems and except this 
>> list there was not much information elsewhere. So, I have all the 
>> answers fresh in my mind and with some help from the rest of the 
>> nutch-users it can be done without too much of a hustle.
>>
>> Besides, many people on this list contribute on their free time, I 
>> would be happy to contribute to the success of this  project.
>
> This is always welcome, and there is already a place where we collect 
> such info. Please see the Nutch Wiki, and feel free to enhance or add 
> new content there.
>
You are right, I did miss your point. And now that I understand :-) I 
think it is a very good Idea.

Yes I found the FAQ hiding in the wiki and I have started working on it.

Gal

Re: Proposal: refuse to open partially trunc. MapFile, unless forced (Re: indexing is very very very slow)

Posted by Matthias Jaekle <ja...@eventax.de>.

> You missed my point - I proposed that we change the API. On the surface, 
> command-line tools would behave like now, with the benefit that segment 
> corruption would be fixed automatically by those tools that require 
> clean segments - unless _prevented_ by a cmd-line switch. So, this is 
> just to improve the default behaviour, and not to complain even louder 
> than now.
That would be great!
Matthias

Re: Proposal: refuse to open partially trunc. MapFile, unless forced (Re: indexing is very very very slow)

Posted by Andrzej Bialecki <ab...@getopt.org>.

Gal Nitzan wrote:
> Andrzej Bialecki wrote:
> 
>> Hi all,
>>
>>> Well I still get a very slow mergesegs:
>>
>>
>>>>
>>>> >050917 043332  - data in segment index/segments/20050916014401 is 
>>>> corrupt, using only 128115 entries.
>>
>>
>> This is a common and recurring problem. What's worse is that an 
>> unfixed segment like this will destroy the performance of the search, 
>> too, not just the backend pre-processing.
>>
>> I propose to modify MapFile.Reader so that it refuses to open such 
>> file, and throws an Exception, unless a force=true flag is given. 
>> Tools that want to ignore this can do so, but all other tools will be 
>> able to make a conscious decision whether to fix it first, or to use 
>> it as such.
>>
>> If there are no objections, I will change it in the trunk/ in a couple 
>> of days.
>>
> Hi,
> 
> I think it would be very confusing to old users as well as new users. 
> Throwing an exception when actually  a segment corruption is trivial and 
> can be fixed easily (now that I know how to do that :-)...

You missed my point - I proposed that we change the API. On the surface, 
command-line tools would behave like now, with the benefit that segment 
corruption would be fixed automatically by those tools that require 
clean segments - unless _prevented_ by a cmd-line switch. So, this is 
just to improve the default behaviour, and not to complain even louder 
than now.

> 
> Instead I would like to suggest building a FAQ for Nutch.
> 
> I would like to propose myself  to build at least the skeleton for it.
> 
> As a new user to Nutch I have run to so many problems and except this 
> list there was not much information elsewhere. So, I have all the 
> answers fresh in my mind and with some help from the rest of the 
> nutch-users it can be done without too much of a hustle.
> 
> Besides, many people on this list contribute on their free time, I would 
> be happy to contribute to the success of this  project.

This is always welcome, and there is already a place where we collect 
such info. Please see the Nutch Wiki, and feel free to enhance or add 
new content there.

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

regarding gal's faq proposal

Posted by gekkokid <me...@gekkokid.org.uk>.

is there a place where we can search the mailing list? that could be a short 
term solution

_gk
----- Original Message ----- 
From: "Gal Nitzan" <gn...@usa.net>
To: <nu...@lucene.apache.org>
Sent: Monday, September 19, 2005 11:37 PM
Subject: Re: Proposal: refuse to open partially trunc. MapFile, unless 
forced (Re: indexing is very very very slow)


> Andrzej Bialecki wrote:
>> Hi all,
>>
>>> Well I still get a very slow mergesegs:
>>
>>>>
>>>> >050917 043332  - data in segment index/segments/20050916014401 is
>>>> corrupt, using only 128115 entries.
>>
>> This is a common and recurring problem. What's worse is that an unfixed 
>> segment like this will destroy the performance of the search, too, not 
>> just the backend pre-processing.
>>
>> I propose to modify MapFile.Reader so that it refuses to open such file, 
>> and throws an Exception, unless a force=true flag is given. Tools that 
>> want to ignore this can do so, but all other tools will be able to make a 
>> conscious decision whether to fix it first, or to use it as such.
>>
>> If there are no objections, I will change it in the trunk/ in a couple of 
>> days.
>>
> Hi,
>
> I think it would be very confusing to old users as well as new users. 
> Throwing an exception when actually  a segment corruption is trivial and 
> can be fixed easily (now that I know how to do that :-)...
>
> Instead I would like to suggest building a FAQ for Nutch.
>
> I would like to propose myself  to build at least the skeleton for it.
>
> As a new user to Nutch I have run to so many problems and except this list 
> there was not much information elsewhere. So, I have all the answers fresh 
> in my mind and with some help from the rest of the nutch-users it can be 
> done without too much of a hustle.
>
> Besides, many people on this list contribute on their free time, I would 
> be happy to contribute to the success of this  project.
>
> Regards,
>
> Gal
>
>
>
>
>

Re: Proposal: refuse to open partially trunc. MapFile, unless forced (Re: indexing is very very very slow)

Posted by Gal Nitzan <gn...@usa.net>.

Andrzej Bialecki wrote:
> Hi all,
>
>> Well I still get a very slow mergesegs:
>
>>>
>>> >050917 043332  - data in segment index/segments/20050916014401 is 
>>> corrupt, using only 128115 entries.
>
> This is a common and recurring problem. What's worse is that an 
> unfixed segment like this will destroy the performance of the search, 
> too, not just the backend pre-processing.
>
> I propose to modify MapFile.Reader so that it refuses to open such 
> file, and throws an Exception, unless a force=true flag is given. 
> Tools that want to ignore this can do so, but all other tools will be 
> able to make a conscious decision whether to fix it first, or to use 
> it as such.
>
> If there are no objections, I will change it in the trunk/ in a couple 
> of days.
>
Hi,

I think it would be very confusing to old users as well as new users. 
Throwing an exception when actually  a segment corruption is trivial and 
can be fixed easily (now that I know how to do that :-)...

Instead I would like to suggest building a FAQ for Nutch.

I would like to propose myself  to build at least the skeleton for it.

As a new user to Nutch I have run to so many problems and except this 
list there was not much information elsewhere. So, I have all the 
answers fresh in my mind and with some help from the rest of the 
nutch-users it can be done without too much of a hustle.

Besides, many people on this list contribute on their free time, I would 
be happy to contribute to the success of this  project.

Regards,

Gal

Proposal: refuse to open partially trunc. MapFile, unless forced (Re: indexing is very very very slow)

Posted by Andrzej Bialecki <ab...@getopt.org>.

Hi all,

> Well I still get a very slow mergesegs:

>>
>> >050917 043332  - data in segment index/segments/20050916014401 is 
>> corrupt, using only 128115 entries.

This is a common and recurring problem. What's worse is that an unfixed 
segment like this will destroy the performance of the search, too, not 
just the backend pre-processing.

I propose to modify MapFile.Reader so that it refuses to open such file, 
and throws an Exception, unless a force=true flag is given. Tools that 
want to ignore this can do so, but all other tools will be able to make 
a conscious decision whether to fix it first, or to use it as such.

If there are no objections, I will change it in the trunk/ in a couple 
of days.

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: indexing is very very very slow

Posted by Gal Nitzan <gn...@usa.net>.

Hi,

Well I still get a very slow mergesegs:

[root@kunzon nutch]# tail -f nutch-mergesegs-kunzon.com.log
050919 171351  Processed 120000 records (1146.5918 rec/s)
050919 171408  Processed 140000 records (1158.2788 rec/s)
050919 171428  Processed 160000 records (1019.8358 rec/s)
050919 171451  Processed 180000 records (879.2368 rec/s)
050919 171510  Processed 200000 records (1054.9636 rec/s)
050919 171528  Processed 220000 records (1069.2328 rec/s)
050919 171547  Processed 240000 records (1099.868 rec/s)
050919 171832  - creating next subindex...
050919 174512  Processed 260000 records (11.328647 rec/s)
050919 200315  Processed 280000 records (2.4145627 rec/s)

It is falling to 2.4 res per second ...

Can somebody help me please. 400K records is only the beginning what 
will happen when it is 4M?

Regards,

Gal

EM wrote:
> 'segslice' all partial segments into new ones prior merging.
>
> Gal Nitzan wrote:
>
>> Hi Doug,
>>
>> Thank you for the prompt reply.
>>
>> Well things got much much faster (i guess about 40% faster), but it 
>> seems that something got really corrupted. Everything gets stuck 
>> after 40K records.
>
> >050917 043332  - data in segment index/segments/20050916014401 is 
> corrupt, using only 128115 entries.
>
> .
>

Re: indexing is very very very slow

Posted by Gal Nitzan <gn...@usa.net>.

Hi EM,

After sending that eMail I looked at the segslice and it worked 
perfect.....!!!!

Thanks,

Gal

EM wrote:
> 'segslice' all partial segments into new ones prior merging.
>
> Gal Nitzan wrote:
>
>> Hi Doug,
>>
>> Thank you for the prompt reply.
>>
>> Well things got much much faster (i guess about 40% faster), but it 
>> seems that something got really corrupted. Everything gets stuck 
>> after 40K records.
>
> >050917 043332  - data in segment index/segments/20050916014401 is 
> corrupt, using only 128115 entries.
>
> .
>

Re: indexing is very very very slow

Posted by EM <em...@cpuedge.com>.

'segslice' all partial segments into new ones prior merging.

Gal Nitzan wrote:

> Hi Doug,
>
> Thank you for the prompt reply.
>
> Well things got much much faster (i guess about 40% faster), but it 
> seems that something got really corrupted. Everything gets stuck after 
> 40K records.

 >050917 043332  - data in segment index/segments/20050916014401 is 
corrupt, using only 128115 entries.

Re: indexing is very very very slow

Posted by Gal Nitzan <gn...@usa.net>.

Hi Doug,

Thank you for the prompt reply.

Well things got much much faster (i guess about 40% faster), but it 
seems that something got really corrupted. Everything gets stuck after 
40K records.

[root@kunzon nutch]# bin/nutch mergesegs -dir index/segments/ -i -ds
050917 043331 parsing file:/nutch/conf/nutch-default.xml
050917 043331 parsing file:/nutch/conf/nutch-site.xml
050917 043331 No FS indicated, using default:local
050917 043331 * Opening 2 segments:
050917 043332  - segment 20050916013342: 42287 records.
050917 043332  - data in segment index/segments/20050916014401 is 
corrupt, using only 128115 entries.
050917 043332  - segment 20050916014401: 128116 records.
050917 043332 * TOTAL 170403 input records in 2 segments.
050917 043332 * Creating master index...
050917 043345  Processed 20000 records (1613.5538 rec/s)
050917 043354  Processed 40000 records (2113.9414 rec/s)

And that is it. I notice memory is still being consumed but no apparent 
activity.

Since I'm really newbie to nutch, could you give me a tip on a way to 
rescue the already fetched data and to remove the corruption from the 
segment. I already tried the -fix but it didn't help.

Regards,

Gal


Doug Cutting wrote:
> The default for indexer.maxMergeDocs was mistakenly set to 50, which 
> can make indexing really slow.  Try putting the following in your 
> nutch-site.xml:
>
> <property>
>   <name>indexer.maxMergeDocs</name>
>   <value>2147483647</value>
> </property>
>
> Does that help?
>
> I just fixed this in trunk.  We should fix this in the 0.7 release 
> branch.
>
> Doug
>
> .
>

Re: indexing is very very very slow

Posted by Doug Cutting <cu...@nutch.org>.

The default for indexer.maxMergeDocs was mistakenly set to 50, which can 
make indexing really slow.  Try putting the following in your 
nutch-site.xml:

<property>
   <name>indexer.maxMergeDocs</name>
   <value>2147483647</value>
</property>

Does that help?

I just fixed this in trunk.  We should fix this in the 0.7 release branch.

Doug

indexing is very very very slow

Posted by Gal Nitzan <gn...@usa.net>.

Hi,

I am crawling the web...

my machine:
cpu: Xeon 2.8 X 2
ram 2GB
HD raid 2 X 160 GB

After fetching (i stopped the fetcher after it halted (didn't fetch) for 
a few hours) i have done the following:

1. s1=`ls -d index/segments/2* | tail -1`

2. bin/nutch updatedb index/db/ $s1
    the following is the last few lines from the updatedb

--------------------------------------------------------------------
050916 135308 Processing document 127000
050916 135316 Processing document 128000
050916 135317 Unexpected EOF in: index/segments/20050916014401/fetcher 
at entry #128116.  Ignoring.
050916 135317 Finishing update
050916 135456 Processing pagesByURL: Sorted 3083939 instructions in 
99.536 seconds.
050916 135456 Processing pagesByURL: Sorted 30983.15182446552 
instructions/second
050916 135559 Processing pagesByURL: Merged to new DB containing 774610 
records in 35.355 seconds
050916 135559 Processing pagesByURL: Merged 21909.489464007922 
records/second
050916 135611 Processing pagesByMD5: Sorted 803182 instructions in 
11.654 seconds.
050916 135611 Processing pagesByMD5: Sorted 68918.99776900635 
instructions/second
050916 135627 Processing pagesByMD5: Merged to new DB containing 774610 
records in 14.216 seconds
050916 135627 Processing pagesByMD5: Merged 54488.604389420376 
records/second
050916 135633 Processing linksByMD5: Sorted 689997 instructions in 6.038 
seconds.
050916 135633 Processing linksByMD5: Sorted 114275.75356078171 
instructions/second
050916 135648 Processing linksByMD5: Merged to new DB containing 776849 
records in 13.624 seconds
050916 135648 Processing linksByMD5: Merged 57020.62536699941 records/second
050916 135655 Processing linksByURL: Sorted 584963 instructions in 7.056 
seconds.
050916 135655 Processing linksByURL: Sorted 82902.91950113379 
instructions/second
050916 135711 Processing linksByURL: Merged to new DB containing 776849 
records in 14.533 seconds
050916 135711 Processing linksByURL: Merged 53454.13885639579 records/second
050916 135718 Processing linksByMD5: Sorted 671867 instructions in 6.732 
seconds.
050916 135718 Processing linksByMD5: Sorted 99801.99049316696 
instructions/second
050916 135729 Processing linksByMD5: Merged to new DB containing 776849 
records in 9.999 seconds
050916 135729 Processing linksByMD5: Merged 77692.66926692669 records/second
050916 135744 Update finished
--------------------------------------------------------------------

As you can see the updatedb gone fine though it encountered the stop of 
the fetcher

3. bin/nutch mergesegs -dir index/segments/ -i -ds

from here on is the problem

--------------------------------------------------------------------
050916 141720 parsing file:/nutch/conf/nutch-default.xml
050916 141720 parsing file:/nutch/conf/nutch-site.xml
050916 141720 No FS indicated, using default:local
050916 141720 * Opening 2 segments:
050916 141720  - segment 20050916013342: 42287 records.
050916 141721  - data in segment index/segments/20050916014401 is 
corrupt, using only 128115 entries.
050916 141722  - segment 20050916014401: 128116 records.
050916 141722 * TOTAL 170403 input records in 2 segments.
050916 141722 * Creating master index...
050916 141737  Processed 20000 records (1311.9916 rec/s)
050916 141751  Processed 40000 records (1394.0197 rec/s)
050916 154424  Processed 60000 records (3.851173 rec/s)
--------------------------------------------------------------------
as you can see in thelast line, the indexer process 3.8 records per 
second which mean too long

Anybody got a clue or a hint please !!!

Regards,

Gal