You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by ad...@interfree.it on 2005/09/16 12:13:26 UTC
crawl-urlfilter
Hi,
thank you for your hints but I didn' give you the following information:
I modified the file crawl-urlfilter.txt in this mode:
#start crawl-urlfilter
# skip file:, ftp:, & mailto: urls
-^(ftp|mailto):
# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|rtf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe)$
# skip URLs containing certain characters as probable queries, etc.
-[*!@]
# accept anything else
+.
#end crawl-urlfilter
I started nutch with this line_command :
bin/nutch crawl urls -dir /home/paul/nutch-searcher.dir -depth 3 -thread 8 >& crawl.log
In the file "urls" there is the url of the following page:
<HTML>
<HEAD>
<TITLE> TitleOfSite </TITLE>
</HEAD>
<FRAMESET ROWS="14%, *">
<FRAME NORESIZE NAME="MENU" SRC="MyServlet?menu=1" SCROLLING =AUTO">
<FRAME NAME="PAGE" SRC="../welcome.html" SCROLLING=AUTO">
</FRAMESET>
</HTML>
Nutch crawls and fetchs "welcome.html" but doesn't work with MyServlet?menu=1
The servlet "MyServlet?menu=1" shows some links but in the log nutch doesn't
fetch any of those links.
I hope the question is clear and am looking forward to receiving your answer.
Adriano
please help me!!!!!!!
-------------------------------------------------------------------------
Visita http://domini.interfree.it, il sito di Interfree dove trovare
soluzioni semplici e complete che soddisfano le tue esigenze in Internet,
ecco due esempi di offerte:
- Registrazione Dominio: un dominio con 1 MB di spazio disco + 2 caselle
email a soli 18,59 euro
- MioDominio: un dominio con 20 MB di spazio disco + 5 caselle email
a soli 51,13 euro
Vieni a trovarci!
Lo Staff di Interfree
-------------------------------------------------------------------------
Re: indexing is very very very slow
Posted by Gal Nitzan <gn...@usa.net>.
Doug,
Should the indexer.maxMergeDocs be set to the same value : 2147483647 ?
Thanks,
Gal
Doug Cutting wrote:
> The default for indexer.maxMergeDocs was mistakenly set to 50, which
> can make indexing really slow. Try putting the following in your
> nutch-site.xml:
>
> <property>
> <name>indexer.maxMergeDocs</name>
> <value>2147483647</value>
> </property>
>
> Does that help?
>
> I just fixed this in trunk. We should fix this in the 0.7 release
> branch.
>
> Doug
>
> .
>
Re: indexing is very very very slow
Posted by Gal Nitzan <gn...@usa.net>.
EM wrote:
> Andrzej Bialecki wrote:
>
>>
>> The fix is simple: delete the "index" files in each subdirectory of
>> the 20050916014401 segment. Then run "nutch segread -fix
>> 20050916014401". Then re-run mergesegs - it will now work at full speed.
>>
>> NB. if there are any more segments which give you this warning, do
>> the same before you run mergesegs.
>>
> "segread -fix" doesn't work, "-segslice" solves this though.
>
>
>
> .
>
Yes segslice solved it.
Thanks.
Re: indexing is very very very slow
Posted by EM <em...@cpuedge.com>.
Andrzej Bialecki wrote:
>
> The fix is simple: delete the "index" files in each subdirectory of
> the 20050916014401 segment. Then run "nutch segread -fix
> 20050916014401". Then re-run mergesegs - it will now work at full speed.
>
> NB. if there are any more segments which give you this warning, do the
> same before you run mergesegs.
>
"segread -fix" doesn't work, "-segslice" solves this though.
Re: indexing is very very very slow
Posted by Gal Nitzan <gn...@usa.net>.
Hi Andrzej,
Thank you for your reply.
I have tried twice but the segment is not being fixed:
[root@kunzon nutch]# find index/segments/20050919092227/ -name index -print
index/segments/20050919092227/fetcher/index
index/segments/20050919092227/parse_text/index
index/segments/20050919092227/content/index
index/segments/20050919092227/parse_data/index
[root@kunzon nutch]# rm -rf index/segments/20050919092227/fetcher/index
[root@kunzon nutch]# rm -rf index/segments/20050919092227/parse_text/index
[root@kunzon nutch]# rm -rf index/segments/20050919092227/content/index
[root@kunzon nutch]# rm -rf index/segments/20050919092227/parse_data/index
[root@kunzon nutch]# bin/nutch segread index/segments/20050919092227 -fix
050920 031844 parsing file:/nutch/conf/nutch-default.xml
050920 031844 parsing file:/nutch/conf/nutch-site.xml
050920 031845 No FS indicated, using default:local
050920 031849 - fixed fetcher
050920 031932 - fixed content
050920 031952 - fixed parse_data
050920 032006 - fixed parse_text
050920 032006 Finished fixing 20050919092227
050920 032006 - data in segment index/segments/20050919092227 is
corrupt, using only 91212 entries.
Thanks,
Gal
Andrzej Bialecki wrote:
> Gal Nitzan wrote:
>> Hi,
>>
>> Well I still get a very slow mergesegs:
>>
>> [root@kunzon nutch]# tail -f nutch-mergesegs-kunzon.com.log
>> 050919 171351 Processed 120000 records (1146.5918 rec/s)
>> 050919 171408 Processed 140000 records (1158.2788 rec/s)
>> 050919 171428 Processed 160000 records (1019.8358 rec/s)
>> 050919 171451 Processed 180000 records (879.2368 rec/s)
>> 050919 171510 Processed 200000 records (1054.9636 rec/s)
>> 050919 171528 Processed 220000 records (1069.2328 rec/s)
>> 050919 171547 Processed 240000 records (1099.868 rec/s)
>> 050919 171832 - creating next subindex...
>> 050919 174512 Processed 260000 records (11.328647 rec/s)
>> 050919 200315 Processed 280000 records (2.4145627 rec/s)
>>
>> It is falling to 2.4 res per second ...
>>
>> Can somebody help me please. 400K records is only the beginning what
>> will happen when it is 4M?
>
>>> >050917 043332 - data in segment index/segments/20050916014401 is
>>> corrupt, using only 128115 entries.
>
> This is the real reason for the slowdown. Technically speaking, a
> partially corrupted MapFile is readable and usable. However, random
> access is orders of magnitude slower...
>
> The fix is simple: delete the "index" files in each subdirectory of
> the 20050916014401 segment. Then run "nutch segread -fix
> 20050916014401". Then re-run mergesegs - it will now work at full speed.
>
> NB. if there are any more segments which give you this warning, do the
> same before you run mergesegs.
>
Re: indexing is very very very slow
Posted by Andrzej Bialecki <ab...@getopt.org>.
Gal Nitzan wrote:
> Hi,
>
> Well I still get a very slow mergesegs:
>
> [root@kunzon nutch]# tail -f nutch-mergesegs-kunzon.com.log
> 050919 171351 Processed 120000 records (1146.5918 rec/s)
> 050919 171408 Processed 140000 records (1158.2788 rec/s)
> 050919 171428 Processed 160000 records (1019.8358 rec/s)
> 050919 171451 Processed 180000 records (879.2368 rec/s)
> 050919 171510 Processed 200000 records (1054.9636 rec/s)
> 050919 171528 Processed 220000 records (1069.2328 rec/s)
> 050919 171547 Processed 240000 records (1099.868 rec/s)
> 050919 171832 - creating next subindex...
> 050919 174512 Processed 260000 records (11.328647 rec/s)
> 050919 200315 Processed 280000 records (2.4145627 rec/s)
>
> It is falling to 2.4 res per second ...
>
> Can somebody help me please. 400K records is only the beginning what
> will happen when it is 4M?
>> >050917 043332 - data in segment index/segments/20050916014401 is
>> corrupt, using only 128115 entries.
This is the real reason for the slowdown. Technically speaking, a
partially corrupted MapFile is readable and usable. However, random
access is orders of magnitude slower...
The fix is simple: delete the "index" files in each subdirectory of the
20050916014401 segment. Then run "nutch segread -fix 20050916014401".
Then re-run mergesegs - it will now work at full speed.
NB. if there are any more segments which give you this warning, do the
same before you run mergesegs.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
Updated FAQ
Posted by Gal Nitzan <gn...@usa.net>.
Hello,
I have updated the FAQ: http://wiki.apache.org/nutch/FAQ?action=show
I've just started so be patient.
Please let me know if I'm on the right direction and please let me know
of any additions/changes/deletions required.
Regards,
Gal
Re: JDK 1.5
Posted by Gal Nitzan <gn...@usa.net>.
Jay Pound wrote:
> I'm running JDK 1.5 on windows 64bit, no problems, in fact very fast/stable
> ----- Original Message -----
> From: "Piotr Kosiorowski" <pk...@gmail.com>
> To: <nu...@lucene.apache.org>
> Sent: Tuesday, September 20, 2005 5:31 AM
> Subject: Re: JDK 1.5
>
>
> I was running it on JDK 1.5 on AMD64 linux for more than half a year and had
> no such problems.
> Regards
> Piotr
>
> On 9/20/05, Gal Nitzan <gn...@usa.net> wrote:
>
>> Hi,
>>
>> I have tried running Nutch with JDK 1.5 and got very weird results,
>> like fetcher is hanging and merge is hanging.
>>
>> After that I switched to 1.4 and all went well.
>>
>> Is it just a matter of re-build?
>>
>> Regards,
>>
>> Gal
>>
>>
>
>
> Thanks for the replies. Did you have to re-build or out of the box?
>
Gal
> .
>
>
Re: JDK 1.5
Posted by Jay Pound <we...@poundwebhosting.com>.
I'm running JDK 1.5 on windows 64bit, no problems, in fact very fast/stable
----- Original Message -----
From: "Piotr Kosiorowski" <pk...@gmail.com>
To: <nu...@lucene.apache.org>
Sent: Tuesday, September 20, 2005 5:31 AM
Subject: Re: JDK 1.5
I was running it on JDK 1.5 on AMD64 linux for more than half a year and had
no such problems.
Regards
Piotr
On 9/20/05, Gal Nitzan <gn...@usa.net> wrote:
>
> Hi,
>
> I have tried running Nutch with JDK 1.5 and got very weird results,
> like fetcher is hanging and merge is hanging.
>
> After that I switched to 1.4 and all went well.
>
> Is it just a matter of re-build?
>
> Regards,
>
> Gal
>
Re: JDK 1.5
Posted by Piotr Kosiorowski <pk...@gmail.com>.
I was running it on JDK 1.5 on AMD64 linux for more than half a year and had
no such problems.
Regards
Piotr
On 9/20/05, Gal Nitzan <gn...@usa.net> wrote:
>
> Hi,
>
> I have tried running Nutch with JDK 1.5 and got very weird results,
> like fetcher is hanging and merge is hanging.
>
> After that I switched to 1.4 and all went well.
>
> Is it just a matter of re-build?
>
> Regards,
>
> Gal
>
JDK 1.5
Posted by Gal Nitzan <gn...@usa.net>.
Hi,
I have tried running Nutch with JDK 1.5 and got very weird results,
like fetcher is hanging and merge is hanging.
After that I switched to 1.4 and all went well.
Is it just a matter of re-build?
Regards,
Gal
Re: Is it possible to change the list of common words without crawling
everything again
Posted by Andrzej Bialecki <ab...@getopt.org>.
Gal Nitzan wrote:
> This question was in the FAQ unanswered.
>
> Can someone answer that please. I shall put it in the FAQ
Yes. The list of common words is used only when indexing and searching,
and not during other steps. So, if you change the list of common words,
there is no need to re-fetch the content, you just need to re-create
segment indexes to reflect the changes.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
Re: Is it possible to change the list of common words without crawling
everything again
Posted by Gal Nitzan <gn...@usa.net>.
Tim Archambault wrote:
> How can I unsubscribe? Sorry to bother.
>
>
> To remove your address from the list, send a message to:
> <nu...@lucene.apache.org>
Re: Is it possible to change the list of common words without crawling everything again
Posted by Tim Archambault <jo...@gmail.com>.
How can I unsubscribe? Sorry to bother.
On 9/20/05, Gal Nitzan <gn...@usa.net> wrote:
>
> This question was in the FAQ unanswered.
>
> Can someone answer that please. I shall put it in the FAQ
>
> Regards,
>
> Gal
>
Is it possible to change the list of common words without crawling
everything again
Posted by Gal Nitzan <gn...@usa.net>.
This question was in the FAQ unanswered.
Can someone answer that please. I shall put it in the FAQ
Regards,
Gal
Re: Proposal: refuse to open partially trunc. MapFile, unless forced
(Re: indexing is very very very slow)
Posted by Gal Nitzan <gn...@usa.net>.
Andrzej Bialecki wrote:
> Gal Nitzan wrote:
>> Andrzej Bialecki wrote:
>>
>>> Hi all,
>>>
>>>> Well I still get a very slow mergesegs:
>>>
>>>
>>>>>
>>>>> >050917 043332 - data in segment index/segments/20050916014401 is
>>>>> corrupt, using only 128115 entries.
>>>
>>>
>>> This is a common and recurring problem. What's worse is that an
>>> unfixed segment like this will destroy the performance of the
>>> search, too, not just the backend pre-processing.
>>>
>>> I propose to modify MapFile.Reader so that it refuses to open such
>>> file, and throws an Exception, unless a force=true flag is given.
>>> Tools that want to ignore this can do so, but all other tools will
>>> be able to make a conscious decision whether to fix it first, or to
>>> use it as such.
>>>
>>> If there are no objections, I will change it in the trunk/ in a
>>> couple of days.
>>>
>> Hi,
>>
>> I think it would be very confusing to old users as well as new users.
>> Throwing an exception when actually a segment corruption is trivial
>> and can be fixed easily (now that I know how to do that :-)...
>
> You missed my point - I proposed that we change the API. On the
> surface, command-line tools would behave like now, with the benefit
> that segment corruption would be fixed automatically by those tools
> that require clean segments - unless _prevented_ by a cmd-line switch.
> So, this is just to improve the default behaviour, and not to complain
> even louder than now.
>
>>
>> Instead I would like to suggest building a FAQ for Nutch.
>>
>> I would like to propose myself to build at least the skeleton for it.
>>
>> As a new user to Nutch I have run to so many problems and except this
>> list there was not much information elsewhere. So, I have all the
>> answers fresh in my mind and with some help from the rest of the
>> nutch-users it can be done without too much of a hustle.
>>
>> Besides, many people on this list contribute on their free time, I
>> would be happy to contribute to the success of this project.
>
> This is always welcome, and there is already a place where we collect
> such info. Please see the Nutch Wiki, and feel free to enhance or add
> new content there.
>
You are right, I did miss your point. And now that I understand :-) I
think it is a very good Idea.
Yes I found the FAQ hiding in the wiki and I have started working on it.
Gal
Re: Proposal: refuse to open partially trunc. MapFile, unless forced
(Re: indexing is very very very slow)
Posted by Matthias Jaekle <ja...@eventax.de>.
> You missed my point - I proposed that we change the API. On the surface,
> command-line tools would behave like now, with the benefit that segment
> corruption would be fixed automatically by those tools that require
> clean segments - unless _prevented_ by a cmd-line switch. So, this is
> just to improve the default behaviour, and not to complain even louder
> than now.
That would be great!
Matthias
Re: Proposal: refuse to open partially trunc. MapFile, unless forced
(Re: indexing is very very very slow)
Posted by Andrzej Bialecki <ab...@getopt.org>.
Gal Nitzan wrote:
> Andrzej Bialecki wrote:
>
>> Hi all,
>>
>>> Well I still get a very slow mergesegs:
>>
>>
>>>>
>>>> >050917 043332 - data in segment index/segments/20050916014401 is
>>>> corrupt, using only 128115 entries.
>>
>>
>> This is a common and recurring problem. What's worse is that an
>> unfixed segment like this will destroy the performance of the search,
>> too, not just the backend pre-processing.
>>
>> I propose to modify MapFile.Reader so that it refuses to open such
>> file, and throws an Exception, unless a force=true flag is given.
>> Tools that want to ignore this can do so, but all other tools will be
>> able to make a conscious decision whether to fix it first, or to use
>> it as such.
>>
>> If there are no objections, I will change it in the trunk/ in a couple
>> of days.
>>
> Hi,
>
> I think it would be very confusing to old users as well as new users.
> Throwing an exception when actually a segment corruption is trivial and
> can be fixed easily (now that I know how to do that :-)...
You missed my point - I proposed that we change the API. On the surface,
command-line tools would behave like now, with the benefit that segment
corruption would be fixed automatically by those tools that require
clean segments - unless _prevented_ by a cmd-line switch. So, this is
just to improve the default behaviour, and not to complain even louder
than now.
>
> Instead I would like to suggest building a FAQ for Nutch.
>
> I would like to propose myself to build at least the skeleton for it.
>
> As a new user to Nutch I have run to so many problems and except this
> list there was not much information elsewhere. So, I have all the
> answers fresh in my mind and with some help from the rest of the
> nutch-users it can be done without too much of a hustle.
>
> Besides, many people on this list contribute on their free time, I would
> be happy to contribute to the success of this project.
This is always welcome, and there is already a place where we collect
such info. Please see the Nutch Wiki, and feel free to enhance or add
new content there.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
regarding gal's faq proposal
Posted by gekkokid <me...@gekkokid.org.uk>.
is there a place where we can search the mailing list? that could be a short
term solution
_gk
----- Original Message -----
From: "Gal Nitzan" <gn...@usa.net>
To: <nu...@lucene.apache.org>
Sent: Monday, September 19, 2005 11:37 PM
Subject: Re: Proposal: refuse to open partially trunc. MapFile, unless
forced (Re: indexing is very very very slow)
> Andrzej Bialecki wrote:
>> Hi all,
>>
>>> Well I still get a very slow mergesegs:
>>
>>>>
>>>> >050917 043332 - data in segment index/segments/20050916014401 is
>>>> corrupt, using only 128115 entries.
>>
>> This is a common and recurring problem. What's worse is that an unfixed
>> segment like this will destroy the performance of the search, too, not
>> just the backend pre-processing.
>>
>> I propose to modify MapFile.Reader so that it refuses to open such file,
>> and throws an Exception, unless a force=true flag is given. Tools that
>> want to ignore this can do so, but all other tools will be able to make a
>> conscious decision whether to fix it first, or to use it as such.
>>
>> If there are no objections, I will change it in the trunk/ in a couple of
>> days.
>>
> Hi,
>
> I think it would be very confusing to old users as well as new users.
> Throwing an exception when actually a segment corruption is trivial and
> can be fixed easily (now that I know how to do that :-)...
>
> Instead I would like to suggest building a FAQ for Nutch.
>
> I would like to propose myself to build at least the skeleton for it.
>
> As a new user to Nutch I have run to so many problems and except this list
> there was not much information elsewhere. So, I have all the answers fresh
> in my mind and with some help from the rest of the nutch-users it can be
> done without too much of a hustle.
>
> Besides, many people on this list contribute on their free time, I would
> be happy to contribute to the success of this project.
>
> Regards,
>
> Gal
>
>
>
>
>
Re: Proposal: refuse to open partially trunc. MapFile, unless forced
(Re: indexing is very very very slow)
Posted by Gal Nitzan <gn...@usa.net>.
Andrzej Bialecki wrote:
> Hi all,
>
>> Well I still get a very slow mergesegs:
>
>>>
>>> >050917 043332 - data in segment index/segments/20050916014401 is
>>> corrupt, using only 128115 entries.
>
> This is a common and recurring problem. What's worse is that an
> unfixed segment like this will destroy the performance of the search,
> too, not just the backend pre-processing.
>
> I propose to modify MapFile.Reader so that it refuses to open such
> file, and throws an Exception, unless a force=true flag is given.
> Tools that want to ignore this can do so, but all other tools will be
> able to make a conscious decision whether to fix it first, or to use
> it as such.
>
> If there are no objections, I will change it in the trunk/ in a couple
> of days.
>
Hi,
I think it would be very confusing to old users as well as new users.
Throwing an exception when actually a segment corruption is trivial and
can be fixed easily (now that I know how to do that :-)...
Instead I would like to suggest building a FAQ for Nutch.
I would like to propose myself to build at least the skeleton for it.
As a new user to Nutch I have run to so many problems and except this
list there was not much information elsewhere. So, I have all the
answers fresh in my mind and with some help from the rest of the
nutch-users it can be done without too much of a hustle.
Besides, many people on this list contribute on their free time, I would
be happy to contribute to the success of this project.
Regards,
Gal
Proposal: refuse to open partially trunc. MapFile, unless forced
(Re: indexing is very very very slow)
Posted by Andrzej Bialecki <ab...@getopt.org>.
Hi all,
> Well I still get a very slow mergesegs:
>>
>> >050917 043332 - data in segment index/segments/20050916014401 is
>> corrupt, using only 128115 entries.
This is a common and recurring problem. What's worse is that an unfixed
segment like this will destroy the performance of the search, too, not
just the backend pre-processing.
I propose to modify MapFile.Reader so that it refuses to open such file,
and throws an Exception, unless a force=true flag is given. Tools that
want to ignore this can do so, but all other tools will be able to make
a conscious decision whether to fix it first, or to use it as such.
If there are no objections, I will change it in the trunk/ in a couple
of days.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
Re: indexing is very very very slow
Posted by Gal Nitzan <gn...@usa.net>.
Hi,
Well I still get a very slow mergesegs:
[root@kunzon nutch]# tail -f nutch-mergesegs-kunzon.com.log
050919 171351 Processed 120000 records (1146.5918 rec/s)
050919 171408 Processed 140000 records (1158.2788 rec/s)
050919 171428 Processed 160000 records (1019.8358 rec/s)
050919 171451 Processed 180000 records (879.2368 rec/s)
050919 171510 Processed 200000 records (1054.9636 rec/s)
050919 171528 Processed 220000 records (1069.2328 rec/s)
050919 171547 Processed 240000 records (1099.868 rec/s)
050919 171832 - creating next subindex...
050919 174512 Processed 260000 records (11.328647 rec/s)
050919 200315 Processed 280000 records (2.4145627 rec/s)
It is falling to 2.4 res per second ...
Can somebody help me please. 400K records is only the beginning what
will happen when it is 4M?
Regards,
Gal
EM wrote:
> 'segslice' all partial segments into new ones prior merging.
>
> Gal Nitzan wrote:
>
>> Hi Doug,
>>
>> Thank you for the prompt reply.
>>
>> Well things got much much faster (i guess about 40% faster), but it
>> seems that something got really corrupted. Everything gets stuck
>> after 40K records.
>
> >050917 043332 - data in segment index/segments/20050916014401 is
> corrupt, using only 128115 entries.
>
> .
>
Re: indexing is very very very slow
Posted by Gal Nitzan <gn...@usa.net>.
Hi EM,
After sending that eMail I looked at the segslice and it worked
perfect.....!!!!
Thanks,
Gal
EM wrote:
> 'segslice' all partial segments into new ones prior merging.
>
> Gal Nitzan wrote:
>
>> Hi Doug,
>>
>> Thank you for the prompt reply.
>>
>> Well things got much much faster (i guess about 40% faster), but it
>> seems that something got really corrupted. Everything gets stuck
>> after 40K records.
>
> >050917 043332 - data in segment index/segments/20050916014401 is
> corrupt, using only 128115 entries.
>
> .
>
Re: indexing is very very very slow
Posted by EM <em...@cpuedge.com>.
'segslice' all partial segments into new ones prior merging.
Gal Nitzan wrote:
> Hi Doug,
>
> Thank you for the prompt reply.
>
> Well things got much much faster (i guess about 40% faster), but it
> seems that something got really corrupted. Everything gets stuck after
> 40K records.
>050917 043332 - data in segment index/segments/20050916014401 is
corrupt, using only 128115 entries.
Re: indexing is very very very slow
Posted by Gal Nitzan <gn...@usa.net>.
Hi Doug,
Thank you for the prompt reply.
Well things got much much faster (i guess about 40% faster), but it
seems that something got really corrupted. Everything gets stuck after
40K records.
[root@kunzon nutch]# bin/nutch mergesegs -dir index/segments/ -i -ds
050917 043331 parsing file:/nutch/conf/nutch-default.xml
050917 043331 parsing file:/nutch/conf/nutch-site.xml
050917 043331 No FS indicated, using default:local
050917 043331 * Opening 2 segments:
050917 043332 - segment 20050916013342: 42287 records.
050917 043332 - data in segment index/segments/20050916014401 is
corrupt, using only 128115 entries.
050917 043332 - segment 20050916014401: 128116 records.
050917 043332 * TOTAL 170403 input records in 2 segments.
050917 043332 * Creating master index...
050917 043345 Processed 20000 records (1613.5538 rec/s)
050917 043354 Processed 40000 records (2113.9414 rec/s)
And that is it. I notice memory is still being consumed but no apparent
activity.
Since I'm really newbie to nutch, could you give me a tip on a way to
rescue the already fetched data and to remove the corruption from the
segment. I already tried the -fix but it didn't help.
Regards,
Gal
Doug Cutting wrote:
> The default for indexer.maxMergeDocs was mistakenly set to 50, which
> can make indexing really slow. Try putting the following in your
> nutch-site.xml:
>
> <property>
> <name>indexer.maxMergeDocs</name>
> <value>2147483647</value>
> </property>
>
> Does that help?
>
> I just fixed this in trunk. We should fix this in the 0.7 release
> branch.
>
> Doug
>
> .
>
Re: indexing is very very very slow
Posted by Doug Cutting <cu...@nutch.org>.
The default for indexer.maxMergeDocs was mistakenly set to 50, which can
make indexing really slow. Try putting the following in your
nutch-site.xml:
<property>
<name>indexer.maxMergeDocs</name>
<value>2147483647</value>
</property>
Does that help?
I just fixed this in trunk. We should fix this in the 0.7 release branch.
Doug
indexing is very very very slow
Posted by Gal Nitzan <gn...@usa.net>.
Hi,
I am crawling the web...
my machine:
cpu: Xeon 2.8 X 2
ram 2GB
HD raid 2 X 160 GB
After fetching (i stopped the fetcher after it halted (didn't fetch) for
a few hours) i have done the following:
1. s1=`ls -d index/segments/2* | tail -1`
2. bin/nutch updatedb index/db/ $s1
the following is the last few lines from the updatedb
--------------------------------------------------------------------
050916 135308 Processing document 127000
050916 135316 Processing document 128000
050916 135317 Unexpected EOF in: index/segments/20050916014401/fetcher
at entry #128116. Ignoring.
050916 135317 Finishing update
050916 135456 Processing pagesByURL: Sorted 3083939 instructions in
99.536 seconds.
050916 135456 Processing pagesByURL: Sorted 30983.15182446552
instructions/second
050916 135559 Processing pagesByURL: Merged to new DB containing 774610
records in 35.355 seconds
050916 135559 Processing pagesByURL: Merged 21909.489464007922
records/second
050916 135611 Processing pagesByMD5: Sorted 803182 instructions in
11.654 seconds.
050916 135611 Processing pagesByMD5: Sorted 68918.99776900635
instructions/second
050916 135627 Processing pagesByMD5: Merged to new DB containing 774610
records in 14.216 seconds
050916 135627 Processing pagesByMD5: Merged 54488.604389420376
records/second
050916 135633 Processing linksByMD5: Sorted 689997 instructions in 6.038
seconds.
050916 135633 Processing linksByMD5: Sorted 114275.75356078171
instructions/second
050916 135648 Processing linksByMD5: Merged to new DB containing 776849
records in 13.624 seconds
050916 135648 Processing linksByMD5: Merged 57020.62536699941 records/second
050916 135655 Processing linksByURL: Sorted 584963 instructions in 7.056
seconds.
050916 135655 Processing linksByURL: Sorted 82902.91950113379
instructions/second
050916 135711 Processing linksByURL: Merged to new DB containing 776849
records in 14.533 seconds
050916 135711 Processing linksByURL: Merged 53454.13885639579 records/second
050916 135718 Processing linksByMD5: Sorted 671867 instructions in 6.732
seconds.
050916 135718 Processing linksByMD5: Sorted 99801.99049316696
instructions/second
050916 135729 Processing linksByMD5: Merged to new DB containing 776849
records in 9.999 seconds
050916 135729 Processing linksByMD5: Merged 77692.66926692669 records/second
050916 135744 Update finished
--------------------------------------------------------------------
As you can see the updatedb gone fine though it encountered the stop of
the fetcher
3. bin/nutch mergesegs -dir index/segments/ -i -ds
from here on is the problem
--------------------------------------------------------------------
050916 141720 parsing file:/nutch/conf/nutch-default.xml
050916 141720 parsing file:/nutch/conf/nutch-site.xml
050916 141720 No FS indicated, using default:local
050916 141720 * Opening 2 segments:
050916 141720 - segment 20050916013342: 42287 records.
050916 141721 - data in segment index/segments/20050916014401 is
corrupt, using only 128115 entries.
050916 141722 - segment 20050916014401: 128116 records.
050916 141722 * TOTAL 170403 input records in 2 segments.
050916 141722 * Creating master index...
050916 141737 Processed 20000 records (1311.9916 rec/s)
050916 141751 Processed 40000 records (1394.0197 rec/s)
050916 154424 Processed 60000 records (3.851173 rec/s)
--------------------------------------------------------------------
as you can see in thelast line, the indexer process 3.8 records per
second which mean too long
Anybody got a clue or a hint please !!!
Regards,
Gal