You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Andrzej Bialecki <ab...@getopt.org> on 2005/09/19 21:10:57 UTC

Proposal: refuse to open partially trunc. MapFile, unless forced (Re: indexing is very very very slow)

Hi all,

> Well I still get a very slow mergesegs:

>>
>> >050917 043332  - data in segment index/segments/20050916014401 is 
>> corrupt, using only 128115 entries.

This is a common and recurring problem. What's worse is that an unfixed 
segment like this will destroy the performance of the search, too, not 
just the backend pre-processing.

I propose to modify MapFile.Reader so that it refuses to open such file, 
and throws an Exception, unless a force=true flag is given. Tools that 
want to ignore this can do so, but all other tools will be able to make 
a conscious decision whether to fix it first, or to use it as such.

If there are no objections, I will change it in the trunk/ in a couple 
of days.

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Updated FAQ

Posted by Gal Nitzan <gn...@usa.net>.
Hello,

I have updated the FAQ: http://wiki.apache.org/nutch/FAQ?action=show

I've just started so be patient.

Please let me know if I'm on the right direction and please let me know 
of any additions/changes/deletions required.

Regards,

Gal

Re: JDK 1.5

Posted by Gal Nitzan <gn...@usa.net>.
Jay Pound wrote:
> I'm running JDK 1.5 on windows 64bit, no problems, in fact very fast/stable
> ----- Original Message ----- 
> From: "Piotr Kosiorowski" <pk...@gmail.com>
> To: <nu...@lucene.apache.org>
> Sent: Tuesday, September 20, 2005 5:31 AM
> Subject: Re: JDK 1.5
>
>
> I was running it on JDK 1.5 on AMD64 linux for more than half a year and had
> no such problems.
> Regards
> Piotr
>
> On 9/20/05, Gal Nitzan <gn...@usa.net> wrote:
>   
>> Hi,
>>
>> I have tried running Nutch with JDK 1.5 and got very weird results,
>> like fetcher is hanging and merge is hanging.
>>
>> After that I switched to 1.4 and all went well.
>>
>> Is it just a matter of re-build?
>>
>> Regards,
>>
>> Gal
>>
>>     
>
>
> Thanks for the replies. Did you have to re-build or out of the box?
>   

Gal
> .
>
>   


Re: JDK 1.5

Posted by Jay Pound <we...@poundwebhosting.com>.
I'm running JDK 1.5 on windows 64bit, no problems, in fact very fast/stable
----- Original Message ----- 
From: "Piotr Kosiorowski" <pk...@gmail.com>
To: <nu...@lucene.apache.org>
Sent: Tuesday, September 20, 2005 5:31 AM
Subject: Re: JDK 1.5


I was running it on JDK 1.5 on AMD64 linux for more than half a year and had
no such problems.
Regards
Piotr

On 9/20/05, Gal Nitzan <gn...@usa.net> wrote:
>
> Hi,
>
> I have tried running Nutch with JDK 1.5 and got very weird results,
> like fetcher is hanging and merge is hanging.
>
> After that I switched to 1.4 and all went well.
>
> Is it just a matter of re-build?
>
> Regards,
>
> Gal
>



Re: JDK 1.5

Posted by Piotr Kosiorowski <pk...@gmail.com>.
I was running it on JDK 1.5 on AMD64 linux for more than half a year and had 
no such problems.
Regards
Piotr

On 9/20/05, Gal Nitzan <gn...@usa.net> wrote:
> 
> Hi,
> 
> I have tried running Nutch with JDK 1.5 and got very weird results,
> like fetcher is hanging and merge is hanging.
> 
> After that I switched to 1.4 and all went well.
> 
> Is it just a matter of re-build?
> 
> Regards,
> 
> Gal
>

JDK 1.5

Posted by Gal Nitzan <gn...@usa.net>.
Hi,

I have tried running Nutch with JDK 1.5 and got very weird results, 
like  fetcher is hanging and  merge is  hanging.

After that I switched to 1.4 and all went well.

Is it just a matter of  re-build?

Regards,

Gal

Re: Is it possible to change the list of common words without crawling everything again

Posted by Andrzej Bialecki <ab...@getopt.org>.
Gal Nitzan wrote:
> This question was in the FAQ unanswered.
> 
> Can someone answer that please. I shall put it in the FAQ

Yes. The list of common words is used only when indexing and searching, 
and not during other steps. So, if you change the list of common words, 
there is no need to re-fetch the content, you just need to re-create 
segment indexes to reflect the changes.


-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Re: Is it possible to change the list of common words without crawling everything again

Posted by Gal Nitzan <gn...@usa.net>.
Tim Archambault wrote:
> How can I unsubscribe? Sorry to bother.
>
>   
> To remove your address from the list, send a message to:
>   <nu...@lucene.apache.org>


Re: Is it possible to change the list of common words without crawling everything again

Posted by Tim Archambault <jo...@gmail.com>.
How can I unsubscribe? Sorry to bother.

On 9/20/05, Gal Nitzan <gn...@usa.net> wrote: 
> 
> This question was in the FAQ unanswered.
> 
> Can someone answer that please. I shall put it in the FAQ
> 
> Regards,
> 
> Gal
>

Is it possible to change the list of common words without crawling everything again

Posted by Gal Nitzan <gn...@usa.net>.
This question was in the FAQ unanswered.

Can someone answer that please. I shall put it in the FAQ

Regards,

Gal

Re: Proposal: refuse to open partially trunc. MapFile, unless forced (Re: indexing is very very very slow)

Posted by Gal Nitzan <gn...@usa.net>.
Andrzej Bialecki wrote:
> Gal Nitzan wrote:
>> Andrzej Bialecki wrote:
>>
>>> Hi all,
>>>
>>>> Well I still get a very slow mergesegs:
>>>
>>>
>>>>>
>>>>> >050917 043332  - data in segment index/segments/20050916014401 is 
>>>>> corrupt, using only 128115 entries.
>>>
>>>
>>> This is a common and recurring problem. What's worse is that an 
>>> unfixed segment like this will destroy the performance of the 
>>> search, too, not just the backend pre-processing.
>>>
>>> I propose to modify MapFile.Reader so that it refuses to open such 
>>> file, and throws an Exception, unless a force=true flag is given. 
>>> Tools that want to ignore this can do so, but all other tools will 
>>> be able to make a conscious decision whether to fix it first, or to 
>>> use it as such.
>>>
>>> If there are no objections, I will change it in the trunk/ in a 
>>> couple of days.
>>>
>> Hi,
>>
>> I think it would be very confusing to old users as well as new users. 
>> Throwing an exception when actually  a segment corruption is trivial 
>> and can be fixed easily (now that I know how to do that :-)...
>
> You missed my point - I proposed that we change the API. On the 
> surface, command-line tools would behave like now, with the benefit 
> that segment corruption would be fixed automatically by those tools 
> that require clean segments - unless _prevented_ by a cmd-line switch. 
> So, this is just to improve the default behaviour, and not to complain 
> even louder than now.
>
>>
>> Instead I would like to suggest building a FAQ for Nutch.
>>
>> I would like to propose myself  to build at least the skeleton for it.
>>
>> As a new user to Nutch I have run to so many problems and except this 
>> list there was not much information elsewhere. So, I have all the 
>> answers fresh in my mind and with some help from the rest of the 
>> nutch-users it can be done without too much of a hustle.
>>
>> Besides, many people on this list contribute on their free time, I 
>> would be happy to contribute to the success of this  project.
>
> This is always welcome, and there is already a place where we collect 
> such info. Please see the Nutch Wiki, and feel free to enhance or add 
> new content there.
>
You are right, I did miss your point. And now that I understand :-) I 
think it is a very good Idea.

Yes I found the FAQ hiding in the wiki and I have started working on it.

Gal

Re: Proposal: refuse to open partially trunc. MapFile, unless forced (Re: indexing is very very very slow)

Posted by Matthias Jaekle <ja...@eventax.de>.
> You missed my point - I proposed that we change the API. On the surface, 
> command-line tools would behave like now, with the benefit that segment 
> corruption would be fixed automatically by those tools that require 
> clean segments - unless _prevented_ by a cmd-line switch. So, this is 
> just to improve the default behaviour, and not to complain even louder 
> than now.
That would be great!
Matthias

Re: Proposal: refuse to open partially trunc. MapFile, unless forced (Re: indexing is very very very slow)

Posted by Andrzej Bialecki <ab...@getopt.org>.
Gal Nitzan wrote:
> Andrzej Bialecki wrote:
> 
>> Hi all,
>>
>>> Well I still get a very slow mergesegs:
>>
>>
>>>>
>>>> >050917 043332  - data in segment index/segments/20050916014401 is 
>>>> corrupt, using only 128115 entries.
>>
>>
>> This is a common and recurring problem. What's worse is that an 
>> unfixed segment like this will destroy the performance of the search, 
>> too, not just the backend pre-processing.
>>
>> I propose to modify MapFile.Reader so that it refuses to open such 
>> file, and throws an Exception, unless a force=true flag is given. 
>> Tools that want to ignore this can do so, but all other tools will be 
>> able to make a conscious decision whether to fix it first, or to use 
>> it as such.
>>
>> If there are no objections, I will change it in the trunk/ in a couple 
>> of days.
>>
> Hi,
> 
> I think it would be very confusing to old users as well as new users. 
> Throwing an exception when actually  a segment corruption is trivial and 
> can be fixed easily (now that I know how to do that :-)...

You missed my point - I proposed that we change the API. On the surface, 
command-line tools would behave like now, with the benefit that segment 
corruption would be fixed automatically by those tools that require 
clean segments - unless _prevented_ by a cmd-line switch. So, this is 
just to improve the default behaviour, and not to complain even louder 
than now.

> 
> Instead I would like to suggest building a FAQ for Nutch.
> 
> I would like to propose myself  to build at least the skeleton for it.
> 
> As a new user to Nutch I have run to so many problems and except this 
> list there was not much information elsewhere. So, I have all the 
> answers fresh in my mind and with some help from the rest of the 
> nutch-users it can be done without too much of a hustle.
> 
> Besides, many people on this list contribute on their free time, I would 
> be happy to contribute to the success of this  project.

This is always welcome, and there is already a place where we collect 
such info. Please see the Nutch Wiki, and feel free to enhance or add 
new content there.

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


regarding gal's faq proposal

Posted by gekkokid <me...@gekkokid.org.uk>.
is there a place where we can search the mailing list? that could be a short 
term solution

_gk
----- Original Message ----- 
From: "Gal Nitzan" <gn...@usa.net>
To: <nu...@lucene.apache.org>
Sent: Monday, September 19, 2005 11:37 PM
Subject: Re: Proposal: refuse to open partially trunc. MapFile, unless 
forced (Re: indexing is very very very slow)


> Andrzej Bialecki wrote:
>> Hi all,
>>
>>> Well I still get a very slow mergesegs:
>>
>>>>
>>>> >050917 043332  - data in segment index/segments/20050916014401 is
>>>> corrupt, using only 128115 entries.
>>
>> This is a common and recurring problem. What's worse is that an unfixed 
>> segment like this will destroy the performance of the search, too, not 
>> just the backend pre-processing.
>>
>> I propose to modify MapFile.Reader so that it refuses to open such file, 
>> and throws an Exception, unless a force=true flag is given. Tools that 
>> want to ignore this can do so, but all other tools will be able to make a 
>> conscious decision whether to fix it first, or to use it as such.
>>
>> If there are no objections, I will change it in the trunk/ in a couple of 
>> days.
>>
> Hi,
>
> I think it would be very confusing to old users as well as new users. 
> Throwing an exception when actually  a segment corruption is trivial and 
> can be fixed easily (now that I know how to do that :-)...
>
> Instead I would like to suggest building a FAQ for Nutch.
>
> I would like to propose myself  to build at least the skeleton for it.
>
> As a new user to Nutch I have run to so many problems and except this list 
> there was not much information elsewhere. So, I have all the answers fresh 
> in my mind and with some help from the rest of the nutch-users it can be 
> done without too much of a hustle.
>
> Besides, many people on this list contribute on their free time, I would 
> be happy to contribute to the success of this  project.
>
> Regards,
>
> Gal
>
>
>
>
> 



Re: Proposal: refuse to open partially trunc. MapFile, unless forced (Re: indexing is very very very slow)

Posted by Gal Nitzan <gn...@usa.net>.
Andrzej Bialecki wrote:
> Hi all,
>
>> Well I still get a very slow mergesegs:
>
>>>
>>> >050917 043332  - data in segment index/segments/20050916014401 is 
>>> corrupt, using only 128115 entries.
>
> This is a common and recurring problem. What's worse is that an 
> unfixed segment like this will destroy the performance of the search, 
> too, not just the backend pre-processing.
>
> I propose to modify MapFile.Reader so that it refuses to open such 
> file, and throws an Exception, unless a force=true flag is given. 
> Tools that want to ignore this can do so, but all other tools will be 
> able to make a conscious decision whether to fix it first, or to use 
> it as such.
>
> If there are no objections, I will change it in the trunk/ in a couple 
> of days.
>
Hi,

I think it would be very confusing to old users as well as new users. 
Throwing an exception when actually  a segment corruption is trivial and 
can be fixed easily (now that I know how to do that :-)...

Instead I would like to suggest building a FAQ for Nutch.

I would like to propose myself  to build at least the skeleton for it.

As a new user to Nutch I have run to so many problems and except this 
list there was not much information elsewhere. So, I have all the 
answers fresh in my mind and with some help from the rest of the 
nutch-users it can be done without too much of a hustle.

Besides, many people on this list contribute on their free time, I would 
be happy to contribute to the success of this  project.

Regards,

Gal