You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Andrzej Bialecki <ab...@getopt.org> on 2005/09/19 21:10:57 UTC
Proposal: refuse to open partially trunc. MapFile, unless forced
(Re: indexing is very very very slow)
Hi all,
> Well I still get a very slow mergesegs:
>>
>> >050917 043332 - data in segment index/segments/20050916014401 is
>> corrupt, using only 128115 entries.
This is a common and recurring problem. What's worse is that an unfixed
segment like this will destroy the performance of the search, too, not
just the backend pre-processing.
I propose to modify MapFile.Reader so that it refuses to open such file,
and throws an Exception, unless a force=true flag is given. Tools that
want to ignore this can do so, but all other tools will be able to make
a conscious decision whether to fix it first, or to use it as such.
If there are no objections, I will change it in the trunk/ in a couple
of days.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
Updated FAQ
Posted by Gal Nitzan <gn...@usa.net>.
Hello,
I have updated the FAQ: http://wiki.apache.org/nutch/FAQ?action=show
I've just started so be patient.
Please let me know if I'm on the right direction and please let me know
of any additions/changes/deletions required.
Regards,
Gal
Re: JDK 1.5
Posted by Gal Nitzan <gn...@usa.net>.
Jay Pound wrote:
> I'm running JDK 1.5 on windows 64bit, no problems, in fact very fast/stable
> ----- Original Message -----
> From: "Piotr Kosiorowski" <pk...@gmail.com>
> To: <nu...@lucene.apache.org>
> Sent: Tuesday, September 20, 2005 5:31 AM
> Subject: Re: JDK 1.5
>
>
> I was running it on JDK 1.5 on AMD64 linux for more than half a year and had
> no such problems.
> Regards
> Piotr
>
> On 9/20/05, Gal Nitzan <gn...@usa.net> wrote:
>
>> Hi,
>>
>> I have tried running Nutch with JDK 1.5 and got very weird results,
>> like fetcher is hanging and merge is hanging.
>>
>> After that I switched to 1.4 and all went well.
>>
>> Is it just a matter of re-build?
>>
>> Regards,
>>
>> Gal
>>
>>
>
>
> Thanks for the replies. Did you have to re-build or out of the box?
>
Gal
> .
>
>
Re: JDK 1.5
Posted by Jay Pound <we...@poundwebhosting.com>.
I'm running JDK 1.5 on windows 64bit, no problems, in fact very fast/stable
----- Original Message -----
From: "Piotr Kosiorowski" <pk...@gmail.com>
To: <nu...@lucene.apache.org>
Sent: Tuesday, September 20, 2005 5:31 AM
Subject: Re: JDK 1.5
I was running it on JDK 1.5 on AMD64 linux for more than half a year and had
no such problems.
Regards
Piotr
On 9/20/05, Gal Nitzan <gn...@usa.net> wrote:
>
> Hi,
>
> I have tried running Nutch with JDK 1.5 and got very weird results,
> like fetcher is hanging and merge is hanging.
>
> After that I switched to 1.4 and all went well.
>
> Is it just a matter of re-build?
>
> Regards,
>
> Gal
>
Re: JDK 1.5
Posted by Piotr Kosiorowski <pk...@gmail.com>.
I was running it on JDK 1.5 on AMD64 linux for more than half a year and had
no such problems.
Regards
Piotr
On 9/20/05, Gal Nitzan <gn...@usa.net> wrote:
>
> Hi,
>
> I have tried running Nutch with JDK 1.5 and got very weird results,
> like fetcher is hanging and merge is hanging.
>
> After that I switched to 1.4 and all went well.
>
> Is it just a matter of re-build?
>
> Regards,
>
> Gal
>
JDK 1.5
Posted by Gal Nitzan <gn...@usa.net>.
Hi,
I have tried running Nutch with JDK 1.5 and got very weird results,
like fetcher is hanging and merge is hanging.
After that I switched to 1.4 and all went well.
Is it just a matter of re-build?
Regards,
Gal
Re: Is it possible to change the list of common words without crawling
everything again
Posted by Andrzej Bialecki <ab...@getopt.org>.
Gal Nitzan wrote:
> This question was in the FAQ unanswered.
>
> Can someone answer that please. I shall put it in the FAQ
Yes. The list of common words is used only when indexing and searching,
and not during other steps. So, if you change the list of common words,
there is no need to re-fetch the content, you just need to re-create
segment indexes to reflect the changes.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
Re: Is it possible to change the list of common words without crawling
everything again
Posted by Gal Nitzan <gn...@usa.net>.
Tim Archambault wrote:
> How can I unsubscribe? Sorry to bother.
>
>
> To remove your address from the list, send a message to:
> <nu...@lucene.apache.org>
Re: Is it possible to change the list of common words without crawling everything again
Posted by Tim Archambault <jo...@gmail.com>.
How can I unsubscribe? Sorry to bother.
On 9/20/05, Gal Nitzan <gn...@usa.net> wrote:
>
> This question was in the FAQ unanswered.
>
> Can someone answer that please. I shall put it in the FAQ
>
> Regards,
>
> Gal
>
Is it possible to change the list of common words without crawling
everything again
Posted by Gal Nitzan <gn...@usa.net>.
This question was in the FAQ unanswered.
Can someone answer that please. I shall put it in the FAQ
Regards,
Gal
Re: Proposal: refuse to open partially trunc. MapFile, unless forced
(Re: indexing is very very very slow)
Posted by Gal Nitzan <gn...@usa.net>.
Andrzej Bialecki wrote:
> Gal Nitzan wrote:
>> Andrzej Bialecki wrote:
>>
>>> Hi all,
>>>
>>>> Well I still get a very slow mergesegs:
>>>
>>>
>>>>>
>>>>> >050917 043332 - data in segment index/segments/20050916014401 is
>>>>> corrupt, using only 128115 entries.
>>>
>>>
>>> This is a common and recurring problem. What's worse is that an
>>> unfixed segment like this will destroy the performance of the
>>> search, too, not just the backend pre-processing.
>>>
>>> I propose to modify MapFile.Reader so that it refuses to open such
>>> file, and throws an Exception, unless a force=true flag is given.
>>> Tools that want to ignore this can do so, but all other tools will
>>> be able to make a conscious decision whether to fix it first, or to
>>> use it as such.
>>>
>>> If there are no objections, I will change it in the trunk/ in a
>>> couple of days.
>>>
>> Hi,
>>
>> I think it would be very confusing to old users as well as new users.
>> Throwing an exception when actually a segment corruption is trivial
>> and can be fixed easily (now that I know how to do that :-)...
>
> You missed my point - I proposed that we change the API. On the
> surface, command-line tools would behave like now, with the benefit
> that segment corruption would be fixed automatically by those tools
> that require clean segments - unless _prevented_ by a cmd-line switch.
> So, this is just to improve the default behaviour, and not to complain
> even louder than now.
>
>>
>> Instead I would like to suggest building a FAQ for Nutch.
>>
>> I would like to propose myself to build at least the skeleton for it.
>>
>> As a new user to Nutch I have run to so many problems and except this
>> list there was not much information elsewhere. So, I have all the
>> answers fresh in my mind and with some help from the rest of the
>> nutch-users it can be done without too much of a hustle.
>>
>> Besides, many people on this list contribute on their free time, I
>> would be happy to contribute to the success of this project.
>
> This is always welcome, and there is already a place where we collect
> such info. Please see the Nutch Wiki, and feel free to enhance or add
> new content there.
>
You are right, I did miss your point. And now that I understand :-) I
think it is a very good Idea.
Yes I found the FAQ hiding in the wiki and I have started working on it.
Gal
Re: Proposal: refuse to open partially trunc. MapFile, unless forced
(Re: indexing is very very very slow)
Posted by Matthias Jaekle <ja...@eventax.de>.
> You missed my point - I proposed that we change the API. On the surface,
> command-line tools would behave like now, with the benefit that segment
> corruption would be fixed automatically by those tools that require
> clean segments - unless _prevented_ by a cmd-line switch. So, this is
> just to improve the default behaviour, and not to complain even louder
> than now.
That would be great!
Matthias
Re: Proposal: refuse to open partially trunc. MapFile, unless forced
(Re: indexing is very very very slow)
Posted by Andrzej Bialecki <ab...@getopt.org>.
Gal Nitzan wrote:
> Andrzej Bialecki wrote:
>
>> Hi all,
>>
>>> Well I still get a very slow mergesegs:
>>
>>
>>>>
>>>> >050917 043332 - data in segment index/segments/20050916014401 is
>>>> corrupt, using only 128115 entries.
>>
>>
>> This is a common and recurring problem. What's worse is that an
>> unfixed segment like this will destroy the performance of the search,
>> too, not just the backend pre-processing.
>>
>> I propose to modify MapFile.Reader so that it refuses to open such
>> file, and throws an Exception, unless a force=true flag is given.
>> Tools that want to ignore this can do so, but all other tools will be
>> able to make a conscious decision whether to fix it first, or to use
>> it as such.
>>
>> If there are no objections, I will change it in the trunk/ in a couple
>> of days.
>>
> Hi,
>
> I think it would be very confusing to old users as well as new users.
> Throwing an exception when actually a segment corruption is trivial and
> can be fixed easily (now that I know how to do that :-)...
You missed my point - I proposed that we change the API. On the surface,
command-line tools would behave like now, with the benefit that segment
corruption would be fixed automatically by those tools that require
clean segments - unless _prevented_ by a cmd-line switch. So, this is
just to improve the default behaviour, and not to complain even louder
than now.
>
> Instead I would like to suggest building a FAQ for Nutch.
>
> I would like to propose myself to build at least the skeleton for it.
>
> As a new user to Nutch I have run to so many problems and except this
> list there was not much information elsewhere. So, I have all the
> answers fresh in my mind and with some help from the rest of the
> nutch-users it can be done without too much of a hustle.
>
> Besides, many people on this list contribute on their free time, I would
> be happy to contribute to the success of this project.
This is always welcome, and there is already a place where we collect
such info. Please see the Nutch Wiki, and feel free to enhance or add
new content there.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
regarding gal's faq proposal
Posted by gekkokid <me...@gekkokid.org.uk>.
is there a place where we can search the mailing list? that could be a short
term solution
_gk
----- Original Message -----
From: "Gal Nitzan" <gn...@usa.net>
To: <nu...@lucene.apache.org>
Sent: Monday, September 19, 2005 11:37 PM
Subject: Re: Proposal: refuse to open partially trunc. MapFile, unless
forced (Re: indexing is very very very slow)
> Andrzej Bialecki wrote:
>> Hi all,
>>
>>> Well I still get a very slow mergesegs:
>>
>>>>
>>>> >050917 043332 - data in segment index/segments/20050916014401 is
>>>> corrupt, using only 128115 entries.
>>
>> This is a common and recurring problem. What's worse is that an unfixed
>> segment like this will destroy the performance of the search, too, not
>> just the backend pre-processing.
>>
>> I propose to modify MapFile.Reader so that it refuses to open such file,
>> and throws an Exception, unless a force=true flag is given. Tools that
>> want to ignore this can do so, but all other tools will be able to make a
>> conscious decision whether to fix it first, or to use it as such.
>>
>> If there are no objections, I will change it in the trunk/ in a couple of
>> days.
>>
> Hi,
>
> I think it would be very confusing to old users as well as new users.
> Throwing an exception when actually a segment corruption is trivial and
> can be fixed easily (now that I know how to do that :-)...
>
> Instead I would like to suggest building a FAQ for Nutch.
>
> I would like to propose myself to build at least the skeleton for it.
>
> As a new user to Nutch I have run to so many problems and except this list
> there was not much information elsewhere. So, I have all the answers fresh
> in my mind and with some help from the rest of the nutch-users it can be
> done without too much of a hustle.
>
> Besides, many people on this list contribute on their free time, I would
> be happy to contribute to the success of this project.
>
> Regards,
>
> Gal
>
>
>
>
>
Re: Proposal: refuse to open partially trunc. MapFile, unless forced
(Re: indexing is very very very slow)
Posted by Gal Nitzan <gn...@usa.net>.
Andrzej Bialecki wrote:
> Hi all,
>
>> Well I still get a very slow mergesegs:
>
>>>
>>> >050917 043332 - data in segment index/segments/20050916014401 is
>>> corrupt, using only 128115 entries.
>
> This is a common and recurring problem. What's worse is that an
> unfixed segment like this will destroy the performance of the search,
> too, not just the backend pre-processing.
>
> I propose to modify MapFile.Reader so that it refuses to open such
> file, and throws an Exception, unless a force=true flag is given.
> Tools that want to ignore this can do so, but all other tools will be
> able to make a conscious decision whether to fix it first, or to use
> it as such.
>
> If there are no objections, I will change it in the trunk/ in a couple
> of days.
>
Hi,
I think it would be very confusing to old users as well as new users.
Throwing an exception when actually a segment corruption is trivial and
can be fixed easily (now that I know how to do that :-)...
Instead I would like to suggest building a FAQ for Nutch.
I would like to propose myself to build at least the skeleton for it.
As a new user to Nutch I have run to so many problems and except this
list there was not much information elsewhere. So, I have all the
answers fresh in my mind and with some help from the rest of the
nutch-users it can be done without too much of a hustle.
Besides, many people on this list contribute on their free time, I would
be happy to contribute to the success of this project.
Regards,
Gal