You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Jesse Hires <jh...@gmail.com> on 2009/12/01 03:48:01 UTC
Re: odd warnings
What is "segments.gen" and "segments_2" ?
The warning I am getting happens when I dedup two indexes.
I create index1 and index2 through generate/fetch/index/...etc
index1 is an index of 1/2 the segments. index2 is an index of the other 1/2
The warning is happening on both datanodes.
The command I am running is "bin/nutch dedup crawl/index1 crawl/index2"
If segments.gen and segments_2 are supposed to be directories, then why are
they created as files?
They are created as files from the start
"bin/nutch index crawl/index1 crawl/crawldb /crawl/linkdb crawl/segments/XXX
crawl/segments/YYY"
I don't see any errors or warnings about creating the index.
I'm using nutch 1.0, though it has been a bit since I've updated the sources
from the trunk.
I'm running one name node and two data nodes.
2009-11-30 18:25:23,497 WARN mapred.FileInputFormat - Can't open index at
hdfs://nn1:9000/user/nutch/crawl/index2/segments_2:0+2147483647, skipping.
(hdfs://nn1:9000/user/nutch/crawl/index2/segments_2 not a directory)
2009-11-30 18:33:50,200 WARN mapred.FileInputFormat - Can't open index at
hdfs://nn1:9000/user/nutch/crawl/index2/segments.gen:0+2147483647, skipping.
(hdfs://nn1:9000/user/nutch/crawl/index2/segments.gen not a directory)
Jesse
int GetRandomNumber()
{
return 4; // Chosen by fair roll of dice
// Guaranteed to be random
} // xkcd.com
On Mon, Nov 30, 2009 at 9:30 AM, Jesse Hires <jh...@gmail.com> wrote:
> actually searcher.dir is still the default "crawl". The warnings are
> showing up either while indexing segments or merging indexes. I need to
> spend some time figuring out just where it is happening at. I will look into
> it later tonight, work doesn't like my hobbies intruding. :)
>
> I may need some more info in "index" vs "indexes" later if you don't mind
> my asking some dumb questions about them, but thus far, things seem to be
> working in the manner I have it set up. With the exception of the warnings
> mentioned of course.
>
> The searching (or searchers) run out of a different directory and I run the
> indexes and segments for them locally on the individual nodes and I am
> getting search results back, which increase with every pass as expected.
>
>
>
> Jesse
>
> int GetRandomNumber()
> {
> return 4; // Chosen by fair roll of dice
> // Guaranteed to be random
> } // xkcd.com
>
>
>
> On Mon, Nov 30, 2009 at 8:57 AM, Andrzej Bialecki <ab...@getopt.org> wrote:
>
>> Jesse Hires wrote:
>>
>>> I am getting warnings in hadoop.log that segments.gen and segments_2 are
>>> not
>>> directories, and as you can see by the listing, they are in fact files
>>> not
>>> directories. I'm not sure what stage of the process this is happening in,
>>> as
>>> I just now stumbled on them, but it concerns me that it says it is
>>> skipping
>>> something. Any ideas before I start digging further?
>>>
>>>
>>>
>>>
>>> 2009-11-30 08:28:56,344 WARN mapred.FileInputFormat - Can't open index
>>> at
>>> hdfs://nn1:9000/user/nutch/crawl/index1/segments.gen:0+2147483647,
>>> skipping.
>>>
>>
>> Most likely reason for this is that you defined your searcher.dir as
>> hdfs://nn1:9000/user/nutch/crawl/index1 - instead you should set it to
>> hdfs://nn1:9000/user/nutch/crawl . Please also note that names "index" and
>> "indexes" are magic - Lucene indexes must be located under one of these
>> names ("index" for a single merged index, and "indexes" for partial
>> indexes), otherwise they won't be found by the NutchBean (the search
>> component in Nutch). So e.g. your Lucene index in index1/ won't be found.
>>
>>
>> --
>> Best regards,
>> Andrzej Bialecki <><
>> ___. ___ ___ ___ _ _ __________________________________
>> [__ || __|__/|__||\/| Information Retrieval, Semantic Web
>> ___|||__|| \| || | Embedded Unix, System Integration
>> http://www.sigram.com Contact: info at sigram dot com
>>
>>
>
Re: odd warnings
Posted by Jesse Hires <jh...@gmail.com>.
Thanks! Fixing how I was merging the indexes took care of the warning.
Jesse
int GetRandomNumber()
{
return 4; // Chosen by fair roll of dice
// Guaranteed to be random
} // xkcd.com
On Tue, Dec 1, 2009 at 4:49 AM, Andrzej Bialecki <ab...@getopt.org> wrote:
> Jesse Hires wrote:
>
>> What is "segments.gen" and "segments_2" ?
>> The warning I am getting happens when I dedup two indexes.
>>
>> I create index1 and index2 through generate/fetch/index/...etc
>> index1 is an index of 1/2 the segments. index2 is an index of the other
>> 1/2
>>
>> The warning is happening on both datanodes.
>>
>> The command I am running is "bin/nutch dedup crawl/index1 crawl/index2"
>>
>> If segments.gen and segments_2 are supposed to be directories, then why
>> are
>> they created as files?
>>
>> They are created as files from the start
>> "bin/nutch index crawl/index1 crawl/crawldb /crawl/linkdb
>> crawl/segments/XXX
>> crawl/segments/YYY"
>>
>> I don't see any errors or warnings about creating the index.
>>
>
> The command that you quote above produces multiple partial indexes, located
> in crawl/index1/part-NNNNN and only in these subdirectories the Lucene
> indexes can be found. However, the deduplication process doesn't accept
> partial indexes, so you need to specify each /part-NNNN dir as an input to
> dedup.
>
>
>
> --
> Best regards,
> Andrzej Bialecki <><
> ___. ___ ___ ___ _ _ __________________________________
> [__ || __|__/|__||\/| Information Retrieval, Semantic Web
> ___|||__|| \| || | Embedded Unix, System Integration
> http://www.sigram.com Contact: info at sigram dot com
>
>
Re: odd warnings
Posted by Andrzej Bialecki <ab...@getopt.org>.
Jesse Hires wrote:
> What is "segments.gen" and "segments_2" ?
> The warning I am getting happens when I dedup two indexes.
>
> I create index1 and index2 through generate/fetch/index/...etc
> index1 is an index of 1/2 the segments. index2 is an index of the other 1/2
>
> The warning is happening on both datanodes.
>
> The command I am running is "bin/nutch dedup crawl/index1 crawl/index2"
>
> If segments.gen and segments_2 are supposed to be directories, then why are
> they created as files?
>
> They are created as files from the start
> "bin/nutch index crawl/index1 crawl/crawldb /crawl/linkdb crawl/segments/XXX
> crawl/segments/YYY"
>
> I don't see any errors or warnings about creating the index.
The command that you quote above produces multiple partial indexes,
located in crawl/index1/part-NNNNN and only in these subdirectories the
Lucene indexes can be found. However, the deduplication process doesn't
accept partial indexes, so you need to specify each /part-NNNN dir as an
input to dedup.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com