You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Jesse Hires <jh...@gmail.com> on 2009/12/01 03:48:01 UTC

Re: odd warnings

What is "segments.gen" and "segments_2" ?
The warning I am getting happens when I dedup two indexes.

I create index1 and index2 through generate/fetch/index/...etc
index1 is an index of 1/2 the segments. index2 is an index of the other 1/2

The warning is happening on both datanodes.

The command I am running is "bin/nutch dedup crawl/index1 crawl/index2"

If segments.gen and segments_2 are supposed to be directories, then why are
they created as files?

They are created as files from the start
"bin/nutch index crawl/index1 crawl/crawldb /crawl/linkdb crawl/segments/XXX
crawl/segments/YYY"

I don't see any errors or warnings about creating the index.

I'm using nutch 1.0, though it has been a bit since I've updated the sources
from the trunk.
I'm running one name node and two data nodes.

2009-11-30 18:25:23,497 WARN  mapred.FileInputFormat - Can't open index at
hdfs://nn1:9000/user/nutch/crawl/index2/segments_2:0+2147483647, skipping.
(hdfs://nn1:9000/user/nutch/crawl/index2/segments_2 not a directory)
2009-11-30 18:33:50,200 WARN  mapred.FileInputFormat - Can't open index at
hdfs://nn1:9000/user/nutch/crawl/index2/segments.gen:0+2147483647, skipping.
(hdfs://nn1:9000/user/nutch/crawl/index2/segments.gen not a directory)


Jesse

int GetRandomNumber()
{
   return 4; // Chosen by fair roll of dice
                // Guaranteed to be random
} // xkcd.com



On Mon, Nov 30, 2009 at 9:30 AM, Jesse Hires <jh...@gmail.com> wrote:

> actually searcher.dir is still the default "crawl". The warnings are
> showing up either while indexing segments or merging indexes. I need to
> spend some time figuring out just where it is happening at. I will look into
> it later tonight, work doesn't like my hobbies intruding. :)
>
> I may need some more info in "index" vs "indexes" later if you don't mind
> my asking some dumb questions about them, but thus far, things seem to be
> working in the manner I have it set up. With the exception of the warnings
> mentioned of course.
>
> The searching (or searchers) run out of a different directory and I run the
> indexes and segments for them locally on the individual nodes and I am
> getting search results back, which increase with every pass as expected.
>
>
>
> Jesse
>
> int GetRandomNumber()
> {
>    return 4; // Chosen by fair roll of dice
>                 // Guaranteed to be random
> } // xkcd.com
>
>
>
> On Mon, Nov 30, 2009 at 8:57 AM, Andrzej Bialecki <ab...@getopt.org> wrote:
>
>> Jesse Hires wrote:
>>
>>> I am getting warnings in hadoop.log that segments.gen and segments_2 are
>>> not
>>> directories, and as you can see by the listing, they are in fact files
>>> not
>>> directories. I'm not sure what stage of the process this is happening in,
>>> as
>>> I just now stumbled on them, but it concerns me that it says it is
>>> skipping
>>> something. Any ideas before I start digging further?
>>>
>>>
>>>
>>>
>>> 2009-11-30 08:28:56,344 WARN  mapred.FileInputFormat - Can't open index
>>> at
>>> hdfs://nn1:9000/user/nutch/crawl/index1/segments.gen:0+2147483647,
>>> skipping.
>>>
>>
>> Most likely reason for this is that you defined your searcher.dir as
>> hdfs://nn1:9000/user/nutch/crawl/index1 - instead you should set it to
>> hdfs://nn1:9000/user/nutch/crawl . Please also note that names "index" and
>> "indexes" are magic - Lucene indexes must be located under one of these
>> names ("index" for a single merged index, and "indexes" for partial
>> indexes), otherwise they won't be found by the NutchBean (the search
>> component in Nutch). So e.g. your Lucene index in index1/ won't be found.
>>
>>
>> --
>> Best regards,
>> Andrzej Bialecki     <><
>>  ___. ___ ___ ___ _ _   __________________________________
>> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
>> ___|||__||  \|  ||  |  Embedded Unix, System Integration
>> http://www.sigram.com  Contact: info at sigram dot com
>>
>>
>

Re: odd warnings

Posted by Jesse Hires <jh...@gmail.com>.

Thanks! Fixing how I was merging the indexes took care of the warning.
Jesse

int GetRandomNumber()
{
   return 4; // Chosen by fair roll of dice
                // Guaranteed to be random
} // xkcd.com



On Tue, Dec 1, 2009 at 4:49 AM, Andrzej Bialecki <ab...@getopt.org> wrote:

> Jesse Hires wrote:
>
>> What is "segments.gen" and "segments_2" ?
>> The warning I am getting happens when I dedup two indexes.
>>
>> I create index1 and index2 through generate/fetch/index/...etc
>> index1 is an index of 1/2 the segments. index2 is an index of the other
>> 1/2
>>
>> The warning is happening on both datanodes.
>>
>> The command I am running is "bin/nutch dedup crawl/index1 crawl/index2"
>>
>> If segments.gen and segments_2 are supposed to be directories, then why
>> are
>> they created as files?
>>
>> They are created as files from the start
>> "bin/nutch index crawl/index1 crawl/crawldb /crawl/linkdb
>> crawl/segments/XXX
>> crawl/segments/YYY"
>>
>> I don't see any errors or warnings about creating the index.
>>
>
> The command that you quote above produces multiple partial indexes, located
> in crawl/index1/part-NNNNN and only in these subdirectories the Lucene
> indexes can be found. However, the deduplication process doesn't accept
> partial indexes, so you need to specify each /part-NNNN dir as an input to
> dedup.
>
>
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>

Re: odd warnings

Posted by Andrzej Bialecki <ab...@getopt.org>.

Jesse Hires wrote:
> What is "segments.gen" and "segments_2" ?
> The warning I am getting happens when I dedup two indexes.
> 
> I create index1 and index2 through generate/fetch/index/...etc
> index1 is an index of 1/2 the segments. index2 is an index of the other 1/2
> 
> The warning is happening on both datanodes.
> 
> The command I am running is "bin/nutch dedup crawl/index1 crawl/index2"
> 
> If segments.gen and segments_2 are supposed to be directories, then why are
> they created as files?
> 
> They are created as files from the start
> "bin/nutch index crawl/index1 crawl/crawldb /crawl/linkdb crawl/segments/XXX
> crawl/segments/YYY"
> 
> I don't see any errors or warnings about creating the index.

The command that you quote above produces multiple partial indexes, 
located in crawl/index1/part-NNNNN and only in these subdirectories the 
Lucene indexes can be found. However, the deduplication process doesn't 
accept partial indexes, so you need to specify each /part-NNNN dir as an 
input to dedup.


-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com