You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by MilleBii <mi...@gmail.com> on 2009/12/12 10:47:26 UTC

Distributed Search problem

I'm trying to search directly from the index in hdfs so in distributed mode

What do I have wrong ?

created  nutch/conf/search-servers.txt with
 localhost 8100

pointed  search.dir in nutch-site.xml to nutch/conf

tried to start search server with either :
 + nutch server 8100  crawl
 + nutch server 8100 hdfs://localhost:9000/user/nutch/crawl

The nutch server command doesn't return to prompt ???
Is this normal should I wait ?

And of course if I try a search it doesn't work

-- 
-MilleBii-

Re: Distributed Search problem

Posted by Dennis Kubes <ku...@apache.org>.

I wouldn't.  If you want to reparse or analyze that content later you 
are going to need the segments.  True it saves space but the content is 
the most important part for further analysis.  If you know you are not 
going to do any further analysis on it then yes, it can be deleted.

Dennis

MilleBii wrote:
> OK thx, I can also remove the segments in the HDFS since I don't thing they
> are used for further crawls or even during merge of indexed segments ?
> That way I could save a lot space in keeping only one copy of the segments
> data.
> 
> 
> 2009/12/14 Dennis Kubes <ku...@apache.org>
> 
>> Index and segments is the minimum yes.  You only need the segments for the
>> indexes that you are serving on the local box.
>>
>> Dennis
>>
>>
>> MilleBii wrote:
>>
>>> Ok I don't per say need distributed search.
>>> I was trying to avoid a copy to local file system to optimize on
>>> ressources working off HDFS
>>>
>>> What is the minimum to copy over index and segments ? Not crawldb ?
>>> All data in segments ?
>>>
>>> 2009/12/13, Dennis Kubes <ku...@apache.org>:
>>>
>>>> The assumption is wrong.  Distributed search is done from indexes on
>>>> local file systems not HDFS.
>>>>
>>>> It doesn't return because lucene is trying to search across the indexes
>>>> in HDFS in real time which doesn't work because of network overhead.
>>>> Depending on the size of the indexes it may actually return after some
>>>> time but I have seen it timeout even for small indexes.
>>>>
>>>> Short of it is, move the indexes and segments to a local file system,
>>>> then point the distributed search server at their parent directory.
>>>> Something like this:
>>>>
>>>> bin/nutch server 8100 /full/path/to/parent/of/local/indexes
>>>>
>>>> It technically doesn't have to be a full path.  Then point the
>>>> searcher.dir to a directory with search-servers.txt as you have done.
>>>> The search-servers.txt points like you have it.
>>>>
>>>> Dennis
>>>>
>>>> MilleBii wrote:
>>>>
>>>>> I'm trying to search directly from the index in hdfs so in distributed
>>>>> mode
>>>>>
>>>>> What do I have wrong ?
>>>>>
>>>>> created  nutch/conf/search-servers.txt with
>>>>>  localhost 8100
>>>>>
>>>>> pointed  search.dir in nutch-site.xml to nutch/conf
>>>>>
>>>>> tried to start search server with either :
>>>>>  + nutch server 8100  crawl
>>>>>  + nutch server 8100 hdfs://localhost:9000/user/nutch/crawl
>>>>>
>>>>> The nutch server command doesn't return to prompt ???
>>>>> Is this normal should I wait ?
>>>>>
>>>>> And of course if I try a search it doesn't work
>>>>>
>>>>>
>>>
> 
>

Re: Distributed Search problem

Posted by MilleBii <mi...@gmail.com>.

OK thx, I can also remove the segments in the HDFS since I don't thing they
are used for further crawls or even during merge of indexed segments ?
That way I could save a lot space in keeping only one copy of the segments
data.


2009/12/14 Dennis Kubes <ku...@apache.org>

> Index and segments is the minimum yes.  You only need the segments for the
> indexes that you are serving on the local box.
>
> Dennis
>
>
> MilleBii wrote:
>
>> Ok I don't per say need distributed search.
>> I was trying to avoid a copy to local file system to optimize on
>> ressources working off HDFS
>>
>> What is the minimum to copy over index and segments ? Not crawldb ?
>> All data in segments ?
>>
>> 2009/12/13, Dennis Kubes <ku...@apache.org>:
>>
>>> The assumption is wrong.  Distributed search is done from indexes on
>>> local file systems not HDFS.
>>>
>>> It doesn't return because lucene is trying to search across the indexes
>>> in HDFS in real time which doesn't work because of network overhead.
>>> Depending on the size of the indexes it may actually return after some
>>> time but I have seen it timeout even for small indexes.
>>>
>>> Short of it is, move the indexes and segments to a local file system,
>>> then point the distributed search server at their parent directory.
>>> Something like this:
>>>
>>> bin/nutch server 8100 /full/path/to/parent/of/local/indexes
>>>
>>> It technically doesn't have to be a full path.  Then point the
>>> searcher.dir to a directory with search-servers.txt as you have done.
>>> The search-servers.txt points like you have it.
>>>
>>> Dennis
>>>
>>> MilleBii wrote:
>>>
>>>> I'm trying to search directly from the index in hdfs so in distributed
>>>> mode
>>>>
>>>> What do I have wrong ?
>>>>
>>>> created  nutch/conf/search-servers.txt with
>>>>  localhost 8100
>>>>
>>>> pointed  search.dir in nutch-site.xml to nutch/conf
>>>>
>>>> tried to start search server with either :
>>>>  + nutch server 8100  crawl
>>>>  + nutch server 8100 hdfs://localhost:9000/user/nutch/crawl
>>>>
>>>> The nutch server command doesn't return to prompt ???
>>>> Is this normal should I wait ?
>>>>
>>>> And of course if I try a search it doesn't work
>>>>
>>>>
>>
>>


-- 
-MilleBii-

Re: Distributed Search problem

Posted by Dennis Kubes <ku...@apache.org>.

Index and segments is the minimum yes.  You only need the segments for 
the indexes that you are serving on the local box.

Dennis

MilleBii wrote:
> Ok I don't per say need distributed search.
> I was trying to avoid a copy to local file system to optimize on
> ressources working off HDFS
> 
> What is the minimum to copy over index and segments ? Not crawldb ?
> All data in segments ?
> 
> 2009/12/13, Dennis Kubes <ku...@apache.org>:
>> The assumption is wrong.  Distributed search is done from indexes on
>> local file systems not HDFS.
>>
>> It doesn't return because lucene is trying to search across the indexes
>> in HDFS in real time which doesn't work because of network overhead.
>> Depending on the size of the indexes it may actually return after some
>> time but I have seen it timeout even for small indexes.
>>
>> Short of it is, move the indexes and segments to a local file system,
>> then point the distributed search server at their parent directory.
>> Something like this:
>>
>> bin/nutch server 8100 /full/path/to/parent/of/local/indexes
>>
>> It technically doesn't have to be a full path.  Then point the
>> searcher.dir to a directory with search-servers.txt as you have done.
>> The search-servers.txt points like you have it.
>>
>> Dennis
>>
>> MilleBii wrote:
>>> I'm trying to search directly from the index in hdfs so in distributed
>>> mode
>>>
>>> What do I have wrong ?
>>>
>>> created  nutch/conf/search-servers.txt with
>>>  localhost 8100
>>>
>>> pointed  search.dir in nutch-site.xml to nutch/conf
>>>
>>> tried to start search server with either :
>>>  + nutch server 8100  crawl
>>>  + nutch server 8100 hdfs://localhost:9000/user/nutch/crawl
>>>
>>> The nutch server command doesn't return to prompt ???
>>> Is this normal should I wait ?
>>>
>>> And of course if I try a search it doesn't work
>>>
> 
>

Re: Distributed Search problem

Posted by MilleBii <mi...@gmail.com>.

Ok I don't per say need distributed search.
I was trying to avoid a copy to local file system to optimize on
ressources working off HDFS

What is the minimum to copy over index and segments ? Not crawldb ?
All data in segments ?

2009/12/13, Dennis Kubes <ku...@apache.org>:
> The assumption is wrong.  Distributed search is done from indexes on
> local file systems not HDFS.
>
> It doesn't return because lucene is trying to search across the indexes
> in HDFS in real time which doesn't work because of network overhead.
> Depending on the size of the indexes it may actually return after some
> time but I have seen it timeout even for small indexes.
>
> Short of it is, move the indexes and segments to a local file system,
> then point the distributed search server at their parent directory.
> Something like this:
>
> bin/nutch server 8100 /full/path/to/parent/of/local/indexes
>
> It technically doesn't have to be a full path.  Then point the
> searcher.dir to a directory with search-servers.txt as you have done.
> The search-servers.txt points like you have it.
>
> Dennis
>
> MilleBii wrote:
>> I'm trying to search directly from the index in hdfs so in distributed
>> mode
>>
>> What do I have wrong ?
>>
>> created  nutch/conf/search-servers.txt with
>>  localhost 8100
>>
>> pointed  search.dir in nutch-site.xml to nutch/conf
>>
>> tried to start search server with either :
>>  + nutch server 8100  crawl
>>  + nutch server 8100 hdfs://localhost:9000/user/nutch/crawl
>>
>> The nutch server command doesn't return to prompt ???
>> Is this normal should I wait ?
>>
>> And of course if I try a search it doesn't work
>>
>


-- 
-MilleBii-

Re: Distributed Search problem

Posted by Dennis Kubes <ku...@apache.org>.

The assumption is wrong.  Distributed search is done from indexes on 
local file systems not HDFS.

It doesn't return because lucene is trying to search across the indexes 
in HDFS in real time which doesn't work because of network overhead. 
Depending on the size of the indexes it may actually return after some 
time but I have seen it timeout even for small indexes.

Short of it is, move the indexes and segments to a local file system, 
then point the distributed search server at their parent directory. 
Something like this:

bin/nutch server 8100 /full/path/to/parent/of/local/indexes

It technically doesn't have to be a full path.  Then point the 
searcher.dir to a directory with search-servers.txt as you have done. 
The search-servers.txt points like you have it.

Dennis

MilleBii wrote:
> I'm trying to search directly from the index in hdfs so in distributed mode
> 
> What do I have wrong ?
> 
> created  nutch/conf/search-servers.txt with
>  localhost 8100
> 
> pointed  search.dir in nutch-site.xml to nutch/conf
> 
> tried to start search server with either :
>  + nutch server 8100  crawl
>  + nutch server 8100 hdfs://localhost:9000/user/nutch/crawl
> 
> The nutch server command doesn't return to prompt ???
> Is this normal should I wait ?
> 
> And of course if I try a search it doesn't work
>