You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by MilleBii <mi...@gmail.com> on 2009/12/12 10:47:26 UTC
Distributed Search problem
I'm trying to search directly from the index in hdfs so in distributed mode
What do I have wrong ?
created nutch/conf/search-servers.txt with
localhost 8100
pointed search.dir in nutch-site.xml to nutch/conf
tried to start search server with either :
+ nutch server 8100 crawl
+ nutch server 8100 hdfs://localhost:9000/user/nutch/crawl
The nutch server command doesn't return to prompt ???
Is this normal should I wait ?
And of course if I try a search it doesn't work
--
-MilleBii-
Re: Distributed Search problem
Posted by Dennis Kubes <ku...@apache.org>.
I wouldn't. If you want to reparse or analyze that content later you
are going to need the segments. True it saves space but the content is
the most important part for further analysis. If you know you are not
going to do any further analysis on it then yes, it can be deleted.
Dennis
MilleBii wrote:
> OK thx, I can also remove the segments in the HDFS since I don't thing they
> are used for further crawls or even during merge of indexed segments ?
> That way I could save a lot space in keeping only one copy of the segments
> data.
>
>
> 2009/12/14 Dennis Kubes <ku...@apache.org>
>
>> Index and segments is the minimum yes. You only need the segments for the
>> indexes that you are serving on the local box.
>>
>> Dennis
>>
>>
>> MilleBii wrote:
>>
>>> Ok I don't per say need distributed search.
>>> I was trying to avoid a copy to local file system to optimize on
>>> ressources working off HDFS
>>>
>>> What is the minimum to copy over index and segments ? Not crawldb ?
>>> All data in segments ?
>>>
>>> 2009/12/13, Dennis Kubes <ku...@apache.org>:
>>>
>>>> The assumption is wrong. Distributed search is done from indexes on
>>>> local file systems not HDFS.
>>>>
>>>> It doesn't return because lucene is trying to search across the indexes
>>>> in HDFS in real time which doesn't work because of network overhead.
>>>> Depending on the size of the indexes it may actually return after some
>>>> time but I have seen it timeout even for small indexes.
>>>>
>>>> Short of it is, move the indexes and segments to a local file system,
>>>> then point the distributed search server at their parent directory.
>>>> Something like this:
>>>>
>>>> bin/nutch server 8100 /full/path/to/parent/of/local/indexes
>>>>
>>>> It technically doesn't have to be a full path. Then point the
>>>> searcher.dir to a directory with search-servers.txt as you have done.
>>>> The search-servers.txt points like you have it.
>>>>
>>>> Dennis
>>>>
>>>> MilleBii wrote:
>>>>
>>>>> I'm trying to search directly from the index in hdfs so in distributed
>>>>> mode
>>>>>
>>>>> What do I have wrong ?
>>>>>
>>>>> created nutch/conf/search-servers.txt with
>>>>> localhost 8100
>>>>>
>>>>> pointed search.dir in nutch-site.xml to nutch/conf
>>>>>
>>>>> tried to start search server with either :
>>>>> + nutch server 8100 crawl
>>>>> + nutch server 8100 hdfs://localhost:9000/user/nutch/crawl
>>>>>
>>>>> The nutch server command doesn't return to prompt ???
>>>>> Is this normal should I wait ?
>>>>>
>>>>> And of course if I try a search it doesn't work
>>>>>
>>>>>
>>>
>
>
Re: Distributed Search problem
Posted by MilleBii <mi...@gmail.com>.
OK thx, I can also remove the segments in the HDFS since I don't thing they
are used for further crawls or even during merge of indexed segments ?
That way I could save a lot space in keeping only one copy of the segments
data.
2009/12/14 Dennis Kubes <ku...@apache.org>
> Index and segments is the minimum yes. You only need the segments for the
> indexes that you are serving on the local box.
>
> Dennis
>
>
> MilleBii wrote:
>
>> Ok I don't per say need distributed search.
>> I was trying to avoid a copy to local file system to optimize on
>> ressources working off HDFS
>>
>> What is the minimum to copy over index and segments ? Not crawldb ?
>> All data in segments ?
>>
>> 2009/12/13, Dennis Kubes <ku...@apache.org>:
>>
>>> The assumption is wrong. Distributed search is done from indexes on
>>> local file systems not HDFS.
>>>
>>> It doesn't return because lucene is trying to search across the indexes
>>> in HDFS in real time which doesn't work because of network overhead.
>>> Depending on the size of the indexes it may actually return after some
>>> time but I have seen it timeout even for small indexes.
>>>
>>> Short of it is, move the indexes and segments to a local file system,
>>> then point the distributed search server at their parent directory.
>>> Something like this:
>>>
>>> bin/nutch server 8100 /full/path/to/parent/of/local/indexes
>>>
>>> It technically doesn't have to be a full path. Then point the
>>> searcher.dir to a directory with search-servers.txt as you have done.
>>> The search-servers.txt points like you have it.
>>>
>>> Dennis
>>>
>>> MilleBii wrote:
>>>
>>>> I'm trying to search directly from the index in hdfs so in distributed
>>>> mode
>>>>
>>>> What do I have wrong ?
>>>>
>>>> created nutch/conf/search-servers.txt with
>>>> localhost 8100
>>>>
>>>> pointed search.dir in nutch-site.xml to nutch/conf
>>>>
>>>> tried to start search server with either :
>>>> + nutch server 8100 crawl
>>>> + nutch server 8100 hdfs://localhost:9000/user/nutch/crawl
>>>>
>>>> The nutch server command doesn't return to prompt ???
>>>> Is this normal should I wait ?
>>>>
>>>> And of course if I try a search it doesn't work
>>>>
>>>>
>>
>>
--
-MilleBii-
Re: Distributed Search problem
Posted by Dennis Kubes <ku...@apache.org>.
Index and segments is the minimum yes. You only need the segments for
the indexes that you are serving on the local box.
Dennis
MilleBii wrote:
> Ok I don't per say need distributed search.
> I was trying to avoid a copy to local file system to optimize on
> ressources working off HDFS
>
> What is the minimum to copy over index and segments ? Not crawldb ?
> All data in segments ?
>
> 2009/12/13, Dennis Kubes <ku...@apache.org>:
>> The assumption is wrong. Distributed search is done from indexes on
>> local file systems not HDFS.
>>
>> It doesn't return because lucene is trying to search across the indexes
>> in HDFS in real time which doesn't work because of network overhead.
>> Depending on the size of the indexes it may actually return after some
>> time but I have seen it timeout even for small indexes.
>>
>> Short of it is, move the indexes and segments to a local file system,
>> then point the distributed search server at their parent directory.
>> Something like this:
>>
>> bin/nutch server 8100 /full/path/to/parent/of/local/indexes
>>
>> It technically doesn't have to be a full path. Then point the
>> searcher.dir to a directory with search-servers.txt as you have done.
>> The search-servers.txt points like you have it.
>>
>> Dennis
>>
>> MilleBii wrote:
>>> I'm trying to search directly from the index in hdfs so in distributed
>>> mode
>>>
>>> What do I have wrong ?
>>>
>>> created nutch/conf/search-servers.txt with
>>> localhost 8100
>>>
>>> pointed search.dir in nutch-site.xml to nutch/conf
>>>
>>> tried to start search server with either :
>>> + nutch server 8100 crawl
>>> + nutch server 8100 hdfs://localhost:9000/user/nutch/crawl
>>>
>>> The nutch server command doesn't return to prompt ???
>>> Is this normal should I wait ?
>>>
>>> And of course if I try a search it doesn't work
>>>
>
>
Re: Distributed Search problem
Posted by MilleBii <mi...@gmail.com>.
Ok I don't per say need distributed search.
I was trying to avoid a copy to local file system to optimize on
ressources working off HDFS
What is the minimum to copy over index and segments ? Not crawldb ?
All data in segments ?
2009/12/13, Dennis Kubes <ku...@apache.org>:
> The assumption is wrong. Distributed search is done from indexes on
> local file systems not HDFS.
>
> It doesn't return because lucene is trying to search across the indexes
> in HDFS in real time which doesn't work because of network overhead.
> Depending on the size of the indexes it may actually return after some
> time but I have seen it timeout even for small indexes.
>
> Short of it is, move the indexes and segments to a local file system,
> then point the distributed search server at their parent directory.
> Something like this:
>
> bin/nutch server 8100 /full/path/to/parent/of/local/indexes
>
> It technically doesn't have to be a full path. Then point the
> searcher.dir to a directory with search-servers.txt as you have done.
> The search-servers.txt points like you have it.
>
> Dennis
>
> MilleBii wrote:
>> I'm trying to search directly from the index in hdfs so in distributed
>> mode
>>
>> What do I have wrong ?
>>
>> created nutch/conf/search-servers.txt with
>> localhost 8100
>>
>> pointed search.dir in nutch-site.xml to nutch/conf
>>
>> tried to start search server with either :
>> + nutch server 8100 crawl
>> + nutch server 8100 hdfs://localhost:9000/user/nutch/crawl
>>
>> The nutch server command doesn't return to prompt ???
>> Is this normal should I wait ?
>>
>> And of course if I try a search it doesn't work
>>
>
--
-MilleBii-
Re: Distributed Search problem
Posted by Dennis Kubes <ku...@apache.org>.
The assumption is wrong. Distributed search is done from indexes on
local file systems not HDFS.
It doesn't return because lucene is trying to search across the indexes
in HDFS in real time which doesn't work because of network overhead.
Depending on the size of the indexes it may actually return after some
time but I have seen it timeout even for small indexes.
Short of it is, move the indexes and segments to a local file system,
then point the distributed search server at their parent directory.
Something like this:
bin/nutch server 8100 /full/path/to/parent/of/local/indexes
It technically doesn't have to be a full path. Then point the
searcher.dir to a directory with search-servers.txt as you have done.
The search-servers.txt points like you have it.
Dennis
MilleBii wrote:
> I'm trying to search directly from the index in hdfs so in distributed mode
>
> What do I have wrong ?
>
> created nutch/conf/search-servers.txt with
> localhost 8100
>
> pointed search.dir in nutch-site.xml to nutch/conf
>
> tried to start search server with either :
> + nutch server 8100 crawl
> + nutch server 8100 hdfs://localhost:9000/user/nutch/crawl
>
> The nutch server command doesn't return to prompt ???
> Is this normal should I wait ?
>
> And of course if I try a search it doesn't work
>