You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Mark Kerzner <ma...@gmail.com> on 2009/03/13 05:38:17 UTC

Creating Lucene index in Hadoop

Hi,

How do I allow multiple nodes to write to the same index file in HDFS?

Thank you,
Mark

Re: Creating Lucene index in Hadoop

Posted by Ning Li <ni...@gmail.com>.

> Lucene on a local disk benefits significantly from the local filesystem's
> RAM cache (aka the kernel's buffer cache).  HDFS has no such local RAM cache
> outside of the stream's buffer.  The cache would need to be no larger than
> the kernel's buffer cache to get an equivalent hit ratio.  And if you're

If the two cache sizes are the same, then yes. Just that local FS
cache size is adjusted (more?) dynamically.


Cheers,
Ning

Re: Creating Lucene index in Hadoop

Posted by Doug Cutting <cu...@apache.org>.

Ning Li wrote:
> 1 is good. But for 2:
>   - Won't it have a security concern as well? Or is this not a general
> local cache?

A client-side RAM cache would be filled through the same security 
mechanisms as all other filesystem accesses.

>   - You are referring to caching in RAM, not caching in local FS,
> right? In general, a Lucene index size could be quite large. We may
> have to cache a lot of data to reach a reasonable hit ratio...

Lucene on a local disk benefits significantly from the local 
filesystem's RAM cache (aka the kernel's buffer cache).  HDFS has no 
such local RAM cache outside of the stream's buffer.  The cache would 
need to be no larger than the kernel's buffer cache to get an equivalent 
hit ratio.  And if you're accessing a remote index then you shouldn't 
also need a large buffer cache.

Doug

Re: Creating Lucene index in Hadoop

Posted by Ning Li <ni...@gmail.com>.

1 is good. But for 2:
  - Won't it have a security concern as well? Or is this not a general
local cache?
  - You are referring to caching in RAM, not caching in local FS,
right? In general, a Lucene index size could be quite large. We may
have to cache a lot of data to reach a reasonable hit ratio...

Cheers,
Ning


On Mon, Mar 16, 2009 at 5:36 PM, Doug Cutting <cu...@apache.org> wrote:
> Ning Li wrote:
>>
>> With
>> http://issues.apache.org/jira/browse/HADOOP-4801, however, it may
>> become feasible to search on HDFS directly.
>
> I don't think HADOOP-4801 is required.  It would help, certainly, but it's
> so fraught with security and other issues that I doubt it will be committed
> anytime soon.
>
> What would probably help HDFS random access performance for Lucene
> significantly would be:
>  1. A cache of connections to datanodes, so that each seek() does not
> require an open().  If we move HDFS data transfer to be RPC-based (see,
> e.g., http://issues.apache.org/jira/browse/HADOOP-4386), then this will come
> for free, since RPC already caches connections.  We hope to do this for
> Hadoop 1.0, so that we use a single transport for all Hadoop's core
> operations, to simplify security.
>  2. A local cache of read-only HDFS data, equivalent to kernel's buffer
> cache.  This might be implemented as a Lucene Directory that keeps an LRU
> cache of buffers from a wrapped filesystem, perhaps a subclass of
> RAMDirectory.
>
> With these, performance would still be slower than a local drive, but
> perhaps not so dramatically.
>
> Doug
>

Re: Creating Lucene index in Hadoop

Posted by Doug Cutting <cu...@apache.org>.

Ning Li wrote:
> With
> http://issues.apache.org/jira/browse/HADOOP-4801, however, it may
> become feasible to search on HDFS directly.

I don't think HADOOP-4801 is required.  It would help, certainly, but 
it's so fraught with security and other issues that I doubt it will be 
committed anytime soon.

What would probably help HDFS random access performance for Lucene 
significantly would be:
  1. A cache of connections to datanodes, so that each seek() does not 
require an open().  If we move HDFS data transfer to be RPC-based (see, 
e.g., http://issues.apache.org/jira/browse/HADOOP-4386), then this will 
come for free, since RPC already caches connections.  We hope to do this 
for Hadoop 1.0, so that we use a single transport for all Hadoop's core 
operations, to simplify security.
  2. A local cache of read-only HDFS data, equivalent to kernel's buffer 
cache.  This might be implemented as a Lucene Directory that keeps an 
LRU cache of buffers from a wrapped filesystem, perhaps a subclass of 
RAMDirectory.

With these, performance would still be slower than a local drive, but 
perhaps not so dramatically.

Doug

Re: Creating Lucene index in Hadoop

Posted by Jason Venner <ja...@gmail.com>.

Check out katta, as it can pull indexes from hdfs and deploy them into your
search cluster.
Katta also handles index directories that have been packed into a zip file.
Katta can pull indexes from any file system that hadoop supports, hdfs, s3,
hftp, file etc.

We have been doing this with our solr (solr-1301) indexes and getting an 80%
reduction in size, which is a big gain for us.

I need to feed a 2 line change back into solr-1301 as the close method can
fail to heart beat while the optimize is happening, in some situations right
now.


On Tue, Oct 6, 2009 at 9:30 PM, ctam <ct...@gmail.com> wrote:

>
> hi Ning , I am also looking at different approaches on indexing with hadoop
> ,
> I could index using contrib package for hadoop into HDFS but since its not
> designed for random access  what would be the other recommended ways to
> move
> them to Local file system
>
> Also what would be the best approach to begin with ? should we look into
> katta or solr integrations ?
>
> thanks in advance.
>
>
> Ning Li-5 wrote:
> >
> >> I'm missing why you would ever want the Lucene index in HDFS for
> >> reading.
> >
> > The Lucene indexes are written to HDFS, but that does not mean you
> > conduct search on the indexes stored in HDFS directly. HDFS is not
> > designed for random access. Usually the indexes are copied to the
> > nodes where search will be served. With
> > http://issues.apache.org/jira/browse/HADOOP-4801, however, it may
> > become feasible to search on HDFS directly.
> >
> > Cheers,
> > Ning
> >
> >
> > On Mon, Mar 16, 2009 at 4:52 PM, Ian Soboroff <ia...@nist.gov>
> > wrote:
> >>
> >> Does anyone have stats on how multiple readers on an optimized Lucene
> >> index in HDFS compares with a ParallelMultiReader (or whatever its
> >> called) over RPC on a local filesystem?
> >>
> >> I'm missing why you would ever want the Lucene index in HDFS for
> >> reading.
> >>
> >> Ian
> >>
> >> Ning Li <ni...@gmail.com> writes:
> >>
> >>> I should have pointed out that Nutch index build and contrib/index
> >>> targets different applications. The latter is for applications who
> >>> simply want to build Lucene index from a set of documents - e.g. no
> >>> link analysis.
> >>>
> >>> As to writing Lucene indexes, both work the same way - write the final
> >>> results to local file system and then copy to HDFS. In contrib/index,
> >>> the intermediate results are in memory and not written to HDFS.
> >>>
> >>> Hope it clarifies things.
> >>>
> >>> Cheers,
> >>> Ning
> >>>
> >>>
> >>> On Mon, Mar 16, 2009 at 2:57 PM, Ian Soboroff <ia...@nist.gov>
> >>> wrote:
> >>>>
> >>>> I understand why you would index in the reduce phase, because the
> >>>> anchor
> >>>> text gets shuffled to be next to the document.  However, when you
> index
> >>>> in the map phase, don't you just have to reindex later?
> >>>>
> >>>> The main point to the OP is that HDFS is a bad FS for writing Lucene
> >>>> indexes because of how Lucene works.  The simple approach is to write
> >>>> your index outside of HDFS in the reduce phase, and then merge the
> >>>> indexes from each reducer manually.
> >>>>
> >>>> Ian
> >>>>
> >>>> Ning Li <ni...@gmail.com> writes:
> >>>>
> >>>>> Or you can check out the index contrib. The difference of the two is
> >>>>> that:
> >>>>>   - In Nutch's indexing map/reduce job, indexes are built in the
> >>>>> reduce phase. Afterwards, they are merged into smaller number of
> >>>>> shards if necessary. The last time I checked, the merge process does
> >>>>> not use map/reduce.
> >>>>>   - In contrib/index, small indexes are built in the map phase. They
> >>>>> are merged into the desired number of shards in the reduce phase. In
> >>>>> addition, they can be merged into existing shards.
> >>>>>
> >>>>> Cheers,
> >>>>> Ning
> >>>>>
> >>>>>
> >>>>> On Fri, Mar 13, 2009 at 1:34 AM, 王红宝 <im...@126.com> wrote:
> >>>>>> you can see the nutch code.
> >>>>>>
> >>>>>> 2009/3/13 Mark Kerzner <ma...@gmail.com>
> >>>>>>
> >>>>>>> Hi,
> >>>>>>>
> >>>>>>> How do I allow multiple nodes to write to the same index file in
> >>>>>>> HDFS?
> >>>>>>>
> >>>>>>> Thank you,
> >>>>>>> Mark
> >>>>>>>
> >>>>>>
> >>>>
> >>>>
> >>
> >>
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/Creating-Lucene-index-in-Hadoop-tp22490120p25780366.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
>


-- 
Pro Hadoop, a book to guide you from beginner to hadoop mastery,
http://www.amazon.com/dp/1430219424?tag=jewlerymall
www.prohadoopbook.com a community for Hadoop Professionals

Re: Creating Lucene index in Hadoop

Posted by ctam <ct...@gmail.com>.

hi Ning , I am also looking at different approaches on indexing with hadoop ,
I could index using contrib package for hadoop into HDFS but since its not
designed for random access  what would be the other recommended ways to move
them to Local file system

Also what would be the best approach to begin with ? should we look into
katta or solr integrations ?

thanks in advance.


Ning Li-5 wrote:
> 
>> I'm missing why you would ever want the Lucene index in HDFS for
>> reading.
> 
> The Lucene indexes are written to HDFS, but that does not mean you
> conduct search on the indexes stored in HDFS directly. HDFS is not
> designed for random access. Usually the indexes are copied to the
> nodes where search will be served. With
> http://issues.apache.org/jira/browse/HADOOP-4801, however, it may
> become feasible to search on HDFS directly.
> 
> Cheers,
> Ning
> 
> 
> On Mon, Mar 16, 2009 at 4:52 PM, Ian Soboroff <ia...@nist.gov>
> wrote:
>>
>> Does anyone have stats on how multiple readers on an optimized Lucene
>> index in HDFS compares with a ParallelMultiReader (or whatever its
>> called) over RPC on a local filesystem?
>>
>> I'm missing why you would ever want the Lucene index in HDFS for
>> reading.
>>
>> Ian
>>
>> Ning Li <ni...@gmail.com> writes:
>>
>>> I should have pointed out that Nutch index build and contrib/index
>>> targets different applications. The latter is for applications who
>>> simply want to build Lucene index from a set of documents - e.g. no
>>> link analysis.
>>>
>>> As to writing Lucene indexes, both work the same way - write the final
>>> results to local file system and then copy to HDFS. In contrib/index,
>>> the intermediate results are in memory and not written to HDFS.
>>>
>>> Hope it clarifies things.
>>>
>>> Cheers,
>>> Ning
>>>
>>>
>>> On Mon, Mar 16, 2009 at 2:57 PM, Ian Soboroff <ia...@nist.gov>
>>> wrote:
>>>>
>>>> I understand why you would index in the reduce phase, because the
>>>> anchor
>>>> text gets shuffled to be next to the document.  However, when you index
>>>> in the map phase, don't you just have to reindex later?
>>>>
>>>> The main point to the OP is that HDFS is a bad FS for writing Lucene
>>>> indexes because of how Lucene works.  The simple approach is to write
>>>> your index outside of HDFS in the reduce phase, and then merge the
>>>> indexes from each reducer manually.
>>>>
>>>> Ian
>>>>
>>>> Ning Li <ni...@gmail.com> writes:
>>>>
>>>>> Or you can check out the index contrib. The difference of the two is
>>>>> that:
>>>>>   - In Nutch's indexing map/reduce job, indexes are built in the
>>>>> reduce phase. Afterwards, they are merged into smaller number of
>>>>> shards if necessary. The last time I checked, the merge process does
>>>>> not use map/reduce.
>>>>>   - In contrib/index, small indexes are built in the map phase. They
>>>>> are merged into the desired number of shards in the reduce phase. In
>>>>> addition, they can be merged into existing shards.
>>>>>
>>>>> Cheers,
>>>>> Ning
>>>>>
>>>>>
>>>>> On Fri, Mar 13, 2009 at 1:34 AM, 王红宝 <im...@126.com> wrote:
>>>>>> you can see the nutch code.
>>>>>>
>>>>>> 2009/3/13 Mark Kerzner <ma...@gmail.com>
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> How do I allow multiple nodes to write to the same index file in
>>>>>>> HDFS?
>>>>>>>
>>>>>>> Thank you,
>>>>>>> Mark
>>>>>>>
>>>>>>
>>>>
>>>>
>>
>>
> 
> 

-- 
View this message in context: http://www.nabble.com/Creating-Lucene-index-in-Hadoop-tp22490120p25780366.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

Re: Creating Lucene index in Hadoop

Posted by Ning Li <ni...@gmail.com>.

> I'm missing why you would ever want the Lucene index in HDFS for
> reading.

The Lucene indexes are written to HDFS, but that does not mean you
conduct search on the indexes stored in HDFS directly. HDFS is not
designed for random access. Usually the indexes are copied to the
nodes where search will be served. With
http://issues.apache.org/jira/browse/HADOOP-4801, however, it may
become feasible to search on HDFS directly.

Cheers,
Ning


On Mon, Mar 16, 2009 at 4:52 PM, Ian Soboroff <ia...@nist.gov> wrote:
>
> Does anyone have stats on how multiple readers on an optimized Lucene
> index in HDFS compares with a ParallelMultiReader (or whatever its
> called) over RPC on a local filesystem?
>
> I'm missing why you would ever want the Lucene index in HDFS for
> reading.
>
> Ian
>
> Ning Li <ni...@gmail.com> writes:
>
>> I should have pointed out that Nutch index build and contrib/index
>> targets different applications. The latter is for applications who
>> simply want to build Lucene index from a set of documents - e.g. no
>> link analysis.
>>
>> As to writing Lucene indexes, both work the same way - write the final
>> results to local file system and then copy to HDFS. In contrib/index,
>> the intermediate results are in memory and not written to HDFS.
>>
>> Hope it clarifies things.
>>
>> Cheers,
>> Ning
>>
>>
>> On Mon, Mar 16, 2009 at 2:57 PM, Ian Soboroff <ia...@nist.gov> wrote:
>>>
>>> I understand why you would index in the reduce phase, because the anchor
>>> text gets shuffled to be next to the document.  However, when you index
>>> in the map phase, don't you just have to reindex later?
>>>
>>> The main point to the OP is that HDFS is a bad FS for writing Lucene
>>> indexes because of how Lucene works.  The simple approach is to write
>>> your index outside of HDFS in the reduce phase, and then merge the
>>> indexes from each reducer manually.
>>>
>>> Ian
>>>
>>> Ning Li <ni...@gmail.com> writes:
>>>
>>>> Or you can check out the index contrib. The difference of the two is that:
>>>>   - In Nutch's indexing map/reduce job, indexes are built in the
>>>> reduce phase. Afterwards, they are merged into smaller number of
>>>> shards if necessary. The last time I checked, the merge process does
>>>> not use map/reduce.
>>>>   - In contrib/index, small indexes are built in the map phase. They
>>>> are merged into the desired number of shards in the reduce phase. In
>>>> addition, they can be merged into existing shards.
>>>>
>>>> Cheers,
>>>> Ning
>>>>
>>>>
>>>> On Fri, Mar 13, 2009 at 1:34 AM, 王红宝 <im...@126.com> wrote:
>>>>> you can see the nutch code.
>>>>>
>>>>> 2009/3/13 Mark Kerzner <ma...@gmail.com>
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> How do I allow multiple nodes to write to the same index file in HDFS?
>>>>>>
>>>>>> Thank you,
>>>>>> Mark
>>>>>>
>>>>>
>>>
>>>
>
>

Re: Creating Lucene index in Hadoop

Posted by Ian Soboroff <ia...@nist.gov>.

Does anyone have stats on how multiple readers on an optimized Lucene
index in HDFS compares with a ParallelMultiReader (or whatever its
called) over RPC on a local filesystem?

I'm missing why you would ever want the Lucene index in HDFS for
reading.

Ian

Ning Li <ni...@gmail.com> writes:

> I should have pointed out that Nutch index build and contrib/index
> targets different applications. The latter is for applications who
> simply want to build Lucene index from a set of documents - e.g. no
> link analysis.
>
> As to writing Lucene indexes, both work the same way - write the final
> results to local file system and then copy to HDFS. In contrib/index,
> the intermediate results are in memory and not written to HDFS.
>
> Hope it clarifies things.
>
> Cheers,
> Ning
>
>
> On Mon, Mar 16, 2009 at 2:57 PM, Ian Soboroff <ia...@nist.gov> wrote:
>>
>> I understand why you would index in the reduce phase, because the anchor
>> text gets shuffled to be next to the document.  However, when you index
>> in the map phase, don't you just have to reindex later?
>>
>> The main point to the OP is that HDFS is a bad FS for writing Lucene
>> indexes because of how Lucene works.  The simple approach is to write
>> your index outside of HDFS in the reduce phase, and then merge the
>> indexes from each reducer manually.
>>
>> Ian
>>
>> Ning Li <ni...@gmail.com> writes:
>>
>>> Or you can check out the index contrib. The difference of the two is that:
>>>   - In Nutch's indexing map/reduce job, indexes are built in the
>>> reduce phase. Afterwards, they are merged into smaller number of
>>> shards if necessary. The last time I checked, the merge process does
>>> not use map/reduce.
>>>   - In contrib/index, small indexes are built in the map phase. They
>>> are merged into the desired number of shards in the reduce phase. In
>>> addition, they can be merged into existing shards.
>>>
>>> Cheers,
>>> Ning
>>>
>>>
>>> On Fri, Mar 13, 2009 at 1:34 AM, 王红宝 <im...@126.com> wrote:
>>>> you can see the nutch code.
>>>>
>>>> 2009/3/13 Mark Kerzner <ma...@gmail.com>
>>>>
>>>>> Hi,
>>>>>
>>>>> How do I allow multiple nodes to write to the same index file in HDFS?
>>>>>
>>>>> Thank you,
>>>>> Mark
>>>>>
>>>>
>>
>>

Re: Creating Lucene index in Hadoop

Posted by Ning Li <ni...@gmail.com>.

I should have pointed out that Nutch index build and contrib/index
targets different applications. The latter is for applications who
simply want to build Lucene index from a set of documents - e.g. no
link analysis.

As to writing Lucene indexes, both work the same way - write the final
results to local file system and then copy to HDFS. In contrib/index,
the intermediate results are in memory and not written to HDFS.

Hope it clarifies things.

Cheers,
Ning


On Mon, Mar 16, 2009 at 2:57 PM, Ian Soboroff <ia...@nist.gov> wrote:
>
> I understand why you would index in the reduce phase, because the anchor
> text gets shuffled to be next to the document.  However, when you index
> in the map phase, don't you just have to reindex later?
>
> The main point to the OP is that HDFS is a bad FS for writing Lucene
> indexes because of how Lucene works.  The simple approach is to write
> your index outside of HDFS in the reduce phase, and then merge the
> indexes from each reducer manually.
>
> Ian
>
> Ning Li <ni...@gmail.com> writes:
>
>> Or you can check out the index contrib. The difference of the two is that:
>>   - In Nutch's indexing map/reduce job, indexes are built in the
>> reduce phase. Afterwards, they are merged into smaller number of
>> shards if necessary. The last time I checked, the merge process does
>> not use map/reduce.
>>   - In contrib/index, small indexes are built in the map phase. They
>> are merged into the desired number of shards in the reduce phase. In
>> addition, they can be merged into existing shards.
>>
>> Cheers,
>> Ning
>>
>>
>> On Fri, Mar 13, 2009 at 1:34 AM, 王红宝 <im...@126.com> wrote:
>>> you can see the nutch code.
>>>
>>> 2009/3/13 Mark Kerzner <ma...@gmail.com>
>>>
>>>> Hi,
>>>>
>>>> How do I allow multiple nodes to write to the same index file in HDFS?
>>>>
>>>> Thank you,
>>>> Mark
>>>>
>>>
>
>

Re: Creating Lucene index in Hadoop

Posted by Ian Soboroff <ia...@nist.gov>.

I understand why you would index in the reduce phase, because the anchor
text gets shuffled to be next to the document.  However, when you index
in the map phase, don't you just have to reindex later?

The main point to the OP is that HDFS is a bad FS for writing Lucene
indexes because of how Lucene works.  The simple approach is to write
your index outside of HDFS in the reduce phase, and then merge the
indexes from each reducer manually.

Ian

Ning Li <ni...@gmail.com> writes:

> Or you can check out the index contrib. The difference of the two is that:
>   - In Nutch's indexing map/reduce job, indexes are built in the
> reduce phase. Afterwards, they are merged into smaller number of
> shards if necessary. The last time I checked, the merge process does
> not use map/reduce.
>   - In contrib/index, small indexes are built in the map phase. They
> are merged into the desired number of shards in the reduce phase. In
> addition, they can be merged into existing shards.
>
> Cheers,
> Ning
>
>
> On Fri, Mar 13, 2009 at 1:34 AM, 王红宝 <im...@126.com> wrote:
>> you can see the nutch code.
>>
>> 2009/3/13 Mark Kerzner <ma...@gmail.com>
>>
>>> Hi,
>>>
>>> How do I allow multiple nodes to write to the same index file in HDFS?
>>>
>>> Thank you,
>>> Mark
>>>
>>

Re: Creating Lucene index in Hadoop

Posted by Ning Li <ni...@gmail.com>.

Or you can check out the index contrib. The difference of the two is that:
  - In Nutch's indexing map/reduce job, indexes are built in the
reduce phase. Afterwards, they are merged into smaller number of
shards if necessary. The last time I checked, the merge process does
not use map/reduce.
  - In contrib/index, small indexes are built in the map phase. They
are merged into the desired number of shards in the reduce phase. In
addition, they can be merged into existing shards.

Cheers,
Ning


On Fri, Mar 13, 2009 at 1:34 AM, 王红宝 <im...@126.com> wrote:
> you can see the nutch code.
>
> 2009/3/13 Mark Kerzner <ma...@gmail.com>
>
>> Hi,
>>
>> How do I allow multiple nodes to write to the same index file in HDFS?
>>
>> Thank you,
>> Mark
>>
>

Re: Creating Lucene index in Hadoop

Posted by 王红宝 <im...@126.com>.

you can see the nutch code.

2009/3/13 Mark Kerzner <ma...@gmail.com>

> Hi,
>
> How do I allow multiple nodes to write to the same index file in HDFS?
>
> Thank you,
> Mark
>