You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Igor Bolotin <ib...@gmail.com> on 2006/03/26 09:53:10 UTC

Lucene indexing on Hadoop distributed file system

In my current project we needed a way to create very large Lucene indexes on
Hadoop distributed file system. When we tried to do it directly on DFS using
Nutch FsDirectory class - we immediately found that indexing fails because
DfsIndexOutput.seek() method throws UnsupportedOperationException. The
reason for this behavior is clear - DFS does not support random updates and
so seek() method can't be supported (at least not easily).

Well, if we can't support random updates - the question is: do we really
need them? Search in the Lucene code revealed 2 places which call
IndexOutput.seek() method: one is in TermInfosWriter and another one in
CompoundFileWriter. As we weren't planning to use CompoundFileWriter - the
only place that concerned us was in TermInfosWriter.

TermInfosWriter uses IndexOutput.seek() in its close() method to write total
number of terms in the file back into the beginning of the file. It was very
simple to change file format a little bit and write number of terms into
last 8 bytes of the file instead of writing them into beginning of file. The
only other place that should be fixed in order for this to work is in
SegmentTermEnum constructor - to read this piece of information at position
= file length - 8.

With this format hack - we were able to use FsDirectory to write index
directly to DFS without any problems. Well - we still don't index directly
to DFS for performance reasons, but at least we can build small local
indexes and merge them into the main index on DFS without copying big main
index back and forth.

If somebody is interested - I can post our changes in TermInfosWriter and
SegmentTermEnum code, although they are pretty trivial.

Best regards!
Igor

Re: Lucene indexing on Hadoop distributed file system

Posted by Raghavendra Prabhu <rr...@gmail.com>.
I would like to see lucene operate with hadoop

As you rightly pointed out, writing using FSDirectory to DFS would be a
performance issue.

I am interested in the idea. But i do not know how much time i can
contribute to this because of the little time which i can spare.

If anyone else is interested, can they join ? We can work on this together

Rgds
Prabhu


On 3/26/06, Igor Bolotin <ib...@gmail.com> wrote:
>
> In my current project we needed a way to create very large Lucene indexes
> on
> Hadoop distributed file system. When we tried to do it directly on DFS
> using
> Nutch FsDirectory class - we immediately found that indexing fails because
> DfsIndexOutput.seek() method throws UnsupportedOperationException. The
> reason for this behavior is clear - DFS does not support random updates
> and
> so seek() method can't be supported (at least not easily).
>
> Well, if we can't support random updates - the question is: do we really
> need them? Search in the Lucene code revealed 2 places which call
> IndexOutput.seek() method: one is in TermInfosWriter and another one in
> CompoundFileWriter. As we weren't planning to use CompoundFileWriter - the
> only place that concerned us was in TermInfosWriter.
>
> TermInfosWriter uses IndexOutput.seek() in its close() method to write
> total
> number of terms in the file back into the beginning of the file. It was
> very
> simple to change file format a little bit and write number of terms into
> last 8 bytes of the file instead of writing them into beginning of file.
> The
> only other place that should be fixed in order for this to work is in
> SegmentTermEnum constructor - to read this piece of information at
> position
> = file length - 8.
>
> With this format hack - we were able to use FsDirectory to write index
> directly to DFS without any problems. Well - we still don't index directly
> to DFS for performance reasons, but at least we can build small local
> indexes and merge them into the main index on DFS without copying big main
> index back and forth.
>
> If somebody is interested - I can post our changes in TermInfosWriter and
> SegmentTermEnum code, although they are pretty trivial.
>
> Best regards!
> Igor
>
>

Re: Lucene indexing on Hadoop distributed file system

Posted by Doug Cutting <cu...@apache.org>.
Igor Bolotin wrote:
> Does it make sense to change TermInfosWriter.FORMAT in the patch?

Yes.  This should be updated for any change to the format of the file, 
and this certainly constitutes a format change.  This discussion should 
move to java-dev@lucene.apache.org...

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Lucene indexing on Hadoop distributed file system

Posted by Igor Bolotin <ib...@gmail.com>.
Does it make sense to change TermInfosWriter.FORMAT in the patch?

Igor


On 3/27/06, Doug Cutting <cu...@apache.org> wrote:
>
> Igor Bolotin wrote:
> > If somebody is interested - I can post our changes in TermInfosWriter
> and
> > SegmentTermEnum code, although they are pretty trivial.
>
> Please submit this as a patch attached to a bug report.
>
> I contemplated making this change to Lucene myself, when writing Nutch's
> FsDirectory, but thought that no one else would ever be interested in
> using it.  Now that's been proven wrong!
>
> Note that any change to the file format must be back-compatible.
>
> Doug
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Lucene indexing on Hadoop distributed file system

Posted by Andrzej Bialecki <ab...@getopt.org>.
Doug Cutting wrote:
> Igor Bolotin wrote:
>> If somebody is interested - I can post our changes in TermInfosWriter 
>> and
>> SegmentTermEnum code, although they are pretty trivial.
>
> Please submit this as a patch attached to a bug report.
>
> I contemplated making this change to Lucene myself, when writing 
> Nutch's FsDirectory, but thought that no one else would ever be 
> interested in using it.  Now that's been proven wrong!
>
> Note that any change to the file format must be back-compatible.

This could be solved by putting a marker value in the first 8 bytes (== 
-1L), which would indicate that the real length is at the end. This way 
the new implementation will be able to read old indexes.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Lucene indexing on Hadoop distributed file system

Posted by Doug Cutting <cu...@apache.org>.
Igor Bolotin wrote:
> If somebody is interested - I can post our changes in TermInfosWriter and
> SegmentTermEnum code, although they are pretty trivial.

Please submit this as a patch attached to a bug report.

I contemplated making this change to Lucene myself, when writing Nutch's 
FsDirectory, but thought that no one else would ever be interested in 
using it.  Now that's been proven wrong!

Note that any change to the file format must be back-compatible.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org