You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Michael Sokolov <ms...@safaribooksonline.com> on 2013/10/11 19:03:03 UTC

external file stored field codec

I've been running some tests comparing storing large fields (documents, 
say 100K .. 10M) as files vs. storing them in Lucene as stored fields.  
Initial results seem to indicate storing them externally is a win (at 
least for binary docs which don't compress, and presumably we can 
compress the external files if we want, too), which seems to make 
sense.  There will be some issues with huge directories, but that might 
be worth solving.

So I'm wondering if there is a codec that does that?  I haven't seen one 
talked about anywhere.

-Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: external file stored field codec

Posted by Michael Sokolov <ms...@safaribooksonline.com>.

On 10/18/2013 1:08 AM, Shai Erera wrote:
>> The codec intercepts merges in order to clean up files that are no longer
>> referenced
>>
> What happens if a document is deleted while there's a reader open on the
> index, and the segments are merged? Maybe I misunderstand what you meant by
> this statement, but if the external file is deleted, since the document is
> "pruned" from the index, how will the reader be able to read the stored
> fields from it? How do you track references to the external files?
Right now you get a FileNotFoundException, or a missing field value, 
depending on how you configure the codec.  I believe the tests probably 
pass only because they don't test for the missing field value.  
Certainly I have a test (like the one you wrote, but that checks the 
field value explicitly) that exposes this problem.  My reasoning was 
that this is similar to the situation with NFS: the user has to be aware 
of the situation and deal with it by having an IndexDeletionPolicy that 
maintains old commits.  I don't see what else can be done without some 
(possibly heavyweight) additional tracking/garbage collection mechanism.

In our case (document archive), this behavior may be acceptable, but 
it's certainly one of the main areas that concerns me.  It would be nice 
if it were possible to receive an event when all outstanding readers for 
a commit were closed: that way we could clean up then instead of at the 
time of the commit, but I don't think this is how Lucene works?  At 
least I couldn't see how to do that, and given the discussion in 
IndexDeletionPolicy about NFS, I assumed that wasn't possible.

Another unsolved problem is how to clean up empty segments. Normally 
they're merged by simply not copying them, but in our case we have to 
actively delete.  I haven't looked at this carefully yet, but I have a 
couple of ideas: one is to use the Lucene docids as part of the 
filename: the idea being that as those are re-assigned, we would rename 
the files, unlinking the old ones with the same docid in the process.  
But I'm not totally clear on how the docid renumbering works, so not 
sure if that would be feasible.  Another idea is to use filesystem hard 
linking in some way as a reference counting mechanism, but that would 
restrict this to java 7. Finally, I suppose it's possible to build some 
data structure that actively manages the file references.

I guess my initial concern was with testing performance to see if it was 
even worth trying to solve these problems.  Now I think it is, but they 
are not necessarily easy to solve.

-Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: external file stored field codec

Posted by Shai Erera <se...@gmail.com>.

>
> The codec intercepts merges in order to clean up files that are no longer
> referenced
>

What happens if a document is deleted while there's a reader open on the
index, and the segments are merged? Maybe I misunderstand what you meant by
this statement, but if the external file is deleted, since the document is
"pruned" from the index, how will the reader be able to read the stored
fields from it? How do you track references to the external files?

Since you write that all tests in the o.a.l.index package pass, I assume
you handle this, but here's a simple testcase I have in mind:

IndexWriter writer = new IndexWriter(dir, configWithNewCode());
writer.addDocument(addDocWithStoredFields("doc1"));
writer.addDocument(addDocWithStoredFields("doc2"));
writer.commit();
writer.addDocument(addDocWithStoredFields("doc3"));
writer.addDocument(addDocWithStoredFields("doc4"));
IndexReader reader = writer.getReader();
writer.deleteDocuments("doc1");
writer.deleteDocuments("doc4");
writer.forceMerge(1);
writer.close();
System.out.println(reader.document("doc1"));
System.out.println(reader.document("doc4"));

Does this test pass?

Shai


On Fri, Oct 18, 2013 at 7:14 AM, Michael Sokolov <
msokolov@safaribooksonline.com> wrote:

> On 10/13/13 8:09 PM, Michael Sokolov wrote:
>
>> On 10/13/2013 1:52 PM, Adrien Grand wrote:
>>
>>> Hi Michael,
>>>
>>> I'm not aware enough of operating system internals to know what
>>> exactly happens when a file is open but it sounds to be like having
>>> separate files per document or field adds levels of indirection when
>>> loading stored fields, so I would be surprised it it actually proved
>>> to be more efficient than storing everything in a single file.
>>>
>>>  That's true, Adrien, there's definitely a cost to using files. There
>> are some gnarly challenges in here (mostly to do with the large number of
>> files, as you say, and with cleaning up after deletes - deletion is always
>> hard).  I'm not sure it's going to be possible to both clean up and
>> maintain files for stale commits; this will become problematic in the way
>> that having index files on NFS mounts are problematic.
>>
>> I think the hope is that there will be countervailing savings during
>> writes and merges (mostly) because we may be able to cleverly avoid copying
>> the contents of stored fields being merged.  There may also be savings when
>> querying due to reduced RAM requirements since the large stored fields
>> won't be paged in while performing queries.  As I said, some simple tests
>> do show improvements under at least some circumstances, so I'm pursuing
>> this a bit further.  I have a preliminary implementation as a codec now,
>> and I'm learning a bit about Lucene's index internals. BTW SimpleTextCodec
>> is a great tool for learning and debugging.
>>
>> The background for this is a document store with large files (think PDFs,
>> but lots of formats) that have to be tracked, and have associated metadata.
>>  We've been storing these externally, but it would be beneficial to have a
>> single data management layer: i.e. to push this down into Lucene, for a
>> variety of reasons.  For one, we could rely on Solr to do our replication
>> for us.
>>
>> I'll post back when I have some measurements.
>>
>> -Mike
>>
> This idea actually does seem to be working out pretty nicely.  I compared
> time to write and then to read documents that included a couple of small
> indexed fields and a binary stored field that varied in size.  Writing to
> external files, via the FSFieldCodec, was 3-20 times faster than writing to
> the index in the normal way (using MMapDirectory).  Reading was sometimes
> faster and sometimes slower. I also measured time for a forceMerge(1) at
> the end of each test: this was almost always nearly zero when binaries were
> external, and grew larger with more data in the normal case.  I believe the
> improvements we're seeing here result largely from removing the bulk of the
> data from the merge I/O path.
>
> As with any performance measurements, a lot of factors can affect the
> measurements, but this effect seems pretty robust across the conditions I
> measured (different file sizes, numbers of files, and frequency of commits,
> with lots of repetition).  One oddity is a large difference between Mac SSD
> filesystem (15-20x writing, reading 0.6x)  via FSFieldCodec) and Linux ext4
> HD filesystem (3-4x writing, 1.5x reading).
>
> The codec works as a wrapper around another codec (like the compressing
> codecs), intercepting binary and string stored fields larger than a
> configurable threshold, and storing a file number as a reference in the
> main index which then functions kind of like a symlink.  The codec
> intercepts merges in order to clean up files that are no longer referenced,
> taking special care to preserve the ability of the other codecs to perform
> bulk merges.  The codec passes all the Lucene unit tests in the o.a.l.index
> package.
>
> The implementation is still very experimental: there are lots of details
> to be worked out: for example, I haven't yet measured the performance
> impact of deletions, which could be pretty significant. It would be really
> great if someone with intimate knowledge of Lucene's indexing internals
> were able to review it: I'd be happy to share the code and my list of
> TODO's and questions if there's any interest, but at least I thought it
> would be interesting to know that the approach does seem to be worth
> pursuing.
>
> -Mike
>
> ------------------------------**------------------------------**---------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.**apache.org<ja...@lucene.apache.org>
> For additional commands, e-mail: java-user-help@lucene.apache.**org<ja...@lucene.apache.org>
>
>

Re: external file stored field codec

Posted by Michael Sokolov <ms...@safaribooksonline.com>.

On 10/13/13 8:09 PM, Michael Sokolov wrote:
> On 10/13/2013 1:52 PM, Adrien Grand wrote:
>> Hi Michael,
>>
>> I'm not aware enough of operating system internals to know what
>> exactly happens when a file is open but it sounds to be like having
>> separate files per document or field adds levels of indirection when
>> loading stored fields, so I would be surprised it it actually proved
>> to be more efficient than storing everything in a single file.
>>
> That's true, Adrien, there's definitely a cost to using files. There 
> are some gnarly challenges in here (mostly to do with the large number 
> of files, as you say, and with cleaning up after deletes - deletion is 
> always hard).  I'm not sure it's going to be possible to both clean up 
> and maintain files for stale commits; this will become problematic in 
> the way that having index files on NFS mounts are problematic.
>
> I think the hope is that there will be countervailing savings during 
> writes and merges (mostly) because we may be able to cleverly avoid 
> copying the contents of stored fields being merged.  There may also be 
> savings when querying due to reduced RAM requirements since the large 
> stored fields won't be paged in while performing queries.  As I said, 
> some simple tests do show improvements under at least some 
> circumstances, so I'm pursuing this a bit further.  I have a 
> preliminary implementation as a codec now, and I'm learning a bit 
> about Lucene's index internals. BTW SimpleTextCodec is a great tool 
> for learning and debugging.
>
> The background for this is a document store with large files (think 
> PDFs, but lots of formats) that have to be tracked, and have 
> associated metadata.  We've been storing these externally, but it 
> would be beneficial to have a single data management layer: i.e. to 
> push this down into Lucene, for a variety of reasons.  For one, we 
> could rely on Solr to do our replication for us.
>
> I'll post back when I have some measurements.
>
> -Mike
This idea actually does seem to be working out pretty nicely.  I 
compared time to write and then to read documents that included a couple 
of small indexed fields and a binary stored field that varied in size.  
Writing to external files, via the FSFieldCodec, was 3-20 times faster 
than writing to the index in the normal way (using MMapDirectory).  
Reading was sometimes faster and sometimes slower. I also measured time 
for a forceMerge(1) at the end of each test: this was almost always 
nearly zero when binaries were external, and grew larger with more data 
in the normal case.  I believe the improvements we're seeing here result 
largely from removing the bulk of the data from the merge I/O path.

As with any performance measurements, a lot of factors can affect the 
measurements, but this effect seems pretty robust across the conditions 
I measured (different file sizes, numbers of files, and frequency of 
commits, with lots of repetition).  One oddity is a large difference 
between Mac SSD filesystem (15-20x writing, reading 0.6x)  via 
FSFieldCodec) and Linux ext4 HD filesystem (3-4x writing, 1.5x reading).

The codec works as a wrapper around another codec (like the compressing 
codecs), intercepting binary and string stored fields larger than a 
configurable threshold, and storing a file number as a reference in the 
main index which then functions kind of like a symlink.  The codec 
intercepts merges in order to clean up files that are no longer 
referenced, taking special care to preserve the ability of the other 
codecs to perform bulk merges.  The codec passes all the Lucene unit 
tests in the o.a.l.index package.

The implementation is still very experimental: there are lots of details 
to be worked out: for example, I haven't yet measured the performance 
impact of deletions, which could be pretty significant. It would be 
really great if someone with intimate knowledge of Lucene's indexing 
internals were able to review it: I'd be happy to share the code and my 
list of TODO's and questions if there's any interest, but at least I 
thought it would be interesting to know that the approach does seem to 
be worth pursuing.

-Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: external file stored field codec

Posted by Michael Sokolov <ms...@safaribooksonline.com>.

On 10/13/2013 1:52 PM, Adrien Grand wrote:
> Hi Michael,
>
> I'm not aware enough of operating system internals to know what
> exactly happens when a file is open but it sounds to be like having
> separate files per document or field adds levels of indirection when
> loading stored fields, so I would be surprised it it actually proved
> to be more efficient than storing everything in a single file.
>
That's true, Adrien, there's definitely a cost to using files. There are 
some gnarly challenges in here (mostly to do with the large number of 
files, as you say, and with cleaning up after deletes - deletion is 
always hard).  I'm not sure it's going to be possible to both clean up 
and maintain files for stale commits; this will become problematic in 
the way that having index files on NFS mounts are problematic.

I think the hope is that there will be countervailing savings during 
writes and merges (mostly) because we may be able to cleverly avoid 
copying the contents of stored fields being merged.  There may also be 
savings when querying due to reduced RAM requirements since the large 
stored fields won't be paged in while performing queries.  As I said, 
some simple tests do show improvements under at least some 
circumstances, so I'm pursuing this a bit further.  I have a preliminary 
implementation as a codec now, and I'm learning a bit about Lucene's 
index internals.  BTW SimpleTextCodec is a great tool for learning and 
debugging.

The background for this is a document store with large files (think 
PDFs, but lots of formats) that have to be tracked, and have associated 
metadata.  We've been storing these externally, but it would be 
beneficial to have a single data management layer: i.e. to push this 
down into Lucene, for a variety of reasons.  For one, we could rely on 
Solr to do our replication for us.

I'll post back when I have some measurements.

-Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: external file stored field codec

Posted by Adrien Grand <jp...@gmail.com>.

Hi Michael,

I'm not aware enough of operating system internals to know what
exactly happens when a file is open but it sounds to be like having
separate files per document or field adds levels of indirection when
loading stored fields, so I would be surprised it it actually proved
to be more efficient than storing everything in a single file.

-- 
Adrien

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: external file stored field codec

Posted by Michael Sokolov <ms...@safaribooksonline.com>.

On 10/11/2013 03:19 PM, Michael Sokolov wrote:
> On 10/11/2013 03:04 PM, Adrien Grand wrote:
>> On Fri, Oct 11, 2013 at 7:03 PM, Michael Sokolov
>> <ms...@safaribooksonline.com> wrote:
>>> I've been running some tests comparing storing large fields 
>>> (documents, say
>>> 100K .. 10M) as files vs. storing them in Lucene as stored fields.  
>>> Initial
>>> results seem to indicate storing them externally is a win (at least for
>>> binary docs which don't compress, and presumably we can compress the
>>> external files if we want, too), which seems to make sense. There 
>>> will be
>>> some issues with huge directories, but that might be worth solving.
>>>
>>> So I'm wondering if there is a codec that does that?  I haven't seen 
>>> one
>>> talked about anywhere.
>> I don't know about any codec that works this way but such a codec
>> would quickly exceed the amount of available file descriptors.
>>
> I'm not sure I understand.  I was thinking that the stored fields 
> would be accessed infrequently (only when writing or reading the 
> particular stored field value), and the file descriptor would only be 
> in use during the read/write operation - they wouldn't be held open.  
> So for example during query scoring one wouldn't need to visit these 
> fields I think? But I may have a fundamental misunderstanding about 
> how Lucene uses its codecs: this is new to me.
>
> -Mike
My thought was to keep a folder hierarchy (per-segment, I think) to 
avoid too many files in a folder -- maybe that's the problem you were 
referring to, Adrien?  But there is a real problem in that there isn't 
sufficient information available when merging to avoid copying files, it 
seems. It would be nice to hard link a file in order to move it to a new 
segment.  I think without that the gains will be much less attractive.

-Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: external file stored field codec

Posted by Michael Sokolov <ms...@safaribooksonline.com>.

On 10/11/2013 03:04 PM, Adrien Grand wrote:
> On Fri, Oct 11, 2013 at 7:03 PM, Michael Sokolov
> <ms...@safaribooksonline.com> wrote:
>> I've been running some tests comparing storing large fields (documents, say
>> 100K .. 10M) as files vs. storing them in Lucene as stored fields.  Initial
>> results seem to indicate storing them externally is a win (at least for
>> binary docs which don't compress, and presumably we can compress the
>> external files if we want, too), which seems to make sense.  There will be
>> some issues with huge directories, but that might be worth solving.
>>
>> So I'm wondering if there is a codec that does that?  I haven't seen one
>> talked about anywhere.
> I don't know about any codec that works this way but such a codec
> would quickly exceed the amount of available file descriptors.
>
I'm not sure I understand.  I was thinking that the stored fields would 
be accessed infrequently (only when writing or reading the particular 
stored field value), and the file descriptor would only be in use during 
the read/write operation - they wouldn't be held open.  So for example 
during query scoring one wouldn't need to visit these fields I think? 
But I may have a fundamental misunderstanding about how Lucene uses its 
codecs: this is new to me.

-Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: external file stored field codec

Posted by Adrien Grand <jp...@gmail.com>.

On Fri, Oct 11, 2013 at 7:03 PM, Michael Sokolov
<ms...@safaribooksonline.com> wrote:
> I've been running some tests comparing storing large fields (documents, say
> 100K .. 10M) as files vs. storing them in Lucene as stored fields.  Initial
> results seem to indicate storing them externally is a win (at least for
> binary docs which don't compress, and presumably we can compress the
> external files if we want, too), which seems to make sense.  There will be
> some issues with huge directories, but that might be worth solving.
>
> So I'm wondering if there is a codec that does that?  I haven't seen one
> talked about anywhere.

I don't know about any codec that works this way but such a codec
would quickly exceed the amount of available file descriptors.

-- 
Adrien

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org