You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Andi Vajda <an...@osafoundation.org> on 2004/09/25 21:13:23 UTC

DbDirectory and compound files

The compound file storage implementation that became the default with Lucene 
1.4 seems to not work so well with the Berkeley DB-based implementation of 
org.apache.lucene.store.Directory I submitted to the sandbox about 9 months 
ago. Even though this implementation, org.apache.lucene.store.db.DbDirectory, 
and its related classes are still exact as far as the interface defined by 
Directory and its related classes are concerned, there seems to be 
some apparently random failures that I'm thinking are related to implied flush 
semantics.

I haven't looked into this any further yet as I'm wondering what sense using 
this compound file storage feature makes when using Berkeley DB storage since,
in a way, Berkeley DB files are compound files themselves.

The obvious workaround is to call IndexWriter.setUseCompoundFiles(False) but 
this workaround needs to be documented to be effective.

So, my question: why is the compound file storage implemented in such an 
orthogonal to Directory way instead of just being another Directory 
implementation called FSCompoundFileDirectory ?

Andi..

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

Re: DbDirectory and compound files

Posted by Doug Cutting <cu...@apache.org>.

Andi Vajda wrote:
> This point about buffering brings up another point. Currently, there is 
> no public way to tell the open IndexWriter to flush its Directory. This 
> makes it difficult to use several transactions during the lifetime of 
> the IndexWriter.

The only thing in the Directory API which is required to be atomic and 
that's used to implement commit, is renameFile().  This is called when 
all IndexOutputs have been closed.  Immediately after the rename is 
completed is the appropriate place to commit a transaction.  Perhaps 
this should be more explicit in the API...

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

Re: DbDirectory and compound files

Posted by Andi Vajda <an...@osafoundation.org>.

> The purpose of the compound file implementation is to minimize the number of 
> open files that an IndexReader must keep open.  Instead of 7 + the number of 
> indexed fields files per segement, only a single file must be kept open per 
> segement.  This helps applications which keep lots of unoptimized indexes 
> open.  (It also, and this is more common, helps folks who open a new 
> IndexReader for each query and don't close it.  In this case, opening fewer 
> files gives the garbage collector time to close files before the process runs 
> into its file descriptor limit, inducing a flurry of but reports about "too 
> many open files".)
>
> Does that make any more sense?

Yes, thanks for the explanation. This confirms that the compound file 
implementation is not that useful when used in conjunction with the 
DbDirectory implementation since the only open OS files are the ones opened 
by Berkeley DB, ie, the two db files + some log files if transactions 
are used. The number of OS files open is more or less constant, is controlled 
by the Berkeley DB environment and is independant of the number of 
IndexWriter instances open.
This thinking would also apply to RAMDirectory. No files are open at all in 
that case, right ?

> These changes are back-compatible: the old classes and methods are still 
> there and interoperate with the new but are deprecated.  You might wait until 
> there is a Lucene release with the new API in it before you update 
> DbDirectory.  To move to the new API, all that should be required is changing 
> your subclass of InputStream to instead subclass BufferedIndexInput, and also 
> change your subclass of IndexOutput to instead subclass BufferedIndexOutput. 
> You'll also need to add a length() method to your BufferedIndexInput 
> subclass, instead of setting a protected length field in the constructor. 
> That's it.

Cool, that should be easy enough.

> The revision of the API was primarily to make buffering optional.  We could 
> have left the buffered implementation names the same, but then the classes 
> would be named poorly and it also seemed like an opportunity to remove the 
> name clash with java.io.

This point about buffering brings up another point. Currently, there is no 
public way to tell the open IndexWriter to flush its Directory. This makes it 
difficult to use several transactions during the lifetime of the IndexWriter.
For example, it would be good if after each indexing operation, the 
Berkely DB transaction could be committed. For that to work though, the 
DbDirectory buffers have to be flushed first. There is no public API available 
at the moment to tell the IndexWriter to make this happen.
It seems that you're saying that this situation is improved with the new 
index IO classes since buffering was made optional ?

Andi..

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

Re: DbDirectory and compound files

Posted by Doug Cutting <cu...@apache.org>.

Andi Vajda wrote:
> You ask if this makes sense. No, not really. I don't know the details of 
> the purpose of the compound file implementation so this may be my problem.

The purpose of the compound file implementation is to minimize the 
number of open files that an IndexReader must keep open.  Instead of 7 + 
the number of indexed fields files per segement, only a single file must 
be kept open per segement.  This helps applications which keep lots of 
unoptimized indexes open.  (It also, and this is more common, helps 
folks who open a new IndexReader for each query and don't close it.  In 
this case, opening fewer files gives the garbage collector time to close 
files before the process runs into its file descriptor limit, inducing a 
flurry of but reports about "too many open files".)

Does that make any more sense?

> However, from earlier posts of yours, it seems that the Directory 
> implementation classes such as OutputStream et al are being deprecated 
> and replaced by others, so it may very well be that DbDirectory needs to 
> be rewritten when these changes are finalized.

These changes are back-compatible: the old classes and methods are still 
there and interoperate with the new but are deprecated.  You might wait 
until there is a Lucene release with the new API in it before you update 
DbDirectory.  To move to the new API, all that should be required is 
changing your subclass of InputStream to instead subclass 
BufferedIndexInput, and also change your subclass of IndexOutput to 
instead subclass BufferedIndexOutput.  You'll also need to add a 
length() method to your BufferedIndexInput subclass, instead of setting 
a protected length field in the constructor.  That's it.

The revision of the API was primarily to make buffering optional.  We 
could have left the buffered implementation names the same, but then the 
classes would be named poorly and it also seemed like an opportunity to 
remove the name clash with java.io.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

Re: DbDirectory and compound files

Posted by Andi Vajda <an...@osafoundation.org>.

You ask if this makes sense. No, not really. I don't know the details of the 
purpose of the compound file implementation so this may be my problem.

I understand the notion of a compound file as a file that contains other 
files, a file as a directory in a sense. In that sense, the DbDirectory 
implementation is implemented with two files - two dbs - the file containing 
the file names and the file containing the data files and it seems that in 
that context, compound files don't add much.

But your reply seems to allude to other purposes, like 'combining the files 
of a segment' that I don't understand since I don't know anything about the 
low-level implementation of Lucene indexes. I think I should learn more about 
that before continuing this conversation.

However, from earlier posts of yours, it seems that the Directory 
implementation classes such as OutputStream et al are being deprecated and 
replaced by others, so it may very well be that DbDirectory needs to be 
rewritten when these changes are finalized. Until then, it is probably moot to 
spend more time on the Compound-File-with-DbDirectory issue which can be 
worked around by not using them for the time being.

Andi..

On Wed, 29 Sep 2004, Doug Cutting wrote:

> Andi Vajda wrote:
>> So, my question: why is the compound file storage implemented in such an 
>> orthogonal to Directory way instead of just being another Directory 
>> implementation called FSCompoundFileDirectory ?
>
> To combine the files of a segment we need to know when the segment was 
> complete.  So a method would need to be added to Directory to instruct it 
> when to combine files.  And then the Directory would need to be able to 
> locate files within the combined file in order to open them.
>
> It would be a shame to re-invent this logic for each Directory 
> implementation, so the indexing code has a generic implementation layered on 
> top of Directory.  Does that make sense?
>
> Doug
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

Re: DbDirectory and compound files

Posted by Doug Cutting <cu...@apache.org>.

Andi Vajda wrote:
> So, my question: why is the compound file storage implemented in such an 
> orthogonal to Directory way instead of just being another Directory 
> implementation called FSCompoundFileDirectory ?

To combine the files of a segment we need to know when the segment was 
complete.  So a method would need to be added to Directory to instruct 
it when to combine files.  And then the Directory would need to be able 
to locate files within the combined file in order to open them.

It would be a shame to re-invent this logic for each Directory 
implementation, so the indexing code has a generic implementation 
layered on top of Directory.  Does that make sense?

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org