You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Andi Vajda <an...@osafoundation.org> on 2004/09/25 21:13:23 UTC
DbDirectory and compound files
The compound file storage implementation that became the default with Lucene
1.4 seems to not work so well with the Berkeley DB-based implementation of
org.apache.lucene.store.Directory I submitted to the sandbox about 9 months
ago. Even though this implementation, org.apache.lucene.store.db.DbDirectory,
and its related classes are still exact as far as the interface defined by
Directory and its related classes are concerned, there seems to be
some apparently random failures that I'm thinking are related to implied flush
semantics.
I haven't looked into this any further yet as I'm wondering what sense using
this compound file storage feature makes when using Berkeley DB storage since,
in a way, Berkeley DB files are compound files themselves.
The obvious workaround is to call IndexWriter.setUseCompoundFiles(False) but
this workaround needs to be documented to be effective.
So, my question: why is the compound file storage implemented in such an
orthogonal to Directory way instead of just being another Directory
implementation called FSCompoundFileDirectory ?
Andi..
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
Re: DbDirectory and compound files
Posted by Doug Cutting <cu...@apache.org>.
Andi Vajda wrote:
> This point about buffering brings up another point. Currently, there is
> no public way to tell the open IndexWriter to flush its Directory. This
> makes it difficult to use several transactions during the lifetime of
> the IndexWriter.
The only thing in the Directory API which is required to be atomic and
that's used to implement commit, is renameFile(). This is called when
all IndexOutputs have been closed. Immediately after the rename is
completed is the appropriate place to commit a transaction. Perhaps
this should be more explicit in the API...
Doug
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
Re: DbDirectory and compound files
Posted by Andi Vajda <an...@osafoundation.org>.
> The purpose of the compound file implementation is to minimize the number of
> open files that an IndexReader must keep open. Instead of 7 + the number of
> indexed fields files per segement, only a single file must be kept open per
> segement. This helps applications which keep lots of unoptimized indexes
> open. (It also, and this is more common, helps folks who open a new
> IndexReader for each query and don't close it. In this case, opening fewer
> files gives the garbage collector time to close files before the process runs
> into its file descriptor limit, inducing a flurry of but reports about "too
> many open files".)
>
> Does that make any more sense?
Yes, thanks for the explanation. This confirms that the compound file
implementation is not that useful when used in conjunction with the
DbDirectory implementation since the only open OS files are the ones opened
by Berkeley DB, ie, the two db files + some log files if transactions
are used. The number of OS files open is more or less constant, is controlled
by the Berkeley DB environment and is independant of the number of
IndexWriter instances open.
This thinking would also apply to RAMDirectory. No files are open at all in
that case, right ?
> These changes are back-compatible: the old classes and methods are still
> there and interoperate with the new but are deprecated. You might wait until
> there is a Lucene release with the new API in it before you update
> DbDirectory. To move to the new API, all that should be required is changing
> your subclass of InputStream to instead subclass BufferedIndexInput, and also
> change your subclass of IndexOutput to instead subclass BufferedIndexOutput.
> You'll also need to add a length() method to your BufferedIndexInput
> subclass, instead of setting a protected length field in the constructor.
> That's it.
Cool, that should be easy enough.
> The revision of the API was primarily to make buffering optional. We could
> have left the buffered implementation names the same, but then the classes
> would be named poorly and it also seemed like an opportunity to remove the
> name clash with java.io.
This point about buffering brings up another point. Currently, there is no
public way to tell the open IndexWriter to flush its Directory. This makes it
difficult to use several transactions during the lifetime of the IndexWriter.
For example, it would be good if after each indexing operation, the
Berkely DB transaction could be committed. For that to work though, the
DbDirectory buffers have to be flushed first. There is no public API available
at the moment to tell the IndexWriter to make this happen.
It seems that you're saying that this situation is improved with the new
index IO classes since buffering was made optional ?
Andi..
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
Re: DbDirectory and compound files
Posted by Doug Cutting <cu...@apache.org>.
Andi Vajda wrote:
> You ask if this makes sense. No, not really. I don't know the details of
> the purpose of the compound file implementation so this may be my problem.
The purpose of the compound file implementation is to minimize the
number of open files that an IndexReader must keep open. Instead of 7 +
the number of indexed fields files per segement, only a single file must
be kept open per segement. This helps applications which keep lots of
unoptimized indexes open. (It also, and this is more common, helps
folks who open a new IndexReader for each query and don't close it. In
this case, opening fewer files gives the garbage collector time to close
files before the process runs into its file descriptor limit, inducing a
flurry of but reports about "too many open files".)
Does that make any more sense?
> However, from earlier posts of yours, it seems that the Directory
> implementation classes such as OutputStream et al are being deprecated
> and replaced by others, so it may very well be that DbDirectory needs to
> be rewritten when these changes are finalized.
These changes are back-compatible: the old classes and methods are still
there and interoperate with the new but are deprecated. You might wait
until there is a Lucene release with the new API in it before you update
DbDirectory. To move to the new API, all that should be required is
changing your subclass of InputStream to instead subclass
BufferedIndexInput, and also change your subclass of IndexOutput to
instead subclass BufferedIndexOutput. You'll also need to add a
length() method to your BufferedIndexInput subclass, instead of setting
a protected length field in the constructor. That's it.
The revision of the API was primarily to make buffering optional. We
could have left the buffered implementation names the same, but then the
classes would be named poorly and it also seemed like an opportunity to
remove the name clash with java.io.
Doug
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
Re: DbDirectory and compound files
Posted by Andi Vajda <an...@osafoundation.org>.
You ask if this makes sense. No, not really. I don't know the details of the
purpose of the compound file implementation so this may be my problem.
I understand the notion of a compound file as a file that contains other
files, a file as a directory in a sense. In that sense, the DbDirectory
implementation is implemented with two files - two dbs - the file containing
the file names and the file containing the data files and it seems that in
that context, compound files don't add much.
But your reply seems to allude to other purposes, like 'combining the files
of a segment' that I don't understand since I don't know anything about the
low-level implementation of Lucene indexes. I think I should learn more about
that before continuing this conversation.
However, from earlier posts of yours, it seems that the Directory
implementation classes such as OutputStream et al are being deprecated and
replaced by others, so it may very well be that DbDirectory needs to be
rewritten when these changes are finalized. Until then, it is probably moot to
spend more time on the Compound-File-with-DbDirectory issue which can be
worked around by not using them for the time being.
Andi..
On Wed, 29 Sep 2004, Doug Cutting wrote:
> Andi Vajda wrote:
>> So, my question: why is the compound file storage implemented in such an
>> orthogonal to Directory way instead of just being another Directory
>> implementation called FSCompoundFileDirectory ?
>
> To combine the files of a segment we need to know when the segment was
> complete. So a method would need to be added to Directory to instruct it
> when to combine files. And then the Directory would need to be able to
> locate files within the combined file in order to open them.
>
> It would be a shame to re-invent this logic for each Directory
> implementation, so the indexing code has a generic implementation layered on
> top of Directory. Does that make sense?
>
> Doug
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
Re: DbDirectory and compound files
Posted by Doug Cutting <cu...@apache.org>.
Andi Vajda wrote:
> So, my question: why is the compound file storage implemented in such an
> orthogonal to Directory way instead of just being another Directory
> implementation called FSCompoundFileDirectory ?
To combine the files of a segment we need to know when the segment was
complete. So a method would need to be added to Directory to instruct
it when to combine files. And then the Directory would need to be able
to locate files within the combined file in order to open them.
It would be a shame to re-invent this logic for each Directory
implementation, so the indexing code has a generic implementation
layered on top of Directory. Does that make sense?
Doug
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org