You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Michael McCandless (JIRA)" <ji...@apache.org> on 2007/01/09 12:41:27 UTC
[jira] Created: (LUCENE-767) maxDoc should be explicitly stored in
the index, not derived from file length
maxDoc should be explicitly stored in the index, not derived from file length
-----------------------------------------------------------------------------
Key: LUCENE-767
URL: https://issues.apache.org/jira/browse/LUCENE-767
Project: Lucene - Java
Issue Type: Improvement
Affects Versions: 2.0.0, 1.9, 2.0.1, 2.1
Reporter: Michael McCandless
Assigned To: Michael McCandless
Priority: Minor
This is a spinoff of LUCENE-140
In general we should rely on "as little as possible" from the file system. Right now, maxDoc is derived by checking the file length of the FieldsReader index file (.fdx) which makes me nervous. I think we should explicitly store it instead.
Note that there are no known cases where this is actually causing a problem. There was some speculation in the discussion of LUCENE-140 that it could be one of the possible, but in digging / discussion there were no specifically relevant JVM bugs found (yet!). So this would be a defensive fix at this point.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: [jira] Created: (LUCENE-767) maxDoc should be explicitly stored in the index, not derived from file length
Posted by Grant Ingersoll <gs...@apache.org>.
Hi Michael,
Can you explain in more detail on this bug why this makes you nervous?
Thanks,
Grant
On Jan 9, 2007, at 6:41 AM, Michael McCandless (JIRA) wrote:
> maxDoc should be explicitly stored in the index, not derived from
> file length
> ----------------------------------------------------------------------
> -------
>
> Key: LUCENE-767
> URL: https://issues.apache.org/jira/browse/LUCENE-767
> Project: Lucene - Java
> Issue Type: Improvement
> Affects Versions: 2.0.0, 1.9, 2.0.1, 2.1
> Reporter: Michael McCandless
> Assigned To: Michael McCandless
> Priority: Minor
>
>
> This is a spinoff of LUCENE-140
>
> In general we should rely on "as little as possible" from the file
> system. Right now, maxDoc is derived by checking the file length
> of the FieldsReader index file (.fdx) which makes me nervous. I
> think we should explicitly store it instead.
>
> Note that there are no known cases where this is actually causing a
> problem. There was some speculation in the discussion of LUCENE-140
> that it could be one of the possible, but in digging / discussion
> there were no specifically relevant JVM bugs found (yet!). So this
> would be a defensive fix at this point.
>
> --
> This message is automatically generated by JIRA.
> -
> If you think it was sent incorrectly contact one of the
> administrators: https://issues.apache.org/jira/secure/
> Administrators.jspa
> -
> For more information on JIRA, see: http://www.atlassian.com/
> software/jira
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
--------------------------
Grant Ingersoll
Center for Natural Language Processing
http://www.cnlp.org
Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/
LuceneFAQ
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
[jira] Resolved: (LUCENE-767) maxDoc should be explicitly stored in
the index, not derived from file length
Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/LUCENE-767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Michael McCandless resolved LUCENE-767.
---------------------------------------
Resolution: Fixed
Fix Version/s: 2.1
> maxDoc should be explicitly stored in the index, not derived from file length
> -----------------------------------------------------------------------------
>
> Key: LUCENE-767
> URL: https://issues.apache.org/jira/browse/LUCENE-767
> Project: Lucene - Java
> Issue Type: Improvement
> Affects Versions: 1.9, 2.0.0, 2.0.1, 2.1
> Reporter: Michael McCandless
> Assigned To: Michael McCandless
> Priority: Minor
> Fix For: 2.1
>
>
> This is a spinoff of LUCENE-140
> In general we should rely on "as little as possible" from the file system. Right now, maxDoc is derived by checking the file length of the FieldsReader index file (.fdx) which makes me nervous. I think we should explicitly store it instead.
> Note that there are no known cases where this is actually causing a problem. There was some speculation in the discussion of LUCENE-140 that it could be one of the possible, but in digging / discussion there were no specifically relevant JVM bugs found (yet!). So this would be a defensive fix at this point.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
[jira] Commented: (LUCENE-767) maxDoc should be explicitly stored
in the index, not derived from file length
Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/LUCENE-767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12463335 ]
Michael McCandless commented on LUCENE-767:
-------------------------------------------
Ooh that's great! I think your logic is correct.
But I do see one unit test failing when I make that change locally (testIndexAndMerge in src/test/org/apache/lucene/index/TestDoc.java). Actually, this unit test only fails with my last commit (yesterday) for LUCENE-140 , because I made the checking for "docs out of order" more strict (catch a previously missing boundary case), and this test seems to hit that boundary case.
However, that test is buggy because it manually creates SegmentInfos with an incorrect docCount. So I will fix the test, and commit your solution above. Thanks!
> maxDoc should be explicitly stored in the index, not derived from file length
> -----------------------------------------------------------------------------
>
> Key: LUCENE-767
> URL: https://issues.apache.org/jira/browse/LUCENE-767
> Project: Lucene - Java
> Issue Type: Improvement
> Affects Versions: 1.9, 2.0.0, 2.0.1, 2.1
> Reporter: Michael McCandless
> Assigned To: Michael McCandless
> Priority: Minor
>
> This is a spinoff of LUCENE-140
> In general we should rely on "as little as possible" from the file system. Right now, maxDoc is derived by checking the file length of the FieldsReader index file (.fdx) which makes me nervous. I think we should explicitly store it instead.
> Note that there are no known cases where this is actually causing a problem. There was some speculation in the discussion of LUCENE-140 that it could be one of the possible, but in digging / discussion there were no specifically relevant JVM bugs found (yet!). So this would be a defensive fix at this point.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
[jira] Commented: (LUCENE-767) maxDoc should be explicitly stored
in the index, not derived from file length
Posted by "Chuck Williams (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/LUCENE-767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12463322 ]
Chuck Williams commented on LUCENE-767:
---------------------------------------
Isn't maxDoc always the same as the docCount of the segment, which is stored? I.e., couldn't SegmentReader.maxDoc() be equivalently defined as:
public int maxDoc() {
return si.docCount;
}
Since maxDoc==numDocs==docCount for a newly merged segment, and deletion with a reader never changes numDocs or maxDoc, it seems to me these values should always be the same.
All Lucene tests pass with this definition. I have code that relies on this equivalence and so would appreciate knowledge of any case where this equivalence might not hold.
> maxDoc should be explicitly stored in the index, not derived from file length
> -----------------------------------------------------------------------------
>
> Key: LUCENE-767
> URL: https://issues.apache.org/jira/browse/LUCENE-767
> Project: Lucene - Java
> Issue Type: Improvement
> Affects Versions: 1.9, 2.0.0, 2.0.1, 2.1
> Reporter: Michael McCandless
> Assigned To: Michael McCandless
> Priority: Minor
>
> This is a spinoff of LUCENE-140
> In general we should rely on "as little as possible" from the file system. Right now, maxDoc is derived by checking the file length of the FieldsReader index file (.fdx) which makes me nervous. I think we should explicitly store it instead.
> Note that there are no known cases where this is actually causing a problem. There was some speculation in the discussion of LUCENE-140 that it could be one of the possible, but in digging / discussion there were no specifically relevant JVM bugs found (yet!). So this would be a defensive fix at this point.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: [jira] Commented: (LUCENE-767) maxDoc should be explicitly stored in the index, not derived from file length
Posted by robert engels <re...@ix.netcom.com>.
I think this is the relevant section:
A8. What is close-to-open cache consistency?
A. Perfect cache coherency among disparate NFS clients is very
expensive to achieve, so NFS settles for something weaker that
satisfies the requirements of most everyday types of file sharing.
Everyday file sharing is most often completely sequential: first
client A opens a file, writes something to it, then closes it; then
client B opens the same file, and reads the changes.
So, when an application opens a file stored in NFS, the NFS
client checks that it still exists on the server, and is permitted to
the opener, by sending a GETATTR or ACCESS operation. When the
application closes the file, the NFS client writes back any pending
changes to the file so that the next opener can view the changes.
This also gives the NFS client an opportunity to report any server
write errors to the application via the return code from close().
This behavior is referred to as close-to-open cache consistency.
Linux implements close-to-open cache consistency by comparing
the results of a GETATTR operation done just after the file is closed
to the results of a GETATTR operation done when the file is next
opened. If the results are the same, the client will assume its data
cache is still valid; otherwise, the cache is purged.
Close-to-open cache consistency was introduced to the Linux NFS
client in 2.4.20. If for some reason you have applications that
depend on the old behavior, you can disable close-to-open support by
using the "nocto" mount option.
There are still opportunities for a client's data cache to
contain stale data. The NFS version 3 protocol introduced "weak cache
consistency" (also known as WCC) which provides a way of checking a
file's attributes before and after an operation to allow a client to
identify changes that could have been made by other clients.
Unfortunately when a client is using many concurrent operations that
update the same file at the same time, it is impossible to tell
whether it was that client's updates or some other client's updates
that changed the file.
For this reason, some versions of the Linux 2.6 NFS client
abandon WCC checking entirely, and simply trust their own data cache.
On these versions, the client can maintain a cache full of stale file
data if a file is opened for write. In this case, using file locking
is the best way to ensure that all clients see the latest version of
a file's data.
A system administrator can try using the "noac" mount option to
achieve attribute cache coherency among multiple clients. Almost
every client operation checks file attribute information. Usually the
client keeps this information cached for a period of time to reduce
network and server load. When "noac" is in effect, a client's file
attribute cache is disabled, so each operation that needs to check a
file's attributes is forced to go back to the server. This permits a
client to see changes to a file very quickly, at the cost of many
extra network operations.
Be careful not to confuse "noac" with "no data caching." The
"noac" mount option will keep file attributes up-to-date with the
server, but there are still races that may result in data incoherency
between client and server. If you need absolute cache coherency among
clients, applications can use file locking, where a client purges
file data when a file is locked, and flushes changes back to the
server before unlocking a file; or applications can open their files
with the O_DIRECT flag to disable data caching entirely.
For a better understanding of the compromises faced in the
design of NFS caching, see Callaghan's "NFS Illustrated."
On Jan 9, 2007, at 12:25 PM, Michael McCandless (JIRA) wrote:
>
> [ https://issues.apache.org/jira/browse/LUCENE-767?
> page=com.atlassian.jira.plugin.system.issuetabpanels:comment-
> tabpanel#action_12463358 ]
>
> Michael McCandless commented on LUCENE-767:
> -------------------------------------------
>
>
> Carrying over from the java-dev list:
>
>
> Grant Ingersoll wrote:
>
>> Can you explain in more detail on this bug why this makes you
>> nervous?
>
> Well ... the only specific example I have is NFS (always my favorite
> example!).
>
> As I understand it, the NFS client typically uses a separate cache to
> hold the "attributes" of the file, including file length. This cache
> often has weaker or maybe just "different" guarantees than the "data
> cache" that holds the file contents. So basically you can ask what
> the file length is and get a wrong (stale) answer. EG see
> http://nfs.sourceforge.net, which describes Linux's NFS client
> approach. The NFS client on Apple's OS X seems to be even worse!
>
> I think very likely Lucene may not trip up on this specifically since
> a reader would only ask for this file's length for the first time once
> the file is done being written (ie the commit of segments_N has
> occurred) and so hopefully it's not in the attribute cache yet?
>
> I think there may very well be cases of other filesystems where
> "checking file length" is risky (that we all just don't know about
> (yet!)), which is why I favor using explicit values instead of relying
> on file system semantics, whenever possible.
>
> Maybe I'm just too paranoid :)
>
> But for all the places / devices Lucene has gone and will go, relying
> on the bare minimum set of IO operations I think will maximize our
> overall portability. Every filesystem has its quirks.
>
>
>> maxDoc should be explicitly stored in the index, not derived from
>> file length
>> ---------------------------------------------------------------------
>> --------
>>
>> Key: LUCENE-767
>> URL: https://issues.apache.org/jira/browse/LUCENE-767
>> Project: Lucene - Java
>> Issue Type: Improvement
>> Affects Versions: 1.9, 2.0.0, 2.0.1, 2.1
>> Reporter: Michael McCandless
>> Assigned To: Michael McCandless
>> Priority: Minor
>>
>> This is a spinoff of LUCENE-140
>> In general we should rely on "as little as possible" from the file
>> system. Right now, maxDoc is derived by checking the file length
>> of the FieldsReader index file (.fdx) which makes me nervous. I
>> think we should explicitly store it instead.
>> Note that there are no known cases where this is actually causing
>> a problem. There was some speculation in the discussion of
>> LUCENE-140 that it could be one of the possible, but in digging /
>> discussion there were no specifically relevant JVM bugs found
>> (yet!). So this would be a defensive fix at this point.
>
> --
> This message is automatically generated by JIRA.
> -
> If you think it was sent incorrectly contact one of the
> administrators: https://issues.apache.org/jira/secure/
> Administrators.jspa
> -
> For more information on JIRA, see: http://www.atlassian.com/
> software/jira
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
Re: [jira] Commented: (LUCENE-767) maxDoc should be explicitly stored
in the index, not derived from file length
Posted by Michael McCandless <lu...@mikemccandless.com>.
robert engels wrote:
> It would appear that NFS Version 2 is not suitable for Lucene. NFS
> Version 3 looks like it should work. See
> http://nfs.sourceforge.net/#section_a
>
> I will take this opportunity to state again what I've always been told,
> and it seems to hold up, using NFS for shared interactively updated
> files is always going to be troublesome. They have patched it over the
> years to help, but it just wasn't designed for this for the beginning.
>
> Unix systems never even had file system locks. It was assumed that
> shared access to shared data would be accomplished via a shared server -
> not by sharing access to the data directly. It is far more efficient and
> robust to do things this way.
>
> Modifying a shared Lucene directory via NFS directly is always going to
> be error prone.
>
> Why not just implement a server/parallel index solution ?
Actually I think now (with lockless commits) Lucene works fine over
NFS, except for the [yes, rather big] remaining issue: LUCENE-710.
But that issue, while clearly scary when you first see it, can be
easily worked around (just refresh your searchers once they hit "Stale
NFS handle").
Even once we resolve that and Lucene works over NFS, I do think the
performance will typically not be "stellar". At least in my
experience the performance of NFS is surprisingly poor. So I do think
for users that require high performance a replicated (like Solr)
and/or distributed index solution is probably the way to go.
Anyway, I didn't mean to turn this back into an NFS discussion. I
just wanted to use NFS as an example of where relying on file length
for something important (maxDocs() in a segment) is possibly
dangerous.
Mike
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: [jira] Commented: (LUCENE-767) maxDoc should be explicitly stored in the index, not derived from file length
Posted by robert engels <re...@ix.netcom.com>.
It would appear that NFS Version 2 is not suitable for Lucene. NFS
Version 3 looks like it should work. See http://nfs.sourceforge.net/
#section_a
I will take this opportunity to state again what I've always been
told, and it seems to hold up, using NFS for shared interactively
updated files is always going to be troublesome. They have patched it
over the years to help, but it just wasn't designed for this for the
beginning.
Unix systems never even had file system locks. It was assumed that
shared access to shared data would be accomplished via a shared
server - not by sharing access to the data directly. It is far more
efficient and robust to do things this way.
Modifying a shared Lucene directory via NFS directly is always going
to be error prone.
Why not just implement a server/parallel index solution ?
On Jan 9, 2007, at 12:25 PM, Michael McCandless (JIRA) wrote:
>
> [ https://issues.apache.org/jira/browse/LUCENE-767?
> page=com.atlassian.jira.plugin.system.issuetabpanels:comment-
> tabpanel#action_12463358 ]
>
> Michael McCandless commented on LUCENE-767:
> -------------------------------------------
>
>
> Carrying over from the java-dev list:
>
>
> Grant Ingersoll wrote:
>
>> Can you explain in more detail on this bug why this makes you
>> nervous?
>
> Well ... the only specific example I have is NFS (always my favorite
> example!).
>
> As I understand it, the NFS client typically uses a separate cache to
> hold the "attributes" of the file, including file length. This cache
> often has weaker or maybe just "different" guarantees than the "data
> cache" that holds the file contents. So basically you can ask what
> the file length is and get a wrong (stale) answer. EG see
> http://nfs.sourceforge.net, which describes Linux's NFS client
> approach. The NFS client on Apple's OS X seems to be even worse!
>
> I think very likely Lucene may not trip up on this specifically since
> a reader would only ask for this file's length for the first time once
> the file is done being written (ie the commit of segments_N has
> occurred) and so hopefully it's not in the attribute cache yet?
>
> I think there may very well be cases of other filesystems where
> "checking file length" is risky (that we all just don't know about
> (yet!)), which is why I favor using explicit values instead of relying
> on file system semantics, whenever possible.
>
> Maybe I'm just too paranoid :)
>
> But for all the places / devices Lucene has gone and will go, relying
> on the bare minimum set of IO operations I think will maximize our
> overall portability. Every filesystem has its quirks.
>
>
>> maxDoc should be explicitly stored in the index, not derived from
>> file length
>> ---------------------------------------------------------------------
>> --------
>>
>> Key: LUCENE-767
>> URL: https://issues.apache.org/jira/browse/LUCENE-767
>> Project: Lucene - Java
>> Issue Type: Improvement
>> Affects Versions: 1.9, 2.0.0, 2.0.1, 2.1
>> Reporter: Michael McCandless
>> Assigned To: Michael McCandless
>> Priority: Minor
>>
>> This is a spinoff of LUCENE-140
>> In general we should rely on "as little as possible" from the file
>> system. Right now, maxDoc is derived by checking the file length
>> of the FieldsReader index file (.fdx) which makes me nervous. I
>> think we should explicitly store it instead.
>> Note that there are no known cases where this is actually causing
>> a problem. There was some speculation in the discussion of
>> LUCENE-140 that it could be one of the possible, but in digging /
>> discussion there were no specifically relevant JVM bugs found
>> (yet!). So this would be a defensive fix at this point.
>
> --
> This message is automatically generated by JIRA.
> -
> If you think it was sent incorrectly contact one of the
> administrators: https://issues.apache.org/jira/secure/
> Administrators.jspa
> -
> For more information on JIRA, see: http://www.atlassian.com/
> software/jira
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
[jira] Commented: (LUCENE-767) maxDoc should be explicitly stored
in the index, not derived from file length
Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/LUCENE-767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12463358 ]
Michael McCandless commented on LUCENE-767:
-------------------------------------------
Carrying over from the java-dev list:
Grant Ingersoll wrote:
> Can you explain in more detail on this bug why this makes you nervous?
Well ... the only specific example I have is NFS (always my favorite
example!).
As I understand it, the NFS client typically uses a separate cache to
hold the "attributes" of the file, including file length. This cache
often has weaker or maybe just "different" guarantees than the "data
cache" that holds the file contents. So basically you can ask what
the file length is and get a wrong (stale) answer. EG see
http://nfs.sourceforge.net, which describes Linux's NFS client
approach. The NFS client on Apple's OS X seems to be even worse!
I think very likely Lucene may not trip up on this specifically since
a reader would only ask for this file's length for the first time once
the file is done being written (ie the commit of segments_N has
occurred) and so hopefully it's not in the attribute cache yet?
I think there may very well be cases of other filesystems where
"checking file length" is risky (that we all just don't know about
(yet!)), which is why I favor using explicit values instead of relying
on file system semantics, whenever possible.
Maybe I'm just too paranoid :)
But for all the places / devices Lucene has gone and will go, relying
on the bare minimum set of IO operations I think will maximize our
overall portability. Every filesystem has its quirks.
> maxDoc should be explicitly stored in the index, not derived from file length
> -----------------------------------------------------------------------------
>
> Key: LUCENE-767
> URL: https://issues.apache.org/jira/browse/LUCENE-767
> Project: Lucene - Java
> Issue Type: Improvement
> Affects Versions: 1.9, 2.0.0, 2.0.1, 2.1
> Reporter: Michael McCandless
> Assigned To: Michael McCandless
> Priority: Minor
>
> This is a spinoff of LUCENE-140
> In general we should rely on "as little as possible" from the file system. Right now, maxDoc is derived by checking the file length of the FieldsReader index file (.fdx) which makes me nervous. I think we should explicitly store it instead.
> Note that there are no known cases where this is actually causing a problem. There was some speculation in the discussion of LUCENE-140 that it could be one of the possible, but in digging / discussion there were no specifically relevant JVM bugs found (yet!). So this would be a defensive fix at this point.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org