You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Shai Erera <se...@gmail.com> on 2009/08/13 14:50:29 UTC

SMB2 cache

Hi

Has anyone experienced any problems w/ Lucene indexes on a shared SMB2
network drive?

We've hit a scenario where it seems the FS cache refuses to check for
existence of files on the shared network drive. Specifically, we hit the
following exception:

java.io.FileNotFoundException: Z:\index\segments_p8 (The system cannot find
the file specified.)
at java.io.RandomAccessFile.open(Native Method)
at java.io.RandomAccessFile.<init>(RandomAccessFile.java:243)
at
org.apache.lucene.store.FSDirectory$FSIndexInput$Descriptor.<init>(FSDirectory.java:552)
at
org.apache.lucene.store.FSDirectory$FSIndexInput.<init>(FSDirectory.java:582)
at org.apache.lucene.store.FSDirectory.openInput(FSDirectory.java:488)
at org.apache.lucene.store.FSDirectory.openInput(FSDirectory.java:482)
at org.apache.lucene.index.SegmentInfos$2.doBody(SegmentInfos.java:369)
at
org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:653)
at
org.apache.lucene.index.SegmentInfos.readCurrentVersion(SegmentInfos.java:366)
at
org.apache.lucene.index.DirectoryIndexReader.isCurrent(DirectoryIndexReader.java:188)
at org.apache.lucene.index.MultiReader.isCurrent(MultiReader.java:352)

The environment:
* 3 Windows Server 2008 machines
** Machine A - hosts the index
** Machine B - indexes and search
** Machine C - just search
* Machine A and B map Machine C on drive Z.
* The exception happens on Machine C only, i.e. on the machine that does
just 'search'.

According to my understanding, FindSegmentFile attempts to read the latest
segment from segments.gen and directory listing and if there is a problem,
it will do a gen-readahead until success or defaultGenLookaheadCount is
exhausted.

So by hitting this exception we thought of the following explanation: the FS
cache 'decides' the file does not exist, due to a stale directory cache, and
refuses to check whether the file actually exists on the remote machine.

Does that sound reasonable?

Some more information:
* We use Lucene 2.4.0
* Other runs are executed on those machines currently, and so it will take
about a week until we can run the same scenario again. I thought that
perhaps we can discuss this until then.
* Unfortunately we weren't able to get an infoStream output before the
machines started another run, so we hope to get it next time. Anyway, it's
not easily reproduced.
* There isn't any other process which touches this directory, such that it
may remove index files.

We know the same code runs well on NFS (4). We haven't checked yet if SMB
1.0 works ok. Some pointers we've found:

A known issue on MS, w/ some C++ fixes:
http://www.microsoft.com/communities/newsgroups/en-us/default.aspx?dg=microsoft.public.win32.programmer.networks&tid=69e63e38-7d91-4306-ab6e-a615e1c6afaa&cat=en_US_bc89adf4-f184-4d3d-aaee-122567385744&lang=en&cr=US&sloc=&p=1

Info on how to disable SMB 2.0 on Windows:
http://www.petri.co.il/how-to-disable-smb-2-on-windows-vista-or-server-2008.htm

Currently, we think to bypass the problem by wrapping calls to isCurrent and
reopen w/ a try-catch FileNotFoundException and use the reader we have at
hand. Later, we will attempt the isCurrent again. Since SMB caching seems to
be time-controlled, we expect the cache to be refreshed after several
seconds, and those calls will succeed.
I wonder though if this can't get us into hitting the exception 'forever'.
E.g., imagine a system which indexes at very high rates. Isn't it possible
that we'll hit this exception every time we call isCurrent?

I'm not sure if there is anything we can do in Lucene, besides sleeping in
FindSegmentsFile for several seconds which is not reasonable.
Maybe a way out would be, I think, having FindSegmentsFile try to read ahead
and then backwards. At some point, we ought to find a segment that's
readable, even if an old one, no?

Any help will be appreciated.

Shai

Re: SMB2 cache

Posted by Michael McCandless <lu...@mikemccandless.com>.
On Fri, Aug 14, 2009 at 6:56 AM, Shai Erera<se...@gmail.com> wrote:
> Thanks Mike. If we only try to reopen after a commit happens, then it makes
> sense that the cache will expire between commits, and therefore the call
> will succeed.

Yeah, at least this is the theory :)  I'd love to hear confirmation
that it's actually true!

> How can we update the Wiki page? Is it done through issues? I don't believe
> I have access to the source, but if I can, I don't mind to prepare a patch.

Just login (make yourself an account if you haven't already) and edit
it.  Anyone can edit the wiki.

I think we should remove that old FAQ (or, maybe "deprecate" it,
saying it only applies to Lucene <= 2.0).

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: SMB2 cache

Posted by Shai Erera <se...@gmail.com>.
Thanks Mike. If we only try to reopen after a commit happens, then it makes
sense that the cache will expire between commits, and therefore the call
will succeed.

How can we update the Wiki page? Is it done through issues? I don't believe
I have access to the source, but if I can, I don't mind to prepare a patch.

Shai

On Fri, Aug 14, 2009 at 1:55 AM, Michael McCandless <
lucene@mikemccandless.com> wrote:

> On Thu, Aug 13, 2009 at 6:03 PM, Shai Erera<se...@gmail.com> wrote:
> > Also Mike - even if the writer has committed, and then I notify the other
> > nodes they should refresh, it's still possible for them to hit this
> > exception, right?
>
> I'm hoping you won't hit the exception, as long as the updates are
> less frequent that the cache timeout.
>
> Ie, because the cache expiration is time based, if you haven't listed
> the directory in quite a while, when you do attempt to list it, it
> shouldn't be stale.
>
> Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

Re: SMB2 cache

Posted by Michael McCandless <lu...@mikemccandless.com>.
On Thu, Aug 13, 2009 at 6:03 PM, Shai Erera<se...@gmail.com> wrote:
> Also Mike - even if the writer has committed, and then I notify the other
> nodes they should refresh, it's still possible for them to hit this
> exception, right?

I'm hoping you won't hit the exception, as long as the updates are
less frequent that the cache timeout.

Ie, because the cache expiration is time based, if you haven't listed
the directory in quite a while, when you do attempt to list it, it
shouldn't be stale.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: SMB2 cache

Posted by Shai Erera <se...@gmail.com>.
Also Mike - even if the writer has committed, and then I notify the other
nodes they should refresh, it's still possible for them to hit this
exception, right?

On Fri, Aug 14, 2009 at 1:02 AM, Shai Erera <se...@gmail.com> wrote:

> How can the writer delete all previous segments? If I have a reader open,
> doesn't it prevent those files to be deleted? That's why I count on any of
> those files to exist. Perhaps I'm wrong though.
>
> I think we can come up w/ some notification mechanism, through MQ or
> something.
>
> Do you think it's worth to be documented on the Wiki? The entry about FNFE
> during searches mentions NFS or SMB, but does not mention
> SimpleFSLockFactory (Which solves a different problem). Maybe we can add
> that info there?
>
> Shai
>
>
> On Fri, Aug 14, 2009 at 12:50 AM, Michael McCandless <
> lucene@mikemccandless.com> wrote:
>
>> On Thu, Aug 13, 2009 at 5:33 PM, Shai Erera<se...@gmail.com> wrote:
>>
>> > So if afterwards we read until segment_17 and exhaust read-ahead, and we
>> > determine that there's a problem - we throw the exception. If instead
>> we'll
>> > try to read backwards, I'm sure one of the segments will be read
>> > successfully, because that reader must already see any segment, right?
>>
>> I don't think you're guaranteed to read successfully, on reading
>> backwards.
>>
>> Ie, say writer has committed segments_8, and therefore just removed
>> segments_7.
>>
>> When the reader (on a different machine, w/ stale cache) tries to
>> open, it's cache claims segments_7 still exists, so we try to open
>> that but fail.  We advance to segments_8 and try to open that, but
>> fail (presumably because local SMB2 cache doesn't consult the server,
>> unlike many NFS clients, I think).  We then try up through segments_17
>> and nothing works.  But going backwards can't work either because
>> those segments files have all been deleted.  (Assuming
>> KeepOnlyLastCommitDeletionPolicy... things do get more interesting if
>> you're using a different deletion policy...).
>>
>> Sadly, the most common approach to refreshing readers, eg checking
>> every N seconds if it's time to reopen, leads directly to this "cache
>> is holding onto stale data".  My guess is if an app only attempted to
>> reopen the reader after the writer on another machine had committed,
>> then this exception wouldn't happen.  But that'd require some
>> notification mechanism outside of Lucene.
>>
>> Mike
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>>
>

Re: SMB2 cache

Posted by Michael McCandless <lu...@mikemccandless.com>.
On Thu, Aug 13, 2009 at 6:02 PM, Shai Erera<se...@gmail.com> wrote:
> How can the writer delete all previous segments? If I have a reader open,
> doesn't it prevent those files to be deleted? That's why I count on any of
> those files to exist. Perhaps I'm wrong though.

The segments_N file is not held open... it's opened only briefly and
then closed.  So the writer is able to delete it.

> I think we can come up w/ some notification mechanism, through MQ or
> something.

Yeah... if you do that (only attempt to open a new reader when you
know a writer has committed), please report back if that stops the
exception!

> Do you think it's worth to be documented on the Wiki? The entry about FNFE
> during searches mentions NFS or SMB, but does not mention
> SimpleFSLockFactory (Which solves a different problem). Maybe we can add
> that info there?

Actually that FAQ (I'm assuming you're talking about
http://wiki.apache.org/lucene-java/LuceneFAQ#head-24283600713a6643a4a643cef86af5acfc83aa96)
is talking about the old commit lock, which we no longer use (as of
lockless commits).  I think that one should be removed.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: SMB2 cache

Posted by Shai Erera <se...@gmail.com>.
How can the writer delete all previous segments? If I have a reader open,
doesn't it prevent those files to be deleted? That's why I count on any of
those files to exist. Perhaps I'm wrong though.

I think we can come up w/ some notification mechanism, through MQ or
something.

Do you think it's worth to be documented on the Wiki? The entry about FNFE
during searches mentions NFS or SMB, but does not mention
SimpleFSLockFactory (Which solves a different problem). Maybe we can add
that info there?

Shai

On Fri, Aug 14, 2009 at 12:50 AM, Michael McCandless <
lucene@mikemccandless.com> wrote:

> On Thu, Aug 13, 2009 at 5:33 PM, Shai Erera<se...@gmail.com> wrote:
>
> > So if afterwards we read until segment_17 and exhaust read-ahead, and we
> > determine that there's a problem - we throw the exception. If instead
> we'll
> > try to read backwards, I'm sure one of the segments will be read
> > successfully, because that reader must already see any segment, right?
>
> I don't think you're guaranteed to read successfully, on reading backwards.
>
> Ie, say writer has committed segments_8, and therefore just removed
> segments_7.
>
> When the reader (on a different machine, w/ stale cache) tries to
> open, it's cache claims segments_7 still exists, so we try to open
> that but fail.  We advance to segments_8 and try to open that, but
> fail (presumably because local SMB2 cache doesn't consult the server,
> unlike many NFS clients, I think).  We then try up through segments_17
> and nothing works.  But going backwards can't work either because
> those segments files have all been deleted.  (Assuming
> KeepOnlyLastCommitDeletionPolicy... things do get more interesting if
> you're using a different deletion policy...).
>
> Sadly, the most common approach to refreshing readers, eg checking
> every N seconds if it's time to reopen, leads directly to this "cache
> is holding onto stale data".  My guess is if an app only attempted to
> reopen the reader after the writer on another machine had committed,
> then this exception wouldn't happen.  But that'd require some
> notification mechanism outside of Lucene.
>
> Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

Re: SMB2 cache

Posted by Michael McCandless <lu...@mikemccandless.com>.
On Thu, Aug 13, 2009 at 5:33 PM, Shai Erera<se...@gmail.com> wrote:

> So if afterwards we read until segment_17 and exhaust read-ahead, and we
> determine that there's a problem - we throw the exception. If instead we'll
> try to read backwards, I'm sure one of the segments will be read
> successfully, because that reader must already see any segment, right?

I don't think you're guaranteed to read successfully, on reading backwards.

Ie, say writer has committed segments_8, and therefore just removed segments_7.

When the reader (on a different machine, w/ stale cache) tries to
open, it's cache claims segments_7 still exists, so we try to open
that but fail.  We advance to segments_8 and try to open that, but
fail (presumably because local SMB2 cache doesn't consult the server,
unlike many NFS clients, I think).  We then try up through segments_17
and nothing works.  But going backwards can't work either because
those segments files have all been deleted.  (Assuming
KeepOnlyLastCommitDeletionPolicy... things do get more interesting if
you're using a different deletion policy...).

Sadly, the most common approach to refreshing readers, eg checking
every N seconds if it's time to reopen, leads directly to this "cache
is holding onto stale data".  My guess is if an app only attempted to
reopen the reader after the writer on another machine had committed,
then this exception wouldn't happen.  But that'd require some
notification mechanism outside of Lucene.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: SMB2 cache

Posted by Shai Erera <se...@gmail.com>.
Well ... I was just thinking that today, because of stale caches, we might
read segments_5 from either segments.gen or directory listing, and attempt
to read and succeed. Then the reader will see the index 'until segment 5'.

So when I analyze the exception I think that:
* The directory listing couldn't return segments_8 as candidate, as
otherwise we should have succeeded reading it?
* It might have been returned from segments.gen, but since the directory
listing cache is stale, the client thinks the file does not exist.
** At which point we'd attempt to read segments_7, but may fail again.

So if afterwards we read until segment_17 and exhaust read-ahead, and we
determine that there's a problem - we throw the exception. If instead we'll
try to read backwards, I'm sure one of the segments will be read
successfully, because that reader must already see any segment, right?

Or am I completely off here?

Shai

On Fri, Aug 14, 2009 at 12:26 AM, Michael McCandless <
lucene@mikemccandless.com> wrote:

> On Thu, Aug 13, 2009 at 5:02 PM, Shai Erera<se...@gmail.com> wrote:
> > What about doing a read-backwards as well? I.e., if read ahead fails, try
> to
> > read backwards --> we must be able to read any segment, no?
>
> How would that help?
>
> Ie, IndexWriter only writes "forwards".  So if a filesystem cache is
> stale, it must be the case that the "truth" lies forwards?  Ie a cache
> would never lie, saying the segmetns_N file was bigger than it really
> is?
>
> Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

Re: SMB2 cache

Posted by Michael McCandless <lu...@mikemccandless.com>.
On Thu, Aug 13, 2009 at 5:02 PM, Shai Erera<se...@gmail.com> wrote:
> What about doing a read-backwards as well? I.e., if read ahead fails, try to
> read backwards --> we must be able to read any segment, no?

How would that help?

Ie, IndexWriter only writes "forwards".  So if a filesystem cache is
stale, it must be the case that the "truth" lies forwards?  Ie a cache
would never lie, saying the segmetns_N file was bigger than it really
is?

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: SMB2 cache

Posted by Shai Erera <se...@gmail.com>.
What about doing a read-backwards as well? I.e., if read ahead fails, try to
read backwards --> we must be able to read any segment, no?

Shai

On Thu, Aug 13, 2009 at 9:29 PM, Michael McCandless <
lucene@mikemccandless.com> wrote:

> This is spooky -- it looks like SMB2 (which was introduced with Windows
> Vista & Windows Server 2008) now does "aggressive" client-side
> caching, such that the cache can be wrong about the current state of
> the directory.
>
> At least it sort of sounds like Microsoft considers it a real issue:
>
> > Yes, this is a known product issue of SMB2.
> >
> > SMB2 does implicit attribute and directory metadata caching at all
> > times, whereas SMB1 was much stricter about when it would do so. The
> > caches are consistent when changes are made by the client, but if
> > changes are made from another client they may not be reflected until
> > the cache times out.
>
> This will definitely cause problems (like the exception you're
> hitting) for Lucene.  It's exactly the same problems we had with NFS,
> but the readahead in SegmentInfos.FindSegmentsFile worked around that.
> It sounds like for SMB2 that readahead is not working, presumably
> because (unlike NFS) the client does not check back w/ the server if
> it believes (based on its stale cache) that the file does not exist.  Sigh.
>
> SMB1 did not have this problem, in my experience.
>
> I wonder if, from javaland, we have some way to force the cache to
> become coherent.
>
> One simple workaround at the app level is to simply retry on hitting
> an errant "segments_N file not found" exception.
>
> Mike
>
> On Thu, Aug 13, 2009 at 8:50 AM, Shai Erera<se...@gmail.com> wrote:
> > Hi
> >
> > Has anyone experienced any problems w/ Lucene indexes on a shared SMB2
> > network drive?
> >
> > We've hit a scenario where it seems the FS cache refuses to check for
> > existence of files on the shared network drive. Specifically, we hit the
> > following exception:
> >
> > java.io.FileNotFoundException: Z:\index\segments_p8 (The system cannot
> find
> > the file specified.)
> > at java.io.RandomAccessFile.open(Native Method)
> > at java.io.RandomAccessFile.<init>(RandomAccessFile.java:243)
> > at
> >
> org.apache.lucene.store.FSDirectory$FSIndexInput$Descriptor.<init>(FSDirectory.java:552)
> > at
> >
> org.apache.lucene.store.FSDirectory$FSIndexInput.<init>(FSDirectory.java:582)
> > at org.apache.lucene.store.FSDirectory.openInput(FSDirectory.java:488)
> > at org.apache.lucene.store.FSDirectory.openInput(FSDirectory.java:482)
> > at org.apache.lucene.index.SegmentInfos$2.doBody(SegmentInfos.java:369)
> > at
> >
> org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:653)
> > at
> >
> org.apache.lucene.index.SegmentInfos.readCurrentVersion(SegmentInfos.java:366)
> > at
> >
> org.apache.lucene.index.DirectoryIndexReader.isCurrent(DirectoryIndexReader.java:188)
> > at org.apache.lucene.index.MultiReader.isCurrent(MultiReader.java:352)
> >
> > The environment:
> > * 3 Windows Server 2008 machines
> > ** Machine A - hosts the index
> > ** Machine B - indexes and search
> > ** Machine C - just search
> > * Machine A and B map Machine C on drive Z.
> > * The exception happens on Machine C only, i.e. on the machine that does
> > just 'search'.
> >
> > According to my understanding, FindSegmentFile attempts to read the
> latest
> > segment from segments.gen and directory listing and if there is a
> problem,
> > it will do a gen-readahead until success or defaultGenLookaheadCount is
> > exhausted.
> >
> > So by hitting this exception we thought of the following explanation: the
> FS
> > cache 'decides' the file does not exist, due to a stale directory cache,
> and
> > refuses to check whether the file actually exists on the remote machine.
> >
> > Does that sound reasonable?
> >
> > Some more information:
> > * We use Lucene 2.4.0
> > * Other runs are executed on those machines currently, and so it will
> take
> > about a week until we can run the same scenario again. I thought that
> > perhaps we can discuss this until then.
> > * Unfortunately we weren't able to get an infoStream output before the
> > machines started another run, so we hope to get it next time. Anyway,
> it's
> > not easily reproduced.
> > * There isn't any other process which touches this directory, such that
> it
> > may remove index files.
> >
> > We know the same code runs well on NFS (4). We haven't checked yet if SMB
> > 1.0 works ok. Some pointers we've found:
> >
> > A known issue on MS, w/ some C++ fixes:
> >
> http://www.microsoft.com/communities/newsgroups/en-us/default.aspx?dg=microsoft.public.win32.programmer.networks&tid=69e63e38-7d91-4306-ab6e-a615e1c6afaa&cat=en_US_bc89adf4-f184-4d3d-aaee-122567385744&lang=en&cr=US&sloc=&p=1
> >
> > Info on how to disable SMB 2.0 on Windows:
> >
> http://www.petri.co.il/how-to-disable-smb-2-on-windows-vista-or-server-2008.htm
> >
> > Currently, we think to bypass the problem by wrapping calls to isCurrent
> and
> > reopen w/ a try-catch FileNotFoundException and use the reader we have at
> > hand. Later, we will attempt the isCurrent again. Since SMB caching seems
> to
> > be time-controlled, we expect the cache to be refreshed after several
> > seconds, and those calls will succeed.
> > I wonder though if this can't get us into hitting the exception
> 'forever'.
> > E.g., imagine a system which indexes at very high rates. Isn't it
> possible
> > that we'll hit this exception every time we call isCurrent?
> >
> > I'm not sure if there is anything we can do in Lucene, besides sleeping
> in
> > FindSegmentsFile for several seconds which is not reasonable.
> > Maybe a way out would be, I think, having FindSegmentsFile try to read
> ahead
> > and then backwards. At some point, we ought to find a segment that's
> > readable, even if an old one, no?
> >
> > Any help will be appreciated.
> >
> > Shai
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

Re: SMB2 cache

Posted by Michael McCandless <lu...@mikemccandless.com>.
This is spooky -- it looks like SMB2 (which was introduced with Windows
Vista & Windows Server 2008) now does "aggressive" client-side
caching, such that the cache can be wrong about the current state of
the directory.

At least it sort of sounds like Microsoft considers it a real issue:

> Yes, this is a known product issue of SMB2.
>
> SMB2 does implicit attribute and directory metadata caching at all
> times, whereas SMB1 was much stricter about when it would do so. The
> caches are consistent when changes are made by the client, but if
> changes are made from another client they may not be reflected until
> the cache times out.

This will definitely cause problems (like the exception you're
hitting) for Lucene.  It's exactly the same problems we had with NFS,
but the readahead in SegmentInfos.FindSegmentsFile worked around that.
It sounds like for SMB2 that readahead is not working, presumably
because (unlike NFS) the client does not check back w/ the server if
it believes (based on its stale cache) that the file does not exist.  Sigh.

SMB1 did not have this problem, in my experience.

I wonder if, from javaland, we have some way to force the cache to
become coherent.

One simple workaround at the app level is to simply retry on hitting
an errant "segments_N file not found" exception.

Mike

On Thu, Aug 13, 2009 at 8:50 AM, Shai Erera<se...@gmail.com> wrote:
> Hi
>
> Has anyone experienced any problems w/ Lucene indexes on a shared SMB2
> network drive?
>
> We've hit a scenario where it seems the FS cache refuses to check for
> existence of files on the shared network drive. Specifically, we hit the
> following exception:
>
> java.io.FileNotFoundException: Z:\index\segments_p8 (The system cannot find
> the file specified.)
> at java.io.RandomAccessFile.open(Native Method)
> at java.io.RandomAccessFile.<init>(RandomAccessFile.java:243)
> at
> org.apache.lucene.store.FSDirectory$FSIndexInput$Descriptor.<init>(FSDirectory.java:552)
> at
> org.apache.lucene.store.FSDirectory$FSIndexInput.<init>(FSDirectory.java:582)
> at org.apache.lucene.store.FSDirectory.openInput(FSDirectory.java:488)
> at org.apache.lucene.store.FSDirectory.openInput(FSDirectory.java:482)
> at org.apache.lucene.index.SegmentInfos$2.doBody(SegmentInfos.java:369)
> at
> org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:653)
> at
> org.apache.lucene.index.SegmentInfos.readCurrentVersion(SegmentInfos.java:366)
> at
> org.apache.lucene.index.DirectoryIndexReader.isCurrent(DirectoryIndexReader.java:188)
> at org.apache.lucene.index.MultiReader.isCurrent(MultiReader.java:352)
>
> The environment:
> * 3 Windows Server 2008 machines
> ** Machine A - hosts the index
> ** Machine B - indexes and search
> ** Machine C - just search
> * Machine A and B map Machine C on drive Z.
> * The exception happens on Machine C only, i.e. on the machine that does
> just 'search'.
>
> According to my understanding, FindSegmentFile attempts to read the latest
> segment from segments.gen and directory listing and if there is a problem,
> it will do a gen-readahead until success or defaultGenLookaheadCount is
> exhausted.
>
> So by hitting this exception we thought of the following explanation: the FS
> cache 'decides' the file does not exist, due to a stale directory cache, and
> refuses to check whether the file actually exists on the remote machine.
>
> Does that sound reasonable?
>
> Some more information:
> * We use Lucene 2.4.0
> * Other runs are executed on those machines currently, and so it will take
> about a week until we can run the same scenario again. I thought that
> perhaps we can discuss this until then.
> * Unfortunately we weren't able to get an infoStream output before the
> machines started another run, so we hope to get it next time. Anyway, it's
> not easily reproduced.
> * There isn't any other process which touches this directory, such that it
> may remove index files.
>
> We know the same code runs well on NFS (4). We haven't checked yet if SMB
> 1.0 works ok. Some pointers we've found:
>
> A known issue on MS, w/ some C++ fixes:
> http://www.microsoft.com/communities/newsgroups/en-us/default.aspx?dg=microsoft.public.win32.programmer.networks&tid=69e63e38-7d91-4306-ab6e-a615e1c6afaa&cat=en_US_bc89adf4-f184-4d3d-aaee-122567385744&lang=en&cr=US&sloc=&p=1
>
> Info on how to disable SMB 2.0 on Windows:
> http://www.petri.co.il/how-to-disable-smb-2-on-windows-vista-or-server-2008.htm
>
> Currently, we think to bypass the problem by wrapping calls to isCurrent and
> reopen w/ a try-catch FileNotFoundException and use the reader we have at
> hand. Later, we will attempt the isCurrent again. Since SMB caching seems to
> be time-controlled, we expect the cache to be refreshed after several
> seconds, and those calls will succeed.
> I wonder though if this can't get us into hitting the exception 'forever'.
> E.g., imagine a system which indexes at very high rates. Isn't it possible
> that we'll hit this exception every time we call isCurrent?
>
> I'm not sure if there is anything we can do in Lucene, besides sleeping in
> FindSegmentsFile for several seconds which is not reasonable.
> Maybe a way out would be, I think, having FindSegmentsFile try to read ahead
> and then backwards. At some point, we ought to find a segment that's
> readable, even if an old one, no?
>
> Any help will be appreciated.
>
> Shai
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org