You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@lucy.apache.org by goran kent <go...@gmail.com> on 2011/11/16 08:28:25 UTC

[lucy-user] Couldn't completely remove 'seg_N'

Top of the morning to all and sundry,

I've noticed this occasional error/failure during index merging:

  Couldn't completely remove 'seg_N'

Which comes from core/Lucy/Index/SegWriter.c, where the comment in
SegWriter_prep_seg_dir() says:

  // Clear stale segment files from crashed indexing sessions.

Is that stale segment folder from a previous unrelated index/merge
session, or is it from the current session which has crashed/failed
and this is part of the cleanup procedure?  It seems to be the former,
am I right?  The "_prep_" in SegWriter_prep_seg_dir() seems to imply
this is a brand new session trying to create the seg_N folder, which
throws an exception since the folder already exists.

I'll start some debugging sometime today to try and track down where
the hell that crash is happening, but I just wanted to clarify my
understanding of the code.

btw, if seg_N is empty, why is Folder_Delete_Tree() failing to trash
it?  Maybe because the stale write.lock is still soiling the
situation?  (grep -rl '^Folder_Delete_Tree' * failed to find anything,
so I couldn't have a quick look to confirm that idea)

Thanks!

Re: [lucy-user] Couldn't completely remove 'seg_N'

Posted by goran kent <go...@gmail.com>.
On Fri, Nov 18, 2011 at 5:18 PM, Marvin Humphrey <ma...@rectangular.com> wrote:
>> The lockfile contains:
>> {
>>   "host": "host6",
>>   "name": "write",
>>   "pid": "24342"
>> }
>
> OK, all that looks correct.  Also, since the lockfile is still there and
> definitely corresponds to the process that crashed, we can assume that no
> other process has messed with the index directory since.
>
> Question: is there a seg_2 folder in the index dir?  If so, is there anything
> inside it?

Yes, seg_2 was there - it also had the same timestamp as the lockfile,
implying it had been created by the same process that created the
lockfile itself.
Secondly, no it was empty.

> It could be NFS cache consistency: a deletion operation succeeds, and the item
> is really gone from the NFS drive, but the local cache of the NFS client
> doesn't get updated in time and a subsequent check on whether the item exists
> returns an incorrect result.
>
>    http://nfs.sourceforge.net/#faq_a8
>
>    Perfect cache coherency among disparate NFS clients is very expensive to
>    achieve, so NFS settles for something weaker that satisfies the
>    requirements of most everyday types of file sharing.
>
> A tremendous amount of energy has gone into making NFS mimic local file system
> behaviors as closely as possible, both by the NFS devs and by us (see
> <http://incubator.apache.org/lucy/docs/perl/Lucy/Docs/FileLocking.html>) but
> it's a very hard problem and compromises are impossible to avoid.

Some food for thought, thanks.  I'll start looking into my index store
servers and their NFS exports.

> Best practice would be to avoid writing to Lucy indexes on NFS drives if
> possible.  Read performance is going to be lousy anyway unless you make the
> NFS mount read-only.

Just too much data (and we needed redundancy), and the load needs to
be spread across as many storage nodes as possible (we have separate
source-store and index-store servers).  When all the indexing machines
are grinding away in unison, sucking from the source-store servers
(via NFS), and writing to the index-store servers (also via NFS), the
load can get quite high and some NFS tweaking has been done on the
source/target servers.

Merging nodes reads indexes from the index-store servers (NFS) and
writes them on the search-server (NFS) nodes themselves.  Searching is
then always local.

The load at the moment is negligible, so I know it's not causing a
problem with NFS - however, with NFS you never know, so I'll be
focusing on that next.

Thanks for your comments!

Re: [lucy-user] Couldn't completely remove 'seg_N'

Posted by goran kent <go...@gmail.com>.
I'm chasing up all possibilities, since so far I cannot find any issue
with NFS being a factor here (at least, I cannot find any NFS related
kernel message in the logs on the machines in question).

What's the best strategy to handle this scenario:

1. open index for merging.
2. loop
3.     add_index ($subidx).
4.     if error, then what?  # I currently return without committing
or anything else
5. end loop.
6. commit.

I just want to confirm that what I'm doing above is ok:  presumably
nothing will be committed and the original index will still be ok if I
abort the entire loop if a single subindex fails to merge
successfully.
My logs lead me to believe that the above is not related to the issue
I'm experiencing.

thanks

Re: [lucy-user] Couldn't completely remove 'seg_N'

Posted by Marvin Humphrey <ma...@rectangular.com>.
On Fri, Nov 18, 2011 at 12:22:09PM +0200, goran kent wrote:
> On Thu, Nov 17, 2011 at 9:08 AM, Marvin Humphrey <ma...@rectangular.com> wrote:
> > If you don't supply a hostname, machines will zap each other's lockfiles.
> 
> Another one of these has popped up this morning:
> 
> "Lucy::Index::Indexer->new failed (Couldn't completely remove 'seg_2'"
> 
> ...even though I'm using IndexManager:
> 
> my $manager = Lucy::Index::IndexManager->new(
>         host => $host,
>     );
> $index = Lucy::Index::Indexer->new(
>             schema   => $schema,
>             index    => $target,
>             manager  => $manager,
>             create   => 1,
>             truncate => 0,
>         );
> 
> The lockfile contains:
> {
>   "host": "host6",
>   "name": "write",
>   "pid": "24342"
> }
 
> The hostname and PID correspond to the current host and the PID
> corresponds to the script trying to update the index at the time of
> the Lucy::Index::Indexer->new above.

OK, all that looks correct.  Also, since the lockfile is still there and
definitely corresponds to the process that crashed, we can assume that no
other process has messed with the index directory since.

Question: is there a seg_2 folder in the index dir?  If so, is there anything
inside it?

The other question is *why* seg_2 existed in that index, because even if it's
gone now, it was there before.  Either an Indexer crashed, or an Indexer was
created but commit() was never called.

> Is my code sample above correct in it's usage of IndexManager()?  eg,
> do I need to do specify anything else to ensure write exclusivity?  Is
> there something else going on here?

It could be NFS cache consistency: a deletion operation succeeds, and the item
is really gone from the NFS drive, but the local cache of the NFS client
doesn't get updated in time and a subsequent check on whether the item exists
returns an incorrect result.  

    http://nfs.sourceforge.net/#faq_a8

    Perfect cache coherency among disparate NFS clients is very expensive to
    achieve, so NFS settles for something weaker that satisfies the
    requirements of most everyday types of file sharing. 

A tremendous amount of energy has gone into making NFS mimic local file system
behaviors as closely as possible, both by the NFS devs and by us (see
<http://incubator.apache.org/lucy/docs/perl/Lucy/Docs/FileLocking.html>) but
it's a very hard problem and compromises are impossible to avoid.

Best practice would be to avoid writing to Lucy indexes on NFS drives if
possible.  Read performance is going to be lousy anyway unless you make the
NFS mount read-only.

Marvin Humphrey


Re: [lucy-user] Couldn't completely remove 'seg_N'

Posted by goran kent <go...@gmail.com>.
On Thu, Nov 17, 2011 at 9:08 AM, Marvin Humphrey <ma...@rectangular.com> wrote:
> If you don't supply a hostname, machines will zap each other's lockfiles.

Another one of these has popped up this morning:

"Lucy::Index::Indexer->new failed (Couldn't completely remove 'seg_2'"

...even though I'm using IndexManager:

my $manager = Lucy::Index::IndexManager->new(
        host => $host,
    );
$index = Lucy::Index::Indexer->new(
            schema   => $schema,
            index    => $target,
            manager  => $manager,
            create   => 1,
            truncate => 0,
        );

The lockfile contains:
{
  "host": "host6",
  "name": "write",
  "pid": "24342"
}

The hostname and PID correspond to the current host and the PID
corresponds to the script trying to update the index at the time of
the Lucy::Index::Indexer->new above.  The seg_2 timestamp matches the
lockfile's, so it seems improbable that some other writer created
seg_2... especially if they are all (ie, all the indexer cluster
nodes) now using IndexManager which provides improved locking.

Is my code sample above correct in it's usage of IndexManager()?  eg,
do I need to do specify anything else to ensure write exclusivity?  Is
there something else going on here?

thanks

Re: [lucy-user] Couldn't completely remove 'seg_N'

Posted by goran kent <go...@gmail.com>.
On Thu, Nov 17, 2011 at 9:08 AM, Marvin Humphrey <ma...@rectangular.com> wrote:
> If you don't supply a hostname, machines will zap each other's lockfiles.

understood, thanks.

Re: [lucy-user] Couldn't completely remove 'seg_N'

Posted by Marvin Humphrey <ma...@rectangular.com>.
On Thu, Nov 17, 2011 at 08:42:05AM +0200, goran kent wrote:
> Thanks, I've stuck this in, so let's see how it goes.  The docs for
> Lucy::Index::IndexManager don't say much about what exactly the above
> will do.  Presumably it's an extra lock/concurrency check or something
> that will croak if a conflict is detected?

    http://incubator.apache.org/lucy/docs/perl/Lucy/Docs/FileLocking.html

    Both read and write applications accessing an index on a shared volume
    need to identify themselves with a unique host id, e.g. hostname or ip
    address.  Knowing the host id makes it possible to tell which lockfiles
    belong to other machines and therefore must not be removed when the
    lockfile's pid number appears not to correspond to an active process.

    http://incubator.apache.org/lucy/docs/perl/Lucy/Store/Lock.html#clear_stale-

    Release all locks that meet the following three conditions: the lock name
    matches, the host id matches, and the process id that the lock was created
    under no longer identifies an active process.

> I see the hostname then goes into write.lock:host, is that merely to
> provide a fingerprint of the last machine to touch the index?  Also, I
> recall vaguely something to the effect that the lockfile will be
> overwritten anyway, does IndexManager prevent that, or maybe I'm
> misunderstanding the context here?

If you don't supply a hostname, machines will zap each other's lockfiles.

Marvin Humphrey
     

Re: [lucy-user] Couldn't completely remove 'seg_N'

Posted by goran kent <go...@gmail.com>.
On 11/17/11, Marvin Humphrey <ma...@rectangular.com> wrote:
> The fact that you
> are
> encountering repeated errors there is troubling; the same symptom can arise
> when two indexers from different machines collide when trying to write to
> the
> same index on a shared drive.  I recall you saying that you had your own
> locking mechanism, but as a precaution,

Yes, something is not working as expected with my locking, I'm sure.
I have a feeling things become brittle under heavy load.

> I suggest that you use this code to
> provide a defense in depth against locking problems:
>
>     use Sys::Hostname qw( hostname );
>     my $hostname = hostname() or die "Can't get unique hostname";
>     my $manager = Lucy::Index::IndexManager->new(
>         host => $hostname,
>     );
>     my $indexer = Lucy::Index::Indexer->new(
>         index => '/path/to/index',
>         manager => $manager,
>     );


Thanks, I've stuck this in, so let's see how it goes.  The docs for
Lucy::Index::IndexManager don't say much about what exactly the above
will do.  Presumably it's an extra lock/concurrency check or something
that will croak if a conflict is detected?

I see the hostname then goes into write.lock:host, is that merely to
provide a fingerprint of the last machine to touch the index?  Also, I
recall vaguely something to the effect that the lockfile will be
overwritten anyway, does IndexManager prevent that, or maybe I'm
misunderstanding the context here?

Re: [lucy-user] Couldn't completely remove 'seg_N'

Posted by Marvin Humphrey <ma...@rectangular.com>.
On Wed, Nov 16, 2011 at 09:28:25AM +0200, goran kent wrote:
> I've noticed this occasional error/failure during index merging:
> 
>   Couldn't completely remove 'seg_N'
> 
> Which comes from core/Lucy/Index/SegWriter.c, where the comment in
> SegWriter_prep_seg_dir() says:
> 
>   // Clear stale segment files from crashed indexing sessions.
> 
> Is that stale segment folder from a previous unrelated index/merge
> session, or is it from the current session which has crashed/failed
> and this is part of the cleanup procedure?  

> It seems to be the former, am I right?

That code is indeed supposed to clear files from discrete indexing sessions
which crashed earlier.

It generally works without calling attention to itself.  The fact that you are
encountering repeated errors there is troubling; the same symptom can arise
when two indexers from different machines collide when trying to write to the
same index on a shared drive.  I recall you saying that you had your own
locking mechanism, but as a precaution, I suggest that you use this code to
provide a defense in depth against locking problems:

    use Sys::Hostname qw( hostname );
    my $hostname = hostname() or die "Can't get unique hostname";
    my $manager = Lucy::Index::IndexManager->new( 
        host => $hostname,
    );
    my $indexer = Lucy::Index::Indexer->new(
        index => '/path/to/index',
        manager => $manager,
    );

> (grep -rl '^Folder_Delete_Tree' * failed to find anything,
> so I couldn't have a quick look to confirm that idea)

Folder_Delete_Tree is a method call.  The implementing function which gets
called in this case is Folder_delete_tree, in core/Lucy/Store/Folder.c.  

Marvin Humphrey


[lucy-user] Re: Couldn't completely remove 'seg_N'

Posted by goran kent <go...@gmail.com>.
On Wed, Nov 16, 2011 at 9:28 AM, goran kent <go...@gmail.com> wrote:
> Is that stale segment folder from a previous unrelated index/merge
> session, or is it from the current session which has crashed/failed
> and this is part of the cleanup procedure?  It seems to be the former,
> am I right?  The "_prep_" in SegWriter_prep_seg_dir() seems to imply
> this is a brand new session trying to create the seg_N folder, which
> throws an exception since the folder already exists.
>
> I'll start some debugging sometime today to try and track down where
> the hell that crash is happening, but I just wanted to clarify my
> understanding of the code.
>
> btw, if seg_N is empty, why is Folder_Delete_Tree() failing to trash
> it?  Maybe because the stale write.lock is still soiling the
> situation?  (grep -rl '^Folder_Delete_Tree' * failed to find anything,
> so I couldn't have a quick look to confirm that idea)

Looks like the lock file is for the current session (the PID therein
and the timestamp all match up), and not for a previous unrelated
crashed session.

So, it locks the index successfully, does something, then tries to remove seg_4:

drwxr-xr-x 2 root root 4.0K Nov 16 01:30 seg_4
drwxr-xr-x 2 root root 4.0K Nov 16 01:30 locks
-rw-r--r-- 1 root root  119 Nov  7 10:07 snapshot_3.json
-rw-r--r-- 1 root root  13K Nov  7 10:07 schema_3.json
drwxr-xr-x 2 root root 4.0K Nov  7 10:07 seg_3
drwxr-xr-x 2 root root 4.0K Nov  7 09:48 seg_2
drwxr-xr-x 2 root root 4.0K Nov  6 22:42 seg_1

-rw-r--r-- 1 root root 54 Nov 16 01:30 write.lock
cat:
{
  "host": "",
  "name": "write",
  "pid": "26271"
}

This happened during an automated run.  When I simulated the run today
manually, it succeeded (ie, seg_4 was ignored, seg_5 was created, the
lockfile was purged, etc).

I'm trying to get my head around what could be going wrong here so I
can automate self-healing or better handle this scenario.