You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@lucy.apache.org by Gerald Richter <ri...@ecos.de> on 2015/10/17 20:11:08 UTC

[lucy-user] Lucy and Coro/AnyEvent

Hi,

as far as I see all calls to Lucy are synchronous.

I there are way to use it together with AnyEvent and/or Coro without 
blocking the whole system for the time of the Lucy calls?

Thanks & Regards

Gerald

Re: [lucy-user] Strange results when documents gets delete while iterating

Posted by Marvin Humphrey <ma...@rectangular.com>.

Thanks for closing the loop, and glad that things seem to be working OK!

Marvin Humphrey

On Wed, Nov 25, 2015 at 9:38 PM, Gerald Richter <ri...@ecos.de> wrote:
> Thanks for the detailed explanation. Yes, I am using Coro, but in this
> special test case only one Coro thread was running.
>
> After restarting all processes the issue has gone away. I still did not
> really understand what was going on, but since the restart (a few days ago)
> everything works like expected
>
> Regards
>
> Gerald

Re: [lucy-user] Strange results when documents gets delete while iterating

Posted by Gerald Richter <ri...@ecos.de>.

Thanks for the detailed explanation. Yes, I am using Coro, but in this 
special test case only one Coro thread was running.

After restarting all processes the issue has gone away. I still did not 
really understand what was going on, but since the restart (a few days 
ago) everything works like expected

Regards

Gerald


Am 19.11.2015 um 16:03 schrieb Marvin Humphrey:
> On Thu, Nov 19, 2015 at 4:39 AM, Gerald Richter - ECOS Technology
> <Ge...@ecos.de> wrote:
>> Hi,
>>
>> It's a local IndexSearcher.
>>
>> I have done a lot of tests and it's really happening.
>>
>> Let me give you a little more details, maybe this helps:
>>
>> - I call a function that creates a new IndexSearcher and call $hits = $searcher -> hits.
>> - I iterate over the first few entries and returns the entries and the $hits
>> - The documents that were found are deleted from a database, which in turn deletes the documents from the Lucy index.
>> - Now I iterate over the next few entries and delete them and so on
>>
>> I have made small test where per iteration only two entries are fetch. The result looks like this:
>>
>>        id  => "8b8bce64e69b52ed244671009c11ee0e",
>>        id  => "8b8bce64e69b52ed244671009c4857e7",
>>        id  => "4a3dcd6c2e9e3074d2d52b8e72584b68",
>>        id  => "8b8bce64e69b52ed244671009c730dc9",
>>        id  => "4a3dcd6c2e9e3074d2d52b8e72584d19",
>>        id  => "8b8bce64e69b52ed244671009c7e3974",
>>        id  => "4a3dcd6c2e9e3074d2d52b8e72585475",
>>        id  => "8b8bce64e69b52ed244671009c7e4788",
>>        id  => "4a3dcd6c2e9e3074d2d52b8e72585dc2",
>>        id  => "8b8bce64e69b52ed244671009c7e2fa6",
>>
>> id is some value I store in the document. The result should only contain ids starting with 8.
>>
>> So you see the first two are correct, after deletion of this two (always in a different process), the next time, the first one I get is wrong the second one is correct...
>>
>> If I do not delete anything I only get the right entries (just commented out one line the rest is still the same).
>>
>> Any clue?
> When documents in an old segment are marked as deleted, that information is
> written to a bitmap deletions file which is written to a new segment.  Old
> readers are not supposed to know about new segments.  So for something to go
> wrong, either 1) information in an old segment would have to be corrupted, 2)
> a reader would have to somehow find out about information in a new segment, or
> 3) somthing else unrelated.
>
> Indexers write index data (including new deletions data referencing documents
> in old segments) to temp files in a new segment, which are then consolidated
> into a single per-segment "compound file" named "cf.dat".  When a reader
> opens, it mmaps cf.dat for each segment in the snapshot.  Once the reader
> successfully opens all the files it needs, it never goes looking for new
> files.
>
> It's hard to imagine a mechanism that would either cause an existing "cf.dat"
> file to be modified, or persuade a reader to go look at a new "cf.dat"
> file.  So unless my reasoning is wrong, the cause is #3 -- something else
> unrelated.  I really have no idea what that could be, though since you've
> previously asked some questions about Coro/AnyEvent and other concurrency
> stuff the most likely prospect would seem to be something unique to your
> setup.
>
> The next step is probably to take the behavior you've been able to reproduce
> and isolate it in a test case that others can run and analyze.
>
> Marvin Humphrey
>
> !DSPAM:416,564de4eb23791822212463!

Re: [lucy-user] Strange results when documents gets delete while iterating

Posted by Marvin Humphrey <ma...@rectangular.com>.

On Thu, Nov 19, 2015 at 4:39 AM, Gerald Richter - ECOS Technology
<Ge...@ecos.de> wrote:
> Hi,
>
> It's a local IndexSearcher.
>
> I have done a lot of tests and it's really happening.
>
> Let me give you a little more details, maybe this helps:
>
> - I call a function that creates a new IndexSearcher and call $hits = $searcher -> hits.
> - I iterate over the first few entries and returns the entries and the $hits
> - The documents that were found are deleted from a database, which in turn deletes the documents from the Lucy index.
> - Now I iterate over the next few entries and delete them and so on
>
> I have made small test where per iteration only two entries are fetch. The result looks like this:
>
>       id  => "8b8bce64e69b52ed244671009c11ee0e",
>       id  => "8b8bce64e69b52ed244671009c4857e7",
>       id  => "4a3dcd6c2e9e3074d2d52b8e72584b68",
>       id  => "8b8bce64e69b52ed244671009c730dc9",
>       id  => "4a3dcd6c2e9e3074d2d52b8e72584d19",
>       id  => "8b8bce64e69b52ed244671009c7e3974",
>       id  => "4a3dcd6c2e9e3074d2d52b8e72585475",
>       id  => "8b8bce64e69b52ed244671009c7e4788",
>       id  => "4a3dcd6c2e9e3074d2d52b8e72585dc2",
>       id  => "8b8bce64e69b52ed244671009c7e2fa6",
>
> id is some value I store in the document. The result should only contain ids starting with 8.
>
> So you see the first two are correct, after deletion of this two (always in a different process), the next time, the first one I get is wrong the second one is correct...
>
> If I do not delete anything I only get the right entries (just commented out one line the rest is still the same).
>
> Any clue?

When documents in an old segment are marked as deleted, that information is
written to a bitmap deletions file which is written to a new segment.  Old
readers are not supposed to know about new segments.  So for something to go
wrong, either 1) information in an old segment would have to be corrupted, 2)
a reader would have to somehow find out about information in a new segment, or
3) somthing else unrelated.

Indexers write index data (including new deletions data referencing documents
in old segments) to temp files in a new segment, which are then consolidated
into a single per-segment "compound file" named "cf.dat".  When a reader
opens, it mmaps cf.dat for each segment in the snapshot.  Once the reader
successfully opens all the files it needs, it never goes looking for new
files.

It's hard to imagine a mechanism that would either cause an existing "cf.dat"
file to be modified, or persuade a reader to go look at a new "cf.dat"
file.  So unless my reasoning is wrong, the cause is #3 -- something else
unrelated.  I really have no idea what that could be, though since you've
previously asked some questions about Coro/AnyEvent and other concurrency
stuff the most likely prospect would seem to be something unique to your
setup.

The next step is probably to take the behavior you've been able to reproduce
and isolate it in a test case that others can run and analyze.

Marvin Humphrey

AW: [lucy-user] Strange results when documents gets delete while iterating

Posted by Gerald Richter - ECOS Technology <Ge...@ecos.de>.

Hi,

It's a local IndexSearcher.

I have done a lot of tests and it's really happening.

Let me give you a little more details, maybe this helps:

- I call a function that creates a new IndexSearcher and call $hits = $searcher -> hits.
- I iterate over the first few entries and returns the entries and the $hits
- The documents that were found are deleted from a database, which in turn deletes the documents from the Lucy index.
- Now I iterate over the next few entries and delete them and so on

I have made small test where per iteration only two entries are fetch. The result looks like this:

      id  => "8b8bce64e69b52ed244671009c11ee0e",
      id  => "8b8bce64e69b52ed244671009c4857e7",
      id  => "4a3dcd6c2e9e3074d2d52b8e72584b68",
      id  => "8b8bce64e69b52ed244671009c730dc9",
      id  => "4a3dcd6c2e9e3074d2d52b8e72584d19",
      id  => "8b8bce64e69b52ed244671009c7e3974",
      id  => "4a3dcd6c2e9e3074d2d52b8e72585475",
      id  => "8b8bce64e69b52ed244671009c7e4788",
      id  => "4a3dcd6c2e9e3074d2d52b8e72585dc2",
      id  => "8b8bce64e69b52ed244671009c7e2fa6",

id is some value I store in the document. The result should only contain ids starting with 8.

So you see the first two are correct, after deletion of this two (always in a different process), the next time, the first one I get is wrong the second one is correct...

If I do not delete anything I only get the right entries (just commented out one line the rest is still the same).

Any clue?

Thanks & Regards

Gerald

-----Ursprüngliche Nachricht-----
Von: Marvin Humphrey [mailto:marvin@rectangular.com] 
Gesendet: Donnerstag, 19. November 2015 12:19
An: user@lucy.apache.org
Betreff: Re: [lucy-user] Strange results when documents gets delete while iterating

On Wed, Nov 18, 2015 at 10:22 PM, Gerald Richter <ri...@ecos.de> wrote:

> I have a simple query that consists of a TermQuery and a RangeQuery, I am
> iterating over this query like this:
>
>         while ($cnt-- >= 0 && ($hit = $hits -> next))
>             {
>             $data = $hit->get_fields() ;
>             ....
>             }
>
> While this loop runs documents are deleted from the index by another
> process. Without this other process everything is fine. When this deletion
> is happeing, it seems that half of the documents that are returned by $hits
> -> next are wrong, which mean I get a totaly different document, which
> should not be part of the resultset.
>
> I thought that a searcher operates on a snapshot, so changes that happens at
> the same time does not influence the query. Is this wrong? If yes, how could
> I make sure my resultset is not corrupted?

What kind of a Searcher is this?  If it's an IndexSearcher operating
on a local index, I don't see how it could happen.  But if it's a
ClusterSearcher, then it would be possible if the remotes are being
refreshed.

Marvin Humphrey

!DSPAM:416,564db01923795029453755!

Re: [lucy-user] Strange results when documents gets delete while iterating

Posted by Gerald Richter <ri...@ecos.de>.

Hi,

It's a local IndexSearcher.

I have done a lot of tests and it's really happening.

Let me give you a little more details, maybe this helps:

- I call a function that creates a new IndexSearcher and call $hits = 
$searcher -> hits.

- I iterate over the first few entries and returns the entries and the $hits

- The documents that were found are deleted from a database, which in 
turn deletes the documents from the Lucy index.

- Now I iterate over the next few entries and delete them and so on

I have made small test where per iteration only two entries are fetch. 
The result looks like this:

id=> "8b8bce64e69b52ed244671009c11ee0e",

id=> "8b8bce64e69b52ed244671009c4857e7",

id=> "4a3dcd6c2e9e3074d2d52b8e72584b68",

id=> "8b8bce64e69b52ed244671009c730dc9",

id=> "4a3dcd6c2e9e3074d2d52b8e72584d19",

id=> "8b8bce64e69b52ed244671009c7e3974",

id=> "4a3dcd6c2e9e3074d2d52b8e72585475",

id=> "8b8bce64e69b52ed244671009c7e4788",

id=> "4a3dcd6c2e9e3074d2d52b8e72585dc2",

id=> "8b8bce64e69b52ed244671009c7e2fa6",

id is some value I store in the document. The result should only contain 
ids starting with 8.

So you see the first two are correct, after deletion of this two (always 
in a different process), the next time, the first one I get is wrong the 
second one is correct...

If I do not delete anything I only get the right entries (just commented 
out one line the rest is still the same).

Any clue?

Thanks & Regards

Gerald



Am 19.11.2015 um 12:18 schrieb Marvin Humphrey:
> On Wed, Nov 18, 2015 at 10:22 PM, Gerald Richter <ri...@ecos.de> wrote:
>
>> I have a simple query that consists of a TermQuery and a RangeQuery, I am
>> iterating over this query like this:
>>
>>          while ($cnt-- >= 0 && ($hit = $hits -> next))
>>              {
>>              $data = $hit->get_fields() ;
>>              ....
>>              }
>>
>> While this loop runs documents are deleted from the index by another
>> process. Without this other process everything is fine. When this deletion
>> is happeing, it seems that half of the documents that are returned by $hits
>> -> next are wrong, which mean I get a totaly different document, which
>> should not be part of the resultset.
>>
>> I thought that a searcher operates on a snapshot, so changes that happens at
>> the same time does not influence the query. Is this wrong? If yes, how could
>> I make sure my resultset is not corrupted?
> What kind of a Searcher is this?  If it's an IndexSearcher operating
> on a local index, I don't see how it could happen.  But if it's a
> ClusterSearcher, then it would be possible if the remotes are being
> refreshed.
>
> Marvin Humphrey
>
> !DSPAM:416,564db01923795029453755!

Re: [lucy-user] Strange results when documents gets delete while iterating

Posted by Marvin Humphrey <ma...@rectangular.com>.

On Wed, Nov 18, 2015 at 10:22 PM, Gerald Richter <ri...@ecos.de> wrote:

> I have a simple query that consists of a TermQuery and a RangeQuery, I am
> iterating over this query like this:
>
>         while ($cnt-- >= 0 && ($hit = $hits -> next))
>             {
>             $data = $hit->get_fields() ;
>             ....
>             }
>
> While this loop runs documents are deleted from the index by another
> process. Without this other process everything is fine. When this deletion
> is happeing, it seems that half of the documents that are returned by $hits
> -> next are wrong, which mean I get a totaly different document, which
> should not be part of the resultset.
>
> I thought that a searcher operates on a snapshot, so changes that happens at
> the same time does not influence the query. Is this wrong? If yes, how could
> I make sure my resultset is not corrupted?

What kind of a Searcher is this?  If it's an IndexSearcher operating
on a local index, I don't see how it could happen.  But if it's a
ClusterSearcher, then it would be possible if the remotes are being
refreshed.

Marvin Humphrey

[lucy-user] Strange results when documents gets delete while iterating

Posted by Gerald Richter <ri...@ecos.de>.

Hi,

I have a simple query that consists of a TermQuery and a RangeQuery, I 
am iterating over this query like this:

         while ($cnt-- >= 0 && ($hit = $hits -> next))
             {
             $data = $hit->get_fields() ;
             ....
             }

While this loop runs documents are deleted from the index by another 
process. Without this other process everything is fine. When this 
deletion is happeing, it seems that half of the documents that are 
returned by $hits -> next are wrong, which mean I get a totaly different 
document, which should not be part of the resultset.

I thought that a searcher operates on a snapshot, so changes that 
happens at the same time does not influence the query. Is this wrong? If 
yes, how could I make sure my resultset is not corrupted?

Thanks & Regards

Gerald

Re: [lucy-user] how to get distinct values of a field

Posted by Gerald Richter <ri...@ecos.de>.

That works great!

Thanks

Gerald


Am 02.11.2015 um 14:14 schrieb Nick Wellnhofer:
>    my $index = Lucy::Index::IndexReader->open(index => $path_to_index);
>     my $lex_reader = $index->obtain('Lucy::Index::LexiconReader');
>     my $lexicon = $lex_reader->lexicon(field => $field_name);
>     my @terms;
>
>     while ($lexicon->next) {
>         push(@terms, $lexicon->get_term);
>     }
>
> Depending on the size of your index and the number of segments, it 
> might be more efficient to merge the terms from multiple segments 
> manually:
>
>     my $index = Lucy::Index::IndexReader->open(index => $path_to_index);
>     my $seg_readers = $index->seg_readers;
>     my %term_hash;
>
>     for my $seg_reader (@$seg_readers) {
>         my $lex_reader = 
> $seg_reader->obtain('Lucy::Index::LexiconReader');
>         my $lexicon = $lex_reader->lexicon(field => $field_name);
>
>         while ($lexicon->next) {
>             my $term = $lexicon->get_term;
>             $term_hash{$term} = undef;
>         }
>     }
>
>     my @terms = keys(%term_hash);

Re: [lucy-user] how to get distinct values of a field

Posted by Nick Wellnhofer <we...@aevum.de>.

On 02/11/2015 08:54, Gerald Richter wrote:
> I like to get all distinct values from a field, something which would in sql
> look like this:
>
> select distinct fieldname from table
>
> where fieldname is a StringType.
>
> Is this possible with lucy?

The easiest way (using a PolyLexiconReader under the hood):

     my $index = Lucy::Index::IndexReader->open(index => $path_to_index);
     my $lex_reader = $index->obtain('Lucy::Index::LexiconReader');
     my $lexicon = $lex_reader->lexicon(field => $field_name);
     my @terms;

     while ($lexicon->next) {
         push(@terms, $lexicon->get_term);
     }

Depending on the size of your index and the number of segments, it might be 
more efficient to merge the terms from multiple segments manually:

     my $index = Lucy::Index::IndexReader->open(index => $path_to_index);
     my $seg_readers = $index->seg_readers;
     my %term_hash;

     for my $seg_reader (@$seg_readers) {
         my $lex_reader = $seg_reader->obtain('Lucy::Index::LexiconReader');
         my $lexicon = $lex_reader->lexicon(field => $field_name);

         while ($lexicon->next) {
             my $term = $lexicon->get_term;
             $term_hash{$term} = undef;
         }
     }

     my @terms = keys(%term_hash);

Note that these examples also work with full text fields.

Nick

[lucy-user] how to get distinct values of a field

Posted by Gerald Richter <ri...@ecos.de>.

Hi,

I like to get all distinct values from a field, something which would in 
sql look like this:

select distinct fieldname from table

where fieldname is a StringType.

Is this possible with lucy?

Thanks & Regards

Gerald

Re: [lucy-user] Lucy and Coro/AnyEvent

Posted by Nick Wellnhofer <we...@aevum.de>.

On 20/10/2015 09:43, Gerald Richter wrote:
> Is the searcher thread safe?

In a strict sense, no. But the code is reentrant, so it's possible to use 
multiple searchers from separate threads as long as each searcher is only used 
by a single thread.

A user-supplied locking mechanism should work, too. But the Perl bindings use 
the CLONE_SKIP facility, so Lucy objects can't be shared across Perl threads 
anyway.

> Is there any documentation about the C interface of Lucy?

The C interface will be fully documented in the upcoming 0.5 release. You can 
find a preview here:

https://rawgit.com/nwellnhof/lucy/generated_docs/c/autogen/share/doc/clownfish/lucy.html

Here is some sample code:

     https://github.com/apache/lucy/tree/master/c/sample

Nick

Re: [lucy-user] Lucy and Coro/AnyEvent

Posted by Gerald Richter <ri...@ecos.de>.

Hi Marvin,

thanks for your feedback.

Using threads, like IO::AIO/AnyEvent::AIO does, would be my prefered way.

Is the searcher thread safe?

Is there any documentation about the C interface of Lucy?

Thanks & Regards

Gerald


Am 19.10.2015 um 21:54 schrieb Marvin Humphrey:
> On Sat, Oct 17, 2015 at 11:11 AM, Gerald Richter <ri...@ecos.de> wrote:
>> Hi,
>>
>> as far as I see all calls to Lucy are synchronous.
>>
>> I there are way to use it together with AnyEvent and/or Coro without
>> blocking the whole system for the time of the Lucy calls?
> Hi Gerald,
>
> The only way I think it could work would be to launch a concurrent
> independent process/thread on which Lucy does work. A call to interact
> with the Lucy thread would then fire off work to be done on the
> separate thread and register a callback signaling the main thread when
> the work is done. That's effectively what we do in
> LucyX::Remote::ClusterSearcher, though that's using a select loop.
>
> Marvin Humphrey
>
> !DSPAM:416,56254a9f23791092315305!

Re: [lucy-user] Lucy and Coro/AnyEvent

Posted by Marvin Humphrey <ma...@rectangular.com>.

On Sat, Oct 17, 2015 at 11:11 AM, Gerald Richter <ri...@ecos.de> wrote:
> Hi,
>
> as far as I see all calls to Lucy are synchronous.
>
> I there are way to use it together with AnyEvent and/or Coro without
> blocking the whole system for the time of the Lucy calls?

Hi Gerald,

The only way I think it could work would be to launch a concurrent
independent process/thread on which Lucy does work. A call to interact
with the Lucy thread would then fire off work to be done on the
separate thread and register a callback signaling the main thread when
the work is done. That's effectively what we do in
LucyX::Remote::ClusterSearcher, though that's using a select loop.

Marvin Humphrey