You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@lucy.apache.org by Gerald Richter <ri...@ecos.de> on 2015/10/17 20:11:08 UTC
[lucy-user] Lucy and Coro/AnyEvent
Hi,
as far as I see all calls to Lucy are synchronous.
I there are way to use it together with AnyEvent and/or Coro without
blocking the whole system for the time of the Lucy calls?
Thanks & Regards
Gerald
Re: [lucy-user] Strange results when documents gets delete while iterating
Posted by Marvin Humphrey <ma...@rectangular.com>.
Thanks for closing the loop, and glad that things seem to be working OK!
Marvin Humphrey
On Wed, Nov 25, 2015 at 9:38 PM, Gerald Richter <ri...@ecos.de> wrote:
> Thanks for the detailed explanation. Yes, I am using Coro, but in this
> special test case only one Coro thread was running.
>
> After restarting all processes the issue has gone away. I still did not
> really understand what was going on, but since the restart (a few days ago)
> everything works like expected
>
> Regards
>
> Gerald
Re: [lucy-user] Strange results when documents gets delete while
iterating
Posted by Gerald Richter <ri...@ecos.de>.
Thanks for the detailed explanation. Yes, I am using Coro, but in this
special test case only one Coro thread was running.
After restarting all processes the issue has gone away. I still did not
really understand what was going on, but since the restart (a few days
ago) everything works like expected
Regards
Gerald
Am 19.11.2015 um 16:03 schrieb Marvin Humphrey:
> On Thu, Nov 19, 2015 at 4:39 AM, Gerald Richter - ECOS Technology
> <Ge...@ecos.de> wrote:
>> Hi,
>>
>> It's a local IndexSearcher.
>>
>> I have done a lot of tests and it's really happening.
>>
>> Let me give you a little more details, maybe this helps:
>>
>> - I call a function that creates a new IndexSearcher and call $hits = $searcher -> hits.
>> - I iterate over the first few entries and returns the entries and the $hits
>> - The documents that were found are deleted from a database, which in turn deletes the documents from the Lucy index.
>> - Now I iterate over the next few entries and delete them and so on
>>
>> I have made small test where per iteration only two entries are fetch. The result looks like this:
>>
>> id => "8b8bce64e69b52ed244671009c11ee0e",
>> id => "8b8bce64e69b52ed244671009c4857e7",
>> id => "4a3dcd6c2e9e3074d2d52b8e72584b68",
>> id => "8b8bce64e69b52ed244671009c730dc9",
>> id => "4a3dcd6c2e9e3074d2d52b8e72584d19",
>> id => "8b8bce64e69b52ed244671009c7e3974",
>> id => "4a3dcd6c2e9e3074d2d52b8e72585475",
>> id => "8b8bce64e69b52ed244671009c7e4788",
>> id => "4a3dcd6c2e9e3074d2d52b8e72585dc2",
>> id => "8b8bce64e69b52ed244671009c7e2fa6",
>>
>> id is some value I store in the document. The result should only contain ids starting with 8.
>>
>> So you see the first two are correct, after deletion of this two (always in a different process), the next time, the first one I get is wrong the second one is correct...
>>
>> If I do not delete anything I only get the right entries (just commented out one line the rest is still the same).
>>
>> Any clue?
> When documents in an old segment are marked as deleted, that information is
> written to a bitmap deletions file which is written to a new segment. Old
> readers are not supposed to know about new segments. So for something to go
> wrong, either 1) information in an old segment would have to be corrupted, 2)
> a reader would have to somehow find out about information in a new segment, or
> 3) somthing else unrelated.
>
> Indexers write index data (including new deletions data referencing documents
> in old segments) to temp files in a new segment, which are then consolidated
> into a single per-segment "compound file" named "cf.dat". When a reader
> opens, it mmaps cf.dat for each segment in the snapshot. Once the reader
> successfully opens all the files it needs, it never goes looking for new
> files.
>
> It's hard to imagine a mechanism that would either cause an existing "cf.dat"
> file to be modified, or persuade a reader to go look at a new "cf.dat"
> file. So unless my reasoning is wrong, the cause is #3 -- something else
> unrelated. I really have no idea what that could be, though since you've
> previously asked some questions about Coro/AnyEvent and other concurrency
> stuff the most likely prospect would seem to be something unique to your
> setup.
>
> The next step is probably to take the behavior you've been able to reproduce
> and isolate it in a test case that others can run and analyze.
>
> Marvin Humphrey
>
> !DSPAM:416,564de4eb23791822212463!
Re: [lucy-user] Strange results when documents gets delete while iterating
Posted by Marvin Humphrey <ma...@rectangular.com>.
On Thu, Nov 19, 2015 at 4:39 AM, Gerald Richter - ECOS Technology
<Ge...@ecos.de> wrote:
> Hi,
>
> It's a local IndexSearcher.
>
> I have done a lot of tests and it's really happening.
>
> Let me give you a little more details, maybe this helps:
>
> - I call a function that creates a new IndexSearcher and call $hits = $searcher -> hits.
> - I iterate over the first few entries and returns the entries and the $hits
> - The documents that were found are deleted from a database, which in turn deletes the documents from the Lucy index.
> - Now I iterate over the next few entries and delete them and so on
>
> I have made small test where per iteration only two entries are fetch. The result looks like this:
>
> id => "8b8bce64e69b52ed244671009c11ee0e",
> id => "8b8bce64e69b52ed244671009c4857e7",
> id => "4a3dcd6c2e9e3074d2d52b8e72584b68",
> id => "8b8bce64e69b52ed244671009c730dc9",
> id => "4a3dcd6c2e9e3074d2d52b8e72584d19",
> id => "8b8bce64e69b52ed244671009c7e3974",
> id => "4a3dcd6c2e9e3074d2d52b8e72585475",
> id => "8b8bce64e69b52ed244671009c7e4788",
> id => "4a3dcd6c2e9e3074d2d52b8e72585dc2",
> id => "8b8bce64e69b52ed244671009c7e2fa6",
>
> id is some value I store in the document. The result should only contain ids starting with 8.
>
> So you see the first two are correct, after deletion of this two (always in a different process), the next time, the first one I get is wrong the second one is correct...
>
> If I do not delete anything I only get the right entries (just commented out one line the rest is still the same).
>
> Any clue?
When documents in an old segment are marked as deleted, that information is
written to a bitmap deletions file which is written to a new segment. Old
readers are not supposed to know about new segments. So for something to go
wrong, either 1) information in an old segment would have to be corrupted, 2)
a reader would have to somehow find out about information in a new segment, or
3) somthing else unrelated.
Indexers write index data (including new deletions data referencing documents
in old segments) to temp files in a new segment, which are then consolidated
into a single per-segment "compound file" named "cf.dat". When a reader
opens, it mmaps cf.dat for each segment in the snapshot. Once the reader
successfully opens all the files it needs, it never goes looking for new
files.
It's hard to imagine a mechanism that would either cause an existing "cf.dat"
file to be modified, or persuade a reader to go look at a new "cf.dat"
file. So unless my reasoning is wrong, the cause is #3 -- something else
unrelated. I really have no idea what that could be, though since you've
previously asked some questions about Coro/AnyEvent and other concurrency
stuff the most likely prospect would seem to be something unique to your
setup.
The next step is probably to take the behavior you've been able to reproduce
and isolate it in a test case that others can run and analyze.
Marvin Humphrey
AW: [lucy-user] Strange results when documents gets delete while
iterating
Posted by Gerald Richter - ECOS Technology <Ge...@ecos.de>.
Hi,
It's a local IndexSearcher.
I have done a lot of tests and it's really happening.
Let me give you a little more details, maybe this helps:
- I call a function that creates a new IndexSearcher and call $hits = $searcher -> hits.
- I iterate over the first few entries and returns the entries and the $hits
- The documents that were found are deleted from a database, which in turn deletes the documents from the Lucy index.
- Now I iterate over the next few entries and delete them and so on
I have made small test where per iteration only two entries are fetch. The result looks like this:
id => "8b8bce64e69b52ed244671009c11ee0e",
id => "8b8bce64e69b52ed244671009c4857e7",
id => "4a3dcd6c2e9e3074d2d52b8e72584b68",
id => "8b8bce64e69b52ed244671009c730dc9",
id => "4a3dcd6c2e9e3074d2d52b8e72584d19",
id => "8b8bce64e69b52ed244671009c7e3974",
id => "4a3dcd6c2e9e3074d2d52b8e72585475",
id => "8b8bce64e69b52ed244671009c7e4788",
id => "4a3dcd6c2e9e3074d2d52b8e72585dc2",
id => "8b8bce64e69b52ed244671009c7e2fa6",
id is some value I store in the document. The result should only contain ids starting with 8.
So you see the first two are correct, after deletion of this two (always in a different process), the next time, the first one I get is wrong the second one is correct...
If I do not delete anything I only get the right entries (just commented out one line the rest is still the same).
Any clue?
Thanks & Regards
Gerald
-----Ursprüngliche Nachricht-----
Von: Marvin Humphrey [mailto:marvin@rectangular.com]
Gesendet: Donnerstag, 19. November 2015 12:19
An: user@lucy.apache.org
Betreff: Re: [lucy-user] Strange results when documents gets delete while iterating
On Wed, Nov 18, 2015 at 10:22 PM, Gerald Richter <ri...@ecos.de> wrote:
> I have a simple query that consists of a TermQuery and a RangeQuery, I am
> iterating over this query like this:
>
> while ($cnt-- >= 0 && ($hit = $hits -> next))
> {
> $data = $hit->get_fields() ;
> ....
> }
>
> While this loop runs documents are deleted from the index by another
> process. Without this other process everything is fine. When this deletion
> is happeing, it seems that half of the documents that are returned by $hits
> -> next are wrong, which mean I get a totaly different document, which
> should not be part of the resultset.
>
> I thought that a searcher operates on a snapshot, so changes that happens at
> the same time does not influence the query. Is this wrong? If yes, how could
> I make sure my resultset is not corrupted?
What kind of a Searcher is this? If it's an IndexSearcher operating
on a local index, I don't see how it could happen. But if it's a
ClusterSearcher, then it would be possible if the remotes are being
refreshed.
Marvin Humphrey
!DSPAM:416,564db01923795029453755!
Re: [lucy-user] Strange results when documents gets delete while
iterating
Posted by Gerald Richter <ri...@ecos.de>.
Hi,
It's a local IndexSearcher.
I have done a lot of tests and it's really happening.
Let me give you a little more details, maybe this helps:
- I call a function that creates a new IndexSearcher and call $hits =
$searcher -> hits.
- I iterate over the first few entries and returns the entries and the $hits
- The documents that were found are deleted from a database, which in
turn deletes the documents from the Lucy index.
- Now I iterate over the next few entries and delete them and so on
I have made small test where per iteration only two entries are fetch.
The result looks like this:
id=> "8b8bce64e69b52ed244671009c11ee0e",
id=> "8b8bce64e69b52ed244671009c4857e7",
id=> "4a3dcd6c2e9e3074d2d52b8e72584b68",
id=> "8b8bce64e69b52ed244671009c730dc9",
id=> "4a3dcd6c2e9e3074d2d52b8e72584d19",
id=> "8b8bce64e69b52ed244671009c7e3974",
id=> "4a3dcd6c2e9e3074d2d52b8e72585475",
id=> "8b8bce64e69b52ed244671009c7e4788",
id=> "4a3dcd6c2e9e3074d2d52b8e72585dc2",
id=> "8b8bce64e69b52ed244671009c7e2fa6",
id is some value I store in the document. The result should only contain
ids starting with 8.
So you see the first two are correct, after deletion of this two (always
in a different process), the next time, the first one I get is wrong the
second one is correct...
If I do not delete anything I only get the right entries (just commented
out one line the rest is still the same).
Any clue?
Thanks & Regards
Gerald
Am 19.11.2015 um 12:18 schrieb Marvin Humphrey:
> On Wed, Nov 18, 2015 at 10:22 PM, Gerald Richter <ri...@ecos.de> wrote:
>
>> I have a simple query that consists of a TermQuery and a RangeQuery, I am
>> iterating over this query like this:
>>
>> while ($cnt-- >= 0 && ($hit = $hits -> next))
>> {
>> $data = $hit->get_fields() ;
>> ....
>> }
>>
>> While this loop runs documents are deleted from the index by another
>> process. Without this other process everything is fine. When this deletion
>> is happeing, it seems that half of the documents that are returned by $hits
>> -> next are wrong, which mean I get a totaly different document, which
>> should not be part of the resultset.
>>
>> I thought that a searcher operates on a snapshot, so changes that happens at
>> the same time does not influence the query. Is this wrong? If yes, how could
>> I make sure my resultset is not corrupted?
> What kind of a Searcher is this? If it's an IndexSearcher operating
> on a local index, I don't see how it could happen. But if it's a
> ClusterSearcher, then it would be possible if the remotes are being
> refreshed.
>
> Marvin Humphrey
>
> !DSPAM:416,564db01923795029453755!
Re: [lucy-user] Strange results when documents gets delete while iterating
Posted by Marvin Humphrey <ma...@rectangular.com>.
On Wed, Nov 18, 2015 at 10:22 PM, Gerald Richter <ri...@ecos.de> wrote:
> I have a simple query that consists of a TermQuery and a RangeQuery, I am
> iterating over this query like this:
>
> while ($cnt-- >= 0 && ($hit = $hits -> next))
> {
> $data = $hit->get_fields() ;
> ....
> }
>
> While this loop runs documents are deleted from the index by another
> process. Without this other process everything is fine. When this deletion
> is happeing, it seems that half of the documents that are returned by $hits
> -> next are wrong, which mean I get a totaly different document, which
> should not be part of the resultset.
>
> I thought that a searcher operates on a snapshot, so changes that happens at
> the same time does not influence the query. Is this wrong? If yes, how could
> I make sure my resultset is not corrupted?
What kind of a Searcher is this? If it's an IndexSearcher operating
on a local index, I don't see how it could happen. But if it's a
ClusterSearcher, then it would be possible if the remotes are being
refreshed.
Marvin Humphrey
[lucy-user] Strange results when documents gets delete while iterating
Posted by Gerald Richter <ri...@ecos.de>.
Hi,
I have a simple query that consists of a TermQuery and a RangeQuery, I
am iterating over this query like this:
while ($cnt-- >= 0 && ($hit = $hits -> next))
{
$data = $hit->get_fields() ;
....
}
While this loop runs documents are deleted from the index by another
process. Without this other process everything is fine. When this
deletion is happeing, it seems that half of the documents that are
returned by $hits -> next are wrong, which mean I get a totaly different
document, which should not be part of the resultset.
I thought that a searcher operates on a snapshot, so changes that
happens at the same time does not influence the query. Is this wrong? If
yes, how could I make sure my resultset is not corrupted?
Thanks & Regards
Gerald
Re: [lucy-user] how to get distinct values of a field
Posted by Gerald Richter <ri...@ecos.de>.
That works great!
Thanks
Gerald
Am 02.11.2015 um 14:14 schrieb Nick Wellnhofer:
> my $index = Lucy::Index::IndexReader->open(index => $path_to_index);
> my $lex_reader = $index->obtain('Lucy::Index::LexiconReader');
> my $lexicon = $lex_reader->lexicon(field => $field_name);
> my @terms;
>
> while ($lexicon->next) {
> push(@terms, $lexicon->get_term);
> }
>
> Depending on the size of your index and the number of segments, it
> might be more efficient to merge the terms from multiple segments
> manually:
>
> my $index = Lucy::Index::IndexReader->open(index => $path_to_index);
> my $seg_readers = $index->seg_readers;
> my %term_hash;
>
> for my $seg_reader (@$seg_readers) {
> my $lex_reader =
> $seg_reader->obtain('Lucy::Index::LexiconReader');
> my $lexicon = $lex_reader->lexicon(field => $field_name);
>
> while ($lexicon->next) {
> my $term = $lexicon->get_term;
> $term_hash{$term} = undef;
> }
> }
>
> my @terms = keys(%term_hash);
Re: [lucy-user] how to get distinct values of a field
Posted by Nick Wellnhofer <we...@aevum.de>.
On 02/11/2015 08:54, Gerald Richter wrote:
> I like to get all distinct values from a field, something which would in sql
> look like this:
>
> select distinct fieldname from table
>
> where fieldname is a StringType.
>
> Is this possible with lucy?
The easiest way (using a PolyLexiconReader under the hood):
my $index = Lucy::Index::IndexReader->open(index => $path_to_index);
my $lex_reader = $index->obtain('Lucy::Index::LexiconReader');
my $lexicon = $lex_reader->lexicon(field => $field_name);
my @terms;
while ($lexicon->next) {
push(@terms, $lexicon->get_term);
}
Depending on the size of your index and the number of segments, it might be
more efficient to merge the terms from multiple segments manually:
my $index = Lucy::Index::IndexReader->open(index => $path_to_index);
my $seg_readers = $index->seg_readers;
my %term_hash;
for my $seg_reader (@$seg_readers) {
my $lex_reader = $seg_reader->obtain('Lucy::Index::LexiconReader');
my $lexicon = $lex_reader->lexicon(field => $field_name);
while ($lexicon->next) {
my $term = $lexicon->get_term;
$term_hash{$term} = undef;
}
}
my @terms = keys(%term_hash);
Note that these examples also work with full text fields.
Nick
[lucy-user] how to get distinct values of a field
Posted by Gerald Richter <ri...@ecos.de>.
Hi,
I like to get all distinct values from a field, something which would in
sql look like this:
select distinct fieldname from table
where fieldname is a StringType.
Is this possible with lucy?
Thanks & Regards
Gerald
Re: [lucy-user] Lucy and Coro/AnyEvent
Posted by Nick Wellnhofer <we...@aevum.de>.
On 20/10/2015 09:43, Gerald Richter wrote:
> Is the searcher thread safe?
In a strict sense, no. But the code is reentrant, so it's possible to use
multiple searchers from separate threads as long as each searcher is only used
by a single thread.
A user-supplied locking mechanism should work, too. But the Perl bindings use
the CLONE_SKIP facility, so Lucy objects can't be shared across Perl threads
anyway.
> Is there any documentation about the C interface of Lucy?
The C interface will be fully documented in the upcoming 0.5 release. You can
find a preview here:
https://rawgit.com/nwellnhof/lucy/generated_docs/c/autogen/share/doc/clownfish/lucy.html
Here is some sample code:
https://github.com/apache/lucy/tree/master/c/sample
Nick
Re: [lucy-user] Lucy and Coro/AnyEvent
Posted by Gerald Richter <ri...@ecos.de>.
Hi Marvin,
thanks for your feedback.
Using threads, like IO::AIO/AnyEvent::AIO does, would be my prefered way.
Is the searcher thread safe?
Is there any documentation about the C interface of Lucy?
Thanks & Regards
Gerald
Am 19.10.2015 um 21:54 schrieb Marvin Humphrey:
> On Sat, Oct 17, 2015 at 11:11 AM, Gerald Richter <ri...@ecos.de> wrote:
>> Hi,
>>
>> as far as I see all calls to Lucy are synchronous.
>>
>> I there are way to use it together with AnyEvent and/or Coro without
>> blocking the whole system for the time of the Lucy calls?
> Hi Gerald,
>
> The only way I think it could work would be to launch a concurrent
> independent process/thread on which Lucy does work. A call to interact
> with the Lucy thread would then fire off work to be done on the
> separate thread and register a callback signaling the main thread when
> the work is done. That's effectively what we do in
> LucyX::Remote::ClusterSearcher, though that's using a select loop.
>
> Marvin Humphrey
>
> !DSPAM:416,56254a9f23791092315305!
Re: [lucy-user] Lucy and Coro/AnyEvent
Posted by Marvin Humphrey <ma...@rectangular.com>.
On Sat, Oct 17, 2015 at 11:11 AM, Gerald Richter <ri...@ecos.de> wrote:
> Hi,
>
> as far as I see all calls to Lucy are synchronous.
>
> I there are way to use it together with AnyEvent and/or Coro without
> blocking the whole system for the time of the Lucy calls?
Hi Gerald,
The only way I think it could work would be to launch a concurrent
independent process/thread on which Lucy does work. A call to interact
with the Lucy thread would then fire off work to be done on the
separate thread and register a callback signaling the main thread when
the work is done. That's effectively what we do in
LucyX::Remote::ClusterSearcher, though that's using a select loop.
Marvin Humphrey