You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@lucy.apache.org by Kieron Taylor <kt...@ebi.ac.uk> on 2013/09/30 15:15:03 UTC
[lucy-user] Documents gone AWOL
I'm experimenting with Lucy and a bunch of bioinformatics data.
Generally the results are pleasing, but my last effort has failed with
certain queries returning no results. I hope someone can help me figure
out what I'm doing wrong.
I am querying for identifiers that I know are in the source data, and I
have every reason to believe they would have been added to the index,
but somehow I cannot find them, even though other very similar queries
succeed.
Key details:
I'm using Search::Query::Dialect::Lucy to generate queries
Index consists of 20 segments, totalling 19 GB.
Queries are simplistic ID lookups
Queries look like:
accessions:UPI000224F266
accessions:UPI000006DAC9 <-- this one returns a result
So my questions are, do I need to do anything special to query multiple
segments? Is Search::Query getting in the way?
Many thanks,
Kieron
--
Kieron Taylor PhD.
Ensembl project
EMBL-EBI
Re: [lucy-user] Documents gone AWOL
Posted by Kieron Taylor <kt...@ebi.ac.uk>.
On 30/09/2013 17:41, Peter Karman wrote:
> On 9/30/13 11:22 AM, Kieron Taylor wrote:
>
>> %%% Indexing %%%
>>
>> $lucy_indexer = Lucy::Index::Indexer->new(
>> schema => $schema,
>> index => $path,
>> create => 1,
>> );
>>
>> #
>> while ($record = shift) {
>>
>> %flattened_record = %{$record};
>> $flattened_record{accessions} = join ' ',@accessions;
>> # Array of values turned into whitespaced list.
>> $lucy_indexer->add_doc(
>> \%flattened_record
>> );
>>
>> }
>>
>> # Commit is called ~100k records, before spinning up another indexer
>> $lucy_indexer->commit;
>
>
> I assume you are not passing the 'create => 1' param for each
> $lucy_indexer.
Your comment suggests this would be terminal. I'll make certain I've not
made a blunder.
>>
>> %%% Querying %%%
>>
>> $query = 'accessions:UPI01';
>>
>> $searcher = Lucy::Search::IndexSearcher->new(
>> index => $path,
>> );
>> $parser = Search::Query->parser(
>> dialect => 'Lucy',
>> fields => $lucy_indexer->get_schema()->all_fields,
>> );
>>
>> $search = $parser->parse($query)->as_lucy_query;
>
>
> I would probably insert a debugging statement here to verify that the
> parser is doing what you think it is:
>
> $parsed_query = $parser->parse($query);
> printf("parsed_query:%s\n", $parsed_query);
> $lucy_query = $parsed_query->as_lucy_query;
> printf("lucy_query:%s\n", $lucy_query->dump);
Suggestion welcomed and assimilated.
> Instead of grep'ing the segment files, you might try seeing what Lucy
> reports via the API:
>
> https://metacpan.org/source/KARMAN/SWISH-Prog-Lucy-0.17/bin/lucyx-dump-terms
Ok. To be honest, I was only grasping at straws with grep, so I'm glad
there's a more appropriate alternative.
Thanks very much,
Kieron
Re: [lucy-user] Documents gone AWOL
Posted by Peter Karman <pe...@peknet.com>.
On 9/30/13 11:22 AM, Kieron Taylor wrote:
> %%% Indexing %%%
>
> $lucy_indexer = Lucy::Index::Indexer->new(
> schema => $schema,
> index => $path,
> create => 1,
> );
>
> #
> while ($record = shift) {
>
> %flattened_record = %{$record};
> $flattened_record{accessions} = join ' ',@accessions;
> # Array of values turned into whitespaced list.
> $lucy_indexer->add_doc(
> \%flattened_record
> );
>
> }
>
> # Commit is called ~100k records, before spinning up another indexer
> $lucy_indexer->commit;
I assume you are not passing the 'create => 1' param for each $lucy_indexer.
>
> %%% Querying %%%
>
> $query = 'accessions:UPI01';
>
> $searcher = Lucy::Search::IndexSearcher->new(
> index => $path,
> );
> $parser = Search::Query->parser(
> dialect => 'Lucy',
> fields => $lucy_indexer->get_schema()->all_fields,
> );
>
> $search = $parser->parse($query)->as_lucy_query;
I would probably insert a debugging statement here to verify that the
parser is doing what you think it is:
$parsed_query = $parser->parse($query);
printf("parsed_query:%s\n", $parsed_query);
$lucy_query = $parsed_query->as_lucy_query;
printf("lucy_query:%s\n", $lucy_query->dump);
>
> $result = $searcher->hits(
> query => $search,
> num_wanted => 1000,
> );
>
> while (my $hit = $result->next) {
> say $hit->{accessions};
>
> }
>
>
> I've not shown result paging code, and some blob data use that doesn't
> affect this issue, since blobs are for later and I'm not getting any
> hits on some strings, that I can grep from the .dat in a seg.
>
Instead of grep'ing the segment files, you might try seeing what Lucy
reports via the API:
https://metacpan.org/source/KARMAN/SWISH-Prog-Lucy-0.17/bin/lucyx-dump-terms
--
Peter Karman . http://peknet.com/ . peter@peknet.com
Re: [lucy-user] Documents gone AWOL
Posted by Kieron Taylor <kt...@ebi.ac.uk>.
On 30/09/2013 17:42, Nick Wellnhofer wrote:
>
> If the 'accessions' field contains multiple terms, you should use a
> FullTextType with an appropriate Analyzer, not a StringType.
>
> Nick
Good to know. I suppose I was using optimistic thinking there, after I
found that it didn't support multiple values. Time to implement some
changes.
Thanks,
Kieron
Re: [lucy-user] Documents gone AWOL
Posted by Nick Wellnhofer <we...@aevum.de>.
On 30/09/2013 18:22, Kieron Taylor wrote:
> On 30/09/2013 16:45, Peter Karman wrote:
>> show your code, please.
>
> Doing my best to extract something close enough to the original code.
> Lots of Moose and fluff culled.
>
> %%% Schema generation %%%
>
> $schema = Lucy::Plan::Schema->new;
> $lucy_str = Lucy::Plan::StringType->new;
>
> # schema generated from Moose MOP
> for my $attribute ($record_meta->get_all_attributes) {
> $schema->spec_field( name => $attribute->name, type => $lucy_str);
> }
>
> %%% Indexing %%%
>
> $lucy_indexer = Lucy::Index::Indexer->new(
> schema => $schema,
> index => $path,
> create => 1,
> );
>
> #
> while ($record = shift) {
>
> %flattened_record = %{$record};
> $flattened_record{accessions} = join ' ',@accessions;
> # Array of values turned into whitespaced list.
> $lucy_indexer->add_doc(
> \%flattened_record
> );
>
> }
If the 'accessions' field contains multiple terms, you should use a
FullTextType with an appropriate Analyzer, not a StringType.
Nick
Re: [lucy-user] Documents gone AWOL
Posted by Kieron Taylor <kt...@ebi.ac.uk>.
Thanks for responding Peter.
On 30/09/2013 16:45, Peter Karman wrote:
> fwiw, segments are an implementation detail that most code shouldn't
> need to know anything about.
This is good to know.
> show your code, please.
Doing my best to extract something close enough to the original code.
Lots of Moose and fluff culled.
%%% Schema generation %%%
$schema = Lucy::Plan::Schema->new;
$lucy_str = Lucy::Plan::StringType->new;
# schema generated from Moose MOP
for my $attribute ($record_meta->get_all_attributes) {
$schema->spec_field( name => $attribute->name, type => $lucy_str);
}
%%% Indexing %%%
$lucy_indexer = Lucy::Index::Indexer->new(
schema => $schema,
index => $path,
create => 1,
);
#
while ($record = shift) {
%flattened_record = %{$record};
$flattened_record{accessions} = join ' ',@accessions;
# Array of values turned into whitespaced list.
$lucy_indexer->add_doc(
\%flattened_record
);
}
# Commit is called ~100k records, before spinning up another indexer
$lucy_indexer->commit;
%%% Querying %%%
$query = 'accessions:UPI01';
$searcher = Lucy::Search::IndexSearcher->new(
index => $path,
);
$parser = Search::Query->parser(
dialect => 'Lucy',
fields => $lucy_indexer->get_schema()->all_fields,
);
$search = $parser->parse($query)->as_lucy_query;
$result = $searcher->hits(
query => $search,
num_wanted => 1000,
);
while (my $hit = $result->next) {
say $hit->{accessions};
}
I've not shown result paging code, and some blob data use that doesn't
affect this issue, since blobs are for later and I'm not getting any
hits on some strings, that I can grep from the .dat in a seg.
If needs be, I can tar the lot up onto Google Drive.
Cheers,
Kieron
Re: [lucy-user] Documents gone AWOL
Posted by Peter Karman <pe...@peknet.com>.
On 9/30/13 8:15 AM, Kieron Taylor wrote:
> I am querying for identifiers that I know are in the source data, and I
> have every reason to believe they would have been added to the index,
> but somehow I cannot find them, even though other very similar queries
> succeed.
>
> Key details:
>
> I'm using Search::Query::Dialect::Lucy to generate queries
> Index consists of 20 segments, totalling 19 GB.
fwiw, segments are an implementation detail that most code shouldn't
need to know anything about.
> Queries are simplistic ID lookups
>
>
> Queries look like:
> accessions:UPI000224F266
> accessions:UPI000006DAC9 <-- this one returns a result
>
> So my questions are, do I need to do anything special to query multiple
> segments? Is Search::Query getting in the way?
>
show your code, please.
--
Peter Karman . http://peknet.com/ . peter@peknet.com