You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@lucy.apache.org by Kieron Taylor <kt...@ebi.ac.uk> on 2013/09/30 15:15:03 UTC

[lucy-user] Documents gone AWOL

I'm experimenting with Lucy and a bunch of bioinformatics data. 
Generally the results are pleasing, but my last effort has failed with 
certain queries returning no results. I hope someone can help me figure 
out what I'm doing wrong.

I am querying for identifiers that I know are in the source data, and I 
have every reason to believe they would have been added to the index, 
but somehow I cannot find them, even though other very similar queries 
succeed.

Key details:

I'm using Search::Query::Dialect::Lucy to generate queries
Index consists of 20 segments, totalling 19 GB.
Queries are simplistic ID lookups


Queries look like:
accessions:UPI000224F266
accessions:UPI000006DAC9   <-- this one returns a result

So my questions are, do I need to do anything special to query multiple 
segments? Is Search::Query getting in the way?


Many thanks,

Kieron

-- 
Kieron Taylor PhD.
Ensembl project
EMBL-EBI

Re: [lucy-user] Documents gone AWOL

Posted by Kieron Taylor <kt...@ebi.ac.uk>.

On 30/09/2013 17:41, Peter Karman wrote:
> On 9/30/13 11:22 AM, Kieron Taylor wrote:
>
>> %%% Indexing %%%
>>
>> $lucy_indexer = Lucy::Index::Indexer->new(
>>              schema => $schema,
>>              index => $path,
>>              create => 1,
>> );
>>
>> #
>> while ($record = shift) {
>>
>>    %flattened_record = %{$record};
>>    $flattened_record{accessions} = join ' ',@accessions;
>>    # Array of values turned into whitespaced list.
>>    $lucy_indexer->add_doc(
>>            \%flattened_record
>>    );
>>
>> }
>>
>> # Commit is called ~100k records, before spinning up another indexer
>> $lucy_indexer->commit;
>
>
> I assume you are not passing the 'create => 1' param for each
> $lucy_indexer.

Your comment suggests this would be terminal. I'll make certain I've not 
made a blunder.
>>
>> %%% Querying %%%
>>
>> $query = 'accessions:UPI01';
>>
>> $searcher = Lucy::Search::IndexSearcher->new(
>>      index => $path,
>> );
>> $parser = Search::Query->parser(
>>    dialect => 'Lucy',
>>    fields  => $lucy_indexer->get_schema()->all_fields,
>> );
>>
>> $search = $parser->parse($query)->as_lucy_query;
>
>
> I would probably insert a debugging statement here to verify that the
> parser is doing what you think it is:
>
> $parsed_query = $parser->parse($query);
> printf("parsed_query:%s\n", $parsed_query);
> $lucy_query = $parsed_query->as_lucy_query;
> printf("lucy_query:%s\n", $lucy_query->dump);

Suggestion welcomed and assimilated.

> Instead of grep'ing the segment files, you might try seeing what Lucy
> reports via the API:
>
> https://metacpan.org/source/KARMAN/SWISH-Prog-Lucy-0.17/bin/lucyx-dump-terms

Ok. To be honest, I was only grasping at straws with grep, so I'm glad 
there's a more appropriate alternative.

Thanks very much,

Kieron

Re: [lucy-user] Documents gone AWOL

Posted by Peter Karman <pe...@peknet.com>.

On 9/30/13 11:22 AM, Kieron Taylor wrote:

> %%% Indexing %%%
>
> $lucy_indexer = Lucy::Index::Indexer->new(
>              schema => $schema,
>              index => $path,
>              create => 1,
> );
>
> #
> while ($record = shift) {
>
>    %flattened_record = %{$record};
>    $flattened_record{accessions} = join ' ',@accessions;
>    # Array of values turned into whitespaced list.
>    $lucy_indexer->add_doc(
>            \%flattened_record
>    );
>
> }
>
> # Commit is called ~100k records, before spinning up another indexer
> $lucy_indexer->commit;


I assume you are not passing the 'create => 1' param for each $lucy_indexer.


>
> %%% Querying %%%
>
> $query = 'accessions:UPI01';
>
> $searcher = Lucy::Search::IndexSearcher->new(
>      index => $path,
> );
> $parser = Search::Query->parser(
>    dialect => 'Lucy',
>    fields  => $lucy_indexer->get_schema()->all_fields,
> );
>
> $search = $parser->parse($query)->as_lucy_query;


I would probably insert a debugging statement here to verify that the 
parser is doing what you think it is:

$parsed_query = $parser->parse($query);
printf("parsed_query:%s\n", $parsed_query);
$lucy_query = $parsed_query->as_lucy_query;
printf("lucy_query:%s\n", $lucy_query->dump);


>
> $result = $searcher->hits(
>    query => $search,
>    num_wanted => 1000,
> );
>
> while (my $hit = $result->next) {
>    say $hit->{accessions};
>
> }
>
>
> I've not shown result paging code, and some blob data use that doesn't
> affect this issue, since blobs are for later and I'm not getting any
> hits on some strings, that I can grep from the .dat in a seg.
>

Instead of grep'ing the segment files, you might try seeing what Lucy 
reports via the API:

https://metacpan.org/source/KARMAN/SWISH-Prog-Lucy-0.17/bin/lucyx-dump-terms


-- 
Peter Karman  .  http://peknet.com/  .  peter@peknet.com

Re: [lucy-user] Documents gone AWOL

Posted by Kieron Taylor <kt...@ebi.ac.uk>.

On 30/09/2013 17:42, Nick Wellnhofer wrote:
>
> If the 'accessions' field contains multiple terms, you should use a
> FullTextType with an appropriate Analyzer, not a StringType.
>
> Nick

Good to know. I suppose I was using optimistic thinking there, after I 
found that it didn't support multiple values. Time to implement some 
changes.

Thanks,

Kieron

Re: [lucy-user] Documents gone AWOL

Posted by Nick Wellnhofer <we...@aevum.de>.

On 30/09/2013 18:22, Kieron Taylor wrote:
> On 30/09/2013 16:45, Peter Karman wrote:
>> show your code, please.
>
> Doing my best to extract something close enough to the original code.
> Lots of Moose and fluff culled.
>
> %%% Schema generation %%%
>
> $schema = Lucy::Plan::Schema->new;
> $lucy_str = Lucy::Plan::StringType->new;
>
> # schema generated from Moose MOP
> for my $attribute ($record_meta->get_all_attributes) {
>      $schema->spec_field( name => $attribute->name, type => $lucy_str);
> }
>
> %%% Indexing %%%
>
> $lucy_indexer = Lucy::Index::Indexer->new(
>              schema => $schema,
>              index => $path,
>              create => 1,
> );
>
> #
> while ($record = shift) {
>
>    %flattened_record = %{$record};
>    $flattened_record{accessions} = join ' ',@accessions;
>    # Array of values turned into whitespaced list.
>    $lucy_indexer->add_doc(
>            \%flattened_record
>    );
>
> }

If the 'accessions' field contains multiple terms, you should use a 
FullTextType with an appropriate Analyzer, not a StringType.

Nick

Re: [lucy-user] Documents gone AWOL

Posted by Kieron Taylor <kt...@ebi.ac.uk>.

Thanks for responding Peter.

On 30/09/2013 16:45, Peter Karman wrote:
> fwiw, segments are an implementation detail that most code shouldn't
> need to know anything about.

This is good to know.

> show your code, please.

Doing my best to extract something close enough to the original code. 
Lots of Moose and fluff culled.

%%% Schema generation %%%

$schema = Lucy::Plan::Schema->new;
$lucy_str = Lucy::Plan::StringType->new;

# schema generated from Moose MOP
for my $attribute ($record_meta->get_all_attributes) {
     $schema->spec_field( name => $attribute->name, type => $lucy_str);
}

%%% Indexing %%%

$lucy_indexer = Lucy::Index::Indexer->new(
             schema => $schema,
             index => $path,
             create => 1,
);

#
while ($record = shift) {

   %flattened_record = %{$record};
   $flattened_record{accessions} = join ' ',@accessions;
   # Array of values turned into whitespaced list.
   $lucy_indexer->add_doc(
           \%flattened_record
   );

}

# Commit is called ~100k records, before spinning up another indexer
$lucy_indexer->commit;

%%% Querying %%%

$query = 'accessions:UPI01';

$searcher = Lucy::Search::IndexSearcher->new(
     index => $path,
);
$parser = Search::Query->parser(
   dialect => 'Lucy',
   fields  => $lucy_indexer->get_schema()->all_fields,
);

$search = $parser->parse($query)->as_lucy_query;

$result = $searcher->hits(
   query => $search,
   num_wanted => 1000,
);

while (my $hit = $result->next) {
   say $hit->{accessions};

}


I've not shown result paging code, and some blob data use that doesn't 
affect this issue, since blobs are for later and I'm not getting any 
hits on some strings, that I can grep from the .dat in a seg.

If needs be, I can tar the lot up onto Google Drive.


Cheers,

Kieron

Re: [lucy-user] Documents gone AWOL

Posted by Peter Karman <pe...@peknet.com>.

On 9/30/13 8:15 AM, Kieron Taylor wrote:

> I am querying for identifiers that I know are in the source data, and I
> have every reason to believe they would have been added to the index,
> but somehow I cannot find them, even though other very similar queries
> succeed.
>
> Key details:
>
> I'm using Search::Query::Dialect::Lucy to generate queries
> Index consists of 20 segments, totalling 19 GB.

fwiw, segments are an implementation detail that most code shouldn't 
need to know anything about.

> Queries are simplistic ID lookups
>
>
> Queries look like:
> accessions:UPI000224F266
> accessions:UPI000006DAC9   <-- this one returns a result
>
> So my questions are, do I need to do anything special to query multiple
> segments? Is Search::Query getting in the way?
>

show your code, please.


-- 
Peter Karman  .  http://peknet.com/  .  peter@peknet.com