You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@lucy.apache.org by Saurabh Vasekar <sv...@listenlogic.com> on 2012/06/19 21:05:26 UTC

[lucy-user] Unable to retrieve records using Proximity query

Hello,

I am using LucyX::Search::ProximityQuery to search through the indexed
documents.

My code looks like below

use strict;
use warnings;

my $path_to_index = '{path_to_index}' #specified correctly

use List::Util qw ( max min );
use POSIX qw ( ceil );
use Encode qw ( decode );

use Lucy::Search::IndexSearcher;

use Lucy::Search::QueryParser;
use Lucy::Search::TermQuery;
use LucyX::Search::ProximityQuery;

binmode STDOUT, ":encoding(UTF-8)";

my $proximity_query = LucyX::Search::ProximityQuery->new(
field => 'content',
terms => [ qw ( jakarta apache ) ],
within => 4,
);

my $offset = "0";

my $page_size = 10000;

my $searcher = Lucy::Search::IndexSearcher->new(
index => $path_to_index,
);

my $qparser = Lucy::Search::QueryParser->new(
schema => $searcher->get_schema,

);

my $hits = $searcher->hits(
query => $proximity_query,
offset => $offset,
num_wanted => $page_size,
);


my $hit_count = $hits->total_hits;

print("Hit Count :$hit_count\n");

#End of code

Although my documents contain contents which have text 'jakarta' and
'apache' I am not getting any results. The interesting thing is that is I
specify the following in my proximity query the search returns appropriate
results.

my $proximity_query = LucyX::Search::ProximityQuery->new(
field => 'content',
terms => [ qw ( in the ) ],
within => 4,
);

Is my implementation correct?

Thank you.

Re: [lucy-user] Unable to retrieve records using Proximity query

Posted by Peter Karman <pe...@peknet.com>.
Saurabh Vasekar wrote on 6/21/12 3:01 PM:

> 
> For queries like  e.g. "content:jakarta AND content:apache" or e.g
> "+content:apache AND -content:retrieval"
> I compared the search the results with other indexing libraries viz.  Ferret,
> Lucene etc and they gave the same results. 
> 
> But for query "content:\"jakarta apache\"~4 results shown by Lucene and Ferret
> are accurate but I am not getting any record with Lucy.
> 

Thanks for the full code examples. They were helpful.

It took me a few hours of playing with it to figure out why it wasn't working as
you (and I) expected. Your indexing code is fine. The searching code assumes (as
did I at first) that the terms in a ProximityQuery object would be analyzed
(stemmed). They aren't. Only the QueryParser does the analyzing. When you
construct a Query object manually, you have to the analysis yourself.

Unfortunately, the core Lucy::Search::QueryParser class doesn't handle the
proximity syntax, since ProximityQuery is an extension to the core.

Fortunately, Search::Query::Parser handles more advanced query syntax than does
the core class. (This is no knock against the Lucy parser -- as Marvin and I
have discussed in the past, it is a thankless task to try and create a parser
that is all things to all people.)

I've included example searcher code below. I've included examples of using a
query parser vs just constructing the query objects manually.

use strict;
use warnings;

my $path_to_index = 'lucy_store';

use Lucy::Search::QueryParser;
use Lucy::Search::IndexSearcher;
use LucyX::Search::ProximityQuery;
use Search::Query;

my $searcher = Lucy::Search::IndexSearcher->new( index => $path_to_index, );

TERM: {
    my $term_query = Lucy::Search::TermQuery->new(
        field => 'content',
        term  => 'apache',
    );
    my $hits = $searcher->hits( query => $term_query, );

    my $hit_count = $hits->total_hits;

    while ( my $hit = $hits->next ) {
        my $content = $hit->{content};

        print("Content : $content\n");

        print("\n");
    }

    printf( "TERM Hit Count :$hit_count for query %s\n",
        $term_query->to_string );

}

TERMPARSED: {
    my $qp = Lucy::Search::QueryParser->new(
        schema => $searcher->get_schema,
        fields => [qw( content )],
    );
    my $term_query = $qp->parse('apache');
    my $hits = $searcher->hits( query => $term_query, );

    my $hit_count = $hits->total_hits;

    while ( my $hit = $hits->next ) {
        my $content = $hit->{content};

        print("Content : $content\n");

        print("\n");
    }

    printf( "TERMPARSED Hit Count :$hit_count for query %s\n",
        $term_query->to_string );

}

PROX: {
    my $proximity_query = LucyX::Search::ProximityQuery->new(
        field  => 'content',
        terms  => [qw( apache jakarta )],
        within => 4,
    );
    my $hits = $searcher->hits( query => $proximity_query );

    my $hit_count = $hits->total_hits;

    while ( my $hit = $hits->next ) {
        my $content = $hit->{content};

        print("Content : $content\n");

        print("\n");
    }

    printf( "PROX Hit Count :$hit_count for query %s\n",
        $proximity_query->to_string );

}

PROXSQP: {
    my $schema      = $searcher->get_schema();
    my $field_names = $schema->all_fields;
    my %fieldtypes;
    for my $name (@$field_names) {
        $fieldtypes{$name} = {
            type     => $schema->fetch_type($name),
            analyzer => $schema->fetch_analyzer($name)
        };
    }

    my $qp = Search::Query::Parser->new(
        dialect      => 'Lucy',
        fields       => \%fieldtypes,
        dialect_opts => { default_field => 'content' },  # just for example
    );

    my $proximity_query
        = $qp->parse('content:"apache jakarta"~4')->as_lucy_query;

    my $hits = $searcher->hits( query => $proximity_query );

    my $hit_count = $hits->total_hits;

    while ( my $hit = $hits->next ) {
        my $content = $hit->{content};

        print("Content : $content\n");

        print("\n");
    }

    printf( "PROXSQP Hit Count :$hit_count for query %s\n",
        $proximity_query->to_string );

}



-- 
Peter Karman  .  http://peknet.com/  .  peter@peknet.com

Re: [lucy-user] Unable to retrieve records using Proximity query

Posted by Peter Karman <pe...@peknet.com>.
Saurabh Vasekar wrote on 6/21/12 3:01 PM:
> Hi Peter,
> 
> Thanks a lot for your help.
> 
> Actually I am reading the fields to be indexed from a database and then creating
> an index on these fields. I am able to perform all the other searches but this one. 
>

one other note: in testing all this, I used the 'swish3' tool that comes with
SWISH::Prog::Lucy. It is a cli that makes indexing and searching very simple. No
code to write. Just a config file.

E.g.:

[karpet@pekmac:~/tmp/prox]$ cat doc.xml
<doc>
 <content>jakarta is near apache</content>
</doc>

[karpet@pekmac:~/tmp/prox]$ cat conf
MetaNames content
PropertyNames content

[karpet@pekmac:~/tmp/prox]$ swish3 -c conf -F lucy -i doc.xml
1 documents in 00:00:01

[karpet@pekmac:~/tmp/prox]$ swish3 -q 'content:"jakarta apache"~4'
# swish3 version 3.0.12
# Format: Lucy
# Query: content:"jakarta apache"~4
# Hits: 1
# Search time: 0.0253
 306 doc.xml ""
.


-- 
Peter Karman  .  http://peknet.com/  .  peter@peknet.com

Re: [lucy-user] Unable to retrieve records using Proximity query

Posted by Saurabh Vasekar <sv...@listenlogic.com>.
Hi Peter,

Thanks a lot for your help.

Actually I am reading the fields to be indexed from a database and then
creating an index on these fields. I am able to perform all the other
searches but this one.

For queries like  e.g. "content:jakarta AND content:apache" or e.g
"+content:apache AND -content:retrieval"
I compared the search the results with other indexing libraries viz.
 Ferret, Lucene etc and they gave the same results.

But for query "content:\"jakarta apache\"~4 results shown by Lucene and
Ferret are accurate but I am not getting any record with Lucy.

My code for indexing is -

###############################################################
use strict;
use warnings;
use Redis;
use JSON;

use Lucy::Plan::Schema;
use Lucy::Plan::FullTextType;
use Lucy::Analysis::PolyAnalyzer;
use Lucy::Index::Indexer;


use lib "/root/apache-lucy-0.3.1/perl/lib";

no warnings 'uninitialized';

my $path_to_index = '/lucy_store';

my $schema = Lucy::Plan::Schema->new;

my $polyanalyzer = Lucy::Analysis::PolyAnalyzer->new(
language => 'en',
);

my $content = Lucy::Plan::FullTextType->new(
analyzer => $polyanalyzer,
);

my $order = Lucy::Plan::FullTextType->new(
analyzer => $polyanalyzer,
sortable => 1,
);

$schema->spec_field( name => 'content', type => $content);
$schema->spec_field( name => 'order', type => $order);

my $indexer = Lucy::Index::Indexer->new(
index => $path_to_index,
schema => $schema,
create => 1,
truncate => 1,
);

my $r = Redis->new;

my @records;

my $noOfRecords = 10000;

binmode(STDOUT, ":utf8");

print("Extracting records\n");

retrieve_data($noOfRecords);  #Retriving data from the database

print("Finished Extracting\n");

print("Indexing started\n");

for(my $count = 0; $count < $noOfRecords; $count++)
{
my $doc = parse_data();
$indexer->add_doc($doc);
}

$indexer->commit;

print("Finished Indexing\n");

sub retrieve_data
{
for (my $count = 0; $count < $_[0]; $count++)
{
my $value = $r->get ("retriveFROMDATA:$count");

my $decoded_hash = decode_json $value;

push(@records, $decoded_hash);
}
}


sub parse_data
{
my $decoded_hash = shift(@records);
 my %hash_packet;

while( my ($key, $value) = each%$decoded_hash)
{
if($key eq "packet")
{
      while(my ($key1, $value1) = each%$value)
     {
if($key1 eq "content" || $key1 eq "order")
{
if($value1)
{
if($key1 eq "order")
{
$hash_packet{$key1} = int($value1);
}
else
{
$hash_packet{$key1} = $value1;
}
}
else
{
if($key1 eq "order")
{
$hash_packet{$key1} = -1;
}
else
{
$hash_packet{$key1} = "";
}
}
}
}
 }
}
 return
{
content => $hash_packet{"content"},
published_at => $hash_packet{"order"},
};
}
$r->quit;
###############################################################



My code for searching through the indexed documents


###############################################################
use strict;
use warnings;

my $path_to_index = '/lucy_store';

use List::Util qw ( max min );
use POSIX qw ( ceil );
use Encode qw ( decode );

use Lucy::Search::IndexSearcher;
use Lucy::Search::QueryParser;
use Lucy::Search::TermQuery;
use LucyX::Search::ProximityQuery;


binmode STDOUT, ":encoding(UTF-8)";

my $proximity_query = LucyX::Search::ProximityQuery->new(
field => 'content',
terms => [ qw ( jakarta apache ) ],
within => 4,
);

my $by_order = Lucy::Search::SortRule->new(
field => 'order',
reverse => 1,
);

my $sort_spec = Lucy::Search::SortSpec->new(
rules => [
$by_order,
 ],
);

my $offset = "0";

my $page_size = 10000;

my $searcher = Lucy::Search::IndexSearcher->new(
index => $path_to_index,
);

my $hits = $searcher->hits(
query => $proximity_query,
offset => $offset,
num_wanted => $page_size,
sort_spec => $sort_spec,      # when i remove this statement i am not
getting any segmentation fault
);


my $hit_count = $hits->total_hits;

while(my $hit = $hits->next)
{
my $content = $hit->{content};
my $order = $hit->{order};

print("Content : $content\n");
print("Order :$order\n");
print("\n");
}

print("Hit Count :$hit_count\n");
print("Program executing till here\n"); #Program is getting executed till
here
###############################################################

Also when I execute my searching code I am getting segmentation fault. The
statement "Program executing till here" is getting printed. When I remove
the sorting specification I am not getting any segmentation fault. I am
sorting based on an integer field.

Thank you.



On Wed, Jun 20, 2012 at 7:17 AM, Peter Karman <pe...@peknet.com> wrote:

> On 6/19/12 2:05 PM, Saurabh Vasekar wrote:
>
> [ snipped searching code ]
>
>
>  Although my documents contain contents which have text 'jakarta' and
>> 'apache' I am not getting any results. The interesting thing is that is I
>> specify the following in my proximity query the search returns appropriate
>> results.
>>
>> my $proximity_query = LucyX::Search::ProximityQuery-**>new(
>> field =>  'content',
>> terms =>  [ qw ( in the ) ],
>> within =>  4,
>> );
>>
>> Is my implementation correct?
>>
>>
> your search code looks reasonable. I would suggest a fully self-contained
> example, including example docs and indexing code, to really demonstrate
> the problem. Since we can't see what's in your index, it's difficult to
> help determine if this is a problem in your code or in Lucy.
>
>
> --
> Peter Karman  .  http://peknet.com/  .  peter@peknet.com
>

Re: [lucy-user] Unable to retrieve records using Proximity query

Posted by Peter Karman <pe...@peknet.com>.
On 6/19/12 2:05 PM, Saurabh Vasekar wrote:

[ snipped searching code ]

> Although my documents contain contents which have text 'jakarta' and
> 'apache' I am not getting any results. The interesting thing is that is I
> specify the following in my proximity query the search returns appropriate
> results.
>
> my $proximity_query = LucyX::Search::ProximityQuery->new(
> field =>  'content',
> terms =>  [ qw ( in the ) ],
> within =>  4,
> );
>
> Is my implementation correct?
>

your search code looks reasonable. I would suggest a fully 
self-contained example, including example docs and indexing code, to 
really demonstrate the problem. Since we can't see what's in your index, 
it's difficult to help determine if this is a problem in your code or in 
Lucy.


-- 
Peter Karman  .  http://peknet.com/  .  peter@peknet.com