You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@lucy.apache.org by Saurabh Vasekar <sv...@listenlogic.com> on 2012/06/19 21:05:26 UTC
[lucy-user] Unable to retrieve records using Proximity query
Hello,
I am using LucyX::Search::ProximityQuery to search through the indexed
documents.
My code looks like below
use strict;
use warnings;
my $path_to_index = '{path_to_index}' #specified correctly
use List::Util qw ( max min );
use POSIX qw ( ceil );
use Encode qw ( decode );
use Lucy::Search::IndexSearcher;
use Lucy::Search::QueryParser;
use Lucy::Search::TermQuery;
use LucyX::Search::ProximityQuery;
binmode STDOUT, ":encoding(UTF-8)";
my $proximity_query = LucyX::Search::ProximityQuery->new(
field => 'content',
terms => [ qw ( jakarta apache ) ],
within => 4,
);
my $offset = "0";
my $page_size = 10000;
my $searcher = Lucy::Search::IndexSearcher->new(
index => $path_to_index,
);
my $qparser = Lucy::Search::QueryParser->new(
schema => $searcher->get_schema,
);
my $hits = $searcher->hits(
query => $proximity_query,
offset => $offset,
num_wanted => $page_size,
);
my $hit_count = $hits->total_hits;
print("Hit Count :$hit_count\n");
#End of code
Although my documents contain contents which have text 'jakarta' and
'apache' I am not getting any results. The interesting thing is that is I
specify the following in my proximity query the search returns appropriate
results.
my $proximity_query = LucyX::Search::ProximityQuery->new(
field => 'content',
terms => [ qw ( in the ) ],
within => 4,
);
Is my implementation correct?
Thank you.
Re: [lucy-user] Unable to retrieve records using Proximity query
Posted by Peter Karman <pe...@peknet.com>.
Saurabh Vasekar wrote on 6/21/12 3:01 PM:
>
> For queries like e.g. "content:jakarta AND content:apache" or e.g
> "+content:apache AND -content:retrieval"
> I compared the search the results with other indexing libraries viz. Ferret,
> Lucene etc and they gave the same results.
>
> But for query "content:\"jakarta apache\"~4 results shown by Lucene and Ferret
> are accurate but I am not getting any record with Lucy.
>
Thanks for the full code examples. They were helpful.
It took me a few hours of playing with it to figure out why it wasn't working as
you (and I) expected. Your indexing code is fine. The searching code assumes (as
did I at first) that the terms in a ProximityQuery object would be analyzed
(stemmed). They aren't. Only the QueryParser does the analyzing. When you
construct a Query object manually, you have to the analysis yourself.
Unfortunately, the core Lucy::Search::QueryParser class doesn't handle the
proximity syntax, since ProximityQuery is an extension to the core.
Fortunately, Search::Query::Parser handles more advanced query syntax than does
the core class. (This is no knock against the Lucy parser -- as Marvin and I
have discussed in the past, it is a thankless task to try and create a parser
that is all things to all people.)
I've included example searcher code below. I've included examples of using a
query parser vs just constructing the query objects manually.
use strict;
use warnings;
my $path_to_index = 'lucy_store';
use Lucy::Search::QueryParser;
use Lucy::Search::IndexSearcher;
use LucyX::Search::ProximityQuery;
use Search::Query;
my $searcher = Lucy::Search::IndexSearcher->new( index => $path_to_index, );
TERM: {
my $term_query = Lucy::Search::TermQuery->new(
field => 'content',
term => 'apache',
);
my $hits = $searcher->hits( query => $term_query, );
my $hit_count = $hits->total_hits;
while ( my $hit = $hits->next ) {
my $content = $hit->{content};
print("Content : $content\n");
print("\n");
}
printf( "TERM Hit Count :$hit_count for query %s\n",
$term_query->to_string );
}
TERMPARSED: {
my $qp = Lucy::Search::QueryParser->new(
schema => $searcher->get_schema,
fields => [qw( content )],
);
my $term_query = $qp->parse('apache');
my $hits = $searcher->hits( query => $term_query, );
my $hit_count = $hits->total_hits;
while ( my $hit = $hits->next ) {
my $content = $hit->{content};
print("Content : $content\n");
print("\n");
}
printf( "TERMPARSED Hit Count :$hit_count for query %s\n",
$term_query->to_string );
}
PROX: {
my $proximity_query = LucyX::Search::ProximityQuery->new(
field => 'content',
terms => [qw( apache jakarta )],
within => 4,
);
my $hits = $searcher->hits( query => $proximity_query );
my $hit_count = $hits->total_hits;
while ( my $hit = $hits->next ) {
my $content = $hit->{content};
print("Content : $content\n");
print("\n");
}
printf( "PROX Hit Count :$hit_count for query %s\n",
$proximity_query->to_string );
}
PROXSQP: {
my $schema = $searcher->get_schema();
my $field_names = $schema->all_fields;
my %fieldtypes;
for my $name (@$field_names) {
$fieldtypes{$name} = {
type => $schema->fetch_type($name),
analyzer => $schema->fetch_analyzer($name)
};
}
my $qp = Search::Query::Parser->new(
dialect => 'Lucy',
fields => \%fieldtypes,
dialect_opts => { default_field => 'content' }, # just for example
);
my $proximity_query
= $qp->parse('content:"apache jakarta"~4')->as_lucy_query;
my $hits = $searcher->hits( query => $proximity_query );
my $hit_count = $hits->total_hits;
while ( my $hit = $hits->next ) {
my $content = $hit->{content};
print("Content : $content\n");
print("\n");
}
printf( "PROXSQP Hit Count :$hit_count for query %s\n",
$proximity_query->to_string );
}
--
Peter Karman . http://peknet.com/ . peter@peknet.com
Re: [lucy-user] Unable to retrieve records using Proximity query
Posted by Peter Karman <pe...@peknet.com>.
Saurabh Vasekar wrote on 6/21/12 3:01 PM:
> Hi Peter,
>
> Thanks a lot for your help.
>
> Actually I am reading the fields to be indexed from a database and then creating
> an index on these fields. I am able to perform all the other searches but this one.
>
one other note: in testing all this, I used the 'swish3' tool that comes with
SWISH::Prog::Lucy. It is a cli that makes indexing and searching very simple. No
code to write. Just a config file.
E.g.:
[karpet@pekmac:~/tmp/prox]$ cat doc.xml
<doc>
<content>jakarta is near apache</content>
</doc>
[karpet@pekmac:~/tmp/prox]$ cat conf
MetaNames content
PropertyNames content
[karpet@pekmac:~/tmp/prox]$ swish3 -c conf -F lucy -i doc.xml
1 documents in 00:00:01
[karpet@pekmac:~/tmp/prox]$ swish3 -q 'content:"jakarta apache"~4'
# swish3 version 3.0.12
# Format: Lucy
# Query: content:"jakarta apache"~4
# Hits: 1
# Search time: 0.0253
306 doc.xml ""
.
--
Peter Karman . http://peknet.com/ . peter@peknet.com
Re: [lucy-user] Unable to retrieve records using Proximity query
Posted by Saurabh Vasekar <sv...@listenlogic.com>.
Hi Peter,
Thanks a lot for your help.
Actually I am reading the fields to be indexed from a database and then
creating an index on these fields. I am able to perform all the other
searches but this one.
For queries like e.g. "content:jakarta AND content:apache" or e.g
"+content:apache AND -content:retrieval"
I compared the search the results with other indexing libraries viz.
Ferret, Lucene etc and they gave the same results.
But for query "content:\"jakarta apache\"~4 results shown by Lucene and
Ferret are accurate but I am not getting any record with Lucy.
My code for indexing is -
###############################################################
use strict;
use warnings;
use Redis;
use JSON;
use Lucy::Plan::Schema;
use Lucy::Plan::FullTextType;
use Lucy::Analysis::PolyAnalyzer;
use Lucy::Index::Indexer;
use lib "/root/apache-lucy-0.3.1/perl/lib";
no warnings 'uninitialized';
my $path_to_index = '/lucy_store';
my $schema = Lucy::Plan::Schema->new;
my $polyanalyzer = Lucy::Analysis::PolyAnalyzer->new(
language => 'en',
);
my $content = Lucy::Plan::FullTextType->new(
analyzer => $polyanalyzer,
);
my $order = Lucy::Plan::FullTextType->new(
analyzer => $polyanalyzer,
sortable => 1,
);
$schema->spec_field( name => 'content', type => $content);
$schema->spec_field( name => 'order', type => $order);
my $indexer = Lucy::Index::Indexer->new(
index => $path_to_index,
schema => $schema,
create => 1,
truncate => 1,
);
my $r = Redis->new;
my @records;
my $noOfRecords = 10000;
binmode(STDOUT, ":utf8");
print("Extracting records\n");
retrieve_data($noOfRecords); #Retriving data from the database
print("Finished Extracting\n");
print("Indexing started\n");
for(my $count = 0; $count < $noOfRecords; $count++)
{
my $doc = parse_data();
$indexer->add_doc($doc);
}
$indexer->commit;
print("Finished Indexing\n");
sub retrieve_data
{
for (my $count = 0; $count < $_[0]; $count++)
{
my $value = $r->get ("retriveFROMDATA:$count");
my $decoded_hash = decode_json $value;
push(@records, $decoded_hash);
}
}
sub parse_data
{
my $decoded_hash = shift(@records);
my %hash_packet;
while( my ($key, $value) = each%$decoded_hash)
{
if($key eq "packet")
{
while(my ($key1, $value1) = each%$value)
{
if($key1 eq "content" || $key1 eq "order")
{
if($value1)
{
if($key1 eq "order")
{
$hash_packet{$key1} = int($value1);
}
else
{
$hash_packet{$key1} = $value1;
}
}
else
{
if($key1 eq "order")
{
$hash_packet{$key1} = -1;
}
else
{
$hash_packet{$key1} = "";
}
}
}
}
}
}
return
{
content => $hash_packet{"content"},
published_at => $hash_packet{"order"},
};
}
$r->quit;
###############################################################
My code for searching through the indexed documents
###############################################################
use strict;
use warnings;
my $path_to_index = '/lucy_store';
use List::Util qw ( max min );
use POSIX qw ( ceil );
use Encode qw ( decode );
use Lucy::Search::IndexSearcher;
use Lucy::Search::QueryParser;
use Lucy::Search::TermQuery;
use LucyX::Search::ProximityQuery;
binmode STDOUT, ":encoding(UTF-8)";
my $proximity_query = LucyX::Search::ProximityQuery->new(
field => 'content',
terms => [ qw ( jakarta apache ) ],
within => 4,
);
my $by_order = Lucy::Search::SortRule->new(
field => 'order',
reverse => 1,
);
my $sort_spec = Lucy::Search::SortSpec->new(
rules => [
$by_order,
],
);
my $offset = "0";
my $page_size = 10000;
my $searcher = Lucy::Search::IndexSearcher->new(
index => $path_to_index,
);
my $hits = $searcher->hits(
query => $proximity_query,
offset => $offset,
num_wanted => $page_size,
sort_spec => $sort_spec, # when i remove this statement i am not
getting any segmentation fault
);
my $hit_count = $hits->total_hits;
while(my $hit = $hits->next)
{
my $content = $hit->{content};
my $order = $hit->{order};
print("Content : $content\n");
print("Order :$order\n");
print("\n");
}
print("Hit Count :$hit_count\n");
print("Program executing till here\n"); #Program is getting executed till
here
###############################################################
Also when I execute my searching code I am getting segmentation fault. The
statement "Program executing till here" is getting printed. When I remove
the sorting specification I am not getting any segmentation fault. I am
sorting based on an integer field.
Thank you.
On Wed, Jun 20, 2012 at 7:17 AM, Peter Karman <pe...@peknet.com> wrote:
> On 6/19/12 2:05 PM, Saurabh Vasekar wrote:
>
> [ snipped searching code ]
>
>
> Although my documents contain contents which have text 'jakarta' and
>> 'apache' I am not getting any results. The interesting thing is that is I
>> specify the following in my proximity query the search returns appropriate
>> results.
>>
>> my $proximity_query = LucyX::Search::ProximityQuery-**>new(
>> field => 'content',
>> terms => [ qw ( in the ) ],
>> within => 4,
>> );
>>
>> Is my implementation correct?
>>
>>
> your search code looks reasonable. I would suggest a fully self-contained
> example, including example docs and indexing code, to really demonstrate
> the problem. Since we can't see what's in your index, it's difficult to
> help determine if this is a problem in your code or in Lucy.
>
>
> --
> Peter Karman . http://peknet.com/ . peter@peknet.com
>
Re: [lucy-user] Unable to retrieve records using Proximity query
Posted by Peter Karman <pe...@peknet.com>.
On 6/19/12 2:05 PM, Saurabh Vasekar wrote:
[ snipped searching code ]
> Although my documents contain contents which have text 'jakarta' and
> 'apache' I am not getting any results. The interesting thing is that is I
> specify the following in my proximity query the search returns appropriate
> results.
>
> my $proximity_query = LucyX::Search::ProximityQuery->new(
> field => 'content',
> terms => [ qw ( in the ) ],
> within => 4,
> );
>
> Is my implementation correct?
>
your search code looks reasonable. I would suggest a fully
self-contained example, including example docs and indexing code, to
really demonstrate the problem. Since we can't see what's in your index,
it's difficult to help determine if this is a problem in your code or in
Lucy.
--
Peter Karman . http://peknet.com/ . peter@peknet.com