You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@lucy.apache.org by arjan <ar...@unitedknowledge.nl> on 2011/07/06 22:45:30 UTC

[lucy-user] index and search words separated by hyphens

Dear all,

Does anyone know how to retrieve a document by words in the document 
that are separated by hyphens?

Reason for my question is this. If I index a single document of a single 
line that contains words separated by hyphens, I can retrieve that 
document by any word, but the words separated by hyphens nor the whole 
phrase including the hyphens.

For example I index a single document with only this sentence

"please subscribe to this mailing-list"

I can retrieve this document by searching for "please" or "subscribe" or 
"please subscribe", but not by searching for "mailing-list" or "mailing" 
or "list".

It seems that the words "mailing" and "list" are treated as separate 
words, since both "mailing" and "list" can be found in the lexicon. 
However, it seems that they are somehow not connected to the document.

Any help would be appreciated, or is this a bug?

I am using Lucy-0.1.0 from CPAN.

Kind regards,Arjan Widlak


Re: [lucy-user] index and search words separated by hyphens

Posted by Marvin Humphrey <ma...@rectangular.com>.
On Fri, Jul 08, 2011 at 11:08:50AM +0200, arjan wrote:
> I used an external script to see if the new message was in Kino in the  
> right way. In the external script, the object was freshly instantiated.  
> In the main program however, it was not, and because - as the  
> documentation says - "IndexSearchers operate against a single  
> point-in-time view or Snapshot of the index", I searched the Index as it  
> was before my changes. Now I added a call to $env->clear_searcher to  
> reinstantiate the searcher and it works.

Thank you for making the effort to write up an accurate post-mortem.

The point-in-time view of the index is a sane and appropriate interface for
Searcher.  If Searcher updated itself, then users would have to wrap
transaction guards around every non-atomic sequence of method calls.  Given
Lucy's highly modularized design (where Searchers are often held as member
variables within other components), that's impractical.

Nevertheless, you aren't the first to be tripped up by that behavior and you
won't be the last.

At some point we should create some "troubleshooting" documentation, and
verifying that the Searcher has been properly refreshed will definitely be
on the checklist.

Best,

Marvin Humphrey


Re: [lucy-user] index and search words separated by hyphens

Posted by arjan <ar...@unitedknowledge.nl>.
Hi Marvin,

I found my bug, and it had nothing to do with what I wrote earlier. I 
can verify that indeed
"please subscribe"
"please-subscribe"
please-subscribe
produce the same results, both in Lucy-0.1.0 as well as in KinoSearch 0.313.

The source of my bug was something completely different. I created an 
environment object, msg_searcher, in which I stored my searcher, like 
so: (Moose code)

has 'msg_searcher' => (
     is          => 'ro',
     isa         => 'KinoSearch::Search::IndexSearcher',
     required    => 1,
     lazy        => 1,
     builder     => '_build_searcher',
     clearer        => 'clear_searcher',
);
sub _build_searcher {
     my $self = shift;
     return KinoSearch::Search::IndexSearcher->new(
         index => $self->message_storage,
     );
}

I used an external script to see if the new message was in Kino in the 
right way. In the external script, the object was freshly instantiated. 
In the main program however, it was not, and because - as the 
documentation says - "IndexSearchers operate against a single 
point-in-time view or Snapshot of the index", I searched the Index as it 
was before my changes. Now I added a call to $env->clear_searcher to 
reinstantiate the searcher and it works.

Kind regards,
Arjan Widlak

United Knowledge
http://www.unitedknowledge.nl/


On 07/07/2011 01:47 AM, Marvin Humphrey wrote:
> On Thu, Jul 07, 2011 at 01:35:30AM +0200, arjan wrote:
>> You asked  how I build the query and this made have another look at how I
>> instantiate the QueryParser object. I selected the wrong fields.
> Ah, interesting.  :)
>
>> Thanks for asking the right questions and telling me how it should work.
> Happy to help, and I'm glad you didn't find a Lucy bug!  :)
>
> For what it's worth, it's useful to know what kinds of mistakes our users
> make, as it can help us to refine our API designs so that they shunt people
> in the right direction.  (See<http://wiki.apache.org/lucy/BrainLog>.)
>
> In this case, I think we've already made QueryParser just about as safe as it
> can be by having it take a Schema rather than a hard-coded list of fields by
> default.  I can't think of a way we could improve the design to spare future
> users from the problems you faced.
>
>>> They're in the lexicon?  Do you mean that you've gone all the way down into
>>> Lucy::Index::Lexicon, or something else?
>> Yes, like so:
>> my $polyreader = Lucy::Index::IndexReader->open(
>>          index =>  $env->message_storage,
>>      );
>> my $seg_readers = $polyreader->seg_readers;
>>
>> foreach my $seg_reader ( @$seg_readers ) {
>>      say "segment: $seg_reader";
>>      my $lex_reader = $seg_reader->obtain( "Lucy::Index::LexiconReader" );
>>      my $lexicon    = $lex_reader->lexicon( field =>  'title' );
>>
>>      while ( $lexicon->next ) {
>>          say encode( 'utf8', $lexicon->get_term );
>>      }
>> }
> Nicely done!  You obviously did your homework before coming to the list for
> help...
>
>>>> Any help would be appreciated, or is this a bug?
>>> How are you building/executing the query?
>> Ohhhhh....
> LOL!  Been there...
>
> Good luck,
>
> Marvin Humphrey
>


-- 
Recent: http://www.lomcongres.nl/
Congres- en nieuwsbriefportaal met relatiebeheer systeem voor het Landelijk Overleg Milieuhandhaving

Setting Standards, a a Delft University of Technology and United Knowledge simulation exercise on strategy and cooperation in standardization, http://www.setting-standards.com

United Knowledge, internet voor de publieke sector
Keizersgracht 74
1015 CT Amsterdam
T +31 (0)20 52 18 300
F +31 (0)20 52 18 301
bureau@unitedknowledge.nl
http://www.unitedknowledge.nl

M +31 (0)6 2427 1444
E arjan@unitedknowledge.nl

Bezoek onze site op:
http://www.unitedknowledge.nl

Of bekijk een van onze projecten:
http://www.handhavingsportaal.nl/
http://www.setting-standards.com/
http://www.lomcongres.nl/
http://www.clubvanmaarssen.org/




Re: [lucy-user] index and search words separated by hyphens

Posted by Marvin Humphrey <ma...@rectangular.com>.
On Thu, Jul 07, 2011 at 01:35:30AM +0200, arjan wrote:
> You asked  how I build the query and this made have another look at how I
> instantiate the QueryParser object. I selected the wrong fields.

Ah, interesting.  :)

> Thanks for asking the right questions and telling me how it should work.

Happy to help, and I'm glad you didn't find a Lucy bug!  :)

For what it's worth, it's useful to know what kinds of mistakes our users
make, as it can help us to refine our API designs so that they shunt people
in the right direction.  (See <http://wiki.apache.org/lucy/BrainLog>.)

In this case, I think we've already made QueryParser just about as safe as it
can be by having it take a Schema rather than a hard-coded list of fields by
default.  I can't think of a way we could improve the design to spare future
users from the problems you faced.

>> They're in the lexicon?  Do you mean that you've gone all the way down into
>> Lucy::Index::Lexicon, or something else?
> Yes, like so:
> my $polyreader = Lucy::Index::IndexReader->open(
>         index => $env->message_storage,
>     );
> my $seg_readers = $polyreader->seg_readers;
>
> foreach my $seg_reader ( @$seg_readers ) {
>     say "segment: $seg_reader";
>     my $lex_reader = $seg_reader->obtain( "Lucy::Index::LexiconReader" );
>     my $lexicon    = $lex_reader->lexicon( field => 'title' );
>
>     while ( $lexicon->next ) {
>         say encode( 'utf8', $lexicon->get_term );
>     }
> }

Nicely done!  You obviously did your homework before coming to the list for
help...

>>> Any help would be appreciated, or is this a bug?
>> How are you building/executing the query?
> Ohhhhh....

LOL!  Been there...

Good luck,

Marvin Humphrey


Re: [lucy-user] index and search words separated by hyphens

Posted by arjan <ar...@unitedknowledge.nl>.
Dear Marvin,

I am sorry: it's my bad. While answering your questions below, I found I 
have made a mistake in my test code to pin down my problem. You asked 
how I build the query and this made have another look at how I 
instantiate the QueryParser object. I selected the wrong fields. I may 
have done something similar in the real code. I will check this 
tomorrow. Sorry to have bothered you.

Thanks for asking the right questions and telling me how it should work.

Kind regards,
Arjan.

>> Reason for my question is this. If I index a single document of a single
>> line that contains words separated by hyphens, I can retrieve that
>> document by any word, but the words separated by hyphens nor the whole
>> phrase including the hyphens.
>>
>> For example I index a single document with only this sentence
>>
>> "please subscribe to this mailing-list"
>>
>> I can retrieve this document by searching for "please" or "subscribe" or
>> "please subscribe", but not by searching for "mailing-list" or "mailing"
>> or "list".
> I'm confused -- there seems to be a contradiction between your ability to
> retrieve the document "by any word", and your inability to retrieve the
> document by searching for "mailing" or "list".
>
> Can you please clarify what you get when you search for "mailing"?
I can retrieve the document by "please", "subscribe", "to" and "this" 
but not by "mailing", "list" or "mailing-list". So if I search for 
mailing, I get zero hits.
>> It seems that the words "mailing" and "list" are treated as separate
>> words, since both "mailing" and "list" can be found in the lexicon.
> They're in the lexicon?  Do you mean that you've gone all the way down into
> Lucy::Index::Lexicon, or something else?
Yes, like so:
my $polyreader = Lucy::Index::IndexReader->open(
         index => $env->message_storage,
     );
my $seg_readers = $polyreader->seg_readers;

foreach my $seg_reader ( @$seg_readers ) {
     say "segment: $seg_reader";
     my $lex_reader = $seg_reader->obtain( "Lucy::Index::LexiconReader" );
     my $lexicon    = $lex_reader->lexicon( field => 'title' );

     while ( $lexicon->next ) {
         say encode( 'utf8', $lexicon->get_term );
     }
}
>> Any help would be appreciated, or is this a bug?
> How are you building/executing the query?
Ohhhhh....
> What does the FieldType assigned to the field in question look like?
>
> For common Analyzer configurations, Lucy's QueryParser is supposed to parse
> hyphenated constructs as phrases -- so these should all produce the same
> results:
>
>      "mailing list"
>      "mailing-list"
>      mailing-list
>
> Similarly, these should all produce the same results:
>
>      "please subscribe"
>      "please-subscribe"
>      please-subscribe
>
> It might be interesting to know whether those work as expected.
>
> Best,
>
> Marvin Humphrey
>


-- 
Recent: http://www.lomcongres.nl/
Congres- en nieuwsbriefportaal met relatiebeheer systeem voor het Landelijk Overleg Milieuhandhaving

Setting Standards, a a Delft University of Technology and United Knowledge simulation exercise on strategy and cooperation in standardization, http://www.setting-standards.com

United Knowledge, internet voor de publieke sector
Keizersgracht 74
1015 CT Amsterdam
T +31 (0)20 52 18 300
F +31 (0)20 52 18 301
bureau@unitedknowledge.nl
http://www.unitedknowledge.nl

M +31 (0)6 2427 1444
E arjan@unitedknowledge.nl

Bezoek onze site op:
http://www.unitedknowledge.nl

Of bekijk een van onze projecten:
http://www.handhavingsportaal.nl/
http://www.setting-standards.com/
http://www.lomcongres.nl/
http://www.clubvanmaarssen.org/




Re: [lucy-user] index and search words separated by hyphens

Posted by Marvin Humphrey <ma...@rectangular.com>.
On Wed, Jul 06, 2011 at 10:45:30PM +0200, arjan wrote:
> Does anyone know how to retrieve a document by words in the document  
> that are separated by hyphens?

This shouldn't be a problem. :)

> Reason for my question is this. If I index a single document of a single  
> line that contains words separated by hyphens, I can retrieve that  
> document by any word, but the words separated by hyphens nor the whole  
> phrase including the hyphens.
>
> For example I index a single document with only this sentence
>
> "please subscribe to this mailing-list"
>
> I can retrieve this document by searching for "please" or "subscribe" or  
> "please subscribe", but not by searching for "mailing-list" or "mailing"  
> or "list".

I'm confused -- there seems to be a contradiction between your ability to
retrieve the document "by any word", and your inability to retrieve the
document by searching for "mailing" or "list".

Can you please clarify what you get when you search for "mailing"?

> It seems that the words "mailing" and "list" are treated as separate  
> words, since both "mailing" and "list" can be found in the lexicon.  

They're in the lexicon?  Do you mean that you've gone all the way down into
Lucy::Index::Lexicon, or something else?

> Any help would be appreciated, or is this a bug?

How are you building/executing the query?

What does the FieldType assigned to the field in question look like?

For common Analyzer configurations, Lucy's QueryParser is supposed to parse
hyphenated constructs as phrases -- so these should all produce the same
results:

    "mailing list"
    "mailing-list"
    mailing-list

Similarly, these should all produce the same results:

    "please subscribe"
    "please-subscribe"
    please-subscribe

It might be interesting to know whether those work as expected.

Best,

Marvin Humphrey