You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@lucy.apache.org by "Desilets, Alain" <Al...@nrc-cnrc.gc.ca> on 2012/01/31 19:19:38 UTC

[lucy-user] Can lucy do substring search?

I have a Lucy index with one field called URL.

I would like to do substring searchs on this field, for example, find all the records whose URL includes http://www.somewhere.com/abc/ (i.e. all the urls which are part of the abc directory on that site).

Is there a way to do this?

I guess I could always treat the field as a tokenized string:


---
my $string_tokenizer = Lucy::Analysis::RegexTokenizer->new( pattern => '\w+');
my $analyzer = Lucy::Analysis::PolyAnalyzer->new( analyzers => [$string_tokenizer]);
---

But then I would probably have to do some pos-search processing to make sure that the URLS of the retrieved records actually DO fit the pattern, and that there are no differences in the non-word characters that were stripped out by the indexer.

I was wondering if there was a way to tokenize the string into individual characters instead, and whether that is advisable from a performance point of view.

Thx.

Alain Désilets
Agent de recherche | Research Officer 
Institut de technologie de l'information | Institute for Information Technology Conseil national de recherches du Canada | National Research Council of Canada

RE: [lucy-user] Can lucy do substring search?

Posted by "Desilets, Alain" <Al...@nrc-cnrc.gc.ca>.

Even if I did that, I would still need to search the domain as a non-exact value. For example, I might want to search on *.gc.ca to search all Government of Canada web sites, or search on *.nrc-cnrc.gc.ca to search only on the site of the National Research Council of Canada, or limit the search to any web site in Canada, like *.ca.

Alain

-----Original Message-----
From: Peter Karman [mailto:peter@peknet.com] 
Sent: Thursday, February 02, 2012 9:27 AM
To: 'lucy-user@incubator.apache.org'
Subject: Re: [lucy-user] Can lucy do substring search?

On 2/2/12 7:40 AM, Desilets, Alain wrote:
> Thx Peter. In my case, the fields on which I need to do wild-card searches are fields that specify the URL of a document. I want to be able to use this to limit the search to documents which are on specific web sites.
>
> It seems the best balance in that case, between accuracy and speed, would be to tokenize on non word character. Then, I could retrieve a superset of docs on say, www.somewhere.org, by searching for "www.somewhere.org" (with a QueryParser). This might accidentally retrieve docs whose urls contain www/somewhwere/org (for example), but I would do a second pass to filter the docs whose url do not match the actual expression www.somewhere.org. I would need to do this second pass anyway, even if I was using a WildCard search, because, I might accidentally match a URL that has www.somewhere.org in a different part than the IP name (ex: http:/www.aplace.com/www.somewhere.org.html).
>

why not pull the hostname out at indexing time into its own field? then 
your particular use case should get no false positives?


-- 
Peter Karman  .  http://peknet.com/  .  peter@peknet.com

Re: [lucy-user] Can lucy do substring search?

Posted by Peter Karman <pe...@peknet.com>.

On 2/2/12 7:40 AM, Desilets, Alain wrote:
> Thx Peter. In my case, the fields on which I need to do wild-card searches are fields that specify the URL of a document. I want to be able to use this to limit the search to documents which are on specific web sites.
>
> It seems the best balance in that case, between accuracy and speed, would be to tokenize on non word character. Then, I could retrieve a superset of docs on say, www.somewhere.org, by searching for "www.somewhere.org" (with a QueryParser). This might accidentally retrieve docs whose urls contain www/somewhwere/org (for example), but I would do a second pass to filter the docs whose url do not match the actual expression www.somewhere.org. I would need to do this second pass anyway, even if I was using a WildCard search, because, I might accidentally match a URL that has www.somewhere.org in a different part than the IP name (ex: http:/www.aplace.com/www.somewhere.org.html).
>

why not pull the hostname out at indexing time into its own field? then 
your particular use case should get no false positives?


-- 
Peter Karman  .  http://peknet.com/  .  peter@peknet.com

RE: [lucy-user] Can lucy do substring search?

Posted by "Desilets, Alain" <Al...@nrc-cnrc.gc.ca>.

Thx Peter. In my case, the fields on which I need to do wild-card searches are fields that specify the URL of a document. I want to be able to use this to limit the search to documents which are on specific web sites.

It seems the best balance in that case, between accuracy and speed, would be to tokenize on non word character. Then, I could retrieve a superset of docs on say, www.somewhere.org, by searching for "www.somewhere.org" (with a QueryParser). This might accidentally retrieve docs whose urls contain www/somewhwere/org (for example), but I would do a second pass to filter the docs whose url do not match the actual expression www.somewhere.org. I would need to do this second pass anyway, even if I was using a WildCard search, because, I might accidentally match a URL that has www.somewhere.org in a different part than the IP name (ex: http:/www.aplace.com/www.somewhere.org.html).

Alain

-----Original Message-----
From: Peter Karman [mailto:peter@peknet.com] 
Sent: Wednesday, February 01, 2012 9:23 PM
Cc: 'lucy-user@incubator.apache.org'
Subject: Re: [lucy-user] Can lucy do substring search?

Desilets, Alain wrote on 2/1/12 10:15 AM:
> Thx Peter. Would this encur the same performance problem as tokenizing the string on a character by character basis?

WildcardQuery is slower than a TermQuery. It's all at search time though,
whereas tokenizing the string on a character basis happens at index time and
search time.

Your use case will incur a performance hit no matter what. In my apps, I
tokenize substrings for only particular fields at index time, and do some term
expansion instead of wildcards using a custom lexicon at search time. IME, it's
about finding a balance in your architecture to best fit your actual use cases.
Accuracy vs speed, is one balance to find. The use case you described (finding
all docs with a field matching a particular hostname) could be accomplished with
no change in indexing or tokenizing, if you used the WildcardQuery; whether that
proves too slow depends on your requirements. Try it and see.

-- 
Peter Karman  .  http://peknet.com/  .  peter@peknet.com

Re: [lucy-user] Can lucy do substring search?

Posted by Peter Karman <pe...@peknet.com>.

Desilets, Alain wrote on 2/1/12 10:15 AM:
> Thx Peter. Would this encur the same performance problem as tokenizing the string on a character by character basis?

WildcardQuery is slower than a TermQuery. It's all at search time though,
whereas tokenizing the string on a character basis happens at index time and
search time.

Your use case will incur a performance hit no matter what. In my apps, I
tokenize substrings for only particular fields at index time, and do some term
expansion instead of wildcards using a custom lexicon at search time. IME, it's
about finding a balance in your architecture to best fit your actual use cases.
Accuracy vs speed, is one balance to find. The use case you described (finding
all docs with a field matching a particular hostname) could be accomplished with
no change in indexing or tokenizing, if you used the WildcardQuery; whether that
proves too slow depends on your requirements. Try it and see.

-- 
Peter Karman  .  http://peknet.com/  .  peter@peknet.com

RE: [lucy-user] Can lucy do substring search?

Posted by "Desilets, Alain" <Al...@nrc-cnrc.gc.ca>.

Thx Peter. Would this encur the same performance problem as tokenizing the string on a character by character basis?

-----Original Message-----
From: Peter Karman [mailto:peter@peknet.com] 
Sent: Wednesday, February 01, 2012 9:37 AM
To: lucy-user@incubator.apache.org
Subject: Re: [lucy-user] Can lucy do substring search?

On 1/31/12 12:19 PM, Desilets, Alain wrote:
> I have a Lucy index with one field called URL.
>
> I would like to do substring searchs on this field, for example, find all the records whose URL includes http://www.somewhere.com/abc/ (i.e. all the urls which are part of the abc directory on that site).
>
> Is there a way to do this?
>

http://search.cpan.org/~karman/LucyX-Search-WildcardQuery-0.03/


-- 
Peter Karman  .  http://peknet.com/  .  peter@peknet.com

Re: [lucy-user] Can lucy do substring search?

Posted by Peter Karman <pe...@peknet.com>.

On 1/31/12 12:19 PM, Desilets, Alain wrote:
> I have a Lucy index with one field called URL.
>
> I would like to do substring searchs on this field, for example, find all the records whose URL includes http://www.somewhere.com/abc/ (i.e. all the urls which are part of the abc directory on that site).
>
> Is there a way to do this?
>

http://search.cpan.org/~karman/LucyX-Search-WildcardQuery-0.03/


-- 
Peter Karman  .  http://peknet.com/  .  peter@peknet.com

Re: [lucy-user] Can lucy do substring search?

Posted by Marvin Humphrey <ma...@rectangular.com>.

On Tue, Jan 31, 2012 at 01:19:38PM -0500, Desilets, Alain wrote:
> I was wondering if there was a way to tokenize the string into individual
> characters instead, and whether that is advisable from a performance point
> of view.

You can experiment with changing the 'pattern' argument to RegexTokenizer#new
to be '.' or '\\S'.  It will definitely be worse from a performance
standpoint, as matching a URL will now require a PhraseQuery with one term for
each letter rather than one term for each component matching \w+ in the URL,
and these terms will exist in virtually every document.

Marvin Humphrey