You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Ryan Detzel <ry...@gmail.com> on 2008/07/23 21:30:32 UTC

Using lucene to search a bunch of keywords?

Everything i've read and seen about luceen is search for keywords in
documents; I want to do the reverse. I have a huge list of
keywords("big boy","red ball","computer") and I have phrases that I
want to see if they keywords are in. For example using the small
keyword list above(store in documents in lucene) what's the best
approach to pass in a query "the girl likes red balls" and have it
match the keyword "red ball"?

Thanks.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


RE: Using lucene to search a bunch of keywords?

Posted by Robert Stewart <Ro...@INFONGEN.COM>.
You need to invert the process.  Using Lucene may not be the best option... You need to make your document a key into an index of key words.  I've done the same thing, but not with Lucene. You need to pass through the document and for each word (token) lookup in some index (hashtable) to find possible keywords or phrases starting with that word, and then see which ones match, and then continue through the document.  You may be able to use lucene to do this, but I am not sure if it is the right tool.

What I did was create a hashtable which is keyed by the first word in each keyword I want to match. The value for each hashtable entry is a list of all keywords starting with that word, sorted by length in descending order.  When I see a word in the document which has one or more keywords which start with that word, enumerate the sorted list attempting to match the longer ones first.  If you get a match, continue processing the document from the end of that match.  I only have about 1million total keywords, so I just load entire thing into a in memory hashtable, and have my own document tokenizer.  I don't use Lucene at all for that.

For example, I might have keywords:

Key:"apple"
List: (sorted by length in descending order):
"apple computer company"
"apple computer inc"
"apple computer"
"apple pie"
"apple"

Then if I have a document:

"I love apple pie, but not apple computer"

It finds "apple" but the first one matches "apple pie", then second one matches "apple computer", etc.


If you need to use Lucene, you could try parsing your document into a query, and then issue that query (as a big Boolean OR query) to a Lucene index containing your keywords, and then enumerate the matches. But unless you have a lot of keywords to index, it probably doesn't make sense to use Lucene for that.

-----Original Message-----
From: Ryan Detzel [mailto:ryandetzel@gmail.com]
Sent: Wednesday, July 23, 2008 3:31 PM
To: java-user@lucene.apache.org
Subject: Using lucene to search a bunch of keywords?

Everything i've read and seen about luceen is search for keywords in
documents; I want to do the reverse. I have a huge list of
keywords("big boy","red ball","computer") and I have phrases that I
want to see if they keywords are in. For example using the small
keyword list above(store in documents in lucene) what's the best
approach to pass in a query "the girl likes red balls" and have it
match the keyword "red ball"?

Thanks.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


RE: Using lucene to search a bunch of keywords?

Posted by Steven A Rowe <sa...@syr.edu>.
On 07/23/2008 at 5:09 PM, Steven A Rowe wrote:
> Karl Wettin's recently committed ShingleMatrixAnalyzer

Oops, "ShingleMatrixAnalyzer" -> "ShingleMatrixFilter". 

Steve

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


RE: Using lucene to search a bunch of keywords?

Posted by Steven A Rowe <sa...@syr.edu>.
Hi Ryan,

Well, at 100 million+ keywords, Lucene might be the right tool.

One thing that you might check out for the query side is Karl Wettin's recently committed ShingleMatrixAnalyzer (not in any Lucene release yet - only on the trunk).

The JUnit test class TestShingleMatrixFilter has an example of splitting an input string into "shingles" (a.k.a. "token n-grams") - in this example the input string "please divide this sentence into shingles" is converted into the following terms by requesting a minimum shingle size of one token and a maximum of two tokens, and using the space character to join the tokens together:

"please", "please divide", "divide", "divide this", "this", "this sentence", "sentence",  "sentence into", "into", "into shingles", "shingles"

You could index your keywords list as-is with no tokenization; break up your queries using a WhitespaceTokenizer connected to a ShingleMatrixFilter, with the minimum shingle size set to one and the maximum set to the number of tokens in keyword with the most tokens; and then build a BooleanQuery with one clause per shingle, each set to BooleanClause.Occur.SHOULD.

Steve

On 07/23/2008 at 4:05 PM, Ryan D wrote:
> Heh, actually I'm using Perl but I've always associated text-search with
> Lucene, I'm not sure if it's the best solution or not. On the small side
> there are 1.6 million keywords, on the large side there are well over
> 100 million but I might find another way to break down the searches into
> smaller searches(send A-G server1, H-R to server2...etc).
> 
> Is there another search tool that might be better suited for this...the
> only thing I can relate this too is how AdWords works. A user enters a
> query in the Google search box and they search their database for people
> who've purchased those keywords to the appropriate ads.  What I'm doing
> is similar but without the payday. :-{
> 
> Currently I'm using a (huge) hash table and regular expressions
> ($query =~ /$keyword/) going down the list from largest to smallest
> but I know this is not a long term solution especially if I have to
> load the large 100 million+ list in.
> 
> Thanks.
> 
> 
> On Jul 23, 2008, at 3:54 PM, Steven A Rowe wrote:
> 
> > Hi Ryan,
> > 
> > I'm not sure Lucene's the right tool for this job.
> > 
> > I have used regular expressions and ternary search trees in the past to
> > do similar things.
> > 
> > Is the set of keywords too large for an in-memory solution like these? 
> > If not, consider using a tool like the Perl package Regex::PreSuf
> > <http://search.cpan.org/dist/Regex-PreSuf/> - it can convert a list of
> > strings into a compact set of alternations, which you can then import
> > into a Java program.  (I'm not aware of any similar Java tools.)
> > 
> > Steve
> > 
> > On 07/23/2008 at 3:30 PM, Ryan Detzel wrote:
> > > Everything i've read and seen about luceen is search for keywords in
> > > documents; I want to do the reverse. I have a huge list of
> > > keywords("big boy","red ball","computer") and I have phrases that I
> > > want to see if they keywords are in. For example using the small
> > > keyword list above(store in documents in lucene) what's the best
> > > approach to pass in a query "the girl likes red balls" and have it
> > > match the keyword "red ball"?

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Using lucene to search a bunch of keywords?

Posted by Ryan D <ry...@gmail.com>.
Heh, actually I'm using Perl but I've always associated text-search  
with Lucene, I'm not sure if it's the best solution or not. On the  
small side there are 1.6 million keywords, on the large side there are  
well over 100 million but I might find another way to break down the  
searches into smaller searches(send A-G server1, H-R to server2...etc).

Is there another search tool that might be better suited for  
this...the only thing I can relate this too is how AdWords works. A  
user enters a query in the Google search box and they search their  
database for people who've purchased those keywords to the appropriate  
ads.  What I'm doing is similar but without the payday. :-{

Currently I'm using a (huge) hash table and regular expressions  
($query =~ /$keyword/) going down the list from largest to smallest  
but I know this is not a long term solution especially if I have to  
load the large 100 million+ list in.

Thanks.


On Jul 23, 2008, at 3:54 PM, Steven A Rowe wrote:

> Hi Ryan,
>
> I'm not sure Lucene's the right tool for this job.
>
> I have used regular expressions and ternary search trees in the past  
> to do similar things.
>
> Is the set of keywords too large for an in-memory solution like  
> these?  If not, consider using a tool like the Perl package  
> Regex::PreSuf <http://search.cpan.org/dist/Regex-PreSuf/> - it can  
> convert a list of strings into a compact set of alternations, which  
> you can then import into a Java program.  (I'm not aware of any  
> similar Java tools.)
>
> Steve
>
> On 07/23/2008 at 3:30 PM, Ryan Detzel wrote:
>> Everything i've read and seen about luceen is search for keywords in
>> documents; I want to do the reverse. I have a huge list of
>> keywords("big boy","red ball","computer") and I have phrases that I
>> want to see if they keywords are in. For example using the small
>> keyword list above(store in documents in lucene) what's the best
>> approach to pass in a query "the girl likes red balls" and have it
>> match the keyword "red ball"?
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


RE: Using lucene to search a bunch of keywords?

Posted by Steven A Rowe <sa...@syr.edu>.
Hi Ryan,

I'm not sure Lucene's the right tool for this job.

I have used regular expressions and ternary search trees in the past to do similar things.

Is the set of keywords too large for an in-memory solution like these?  If not, consider using a tool like the Perl package Regex::PreSuf <http://search.cpan.org/dist/Regex-PreSuf/> - it can convert a list of strings into a compact set of alternations, which you can then import into a Java program.  (I'm not aware of any similar Java tools.)

Steve

On 07/23/2008 at 3:30 PM, Ryan Detzel wrote:
> Everything i've read and seen about luceen is search for keywords in
> documents; I want to do the reverse. I have a huge list of
> keywords("big boy","red ball","computer") and I have phrases that I
> want to see if they keywords are in. For example using the small
> keyword list above(store in documents in lucene) what's the best
> approach to pass in a query "the girl likes red balls" and have it
> match the keyword "red ball"?

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org