You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Ghinwa Choueiter <gh...@csail.mit.edu> on 2008/02/19 01:36:21 UTC

How to index word-pairs and phrases

Hi,

I am new to Lucene and have been reading the documentation. I would like to use Lucene to query a song database by lyrics. The query could potentially contain typos, or even wrong words, word contractions (can't versus cannot), etc..

I would like to create an inverted list by word pairs and possibly phrases and not just by isolated words. For example:
<w1,w2>   < d1, d10, d27>
<w2,w3>   <d2, d13>
...

OR even
<phrase 1> <d1, d3,...>
<phrase 2> <...>
...

It seems to me that, by default, the index in Lucene stores statistics for isolated words. The Lucene documentation refers to the word "Term" all the time and seems to imply that "Term" can be a word or a phrase, but I can't see how IndexWriter can read a document and index it by word pairs. 

thank you in advance for the answers and my apologies if I did not get the terminology quite right.

-Ghinwa

RE: How to index word-pairs and phrases

Posted by Steven A Rowe <sa...@syr.edu>.
Hi Ghinwa,

o.a.l.analysis.ngram.NgramTokenizer is a *character* level n-gram filter - from text "word1 word2 word3" you get tokens "wo", "or", "rd", "d1", etc.

The ShingleFilter gives you "word1 word2" and "word2 word3".

Steve

On 02/19/2008 at 11:28 AM, Ghinwa Choueiter wrote:
> What about
> \contrib\analyzers\src\java\org\apache\lucene\analysis\ngram  ??
> 
> Does this tokenizer do what I need?
> 
> thank you,
> -Ghinwa
> 
> On Tue, 19 Feb 2008, Steven A Rowe wrote:
> 
> > Mark,
> > 
> > The ShingleFilter contrib has not been committed yet - it's still here:
> > 
> >   https://issues.apache.org/jira/browse/LUCENE-400
> > 
> > Steve
> > 
> > On 02/19/2008 at 2:33 AM, markharw00d wrote:
> > > Further to Grant's useful background - there is an analyzer
> > > specifically for multi-word terms in "contrib". See
> > > 
> > > Lucene\contrib\analyzers\src\java\org\apache\lucene\analysis\shingle
> > > 
> > > Cheers
> > > Mark
> > > > Hi Ghinwa,
> > > > 
> > > > A Term is simply a unit of tokenization that has been indexed for a
> > > > Field, produced by a TokenStream.   In the demo, on the main site,
> > > > this can be seen in the file called IndexFiles.java on line 56:
> > > > IndexWriter writer = new IndexWriter(INDEX_DIR, new
> > > > StandardAnalyzer(), true, IndexWriter.MaxFieldLength.LIMITED);
> > > > 
> > > > The key being that the StandardAnalyzer is used to get a TokenStream.
> > > > The TokenStream.next() method returns a Token.  All a Term is, is a
> > > > Token that has been indexed for a given Field.  In a nutshell,
> > > > though, a Term is whatever you want it to be, based on your Analyzer.
> > > >  It could be a whole document as a single string or it could be a
> > > > string containing a single character.  In reality, it usually
> > > > corresponds to a single word.
> > > > 
> > > > Have a look at the wiki, also, as there are many talks and articles
> > > > explaining all of this.  Lucene In Action, while outdated, is also an
> > > > excellent book for explaining this stuff (with the exception that
> > > > some of the code examples no longer work on the latest Lucene version)
> > > > 
> > > > Getting beyond that, you will need to look into adding spell checking
> > > > (in Lucene's "contrib" area) and other features like stemming and
> > > > synonyms to handle the various issues you bring up.
> > > > 
> > > > Hope that helps,
> > > > Grant
> > > > 
> > > > On Feb 18, 2008, at 7:36 PM, Ghinwa Choueiter wrote:
> > > > 
> > > > > Hi,
> > > > > 
> > > > > I am new to Lucene and have been reading the documentation. I would
> > > > > like to use Lucene to query a song database by lyrics. The query
> > > > > could potentially contain typos, or even wrong words, word
> > > > > contractions (can't versus cannot), etc..
> > > > > 
> > > > > I would like to create an inverted list by word pairs and possibly
> > > > > phrases and not just by isolated words. For example: <w1,w2>   < d1,
> > > > > d10, d27> <w2,w3>   <d2, d13> ...
> > > > > 
> > > > > OR even
> > > > > <phrase 1> <d1, d3,...>
> > > > > <phrase 2> <...>
> > > > > ...
> > > > > 
> > > > > It seems to me that, by default, the index in Lucene stores
> > > > > statistics for isolated words. The Lucene documentation refers to
> > > > > the word "Term" all the time and seems to imply that "Term" can be a
> > > > > word or a phrase, but I can't see how IndexWriter can read a
> > > > > document and index it by word pairs.
> > > > > 
> > > > > thank you in advance for the answers and my apologies if I did not
> > > > > get the terminology quite right.
> > > > > 
> > > > > -Ghinwa
> > > > 
> > > > --------------------------
> > > > Grant Ingersoll
> > > > http://lucene.grantingersoll.com
> > > > http://www.lucenebootcamp.com
> > > > 
> > > > Lucene Helpful Hints:
> > > > http://wiki.apache.org/lucene-java/BasicsOfPerformance
> > > > http://wiki.apache.org/lucene-java/LuceneFAQ


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


RE: How to index word-pairs and phrases

Posted by Ghinwa Choueiter <gh...@csail.mit.edu>.
What about
\contrib\analyzers\src\java\org\apache\lucene\analysis\ngram  ??

Does this tokenizer do what I need?

thank you,
-Ghinwa

On Tue, 19 Feb 2008, Steven A Rowe wrote:

> Mark,
>
> The ShingleFilter contrib has not been committed yet - it's still here:
>
>   https://issues.apache.org/jira/browse/LUCENE-400
>
> Steve
>
> On 02/19/2008 at 2:33 AM, markharw00d wrote:
>> Further to Grant's useful background - there is an analyzer specifically
>> for multi-word terms in "contrib". See
>> Lucene\contrib\analyzers\src\java\org\apache\lucene\analysis\shingle
>>
>> Cheers
>> Mark
>>> Hi Ghinwa,
>>>
>>> A Term is simply a unit of tokenization that has been indexed for a
>>> Field, produced by a TokenStream.   In the demo, on the main site,
>>> this can be seen in the file called IndexFiles.java on line 56:
>>> IndexWriter writer = new IndexWriter(INDEX_DIR, new
>>> StandardAnalyzer(), true, IndexWriter.MaxFieldLength.LIMITED);
>>>
>>> The key being that the StandardAnalyzer is used to get a TokenStream.
>>> The TokenStream.next() method returns a Token.  All a Term is, is a
>>> Token that has been indexed for a given Field.  In a nutshell, though,
>>> a Term is whatever you want it to be, based on your Analyzer.  It could
>>> be a whole document as a single string or it could be a string
>>> containing a single character.  In reality, it usually corresponds to a
>>> single word.
>>>
>>> Have a look at the wiki, also, as there are many talks and articles
>>> explaining all of this.  Lucene In Action, while outdated, is also an
>>> excellent book for explaining this stuff (with the exception that some
>>> of the code examples no longer work on the latest Lucene version)
>>>
>>> Getting beyond that, you will need to look into adding spell checking
>>> (in Lucene's "contrib" area) and other features like stemming and
>>> synonyms to handle the various issues you bring up.
>>>
>>> Hope that helps,
>>> Grant
>>>
>>> On Feb 18, 2008, at 7:36 PM, Ghinwa Choueiter wrote:
>>>
>>>> Hi,
>>>>
>>>> I am new to Lucene and have been reading the documentation. I would
>>>> like to use Lucene to query a song database by lyrics. The query could
>>>> potentially contain typos, or even wrong words, word contractions
>>>> (can't versus cannot), etc..
>>>>
>>>> I would like to create an inverted list by word pairs and possibly
>>>> phrases and not just by isolated words. For example:
>>>> <w1,w2>   < d1, d10, d27>
>>>> <w2,w3>   <d2, d13>
>>>> ...
>>>>
>>>> OR even
>>>> <phrase 1> <d1, d3,...>
>>>> <phrase 2> <...>
>>>> ...
>>>>
>>>> It seems to me that, by default, the index in Lucene stores statistics
>>>> for isolated words. The Lucene documentation refers to the word "Term"
>>>> all the time and seems to imply that "Term" can be a word or a phrase,
>>>> but I can't see how IndexWriter can read a document and index it by
>>>> word pairs.
>>>>
>>>> thank you in advance for the answers and my apologies if I did not
>>>> get the terminology quite right.
>>>>
>>>> -Ghinwa
>>>
>>> --------------------------
>>> Grant Ingersoll
>>> http://lucene.grantingersoll.com
>>> http://www.lucenebootcamp.com
>>>
>>> Lucene Helpful Hints:
>>> http://wiki.apache.org/lucene-java/BasicsOfPerformance
>>> http://wiki.apache.org/lucene-java/LuceneFAQ
>>>
>>>
>>>
>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For
>>> additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>>
>>
>>
>>
>> --------------------------------------------------------------------- To
>> unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For
>> additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


RE: How to index word-pairs and phrases

Posted by Steven A Rowe <sa...@syr.edu>.
Mark,

The ShingleFilter contrib has not been committed yet - it's still here:

   https://issues.apache.org/jira/browse/LUCENE-400

Steve

On 02/19/2008 at 2:33 AM, markharw00d wrote:
> Further to Grant's useful background - there is an analyzer specifically
> for multi-word terms in "contrib". See
> Lucene\contrib\analyzers\src\java\org\apache\lucene\analysis\shingle
> 
> Cheers
> Mark
> > Hi Ghinwa,
> > 
> > A Term is simply a unit of tokenization that has been indexed for a
> > Field, produced by a TokenStream.   In the demo, on the main site,
> > this can be seen in the file called IndexFiles.java on line 56:
> > IndexWriter writer = new IndexWriter(INDEX_DIR, new
> > StandardAnalyzer(), true, IndexWriter.MaxFieldLength.LIMITED);
> > 
> > The key being that the StandardAnalyzer is used to get a TokenStream.
> > The TokenStream.next() method returns a Token.  All a Term is, is a
> > Token that has been indexed for a given Field.  In a nutshell, though,
> > a Term is whatever you want it to be, based on your Analyzer.  It could
> > be a whole document as a single string or it could be a string
> > containing a single character.  In reality, it usually corresponds to a
> > single word.
> > 
> > Have a look at the wiki, also, as there are many talks and articles
> > explaining all of this.  Lucene In Action, while outdated, is also an
> > excellent book for explaining this stuff (with the exception that some
> > of the code examples no longer work on the latest Lucene version)
> > 
> > Getting beyond that, you will need to look into adding spell checking
> > (in Lucene's "contrib" area) and other features like stemming and
> > synonyms to handle the various issues you bring up.
> > 
> > Hope that helps,
> > Grant
> > 
> > On Feb 18, 2008, at 7:36 PM, Ghinwa Choueiter wrote:
> > 
> > > Hi,
> > > 
> > > I am new to Lucene and have been reading the documentation. I would
> > > like to use Lucene to query a song database by lyrics. The query could
> > > potentially contain typos, or even wrong words, word contractions
> > > (can't versus cannot), etc..
> > > 
> > > I would like to create an inverted list by word pairs and possibly
> > > phrases and not just by isolated words. For example:
> > > <w1,w2>   < d1, d10, d27>
> > > <w2,w3>   <d2, d13>
> > > ...
> > > 
> > > OR even
> > > <phrase 1> <d1, d3,...>
> > > <phrase 2> <...>
> > > ...
> > > 
> > > It seems to me that, by default, the index in Lucene stores statistics
> > > for isolated words. The Lucene documentation refers to the word "Term"
> > > all the time and seems to imply that "Term" can be a word or a phrase,
> > > but I can't see how IndexWriter can read a document and index it by
> > > word pairs.
> > > 
> > > thank you in advance for the answers and my apologies if I did not
> > > get the terminology quite right.
> > > 
> > > -Ghinwa
> > 
> > --------------------------
> > Grant Ingersoll
> > http://lucene.grantingersoll.com
> > http://www.lucenebootcamp.com
> > 
> > Lucene Helpful Hints:
> > http://wiki.apache.org/lucene-java/BasicsOfPerformance
> > http://wiki.apache.org/lucene-java/LuceneFAQ
> > 
> > 
> > 
> > 
> > 
> > 
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For
> > additional commands, e-mail: java-user-help@lucene.apache.org
> > 
> > 
> > 
> 
> 
> 
> --------------------------------------------------------------------- To
> unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For
> additional commands, e-mail: java-user-help@lucene.apache.org
> 
>

 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: How to index word-pairs and phrases

Posted by markharw00d <ma...@yahoo.co.uk>.
Further to Grant's useful background - there is an analyzer specifically 
for multi-word terms in "contrib".
See Lucene\contrib\analyzers\src\java\org\apache\lucene\analysis\shingle

Cheers
Mark
> Hi Ghinwa,
>
> A Term is simply a unit of tokenization that has been indexed for a 
> Field, produced by a TokenStream.   In the demo, on the main site, 
> this can be seen in the file called IndexFiles.java on line 56:
> IndexWriter writer = new IndexWriter(INDEX_DIR, new 
> StandardAnalyzer(), true, IndexWriter.MaxFieldLength.LIMITED);
>
> The key being that the StandardAnalyzer is used to get a TokenStream.  
> The TokenStream.next() method returns a Token.  All a Term is, is a 
> Token that has been indexed for a given Field.  In a nutshell, though, 
> a Term is whatever you want it to be, based on your Analyzer.  It 
> could be a whole document as a single string or it could be a string 
> containing a single character.  In reality, it usually corresponds to 
> a single word.
>
> Have a look at the wiki, also, as there are many talks and articles 
> explaining all of this.  Lucene In Action, while outdated, is also an 
> excellent book for explaining this stuff (with the exception that some 
> of the code examples no longer work on the latest Lucene version)
>
> Getting beyond that, you will need to look into adding spell checking 
> (in Lucene's "contrib" area) and other features like stemming and 
> synonyms to handle the various issues you bring up.
>
> Hope that helps,
> Grant
>
> On Feb 18, 2008, at 7:36 PM, Ghinwa Choueiter wrote:
>
>> Hi,
>>
>> I am new to Lucene and have been reading the documentation. I would 
>> like to use Lucene to query a song database by lyrics. The query 
>> could potentially contain typos, or even wrong words, word 
>> contractions (can't versus cannot), etc..
>>
>> I would like to create an inverted list by word pairs and possibly 
>> phrases and not just by isolated words. For example:
>> <w1,w2>   < d1, d10, d27>
>> <w2,w3>   <d2, d13>
>> ...
>>
>> OR even
>> <phrase 1> <d1, d3,...>
>> <phrase 2> <...>
>> ...
>>
>> It seems to me that, by default, the index in Lucene stores 
>> statistics for isolated words. The Lucene documentation refers to the 
>> word "Term" all the time and seems to imply that "Term" can be a word 
>> or a phrase, but I can't see how IndexWriter can read a document and 
>> index it by word pairs.
>>
>> thank you in advance for the answers and my apologies if I did not 
>> get the terminology quite right.
>>
>> -Ghinwa
>
> --------------------------
> Grant Ingersoll
> http://lucene.grantingersoll.com
> http://www.lucenebootcamp.com
>
> Lucene Helpful Hints:
> http://wiki.apache.org/lucene-java/BasicsOfPerformance
> http://wiki.apache.org/lucene-java/LuceneFAQ
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: How to index word-pairs and phrases

Posted by Grant Ingersoll <gs...@apache.org>.
Hi Ghinwa,

A Term is simply a unit of tokenization that has been indexed for a  
Field, produced by a TokenStream.   In the demo, on the main site,  
this can be seen in the file called IndexFiles.java on line 56:
IndexWriter writer = new IndexWriter(INDEX_DIR, new  
StandardAnalyzer(), true, IndexWriter.MaxFieldLength.LIMITED);

The key being that the StandardAnalyzer is used to get a TokenStream.   
The TokenStream.next() method returns a Token.  All a Term is, is a  
Token that has been indexed for a given Field.  In a nutshell, though,  
a Term is whatever you want it to be, based on your Analyzer.  It  
could be a whole document as a single string or it could be a string  
containing a single character.  In reality, it usually corresponds to  
a single word.

Have a look at the wiki, also, as there are many talks and articles  
explaining all of this.  Lucene In Action, while outdated, is also an  
excellent book for explaining this stuff (with the exception that some  
of the code examples no longer work on the latest Lucene version)

Getting beyond that, you will need to look into adding spell checking  
(in Lucene's "contrib" area) and other features like stemming and  
synonyms to handle the various issues you bring up.

Hope that helps,
Grant

On Feb 18, 2008, at 7:36 PM, Ghinwa Choueiter wrote:

> Hi,
>
> I am new to Lucene and have been reading the documentation. I would  
> like to use Lucene to query a song database by lyrics. The query  
> could potentially contain typos, or even wrong words, word  
> contractions (can't versus cannot), etc..
>
> I would like to create an inverted list by word pairs and possibly  
> phrases and not just by isolated words. For example:
> <w1,w2>   < d1, d10, d27>
> <w2,w3>   <d2, d13>
> ...
>
> OR even
> <phrase 1> <d1, d3,...>
> <phrase 2> <...>
> ...
>
> It seems to me that, by default, the index in Lucene stores  
> statistics for isolated words. The Lucene documentation refers to  
> the word "Term" all the time and seems to imply that "Term" can be a  
> word or a phrase, but I can't see how IndexWriter can read a  
> document and index it by word pairs.
>
> thank you in advance for the answers and my apologies if I did not  
> get the terminology quite right.
>
> -Ghinwa

--------------------------
Grant Ingersoll
http://lucene.grantingersoll.com
http://www.lucenebootcamp.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ





---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org