You are viewing a plain text version of this content. The canonical link for it is here.

Posted to general@lucene.apache.org by poeta simbolista <po...@gmail.com> on 2009/10/26 15:29:34 UTC

Solution for unwanted ngrams

Hi,

Imagine you have a text : 
"Apartment not for sale".
and another
"Sale! Apartment for rent"
Search query: "Apartment for sale". 
The above search query will return the texts above highly scored. I would
like to know how I could tackle the following issue better with Lucene. My
ideas:
 - recognise certain sets "Not for sale" as different from "for sale". That
is, invalidate "for sale" if it comes preceded by "not". How could I do
this?
 - Recognise sale only if preceded by "for", since the second meaning
(bargain vs. something for sale) is tricky.
 - transcript "sale" as "for sale", grouped in the query (produce "-sale
+(for sale)"  ). Wouldn't that query invalidate those with the "sale" term?
How to achieve this with Lucene otherwise? 

Should this be tackled only by preprocessing the data before it makes it to
the index? Ideally I would like to preserve the original text  on the index.

Thanks a lot in advance
 Diego
-- 
View this message in context: http://www.nabble.com/Solution-for-unwanted-ngrams-tp26060874p26060874.html
Sent from the Lucene - General mailing list archive at Nabble.com.

Re: Solution for unwanted ngrams

Posted by Ted Dunning <te...@gmail.com>.

This sort of fine distinction probably requires user feed back.  If the
idioms are highly distinctive, then a learning system that is highly
resistant to over-fitting could be used to learn a query that includes
phrasal components like "not for sale" and such.

If you have to find more flexible phrases that are based on synonymic
substitutions, then you should look at techniques like random indexing or
LSA or LDA so that you can express the phrases you extract from training
documents in terms of more general semantic components.  Sparse random
indexing is probably the easiest to apply to a term based retrieval system
such as Lucene.

Here is one effective learning system:
http://www.aclweb.org/anthology/P/P08/P08-2059.pdf
http://www.cs.jhu.edu/~mdredze/publications/icml_variance.pdf

To summarize, what I would recommend is something like this:

step 0: create a Lucene index with positional and, optionally, semantic
information such as from sparse random indexing
step 1: take user input to retrieve a sample set of documents
step 2: let the user judge some of these documents as relevant or not
step 3: extract possible features such as terms, phrases, semantic phrases
and so on from the sample documents
step 4: run the learning algorithm on the judged documents
step 5: report starting at 1, but now with an augmented query that includes
a post-scoring phase



On Mon, Oct 26, 2009 at 7:29 AM, poeta simbolista <poetasimbolista@gmail.com
> wrote:

>
> Hi,
>
> Imagine you have a text :
> "Apartment not for sale".
> and another
> "Sale! Apartment for rent"
> Search query: "Apartment for sale".
> The above search query will return the texts above highly scored. I would
> like to know how I could tackle the following issue better with Lucene. My
> ideas:
>  - recognise certain sets "Not for sale" as different from "for sale". That
> is, invalidate "for sale" if it comes preceded by "not". How could I do
> this?
>  - Recognise sale only if preceded by "for", since the second meaning
> (bargain vs. something for sale) is tricky.
>  - transcript "sale" as "for sale", grouped in the query (produce "-sale
> +(for sale)"  ). Wouldn't that query invalidate those with the "sale" term?
> How to achieve this with Lucene otherwise?
>
> Should this be tackled only by preprocessing the data before it makes it to
> the index? Ideally I would like to preserve the original text  on the
> index.
>
> Thanks a lot in advance
>  Diego
> --
> View this message in context:
> http://www.nabble.com/Solution-for-unwanted-ngrams-tp26060874p26060874.html
> Sent from the Lucene - General mailing list archive at Nabble.com.
>
>


-- 
Ted Dunning, CTO
DeepDyve