You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Manuel Le Normand <ma...@gmail.com> on 2014/03/12 10:27:10 UTC

Indexing useful N-grams (phrases & entities) and adding payloads

Hi,
I posted this question on the Solr mailing list but it has more to do with
Lucene.

I have a performance and scoring problem for phrase queries

   1. Performance - phrase queries involving frequent terms are very slow
   due to the reading of large positions posting list.
   2. Scoring - I want to control the boost of phrase and entity (in
   gazetteers) matches

Indexing all terms as bi-grams and unigrams is not possible in my use case,
so I plan indexing only the useful bi-grams. Part of it will be achieved by
the CommonGram filter in which I put the frequent words.

I think of going a step further and index phrase queries (extracted from my
query log) entities (from gazetteers). In order to control the boost on
these N-gram matches I plan adding payloads to these terms.

I'm thinking of two different implementations:

   1. Using MappingCharFilter - the mapping.txt would be

#phrase-query

term1 term2 term3 => term1_term2_term3|1

#entity

firstName lastName => firstName_lastName|2


Very simple to implement but an issue might be that I have 100k-1M
(depending on frequency) phrases/entities as above. I saw that
MappingCharFilter is implemented as an FST, so I'm not concerned with
memory footprint, but I'm concerned that iterating on the charBuffer for
long documents might cause problems.

2. Using the shingleTokenFilter - customizing it to compare the output
against my gazetteers. It would demand and FST implementation in this
TokenFilter.


Will I get a quick win with opt.1? How hard would be implementing opt.2?

General question: Is the above N-gram + payload resolution a common
practice?

Thanks in advance,
Manuel

Re: Indexing useful N-grams (phrases & entities) and adding payloads

Posted by Manuel Le Normand <ma...@gmail.com>.
SynonymFilter makes sense.

The planned payloads are indeed not needed. I guess a better solution would
be making out of the boost an attribute during query time that will be
consumed in the queryParser in order to boost these n-gram terms.

Thanks for the hints.
Manuel


On Wed, Mar 12, 2014 at 12:17 PM, Michael McCandless <
lucene@mikemccandless.com> wrote:

> You could also use SynonymFilter?
>
> Why does the boost need to be encoded in the index (in a payload) vs
> at query time when you create the TermQuery for that term?  Does the
> boost vary depending on the surrounding context / document?
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Wed, Mar 12, 2014 at 5:27 AM, Manuel Le Normand
> <ma...@gmail.com> wrote:
> > Hi,
> > I posted this question on the Solr mailing list but it has more to do
> with
> > Lucene.
> >
> > I have a performance and scoring problem for phrase queries
> >
> >    1. Performance - phrase queries involving frequent terms are very slow
> >    due to the reading of large positions posting list.
> >    2. Scoring - I want to control the boost of phrase and entity (in
> >    gazetteers) matches
> >
> > Indexing all terms as bi-grams and unigrams is not possible in my use
> case,
> > so I plan indexing only the useful bi-grams. Part of it will be achieved
> by
> > the CommonGram filter in which I put the frequent words.
> >
> > I think of going a step further and index phrase queries (extracted from
> my
> > query log) entities (from gazetteers). In order to control the boost on
> > these N-gram matches I plan adding payloads to these terms.
> >
> > I'm thinking of two different implementations:
> >
> >    1. Using MappingCharFilter - the mapping.txt would be
> >
> > #phrase-query
> >
> > term1 term2 term3 => term1_term2_term3|1
> >
> > #entity
> >
> > firstName lastName => firstName_lastName|2
> >
> >
> > Very simple to implement but an issue might be that I have 100k-1M
> > (depending on frequency) phrases/entities as above. I saw that
> > MappingCharFilter is implemented as an FST, so I'm not concerned with
> > memory footprint, but I'm concerned that iterating on the charBuffer for
> > long documents might cause problems.
> >
> > 2. Using the shingleTokenFilter - customizing it to compare the output
> > against my gazetteers. It would demand and FST implementation in this
> > TokenFilter.
> >
> >
> > Will I get a quick win with opt.1? How hard would be implementing opt.2?
> >
> > General question: Is the above N-gram + payload resolution a common
> > practice?
> >
> > Thanks in advance,
> > Manuel
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Indexing useful N-grams (phrases & entities) and adding payloads

Posted by Michael McCandless <lu...@mikemccandless.com>.
You could also use SynonymFilter?

Why does the boost need to be encoded in the index (in a payload) vs
at query time when you create the TermQuery for that term?  Does the
boost vary depending on the surrounding context / document?

Mike McCandless

http://blog.mikemccandless.com


On Wed, Mar 12, 2014 at 5:27 AM, Manuel Le Normand
<ma...@gmail.com> wrote:
> Hi,
> I posted this question on the Solr mailing list but it has more to do with
> Lucene.
>
> I have a performance and scoring problem for phrase queries
>
>    1. Performance - phrase queries involving frequent terms are very slow
>    due to the reading of large positions posting list.
>    2. Scoring - I want to control the boost of phrase and entity (in
>    gazetteers) matches
>
> Indexing all terms as bi-grams and unigrams is not possible in my use case,
> so I plan indexing only the useful bi-grams. Part of it will be achieved by
> the CommonGram filter in which I put the frequent words.
>
> I think of going a step further and index phrase queries (extracted from my
> query log) entities (from gazetteers). In order to control the boost on
> these N-gram matches I plan adding payloads to these terms.
>
> I'm thinking of two different implementations:
>
>    1. Using MappingCharFilter - the mapping.txt would be
>
> #phrase-query
>
> term1 term2 term3 => term1_term2_term3|1
>
> #entity
>
> firstName lastName => firstName_lastName|2
>
>
> Very simple to implement but an issue might be that I have 100k-1M
> (depending on frequency) phrases/entities as above. I saw that
> MappingCharFilter is implemented as an FST, so I'm not concerned with
> memory footprint, but I'm concerned that iterating on the charBuffer for
> long documents might cause problems.
>
> 2. Using the shingleTokenFilter - customizing it to compare the output
> against my gazetteers. It would demand and FST implementation in this
> TokenFilter.
>
>
> Will I get a quick win with opt.1? How hard would be implementing opt.2?
>
> General question: Is the above N-gram + payload resolution a common
> practice?
>
> Thanks in advance,
> Manuel

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org