You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Enis Soztutar <en...@gmail.com> on 2007/05/25 16:43:54 UTC

multiple tokens at the same position

Hi,

In nutch we have a use case in which we need to store tokens with their 
original text plus their stemmed form plus their canonical form(through 
some asciifization). From my understanding of lucene, it makes sense to 
write a tokenstream which generates several tokens for each "word", but 
place all the tokens for the "word" at the same position with 
Token#setPositionIncrement(0).
This way we could be able to search over this field using any 
form(stemmed, canonical, original) of the "word". Actually i have two 
questions here. First is that is there any way to avoid matching stemmed 
or canonical forms to a phrase query. Moreover it seems that adding 
multiple forms of the "word"s alters statistical calculations for 
scoring, especially for tf and idf, because the frequency of the root 
form of the word is incremented at each word with that root form. Is 
there any way that we could avoid it?



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: multiple tokens at the same position

Posted by Mark Miller <ma...@gmail.com>.

Another (obvious) option is to use two indexes and direct the query to the
appropriate index depending on the search specification. Of course you
double your space requirements, but your basically going to do that anyway
if you use two fields. I chose this for the slight benefit of fewer fields
on the index of interest (I need my norms and have *plenty* of fields).

My first approach was to just put the stemmed form at the same position as
the unstemmed form, but I did not like the options involved in choosing to
search just stemmed or just unstemmed. In the end I decided the space
savings where just not worth the hassle.

- Mark

On 5/25/07, Enis Soztutar <en...@gmail.com> wrote:
>
> On 5/25/07, Chris Hostetter <ho...@fucit.org> wrote:
> >
> >
> > : Yes, indeed we could but it brings other problems, for example
> > increasing
> > : the index size, and extending the query to search for multiple fields,
> > etc.
> >
> > 1) if you index both teh raw and stemmed forms your index is going to
> grow
> > to roughly the same size regardless of wether the stem and the arw are
> in
> > teh same field or differnet fields.
>
>
> Well using a single field i think we do not need to store stemmed form of
> the word if it is indeed the same with the original, so that the index
> will
> not grow two times the size. However then one can say that using more than
> one field, it is also possible to not index the root forms twice (but i am
> not sure how to query them).
>
>
> 2) For a particular user action, if you only want to search for one form
> > (either raw or stemmed) then your query doesn't have to search over
> > multiple fields -- it just has to search on which ever field you care
> > about for the particular user action.  if you want to search for *both*
> > the stemmed and raw forms n a single query, then the compelxity of hte
> > query is the same regardless of wether the two clauses are on the same
> > field or differnet fields.
>
>
> If we also stem the query from the user, then infact we need to only
> search
> for the stemmed form. But it is then necessary to determine the language
> of
> the query(if possible) in order to stem it.
>
> : > > or canonical forms to a phrase query. Moreover it seems that adding
> > : > > multiple forms of the "word"s alters statistical calculations for
> > : > > scoring, especially for tf and idf, because the frequency of the
> > root
> > : > > form of the word is incremented at each word with that root form.
> Is
> >
> > it's a matter of opinion wether this is "right" or not ... if you are
> > storing both the raw and stemmed forms then in theory your tf/idf
> numbers
> > now both represent "twice" what they normally would and it balances
> out..
> > "dog" is not only a raw word, but also it's own stem, so a
> tf(docA,dog)=2
> > for one real instance of "dog" is just as correct as a doc that contains
> > "dogs" and gets tf(docB,dog)=1 and tf(docB,dogs)=1.
>
>
> You are right about in that calculation, but i was wondering about the tf
> and idf in cases where we do not store the root forms twice. In that case
> i
> think tf of the root forms increases, however the derived forms
> decreases(since total number of terms nearly doubles).
>
> If you disagree with this line of thinking, asimple way to fix the problem
> > is to use a TokenFilter that removes any tokens at the same position
> which
> > contain the same text...
> >
> >
> >
> http://lucene.apache.org/solr/api/org/apache/solr/analysis/RemoveDuplicatesTokenFilter.html
>
>
> thanks for the pointer
>
> -Hoss
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
> I think Erick's way of indexing solves both problems 1 and 2. However it
> is
> effectively not different than storing the forms in seperate fields. How
> would the query performance differ the case that we index all the forms in
> one field(w/o storing root forms twice), and query only this field, from
> the
> case which we index the forms in seperate fields and run the query in all
> of
> them?
>

Re: multiple tokens at the same position

Posted by Enis Soztutar <en...@gmail.com>.

On 5/25/07, Chris Hostetter <ho...@fucit.org> wrote:
>
>
> : Yes, indeed we could but it brings other problems, for example
> increasing
> : the index size, and extending the query to search for multiple fields,
> etc.
>
> 1) if you index both teh raw and stemmed forms your index is going to grow
> to roughly the same size regardless of wether the stem and the arw are in
> teh same field or differnet fields.


Well using a single field i think we do not need to store stemmed form of
the word if it is indeed the same with the original, so that the index will
not grow two times the size. However then one can say that using more than
one field, it is also possible to not index the root forms twice (but i am
not sure how to query them).


2) For a particular user action, if you only want to search for one form
> (either raw or stemmed) then your query doesn't have to search over
> multiple fields -- it just has to search on which ever field you care
> about for the particular user action.  if you want to search for *both*
> the stemmed and raw forms n a single query, then the compelxity of hte
> query is the same regardless of wether the two clauses are on the same
> field or differnet fields.


If we also stem the query from the user, then infact we need to only search
for the stemmed form. But it is then necessary to determine the language of
the query(if possible) in order to stem it.

: > > or canonical forms to a phrase query. Moreover it seems that adding
> : > > multiple forms of the "word"s alters statistical calculations for
> : > > scoring, especially for tf and idf, because the frequency of the
> root
> : > > form of the word is incremented at each word with that root form. Is
>
> it's a matter of opinion wether this is "right" or not ... if you are
> storing both the raw and stemmed forms then in theory your tf/idf numbers
> now both represent "twice" what they normally would and it balances out..
> "dog" is not only a raw word, but also it's own stem, so a tf(docA,dog)=2
> for one real instance of "dog" is just as correct as a doc that contains
> "dogs" and gets tf(docB,dog)=1 and tf(docB,dogs)=1.


You are right about in that calculation, but i was wondering about the tf
and idf in cases where we do not store the root forms twice. In that case i
think tf of the root forms increases, however the derived forms
decreases(since total number of terms nearly doubles).

If you disagree with this line of thinking, asimple way to fix the problem
> is to use a TokenFilter that removes any tokens at the same position which
> contain the same text...
>
>
> http://lucene.apache.org/solr/api/org/apache/solr/analysis/RemoveDuplicatesTokenFilter.html


thanks for the pointer

-Hoss
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
I think Erick's way of indexing solves both problems 1 and 2. However it is
effectively not different than storing the forms in seperate fields. How
would the query performance differ the case that we index all the forms in
one field(w/o storing root forms twice), and query only this field, from the
case which we index the forms in seperate fields and run the query in all of
them?

Re: multiple tokens at the same position

Posted by Chris Hostetter <ho...@fucit.org>.

: Yes, indeed we could but it brings other problems, for example increasing
: the index size, and extending the query to search for multiple fields, etc.

1) if you index both teh raw and stemmed forms your index is going to grow
to roughly the same size regardless of wether the stem and the arw are in
teh same field or differnet fields.

2) For a particular user action, if you only want to search for one form
(either raw or stemmed) then your query doesn't have to search over
multiple fields -- it just has to search on which ever field you care
about for the particular user action. if you want to search for *both*
the stemmed and raw forms n a single query, then the compelxity of hte
query is the same regardless of wether the two clauses are on the same
field or differnet fields.

: > > or canonical forms to a phrase query. Moreover it seems that adding
: > > multiple forms of the "word"s alters statistical calculations for
: > > scoring, especially for tf and idf, because the frequency of the root
: > > form of the word is incremented at each word with that root form. Is

it's a matter of opinion wether this is "right" or not ... if you are
storing both the raw and stemmed forms then in theory your tf/idf numbers
now both represent "twice" what they normally would and it balances out..
"dog" is not only a raw word, but also it's own stem, so a tf(docA,dog)=2
for one real instance of "dog" is just as correct as a doc that contains
"dogs" and gets tf(docB,dog)=1 and tf(docB,dogs)=1.

If you disagree with this line of thinking, asimple way to fix the problem
is to use a TokenFilter that removes any tokens at the same position which
contain the same text...

http://lucene.apache.org/solr/api/org/apache/solr/analysis/RemoveDuplicatesTokenFilter.html

-Hoss

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: multiple tokens at the same position

Posted by Enis Soztutar <en...@gmail.com>.

Yes, indeed we could but it brings other problems, for example increasing
the index size, and extending the query to search for multiple fields, etc.

On 5/25/07, Steven Rowe <sa...@syr.edu> wrote:
>
> Hi Enis,
>
> Enis Soztutar wrote:
> > In nutch we have a use case in which we need to store tokens with their
> > original text plus their stemmed form plus their canonical form(through
> > some asciifization). From my understanding of lucene, it makes sense to
> > write a tokenstream which generates several tokens for each "word", but
> > place all the tokens for the "word" at the same position with
> > Token#setPositionIncrement(0).
> > This way we could be able to search over this field using any
> > form(stemmed, canonical, original) of the "word". Actually i have two
> > questions here. First is that is there any way to avoid matching stemmed
> > or canonical forms to a phrase query. Moreover it seems that adding
> > multiple forms of the "word"s alters statistical calculations for
> > scoring, especially for tf and idf, because the frequency of the root
> > form of the word is incremented at each word with that root form. Is
> > there any way that we could avoid it?
>
> Answering both questions: Couldn't you just use a different field for
> each form?
>
> --
> Steve Rowe
> Center for Natural Language Processing
> http://www.cnlp.org/tech/lucene.asp
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: multiple tokens at the same position

Posted by Steven Rowe <sa...@syr.edu>.

Hi Enis,

Enis Soztutar wrote:
> In nutch we have a use case in which we need to store tokens with their
> original text plus their stemmed form plus their canonical form(through
> some asciifization). From my understanding of lucene, it makes sense to
> write a tokenstream which generates several tokens for each "word", but
> place all the tokens for the "word" at the same position with
> Token#setPositionIncrement(0).
> This way we could be able to search over this field using any
> form(stemmed, canonical, original) of the "word". Actually i have two
> questions here. First is that is there any way to avoid matching stemmed
> or canonical forms to a phrase query. Moreover it seems that adding
> multiple forms of the "word"s alters statistical calculations for
> scoring, especially for tf and idf, because the frequency of the root
> form of the word is incremented at each word with that root form. Is
> there any way that we could avoid it?

Answering both questions: Couldn't you just use a different field for
each form?

-- 
Steve Rowe
Center for Natural Language Processing
http://www.cnlp.org/tech/lucene.asp

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: multiple tokens at the same position

Posted by Erick Erickson <er...@gmail.com>.

I can only speak to the " avoid matching stemmed
or canonical forms" part...

Yes, but you've got to do some fancy dancing when you index,
something like adding a special signifier to, say, the original token.
I'll ignore the canonical part of your question for the sake of
brevity.

Consider indexing "running"
You'd index "run" and "running$".

Now, whenever you care about the original token, you append
the '$' to the term and search for that.

This has one other advantage. Say you index the term "run" with
the above. If you don't do something like adding the $ to the
original, you can't distinguish between getting a hit on the
stem or not. That is, you can't distinguish between getting a hit
where the original word was "run" and one where the original
was "running". This may be important for "exact match".

Best
Erick

On 5/25/07, Enis Soztutar <en...@gmail.com> wrote:
>
> Hi,
>
> In nutch we have a use case in which we need to store tokens with their
> original text plus their stemmed form plus their canonical form(through
> some asciifization). From my understanding of lucene, it makes sense to
> write a tokenstream which generates several tokens for each "word", but
> place all the tokens for the "word" at the same position with
> Token#setPositionIncrement(0).
> This way we could be able to search over this field using any
> form(stemmed, canonical, original) of the "word". Actually i have two
> questions here. First is that is there any way to avoid matching stemmed
> or canonical forms to a phrase query. Moreover it seems that adding
> multiple forms of the "word"s alters statistical calculations for
> scoring, especially for tf and idf, because the frequency of the root
> form of the word is incremented at each word with that root form. Is
> there any way that we could avoid it?
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>