You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Erik Hatcher <er...@ehatchersolutions.com> on 2003/10/21 03:15:56 UTC

positional token info

Is anyone doing anything interesting with the 
Token.setPositionIncrement during analysis?

Just for fun, I've written a simple stop filter that bumps the position 
increments to account for the stop words removed:

   public final Token next() throws IOException {
     int increment = 0;
     for (Token token = input.next(); token != null; token = 
input.next()) {

       if (table.get(token.termText()) == null) {
         token.setPositionIncrement(token.getPositionIncrement() + 
increment);
         return token;
       }

       increment++;
     }

     return null;
   }


But its practically impossible to formulate a Query that can take 
advantage of this.  A PhraseQuery, because Terms don't have positional 
info (only the transient tokens), only works using a slop factor which 
doesn't guarantee an exact match like I'm after.  A PhrasePrefixQuery 
won't work any better as there is no way to add in a "blank" term to 
indicate a missing position.

I certainly see the benefit of putting tokens into zero-increment 
positions, but are increments of 2 or more at all useful?  If so, how 
are folks using it?

Thanks,
	Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: positional token info

Posted by Tatu Saloranta <ta...@hypermall.net>.

On Tuesday 21 October 2003 17:31, Otis Gospodnetic wrote:
> > It does seem handy to avoid exact phrase matches on "phone boy" when
> > a
> > stop word is removed though, so patching StopFilter to put in the
> > missing positions seems reasonable to me currently.  Any objections
> > to that?
>
> So "phone boy" would match documents containing "phone the boy"?  That

Hmmh. WWGD (What Would Google Do)? :-)

> doesn't sound right to me, as it assumes what the user is trying to do.
>  Wouldn't it be better to allow the user to decide what he wants?
> (i.e. "phone boy" returns documents with that _exact_ phrase.  "phone
> boy"~2 also returns documents containing "phone the boy").

As long as phrase queries work appropriately with approximity modifiers, one
alternative (from app standpoint) would be to:

(a) Tokenize stopwords out, adding skip value; either one per stop word,
  or one for non-empty sequence of key words ( "top of the world" might
 make sense to tokenize as "top - world", "-" signifying 'hole')
(b) With phrase queries, first do exact match.
(c) If number of matches is "too low" (whatever definition of low is),
  use phrase query match with slop of 2 instead.

Tricky part would be to do the same for combination queries, where it's
not easy to check matches for individual query components.

Perhaps it'd be possible to create Yet Another Query object, that would,
given a threshold, do one or two searches (as described above), to allow
for self-adjusting behaviour?
Or, perhaps there should be container query, that could execute ordered
sequence of sub-queries, until one returns "good enough" set of matches, then
return that set (or last result(s), if no good matches) and above-mentioned 
"sloppy if need be" phrase query would just be  a special case?

-+ Tatu +-

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: positional token info

Posted by Otis Gospodnetic <ot...@yahoo.com>.

Then we agree, and it is StopFilter that needs to be patched to take
into account the number of removed terms, and add appropriate
positional info to each term.

Otis

--- Erik Hatcher <er...@ehatchersolutions.com> wrote:
> On Tuesday, October 21, 2003, at 07:31  PM, Otis Gospodnetic wrote:
> > So "phone boy" would match documents containing "phone the boy"? 
> That
> > doesn't sound right to me, as it assumes what the user is trying to
> do.
> 
> That is correct.... currently a match would be found.  Here's a
> little 
> test case I'm working with:
> 
>      Directory directory = new RAMDirectory();
>      IndexWriter writer = new IndexWriter(directory, new 
> StandardAnalyzer(), true);
>      Document doc = new Document();
>      doc.add(Field.Text("contents", "The quick brown fox jumped over
> the 
> lazy dogs"));
>      writer.addDocument(doc);
>      writer.close();
> 
>      IndexSearcher searcher = new IndexSearcher(directory);
>      QueryParser parser = new QueryParser("contents", new 
> StandardAnalyzer());
>      Query query = parser.parse("\"over lazy\"");
> 
>      Hits hits = searcher.search(query);
>      assertEquals(1, hits.length());
> 
> which currently passes.... although should not I don't think.
> 
> >  Wouldn't it be better to allow the user to decide what he wants?
> > (i.e. "phone boy" returns documents with that _exact_ phrase. 
> "phone
> > boy"~2 also returns documents containing "phone the boy").
> 
> I concur.  StopFilter just removes terms, but does not adjust the 
> following acceptable term with the offset to account for the missing 
> stop words.


__________________________________
Do you Yahoo!?
The New Yahoo! Shopping - with improved product search
http://shopping.yahoo.com

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: positional token info

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

On Tuesday, October 21, 2003, at 07:31  PM, Otis Gospodnetic wrote:
> So "phone boy" would match documents containing "phone the boy"?  That
> doesn't sound right to me, as it assumes what the user is trying to do.

That is correct.... currently a match would be found.  Here's a little 
test case I'm working with:

     Directory directory = new RAMDirectory();
     IndexWriter writer = new IndexWriter(directory, new 
StandardAnalyzer(), true);
     Document doc = new Document();
     doc.add(Field.Text("contents", "The quick brown fox jumped over the 
lazy dogs"));
     writer.addDocument(doc);
     writer.close();

     IndexSearcher searcher = new IndexSearcher(directory);
     QueryParser parser = new QueryParser("contents", new 
StandardAnalyzer());
     Query query = parser.parse("\"over lazy\"");

     Hits hits = searcher.search(query);
     assertEquals(1, hits.length());

which currently passes.... although should not I don't think.

>  Wouldn't it be better to allow the user to decide what he wants?
> (i.e. "phone boy" returns documents with that _exact_ phrase.  "phone
> boy"~2 also returns documents containing "phone the boy").

I concur.  StopFilter just removes terms, but does not adjust the 
following acceptable term with the offset to account for the missing 
stop words.

	Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: positional token info

Posted by Otis Gospodnetic <ot...@yahoo.com>.

> It does seem handy to avoid exact phrase matches on "phone boy" when
> a 
> stop word is removed though, so patching StopFilter to put in the 
> missing positions seems reasonable to me currently.  Any objections
> to that?

So "phone boy" would match documents containing "phone the boy"?  That
doesn't sound right to me, as it assumes what the user is trying to do.
 Wouldn't it be better to allow the user to decide what he wants?
(i.e. "phone boy" returns documents with that _exact_ phrase.  "phone
boy"~2 also returns documents containing "phone the boy").

Sorry if I'm misunderstanding something, long day, plus 1:30 AM.

Otis


__________________________________
Do you Yahoo!?
The New Yahoo! Shopping - with improved product search
http://shopping.yahoo.com

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: positional token info

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

On Tuesday, October 21, 2003, at 12:53  PM, Doug Cutting wrote:
> If however you want "phone the boy" to match "phone X boy" where X is 
> any word, then PhraseQuery would have to be extended.  It's actually a 
> pretty simple extension.  Each term in a PhraseQuery corresponds to a 
> PhrasePositions object.  The 'offset' field within this is the 
> position of the term in the phrase.  If you construct the phrase 
> positions for a two-term phrase so that the first has offset=0 and the 
> second offset=2, then you'll get this sort of matching.  So all that's 
> needed is a new method PhraseQuery.add(Term term, int offset), and for 
> these offsets to be stored so that they can be used when building 
> PhrasePositions.  Would this be a useful feature?

My questions were really from an academic understanding nature about 
position increments and how it related to searching.  I definitely 
agree (and who could argue?) with Nutch and Google!  Removing stop 
words is not a good thing, but smart handling of pervasive terms is 
important as you have implemented in Nutch when not doing phrase 
queries and how the bi-gram stuff works.

It does seem handy to avoid exact phrase matches on "phone boy" when a 
stop word is removed though, so patching StopFilter to put in the 
missing positions seems reasonable to me currently.  Any objections to 
that?

	Erik

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: positional token info

Posted by Otis Gospodnetic <ot...@yahoo.com>.

I think "phone the boy" query should match exactly that, and not "phone
X boy", nor "phone boy".  To me, entering a query as a phrase query
means that the user wants to find documents with _exactly_ that
sequence of terms.

If you know that your users will be entering phrases with stop words,
then stop words should not be thrown out before indexing.

If users are really interested in terms "phone" and "boy", they should
use +phone +boy.

If they are okay with finding documents that contain the term "phone"
followed by the term "boy", even if "boy" is not the very next term
after "phone", they can use the slop factor options.

If I understand http://nagoya.apache.org/bugzilla/show_bug.cgi?id=23730
correctly, the included patch ensures that "phone boy" does not match
"phone the boy", but I am not sure about the other way around.

Otis



--- Doug Cutting <cu...@lucene.com> wrote:
> Erik Hatcher wrote:
> > Just for fun, I've written a simple stop filter that bumps the
> position 
> > increments to account for the stop words removed:
> > 
> > But its practically impossible to formulate a Query that can take 
> > advantage of this.  A PhraseQuery, because Terms don't have
> positional 
> > info (only the transient tokens), only works using a slop factor
> which 
> > doesn't guarantee an exact match like I'm after.  A
> PhrasePrefixQuery 
> > won't work any better as there is no way to add in a "blank" term
> to 
> > indicate a missing position.
> 
> The PhraseQuery code predates the setPositionIncrement feature.
> 
> You can use your filter to find phrases that don't contain stop
> words, 
> e.g., when your filter is used, a query for the phrase "phone boy"
> won't 
> match "phone the boy", as it would with the normal stop filter, but a
> 
> query for "phone the boy" would also only match "phone boy".
> 
> One workaround is to simply not use a stop list.  Then "phone boy"
> will 
> only match "phone boy", and "phone the boy" will only match "phone
> the 
> boy", and not "phone a boy" too.  One can write a query parser which 
> removes stop words unless they're in phrases.  This is what Nutch and
> 
> Google do.
> 
> If however you want "phone the boy" to match "phone X boy" where X is
> 
> any word, then PhraseQuery would have to be extended.  It's actually
> a 
> pretty simple extension.  Each term in a PhraseQuery corresponds to a
> 
> PhrasePositions object.  The 'offset' field within this is the
> position 
> of the term in the phrase.  If you construct the phrase positions for
> a 
> two-term phrase so that the first has offset=0 and the second
> offset=2, 
> then you'll get this sort of matching.  So all that's needed is a new
> 
> method PhraseQuery.add(Term term, int offset), and for these offsets
> to 
> be stored so that they can be used when building PhrasePositions. 
> Would 
> this be a useful feature?
> 
> Doug
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 


__________________________________
Do you Yahoo!?
The New Yahoo! Shopping - with improved product search
http://shopping.yahoo.com

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: positional token info

Posted by Doug Cutting <cu...@lucene.com>.

Erik Hatcher wrote:
> Just for fun, I've written a simple stop filter that bumps the position 
> increments to account for the stop words removed:
> 
> But its practically impossible to formulate a Query that can take 
> advantage of this.  A PhraseQuery, because Terms don't have positional 
> info (only the transient tokens), only works using a slop factor which 
> doesn't guarantee an exact match like I'm after.  A PhrasePrefixQuery 
> won't work any better as there is no way to add in a "blank" term to 
> indicate a missing position.

The PhraseQuery code predates the setPositionIncrement feature.

You can use your filter to find phrases that don't contain stop words, 
e.g., when your filter is used, a query for the phrase "phone boy" won't 
match "phone the boy", as it would with the normal stop filter, but a 
query for "phone the boy" would also only match "phone boy".

One workaround is to simply not use a stop list.  Then "phone boy" will 
only match "phone boy", and "phone the boy" will only match "phone the 
boy", and not "phone a boy" too.  One can write a query parser which 
removes stop words unless they're in phrases.  This is what Nutch and 
Google do.

If however you want "phone the boy" to match "phone X boy" where X is 
any word, then PhraseQuery would have to be extended.  It's actually a 
pretty simple extension.  Each term in a PhraseQuery corresponds to a 
PhrasePositions object.  The 'offset' field within this is the position 
of the term in the phrase.  If you construct the phrase positions for a 
two-term phrase so that the first has offset=0 and the second offset=2, 
then you'll get this sort of matching.  So all that's needed is a new 
method PhraseQuery.add(Term term, int offset), and for these offsets to 
be stored so that they can be used when building PhrasePositions.  Would 
this be a useful feature?

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: positional token info

Posted by Steve Rowe <sa...@gwmail.syr.edu>.

Erik,

I've submitted a patch (BUG# 23730) very similar to yours, in response 
to a request to fix phrases matching where they should not:

<URL:http://mail-archive.com/lucene-user@jakarta.apache.org/msg04349.html>

Bug #23730:
<URL:http://nagoya.apache.org/bugzilla/show_bug.cgi?id=23730>

 > But, how would you actually *use* an index that was indexed with the
 > holes noted by > 1 position increments?

As the lucene-user email linked above notes, setting the position 
increment interdicts false phrase matching.

Steve Rowe

Erik Hatcher wrote:
> On Tuesday, October 21, 2003, at 03:36  AM, Pierrick Brihaye wrote:
> 
>> The basic idea is to have several tokens at the same position (i.e. 
>> setPositionIncrement(0)) which are different possible stems for the 
>> same word.
> 
> 
> Right.  Like I said, I recognize the benefits of using a position 
> increment of 0.
> 
>>> I certainly see the benefit of putting tokens into zero-increment 
>>> positions, but are increments of 2 or more at all useful?
>>
>>
>> Who knows ? I may be interesting  to keep track of the *presence* of 
>> "empty words", e.g. "[the] sky [is] blue", "[the] sky [is] [really] 
>> blue", "[the] sky [is] [that] [really] blue". The traditionnal 
>> reduction to "sky blue" is maybe over-simplistic for some cases...
> 
> 
> But, how would you actually *use* an index that was indexed with the 
> holes noted by > 1 position increments?
> 
>     Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: positional token info

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

On Tuesday, October 21, 2003, at 03:36  AM, Pierrick Brihaye wrote:
> The basic idea is to have several tokens at the same position (i.e. 
> setPositionIncrement(0)) which are different possible stems for the 
> same word.

Right.  Like I said, I recognize the benefits of using a position 
increment of 0.

>> I certainly see the benefit of putting tokens into zero-increment 
>> positions, but are increments of 2 or more at all useful?
>
> Who knows ? I may be interesting  to keep track of the *presence* of 
> "empty words", e.g. "[the] sky [is] blue", "[the] sky [is] [really] 
> blue", "[the] sky [is] [that] [really] blue". The traditionnal 
> reduction to "sky blue" is maybe over-simplistic for some cases...

But, how would you actually *use* an index that was indexed with the 
holes noted by > 1 position increments?

	Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: positional token info

Posted by Pierrick Brihaye <pi...@culture.gouv.fr>.

Hi,

Erik Hatcher a écrit:

> Is anyone doing anything interesting with the Token.setPositionIncrement 
> during analysis?

I think so :-) Well... my arabic analyzer is based on this functionnality.

The basic idea is to have several tokens at the same position (i.e. 
setPositionIncrement(0)) which are different possible stems for the same 
word.

> But its practically impossible to formulate a Query that can take 
> advantage of this.  A PhraseQuery, because Terms don't have positional 
> info (only the transient tokens)

Correct !

I've made a dirty patch for the QueryParser which is able to handle 
tokens with positionIncrement equal to 0 or 1 (see bug #23307). It still 
needs some work, but it fits my needs :-)

> I certainly see the benefit of putting tokens into zero-increment 
> positions, but are increments of 2 or more at all useful?

Who knows ? I may be interesting  to keep track of the *presence* of 
"empty words", e.g. "[the] sky [is] blue", "[the] sky [is] [really] 
blue", "[the] sky [is] [that] [really] blue". The traditionnal reduction 
to "sky blue" is maybe over-simplistic for some cases...

Well, just an idea.

Cheers,

-- 
Pierrick Brihaye, informaticien
Service régional de l'Inventaire
DRAC Bretagne
mailto:pierrick.brihaye@culture.gouv.fr


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org