You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Pablo <pa...@gmail.com> on 2010/05/03 20:52:53 UTC

Wich way would you recommend for successive-words similarity and scoring ?

Hello,

Lucene core doesn't seems to use relative word positioning (?) for scoring.

For example, indexing that phrase "a b c d e f g h i j k l m n o p q r
s t u v w x y z", these queries give the same results (0.19308087) :
 - 1 : phrase:'e f g'
 - 2 : phrase:'o k z'

I'm a bit familiar with lucene and snowballs, but I never (really)
needed this feature before, and didn't browse the lucene contribs.

Maybe I'm misunderstanding something, but, what can I do to obtain
query 1 get a better score than the second ?

Should I implement a Scorer and or a Similarity, or can an analyser
and a specific stemmer be sufficient?


Thanks,

Pablo

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Wich way would you recommend for successive-words similarity and scoring ?

Posted by Otis Gospodnetic <ot...@yahoo.com>.

Hi Pablo,

This question comes up every once in a while.  You'll find some previous discussions and answers here: http://search-lucene.com/?q=terms+closer+together+score

 Otis
----
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



----- Original Message ----
> From: Pablo <pa...@gmail.com>
> To: java-user@lucene.apache.org
> Sent: Mon, May 3, 2010 3:20:10 PM
> Subject: Wich way would you recommend for successive-words similarity and  scoring ?
> 
> Hello,

Lucene core doesn't seems to use relative word positioning (?) for 
> scoring.

For example, indexing that phrase "a b c d e f g h i j k l m n o 
> p q r
s t u v w x y z", these queries give the same results (0.19308087) 
> :
 - 1 : phrase:'e f g'
 - 2 : phrase:'o k z'

I'm a bit familiar 
> with lucene and snowballs, but I never (really)
needed this feature before, 
> and didn't browse the lucene contribs.

Maybe I'm misunderstanding 
> something, but, what can I do to obtain
query 1 get a better score than the 
> second ?

Should I implement a Scorer and or a Similarity, or can an 
> analyser
and a specific stemmer be sufficient?


Thanks, [I first 
> wrote to dev, wasn't the right 
> place.]

Pablo

---------------------------------------------------------------------
To 
> unsubscribe, e-mail: 
> href="mailto:java-user-unsubscribe@lucene.apache.org">java-user-unsubscribe@lucene.apache.org
For 
> additional commands, e-mail: 
> ymailto="mailto:java-user-help@lucene.apache.org" 
> href="mailto:java-user-help@lucene.apache.org">java-user-help@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: Tokenization / Analyzer question

Posted by "Beard, Brian" <Br...@mybir.com>.

So I've been thinking about this more and what seems most plausible is
to just use Store.NO for the delimiters, so they will have payloads
encoded correctly but not affect the stored data, and store separate
instances of the subId information which can be retrieved along the
boundaries. There should be on average 3 subId's, so I don't think this
would be too big of a performance hit to extract the separate field
instances.

What I find myself wanting to do is to be able to pass MetaData along
with the Field() when it is added to the document. Then be able to
retrieve this metaData information while inside the TokenFilter. I guess
this would be similar to adding column stride fields, but have multiple
ones at different positions in the document.

-----Original Message-----
From: Beard, Brian [mailto:Brian.Beard@mybir.com] 
Sent: Thursday, August 19, 2010 2:02 PM
To: java-user@lucene.apache.org
Subject: Tokenization / Analyzer question

I'm using lucene 2.9.1.

I'm indexing documents which correspond to an ID.
Each field in the ID document is made up of data from all subId's.
(It's a requirement that searches must work across all subId's within an
ID).

They will be indexed and stored in some format similar to:
subId0Value0 subId0Value1 DELIMITER subId1Value0 subId1Value1 DELIMITER
....
This is so the stored value strings can be matched up to the correct
subId for display purposes.
In addition, there may be more than one type of delimiter to denote the
type of data within a subId's data.

Moreover, for some of the fields it is desired to be able to tell which
subId the search terms corresponded to.

What I'd like to do is use payloads to encode the delimiter type into
the payload, but use the same text string for all delimiter types. Also,
the delimiter tokens shouldn't affect the search because they will be
filtered out of the analyzer. This is so that during a post-processing
step of the returned results, all payloads could be extracted at once by
a single term, instead of having to loop through them term by term
(because I don't see a way to get them on a per-document basis -
otherwise I could add a payload to the first term when the subId changes
during indexing). Then the search term positions could be checked
against the delimiter positions to determine the record and type of
data. 

I've got an example performing the tokenization and payload encoding
correctly, but the stored values are not the ones I want.

What I've done is write a BoundaryTokenFilter class which always is
instantiated *after* the standardAnalyzer, or WhitespaceAnalyzer, etc -
new BoundaryTokenFilter(new StandardAnalyzer().tokenStream()). 

In order to get the tokens passed down to BoundaryTokenFilter and pass
through the other tokenizers/TokenFilters higher up the chain, I'm using
character based delimiters (Different ones for each delimiter that also
have some additional encoded information for the payload). Then inside
of BoundaryTokenFilter, the character based delimiter is decoded. Then
the token is changed over to the single delimiter value and the payload
added.

This part works, but the problem I'm having is that I would like the
delimiter values stored in the index to be the same as the single
delimiter tokenized ones. 

Is there any way to do this - transform the stored values to be the
tokenized values, but only for select tokens?

The other alternative I can think of if that isn't possible, is to
modify standardAnalyzer to have indexing and search modes, so that
during indexing mode it would allow the delimiter tokens to pass through
and flag them using a typeAttribute, while they don't in search mode.
For this though I would most likely have to end up using different
delimiter values.

Any help is appreciated,

Brian Beard



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Tokenization / Analyzer question

Posted by "Beard, Brian" <Br...@mybir.com>.

I'm using lucene 2.9.1.

I'm indexing documents which correspond to an ID.
Each field in the ID document is made up of data from all subId's.
(It's a requirement that searches must work across all subId's within an
ID).

They will be indexed and stored in some format similar to:
subId0Value0 subId0Value1 DELIMITER subId1Value0 subId1Value1 DELIMITER
....
This is so the stored value strings can be matched up to the correct
subId for display purposes.
In addition, there may be more than one type of delimiter to denote the
type of data within a subId's data.

Moreover, for some of the fields it is desired to be able to tell which
subId the search terms corresponded to.

What I'd like to do is use payloads to encode the delimiter type into
the payload, but use the same text string for all delimiter types. Also,
the delimiter tokens shouldn't affect the search because they will be
filtered out of the analyzer. This is so that during a post-processing
step of the returned results, all payloads could be extracted at once by
a single term, instead of having to loop through them term by term
(because I don't see a way to get them on a per-document basis -
otherwise I could add a payload to the first term when the subId changes
during indexing). Then the search term positions could be checked
against the delimiter positions to determine the record and type of
data. 

I've got an example performing the tokenization and payload encoding
correctly, but the stored values are not the ones I want.

What I've done is write a BoundaryTokenFilter class which always is
instantiated *after* the standardAnalyzer, or WhitespaceAnalyzer, etc -
new BoundaryTokenFilter(new StandardAnalyzer().tokenStream()). 

In order to get the tokens passed down to BoundaryTokenFilter and pass
through the other tokenizers/TokenFilters higher up the chain, I'm using
character based delimiters (Different ones for each delimiter that also
have some additional encoded information for the payload). Then inside
of BoundaryTokenFilter, the character based delimiter is decoded. Then
the token is changed over to the single delimiter value and the payload
added.

This part works, but the problem I'm having is that I would like the
delimiter values stored in the index to be the same as the single
delimiter tokenized ones. 

Is there any way to do this - transform the stored values to be the
tokenized values, but only for select tokens?

The other alternative I can think of if that isn't possible, is to
modify standardAnalyzer to have indexing and search modes, so that
during indexing mode it would allow the delimiter tokens to pass through
and flag them using a typeAttribute, while they don't in search mode.
For this though I would most likely have to end up using different
delimiter values.

Any help is appreciated,

Brian Beard



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Wich way would you recommend for successive-words similarity and scoring ?

Posted by Ian Lea <ia...@gmail.com>.

Hi


Take a look at PhraseQuery or SpanNearQuery, particularly the
setSlop() methods and the inOrder flag on the latter.
http://www.lucidimagination.com/blog/2009/07/18/the-spanquery/ is well
worth reading.


--
Ian.


On Mon, May 3, 2010 at 8:20 PM, Pablo <pa...@gmail.com> wrote:
> Hello,
>
> Lucene core doesn't seems to use relative word positioning (?) for scoring.
>
> For example, indexing that phrase "a b c d e f g h i j k l m n o p q r
> s t u v w x y z", these queries give the same results (0.19308087) :
>  - 1 : phrase:'e f g'
>  - 2 : phrase:'o k z'
>
> I'm a bit familiar with lucene and snowballs, but I never (really)
> needed this feature before, and didn't browse the lucene contribs.
>
> Maybe I'm misunderstanding something, but, what can I do to obtain
> query 1 get a better score than the second ?
>
> Should I implement a Scorer and or a Similarity, or can an analyser
> and a specific stemmer be sufficient?
>
>
> Thanks, [I first wrote to dev, wasn't the right place.]
>
> Pablo
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Wich way would you recommend for successive-words similarity and scoring ?

Posted by Pablo <pa...@gmail.com>.

Hello,

Lucene core doesn't seems to use relative word positioning (?) for scoring.

For example, indexing that phrase "a b c d e f g h i j k l m n o p q r
s t u v w x y z", these queries give the same results (0.19308087) :
 - 1 : phrase:'e f g'
 - 2 : phrase:'o k z'

I'm a bit familiar with lucene and snowballs, but I never (really)
needed this feature before, and didn't browse the lucene contribs.

Maybe I'm misunderstanding something, but, what can I do to obtain
query 1 get a better score than the second ?

Should I implement a Scorer and or a Similarity, or can an analyser
and a specific stemmer be sufficient?


Thanks, [I first wrote to dev, wasn't the right place.]

Pablo

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org