You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Jacek Grzebyta <gr...@gmail.com> on 2017/06/08 11:56:30 UTC

Penalize fact the searched term is within a world

Hi,

Apologies for repeating question from IRC room but I am not sure if that is
alive.

I have no idea about how lucene works but I need to modify some part in
rdf4j project which depends on that.

I need to use lucene to create a mapping file based on text searching and I
found there is a following problem. Let take a term 'abcd' which is mapped
to node 'abcd-2' whereas node 'abcd' exists. I found the issue is lucene is
searching the term and finds it in both nodes 'abcd' and 'abcd-2' and gives
the same score. My question is: how to modify the scoring to penalise the
fact the searched term is a part of longer word and give more score if that
is itself a word.

Visually It looks like that:

node 'abcd':
  - name: abcd

total score = LS /lucene score/ * 2.0 /name weight/



node 'abcd-2':
   - name: abcd-2
   - alias1: abcd-h
   - alias2: abcd-k9

total score = LS * 2.0 + LS * 0.5 /alias1 score/ + LS * 0.1 /alias2 score/

I gave different weights for properties. "Name" has the the highest weight
but "alias" has some small weight as well. In total the score for a node is
a sum of all partial score * weight. Finally 'abcd-2' has highest score
than 'abcd'.

thanks,
Jacek

Re: Penalize fact the searched term is within a world

Posted by Jacek Grzebyta <gr...@gmail.com>.
Unfortunately for the real data WhitespaceTokenizer does not work properly.
I also tried KeywordAnalyzer because the data I need to process are just
IDs but for that there is no output at all.


On 9 June 2017 at 14:09, Uwe Schindler <uw...@thetaphi.de> wrote:

> Hi,
>
> the tokens are matched as is. It is only a match if the tokens are exactly
> the same bytes. There are never done any substring matches, just simple
> comparison of bytes.
>
> To have more fuzzier matches, you have to do text analysis right. This
> includes splitting of tokens (Tokenizer), but also term "normalization"
> (TokenFilters). One example is lowercasing (to allow case insensitive
> matching), but also stemming might be done, or conversion to phonetic codes
> (to allow phonetic matches). The output of the tokens does not necessarily
> need to be "human readable" anymore. How does this work with matching, the
> user won't enter phonetic codes? - Tokenization and normalization is done
> on both the indexing as well as on the query side. If both sides produce
> same tokens it's a match, very simple. By that information you should be
> able to think about good ways to analyze the text for your use case. If you
> use Solr, the schema.xml is your friend. In Lucene look at the analysis
> module that has examples for common languages. If you want to do your own,
> use CustomAnalyzer to create your own combination of tokenization and
> normalization (filtering of tokens).
>
> Uwe
>
> -----
> Uwe Schindler
> Achterdiek 19, D-28357 Bremen
> http://www.thetaphi.de
> eMail: uwe@thetaphi.de
>
> > -----Original Message-----
> > From: Jacek Grzebyta [mailto:grzebyta.dev@gmail.com]
> > Sent: Friday, June 9, 2017 1:39 PM
> > To: java-user@lucene.apache.org
> > Subject: Re: Penalize fact the searched term is within a world
> >
> > Hi Ahmed,
> >
> > That works! Still I do not understand how that staff working. I just know
> > that analysed cut an indexed text into tokens. But I do not know how the
> > matching is done.
> >
> > Do you recommend and good book to read. I prefer something with less
> > maths
> > and more examples?
> > The only I found is free "An Introduction to Information Retrieval" but I
> > has lot of maths I do not understand.
> >
> > Best regards,
> > Jacek
> >
> >
> >
> > On 8 June 2017 at 19:36, Ahmet Arslan <io...@yahoo.com.invalid> wrote:
> >
> > > Hi,
> > > You can completely ban within-a-word search by simply using
> > > WhitespaceTokenizer for example.By the way, it is all about how you
> > > tokenize/analyze your text. Once you decided, you can create a two
> > versions
> > > of a single field using different analysers.This allows you to assign
> > > different weights to those field at query time.
> > > Ahmet
> > >
> > >
> > > On Thursday, June 8, 2017, 2:56:37 PM GMT+3, Jacek Grzebyta <
> > > grzebyta.dev@gmail.com> wrote:
> > >
> > >
> > > Hi,
> > >
> > > Apologies for repeating question from IRC room but I am not sure if
> that is
> > > alive.
> > >
> > > I have no idea about how lucene works but I need to modify some part in
> > > rdf4j project which depends on that.
> > >
> > > I need to use lucene to create a mapping file based on text searching
> and I
> > > found there is a following problem. Let take a term 'abcd' which is
> mapped
> > > to node 'abcd-2' whereas node 'abcd' exists. I found the issue is
> lucene is
> > > searching the term and finds it in both nodes 'abcd' and 'abcd-2' and
> gives
> > > the same score. My question is: how to modify the scoring to penalise
> the
> > > fact the searched term is a part of longer word and give more score if
> that
> > > is itself a word.
> > >
> > > Visually It looks like that:
> > >
> > > node 'abcd':
> > >   - name: abcd
> > >
> > > total score = LS /lucene score/ * 2.0 /name weight/
> > >
> > >
> > >
> > > node 'abcd-2':
> > >   - name: abcd-2
> > >   - alias1: abcd-h
> > >   - alias2: abcd-k9
> > >
> > > total score = LS * 2.0 + LS * 0.5 /alias1 score/ + LS * 0.1 /alias2
> score/
> > >
> > > I gave different weights for properties. "Name" has the the highest
> weight
> > > but "alias" has some small weight as well. In total the score for a
> node is
> > > a sum of all partial score * weight. Finally 'abcd-2' has highest score
> > > than 'abcd'.
> > >
> > > thanks,
> > > Jacek
> > >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

RE: Penalize fact the searched term is within a world

Posted by Uwe Schindler <uw...@thetaphi.de>.
Hi,

the tokens are matched as is. It is only a match if the tokens are exactly the same bytes. There are never done any substring matches, just simple comparison of bytes.

To have more fuzzier matches, you have to do text analysis right. This includes splitting of tokens (Tokenizer), but also term "normalization" (TokenFilters). One example is lowercasing (to allow case insensitive matching), but also stemming might be done, or conversion to phonetic codes (to allow phonetic matches). The output of the tokens does not necessarily need to be "human readable" anymore. How does this work with matching, the user won't enter phonetic codes? - Tokenization and normalization is done on both the indexing as well as on the query side. If both sides produce same tokens it's a match, very simple. By that information you should be able to think about good ways to analyze the text for your use case. If you use Solr, the schema.xml is your friend. In Lucene look at the analysis module that has examples for common languages. If you want to do your own, use CustomAnalyzer to create your own combination of tokenization and normalization (filtering of tokens).

Uwe

-----
Uwe Schindler
Achterdiek 19, D-28357 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de

> -----Original Message-----
> From: Jacek Grzebyta [mailto:grzebyta.dev@gmail.com]
> Sent: Friday, June 9, 2017 1:39 PM
> To: java-user@lucene.apache.org
> Subject: Re: Penalize fact the searched term is within a world
> 
> Hi Ahmed,
> 
> That works! Still I do not understand how that staff working. I just know
> that analysed cut an indexed text into tokens. But I do not know how the
> matching is done.
> 
> Do you recommend and good book to read. I prefer something with less
> maths
> and more examples?
> The only I found is free "An Introduction to Information Retrieval" but I
> has lot of maths I do not understand.
> 
> Best regards,
> Jacek
> 
> 
> 
> On 8 June 2017 at 19:36, Ahmet Arslan <io...@yahoo.com.invalid> wrote:
> 
> > Hi,
> > You can completely ban within-a-word search by simply using
> > WhitespaceTokenizer for example.By the way, it is all about how you
> > tokenize/analyze your text. Once you decided, you can create a two
> versions
> > of a single field using different analysers.This allows you to assign
> > different weights to those field at query time.
> > Ahmet
> >
> >
> > On Thursday, June 8, 2017, 2:56:37 PM GMT+3, Jacek Grzebyta <
> > grzebyta.dev@gmail.com> wrote:
> >
> >
> > Hi,
> >
> > Apologies for repeating question from IRC room but I am not sure if that is
> > alive.
> >
> > I have no idea about how lucene works but I need to modify some part in
> > rdf4j project which depends on that.
> >
> > I need to use lucene to create a mapping file based on text searching and I
> > found there is a following problem. Let take a term 'abcd' which is mapped
> > to node 'abcd-2' whereas node 'abcd' exists. I found the issue is lucene is
> > searching the term and finds it in both nodes 'abcd' and 'abcd-2' and gives
> > the same score. My question is: how to modify the scoring to penalise the
> > fact the searched term is a part of longer word and give more score if that
> > is itself a word.
> >
> > Visually It looks like that:
> >
> > node 'abcd':
> >   - name: abcd
> >
> > total score = LS /lucene score/ * 2.0 /name weight/
> >
> >
> >
> > node 'abcd-2':
> >   - name: abcd-2
> >   - alias1: abcd-h
> >   - alias2: abcd-k9
> >
> > total score = LS * 2.0 + LS * 0.5 /alias1 score/ + LS * 0.1 /alias2 score/
> >
> > I gave different weights for properties. "Name" has the the highest weight
> > but "alias" has some small weight as well. In total the score for a node is
> > a sum of all partial score * weight. Finally 'abcd-2' has highest score
> > than 'abcd'.
> >
> > thanks,
> > Jacek
> >


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Penalize fact the searched term is within a world

Posted by Jacek Grzebyta <gr...@gmail.com>.
Hi Ahmed,

That works! Still I do not understand how that staff working. I just know
that analysed cut an indexed text into tokens. But I do not know how the
matching is done.

Do you recommend and good book to read. I prefer something with less maths
and more examples?
The only I found is free "An Introduction to Information Retrieval" but I
has lot of maths I do not understand.

Best regards,
Jacek



On 8 June 2017 at 19:36, Ahmet Arslan <io...@yahoo.com.invalid> wrote:

> Hi,
> You can completely ban within-a-word search by simply using
> WhitespaceTokenizer for example.By the way, it is all about how you
> tokenize/analyze your text. Once you decided, you can create a two versions
> of a single field using different analysers.This allows you to assign
> different weights to those field at query time.
> Ahmet
>
>
> On Thursday, June 8, 2017, 2:56:37 PM GMT+3, Jacek Grzebyta <
> grzebyta.dev@gmail.com> wrote:
>
>
> Hi,
>
> Apologies for repeating question from IRC room but I am not sure if that is
> alive.
>
> I have no idea about how lucene works but I need to modify some part in
> rdf4j project which depends on that.
>
> I need to use lucene to create a mapping file based on text searching and I
> found there is a following problem. Let take a term 'abcd' which is mapped
> to node 'abcd-2' whereas node 'abcd' exists. I found the issue is lucene is
> searching the term and finds it in both nodes 'abcd' and 'abcd-2' and gives
> the same score. My question is: how to modify the scoring to penalise the
> fact the searched term is a part of longer word and give more score if that
> is itself a word.
>
> Visually It looks like that:
>
> node 'abcd':
>   - name: abcd
>
> total score = LS /lucene score/ * 2.0 /name weight/
>
>
>
> node 'abcd-2':
>   - name: abcd-2
>   - alias1: abcd-h
>   - alias2: abcd-k9
>
> total score = LS * 2.0 + LS * 0.5 /alias1 score/ + LS * 0.1 /alias2 score/
>
> I gave different weights for properties. "Name" has the the highest weight
> but "alias" has some small weight as well. In total the score for a node is
> a sum of all partial score * weight. Finally 'abcd-2' has highest score
> than 'abcd'.
>
> thanks,
> Jacek
>

Re: Penalize fact the searched term is within a world

Posted by Ahmet Arslan <io...@yahoo.com.INVALID>.
Hi,
You can completely ban within-a-word search by simply using WhitespaceTokenizer for example.By the way, it is all about how you tokenize/analyze your text. Once you decided, you can create a two versions of a single field using different analysers.This allows you to assign different weights to those field at query time.
Ahmet


On Thursday, June 8, 2017, 2:56:37 PM GMT+3, Jacek Grzebyta <gr...@gmail.com> wrote:


Hi,

Apologies for repeating question from IRC room but I am not sure if that is
alive.

I have no idea about how lucene works but I need to modify some part in
rdf4j project which depends on that.

I need to use lucene to create a mapping file based on text searching and I
found there is a following problem. Let take a term 'abcd' which is mapped
to node 'abcd-2' whereas node 'abcd' exists. I found the issue is lucene is
searching the term and finds it in both nodes 'abcd' and 'abcd-2' and gives
the same score. My question is: how to modify the scoring to penalise the
fact the searched term is a part of longer word and give more score if that
is itself a word.

Visually It looks like that:

node 'abcd':
  - name: abcd

total score = LS /lucene score/ * 2.0 /name weight/



node 'abcd-2':
  - name: abcd-2
  - alias1: abcd-h
  - alias2: abcd-k9

total score = LS * 2.0 + LS * 0.5 /alias1 score/ + LS * 0.1 /alias2 score/

I gave different weights for properties. "Name" has the the highest weight
but "alias" has some small weight as well. In total the score for a node is
a sum of all partial score * weight. Finally 'abcd-2' has highest score
than 'abcd'.

thanks,
Jacek