You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Mirko Mancin <mi...@t-frutta.it> on 2015/04/01 14:37:27 UTC

Problem with NGram

Hi,

    I have a problem with n-gram. I would try to find the word "PRINTER".

I have this fields:


<field name="bestExternalDescriptionStandard" type="text_general" indexed="true" stored="true" multiValued="true" termVectors="true" termPositions="true" termOffsets="true"/>

   <field name="bestExternalDescriptionGram" type="text_ngram" indexed="true" stored="true" multiValued="true" termVectors="true" termPositions="true" termOffsets="true"/>




<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">

      <analyzer>

        <tokenizer class="solr.StandardTokenizerFactory"/>

        <filter class="solr.LowerCaseFilterFactory"/>

        <filter class="solr.SnowballPorterFilterFactory" language="Italian" />

      </analyzer>

</fieldType>


<fieldType name="text_ngram" class="solr.TextField" positionIncrementGap="100">

<analyzer>

          <tokenizer class="solr.NGramTokenizerFactory" minGramSize="2" maxGramSize="4"/>


          <filter class="solr.LowerCaseFilterFactory"/>

          <filter class="solr.SnowballPorterFilterFactory" language="Italian" />

        </analyzer>

</fieldType>



And rightly found:

"BROTHER PRINTER","SAMSUNG PRINTER",ecc...

But if I search "PRIN3R" (with an error within the string), solr do not return anything!!

How to do it? How to setup my schema.xml for found documents with a certain similarity?

Thanks


Mirko Mancin

Software Developer

[cid:522DC2EC-33F1-4171-B17A-171D46B2CF64]

Ubiq srl
stradello Conrad Marca-Relli, 9
43122 Parma (PR)
t. +39 0521 781601
cell. +39 346 4137577
follow us on Linkedin<https://www.linkedin.com/company/ubiq-srl>

This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you have received this email in error please notify the system manager. This message contains confidential information and is intended only for the individual named. If you are not the named addressee you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately by e-mail if you have received this e-mail by mistake and delete this e-mail from your system. If you are not the intended recipient you are notified that disclosing, copying, distributing or taking any action in reliance on the contents of this information is strictly prohibited.

Re: Problem with NGram

Posted by Mostafa Gomaa <mo...@gmail.com>.
Fuzzy search works with single terms as far as I know. Solr doesn't support
fuzzy querying for phrases out of the box as far as my limited knowledge
goes. You may want to look into using the ComplexPhraseQueryParser plugin.

On Wed, Apr 1, 2015 at 5:07 PM, Mirko Mancin <mi...@t-frutta.it>
wrote:

>   Doesn’t work with two word! :-(
>
>  If I search "jakart*d* apache lucene”~10 not found  "jakarta apache
> lucene”
>
>  But
>
>  If I search "jakart*e* apache lucene”~10 FOUND  "jakarta apache lucene”
>
>  WHY?!?!?!
>
>   Mirko Mancin
>
>  Software Developer
>
>
> *Ubiq** srl*
>  stradello Conrad Marca-Relli, 9
> 43122 Parma (PR)
> t. +39 0521 781601
> cell. +39 346 4137577
> follow us on Linkedin <https://www.linkedin.com/company/ubiq-srl>
>
> This email and any files transmitted with it are confidential and
> intended solely for the use of the individual or entity to whom they are
> addressed. If you have received this email in error please notify the
> system manager. This message contains confidential information and is
> intended only for the individual named. If you are not the named addressee
> you should not disseminate, distribute or copy this e-mail. Please notify
> the sender immediately by e-mail if you have received this e-mail by
> mistake and delete this e-mail from your system. If you are not the intended
> recipient you are notified that disclosing, copying, distributing or taking
> any action in reliance on the contents of this information is strictly
> prohibited.
>
>   Da: Mostafa Gomaa <mo...@gmail.com>
> Risposta: "dev@lucene.apache.org" <de...@lucene.apache.org>
> Data: mercoledì 1 aprile 2015 15:54
> A: "dev@lucene.apache.org" <de...@lucene.apache.org>
> Oggetto: Re: Problem with NGram
>
>   Hello Mirko,
>
>  Try using fuzzy queries. You can do that by adding a tilde at the end of
> the term you're searching for, like PRIN3ER~. It uses the edit distance
> algorithm to find similar words. You can also specify the number of edits
> by adding the number after the tilde, for example, PRIN3ER~2 will match
> similar words up to two edits. Hope this helps.
>
>  Regards,
>
>  Mostafa Gomaa.
>
> On Wed, Apr 1, 2015 at 2:37 PM, Mirko Mancin <mi...@t-frutta.it>
> wrote:
>
>>   Hi,
>>
>>      I have a problem with n-gram. I would try to find the word
>> “PRINTER”.
>>
>>  I have this fields:
>>
>>  <field name="bestExternalDescriptionStandard" type="text_general"
>> indexed="true" stored="true" multiValued="true" termVectors="true"
>> termPositions="true" termOffsets="true"/>
>>
>>    <field name="bestExternalDescriptionGram" type="text_ngram" indexed=
>> "true" stored="true" multiValued="true" termVectors="true" termPositions=
>> "true" termOffsets="true"/>
>>
>>
>>
>>
>>  <fieldType name="text_general" class="solr.TextField"
>> positionIncrementGap="100">
>>
>>       <analyzer>
>>
>>         <tokenizer class="solr.StandardTokenizerFactory"/>
>>
>>         <filter class="solr.LowerCaseFilterFactory"/>
>>
>>         <filter class="solr.SnowballPorterFilterFactory" language="Italian"
>> />
>>
>>       </analyzer>
>>
>> </fieldType>
>>
>>
>>  <fieldType name="text_ngram" class="solr.TextField" positionIncrementGap
>> ="100">
>>
>> <analyzer>
>>
>>           <tokenizer class="solr.NGramTokenizerFactory" minGramSize="2"
>> maxGramSize="4"/>
>>
>>
>>            <filter class="solr.LowerCaseFilterFactory"/>
>>
>>           <filter class="solr.SnowballPorterFilterFactory" language="Italian"
>> />
>>
>>         </analyzer>
>>
>> </fieldType>
>>
>>
>>
>>  And rightly found:
>>
>>  “BROTHER PRINTER”,”SAMSUNG PRINTER”,ecc…
>>
>>  But if I search “PRIN3R” (with an error within the string), solr do not
>> return anything!!
>>
>>  How to do it? How to setup my schema.xml for found documents with a
>> certain similarity?
>>
>>  Thanks
>>
>>
>>  Mirko Mancin
>>
>>  Software Developer
>>
>>
>> *Ubiq** srl*
>>  stradello Conrad Marca-Relli, 9
>> 43122 Parma (PR)
>> t. +39 0521 781601
>> cell. +39 346 4137577
>> follow us on Linkedin <https://www.linkedin.com/company/ubiq-srl>
>>
>> This email and any files transmitted with it are confidential and
>> intended solely for the use of the individual or entity to whom they are
>> addressed. If you have received this email in error please notify the
>> system manager. This message contains confidential information and is
>> intended only for the individual named. If you are not the named addressee
>> you should not disseminate, distribute or copy this e-mail. Please notify
>> the sender immediately by e-mail if you have received this e-mail by
>> mistake and delete this e-mail from your system. If you are not the intended
>> recipient you are notified that disclosing, copying, distributing or taking
>> any action in reliance on the contents of this information is strictly
>> prohibited.
>>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>

Re: Problem with NGram

Posted by Mirko Mancin <mi...@t-frutta.it>.
Doesn't work with two word! :-(

If I search "jakartd apache lucene"~10 not found  "jakarta apache lucene"

But

If I search "jakarte apache lucene"~10 FOUND  "jakarta apache lucene"

WHY?!?!?!

Mirko Mancin

Software Developer

[cid:38E1590B-64FC-42C9-B24C-27DC3CBD6984]

Ubiq srl
stradello Conrad Marca-Relli, 9
43122 Parma (PR)
t. +39 0521 781601
cell. +39 346 4137577
follow us on Linkedin<https://www.linkedin.com/company/ubiq-srl>

This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you have received this email in error please notify the system manager. This message contains confidential information and is intended only for the individual named. If you are not the named addressee you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately by e-mail if you have received this e-mail by mistake and delete this e-mail from your system. If you are not the intended recipient you are notified that disclosing, copying, distributing or taking any action in reliance on the contents of this information is strictly prohibited.

Da: Mostafa Gomaa <mo...@gmail.com>>
Risposta: "dev@lucene.apache.org<ma...@lucene.apache.org>" <de...@lucene.apache.org>>
Data: mercoledì 1 aprile 2015 15:54
A: "dev@lucene.apache.org<ma...@lucene.apache.org>" <de...@lucene.apache.org>>
Oggetto: Re: Problem with NGram

Hello Mirko,

Try using fuzzy queries. You can do that by adding a tilde at the end of the term you're searching for, like PRIN3ER~. It uses the edit distance algorithm to find similar words. You can also specify the number of edits by adding the number after the tilde, for example, PRIN3ER~2 will match similar words up to two edits. Hope this helps.

Regards,

Mostafa Gomaa.

On Wed, Apr 1, 2015 at 2:37 PM, Mirko Mancin <mi...@t-frutta.it>> wrote:
Hi,

    I have a problem with n-gram. I would try to find the word "PRINTER".

I have this fields:


<field name="bestExternalDescriptionStandard" type="text_general" indexed="true" stored="true" multiValued="true" termVectors="true" termPositions="true" termOffsets="true"/>

   <field name="bestExternalDescriptionGram" type="text_ngram" indexed="true" stored="true" multiValued="true" termVectors="true" termPositions="true" termOffsets="true"/>




<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">

      <analyzer>

        <tokenizer class="solr.StandardTokenizerFactory"/>

        <filter class="solr.LowerCaseFilterFactory"/>

        <filter class="solr.SnowballPorterFilterFactory" language="Italian" />

      </analyzer>

</fieldType>


<fieldType name="text_ngram" class="solr.TextField" positionIncrementGap="100">

<analyzer>

          <tokenizer class="solr.NGramTokenizerFactory" minGramSize="2" maxGramSize="4"/>


          <filter class="solr.LowerCaseFilterFactory"/>

          <filter class="solr.SnowballPorterFilterFactory" language="Italian" />

        </analyzer>

</fieldType>



And rightly found:

"BROTHER PRINTER","SAMSUNG PRINTER",ecc...

But if I search "PRIN3R" (with an error within the string), solr do not return anything!!

How to do it? How to setup my schema.xml for found documents with a certain similarity?

Thanks


Mirko Mancin

Software Developer

[cid:522DC2EC-33F1-4171-B17A-171D46B2CF64]

Ubiq srl
stradello Conrad Marca-Relli, 9
43122 Parma (PR)
t. +39 0521 781601
cell. +39 346 4137577
follow us on Linkedin<https://www.linkedin.com/company/ubiq-srl>

This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you have received this email in error please notify the system manager. This message contains confidential information and is intended only for the individual named. If you are not the named addressee you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately by e-mail if you have received this e-mail by mistake and delete this e-mail from your system. If you are not the intended recipient you are notified that disclosing, copying, distributing or taking any action in reliance on the contents of this information is strictly prohibited.


Re: Problem with NGram

Posted by Mostafa Gomaa <mo...@gmail.com>.
Hello Mirko,

Try using fuzzy queries. You can do that by adding a tilde at the end of
the term you're searching for, like PRIN3ER~. It uses the edit distance
algorithm to find similar words. You can also specify the number of edits
by adding the number after the tilde, for example, PRIN3ER~2 will match
similar words up to two edits. Hope this helps.

Regards,

Mostafa Gomaa.

On Wed, Apr 1, 2015 at 2:37 PM, Mirko Mancin <mi...@t-frutta.it>
wrote:

>   Hi,
>
>      I have a problem with n-gram. I would try to find the word “PRINTER”.
>
>  I have this fields:
>
>   <field name="bestExternalDescriptionStandard" type="text_general"
> indexed="true" stored="true" multiValued="true" termVectors="true"
> termPositions="true" termOffsets="true"/>
>
>    <field name="bestExternalDescriptionGram" type="text_ngram" indexed=
> "true" stored="true" multiValued="true" termVectors="true" termPositions=
> "true" termOffsets="true"/>
>
>
>
>
>  <fieldType name="text_general" class="solr.TextField"
> positionIncrementGap="100">
>
>       <analyzer>
>
>         <tokenizer class="solr.StandardTokenizerFactory"/>
>
>         <filter class="solr.LowerCaseFilterFactory"/>
>
>         <filter class="solr.SnowballPorterFilterFactory" language="Italian"
> />
>
>       </analyzer>
>
> </fieldType>
>
>
>  <fieldType name="text_ngram" class="solr.TextField" positionIncrementGap=
> "100">
>
> <analyzer>
>
>           <tokenizer class="solr.NGramTokenizerFactory" minGramSize="2"
> maxGramSize="4"/>
>
>
>            <filter class="solr.LowerCaseFilterFactory"/>
>
>           <filter class="solr.SnowballPorterFilterFactory" language="Italian"
> />
>
>         </analyzer>
>
> </fieldType>
>
>
>
>  And rightly found:
>
>  “BROTHER PRINTER”,”SAMSUNG PRINTER”,ecc…
>
>  But if I search “PRIN3R” (with an error within the string), solr do not
> return anything!!
>
>  How to do it? How to setup my schema.xml for found documents with a
> certain similarity?
>
>  Thanks
>
>
>  Mirko Mancin
>
>  Software Developer
>
>
> *Ubiq** srl*
>  stradello Conrad Marca-Relli, 9
> 43122 Parma (PR)
> t. +39 0521 781601
> cell. +39 346 4137577
> follow us on Linkedin <https://www.linkedin.com/company/ubiq-srl>
>
> This email and any files transmitted with it are confidential and
> intended solely for the use of the individual or entity to whom they are
> addressed. If you have received this email in error please notify the
> system manager. This message contains confidential information and is
> intended only for the individual named. If you are not the named addressee
> you should not disseminate, distribute or copy this e-mail. Please notify
> the sender immediately by e-mail if you have received this e-mail by
> mistake and delete this e-mail from your system. If you are not the intended
> recipient you are notified that disclosing, copying, distributing or taking
> any action in reliance on the contents of this information is strictly
> prohibited.
>