You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Yeongsu Kim (JIRA)" <ji...@apache.org> on 2019/02/25 07:55:00 UTC

[jira] [Updated] (LUCENE-8706) Nori with DISCARD mode misunderstands compound words, when synonym expansion

     [ https://issues.apache.org/jira/browse/LUCENE-8706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yeongsu Kim updated LUCENE-8706:
--------------------------------
    Description: 
 I found a bug in Nori tokenizer.

Let me describe what the problem is, using a concrete example.

Let assume, we have below dictionaries.

< userdict_ko.txt >

     [ “lg”, “lgtv lg tv”, “tv”, “엘지티비”, “엘지”, “텔레비전”, “티비”, “하이” ]

     (“lgtv” is compound word)

< synonyms.txt >

     [ “lgtv,엘지티비”, “lg,엘지”, “tv,텔레비전,티비” ]

 

Let’s see the results according to below queries.

   * Query1 : lgtv

   * Query2 : lg하이tv 

   * Query3 : lg              tv

 

Also, we will use all different decompound-modes such as “NONE”, “DISCARD”, “MIXED”.

Here are test cases.

   * Test 1 (Query 1 + “MIXED”) - the analysis result is [“엘지티비”, “lgtv”, “lg”, “tv”]

   * Test 2 (Query 1 + “NONE”) - the analysis result is [“엘지티비”, “lgtv”]

   * Test 3 (Query 1 + “DISCARD”) - the analysis result is [“엘지티비”, “lg”, “tv”]

 

   * Test 4 (Query 2 + “MIXED”) - the analysis result is [“엘지”, “lg”, “하이”, “텔레비전”, “티비”, “tv”]

   * Test 5 (Query 2 + “NONE”) - the analysis result is [“엘지”, “lg”, “하이”, “텔레비전”, “티비”, “tv”]

   * Test 6 (Query 2 + “DISCARD”) - the analysis result is [“엘지”, “lg”, “하이”, “텔레비전”, “티비”, “tv”]

 

   * Test 7 (Query 3 + “MIXED”) - the analysis result is [“엘지”, “lg”, “텔레비전”, “티비”, “tv”]

   * Test 8 (Query 3 + “NONE”) - the analysis result is [“엘지”, “lg”, “텔레비전”, “티비”, “tv”]

   * Test 9 (Query 3 + “DISCARD”) - the analysis result is [“엘지티비”, “lg”, “tv”]   => (Here is the problem!!!)

 

I don’t understand why Test 9 has that analysis result. The result should be [“엘지”, “lg”, “텔레비전”, “티비”, “tv”] because the query 3 has some spaces between “lg” and “tv”.

 

The only difference between “DISCARD” and other modes, is that “DISCARD” do not store the compound token (e.g. “lgtv”) to the pending list. Since “DISCARD” do not have the compound token, it may understand consecutive tokens, “lg”, “tv” as compound token “lgtv”. However, there are many cases to make “lg”,”tv”. For example, “lg tv”, “lg * tv”, “lg /// tv”, etc. (Space and punctuations are deleted after tokenizing). It should differentiate “lg tv” from “lgtv”.

 

I guess that it needs to fix communication between nori tokenizer and general synonym filter.

Thanks.

 

P.S.

The existing nori has error when using both synonyms and “MIXED” mode. For this test, I temporarily delete `compoundToken.setPositionIncrement(0);` in KoreanTokenizer.java` because SynonymMap.java throws IllegalArgumentException when position increment is not 1.

 

  was:
 

I found a bug in Nori tokenizer.

Let me describe what the problem is, using a concrete example.


 

Let assume, we have below dictionaries.


 

< userdict_ko.txt >

[ “lg”, “lgtv lg tv”, “tv”, “엘지티비”, “엘지”, “텔레비전”, “티비”, “하이” ]

(“lgtv” is compound word)


 

< synonyms.txt >

[ “lgtv,엘지티비”, “lg,엘지”, “tv,텔레비전,티비” ]

 

Let’s see the results according to below queries.

   * Query1 : lgtv

   * Query2 : lg하이tv 

   * Query3 : lg              tv


 

Also, we will use all different decompound-modes such as “NONE”, “DISCARD”, “MIXED”.

Here are test cases.


 

   * Test 1 (Query 1 + “MIXED”) - the analysis result is [“엘지티비”, “lgtv”, “lg”, “tv”]

   * Test 2 (Query 1 + “NONE”) - the analysis result is [“엘지티비”, “lgtv”]

   * Test 3 (Query 1 + “DISCARD”) - the analysis result is [“엘지티비”, “lg”, “tv”]

 

   * Test 4 (Query 2 + “MIXED”) - the analysis result is [“엘지”, “lg”, “하이”, “텔레비전”, “티비”, “tv”]

   * Test 5 (Query 2 + “NONE”) - the analysis result is [“엘지”, “lg”, “하이”, “텔레비전”, “티비”, “tv”]

   * Test 6 (Query 2 + “DISCARD”) - the analysis result is [“엘지”, “lg”, “하이”, “텔레비전”, “티비”, “tv”]
 

 

 

   * Test 7 (Query 3 + “MIXED”) - the analysis result is [“엘지”, “lg”, “텔레비전”, “티비”, “tv”]

   * Test 8 (Query 3 + “NONE”) - the analysis result is [“엘지”, “lg”, “텔레비전”, “티비”, “tv”]

   * Test 9 (Query 3 + “DISCARD”) - the analysis result is [“엘지티비”, “lg”, “tv”]   => (Here is the problem!!!)

 

I don’t understand why Test 9 has that analysis result. The result should be [“엘지”, “lg”, “텔레비전”, “티비”, “tv”] because the query 3 has some spaces between “lg” and “tv”.

 

The only difference between “DISCARD” and other modes, is that “DISCARD” do not store the compound token (e.g. “lgtv”) to the pending list. Since “DISCARD” do not have the compound token, it may understand consecutive tokens, “lg”, “tv” as compound token “lgtv”. However, there are many cases to make “lg”,”tv”. For example, “lg tv”, “lg * tv”, “lg /// tv”, etc. (Space and punctuations are deleted after tokenizing). It should differentiate “lg tv” from “lgtv”.

 

I guess that it needs to fix communication between nori tokenizer and general synonym filter.

Thanks.

 

P.S.

The existing nori has error when using both synonyms and “MIXED” mode. For this test, I temporarily delete `compoundToken.setPositionIncrement(0);` in KoreanTokenizer.java` because SynonymMap.java throws IllegalArgumentException when position increment is not 1.

 


> Nori with DISCARD mode misunderstands compound words, when synonym expansion 
> -----------------------------------------------------------------------------
>
>                 Key: LUCENE-8706
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8706
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: modules/analysis
>            Reporter: Yeongsu Kim
>            Priority: Minor
>
>  I found a bug in Nori tokenizer.
> Let me describe what the problem is, using a concrete example.
> Let assume, we have below dictionaries.
> < userdict_ko.txt >
>      [ “lg”, “lgtv lg tv”, “tv”, “엘지티비”, “엘지”, “텔레비전”, “티비”, “하이” ]
>      (“lgtv” is compound word)
> < synonyms.txt >
>      [ “lgtv,엘지티비”, “lg,엘지”, “tv,텔레비전,티비” ]
>  
> Let’s see the results according to below queries.
>    * Query1 : lgtv
>    * Query2 : lg하이tv 
>    * Query3 : lg              tv
>  
> Also, we will use all different decompound-modes such as “NONE”, “DISCARD”, “MIXED”.
> Here are test cases.
>    * Test 1 (Query 1 + “MIXED”) - the analysis result is [“엘지티비”, “lgtv”, “lg”, “tv”]
>    * Test 2 (Query 1 + “NONE”) - the analysis result is [“엘지티비”, “lgtv”]
>    * Test 3 (Query 1 + “DISCARD”) - the analysis result is [“엘지티비”, “lg”, “tv”]
>  
>    * Test 4 (Query 2 + “MIXED”) - the analysis result is [“엘지”, “lg”, “하이”, “텔레비전”, “티비”, “tv”]
>    * Test 5 (Query 2 + “NONE”) - the analysis result is [“엘지”, “lg”, “하이”, “텔레비전”, “티비”, “tv”]
>    * Test 6 (Query 2 + “DISCARD”) - the analysis result is [“엘지”, “lg”, “하이”, “텔레비전”, “티비”, “tv”]
>  
>    * Test 7 (Query 3 + “MIXED”) - the analysis result is [“엘지”, “lg”, “텔레비전”, “티비”, “tv”]
>    * Test 8 (Query 3 + “NONE”) - the analysis result is [“엘지”, “lg”, “텔레비전”, “티비”, “tv”]
>    * Test 9 (Query 3 + “DISCARD”) - the analysis result is [“엘지티비”, “lg”, “tv”]   => (Here is the problem!!!)
>  
> I don’t understand why Test 9 has that analysis result. The result should be [“엘지”, “lg”, “텔레비전”, “티비”, “tv”] because the query 3 has some spaces between “lg” and “tv”.
>  
> The only difference between “DISCARD” and other modes, is that “DISCARD” do not store the compound token (e.g. “lgtv”) to the pending list. Since “DISCARD” do not have the compound token, it may understand consecutive tokens, “lg”, “tv” as compound token “lgtv”. However, there are many cases to make “lg”,”tv”. For example, “lg tv”, “lg * tv”, “lg /// tv”, etc. (Space and punctuations are deleted after tokenizing). It should differentiate “lg tv” from “lgtv”.
>  
> I guess that it needs to fix communication between nori tokenizer and general synonym filter.
> Thanks.
>  
> P.S.
> The existing nori has error when using both synonyms and “MIXED” mode. For this test, I temporarily delete `compoundToken.setPositionIncrement(0);` in KoreanTokenizer.java` because SynonymMap.java throws IllegalArgumentException when position increment is not 1.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org