You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Michael Sokolov <ms...@gmail.com> on 2019/05/25 19:03:03 UTC

JapaneseAnalyzer's system vs user dict

I'm trying to understand the relationship between the system and user
dictionaries that JapaneseAnalyzer uses. The API allows a user to
provide a user dictionary; the system one is built in. Are they
otherwise the same kind of thing? If I provide entries in the user
dictionary is it just as if I had included them in the system
dictionary? If the same entry occurs in both, do the user dictionary
weights supersede those in the system dictionary? Is there some way to
suppress entries in the system dict?  I hunted for documentation, but
didn't find answers to these questions, and the code is pretty
involved, so any pointers would be greatly appreciated.

-Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: JapaneseAnalyzer's system vs user dict

Posted by Namgyu Kim <kn...@gmail.com>.

Hi Tomoko :D

Thank you for your reply and listening to my thinking.
And I didn't know this question is old.
Of course, I want to participate in the LUCENE-8816 issue.

I think this issue will take some time.
I'll check it.

Warm regards,
Namgyu Kim


On Tue, May 28, 2019 at 10:43 PM Tomoko Uchida <to...@gmail.com>
wrote:

> Hi guys,
>
> I just created an issue related to this thread.
>
> Decouple Kuromoji's morphological analyser and its dictionary
> https://issues.apache.org/jira/browse/LUCENE-8816
>
> The problem discussed here is essentially within the current
> architecture of Kuromoji (and Nori), "jar bundled system dictionary".
> So, the most natural solution is decoupling the Viterbi logic and the
> encoded dictionary (just as traditional Japanese morphological
> analysis engines do so).
> This is actually old question with respect to kuromoji, however I feel
> like that it's a good time to re-think it.
>
> It will take time (and to be honest I'm not sure the patch will be
> accepted) but I think it's much better than applying monkey-fixes to
> the current build script.
> If you are seriously interested in this work, please feel free to involve
> it.
>
> Tomoko
>
> 2019年5月28日(火) 7:57 Tomoko Uchida <to...@gmail.com>:
> >
> > Hi Namgyu,
> >
> > > There is a team that uses a well-ported system dictionary.
> > > The Lucene version is up. (like 8.1 -> 8.2)
> > > Suppose there was no modification to kuromoji in 8.2.
> > > But the user has to port again.
> > > The same goes for 8.2 to 8.3.
> >
> > I'm not sure about the situation at Korea, however, we also have some
> > frequently updated, well-maintained (by NLP professionals) system
> > dictionaries.
> > 1. neologd (mecab ipadic extension) and 2. Sudachi (unidic extension
> > partially including neologd) I mentioned in my previous mail.
> > I agree with that it's a labor to re-build the tokenizer every time
> > when upgrading.
> >
> > In both case, some outstanding contributors build and distribute
> > plugins including up-to-date dictionary at a constant pace, and other
> > users just use them. Seems this works greatly at least in Japan, for
> > now.
> > Maybe we can start from outside of Lucene project such like that? If
> > the workflow works well and it's really needed, developers can propose
> > the change (a patch for the build script, and possibly the system
> > dictionary operation or update policy is also needed) to the Jira
> > anytime.
> >
> > I know that current JapaneseAnalyzer's system dictionary (MeCab
> > IPADIC) has been not maintained for ten years and developers/users
> > often complain about it.
> > For now I just see the effort of the developers community (including
> > me) to try to find good solutions for that.
> >
> > Thanks,
> > Tomoko
> >
> > 2019年5月28日(火) 2:42 Namgyu Kim <kn...@gmail.com>:
> > >
> > > Thank you for your reply, Tomoko :D
> > >
> > > To be honest, I have not experienced it directly(means commercialize),
> so I
> > > can't tell the exact situation of the Japanese MeCab.
> > > I respect your opinion and it is true that customization is a difficult
> > > task.
> > >
> > > But I can talk a little bit about Korean MeCab. (The basic logic is the
> > > same)
> > > In the case of Hangul MeCab, system dictionary changes are very
> frequent.
> > > Developers do not design the engine from the bottom, so they tend to
> try a
> > > lot of tuning at some level. (like custom model, score matrix, custom
> > > dictionary)
> > > Especially in commercialization, developers make a lot of tuning to
> make
> > > the dictionary that is the most suitable for the purpose.
> > > (Of course, the big tech companies use their own analyzers :D)
> > >
> > > MeCab is especially popular in Korea, so there are many attempts.
> > > Developers often port it to Elasticsearch and use a lot, but they have
> to
> > > do a lot of boring work every time.
> > > (It is not Korean MeCab case, but I think Mike and Trejkaz talked in
> that
> > > sense)
> > >
> > > There is another bad case.
> > >
> > > There is a team that uses a well-ported system dictionary.
> > > The Lucene version is up. (like 8.1 -> 8.2)
> > > Suppose there was no modification to kuromoji in 8.2.
> > > But the user has to port again.
> > > The same goes for 8.2 to 8.3.
> > > Even if kuromoji has a fix that is not associated with Dictionary, the
> user
> > > has to port each time.
> > >
> > > At least if we allow them to read custom dat files, these problems can
> be
> > > disappeared.
> > >
> > > Warm regards,
> > > Namgyu Kim
> > >
> > > On Mon, May 27, 2019 at 8:21 AM Tomoko Uchida <
> tomoko.uchida.1111@gmail.com>
> > > wrote:
> > >
> > > > > Anyway, in my personal opinion, Lucene does not need to consider
> whether
> > > > the system dictionary status is good or not.
> > > >
> > > > Please don't get me wrong, but I don't think so.
> > > > Creating a customized or re-trained system dictionary still needs
> deep
> > > > knowledge about language and machine-learning. Even among in us,
> > > > native Japanese, very few people can do so.
> > > > The system dictionary is a key component for tokenization, so badly
> > > > customized system dictionary directly affects to the search quality
> > > > and I think we should prevent it. Instead of messing up the system
> > > > dictionary without sufficient knowledge, please use the user
> > > > dictionary. That is the reason why it exists.
> > > >
> > > > Anyway building the system dictionary (MeCab IPADIIC extensions), you
> > > > do not need read or fix the DictionaryBuilder class.
> > > > Just modify analysis/kuromoji/build.xml to use the
> > > > customized/re-trained dictionary (tar ball).
> > > >
> > > > Tomoko
> > > >
> > > > 2019年5月27日(月) 1:48 Namgyu Kim <kn...@gmail.com>:
> > > > >
> > > > > Oh, I think my explanation was not enough. Sorry...
> > > > >
> > > > > I mentioned the following sentences.
> > > > > =============================
> > > > > 1. Modify your dictionary file and rebuild.
> > > > >   1-1) Install MeCab
> > > > >   1-2) Install MeCab Dictionary
> > > > >   1-3) Modify your dictionary file
> > > > >   1-4) Make it to tar.gz
> > > > > =============================
> > > > > The "1-3)" does not mean user modifies the csv files and
> compresses it
> > > > back
> > > > > to tar.gz.
> > > > > It means re-training, of course user has to be careful and have
> knowledge
> > > > > of the Natural Language Processing.
> > > > > Column 2, 3 and 4 in csv values are the values produced by
> training.
> > > > > (2 : left context id, 3 : right context id, 4 : cost)
> > > > > These values are dependent on the model and matrix.def values.
> (when use
> > > > > mecab-dict-index)
> > > > >
> > > > > That's why I mentioned "1-1)" and "1-2)" processes first.
> > > > >
> > > > > Anyway, in my personal opinion, Lucene does not need to consider
> whether
> > > > > the system dictionary status is good or not.
> > > > > I just think when some user wants to use a custom system
> dictionary, it
> > > > is
> > > > > not user-friendly to modify the ant file or find some code for a
> long
> > > > time
> > > > > to run the DictionaryBuilder.
> > > > > I think there should be at least a guide.
> > > > >
> > > > > Warm regards,
> > > > > Namgyu Kim
> > > > >
> > > > > P.S. Although not as good as the Tomoko's contents, there is a
> list of
> > > > > dictionaries supported by kuromoji.
> > > > > (https://github.com/atilika/kuromoji#supported-dictionaries)
> > > > >
> > > > >
> > > > > 2019년 5월 27일 (월) 오전 12:12, Tomoko Uchida <
> tomoko.uchida.1111@gmail.com
> > > > >님이
> > > > > 작성:
> > > > >
> > > > > > Hi,
> > > > > >
> > > > > > The system dictionary is not a mere "word collection", it
> includes a
> > > > > > machine-learned language model which is carefully trained by
> > > > > > researchers. If you want to replace the system dictionary, you
> have to
> > > > > > start from "re-train" the model. This needs expert knowledge so
> I do
> > > > > > not recommend to just modify the CSVs and rebuild it (if you do
> not
> > > > > > have an expert about it).
> > > > > >
> > > > > > As far as relates to "modern words" which is not included the
> current
> > > > > > system dictionary, there are already a few options.
> > > > > >
> > > > > > 1. Use neologd dictionary (it's an extension of MeCab IPADIC,
> > > > > > Kuromoji's default dictionary)
> > > > > >
> > > > > > For Solr:
> > > > > >
> https://github.com/mocobeta/lucene-solr/tree/kuromoji-neologd_5_4_0
> > > > > > (The branch is mine. A little bit old, but you can cherry-pick
> the
> > > > > > changes in the kuromoji's build.xml.)
> > > > > >
> > > > > > For Elasticsearch:
> > > > > >
> > > >
> https://github.com/codelibs/elasticsearch-analysis-kuromoji-ipadic-neologd
> > > > > >
> > > > > > 2. Use Sudachi dictionary
> > > > > >
> > > > > > For Elasticsearch:
> > > > > > https://github.com/WorksApplications/elasticsearch-sudachi
> > > > > > This includes Lucene jar, so I think you can extract the jar for
> Solr
> > > > > > (I've never tried to use with Solr).
> > > > > >
> > > > > > Both are actively maintained by linguistics & NLP
> > > > researchers/engineers.
> > > > > > Please be careful, those are rather huge jars...
> > > > > >
> > > > > > Hope that helps.
> > > > > >
> > > > > > Tomoko
> > > > > >
> > > > > > 2019年5月26日(日) 23:11 Trejkaz <tr...@trypticon.org>:
> > > > > > >
> > > > > > > On Sun, 26 May 2019 at 23:49, Namgyu Kim <kn...@gmail.com>
> wrote:
> > > > > > >
> > > > > > > > I think so about that approach.
> > > > > > > > It's not user-friendly and it is not good for the user.
> > > > > > >
> > > > > > > I think it's better to get the parameters in
> > > > > > >
> > > > > > > JapaneseTokenizer.
> > > > > > > >
> > > > > > > > What do you think about this?
> > > > > > >
> > > > > > >
> > > > > > > A way to override the system dictionary would be useful for us
> as
> > > > well.
> > > > > > We
> > > > > > > often get people complaining that the current dictionary is
> missing
> > > > a lot
> > > > > > > of common modern words, and there are alternate mecab
> dictionaries
> > > > > > sitting
> > > > > > > around already which solve this problem.
> > > > > > >
> > > > > > > TX
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > >
> > > > > >
> ---------------------------------------------------------------------
> > > > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > > > For additional commands, e-mail:
> java-user-help@lucene.apache.org
> > > > > >
> > > > > >
> > > >
> > > > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > > >
> > > >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: JapaneseAnalyzer's system vs user dict

Posted by Tomoko Uchida <to...@gmail.com>.

Hi guys,

I just created an issue related to this thread.

Decouple Kuromoji's morphological analyser and its dictionary
https://issues.apache.org/jira/browse/LUCENE-8816

The problem discussed here is essentially within the current
architecture of Kuromoji (and Nori), "jar bundled system dictionary".
So, the most natural solution is decoupling the Viterbi logic and the
encoded dictionary (just as traditional Japanese morphological
analysis engines do so).
This is actually old question with respect to kuromoji, however I feel
like that it's a good time to re-think it.

It will take time (and to be honest I'm not sure the patch will be
accepted) but I think it's much better than applying monkey-fixes to
the current build script.
If you are seriously interested in this work, please feel free to involve it.

Tomoko

2019年5月28日(火) 7:57 Tomoko Uchida <to...@gmail.com>:
>
> Hi Namgyu,
>
> > There is a team that uses a well-ported system dictionary.
> > The Lucene version is up. (like 8.1 -> 8.2)
> > Suppose there was no modification to kuromoji in 8.2.
> > But the user has to port again.
> > The same goes for 8.2 to 8.3.
>
> I'm not sure about the situation at Korea, however, we also have some
> frequently updated, well-maintained (by NLP professionals) system
> dictionaries.
> 1. neologd (mecab ipadic extension) and 2. Sudachi (unidic extension
> partially including neologd) I mentioned in my previous mail.
> I agree with that it's a labor to re-build the tokenizer every time
> when upgrading.
>
> In both case, some outstanding contributors build and distribute
> plugins including up-to-date dictionary at a constant pace, and other
> users just use them. Seems this works greatly at least in Japan, for
> now.
> Maybe we can start from outside of Lucene project such like that? If
> the workflow works well and it's really needed, developers can propose
> the change (a patch for the build script, and possibly the system
> dictionary operation or update policy is also needed) to the Jira
> anytime.
>
> I know that current JapaneseAnalyzer's system dictionary (MeCab
> IPADIC) has been not maintained for ten years and developers/users
> often complain about it.
> For now I just see the effort of the developers community (including
> me) to try to find good solutions for that.
>
> Thanks,
> Tomoko
>
> 2019年5月28日(火) 2:42 Namgyu Kim <kn...@gmail.com>:
> >
> > Thank you for your reply, Tomoko :D
> >
> > To be honest, I have not experienced it directly(means commercialize), so I
> > can't tell the exact situation of the Japanese MeCab.
> > I respect your opinion and it is true that customization is a difficult
> > task.
> >
> > But I can talk a little bit about Korean MeCab. (The basic logic is the
> > same)
> > In the case of Hangul MeCab, system dictionary changes are very frequent.
> > Developers do not design the engine from the bottom, so they tend to try a
> > lot of tuning at some level. (like custom model, score matrix, custom
> > dictionary)
> > Especially in commercialization, developers make a lot of tuning to make
> > the dictionary that is the most suitable for the purpose.
> > (Of course, the big tech companies use their own analyzers :D)
> >
> > MeCab is especially popular in Korea, so there are many attempts.
> > Developers often port it to Elasticsearch and use a lot, but they have to
> > do a lot of boring work every time.
> > (It is not Korean MeCab case, but I think Mike and Trejkaz talked in that
> > sense)
> >
> > There is another bad case.
> >
> > There is a team that uses a well-ported system dictionary.
> > The Lucene version is up. (like 8.1 -> 8.2)
> > Suppose there was no modification to kuromoji in 8.2.
> > But the user has to port again.
> > The same goes for 8.2 to 8.3.
> > Even if kuromoji has a fix that is not associated with Dictionary, the user
> > has to port each time.
> >
> > At least if we allow them to read custom dat files, these problems can be
> > disappeared.
> >
> > Warm regards,
> > Namgyu Kim
> >
> > On Mon, May 27, 2019 at 8:21 AM Tomoko Uchida <to...@gmail.com>
> > wrote:
> >
> > > > Anyway, in my personal opinion, Lucene does not need to consider whether
> > > the system dictionary status is good or not.
> > >
> > > Please don't get me wrong, but I don't think so.
> > > Creating a customized or re-trained system dictionary still needs deep
> > > knowledge about language and machine-learning. Even among in us,
> > > native Japanese, very few people can do so.
> > > The system dictionary is a key component for tokenization, so badly
> > > customized system dictionary directly affects to the search quality
> > > and I think we should prevent it. Instead of messing up the system
> > > dictionary without sufficient knowledge, please use the user
> > > dictionary. That is the reason why it exists.
> > >
> > > Anyway building the system dictionary (MeCab IPADIIC extensions), you
> > > do not need read or fix the DictionaryBuilder class.
> > > Just modify analysis/kuromoji/build.xml to use the
> > > customized/re-trained dictionary (tar ball).
> > >
> > > Tomoko
> > >
> > > 2019年5月27日(月) 1:48 Namgyu Kim <kn...@gmail.com>:
> > > >
> > > > Oh, I think my explanation was not enough. Sorry...
> > > >
> > > > I mentioned the following sentences.
> > > > =============================
> > > > 1. Modify your dictionary file and rebuild.
> > > >   1-1) Install MeCab
> > > >   1-2) Install MeCab Dictionary
> > > >   1-3) Modify your dictionary file
> > > >   1-4) Make it to tar.gz
> > > > =============================
> > > > The "1-3)" does not mean user modifies the csv files and compresses it
> > > back
> > > > to tar.gz.
> > > > It means re-training, of course user has to be careful and have knowledge
> > > > of the Natural Language Processing.
> > > > Column 2, 3 and 4 in csv values are the values produced by training.
> > > > (2 : left context id, 3 : right context id, 4 : cost)
> > > > These values are dependent on the model and matrix.def values. (when use
> > > > mecab-dict-index)
> > > >
> > > > That's why I mentioned "1-1)" and "1-2)" processes first.
> > > >
> > > > Anyway, in my personal opinion, Lucene does not need to consider whether
> > > > the system dictionary status is good or not.
> > > > I just think when some user wants to use a custom system dictionary, it
> > > is
> > > > not user-friendly to modify the ant file or find some code for a long
> > > time
> > > > to run the DictionaryBuilder.
> > > > I think there should be at least a guide.
> > > >
> > > > Warm regards,
> > > > Namgyu Kim
> > > >
> > > > P.S. Although not as good as the Tomoko's contents, there is a list of
> > > > dictionaries supported by kuromoji.
> > > > (https://github.com/atilika/kuromoji#supported-dictionaries)
> > > >
> > > >
> > > > 2019년 5월 27일 (월) 오전 12:12, Tomoko Uchida <tomoko.uchida.1111@gmail.com
> > > >님이
> > > > 작성:
> > > >
> > > > > Hi,
> > > > >
> > > > > The system dictionary is not a mere "word collection", it includes a
> > > > > machine-learned language model which is carefully trained by
> > > > > researchers. If you want to replace the system dictionary, you have to
> > > > > start from "re-train" the model. This needs expert knowledge so I do
> > > > > not recommend to just modify the CSVs and rebuild it (if you do not
> > > > > have an expert about it).
> > > > >
> > > > > As far as relates to "modern words" which is not included the current
> > > > > system dictionary, there are already a few options.
> > > > >
> > > > > 1. Use neologd dictionary (it's an extension of MeCab IPADIC,
> > > > > Kuromoji's default dictionary)
> > > > >
> > > > > For Solr:
> > > > > https://github.com/mocobeta/lucene-solr/tree/kuromoji-neologd_5_4_0
> > > > > (The branch is mine. A little bit old, but you can cherry-pick the
> > > > > changes in the kuromoji's build.xml.)
> > > > >
> > > > > For Elasticsearch:
> > > > >
> > > https://github.com/codelibs/elasticsearch-analysis-kuromoji-ipadic-neologd
> > > > >
> > > > > 2. Use Sudachi dictionary
> > > > >
> > > > > For Elasticsearch:
> > > > > https://github.com/WorksApplications/elasticsearch-sudachi
> > > > > This includes Lucene jar, so I think you can extract the jar for Solr
> > > > > (I've never tried to use with Solr).
> > > > >
> > > > > Both are actively maintained by linguistics & NLP
> > > researchers/engineers.
> > > > > Please be careful, those are rather huge jars...
> > > > >
> > > > > Hope that helps.
> > > > >
> > > > > Tomoko
> > > > >
> > > > > 2019年5月26日(日) 23:11 Trejkaz <tr...@trypticon.org>:
> > > > > >
> > > > > > On Sun, 26 May 2019 at 23:49, Namgyu Kim <kn...@gmail.com> wrote:
> > > > > >
> > > > > > > I think so about that approach.
> > > > > > > It's not user-friendly and it is not good for the user.
> > > > > >
> > > > > > I think it's better to get the parameters in
> > > > > >
> > > > > > JapaneseTokenizer.
> > > > > > >
> > > > > > > What do you think about this?
> > > > > >
> > > > > >
> > > > > > A way to override the system dictionary would be useful for us as
> > > well.
> > > > > We
> > > > > > often get people complaining that the current dictionary is missing
> > > a lot
> > > > > > of common modern words, and there are alternate mecab dictionaries
> > > > > sitting
> > > > > > around already which solve this problem.
> > > > > >
> > > > > > TX
> > > > > >
> > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > >
> > > > > ---------------------------------------------------------------------
> > > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > > > >
> > > > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> > >

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: JapaneseAnalyzer's system vs user dict

Posted by Tomoko Uchida <to...@gmail.com>.

Hi Namgyu,

> There is a team that uses a well-ported system dictionary.
> The Lucene version is up. (like 8.1 -> 8.2)
> Suppose there was no modification to kuromoji in 8.2.
> But the user has to port again.
> The same goes for 8.2 to 8.3.

I'm not sure about the situation at Korea, however, we also have some
frequently updated, well-maintained (by NLP professionals) system
dictionaries.
1. neologd (mecab ipadic extension) and 2. Sudachi (unidic extension
partially including neologd) I mentioned in my previous mail.
I agree with that it's a labor to re-build the tokenizer every time
when upgrading.

In both case, some outstanding contributors build and distribute
plugins including up-to-date dictionary at a constant pace, and other
users just use them. Seems this works greatly at least in Japan, for
now.
Maybe we can start from outside of Lucene project such like that? If
the workflow works well and it's really needed, developers can propose
the change (a patch for the build script, and possibly the system
dictionary operation or update policy is also needed) to the Jira
anytime.

I know that current JapaneseAnalyzer's system dictionary (MeCab
IPADIC) has been not maintained for ten years and developers/users
often complain about it.
For now I just see the effort of the developers community (including
me) to try to find good solutions for that.

Thanks,
Tomoko

2019年5月28日(火) 2:42 Namgyu Kim <kn...@gmail.com>:
>
> Thank you for your reply, Tomoko :D
>
> To be honest, I have not experienced it directly(means commercialize), so I
> can't tell the exact situation of the Japanese MeCab.
> I respect your opinion and it is true that customization is a difficult
> task.
>
> But I can talk a little bit about Korean MeCab. (The basic logic is the
> same)
> In the case of Hangul MeCab, system dictionary changes are very frequent.
> Developers do not design the engine from the bottom, so they tend to try a
> lot of tuning at some level. (like custom model, score matrix, custom
> dictionary)
> Especially in commercialization, developers make a lot of tuning to make
> the dictionary that is the most suitable for the purpose.
> (Of course, the big tech companies use their own analyzers :D)
>
> MeCab is especially popular in Korea, so there are many attempts.
> Developers often port it to Elasticsearch and use a lot, but they have to
> do a lot of boring work every time.
> (It is not Korean MeCab case, but I think Mike and Trejkaz talked in that
> sense)
>
> There is another bad case.
>
> There is a team that uses a well-ported system dictionary.
> The Lucene version is up. (like 8.1 -> 8.2)
> Suppose there was no modification to kuromoji in 8.2.
> But the user has to port again.
> The same goes for 8.2 to 8.3.
> Even if kuromoji has a fix that is not associated with Dictionary, the user
> has to port each time.
>
> At least if we allow them to read custom dat files, these problems can be
> disappeared.
>
> Warm regards,
> Namgyu Kim
>
> On Mon, May 27, 2019 at 8:21 AM Tomoko Uchida <to...@gmail.com>
> wrote:
>
> > > Anyway, in my personal opinion, Lucene does not need to consider whether
> > the system dictionary status is good or not.
> >
> > Please don't get me wrong, but I don't think so.
> > Creating a customized or re-trained system dictionary still needs deep
> > knowledge about language and machine-learning. Even among in us,
> > native Japanese, very few people can do so.
> > The system dictionary is a key component for tokenization, so badly
> > customized system dictionary directly affects to the search quality
> > and I think we should prevent it. Instead of messing up the system
> > dictionary without sufficient knowledge, please use the user
> > dictionary. That is the reason why it exists.
> >
> > Anyway building the system dictionary (MeCab IPADIIC extensions), you
> > do not need read or fix the DictionaryBuilder class.
> > Just modify analysis/kuromoji/build.xml to use the
> > customized/re-trained dictionary (tar ball).
> >
> > Tomoko
> >
> > 2019年5月27日(月) 1:48 Namgyu Kim <kn...@gmail.com>:
> > >
> > > Oh, I think my explanation was not enough. Sorry...
> > >
> > > I mentioned the following sentences.
> > > =============================
> > > 1. Modify your dictionary file and rebuild.
> > >   1-1) Install MeCab
> > >   1-2) Install MeCab Dictionary
> > >   1-3) Modify your dictionary file
> > >   1-4) Make it to tar.gz
> > > =============================
> > > The "1-3)" does not mean user modifies the csv files and compresses it
> > back
> > > to tar.gz.
> > > It means re-training, of course user has to be careful and have knowledge
> > > of the Natural Language Processing.
> > > Column 2, 3 and 4 in csv values are the values produced by training.
> > > (2 : left context id, 3 : right context id, 4 : cost)
> > > These values are dependent on the model and matrix.def values. (when use
> > > mecab-dict-index)
> > >
> > > That's why I mentioned "1-1)" and "1-2)" processes first.
> > >
> > > Anyway, in my personal opinion, Lucene does not need to consider whether
> > > the system dictionary status is good or not.
> > > I just think when some user wants to use a custom system dictionary, it
> > is
> > > not user-friendly to modify the ant file or find some code for a long
> > time
> > > to run the DictionaryBuilder.
> > > I think there should be at least a guide.
> > >
> > > Warm regards,
> > > Namgyu Kim
> > >
> > > P.S. Although not as good as the Tomoko's contents, there is a list of
> > > dictionaries supported by kuromoji.
> > > (https://github.com/atilika/kuromoji#supported-dictionaries)
> > >
> > >
> > > 2019년 5월 27일 (월) 오전 12:12, Tomoko Uchida <tomoko.uchida.1111@gmail.com
> > >님이
> > > 작성:
> > >
> > > > Hi,
> > > >
> > > > The system dictionary is not a mere "word collection", it includes a
> > > > machine-learned language model which is carefully trained by
> > > > researchers. If you want to replace the system dictionary, you have to
> > > > start from "re-train" the model. This needs expert knowledge so I do
> > > > not recommend to just modify the CSVs and rebuild it (if you do not
> > > > have an expert about it).
> > > >
> > > > As far as relates to "modern words" which is not included the current
> > > > system dictionary, there are already a few options.
> > > >
> > > > 1. Use neologd dictionary (it's an extension of MeCab IPADIC,
> > > > Kuromoji's default dictionary)
> > > >
> > > > For Solr:
> > > > https://github.com/mocobeta/lucene-solr/tree/kuromoji-neologd_5_4_0
> > > > (The branch is mine. A little bit old, but you can cherry-pick the
> > > > changes in the kuromoji's build.xml.)
> > > >
> > > > For Elasticsearch:
> > > >
> > https://github.com/codelibs/elasticsearch-analysis-kuromoji-ipadic-neologd
> > > >
> > > > 2. Use Sudachi dictionary
> > > >
> > > > For Elasticsearch:
> > > > https://github.com/WorksApplications/elasticsearch-sudachi
> > > > This includes Lucene jar, so I think you can extract the jar for Solr
> > > > (I've never tried to use with Solr).
> > > >
> > > > Both are actively maintained by linguistics & NLP
> > researchers/engineers.
> > > > Please be careful, those are rather huge jars...
> > > >
> > > > Hope that helps.
> > > >
> > > > Tomoko
> > > >
> > > > 2019年5月26日(日) 23:11 Trejkaz <tr...@trypticon.org>:
> > > > >
> > > > > On Sun, 26 May 2019 at 23:49, Namgyu Kim <kn...@gmail.com> wrote:
> > > > >
> > > > > > I think so about that approach.
> > > > > > It's not user-friendly and it is not good for the user.
> > > > >
> > > > > I think it's better to get the parameters in
> > > > >
> > > > > JapaneseTokenizer.
> > > > > >
> > > > > > What do you think about this?
> > > > >
> > > > >
> > > > > A way to override the system dictionary would be useful for us as
> > well.
> > > > We
> > > > > often get people complaining that the current dictionary is missing
> > a lot
> > > > > of common modern words, and there are alternate mecab dictionaries
> > > > sitting
> > > > > around already which solve this problem.
> > > > >
> > > > > TX
> > > > >
> > > > >
> > > > > >
> > > > > >
> > > > > >
> > > >
> > > > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > > >
> > > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: JapaneseAnalyzer's system vs user dict

Posted by Namgyu Kim <kn...@gmail.com>.

Thank you for your reply, Tomoko :D

To be honest, I have not experienced it directly(means commercialize), so I
can't tell the exact situation of the Japanese MeCab.
I respect your opinion and it is true that customization is a difficult
task.

But I can talk a little bit about Korean MeCab. (The basic logic is the
same)
In the case of Hangul MeCab, system dictionary changes are very frequent.
Developers do not design the engine from the bottom, so they tend to try a
lot of tuning at some level. (like custom model, score matrix, custom
dictionary)
Especially in commercialization, developers make a lot of tuning to make
the dictionary that is the most suitable for the purpose.
(Of course, the big tech companies use their own analyzers :D)

MeCab is especially popular in Korea, so there are many attempts.
Developers often port it to Elasticsearch and use a lot, but they have to
do a lot of boring work every time.
(It is not Korean MeCab case, but I think Mike and Trejkaz talked in that
sense)

There is another bad case.

There is a team that uses a well-ported system dictionary.
The Lucene version is up. (like 8.1 -> 8.2)
Suppose there was no modification to kuromoji in 8.2.
But the user has to port again.
The same goes for 8.2 to 8.3.
Even if kuromoji has a fix that is not associated with Dictionary, the user
has to port each time.

At least if we allow them to read custom dat files, these problems can be
disappeared.

Warm regards,
Namgyu Kim

On Mon, May 27, 2019 at 8:21 AM Tomoko Uchida <to...@gmail.com>
wrote:

> > Anyway, in my personal opinion, Lucene does not need to consider whether
> the system dictionary status is good or not.
>
> Please don't get me wrong, but I don't think so.
> Creating a customized or re-trained system dictionary still needs deep
> knowledge about language and machine-learning. Even among in us,
> native Japanese, very few people can do so.
> The system dictionary is a key component for tokenization, so badly
> customized system dictionary directly affects to the search quality
> and I think we should prevent it. Instead of messing up the system
> dictionary without sufficient knowledge, please use the user
> dictionary. That is the reason why it exists.
>
> Anyway building the system dictionary (MeCab IPADIIC extensions), you
> do not need read or fix the DictionaryBuilder class.
> Just modify analysis/kuromoji/build.xml to use the
> customized/re-trained dictionary (tar ball).
>
> Tomoko
>
> 2019年5月27日(月) 1:48 Namgyu Kim <kn...@gmail.com>:
> >
> > Oh, I think my explanation was not enough. Sorry...
> >
> > I mentioned the following sentences.
> > =============================
> > 1. Modify your dictionary file and rebuild.
> >   1-1) Install MeCab
> >   1-2) Install MeCab Dictionary
> >   1-3) Modify your dictionary file
> >   1-4) Make it to tar.gz
> > =============================
> > The "1-3)" does not mean user modifies the csv files and compresses it
> back
> > to tar.gz.
> > It means re-training, of course user has to be careful and have knowledge
> > of the Natural Language Processing.
> > Column 2, 3 and 4 in csv values are the values produced by training.
> > (2 : left context id, 3 : right context id, 4 : cost)
> > These values are dependent on the model and matrix.def values. (when use
> > mecab-dict-index)
> >
> > That's why I mentioned "1-1)" and "1-2)" processes first.
> >
> > Anyway, in my personal opinion, Lucene does not need to consider whether
> > the system dictionary status is good or not.
> > I just think when some user wants to use a custom system dictionary, it
> is
> > not user-friendly to modify the ant file or find some code for a long
> time
> > to run the DictionaryBuilder.
> > I think there should be at least a guide.
> >
> > Warm regards,
> > Namgyu Kim
> >
> > P.S. Although not as good as the Tomoko's contents, there is a list of
> > dictionaries supported by kuromoji.
> > (https://github.com/atilika/kuromoji#supported-dictionaries)
> >
> >
> > 2019년 5월 27일 (월) 오전 12:12, Tomoko Uchida <tomoko.uchida.1111@gmail.com
> >님이
> > 작성:
> >
> > > Hi,
> > >
> > > The system dictionary is not a mere "word collection", it includes a
> > > machine-learned language model which is carefully trained by
> > > researchers. If you want to replace the system dictionary, you have to
> > > start from "re-train" the model. This needs expert knowledge so I do
> > > not recommend to just modify the CSVs and rebuild it (if you do not
> > > have an expert about it).
> > >
> > > As far as relates to "modern words" which is not included the current
> > > system dictionary, there are already a few options.
> > >
> > > 1. Use neologd dictionary (it's an extension of MeCab IPADIC,
> > > Kuromoji's default dictionary)
> > >
> > > For Solr:
> > > https://github.com/mocobeta/lucene-solr/tree/kuromoji-neologd_5_4_0
> > > (The branch is mine. A little bit old, but you can cherry-pick the
> > > changes in the kuromoji's build.xml.)
> > >
> > > For Elasticsearch:
> > >
> https://github.com/codelibs/elasticsearch-analysis-kuromoji-ipadic-neologd
> > >
> > > 2. Use Sudachi dictionary
> > >
> > > For Elasticsearch:
> > > https://github.com/WorksApplications/elasticsearch-sudachi
> > > This includes Lucene jar, so I think you can extract the jar for Solr
> > > (I've never tried to use with Solr).
> > >
> > > Both are actively maintained by linguistics & NLP
> researchers/engineers.
> > > Please be careful, those are rather huge jars...
> > >
> > > Hope that helps.
> > >
> > > Tomoko
> > >
> > > 2019年5月26日(日) 23:11 Trejkaz <tr...@trypticon.org>:
> > > >
> > > > On Sun, 26 May 2019 at 23:49, Namgyu Kim <kn...@gmail.com> wrote:
> > > >
> > > > > I think so about that approach.
> > > > > It's not user-friendly and it is not good for the user.
> > > >
> > > > I think it's better to get the parameters in
> > > >
> > > > JapaneseTokenizer.
> > > > >
> > > > > What do you think about this?
> > > >
> > > >
> > > > A way to override the system dictionary would be useful for us as
> well.
> > > We
> > > > often get people complaining that the current dictionary is missing
> a lot
> > > > of common modern words, and there are alternate mecab dictionaries
> > > sitting
> > > > around already which solve this problem.
> > > >
> > > > TX
> > > >
> > > >
> > > > >
> > > > >
> > > > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> > >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: JapaneseAnalyzer's system vs user dict

Posted by Tomoko Uchida <to...@gmail.com>.

> Anyway, in my personal opinion, Lucene does not need to consider whether
the system dictionary status is good or not.

Please don't get me wrong, but I don't think so.
Creating a customized or re-trained system dictionary still needs deep
knowledge about language and machine-learning. Even among in us,
native Japanese, very few people can do so.
The system dictionary is a key component for tokenization, so badly
customized system dictionary directly affects to the search quality
and I think we should prevent it. Instead of messing up the system
dictionary without sufficient knowledge, please use the user
dictionary. That is the reason why it exists.

Anyway building the system dictionary (MeCab IPADIIC extensions), you
do not need read or fix the DictionaryBuilder class.
Just modify analysis/kuromoji/build.xml to use the
customized/re-trained dictionary (tar ball).

Tomoko

2019年5月27日(月) 1:48 Namgyu Kim <kn...@gmail.com>:
>
> Oh, I think my explanation was not enough. Sorry...
>
> I mentioned the following sentences.
> =============================
> 1. Modify your dictionary file and rebuild.
>   1-1) Install MeCab
>   1-2) Install MeCab Dictionary
>   1-3) Modify your dictionary file
>   1-4) Make it to tar.gz
> =============================
> The "1-3)" does not mean user modifies the csv files and compresses it back
> to tar.gz.
> It means re-training, of course user has to be careful and have knowledge
> of the Natural Language Processing.
> Column 2, 3 and 4 in csv values are the values produced by training.
> (2 : left context id, 3 : right context id, 4 : cost)
> These values are dependent on the model and matrix.def values. (when use
> mecab-dict-index)
>
> That's why I mentioned "1-1)" and "1-2)" processes first.
>
> Anyway, in my personal opinion, Lucene does not need to consider whether
> the system dictionary status is good or not.
> I just think when some user wants to use a custom system dictionary, it is
> not user-friendly to modify the ant file or find some code for a long time
> to run the DictionaryBuilder.
> I think there should be at least a guide.
>
> Warm regards,
> Namgyu Kim
>
> P.S. Although not as good as the Tomoko's contents, there is a list of
> dictionaries supported by kuromoji.
> (https://github.com/atilika/kuromoji#supported-dictionaries)
>
>
> 2019년 5월 27일 (월) 오전 12:12, Tomoko Uchida <to...@gmail.com>님이
> 작성:
>
> > Hi,
> >
> > The system dictionary is not a mere "word collection", it includes a
> > machine-learned language model which is carefully trained by
> > researchers. If you want to replace the system dictionary, you have to
> > start from "re-train" the model. This needs expert knowledge so I do
> > not recommend to just modify the CSVs and rebuild it (if you do not
> > have an expert about it).
> >
> > As far as relates to "modern words" which is not included the current
> > system dictionary, there are already a few options.
> >
> > 1. Use neologd dictionary (it's an extension of MeCab IPADIC,
> > Kuromoji's default dictionary)
> >
> > For Solr:
> > https://github.com/mocobeta/lucene-solr/tree/kuromoji-neologd_5_4_0
> > (The branch is mine. A little bit old, but you can cherry-pick the
> > changes in the kuromoji's build.xml.)
> >
> > For Elasticsearch:
> > https://github.com/codelibs/elasticsearch-analysis-kuromoji-ipadic-neologd
> >
> > 2. Use Sudachi dictionary
> >
> > For Elasticsearch:
> > https://github.com/WorksApplications/elasticsearch-sudachi
> > This includes Lucene jar, so I think you can extract the jar for Solr
> > (I've never tried to use with Solr).
> >
> > Both are actively maintained by linguistics & NLP researchers/engineers.
> > Please be careful, those are rather huge jars...
> >
> > Hope that helps.
> >
> > Tomoko
> >
> > 2019年5月26日(日) 23:11 Trejkaz <tr...@trypticon.org>:
> > >
> > > On Sun, 26 May 2019 at 23:49, Namgyu Kim <kn...@gmail.com> wrote:
> > >
> > > > I think so about that approach.
> > > > It's not user-friendly and it is not good for the user.
> > >
> > > I think it's better to get the parameters in
> > >
> > > JapaneseTokenizer.
> > > >
> > > > What do you think about this?
> > >
> > >
> > > A way to override the system dictionary would be useful for us as well.
> > We
> > > often get people complaining that the current dictionary is missing a lot
> > > of common modern words, and there are alternate mecab dictionaries
> > sitting
> > > around already which solve this problem.
> > >
> > > TX
> > >
> > >
> > > >
> > > >
> > > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: JapaneseAnalyzer's system vs user dict

Posted by Namgyu Kim <kn...@gmail.com>.

Oh, I think my explanation was not enough. Sorry...

I mentioned the following sentences.
=============================
1. Modify your dictionary file and rebuild.
  1-1) Install MeCab
  1-2) Install MeCab Dictionary
  1-3) Modify your dictionary file
  1-4) Make it to tar.gz
=============================
The "1-3)" does not mean user modifies the csv files and compresses it back
to tar.gz.
It means re-training, of course user has to be careful and have knowledge
of the Natural Language Processing.
Column 2, 3 and 4 in csv values are the values produced by training.
(2 : left context id, 3 : right context id, 4 : cost)
These values are dependent on the model and matrix.def values. (when use
mecab-dict-index)

That's why I mentioned "1-1)" and "1-2)" processes first.

Anyway, in my personal opinion, Lucene does not need to consider whether
the system dictionary status is good or not.
I just think when some user wants to use a custom system dictionary, it is
not user-friendly to modify the ant file or find some code for a long time
to run the DictionaryBuilder.
I think there should be at least a guide.

Warm regards,
Namgyu Kim

P.S. Although not as good as the Tomoko's contents, there is a list of
dictionaries supported by kuromoji.
(https://github.com/atilika/kuromoji#supported-dictionaries)

2019년 5월 27일 (월) 오전 12:12, Tomoko Uchida <to...@gmail.com>님이
작성:

> Hi,
>
> The system dictionary is not a mere "word collection", it includes a
> machine-learned language model which is carefully trained by
> researchers. If you want to replace the system dictionary, you have to
> start from "re-train" the model. This needs expert knowledge so I do
> not recommend to just modify the CSVs and rebuild it (if you do not
> have an expert about it).
>
> As far as relates to "modern words" which is not included the current
> system dictionary, there are already a few options.
>
> 1. Use neologd dictionary (it's an extension of MeCab IPADIC,
> Kuromoji's default dictionary)
>
> For Solr:
> https://github.com/mocobeta/lucene-solr/tree/kuromoji-neologd_5_4_0
> (The branch is mine. A little bit old, but you can cherry-pick the
> changes in the kuromoji's build.xml.)
>
> For Elasticsearch:
> https://github.com/codelibs/elasticsearch-analysis-kuromoji-ipadic-neologd
>
> 2. Use Sudachi dictionary
>
> For Elasticsearch:
> https://github.com/WorksApplications/elasticsearch-sudachi
> This includes Lucene jar, so I think you can extract the jar for Solr
> (I've never tried to use with Solr).
>
> Both are actively maintained by linguistics & NLP researchers/engineers.
> Please be careful, those are rather huge jars...
>
> Hope that helps.
>
> Tomoko
>
> 2019年5月26日(日) 23:11 Trejkaz <tr...@trypticon.org>:
> >
> > On Sun, 26 May 2019 at 23:49, Namgyu Kim <kn...@gmail.com> wrote:
> >
> > > I think so about that approach.
> > > It's not user-friendly and it is not good for the user.
> >
> > I think it's better to get the parameters in
> >
> > JapaneseTokenizer.
> > >
> > > What do you think about this?
> >
> >
> > A way to override the system dictionary would be useful for us as well.
> We
> > often get people complaining that the current dictionary is missing a lot
> > of common modern words, and there are alternate mecab dictionaries
> sitting
> > around already which solve this problem.
> >
> > TX
> >
> >
> > >
> > >
> > >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: JapaneseAnalyzer's system vs user dict

Posted by Tomoko Uchida <to...@gmail.com>.

Hi,

The system dictionary is not a mere "word collection", it includes a
machine-learned language model which is carefully trained by
researchers. If you want to replace the system dictionary, you have to
start from "re-train" the model. This needs expert knowledge so I do
not recommend to just modify the CSVs and rebuild it (if you do not
have an expert about it).

As far as relates to "modern words" which is not included the current
system dictionary, there are already a few options.

1. Use neologd dictionary (it's an extension of MeCab IPADIC,
Kuromoji's default dictionary)

For Solr: https://github.com/mocobeta/lucene-solr/tree/kuromoji-neologd_5_4_0
(The branch is mine. A little bit old, but you can cherry-pick the
changes in the kuromoji's build.xml.)

For Elasticsearch:
https://github.com/codelibs/elasticsearch-analysis-kuromoji-ipadic-neologd

2. Use Sudachi dictionary

For Elasticsearch: https://github.com/WorksApplications/elasticsearch-sudachi
This includes Lucene jar, so I think you can extract the jar for Solr
(I've never tried to use with Solr).

Both are actively maintained by linguistics & NLP researchers/engineers.
Please be careful, those are rather huge jars...

Hope that helps.

Tomoko

2019年5月26日(日) 23:11 Trejkaz <tr...@trypticon.org>:
>
> On Sun, 26 May 2019 at 23:49, Namgyu Kim <kn...@gmail.com> wrote:
>
> > I think so about that approach.
> > It's not user-friendly and it is not good for the user.
>
> I think it's better to get the parameters in
>
> JapaneseTokenizer.
> >
> > What do you think about this?
>
>
> A way to override the system dictionary would be useful for us as well. We
> often get people complaining that the current dictionary is missing a lot
> of common modern words, and there are alternate mecab dictionaries sitting
> around already which solve this problem.
>
> TX
>
>
> >
> >
> >

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: JapaneseAnalyzer's system vs user dict

Posted by Trejkaz <tr...@trypticon.org>.

On Sun, 26 May 2019 at 23:49, Namgyu Kim <kn...@gmail.com> wrote:

> I think so about that approach.
> It's not user-friendly and it is not good for the user.

I think it's better to get the parameters in

JapaneseTokenizer.
>
> What do you think about this?

A way to override the system dictionary would be useful for us as well. We
often get people complaining that the current dictionary is missing a lot
of common modern words, and there are alternate mecab dictionaries sitting
around already which solve this problem.

TX

>
>
>

Re: JapaneseAnalyzer's system vs user dict

Posted by Namgyu Kim <kn...@gmail.com>.

I've been able to build a dictionary using DictionaryBuilder (I guess that
is what the "regenerate" task must be using?)
=>
Yes. That's right.
The "regenerate" run commands in the following order:
1) Compile the code (compile-tools)
2) Download the jar file (download-dict)
3) Save Noun.proper.csv diffs (patch-dict)
4) Run DictionaryBuilder. (build-dict)

Not a very user-friendly approach
=>
I think so about that approach.
It's not user-friendly and it is not good for the user.
I think it's better to get the parameters in constructor of
JapaneseTokenizer.

What do you think about this?

Warm regards,
Namgyu Kim


2019년 5월 26일 (일) 오후 9:19, Michael Sokolov <ms...@gmail.com>님이 작성:

> Thanks, Namgyu. I've been able to build a dictionary using
> DictionaryBuilder (I guess that is what the "regenerate" task must be
> using?) and I can replace the existing one on the classpath with jar
> surgery for now. Not a very user-friendly approach, but it will enable
> me to run some experiments and see whether this is truly necessary for
> my use case.
>
> On Sun, May 26, 2019 at 7:56 AM Namgyu Kim <kn...@gmail.com> wrote:
> >
> > Sorry for the wrong information, Mike.
> > Tomoko is right.
> > I checked it wrong.
> >
> > User dictionary is independent from the system dictionary. If you give
> > the user entries, JapaneseTokenizer builds two FSTs one for the
> > built-in dictionary and one for the user dictionary and they are
> > retrieved separately.
> >
> > Please ignore the following lines in my e-mail.
> > ================================================
> > Japanese Analyzer does not load dictionaries by default.
> > ...
> > Since it is a way to create and pass the UserDictionary object, there is
> no
> > conflict between user dictionary and system dictionary.
> > (You may choose only one of them! -> means userFST instance in
> > JapaneseTokenizer)
> > =================================================
> >
> > The System dictionary and the User dictionary are separated and can have
> > each.
> >
> > About System dictionary,
> > As I know, it is not possible to change the System dictionary at the code
> > level.
> > The part that reads the System dictionary is hard-coded.
> > (TokenInfoDictionary, UnknownDictionary, BinaryDictionary)
> > If you really need it, can you make a JIRA issue and proceed with me?
> >
> > But there is a way to build a new kuromoji jar.
> > 1. Modify your dictionary file and rebuild.
> >   1-1) Install MeCab
> >   1-2) Install MeCab Dictionary
> >   1-3) Modify your dictionary file
> >   1-4) Make it to tar.gz
> > 2. change kuromoji/ivy.xml from
> > <artifact name="ipadic" type=".tar.gz" url="
> >
> https://jaist.dl.sourceforge.net/project/mecab/mecab-ipadic/2.7.0-20070801/mecab-ipadic-2.7.0-20070801.tar.gz
> > "/>
> > to
> > <artifact name="ipadic" type=".tar.gz" url="file:///your/tar
> > path/new_dic.tar.gz"/>
> > 3. "ant regenerate" in /your/path/lucene-solr/lucene/analysis/kuromoji
> > 4. "ant jar"
> >
> > I wish I could help you.
> >
> > Warm regards,
> > Namgyu Kim
> >
> > 2019년 5월 26일 (일) 오전 9:03, Michael Sokolov <ms...@gmail.com>님이 작성:
> >
> > > Thank you for the detailed responses! What Tomoko is saying seems
> > > consistent with my cursory reading of the code. The reason I asked is
> > > I have a customer that thinks they want to replace the system
> > > dictionary, and I am trying to see if that is necessary. It seems as
> > > if for the most part, we can supply a comprehensive user dictionary
> > > and it would pretty much take the place of the system dictionary,
> > > assuming it is a superset (covers at least the original system dict
> > > tokens), but there is probably no way to "remove" a token that is
> > > present in the system dictionary (or maybe it can effectively be
> > > removed by adding it to user dictionary with a high penalty?). I'm not
> > > sure why one would want to do this removal, just trying to understand
> > > the design parameters.
> > >
> > > On Sat, May 25, 2019 at 7:30 PM Tomoko Uchida
> > > <to...@gmail.com> wrote:
> > > >
> > > > Hi,
> > > >
> > > > > If I provide entries in the user
> > > > dictionary is it just as if I had included them in the system
> > > > dictionary? If the same entry occurs in both, do the user dictionary
> > > > weights supersede those in the system dictionary? Is there some way
> to
> > > > suppress entries in the system dict?
> > > >
> > > > User dictionary is independent from the system dictionary. If you
> give
> > > > the user entries, JapaneseTokenizer builds two FSTs one for the
> > > > built-in dictionary and one for the user dictionary and they are
> > > > retrieved separately.
> > > >
> > > > First the user dictionary is retrieved, and if there are no entries
> > > > matched then the system dictionary is retrieved. So if any entry is
> > > > found in the user dictionary, all possible candidates in the system
> > > > dictionary are ignored (suppressed).
> > > >
> > > > (I think this is kuromoji specific behaviour, the original mecab pos
> > > > tagger retrieves both of the system dictionary and user dictionary
> and
> > > > compares their weights by performing Viterbi. In fact the behaviour -
> > > > always gives priority to the entries in the user dictionary - is a
> bit
> > > > too aggressive from the point of view of the consistency of
> > > > tokenization. I do not know why, but there may be some performance
> > > > reasons?)
> > > >
> > > > I think you can easily find the retrieval logic I described here in
> > > > JapaneseTokenizer#parse() method. (Let me know if my understanding is
> > > > not correct.)
> > > >
> > > > Regards,
> > > > Tomoko
> > > >
> > > > 2019年5月26日(日) 5:08 김남규 <kn...@gmail.com>:
> > > > >
> > > > > Hi, Mike :D
> > > > >
> > > > > Japanese Analyzer does not load dictionaries by default.
> > > > > If you look at the constructor, you can see that it is created as
> null
> > > if
> > > > > not set parameters.
> > > > > (check testUserDict3() in TestJapaneseAnalyzer.java)
> > > > >
> > > > > In JapaneseTokenizer,
> > > > > =============================================
> > > > > if (userDictionary != null) {
> > > > >   userFST = userDictionary.getFST();
> > > > >   userFSTReader = userFST.getBytesReader();
> > > > > } else {
> > > > >   userFST = null;
> > > > >   userFSTReader = null;
> > > > > }
> > > > > =============================================
> > > > > Since it is a way to create and pass the UserDictionary object,
> there
> > > is no
> > > > > conflict between user dictionary and system dictionary.
> > > > > (You may choose only one of them! -> means userFST instance in
> > > > > JapaneseTokenizer)
> > > > >
> > > > > About dictionary,
> > > > > Lucene has one pre-built dictionary by default since Lucene 3.6.
> > > > > You can check it in org.apache.lucene.analysis.ja.dict.
> > > > > It called MeCab which uses the Viterbi algorithm.
> > > > > In Lucene, Convert MeCab dictionary(in Lucene, some dat files) to
> FST
> > > and
> > > > > use
> > > > > But it can't satisfy all users.
> > > > > Depending on the situation, some user may need a custom dictionary.
> > > > > It is also same for Nori(Korean Analyzer) since Lucene 7.4. (The
> basic
> > > > > logic(MeCab + FST) is similar to Japanese Analyzer)
> > > > > The original Korean MeCab dictionary size is almost 220MB, but
> Lucene's
> > > > > dictionary size is 24MB.
> > > > > If the user needs a dictionary of 100MB size, the user must build
> and
> > > use
> > > > > it.
> > > > > (Modify MeCab Dictionary -> Training -> Porting to Lucene)
> > > > >
> > > > > If anyone find some wrong information in my reply, please send a
> reply
> > > with
> > > > > the correction.
> > > > >
> > > > > Thank you,
> > > > > Namgyu Kim
> > > > >
> > > > >
> > > > > 2019년 5월 26일 (일) 오전 4:03, Michael Sokolov <ms...@gmail.com>님이
> 작성:
> > > > >
> > > > > > I'm trying to understand the relationship between the system and
> user
> > > > > > dictionaries that JapaneseAnalyzer uses. The API allows a user to
> > > > > > provide a user dictionary; the system one is built in. Are they
> > > > > > otherwise the same kind of thing? If I provide entries in the
> user
> > > > > > dictionary is it just as if I had included them in the system
> > > > > > dictionary? If the same entry occurs in both, do the user
> dictionary
> > > > > > weights supersede those in the system dictionary? Is there some
> way
> > > to
> > > > > > suppress entries in the system dict?  I hunted for
> documentation, but
> > > > > > didn't find answers to these questions, and the code is pretty
> > > > > > involved, so any pointers would be greatly appreciated.
> > > > > >
> > > > > > -Mike
> > > > > >
> > > > > >
> ---------------------------------------------------------------------
> > > > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > > > For additional commands, e-mail:
> java-user-help@lucene.apache.org
> > > > > >
> > > > > >
> > > >
> > > > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> > >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: JapaneseAnalyzer's system vs user dict

Posted by Michael Sokolov <ms...@gmail.com>.

Thanks, Namgyu. I've been able to build a dictionary using
DictionaryBuilder (I guess that is what the "regenerate" task must be
using?) and I can replace the existing one on the classpath with jar
surgery for now. Not a very user-friendly approach, but it will enable
me to run some experiments and see whether this is truly necessary for
my use case.

On Sun, May 26, 2019 at 7:56 AM Namgyu Kim <kn...@gmail.com> wrote:
>
> Sorry for the wrong information, Mike.
> Tomoko is right.
> I checked it wrong.
>
> User dictionary is independent from the system dictionary. If you give
> the user entries, JapaneseTokenizer builds two FSTs one for the
> built-in dictionary and one for the user dictionary and they are
> retrieved separately.
>
> Please ignore the following lines in my e-mail.
> ================================================
> Japanese Analyzer does not load dictionaries by default.
> ...
> Since it is a way to create and pass the UserDictionary object, there is no
> conflict between user dictionary and system dictionary.
> (You may choose only one of them! -> means userFST instance in
> JapaneseTokenizer)
> =================================================
>
> The System dictionary and the User dictionary are separated and can have
> each.
>
> About System dictionary,
> As I know, it is not possible to change the System dictionary at the code
> level.
> The part that reads the System dictionary is hard-coded.
> (TokenInfoDictionary, UnknownDictionary, BinaryDictionary)
> If you really need it, can you make a JIRA issue and proceed with me?
>
> But there is a way to build a new kuromoji jar.
> 1. Modify your dictionary file and rebuild.
>   1-1) Install MeCab
>   1-2) Install MeCab Dictionary
>   1-3) Modify your dictionary file
>   1-4) Make it to tar.gz
> 2. change kuromoji/ivy.xml from
> <artifact name="ipadic" type=".tar.gz" url="
> https://jaist.dl.sourceforge.net/project/mecab/mecab-ipadic/2.7.0-20070801/mecab-ipadic-2.7.0-20070801.tar.gz
> "/>
> to
> <artifact name="ipadic" type=".tar.gz" url="file:///your/tar
> path/new_dic.tar.gz"/>
> 3. "ant regenerate" in /your/path/lucene-solr/lucene/analysis/kuromoji
> 4. "ant jar"
>
> I wish I could help you.
>
> Warm regards,
> Namgyu Kim
>
> 2019년 5월 26일 (일) 오전 9:03, Michael Sokolov <ms...@gmail.com>님이 작성:
>
> > Thank you for the detailed responses! What Tomoko is saying seems
> > consistent with my cursory reading of the code. The reason I asked is
> > I have a customer that thinks they want to replace the system
> > dictionary, and I am trying to see if that is necessary. It seems as
> > if for the most part, we can supply a comprehensive user dictionary
> > and it would pretty much take the place of the system dictionary,
> > assuming it is a superset (covers at least the original system dict
> > tokens), but there is probably no way to "remove" a token that is
> > present in the system dictionary (or maybe it can effectively be
> > removed by adding it to user dictionary with a high penalty?). I'm not
> > sure why one would want to do this removal, just trying to understand
> > the design parameters.
> >
> > On Sat, May 25, 2019 at 7:30 PM Tomoko Uchida
> > <to...@gmail.com> wrote:
> > >
> > > Hi,
> > >
> > > > If I provide entries in the user
> > > dictionary is it just as if I had included them in the system
> > > dictionary? If the same entry occurs in both, do the user dictionary
> > > weights supersede those in the system dictionary? Is there some way to
> > > suppress entries in the system dict?
> > >
> > > User dictionary is independent from the system dictionary. If you give
> > > the user entries, JapaneseTokenizer builds two FSTs one for the
> > > built-in dictionary and one for the user dictionary and they are
> > > retrieved separately.
> > >
> > > First the user dictionary is retrieved, and if there are no entries
> > > matched then the system dictionary is retrieved. So if any entry is
> > > found in the user dictionary, all possible candidates in the system
> > > dictionary are ignored (suppressed).
> > >
> > > (I think this is kuromoji specific behaviour, the original mecab pos
> > > tagger retrieves both of the system dictionary and user dictionary and
> > > compares their weights by performing Viterbi. In fact the behaviour -
> > > always gives priority to the entries in the user dictionary - is a bit
> > > too aggressive from the point of view of the consistency of
> > > tokenization. I do not know why, but there may be some performance
> > > reasons?)
> > >
> > > I think you can easily find the retrieval logic I described here in
> > > JapaneseTokenizer#parse() method. (Let me know if my understanding is
> > > not correct.)
> > >
> > > Regards,
> > > Tomoko
> > >
> > > 2019年5月26日(日) 5:08 김남규 <kn...@gmail.com>:
> > > >
> > > > Hi, Mike :D
> > > >
> > > > Japanese Analyzer does not load dictionaries by default.
> > > > If you look at the constructor, you can see that it is created as null
> > if
> > > > not set parameters.
> > > > (check testUserDict3() in TestJapaneseAnalyzer.java)
> > > >
> > > > In JapaneseTokenizer,
> > > > =============================================
> > > > if (userDictionary != null) {
> > > >   userFST = userDictionary.getFST();
> > > >   userFSTReader = userFST.getBytesReader();
> > > > } else {
> > > >   userFST = null;
> > > >   userFSTReader = null;
> > > > }
> > > > =============================================
> > > > Since it is a way to create and pass the UserDictionary object, there
> > is no
> > > > conflict between user dictionary and system dictionary.
> > > > (You may choose only one of them! -> means userFST instance in
> > > > JapaneseTokenizer)
> > > >
> > > > About dictionary,
> > > > Lucene has one pre-built dictionary by default since Lucene 3.6.
> > > > You can check it in org.apache.lucene.analysis.ja.dict.
> > > > It called MeCab which uses the Viterbi algorithm.
> > > > In Lucene, Convert MeCab dictionary(in Lucene, some dat files) to FST
> > and
> > > > use
> > > > But it can't satisfy all users.
> > > > Depending on the situation, some user may need a custom dictionary.
> > > > It is also same for Nori(Korean Analyzer) since Lucene 7.4. (The basic
> > > > logic(MeCab + FST) is similar to Japanese Analyzer)
> > > > The original Korean MeCab dictionary size is almost 220MB, but Lucene's
> > > > dictionary size is 24MB.
> > > > If the user needs a dictionary of 100MB size, the user must build and
> > use
> > > > it.
> > > > (Modify MeCab Dictionary -> Training -> Porting to Lucene)
> > > >
> > > > If anyone find some wrong information in my reply, please send a reply
> > with
> > > > the correction.
> > > >
> > > > Thank you,
> > > > Namgyu Kim
> > > >
> > > >
> > > > 2019년 5월 26일 (일) 오전 4:03, Michael Sokolov <ms...@gmail.com>님이 작성:
> > > >
> > > > > I'm trying to understand the relationship between the system and user
> > > > > dictionaries that JapaneseAnalyzer uses. The API allows a user to
> > > > > provide a user dictionary; the system one is built in. Are they
> > > > > otherwise the same kind of thing? If I provide entries in the user
> > > > > dictionary is it just as if I had included them in the system
> > > > > dictionary? If the same entry occurs in both, do the user dictionary
> > > > > weights supersede those in the system dictionary? Is there some way
> > to
> > > > > suppress entries in the system dict?  I hunted for documentation, but
> > > > > didn't find answers to these questions, and the code is pretty
> > > > > involved, so any pointers would be greatly appreciated.
> > > > >
> > > > > -Mike
> > > > >
> > > > > ---------------------------------------------------------------------
> > > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > > > >
> > > > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: JapaneseAnalyzer's system vs user dict

Posted by Namgyu Kim <kn...@gmail.com>.

Sorry for the wrong information, Mike.
Tomoko is right.
I checked it wrong.

User dictionary is independent from the system dictionary. If you give
the user entries, JapaneseTokenizer builds two FSTs one for the
built-in dictionary and one for the user dictionary and they are
retrieved separately.

Please ignore the following lines in my e-mail.
================================================
Japanese Analyzer does not load dictionaries by default.
...
Since it is a way to create and pass the UserDictionary object, there is no
conflict between user dictionary and system dictionary.
(You may choose only one of them! -> means userFST instance in
JapaneseTokenizer)
=================================================

The System dictionary and the User dictionary are separated and can have
each.

About System dictionary,
As I know, it is not possible to change the System dictionary at the code
level.
The part that reads the System dictionary is hard-coded.
(TokenInfoDictionary, UnknownDictionary, BinaryDictionary)
If you really need it, can you make a JIRA issue and proceed with me?

But there is a way to build a new kuromoji jar.
1. Modify your dictionary file and rebuild.
  1-1) Install MeCab
  1-2) Install MeCab Dictionary
  1-3) Modify your dictionary file
  1-4) Make it to tar.gz
2. change kuromoji/ivy.xml from
<artifact name="ipadic" type=".tar.gz" url="
https://jaist.dl.sourceforge.net/project/mecab/mecab-ipadic/2.7.0-20070801/mecab-ipadic-2.7.0-20070801.tar.gz
"/>
to
<artifact name="ipadic" type=".tar.gz" url="file:///your/tar
path/new_dic.tar.gz"/>
3. "ant regenerate" in /your/path/lucene-solr/lucene/analysis/kuromoji
4. "ant jar"

I wish I could help you.

Warm regards,
Namgyu Kim

2019년 5월 26일 (일) 오전 9:03, Michael Sokolov <ms...@gmail.com>님이 작성:

> Thank you for the detailed responses! What Tomoko is saying seems
> consistent with my cursory reading of the code. The reason I asked is
> I have a customer that thinks they want to replace the system
> dictionary, and I am trying to see if that is necessary. It seems as
> if for the most part, we can supply a comprehensive user dictionary
> and it would pretty much take the place of the system dictionary,
> assuming it is a superset (covers at least the original system dict
> tokens), but there is probably no way to "remove" a token that is
> present in the system dictionary (or maybe it can effectively be
> removed by adding it to user dictionary with a high penalty?). I'm not
> sure why one would want to do this removal, just trying to understand
> the design parameters.
>
> On Sat, May 25, 2019 at 7:30 PM Tomoko Uchida
> <to...@gmail.com> wrote:
> >
> > Hi,
> >
> > > If I provide entries in the user
> > dictionary is it just as if I had included them in the system
> > dictionary? If the same entry occurs in both, do the user dictionary
> > weights supersede those in the system dictionary? Is there some way to
> > suppress entries in the system dict?
> >
> > User dictionary is independent from the system dictionary. If you give
> > the user entries, JapaneseTokenizer builds two FSTs one for the
> > built-in dictionary and one for the user dictionary and they are
> > retrieved separately.
> >
> > First the user dictionary is retrieved, and if there are no entries
> > matched then the system dictionary is retrieved. So if any entry is
> > found in the user dictionary, all possible candidates in the system
> > dictionary are ignored (suppressed).
> >
> > (I think this is kuromoji specific behaviour, the original mecab pos
> > tagger retrieves both of the system dictionary and user dictionary and
> > compares their weights by performing Viterbi. In fact the behaviour -
> > always gives priority to the entries in the user dictionary - is a bit
> > too aggressive from the point of view of the consistency of
> > tokenization. I do not know why, but there may be some performance
> > reasons?)
> >
> > I think you can easily find the retrieval logic I described here in
> > JapaneseTokenizer#parse() method. (Let me know if my understanding is
> > not correct.)
> >
> > Regards,
> > Tomoko
> >
> > 2019年5月26日(日) 5:08 김남규 <kn...@gmail.com>:
> > >
> > > Hi, Mike :D
> > >
> > > Japanese Analyzer does not load dictionaries by default.
> > > If you look at the constructor, you can see that it is created as null
> if
> > > not set parameters.
> > > (check testUserDict3() in TestJapaneseAnalyzer.java)
> > >
> > > In JapaneseTokenizer,
> > > =============================================
> > > if (userDictionary != null) {
> > >   userFST = userDictionary.getFST();
> > >   userFSTReader = userFST.getBytesReader();
> > > } else {
> > >   userFST = null;
> > >   userFSTReader = null;
> > > }
> > > =============================================
> > > Since it is a way to create and pass the UserDictionary object, there
> is no
> > > conflict between user dictionary and system dictionary.
> > > (You may choose only one of them! -> means userFST instance in
> > > JapaneseTokenizer)
> > >
> > > About dictionary,
> > > Lucene has one pre-built dictionary by default since Lucene 3.6.
> > > You can check it in org.apache.lucene.analysis.ja.dict.
> > > It called MeCab which uses the Viterbi algorithm.
> > > In Lucene, Convert MeCab dictionary(in Lucene, some dat files) to FST
> and
> > > use
> > > But it can't satisfy all users.
> > > Depending on the situation, some user may need a custom dictionary.
> > > It is also same for Nori(Korean Analyzer) since Lucene 7.4. (The basic
> > > logic(MeCab + FST) is similar to Japanese Analyzer)
> > > The original Korean MeCab dictionary size is almost 220MB, but Lucene's
> > > dictionary size is 24MB.
> > > If the user needs a dictionary of 100MB size, the user must build and
> use
> > > it.
> > > (Modify MeCab Dictionary -> Training -> Porting to Lucene)
> > >
> > > If anyone find some wrong information in my reply, please send a reply
> with
> > > the correction.
> > >
> > > Thank you,
> > > Namgyu Kim
> > >
> > >
> > > 2019년 5월 26일 (일) 오전 4:03, Michael Sokolov <ms...@gmail.com>님이 작성:
> > >
> > > > I'm trying to understand the relationship between the system and user
> > > > dictionaries that JapaneseAnalyzer uses. The API allows a user to
> > > > provide a user dictionary; the system one is built in. Are they
> > > > otherwise the same kind of thing? If I provide entries in the user
> > > > dictionary is it just as if I had included them in the system
> > > > dictionary? If the same entry occurs in both, do the user dictionary
> > > > weights supersede those in the system dictionary? Is there some way
> to
> > > > suppress entries in the system dict?  I hunted for documentation, but
> > > > didn't find answers to these questions, and the code is pretty
> > > > involved, so any pointers would be greatly appreciated.
> > > >
> > > > -Mike
> > > >
> > > > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > > >
> > > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: JapaneseAnalyzer's system vs user dict

Posted by Michael Sokolov <ms...@gmail.com>.

Thank you for the detailed responses! What Tomoko is saying seems
consistent with my cursory reading of the code. The reason I asked is
I have a customer that thinks they want to replace the system
dictionary, and I am trying to see if that is necessary. It seems as
if for the most part, we can supply a comprehensive user dictionary
and it would pretty much take the place of the system dictionary,
assuming it is a superset (covers at least the original system dict
tokens), but there is probably no way to "remove" a token that is
present in the system dictionary (or maybe it can effectively be
removed by adding it to user dictionary with a high penalty?). I'm not
sure why one would want to do this removal, just trying to understand
the design parameters.

On Sat, May 25, 2019 at 7:30 PM Tomoko Uchida
<to...@gmail.com> wrote:
>
> Hi,
>
> > If I provide entries in the user
> dictionary is it just as if I had included them in the system
> dictionary? If the same entry occurs in both, do the user dictionary
> weights supersede those in the system dictionary? Is there some way to
> suppress entries in the system dict?
>
> User dictionary is independent from the system dictionary. If you give
> the user entries, JapaneseTokenizer builds two FSTs one for the
> built-in dictionary and one for the user dictionary and they are
> retrieved separately.
>
> First the user dictionary is retrieved, and if there are no entries
> matched then the system dictionary is retrieved. So if any entry is
> found in the user dictionary, all possible candidates in the system
> dictionary are ignored (suppressed).
>
> (I think this is kuromoji specific behaviour, the original mecab pos
> tagger retrieves both of the system dictionary and user dictionary and
> compares their weights by performing Viterbi. In fact the behaviour -
> always gives priority to the entries in the user dictionary - is a bit
> too aggressive from the point of view of the consistency of
> tokenization. I do not know why, but there may be some performance
> reasons?)
>
> I think you can easily find the retrieval logic I described here in
> JapaneseTokenizer#parse() method. (Let me know if my understanding is
> not correct.)
>
> Regards,
> Tomoko
>
> 2019年5月26日(日) 5:08 김남규 <kn...@gmail.com>:
> >
> > Hi, Mike :D
> >
> > Japanese Analyzer does not load dictionaries by default.
> > If you look at the constructor, you can see that it is created as null if
> > not set parameters.
> > (check testUserDict3() in TestJapaneseAnalyzer.java)
> >
> > In JapaneseTokenizer,
> > =============================================
> > if (userDictionary != null) {
> >   userFST = userDictionary.getFST();
> >   userFSTReader = userFST.getBytesReader();
> > } else {
> >   userFST = null;
> >   userFSTReader = null;
> > }
> > =============================================
> > Since it is a way to create and pass the UserDictionary object, there is no
> > conflict between user dictionary and system dictionary.
> > (You may choose only one of them! -> means userFST instance in
> > JapaneseTokenizer)
> >
> > About dictionary,
> > Lucene has one pre-built dictionary by default since Lucene 3.6.
> > You can check it in org.apache.lucene.analysis.ja.dict.
> > It called MeCab which uses the Viterbi algorithm.
> > In Lucene, Convert MeCab dictionary(in Lucene, some dat files) to FST and
> > use
> > But it can't satisfy all users.
> > Depending on the situation, some user may need a custom dictionary.
> > It is also same for Nori(Korean Analyzer) since Lucene 7.4. (The basic
> > logic(MeCab + FST) is similar to Japanese Analyzer)
> > The original Korean MeCab dictionary size is almost 220MB, but Lucene's
> > dictionary size is 24MB.
> > If the user needs a dictionary of 100MB size, the user must build and use
> > it.
> > (Modify MeCab Dictionary -> Training -> Porting to Lucene)
> >
> > If anyone find some wrong information in my reply, please send a reply with
> > the correction.
> >
> > Thank you,
> > Namgyu Kim
> >
> >
> > 2019년 5월 26일 (일) 오전 4:03, Michael Sokolov <ms...@gmail.com>님이 작성:
> >
> > > I'm trying to understand the relationship between the system and user
> > > dictionaries that JapaneseAnalyzer uses. The API allows a user to
> > > provide a user dictionary; the system one is built in. Are they
> > > otherwise the same kind of thing? If I provide entries in the user
> > > dictionary is it just as if I had included them in the system
> > > dictionary? If the same entry occurs in both, do the user dictionary
> > > weights supersede those in the system dictionary? Is there some way to
> > > suppress entries in the system dict?  I hunted for documentation, but
> > > didn't find answers to these questions, and the code is pretty
> > > involved, so any pointers would be greatly appreciated.
> > >
> > > -Mike
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> > >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: JapaneseAnalyzer's system vs user dict

Posted by Tomoko Uchida <to...@gmail.com>.

Hi,

> If I provide entries in the user
dictionary is it just as if I had included them in the system
dictionary? If the same entry occurs in both, do the user dictionary
weights supersede those in the system dictionary? Is there some way to
suppress entries in the system dict?

User dictionary is independent from the system dictionary. If you give
the user entries, JapaneseTokenizer builds two FSTs one for the
built-in dictionary and one for the user dictionary and they are
retrieved separately.

First the user dictionary is retrieved, and if there are no entries
matched then the system dictionary is retrieved. So if any entry is
found in the user dictionary, all possible candidates in the system
dictionary are ignored (suppressed).

(I think this is kuromoji specific behaviour, the original mecab pos
tagger retrieves both of the system dictionary and user dictionary and
compares their weights by performing Viterbi. In fact the behaviour -
always gives priority to the entries in the user dictionary - is a bit
too aggressive from the point of view of the consistency of
tokenization. I do not know why, but there may be some performance
reasons?)

I think you can easily find the retrieval logic I described here in
JapaneseTokenizer#parse() method. (Let me know if my understanding is
not correct.)

Regards,
Tomoko

2019年5月26日(日) 5:08 김남규 <kn...@gmail.com>:
>
> Hi, Mike :D
>
> Japanese Analyzer does not load dictionaries by default.
> If you look at the constructor, you can see that it is created as null if
> not set parameters.
> (check testUserDict3() in TestJapaneseAnalyzer.java)
>
> In JapaneseTokenizer,
> =============================================
> if (userDictionary != null) {
>   userFST = userDictionary.getFST();
>   userFSTReader = userFST.getBytesReader();
> } else {
>   userFST = null;
>   userFSTReader = null;
> }
> =============================================
> Since it is a way to create and pass the UserDictionary object, there is no
> conflict between user dictionary and system dictionary.
> (You may choose only one of them! -> means userFST instance in
> JapaneseTokenizer)
>
> About dictionary,
> Lucene has one pre-built dictionary by default since Lucene 3.6.
> You can check it in org.apache.lucene.analysis.ja.dict.
> It called MeCab which uses the Viterbi algorithm.
> In Lucene, Convert MeCab dictionary(in Lucene, some dat files) to FST and
> use
> But it can't satisfy all users.
> Depending on the situation, some user may need a custom dictionary.
> It is also same for Nori(Korean Analyzer) since Lucene 7.4. (The basic
> logic(MeCab + FST) is similar to Japanese Analyzer)
> The original Korean MeCab dictionary size is almost 220MB, but Lucene's
> dictionary size is 24MB.
> If the user needs a dictionary of 100MB size, the user must build and use
> it.
> (Modify MeCab Dictionary -> Training -> Porting to Lucene)
>
> If anyone find some wrong information in my reply, please send a reply with
> the correction.
>
> Thank you,
> Namgyu Kim
>
>
> 2019년 5월 26일 (일) 오전 4:03, Michael Sokolov <ms...@gmail.com>님이 작성:
>
> > I'm trying to understand the relationship between the system and user
> > dictionaries that JapaneseAnalyzer uses. The API allows a user to
> > provide a user dictionary; the system one is built in. Are they
> > otherwise the same kind of thing? If I provide entries in the user
> > dictionary is it just as if I had included them in the system
> > dictionary? If the same entry occurs in both, do the user dictionary
> > weights supersede those in the system dictionary? Is there some way to
> > suppress entries in the system dict?  I hunted for documentation, but
> > didn't find answers to these questions, and the code is pretty
> > involved, so any pointers would be greatly appreciated.
> >
> > -Mike
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: JapaneseAnalyzer's system vs user dict

Posted by 김남규 <kn...@gmail.com>.

Hi, Mike :D

Japanese Analyzer does not load dictionaries by default.
If you look at the constructor, you can see that it is created as null if
not set parameters.
(check testUserDict3() in TestJapaneseAnalyzer.java)

In JapaneseTokenizer,
=============================================
if (userDictionary != null) {
  userFST = userDictionary.getFST();
  userFSTReader = userFST.getBytesReader();
} else {
  userFST = null;
  userFSTReader = null;
}
=============================================
Since it is a way to create and pass the UserDictionary object, there is no
conflict between user dictionary and system dictionary.
(You may choose only one of them! -> means userFST instance in
JapaneseTokenizer)

About dictionary,
Lucene has one pre-built dictionary by default since Lucene 3.6.
You can check it in org.apache.lucene.analysis.ja.dict.
It called MeCab which uses the Viterbi algorithm.
In Lucene, Convert MeCab dictionary(in Lucene, some dat files) to FST and
use
But it can't satisfy all users.
Depending on the situation, some user may need a custom dictionary.
It is also same for Nori(Korean Analyzer) since Lucene 7.4. (The basic
logic(MeCab + FST) is similar to Japanese Analyzer)
The original Korean MeCab dictionary size is almost 220MB, but Lucene's
dictionary size is 24MB.
If the user needs a dictionary of 100MB size, the user must build and use
it.
(Modify MeCab Dictionary -> Training -> Porting to Lucene)

If anyone find some wrong information in my reply, please send a reply with
the correction.

Thank you,
Namgyu Kim


2019년 5월 26일 (일) 오전 4:03, Michael Sokolov <ms...@gmail.com>님이 작성:

> I'm trying to understand the relationship between the system and user
> dictionaries that JapaneseAnalyzer uses. The API allows a user to
> provide a user dictionary; the system one is built in. Are they
> otherwise the same kind of thing? If I provide entries in the user
> dictionary is it just as if I had included them in the system
> dictionary? If the same entry occurs in both, do the user dictionary
> weights supersede those in the system dictionary? Is there some way to
> suppress entries in the system dict?  I hunted for documentation, but
> didn't find answers to these questions, and the code is pretty
> involved, so any pointers would be greatly appreciated.
>
> -Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>