You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Tomoko Uchida (JIRA)" <ji...@apache.org> on 2019/04/24 10:48:00 UTC
[jira] [Commented] (LUCENE-4056) Japanese Tokenizer (Kuromoji)
cannot build UniDic dictionary
[ https://issues.apache.org/jira/browse/LUCENE-4056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16825017#comment-16825017 ]
Tomoko Uchida commented on LUCENE-4056:
---------------------------------------
Hi,
as far as licensing, UniDic is now distributed under GPL, LGPL, and BSD 3-Clause. To my knowledge, the last one is compatible with ALv2.
Please see: [https://unidic.ninjal.ac.jp/download] and [https://unidic.ninjal.ac.jp/copying/BSD]
Personally I am looking for using UniDic from kuromoji, because the dictionary is still maintained by researchers and suitable for search purpose than current search mode based on mecab-ipadic.
If there is possibility to proceed this issue I'd like to help with this issue.
> Japanese Tokenizer (Kuromoji) cannot build UniDic dictionary
> ------------------------------------------------------------
>
> Key: LUCENE-4056
> URL: https://issues.apache.org/jira/browse/LUCENE-4056
> Project: Lucene - Core
> Issue Type: Improvement
> Components: modules/analysis
> Affects Versions: 3.6
> Environment: Solr 3.6
> UniDic 1.3.12 for MeCab (unidic-mecab1312src.tar.gz)
> Reporter: Kazuaki Hiraga
> Priority: Major
>
> I tried to build a UniDic dictionary for using it along with Kuromoji on Solr 3.6. I think UniDic is a good dictionary than IPA dictionary, so Kuromoji for Lucene/Solr should support UniDic dictionary as standalone Kuromoji does.
> The following is my procedure:
> Modified build.xml under lucene/contrib/analyzers/kuromoji directory and run 'ant build-dict', I got the error as the below.
> build-dict:
> [java] dictionary builder
> [java]
> [java] dictionary format: UNIDIC
> [java] input directory: /home/kazu/Work/src/solr/brunch_3_6/lucene/build/contrib/analyzers/kuromoji/unidic-mecab1312src
> [java] output directory: /home/kazu/Work/src/solr/brunch_3_6/lucene/contrib/analyzers/kuromoji/src/resources
> [java] input encoding: utf-8
> [java] normalize entries: false
> [java]
> [java] building tokeninfo dict...
> [java] parse...
> [java] sort...
> [java] Exception in thread "main" java.lang.AssertionError
> [java] encode...
> [java] at org.apache.lucene.analysis.ja.util.BinaryDictionaryWriter.put(BinaryDictionaryWriter.java:113)
> [java] at org.apache.lucene.analysis.ja.util.TokenInfoDictionaryBuilder.buildDictionary(TokenInfoDictionaryBuilder.java:141)
> [java] at org.apache.lucene.analysis.ja.util.TokenInfoDictionaryBuilder.build(TokenInfoDictionaryBuilder.java:76)
> [java] at org.apache.lucene.analysis.ja.util.DictionaryBuilder.build(DictionaryBuilder.java:37)
> [java] at org.apache.lucene.analysis.ja.util.DictionaryBuilder.main(DictionaryBuilder.java:82)
> And the diff of build.xml:
> ===================================================================
> --- build.xml (revision 1338023)
> +++ build.xml (working copy)
> @@ -28,19 +28,31 @@
> <property name="maven.dist.dir" location="../../../dist/maven" />
>
> <!-- default configuration: uses mecab-ipadic -->
> + <!--
> <property name="ipadic.version" value="mecab-ipadic-2.7.0-20070801" />
> <property name="dict.src.file" value="${ipadic.version}.tar.gz" />
> <property name="dict.url" value="http://mecab.googlecode.com/files/${dict.src.file}"/>
> + -->
>
> <!-- alternative configuration: uses mecab-naist-jdic
> <property name="ipadic.version" value="mecab-naist-jdic-0.6.3b-20111013" />
> <property name="dict.src.file" value="${ipadic.version}.tar.gz" />
> <property name="dict.url" value="http://sourceforge.jp/frs/redir.php?m=iij&f=/naist-jdic/53500/${dict.src.file}"/>
> -->
> -
> +
> + <!-- alternative configuration: uses UniDic -->
> + <property name="ipadic.version" value="unidic-mecab1312src" />
> + <property name="dict.src.file" value="unidic-mecab1312src.tar.gz" />
> + <property name="dict.loc.dir" value="/home/kazu/Work/src/nlp/unidic/_archive"/>
> +
> <property name="dict.src.dir" value="${build.dir}/${ipadic.version}" />
> + <!--
> <property name="dict.encoding" value="euc-jp"/>
> <property name="dict.format" value="ipadic"/>
> + -->
> + <property name="dict.encoding" value="utf-8"/>
> + <property name="dict.format" value="unidic"/>
> +
> <property name="dict.normalize" value="false"/>
> <property name="dict.target.dir" location="./src/resources"/>
>
> @@ -58,7 +70,8 @@
>
> <target name="compile-core" depends="jar-analyzers-common, common.compile-core" />
> <target name="download-dict" unless="dict.available">
> - <get src="${dict.url}" dest="${build.dir}/${dict.src.file}"/>
> + <!-- get src="${dict.url}" dest="${build.dir}/${dict.src.file}"/ -->
> + <copy file="${dict.loc.dir}/${dict.src.file}" tofile="${build.dir}/${dict.src.file}"/>
> <gunzip src="${build.dir}/${dict.src.file}"/>
> <untar src="${build.dir}/${ipadic.version}.tar" dest="${build.dir}"/>
> </target>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org