You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hivemall.apache.org by my...@apache.org on 2021/04/23 10:14:00 UTC
[incubator-hivemall-site] branch asf-site updated: Updated
tokenizer usages
This is an automated email from the ASF dual-hosted git repository.
myui pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/incubator-hivemall-site.git
The following commit(s) were added to refs/heads/asf-site by this push:
new c13e9d3 Updated tokenizer usages
c13e9d3 is described below
commit c13e9d34cbd85e7e076b2f76c250224ab2dbffdc
Author: Makoto Yui <my...@apache.org>
AuthorDate: Fri Apr 23 19:13:46 2021 +0900
Updated tokenizer usages
---
userguide/misc/funcs.html | 8 ++--
userguide/misc/tokenizer.html | 98 ++++++++++++++++++++++++++++++++++---------
2 files changed, 83 insertions(+), 23 deletions(-)
diff --git a/userguide/misc/funcs.html b/userguide/misc/funcs.html
index 2404931..d92bf2f 100644
--- a/userguide/misc/funcs.html
+++ b/userguide/misc/funcs.html
@@ -3414,19 +3414,19 @@ limit 100;
</li>
<li><p><code>tokenize_cn(String line [, const list<string> stopWords])</code> - returns tokenized strings in array<string></p>
</li>
-<li><p><code>tokenize_ja(String line [, const string mode = "normal", const array<string> stopWords, const array<string> stopTags, const array<string> userDict (or string userDictURL)</code>]) - returns tokenized strings in array<string></p>
+<li><p><code>tokenize_ja(String line [, const string mode = "normal", const array<string> stopWords, const array<string> stopTags, const array<string> userDict (or const string userDictURL)</code>]) - returns tokenized strings in array<string></p>
<pre><code class="lang-sql">select tokenize_ja("kuromojiを使った分かち書きのテストです。第二引数にはnormal/search/extendedを指定できます。デフォルトではnormalモードです。");
> ["kuromoji","使う","分かち書き","テスト","第","二","引数","normal","search","extended","指定","デフォルト","normal"," モード"]
</code></pre>
</li>
-<li><p><code>tokenize_ja_neologd(String line [, const string mode = "normal", const array<string> stopWords, const array<string> stopTags, const array<string> userDict (or string userDictURL)</code>]) - returns tokenized strings in array<string></p>
+<li><p><code>tokenize_ja_neologd(String line [, const string mode = "normal", const array<string> stopWords, const array<string> stopTags, const array<string> userDict (or const string userDictURL)</code>]) - returns tokenized strings in array<string></p>
<pre><code class="lang-sql">select tokenize_ja_neologd("kuromojiを使った分かち書きのテストです。第二引数にはnormal/search/extendedを指定できます。デフォルトではnormalモードです。");
> ["kuromoji","使う","分かち書き","テスト","第","二","引数","normal","search","extended","指定","デフォルト","normal"," モード"]
</code></pre>
</li>
-<li><p><code>tokenize_ko(String line [, const array<string> userDict, const string mode = "discard", const array<string> stopTags, boolean outputUnknownUnigrams])</code> - returns tokenized strings in array<string></p>
+<li><p><code>tokenize_ko(String line [, const string mode = "discard" (or const string opts)</code>, const array<string> stopWords, const array<string> stopTags, const array<string> userDict (or const string userDictURL)]) - returns tokenized strings in array<string></p>
<pre><code class="lang-sql">select tokenize_ko("소설 무궁화꽃이 피었습니다.");
> ["소설","무궁","화","꽃","피"]
@@ -3505,7 +3505,7 @@ Apache Hivemall is an effort undergoing incubation at The Apache Software Founda
<script>
var gitbook = gitbook || [];
gitbook.push(function() {
- gitbook.page.hasChanged({"page":{"title":"List of Functions","level":"1.3","depth":1,"next":{"title":"Tips for Effective Hivemall","level":"1.4","depth":1,"path":"tips/README.md","ref":"tips/README.md","articles":[{"title":"Explicit add_bias() for better prediction","level":"1.4.1","depth":2,"path":"tips/addbias.md","ref":"tips/addbias.md","articles":[]},{"title":"Use rand_amplify() to better prediction results","level":"1.4.2","depth":2,"path":"tips/rand_amplify.md","ref":"t [...]
+ gitbook.page.hasChanged({"page":{"title":"List of Functions","level":"1.3","depth":1,"next":{"title":"Tips for Effective Hivemall","level":"1.4","depth":1,"path":"tips/README.md","ref":"tips/README.md","articles":[{"title":"Explicit add_bias() for better prediction","level":"1.4.1","depth":2,"path":"tips/addbias.md","ref":"tips/addbias.md","articles":[]},{"title":"Use rand_amplify() to better prediction results","level":"1.4.2","depth":2,"path":"tips/rand_amplify.md","ref":"t [...]
});
</script>
</div>
diff --git a/userguide/misc/tokenizer.html b/userguide/misc/tokenizer.html
index ede3604..bf1abf0 100644
--- a/userguide/misc/tokenizer.html
+++ b/userguide/misc/tokenizer.html
@@ -2391,10 +2391,16 @@
<ul>
<li><a href="#tokenizer-for-english-texts">Tokenizer for English Texts</a></li>
<li><a href="#tokenizer-for-non-english-texts">Tokenizer for Non-English Texts</a><ul>
-<li><a href="#japanese-tokenizer">Japanese Tokenizer</a></li>
+<li><a href="#japanese-tokenizer">Japanese Tokenizer</a><ul>
+<li><a href="#custom-dictionary">Custom dictionary</a></li>
<li><a href="#part-of-speech">Part-of-speech</a></li>
+</ul>
+</li>
<li><a href="#chinese-tokenizer">Chinese Tokenizer</a></li>
-<li><a href="#korean-tokenizer">Korean Tokenizer</a></li>
+<li><a href="#korean-tokenizer">Korean Tokenizer</a><ul>
+<li><a href="#custom-dictionary-1">Custom dictionary</a></li>
+</ul>
+</li>
</ul>
</li>
</ul>
@@ -2463,6 +2469,7 @@ select tokenize_ja_neologd();
詞-形容詞接続","接頭詞-数接","未知語","記号","記号-アルファベット","記号-一般","記号-句点","記号-括弧閉
","記号-括弧開","記号-空白","記号-読点","語断片","連体詞","非言語音"]</p>
</blockquote>
+<h3 id="custom-dictionary">Custom dictionary</h3>
<p>Moreover, the fifth argument <code>userDict</code> enables you to register a user-defined custom dictionary in <a href="https://github.com/atilika/kuromoji/blob/909fd6b32bf4e9dc86b7599de5c9b50ca8f004a1/kuromoji-core/src/test/resources/userdict.txt" target="_blank">Kuromoji official format</a>:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">select</span> tokenize_ja(<span class="hljs-string">"日本経済新聞&関西国際空港"</span>, <span class="hljs-string">"normal"</span>, <span class="hljs-literal">null</span>, <span class="hljs-literal">null</span>,
<span class="hljs-built_in">array</span>(
@@ -2482,7 +2489,7 @@ select tokenize_ja_neologd();
</code></pre>
<div class="panel panel-primary"><div class="panel-heading"><h3 class="panel-title" id="note"><i class="fa fa-edit"></i> Note</h3></div><div class="panel-body"><p>Dictionary SHOULD be accessible through http/https protocol. And, it SHOULD be compressed using gzip with <code>.gz</code> suffix because the maximum dictionary size is limited to 32MB and read timeout is set to 60 sec. Also, connection must be established in 10 sec.</p><p>If you want to use HTTP Basic Authentication, please us [...]
<p>For detailed APIs, please refer Javadoc of <a href="https://lucene.apache.org/core/5_3_1/analyzers-kuromoji/org/apache/lucene/analysis/ja/JapaneseAnalyzer.html" target="_blank">JapaneseAnalyzer</a> as well.</p>
-<h2 id="part-of-speech">Part-of-speech</h2>
+<h3 id="part-of-speech">Part-of-speech</h3>
<p>From Hivemall v0.6.0, the second argument can also accept the following option format:</p>
<pre><code> -mode <arg> The tokenization mode. One of ['normal', 'search',
'extended', 'default' (normal)]
@@ -2533,12 +2540,27 @@ select tokenize_ja_neologd();
<h2 id="korean-tokenizer">Korean Tokenizer</h2>
<p>Korean toknizer internally uses <a href="analyzers-nori: Korean Morphological Analyzer" target="_blank">lucene-analyzers-nori</a> for tokenization.</p>
<p>The signature of the UDF is as follows:</p>
-<pre><code class="lang-sql">tokenize_ko(String line [,
- const array<string> userDict,
- const string mode = "discard",
- const array<string> stopTags,
- boolean outputUnknownUnigrams
- ]) - returns tokenized strings in array<string>
+<pre><code class="lang-sql">tokenize_ko(
+ String line [, const string mode = "discard" (or const string opts),
+ const array<string> stopWords,
+ const array<string>
+ stopTags,
+ const array<string> userDict (or const string userDictURL)]
+) - returns tokenized strings in array<string>
+</code></pre>
+<div class="panel panel-primary"><div class="panel-heading"><h3 class="panel-title" id="note"><i class="fa fa-edit"></i> Note</h3></div><div class="panel-body"><p>Instead of mode, the 2nd argument can take options starting with <code>-</code>.</p></div></div>
+<p>You can get usage as follows:</p>
+<pre><code class="lang-sql">select tokenize_ko("", "-help");
+
+usage: tokenize_ko(String line [, const string mode = "discard" (or const
+ string opts), const array<string> stopWords, const array<string>
+ stopTags, const array<string> userDict (or const string
+ userDictURL)]) - returns tokenized strings in array<string> [-help]
+ [-mode <arg>] [-outputUnknownUnigrams]
+ -help Show function help
+ -mode <arg> The tokenization mode. One of ['node', 'discard'
+ (default), 'mixed']
+ -outputUnknownUnigrams outputs unigrams for unknown words.
</code></pre>
<div class="panel panel-primary"><div class="panel-heading"><h3 class="panel-title" id="note"><i class="fa fa-edit"></i> Note</h3></div><div class="panel-body"><p>For details options, please refer <a href="https://lucene.apache.org/core/8_8_2/analyzers-nori/org/apache/lucene/analysis/ko/KoreanAnalyzer.html" target="_blank">Lucene API document</a>. <code>none</code>, <code>discord</code> (default), or <code>mixed</code> are supported for the mode argument.</p></div></div>
<p>See the following examples for the usage.</p>
@@ -2546,27 +2568,65 @@ select tokenize_ja_neologd();
select tokenize_ko();
> 8.8.2
-select tokenize_ko("소설 무궁화꽃이 피었습니다.");
+select tokenize_ko('소설 무궁화꽃이 피었습니다.');
> ["소설","무궁","화","꽃","피"]
-select tokenize_ko("소설 무궁화꽃이 피었습니다.", null, "mixed");
+select tokenize_ko('소설 무궁화꽃이 피었습니다.', '-mode discard');
+> ["소설","무궁","화","꽃","피"]
+
+select tokenize_ko('소설 무궁화꽃이 피었습니다.', 'mixed');
> ["소설","무궁화","무궁","화","꽃","피"]
-select tokenize_ko("소설 무궁화꽃이 피었습니다.", null, "discard", array("E", "VV"));
-> ["소설","무궁","화","꽃","이"]
+select tokenize_ko('소설 무궁화꽃이 피었습니다.', '-mode mixed');
+> ["소설","무궁화","무궁","화","꽃","피"]
-select tokenize_ko("Hello, world.", null, "none", array(), true);
-> ["h","e","l","l","o","w","o","r","l","d"]
+select tokenize_ko('소설 무궁화꽃이 피었습니다.', '-mode none');
+> ["소설","무궁화","꽃","피"]
-select tokenize_ko("Hello, world.", null, "none", array(), false);
+select tokenize_ko('Hello, world.', '-mode none');
> ["hello","world"]
-select tokenize_ko("나는 C++ 언어를 프로그래밍 언어로 사랑한다.", null, "discard", array());
+select tokenize_ko('Hello, world.', '-mode none -outputUnknownUnigrams');
+> ["h","e","l","l","o","w","o","r","l","d"]
+
+-- default stopward (null), with stoptags
+select tokenize_ko('소설 무궁화꽃이 피었습니다.', 'discard', null, array('E'));
+> ["소설","무궁","화","꽃","이","피"]
+
+select tokenize_ko('소설 무궁화꽃이 피었습니다.', 'discard', null, array('E', 'VV'));
+> ["소설","무궁","화","꽃","이"]
+
+select tokenize_ko('나는 C++ 언어를 프로그래밍 언어로 사랑한다.', '-mode discard');
+> ["나","c","언어","프로그래밍","언어","사랑"]
+
+select tokenize_ko('나는 C++ 언어를 프로그래밍 언어로 사랑한다.', '-mode discard', array(), null);
> ["나","는","c","언어","를","프로그래밍","언어","로","사랑","하","ᆫ다"]
-select tokenize_ko("나는 C++ 언어를 프로그래밍 언어로 사랑한다.", array("C++"), "discard", array());
+-- default stopward (null), default stoptags (null)
+select tokenize_ko('나는 C++ 언어를 프로그래밍 언어로 사랑한다.', '-mode discard');
+select tokenize_ko('나는 C++ 언어를 프로그래밍 언어로 사랑한다.', '-mode discard', null, null);
+> ["나","c","언어","프로그래밍","언어","사랑"]
+
+-- no stopward (empty array), default stoptags (null)
+select tokenize_ko('나는 C++ 언어를 프로그래밍 언어로 사랑한다.', '-mode discard', array());
+select tokenize_ko('나는 C++ 언어를 프로그래밍 언어로 사랑한다.', '-mode discard', array(), null);
+> ["나","c","언어","프로그래밍","언어","사랑"]
+
+-- no stopward (empty array), no stoptags (emptry array), custom dict
+select tokenize_ko('나는 C++ 언어를 프로그래밍 언어로 사랑한다.', '-mode discard', array(), array(), array('C++'));
> ["나","는","c++","언어","를","프로그래밍","언어","로","사랑","하","ᆫ다"]
+
+> -- default stopward (null), default stoptags (null), custom dict
+select tokenize_ko('나는 C++ 언어를 프로그래밍 언어로 사랑한다.', '-mode discard', null, null, array('C++'));
+> ["나","c++","언어","프로그래밍","언어","사랑"]
+</code></pre>
+<h3 id="custom-dictionary">Custom dictionary</h3>
+<p>Moreover, the fifth argument <code>userDictURL</code> enables you to register a user-defined custom dictionary placed in http/https accessible external site. Find the dictionary format <a href="https://raw.githubusercontent.com/apache/lucene/main/lucene/analysis/nori/src/test/org/apache/lucene/analysis/ko/userdict.txt" target="_blank">here from Lucene's one</a>.</p>
+<pre><code class="lang-sql">select tokenize_ko('나는 c++ 프로그래밍을 즐긴다.', '-mode discard', null, null, 'https://raw.githubusercontent.com/apache/lucene/main/lucene/analysis/nori/src/test/org/apache/lucene/analysis/ko/userdict.txt');
+
+> ["나","c++","프로그래밍","즐기"]
</code></pre>
+<div class="panel panel-primary"><div class="panel-heading"><h3 class="panel-title" id="note"><i class="fa fa-edit"></i> Note</h3></div><div class="panel-body"><p>Dictionary SHOULD be accessible through http/https protocol. And, it SHOULD be compressed using gzip with <code>.gz</code> suffix because the maximum dictionary size is limited to 32MB and read timeout is set to 60 sec. Also, connection must be established in 10 sec.</p></div></div>
<p><div id="page-footer" class="localized-footer"><hr><!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
@@ -2622,7 +2682,7 @@ Apache Hivemall is an effort undergoing incubation at The Apache Software Founda
<script>
var gitbook = gitbook || [];
gitbook.push(function() {
- gitbook.page.hasChanged({"page":{"title":"Text Tokenizer","level":"2.3","depth":1,"next":{"title":"Approximate Aggregate Functions","level":"2.4","depth":1,"path":"misc/approx.md","ref":"misc/approx.md","articles":[]},"previous":{"title":"Efficient Top-K Query Processing","level":"2.2","depth":1,"path":"misc/topk.md","ref":"misc/topk.md","articles":[]},"dir":"ltr"},"config":{"plugins":["theme-api","edit-link","github","splitter","etoc","callouts","toggle-chapters","anchorjs", [...]
+ gitbook.page.hasChanged({"page":{"title":"Text Tokenizer","level":"2.3","depth":1,"next":{"title":"Approximate Aggregate Functions","level":"2.4","depth":1,"path":"misc/approx.md","ref":"misc/approx.md","articles":[]},"previous":{"title":"Efficient Top-K Query Processing","level":"2.2","depth":1,"path":"misc/topk.md","ref":"misc/topk.md","articles":[]},"dir":"ltr"},"config":{"plugins":["theme-api","edit-link","github","splitter","etoc","callouts","toggle-chapters","anchorjs", [...]
});
</script>
</div>