You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hivemall.apache.org by my...@apache.org on 2021/04/23 00:54:59 UTC
[incubator-hivemall-site] branch asf-site updated: Add description
about Korean tokenizer
This is an automated email from the ASF dual-hosted git repository.
myui pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/incubator-hivemall-site.git
The following commit(s) were added to refs/heads/asf-site by this push:
new 5073e89 Add description about Korean tokenizer
5073e89 is described below
commit 5073e89ebcfc1db091283313264b1c76d8723cc7
Author: Makoto Yui <my...@apache.org>
AuthorDate: Fri Apr 23 09:54:49 2021 +0900
Add description about Korean tokenizer
---
userguide/misc/funcs.html | 8 +++++++-
userguide/misc/tokenizer.html | 47 ++++++++++++++++++++++++++++++++++++++-----
2 files changed, 49 insertions(+), 6 deletions(-)
diff --git a/userguide/misc/funcs.html b/userguide/misc/funcs.html
index 42a674f..2404931 100644
--- a/userguide/misc/funcs.html
+++ b/userguide/misc/funcs.html
@@ -3426,6 +3426,12 @@ limit 100;
> ["kuromoji","使う","分かち書き","テスト","第","二","引数","normal","search","extended","指定","デフォルト","normal"," モード"]
</code></pre>
</li>
+<li><p><code>tokenize_ko(String line [, const array<string> userDict, const string mode = "discard", const array<string> stopTags, boolean outputUnknownUnigrams])</code> - returns tokenized strings in array<string></p>
+<pre><code class="lang-sql">select tokenize_ko("소설 무궁화꽃이 피었습니다.");
+
+> ["소설","무궁","화","꽃","피"]
+</code></pre>
+</li>
</ul>
<h1 id="others">Others</h1>
<ul>
@@ -3499,7 +3505,7 @@ Apache Hivemall is an effort undergoing incubation at The Apache Software Founda
<script>
var gitbook = gitbook || [];
gitbook.push(function() {
- gitbook.page.hasChanged({"page":{"title":"List of Functions","level":"1.3","depth":1,"next":{"title":"Tips for Effective Hivemall","level":"1.4","depth":1,"path":"tips/README.md","ref":"tips/README.md","articles":[{"title":"Explicit add_bias() for better prediction","level":"1.4.1","depth":2,"path":"tips/addbias.md","ref":"tips/addbias.md","articles":[]},{"title":"Use rand_amplify() to better prediction results","level":"1.4.2","depth":2,"path":"tips/rand_amplify.md","ref":"t [...]
+ gitbook.page.hasChanged({"page":{"title":"List of Functions","level":"1.3","depth":1,"next":{"title":"Tips for Effective Hivemall","level":"1.4","depth":1,"path":"tips/README.md","ref":"tips/README.md","articles":[{"title":"Explicit add_bias() for better prediction","level":"1.4.1","depth":2,"path":"tips/addbias.md","ref":"tips/addbias.md","articles":[]},{"title":"Use rand_amplify() to better prediction results","level":"1.4.2","depth":2,"path":"tips/rand_amplify.md","ref":"t [...]
});
</script>
</div>
diff --git a/userguide/misc/tokenizer.html b/userguide/misc/tokenizer.html
index 8c29133..ede3604 100644
--- a/userguide/misc/tokenizer.html
+++ b/userguide/misc/tokenizer.html
@@ -2394,6 +2394,7 @@
<li><a href="#japanese-tokenizer">Japanese Tokenizer</a></li>
<li><a href="#part-of-speech">Part-of-speech</a></li>
<li><a href="#chinese-tokenizer">Chinese Tokenizer</a></li>
+<li><a href="#korean-tokenizer">Korean Tokenizer</a></li>
</ul>
</li>
</ul>
@@ -2524,12 +2525,48 @@ select tokenize_ja_neologd();
<pre><code class="lang-sql">tokenize_cn(string line, optional const array<string> stopWords)
</code></pre>
<p>Its basic usage is as follows:</p>
-<pre><code class="lang-sql"><span class="hljs-keyword">select</span> tokenize_cn(<span class="hljs-string">"Smartcn为Apache2.0协议的开源中文分词系统,Java语言编写,修改的中科院计算所ICTCLAS分词系统。"</span>);
+<pre><code class="lang-sql">select tokenize_cn("Smartcn为Apache2.0协议的开源中文分词系统,Java语言编写,修改的中科院计算所ICTCLAS分词系统。");
+
+> [smartcn, 为, apach, 2, 0, 协议, 的, 开源, 中文, 分词, 系统, java, 语言, 编写, 修改, 的, 中科院, 计算, 所, ictcla, 分词, 系统]
</code></pre>
-<blockquote>
-<p>[smartcn, 为, apach, 2, 0, 协议, 的, 开源, 中文, 分词, 系统, java, 语言, 编写, 修改, 的, 中科院, 计算, 所, ictcla, 分词, 系统]</p>
-</blockquote>
<p>For detailed APIs, please refer Javadoc of <a href="https://lucene.apache.org/core/5_3_1/analyzers-smartcn/org/apache/lucene/analysis/cn/smart/SmartChineseAnalyzer.html" target="_blank">SmartChineseAnalyzer</a> as well.</p>
+<h2 id="korean-tokenizer">Korean Tokenizer</h2>
+<p>Korean toknizer internally uses <a href="analyzers-nori: Korean Morphological Analyzer" target="_blank">lucene-analyzers-nori</a> for tokenization.</p>
+<p>The signature of the UDF is as follows:</p>
+<pre><code class="lang-sql">tokenize_ko(String line [,
+ const array<string> userDict,
+ const string mode = "discard",
+ const array<string> stopTags,
+ boolean outputUnknownUnigrams
+ ]) - returns tokenized strings in array<string>
+</code></pre>
+<div class="panel panel-primary"><div class="panel-heading"><h3 class="panel-title" id="note"><i class="fa fa-edit"></i> Note</h3></div><div class="panel-body"><p>For details options, please refer <a href="https://lucene.apache.org/core/8_8_2/analyzers-nori/org/apache/lucene/analysis/ko/KoreanAnalyzer.html" target="_blank">Lucene API document</a>. <code>none</code>, <code>discord</code> (default), or <code>mixed</code> are supported for the mode argument.</p></div></div>
+<p>See the following examples for the usage.</p>
+<pre><code class="lang-sql">-- show version of lucene-analyzers-nori
+select tokenize_ko();
+> 8.8.2
+
+select tokenize_ko("소설 무궁화꽃이 피었습니다.");
+> ["소설","무궁","화","꽃","피"]
+
+select tokenize_ko("소설 무궁화꽃이 피었습니다.", null, "mixed");
+> ["소설","무궁화","무궁","화","꽃","피"]
+
+select tokenize_ko("소설 무궁화꽃이 피었습니다.", null, "discard", array("E", "VV"));
+> ["소설","무궁","화","꽃","이"]
+
+select tokenize_ko("Hello, world.", null, "none", array(), true);
+> ["h","e","l","l","o","w","o","r","l","d"]
+
+select tokenize_ko("Hello, world.", null, "none", array(), false);
+> ["hello","world"]
+
+select tokenize_ko("나는 C++ 언어를 프로그래밍 언어로 사랑한다.", null, "discard", array());
+> ["나","는","c","언어","를","프로그래밍","언어","로","사랑","하","ᆫ다"]
+
+select tokenize_ko("나는 C++ 언어를 프로그래밍 언어로 사랑한다.", array("C++"), "discard", array());
+> ["나","는","c++","언어","를","프로그래밍","언어","로","사랑","하","ᆫ다"]
+</code></pre>
<p><div id="page-footer" class="localized-footer"><hr><!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
@@ -2585,7 +2622,7 @@ Apache Hivemall is an effort undergoing incubation at The Apache Software Founda
<script>
var gitbook = gitbook || [];
gitbook.push(function() {
- gitbook.page.hasChanged({"page":{"title":"Text Tokenizer","level":"2.3","depth":1,"next":{"title":"Approximate Aggregate Functions","level":"2.4","depth":1,"path":"misc/approx.md","ref":"misc/approx.md","articles":[]},"previous":{"title":"Efficient Top-K Query Processing","level":"2.2","depth":1,"path":"misc/topk.md","ref":"misc/topk.md","articles":[]},"dir":"ltr"},"config":{"plugins":["theme-api","edit-link","github","splitter","etoc","callouts","toggle-chapters","anchorjs", [...]
+ gitbook.page.hasChanged({"page":{"title":"Text Tokenizer","level":"2.3","depth":1,"next":{"title":"Approximate Aggregate Functions","level":"2.4","depth":1,"path":"misc/approx.md","ref":"misc/approx.md","articles":[]},"previous":{"title":"Efficient Top-K Query Processing","level":"2.2","depth":1,"path":"misc/topk.md","ref":"misc/topk.md","articles":[]},"dir":"ltr"},"config":{"plugins":["theme-api","edit-link","github","splitter","etoc","callouts","toggle-chapters","anchorjs", [...]
});
</script>
</div>