You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hivemall.apache.org by my...@apache.org on 2021/04/23 10:14:00 UTC

[incubator-hivemall-site] branch asf-site updated: Updated tokenizer usages

This is an automated email from the ASF dual-hosted git repository.

myui pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/incubator-hivemall-site.git


The following commit(s) were added to refs/heads/asf-site by this push:
     new c13e9d3  Updated tokenizer usages
c13e9d3 is described below

commit c13e9d34cbd85e7e076b2f76c250224ab2dbffdc
Author: Makoto Yui <my...@apache.org>
AuthorDate: Fri Apr 23 19:13:46 2021 +0900

    Updated tokenizer usages
---
 userguide/misc/funcs.html     |  8 ++--
 userguide/misc/tokenizer.html | 98 ++++++++++++++++++++++++++++++++++---------
 2 files changed, 83 insertions(+), 23 deletions(-)

diff --git a/userguide/misc/funcs.html b/userguide/misc/funcs.html
index 2404931..d92bf2f 100644
--- a/userguide/misc/funcs.html
+++ b/userguide/misc/funcs.html
@@ -3414,19 +3414,19 @@ limit 100;
 </li>
 <li><p><code>tokenize_cn(String line [, const list&lt;string&gt; stopWords])</code> - returns tokenized strings in array&lt;string&gt;</p>
 </li>
-<li><p><code>tokenize_ja(String line [, const string mode = &quot;normal&quot;, const array&lt;string&gt; stopWords, const array&lt;string&gt; stopTags, const array&lt;string&gt; userDict (or string userDictURL)</code>]) - returns tokenized strings in array&lt;string&gt;</p>
+<li><p><code>tokenize_ja(String line [, const string mode = &quot;normal&quot;, const array&lt;string&gt; stopWords, const array&lt;string&gt; stopTags, const array&lt;string&gt; userDict (or const string userDictURL)</code>]) - returns tokenized strings in array&lt;string&gt;</p>
 <pre><code class="lang-sql">select tokenize_ja(&quot;kuromoji&#x3092;&#x4F7F;&#x3063;&#x305F;&#x5206;&#x304B;&#x3061;&#x66F8;&#x304D;&#x306E;&#x30C6;&#x30B9;&#x30C8;&#x3067;&#x3059;&#x3002;&#x7B2C;&#x4E8C;&#x5F15;&#x6570;&#x306B;&#x306F;normal/search/extended&#x3092;&#x6307;&#x5B9A;&#x3067;&#x304D;&#x307E;&#x3059;&#x3002;&#x30C7;&#x30D5;&#x30A9;&#x30EB;&#x30C8;&#x3067;&#x306F;normal&#x30E2;&#x30FC;&#x30C9;&#x3067;&#x3059;&#x3002;&quot;);
 
 &gt; [&quot;kuromoji&quot;,&quot;&#x4F7F;&#x3046;&quot;,&quot;&#x5206;&#x304B;&#x3061;&#x66F8;&#x304D;&quot;,&quot;&#x30C6;&#x30B9;&#x30C8;&quot;,&quot;&#x7B2C;&quot;,&quot;&#x4E8C;&quot;,&quot;&#x5F15;&#x6570;&quot;,&quot;normal&quot;,&quot;search&quot;,&quot;extended&quot;,&quot;&#x6307;&#x5B9A;&quot;,&quot;&#x30C7;&#x30D5;&#x30A9;&#x30EB;&#x30C8;&quot;,&quot;normal&quot;,&quot; &#x30E2;&#x30FC;&#x30C9;&quot;]
 </code></pre>
 </li>
-<li><p><code>tokenize_ja_neologd(String line [, const string mode = &quot;normal&quot;, const array&lt;string&gt; stopWords, const array&lt;string&gt; stopTags, const array&lt;string&gt; userDict (or string userDictURL)</code>]) - returns tokenized strings in array&lt;string&gt;</p>
+<li><p><code>tokenize_ja_neologd(String line [, const string mode = &quot;normal&quot;, const array&lt;string&gt; stopWords, const array&lt;string&gt; stopTags, const array&lt;string&gt; userDict (or const string userDictURL)</code>]) - returns tokenized strings in array&lt;string&gt;</p>
 <pre><code class="lang-sql">select tokenize_ja_neologd(&quot;kuromoji&#x3092;&#x4F7F;&#x3063;&#x305F;&#x5206;&#x304B;&#x3061;&#x66F8;&#x304D;&#x306E;&#x30C6;&#x30B9;&#x30C8;&#x3067;&#x3059;&#x3002;&#x7B2C;&#x4E8C;&#x5F15;&#x6570;&#x306B;&#x306F;normal/search/extended&#x3092;&#x6307;&#x5B9A;&#x3067;&#x304D;&#x307E;&#x3059;&#x3002;&#x30C7;&#x30D5;&#x30A9;&#x30EB;&#x30C8;&#x3067;&#x306F;normal&#x30E2;&#x30FC;&#x30C9;&#x3067;&#x3059;&#x3002;&quot;);
 
 &gt; [&quot;kuromoji&quot;,&quot;&#x4F7F;&#x3046;&quot;,&quot;&#x5206;&#x304B;&#x3061;&#x66F8;&#x304D;&quot;,&quot;&#x30C6;&#x30B9;&#x30C8;&quot;,&quot;&#x7B2C;&quot;,&quot;&#x4E8C;&quot;,&quot;&#x5F15;&#x6570;&quot;,&quot;normal&quot;,&quot;search&quot;,&quot;extended&quot;,&quot;&#x6307;&#x5B9A;&quot;,&quot;&#x30C7;&#x30D5;&#x30A9;&#x30EB;&#x30C8;&quot;,&quot;normal&quot;,&quot; &#x30E2;&#x30FC;&#x30C9;&quot;]
 </code></pre>
 </li>
-<li><p><code>tokenize_ko(String line [, const array&lt;string&gt; userDict, const string mode = &quot;discard&quot;, const array&lt;string&gt; stopTags, boolean outputUnknownUnigrams])</code> - returns tokenized strings in array&lt;string&gt;</p>
+<li><p><code>tokenize_ko(String line [, const string mode = &quot;discard&quot; (or const string opts)</code>, const array&lt;string&gt; stopWords, const array&lt;string&gt; stopTags, const array&lt;string&gt; userDict (or const string userDictURL)]) - returns tokenized strings in array&lt;string&gt;</p>
 <pre><code class="lang-sql">select tokenize_ko(&quot;&#xC18C;&#xC124; &#xBB34;&#xAD81;&#xD654;&#xAF43;&#xC774; &#xD53C;&#xC5C8;&#xC2B5;&#xB2C8;&#xB2E4;.&quot;);
 
 &gt; [&quot;&#xC18C;&#xC124;&quot;,&quot;&#xBB34;&#xAD81;&quot;,&quot;&#xD654;&quot;,&quot;&#xAF43;&quot;,&quot;&#xD53C;&quot;]
@@ -3505,7 +3505,7 @@ Apache Hivemall is an effort undergoing incubation at The Apache Software Founda
     <script>
         var gitbook = gitbook || [];
         gitbook.push(function() {
-            gitbook.page.hasChanged({"page":{"title":"List of Functions","level":"1.3","depth":1,"next":{"title":"Tips for Effective Hivemall","level":"1.4","depth":1,"path":"tips/README.md","ref":"tips/README.md","articles":[{"title":"Explicit add_bias() for better prediction","level":"1.4.1","depth":2,"path":"tips/addbias.md","ref":"tips/addbias.md","articles":[]},{"title":"Use rand_amplify() to better prediction results","level":"1.4.2","depth":2,"path":"tips/rand_amplify.md","ref":"t [...]
+            gitbook.page.hasChanged({"page":{"title":"List of Functions","level":"1.3","depth":1,"next":{"title":"Tips for Effective Hivemall","level":"1.4","depth":1,"path":"tips/README.md","ref":"tips/README.md","articles":[{"title":"Explicit add_bias() for better prediction","level":"1.4.1","depth":2,"path":"tips/addbias.md","ref":"tips/addbias.md","articles":[]},{"title":"Use rand_amplify() to better prediction results","level":"1.4.2","depth":2,"path":"tips/rand_amplify.md","ref":"t [...]
         });
     </script>
 </div>
diff --git a/userguide/misc/tokenizer.html b/userguide/misc/tokenizer.html
index ede3604..bf1abf0 100644
--- a/userguide/misc/tokenizer.html
+++ b/userguide/misc/tokenizer.html
@@ -2391,10 +2391,16 @@
 <ul>
 <li><a href="#tokenizer-for-english-texts">Tokenizer for English Texts</a></li>
 <li><a href="#tokenizer-for-non-english-texts">Tokenizer for Non-English Texts</a><ul>
-<li><a href="#japanese-tokenizer">Japanese Tokenizer</a></li>
+<li><a href="#japanese-tokenizer">Japanese Tokenizer</a><ul>
+<li><a href="#custom-dictionary">Custom dictionary</a></li>
 <li><a href="#part-of-speech">Part-of-speech</a></li>
+</ul>
+</li>
 <li><a href="#chinese-tokenizer">Chinese Tokenizer</a></li>
-<li><a href="#korean-tokenizer">Korean Tokenizer</a></li>
+<li><a href="#korean-tokenizer">Korean Tokenizer</a><ul>
+<li><a href="#custom-dictionary-1">Custom dictionary</a></li>
+</ul>
+</li>
 </ul>
 </li>
 </ul>
@@ -2463,6 +2469,7 @@ select tokenize_ja_neologd();
 &#x8A5E;-&#x5F62;&#x5BB9;&#x8A5E;&#x63A5;&#x7D9A;&quot;,&quot;&#x63A5;&#x982D;&#x8A5E;-&#x6570;&#x63A5;&quot;,&quot;&#x672A;&#x77E5;&#x8A9E;&quot;,&quot;&#x8A18;&#x53F7;&quot;,&quot;&#x8A18;&#x53F7;-&#x30A2;&#x30EB;&#x30D5;&#x30A1;&#x30D9;&#x30C3;&#x30C8;&quot;,&quot;&#x8A18;&#x53F7;-&#x4E00;&#x822C;&quot;,&quot;&#x8A18;&#x53F7;-&#x53E5;&#x70B9;&quot;,&quot;&#x8A18;&#x53F7;-&#x62EC;&#x5F27;&#x9589;
 &quot;,&quot;&#x8A18;&#x53F7;-&#x62EC;&#x5F27;&#x958B;&quot;,&quot;&#x8A18;&#x53F7;-&#x7A7A;&#x767D;&quot;,&quot;&#x8A18;&#x53F7;-&#x8AAD;&#x70B9;&quot;,&quot;&#x8A9E;&#x65AD;&#x7247;&quot;,&quot;&#x9023;&#x4F53;&#x8A5E;&quot;,&quot;&#x975E;&#x8A00;&#x8A9E;&#x97F3;&quot;]</p>
 </blockquote>
+<h3 id="custom-dictionary">Custom dictionary</h3>
 <p>Moreover, the fifth argument <code>userDict</code> enables you to register a user-defined custom dictionary in <a href="https://github.com/atilika/kuromoji/blob/909fd6b32bf4e9dc86b7599de5c9b50ca8f004a1/kuromoji-core/src/test/resources/userdict.txt" target="_blank">Kuromoji official format</a>:</p>
 <pre><code class="lang-sql"><span class="hljs-keyword">select</span> tokenize_ja(<span class="hljs-string">&quot;&#x65E5;&#x672C;&#x7D4C;&#x6E08;&#x65B0;&#x805E;&#xFF06;&#x95A2;&#x897F;&#x56FD;&#x969B;&#x7A7A;&#x6E2F;&quot;</span>, <span class="hljs-string">&quot;normal&quot;</span>, <span class="hljs-literal">null</span>, <span class="hljs-literal">null</span>, 
                    <span class="hljs-built_in">array</span>(
@@ -2482,7 +2489,7 @@ select tokenize_ja_neologd();
 </code></pre>
 <div class="panel panel-primary"><div class="panel-heading"><h3 class="panel-title" id="note"><i class="fa fa-edit"></i> Note</h3></div><div class="panel-body"><p>Dictionary SHOULD be accessible through http/https protocol. And, it SHOULD be compressed using gzip with <code>.gz</code> suffix because the maximum dictionary size is limited to 32MB and read timeout is set to 60 sec. Also, connection must be established in 10 sec.</p><p>If you want to use HTTP Basic Authentication, please us [...]
 <p>For detailed APIs, please refer Javadoc of <a href="https://lucene.apache.org/core/5_3_1/analyzers-kuromoji/org/apache/lucene/analysis/ja/JapaneseAnalyzer.html" target="_blank">JapaneseAnalyzer</a> as well.</p>
-<h2 id="part-of-speech">Part-of-speech</h2>
+<h3 id="part-of-speech">Part-of-speech</h3>
 <p>From Hivemall v0.6.0, the second argument can also accept the following option format:</p>
 <pre><code> -mode &lt;arg&gt;   The tokenization mode. One of [&apos;normal&apos;, &apos;search&apos;,
                &apos;extended&apos;, &apos;default&apos; (normal)]
@@ -2533,12 +2540,27 @@ select tokenize_ja_neologd();
 <h2 id="korean-tokenizer">Korean Tokenizer</h2>
 <p>Korean toknizer internally uses <a href="analyzers-nori: Korean Morphological Analyzer" target="_blank">lucene-analyzers-nori</a> for tokenization.</p>
 <p>The signature of the UDF is as follows:</p>
-<pre><code class="lang-sql">tokenize_ko(String line [,
-            const array&lt;string&gt; userDict,
-            const string mode = &quot;discard&quot;,
-            const array&lt;string&gt; stopTags,
-            boolean outputUnknownUnigrams
-           ]) - returns tokenized strings in array&lt;string&gt;
+<pre><code class="lang-sql">tokenize_ko(
+       String line [, const string mode = &quot;discard&quot; (or const string opts),
+       const array&lt;string&gt; stopWords,
+       const array&lt;string&gt;
+       stopTags,
+       const array&lt;string&gt; userDict (or const string userDictURL)]
+) - returns tokenized strings in array&lt;string&gt;
+</code></pre>
+<div class="panel panel-primary"><div class="panel-heading"><h3 class="panel-title" id="note"><i class="fa fa-edit"></i> Note</h3></div><div class="panel-body"><p>Instead of mode, the 2nd argument can take options starting with <code>-</code>.</p></div></div>
+<p>You can get usage as follows:</p>
+<pre><code class="lang-sql">select tokenize_ko(&quot;&quot;, &quot;-help&quot;);
+
+usage: tokenize_ko(String line [, const string mode = &quot;discard&quot; (or const
+       string opts), const array&lt;string&gt; stopWords, const array&lt;string&gt;
+       stopTags, const array&lt;string&gt; userDict (or const string
+       userDictURL)]) - returns tokenized strings in array&lt;string&gt; [-help]
+       [-mode &lt;arg&gt;] [-outputUnknownUnigrams]
+ -help                    Show function help
+ -mode &lt;arg&gt;              The tokenization mode. One of [&apos;node&apos;, &apos;discard&apos;
+                          (default), &apos;mixed&apos;]
+ -outputUnknownUnigrams   outputs unigrams for unknown words.
 </code></pre>
 <div class="panel panel-primary"><div class="panel-heading"><h3 class="panel-title" id="note"><i class="fa fa-edit"></i> Note</h3></div><div class="panel-body"><p>For details options, please refer <a href="https://lucene.apache.org/core/8_8_2/analyzers-nori/org/apache/lucene/analysis/ko/KoreanAnalyzer.html" target="_blank">Lucene API document</a>. <code>none</code>, <code>discord</code> (default), or <code>mixed</code> are supported for the mode argument.</p></div></div>
 <p>See the following examples for the usage.</p>
@@ -2546,27 +2568,65 @@ select tokenize_ja_neologd();
 select tokenize_ko();
 &gt; 8.8.2
 
-select tokenize_ko(&quot;&#xC18C;&#xC124; &#xBB34;&#xAD81;&#xD654;&#xAF43;&#xC774; &#xD53C;&#xC5C8;&#xC2B5;&#xB2C8;&#xB2E4;.&quot;);
+select tokenize_ko(&apos;&#xC18C;&#xC124; &#xBB34;&#xAD81;&#xD654;&#xAF43;&#xC774; &#xD53C;&#xC5C8;&#xC2B5;&#xB2C8;&#xB2E4;.&apos;);
 &gt; [&quot;&#xC18C;&#xC124;&quot;,&quot;&#xBB34;&#xAD81;&quot;,&quot;&#xD654;&quot;,&quot;&#xAF43;&quot;,&quot;&#xD53C;&quot;]
 
-select tokenize_ko(&quot;&#xC18C;&#xC124; &#xBB34;&#xAD81;&#xD654;&#xAF43;&#xC774; &#xD53C;&#xC5C8;&#xC2B5;&#xB2C8;&#xB2E4;.&quot;, null, &quot;mixed&quot;);
+select tokenize_ko(&apos;&#xC18C;&#xC124; &#xBB34;&#xAD81;&#xD654;&#xAF43;&#xC774; &#xD53C;&#xC5C8;&#xC2B5;&#xB2C8;&#xB2E4;.&apos;, &apos;-mode discard&apos;);
+&gt; [&quot;&#xC18C;&#xC124;&quot;,&quot;&#xBB34;&#xAD81;&quot;,&quot;&#xD654;&quot;,&quot;&#xAF43;&quot;,&quot;&#xD53C;&quot;]
+
+select tokenize_ko(&apos;&#xC18C;&#xC124; &#xBB34;&#xAD81;&#xD654;&#xAF43;&#xC774; &#xD53C;&#xC5C8;&#xC2B5;&#xB2C8;&#xB2E4;.&apos;, &apos;mixed&apos;);
 &gt; [&quot;&#xC18C;&#xC124;&quot;,&quot;&#xBB34;&#xAD81;&#xD654;&quot;,&quot;&#xBB34;&#xAD81;&quot;,&quot;&#xD654;&quot;,&quot;&#xAF43;&quot;,&quot;&#xD53C;&quot;]
 
-select tokenize_ko(&quot;&#xC18C;&#xC124; &#xBB34;&#xAD81;&#xD654;&#xAF43;&#xC774; &#xD53C;&#xC5C8;&#xC2B5;&#xB2C8;&#xB2E4;.&quot;, null, &quot;discard&quot;, array(&quot;E&quot;, &quot;VV&quot;));
-&gt; [&quot;&#xC18C;&#xC124;&quot;,&quot;&#xBB34;&#xAD81;&quot;,&quot;&#xD654;&quot;,&quot;&#xAF43;&quot;,&quot;&#xC774;&quot;]
+select tokenize_ko(&apos;&#xC18C;&#xC124; &#xBB34;&#xAD81;&#xD654;&#xAF43;&#xC774; &#xD53C;&#xC5C8;&#xC2B5;&#xB2C8;&#xB2E4;.&apos;, &apos;-mode mixed&apos;);
+&gt; [&quot;&#xC18C;&#xC124;&quot;,&quot;&#xBB34;&#xAD81;&#xD654;&quot;,&quot;&#xBB34;&#xAD81;&quot;,&quot;&#xD654;&quot;,&quot;&#xAF43;&quot;,&quot;&#xD53C;&quot;]
 
-select tokenize_ko(&quot;Hello, world.&quot;, null, &quot;none&quot;, array(), true);
-&gt; [&quot;h&quot;,&quot;e&quot;,&quot;l&quot;,&quot;l&quot;,&quot;o&quot;,&quot;w&quot;,&quot;o&quot;,&quot;r&quot;,&quot;l&quot;,&quot;d&quot;]
+select tokenize_ko(&apos;&#xC18C;&#xC124; &#xBB34;&#xAD81;&#xD654;&#xAF43;&#xC774; &#xD53C;&#xC5C8;&#xC2B5;&#xB2C8;&#xB2E4;.&apos;, &apos;-mode none&apos;);
+&gt; [&quot;&#xC18C;&#xC124;&quot;,&quot;&#xBB34;&#xAD81;&#xD654;&quot;,&quot;&#xAF43;&quot;,&quot;&#xD53C;&quot;]
 
-select tokenize_ko(&quot;Hello, world.&quot;, null, &quot;none&quot;, array(), false);
+select tokenize_ko(&apos;Hello, world.&apos;, &apos;-mode none&apos;);
 &gt; [&quot;hello&quot;,&quot;world&quot;]
 
-select tokenize_ko(&quot;&#xB098;&#xB294; C++ &#xC5B8;&#xC5B4;&#xB97C; &#xD504;&#xB85C;&#xADF8;&#xB798;&#xBC0D; &#xC5B8;&#xC5B4;&#xB85C; &#xC0AC;&#xB791;&#xD55C;&#xB2E4;.&quot;, null, &quot;discard&quot;, array());
+select tokenize_ko(&apos;Hello, world.&apos;, &apos;-mode none -outputUnknownUnigrams&apos;);
+&gt; [&quot;h&quot;,&quot;e&quot;,&quot;l&quot;,&quot;l&quot;,&quot;o&quot;,&quot;w&quot;,&quot;o&quot;,&quot;r&quot;,&quot;l&quot;,&quot;d&quot;]
+
+-- default stopward (null), with stoptags
+select tokenize_ko(&apos;&#xC18C;&#xC124; &#xBB34;&#xAD81;&#xD654;&#xAF43;&#xC774; &#xD53C;&#xC5C8;&#xC2B5;&#xB2C8;&#xB2E4;.&apos;, &apos;discard&apos;, null, array(&apos;E&apos;));
+&gt; [&quot;&#xC18C;&#xC124;&quot;,&quot;&#xBB34;&#xAD81;&quot;,&quot;&#xD654;&quot;,&quot;&#xAF43;&quot;,&quot;&#xC774;&quot;,&quot;&#xD53C;&quot;]
+
+select tokenize_ko(&apos;&#xC18C;&#xC124; &#xBB34;&#xAD81;&#xD654;&#xAF43;&#xC774; &#xD53C;&#xC5C8;&#xC2B5;&#xB2C8;&#xB2E4;.&apos;, &apos;discard&apos;, null, array(&apos;E&apos;, &apos;VV&apos;));
+&gt; [&quot;&#xC18C;&#xC124;&quot;,&quot;&#xBB34;&#xAD81;&quot;,&quot;&#xD654;&quot;,&quot;&#xAF43;&quot;,&quot;&#xC774;&quot;]
+
+select tokenize_ko(&apos;&#xB098;&#xB294; C++ &#xC5B8;&#xC5B4;&#xB97C; &#xD504;&#xB85C;&#xADF8;&#xB798;&#xBC0D; &#xC5B8;&#xC5B4;&#xB85C; &#xC0AC;&#xB791;&#xD55C;&#xB2E4;.&apos;, &apos;-mode discard&apos;);
+&gt; [&quot;&#xB098;&quot;,&quot;c&quot;,&quot;&#xC5B8;&#xC5B4;&quot;,&quot;&#xD504;&#xB85C;&#xADF8;&#xB798;&#xBC0D;&quot;,&quot;&#xC5B8;&#xC5B4;&quot;,&quot;&#xC0AC;&#xB791;&quot;]
+
+select tokenize_ko(&apos;&#xB098;&#xB294; C++ &#xC5B8;&#xC5B4;&#xB97C; &#xD504;&#xB85C;&#xADF8;&#xB798;&#xBC0D; &#xC5B8;&#xC5B4;&#xB85C; &#xC0AC;&#xB791;&#xD55C;&#xB2E4;.&apos;, &apos;-mode discard&apos;, array(), null);
 &gt; [&quot;&#xB098;&quot;,&quot;&#xB294;&quot;,&quot;c&quot;,&quot;&#xC5B8;&#xC5B4;&quot;,&quot;&#xB97C;&quot;,&quot;&#xD504;&#xB85C;&#xADF8;&#xB798;&#xBC0D;&quot;,&quot;&#xC5B8;&#xC5B4;&quot;,&quot;&#xB85C;&quot;,&quot;&#xC0AC;&#xB791;&quot;,&quot;&#xD558;&quot;,&quot;&#x11AB;&#xB2E4;&quot;]
 
-select tokenize_ko(&quot;&#xB098;&#xB294; C++ &#xC5B8;&#xC5B4;&#xB97C; &#xD504;&#xB85C;&#xADF8;&#xB798;&#xBC0D; &#xC5B8;&#xC5B4;&#xB85C; &#xC0AC;&#xB791;&#xD55C;&#xB2E4;.&quot;, array(&quot;C++&quot;), &quot;discard&quot;, array());
+-- default stopward (null), default stoptags (null)
+select tokenize_ko(&apos;&#xB098;&#xB294; C++ &#xC5B8;&#xC5B4;&#xB97C; &#xD504;&#xB85C;&#xADF8;&#xB798;&#xBC0D; &#xC5B8;&#xC5B4;&#xB85C; &#xC0AC;&#xB791;&#xD55C;&#xB2E4;.&apos;, &apos;-mode discard&apos;);
+select tokenize_ko(&apos;&#xB098;&#xB294; C++ &#xC5B8;&#xC5B4;&#xB97C; &#xD504;&#xB85C;&#xADF8;&#xB798;&#xBC0D; &#xC5B8;&#xC5B4;&#xB85C; &#xC0AC;&#xB791;&#xD55C;&#xB2E4;.&apos;, &apos;-mode discard&apos;, null, null);
+&gt; [&quot;&#xB098;&quot;,&quot;c&quot;,&quot;&#xC5B8;&#xC5B4;&quot;,&quot;&#xD504;&#xB85C;&#xADF8;&#xB798;&#xBC0D;&quot;,&quot;&#xC5B8;&#xC5B4;&quot;,&quot;&#xC0AC;&#xB791;&quot;]
+
+-- no stopward (empty array), default stoptags (null)
+select tokenize_ko(&apos;&#xB098;&#xB294; C++ &#xC5B8;&#xC5B4;&#xB97C; &#xD504;&#xB85C;&#xADF8;&#xB798;&#xBC0D; &#xC5B8;&#xC5B4;&#xB85C; &#xC0AC;&#xB791;&#xD55C;&#xB2E4;.&apos;, &apos;-mode discard&apos;, array());
+select tokenize_ko(&apos;&#xB098;&#xB294; C++ &#xC5B8;&#xC5B4;&#xB97C; &#xD504;&#xB85C;&#xADF8;&#xB798;&#xBC0D; &#xC5B8;&#xC5B4;&#xB85C; &#xC0AC;&#xB791;&#xD55C;&#xB2E4;.&apos;, &apos;-mode discard&apos;, array(), null);
+&gt; [&quot;&#xB098;&quot;,&quot;c&quot;,&quot;&#xC5B8;&#xC5B4;&quot;,&quot;&#xD504;&#xB85C;&#xADF8;&#xB798;&#xBC0D;&quot;,&quot;&#xC5B8;&#xC5B4;&quot;,&quot;&#xC0AC;&#xB791;&quot;]
+
+-- no stopward (empty array), no stoptags (emptry array), custom dict
+select tokenize_ko(&apos;&#xB098;&#xB294; C++ &#xC5B8;&#xC5B4;&#xB97C; &#xD504;&#xB85C;&#xADF8;&#xB798;&#xBC0D; &#xC5B8;&#xC5B4;&#xB85C; &#xC0AC;&#xB791;&#xD55C;&#xB2E4;.&apos;, &apos;-mode discard&apos;, array(), array(), array(&apos;C++&apos;));
 &gt; [&quot;&#xB098;&quot;,&quot;&#xB294;&quot;,&quot;c++&quot;,&quot;&#xC5B8;&#xC5B4;&quot;,&quot;&#xB97C;&quot;,&quot;&#xD504;&#xB85C;&#xADF8;&#xB798;&#xBC0D;&quot;,&quot;&#xC5B8;&#xC5B4;&quot;,&quot;&#xB85C;&quot;,&quot;&#xC0AC;&#xB791;&quot;,&quot;&#xD558;&quot;,&quot;&#x11AB;&#xB2E4;&quot;]
+
+&gt; -- default stopward (null), default stoptags (null), custom dict
+select tokenize_ko(&apos;&#xB098;&#xB294; C++ &#xC5B8;&#xC5B4;&#xB97C; &#xD504;&#xB85C;&#xADF8;&#xB798;&#xBC0D; &#xC5B8;&#xC5B4;&#xB85C; &#xC0AC;&#xB791;&#xD55C;&#xB2E4;.&apos;, &apos;-mode discard&apos;, null, null, array(&apos;C++&apos;));
+&gt; [&quot;&#xB098;&quot;,&quot;c++&quot;,&quot;&#xC5B8;&#xC5B4;&quot;,&quot;&#xD504;&#xB85C;&#xADF8;&#xB798;&#xBC0D;&quot;,&quot;&#xC5B8;&#xC5B4;&quot;,&quot;&#xC0AC;&#xB791;&quot;]
+</code></pre>
+<h3 id="custom-dictionary">Custom dictionary</h3>
+<p>Moreover, the fifth argument <code>userDictURL</code> enables you to register a user-defined custom dictionary placed in http/https accessible external site. Find the dictionary format <a href="https://raw.githubusercontent.com/apache/lucene/main/lucene/analysis/nori/src/test/org/apache/lucene/analysis/ko/userdict.txt" target="_blank">here from Lucene&apos;s one</a>.</p>
+<pre><code class="lang-sql">select tokenize_ko(&apos;&#xB098;&#xB294; c++ &#xD504;&#xB85C;&#xADF8;&#xB798;&#xBC0D;&#xC744; &#xC990;&#xAE34;&#xB2E4;.&apos;, &apos;-mode discard&apos;, null, null, &apos;https://raw.githubusercontent.com/apache/lucene/main/lucene/analysis/nori/src/test/org/apache/lucene/analysis/ko/userdict.txt&apos;);
+
+&gt; [&quot;&#xB098;&quot;,&quot;c++&quot;,&quot;&#xD504;&#xB85C;&#xADF8;&#xB798;&#xBC0D;&quot;,&quot;&#xC990;&#xAE30;&quot;]
 </code></pre>
+<div class="panel panel-primary"><div class="panel-heading"><h3 class="panel-title" id="note"><i class="fa fa-edit"></i> Note</h3></div><div class="panel-body"><p>Dictionary SHOULD be accessible through http/https protocol. And, it SHOULD be compressed using gzip with <code>.gz</code> suffix because the maximum dictionary size is limited to 32MB and read timeout is set to 60 sec. Also, connection must be established in 10 sec.</p></div></div>
 <p><div id="page-footer" class="localized-footer"><hr><!--
   Licensed to the Apache Software Foundation (ASF) under one
   or more contributor license agreements.  See the NOTICE file
@@ -2622,7 +2682,7 @@ Apache Hivemall is an effort undergoing incubation at The Apache Software Founda
     <script>
         var gitbook = gitbook || [];
         gitbook.push(function() {
-            gitbook.page.hasChanged({"page":{"title":"Text Tokenizer","level":"2.3","depth":1,"next":{"title":"Approximate Aggregate Functions","level":"2.4","depth":1,"path":"misc/approx.md","ref":"misc/approx.md","articles":[]},"previous":{"title":"Efficient Top-K Query Processing","level":"2.2","depth":1,"path":"misc/topk.md","ref":"misc/topk.md","articles":[]},"dir":"ltr"},"config":{"plugins":["theme-api","edit-link","github","splitter","etoc","callouts","toggle-chapters","anchorjs", [...]
+            gitbook.page.hasChanged({"page":{"title":"Text Tokenizer","level":"2.3","depth":1,"next":{"title":"Approximate Aggregate Functions","level":"2.4","depth":1,"path":"misc/approx.md","ref":"misc/approx.md","articles":[]},"previous":{"title":"Efficient Top-K Query Processing","level":"2.2","depth":1,"path":"misc/topk.md","ref":"misc/topk.md","articles":[]},"dir":"ltr"},"config":{"plugins":["theme-api","edit-link","github","splitter","etoc","callouts","toggle-chapters","anchorjs", [...]
         });
     </script>
 </div>