You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hivemall.apache.org by my...@apache.org on 2019/04/19 07:20:36 UTC
[incubator-hivemall-site] branch asf-site updated: Updated about
PoS option
This is an automated email from the ASF dual-hosted git repository.
myui pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/incubator-hivemall-site.git
The following commit(s) were added to refs/heads/asf-site by this push:
new 9d96eed Updated about PoS option
9d96eed is described below
commit 9d96eedcc8b424e09dd6f7bc6c0457b06f29cd7c
Author: Makoto Yui <my...@apache.org>
AuthorDate: Fri Apr 19 16:20:25 2019 +0900
Updated about PoS option
---
userguide/misc/tokenizer.html | 42 +++++++++++++++++++++++++++++++++++++++++-
1 file changed, 41 insertions(+), 1 deletion(-)
diff --git a/userguide/misc/tokenizer.html b/userguide/misc/tokenizer.html
index 28827d3..bfcdd3e 100644
--- a/userguide/misc/tokenizer.html
+++ b/userguide/misc/tokenizer.html
@@ -2383,6 +2383,7 @@
<li><a href="#tokenizer-for-english-texts">Tokenizer for English Texts</a></li>
<li><a href="#tokenizer-for-non-english-texts">Tokenizer for Non-English Texts</a><ul>
<li><a href="#japanese-tokenizer">Japanese Tokenizer</a></li>
+<li><a href="#part-of-speech">Part-of-speech</a></li>
<li><a href="#chinese-tokenizer">Chinese Tokenizer</a></li>
</ul>
</li>
@@ -2458,7 +2459,46 @@
<blockquote>
<p>["日本","経済","新聞","関西","国際","空港"]</p>
</blockquote>
+<p>Dictionary SHOULD be accessible through http/https protocol. And, it SHOULD be compressed using gzip with <code>.gz</code> suffix because the maximum dictionary size is limited to 32MB and read timeout is set to 60 sec. Also, connection must be established in 10 sec.</p>
+<p>If you want to use HTTP Basic Authentication, please use the following form: <code>https://user:password@www.sitreurl.com/my_dict.txt.gz</code> (see Sec 3.1 of <a href="https://www.ietf.org/rfc/rfc1738.txt" target="_blank">rfc1738</a>)</p>
<p>For detailed APIs, please refer Javadoc of <a href="https://lucene.apache.org/core/5_3_1/analyzers-kuromoji/org/apache/lucene/analysis/ja/JapaneseAnalyzer.html" target="_blank">JapaneseAnalyzer</a> as well.</p>
+<h2 id="part-of-speech">Part-of-speech</h2>
+<p>From Hivemall v0.6.0, the second argument can also accept the following option format:</p>
+<pre><code> -mode <arg> The tokenization mode. One of ['normal', 'search',
+ 'extended', 'default' (normal)]
+ -pos Return part-of-speech information
+</code></pre><p>Then, you can get part-of-speech information as follows:</p>
+<pre><code class="lang-sql">WITH tmp as (
+ <span class="hljs-keyword">select</span>
+ tokenize_ja(<span class="hljs-string">'kuromojiを使った分かち書きのテストです。'</span>,<span class="hljs-string">'-mode search -pos'</span>) <span class="hljs-keyword">as</span> r
+)
+<span class="hljs-keyword">select</span>
+ r.tokens,
+ r.pos,
+ r.tokens[<span class="hljs-number">0</span>] <span class="hljs-keyword">as</span> token0,
+ r.pos[<span class="hljs-number">0</span>] <span class="hljs-keyword">as</span> pos0
+<span class="hljs-keyword">from</span>
+ tmp;
+</code></pre>
+<table>
+<thead>
+<tr>
+<th style="text-align:center">tokens</th>
+<th style="text-align:center">pos</th>
+<th style="text-align:center">token0</th>
+<th style="text-align:center">pos0</th>
+</tr>
+</thead>
+<tbody>
+<tr>
+<td style="text-align:center">["kuromoji","使う","分かち書き","テスト"]</td>
+<td style="text-align:center">["名詞-一般","動詞-自立","名詞-一般","名詞-サ変接続"]</td>
+<td style="text-align:center">kuromoji</td>
+<td style="text-align:center">名詞-一般</td>
+</tr>
+</tbody>
+</table>
+<p>Note that when <code>-pos</code> option is specified, <code>tokenize_ja</code> returns a struct record containing <code>array<string> tokens</code> and <code>array<string> pos</code> as the elements.</p>
<h2 id="chinese-tokenizer">Chinese Tokenizer</h2>
<p>Chinese text tokenizer UDF uses <a href="https://lucene.apache.org/core/5_3_1/analyzers-smartcn/org/apache/lucene/analysis/cn/smart/SmartChineseAnalyzer.html" target="_blank">SmartChineseAnalyzer</a>. </p>
<p>The signature of the UDF is as follows:</p>
@@ -2526,7 +2566,7 @@ Apache Hivemall is an effort undergoing incubation at The Apache Software Founda
<script>
var gitbook = gitbook || [];
gitbook.push(function() {
- gitbook.page.hasChanged({"page":{"title":"Text Tokenizer","level":"2.3","depth":1,"next":{"title":"Approximate Aggregate Functions","level":"2.4","depth":1,"path":"misc/approx.md","ref":"misc/approx.md","articles":[]},"previous":{"title":"Efficient Top-K Query Processing","level":"2.2","depth":1,"path":"misc/topk.md","ref":"misc/topk.md","articles":[]},"dir":"ltr"},"config":{"plugins":["theme-api","edit-link","github","splitter","sitemap","etoc","callouts","toggle-chapters"," [...]
+ gitbook.page.hasChanged({"page":{"title":"Text Tokenizer","level":"2.3","depth":1,"next":{"title":"Approximate Aggregate Functions","level":"2.4","depth":1,"path":"misc/approx.md","ref":"misc/approx.md","articles":[]},"previous":{"title":"Efficient Top-K Query Processing","level":"2.2","depth":1,"path":"misc/topk.md","ref":"misc/topk.md","articles":[]},"dir":"ltr"},"config":{"plugins":["theme-api","edit-link","github","splitter","sitemap","etoc","callouts","toggle-chapters"," [...]
});
</script>
</div>