You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hivemall.apache.org by partyyoung <gi...@git.apache.org> on 2017/06/30 02:34:28 UTC

[GitHub] incubator-hivemall pull request #91: [HIVEMALL-122] Added tokenize_cn UDF

GitHub user partyyoung opened a pull request:

    https://github.com/apache/incubator-hivemall/pull/91

    [HIVEMALL-122] Added tokenize_cn UDF

    ## What changes were proposed in this pull request?
    
    Added tokenize_cn UDF to support word segmentation for Simplified Chinese text
    
    ## What type of PR is it?
    
    Feature
    
    ## What is the Jira issue?
    
    https://issues.apache.org/jira/browse/HIVEMALL-122
    
    ## How was this patch tested?
    
    unit tests, manual tests
    
    ## How to use this feature?
    
    udf


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/partyyoung/incubator-hivemall master

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/incubator-hivemall/pull/91.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #91
    
----
commit 5eb80373641920428d9f95f54b726995e89e8443
Author: partyyoung <pa...@126.com>
Date:   2017-06-29T10:32:05Z

    Added tokenize_cn UDF, using org.apache.lucene.analysis.cn.smart.SmartChineseAnalyzer

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-hivemall pull request #91: [HIVEMALL-122] Added tokenize_cn UDF

Posted by takuti <gi...@git.apache.org>.
Github user takuti commented on a diff in the pull request:

    https://github.com/apache/incubator-hivemall/pull/91#discussion_r125118143
  
    --- Diff: nlp/src/main/java/hivemall/nlp/tokenizer/SmartcnUDF.java ---
    @@ -0,0 +1,137 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *   http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing,
    + * software distributed under the License is distributed on an
    + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
    + * KIND, either express or implied.  See the License for the
    + * specific language governing permissions and limitations
    + * under the License.
    + */
    +package hivemall.nlp.tokenizer;
    +
    +import hivemall.utils.hadoop.HiveUtils;
    +import hivemall.utils.io.IOUtils;
    +
    +import java.io.IOException;
    +import java.util.ArrayList;
    +import java.util.Arrays;
    +import java.util.List;
    +
    +import javax.annotation.Nonnull;
    +
    +import org.apache.hadoop.hive.ql.exec.Description;
    +import org.apache.hadoop.hive.ql.exec.UDFArgumentException;
    +import org.apache.hadoop.hive.ql.metadata.HiveException;
    +import org.apache.hadoop.hive.ql.udf.UDFType;
    +import org.apache.hadoop.hive.ql.udf.generic.GenericUDF;
    +import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
    +import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory;
    +import org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorFactory;
    +import org.apache.hadoop.io.Text;
    +import org.apache.lucene.analysis.TokenStream;
    +import org.apache.lucene.analysis.cn.smart.SmartChineseAnalyzer;
    +import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
    +import org.apache.lucene.analysis.util.CharArraySet;
    +
    +@Description(
    +        name = "tokenize_cn",
    +        value = "_FUNC_(String line [, const list<string> stopWords])"
    +                + " - returns tokenized strings in array<string>")
    +@UDFType(deterministic = true, stateful = false)
    +public final class SmartcnUDF extends GenericUDF {
    +
    +    private String[] _stopWordsArray;
    +
    +    private transient SmartChineseAnalyzer _analyzer;
    +
    +    @Override
    +    public ObjectInspector initialize(ObjectInspector[] arguments) throws UDFArgumentException {
    +        final int arglen = arguments.length;
    +        if (arglen < 1 || arglen > 2) {
    +            throw new UDFArgumentException("Invalid number of arguments for `tokenize_cn`: "
    +                    + arglen);
    +        }
    +
    +        this._stopWordsArray = (arglen >= 2) ? HiveUtils.getConstStringArray(arguments[1]) : null;
    +        this._analyzer = null;
    +
    +        return ObjectInspectorFactory.getStandardListObjectInspector(PrimitiveObjectInspectorFactory.writableStringObjectInspector);
    +    }
    +
    +    @Override
    +    public List<Text> evaluate(DeferredObject[] arguments) throws HiveException {
    +        SmartChineseAnalyzer analyzer = _analyzer;
    +        if (analyzer == null) {
    +			CharArraySet stopwords = stopWords(_stopWordsArray);
    +            analyzer= new SmartChineseAnalyzer(stopwords);
    --- End diff --
    
    missing space before `=`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-hivemall pull request #91: [HIVEMALL-122] Added tokenize_cn UDF

Posted by myui <gi...@git.apache.org>.
Github user myui commented on a diff in the pull request:

    https://github.com/apache/incubator-hivemall/pull/91#discussion_r125152035
  
    --- Diff: docs/gitbook/misc/tokenizer.md ---
    @@ -46,4 +46,25 @@ select tokenize_ja("kuromojiを使った分かち書きのテストです。第
     ```
     > ["kuromoji","使う","分かち書き","テスト","第","二","引数","normal","search","extended","指定","デフォルト","normal","モード"]
     
    -For detailed APIs, please refer Javadoc of [JapaneseAnalyzer](https://lucene.apache.org/core/5_3_1/analyzers-kuromoji/org/apache/lucene/analysis/ja/JapaneseAnalyzer.html) as well.
    \ No newline at end of file
    +For detailed APIs, please refer Javadoc of [JapaneseAnalyzer](https://lucene.apache.org/core/5_3_1/analyzers-kuromoji/org/apache/lucene/analysis/ja/JapaneseAnalyzer.html) as well.
    +
    +# Tokenizer for Chinese Texts
    +
    +Hivemall-NLP module provides a Chinese text tokenizer UDF using [SmartChineseAnalyzer](http://lucene.apache.org/core/5_3_1/analyzers-smartcn/org/apache/lucene/analysis/cn/smart/SmartChineseAnalyzer.html). 
    +
    +> add jar /tmp/[hivemall-nlp-xxx-with-dependencies.jar](https://github.com/myui/hivemall/releases);
    --- End diff --
    
    Also, better to remote the link to the old page `https://github.com/myui/hivemall/releases`.
    
    You can remote the link in the `tokenize_ja` as well in this PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-hivemall issue #91: [HIVEMALL-122] Added tokenize_cn UDF

Posted by takuti <gi...@git.apache.org>.
Github user takuti commented on the issue:

    https://github.com/apache/incubator-hivemall/pull/91
  
    @partyyoung Merged. Thank you for your contribution!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-hivemall issue #91: [HIVEMALL-122] Added tokenize_cn UDF

Posted by coveralls <gi...@git.apache.org>.
Github user coveralls commented on the issue:

    https://github.com/apache/incubator-hivemall/pull/91
  
    
    [![Coverage Status](https://coveralls.io/builds/12217081/badge)](https://coveralls.io/builds/12217081)
    
    Coverage increased (+0.4%) to 40.564% when pulling **efc3a6deecdc65eebf6946c6b1efb253debdca1b on partyyoung:master** into **9f01ebf20c74559be8a50d459103118a51c229bf on apache:master**.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-hivemall pull request #91: [HIVEMALL-122] Added tokenize_cn UDF

Posted by takuti <gi...@git.apache.org>.
Github user takuti commented on a diff in the pull request:

    https://github.com/apache/incubator-hivemall/pull/91#discussion_r125117781
  
    --- Diff: nlp/src/main/java/hivemall/nlp/tokenizer/SmartcnUDF.java ---
    @@ -0,0 +1,137 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *   http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing,
    + * software distributed under the License is distributed on an
    + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
    + * KIND, either express or implied.  See the License for the
    + * specific language governing permissions and limitations
    + * under the License.
    + */
    +package hivemall.nlp.tokenizer;
    +
    +import hivemall.utils.hadoop.HiveUtils;
    +import hivemall.utils.io.IOUtils;
    +
    +import java.io.IOException;
    +import java.util.ArrayList;
    +import java.util.Arrays;
    +import java.util.List;
    +
    +import javax.annotation.Nonnull;
    +
    +import org.apache.hadoop.hive.ql.exec.Description;
    +import org.apache.hadoop.hive.ql.exec.UDFArgumentException;
    +import org.apache.hadoop.hive.ql.metadata.HiveException;
    +import org.apache.hadoop.hive.ql.udf.UDFType;
    +import org.apache.hadoop.hive.ql.udf.generic.GenericUDF;
    +import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
    +import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory;
    +import org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorFactory;
    +import org.apache.hadoop.io.Text;
    +import org.apache.lucene.analysis.TokenStream;
    +import org.apache.lucene.analysis.cn.smart.SmartChineseAnalyzer;
    +import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
    +import org.apache.lucene.analysis.util.CharArraySet;
    +
    +@Description(
    +        name = "tokenize_cn",
    +        value = "_FUNC_(String line [, const list<string> stopWords])"
    +                + " - returns tokenized strings in array<string>")
    +@UDFType(deterministic = true, stateful = false)
    +public final class SmartcnUDF extends GenericUDF {
    +
    +    private String[] _stopWordsArray;
    +
    +    private transient SmartChineseAnalyzer _analyzer;
    +
    +    @Override
    +    public ObjectInspector initialize(ObjectInspector[] arguments) throws UDFArgumentException {
    +        final int arglen = arguments.length;
    +        if (arglen < 1 || arglen > 2) {
    +            throw new UDFArgumentException("Invalid number of arguments for `tokenize_cn`: "
    +                    + arglen);
    +        }
    +
    +        this._stopWordsArray = (arglen >= 2) ? HiveUtils.getConstStringArray(arguments[1]) : null;
    +        this._analyzer = null;
    +
    +        return ObjectInspectorFactory.getStandardListObjectInspector(PrimitiveObjectInspectorFactory.writableStringObjectInspector);
    +    }
    +
    +    @Override
    +    public List<Text> evaluate(DeferredObject[] arguments) throws HiveException {
    +        SmartChineseAnalyzer analyzer = _analyzer;
    +        if (analyzer == null) {
    +			CharArraySet stopwords = stopWords(_stopWordsArray);
    +            analyzer= new SmartChineseAnalyzer(stopwords);
    +            this._analyzer = analyzer;
    +        }
    +
    +        Object arg0 = arguments[0].get();
    +        if (arg0 == null) {
    +            return null;
    +        }
    +        String line = arg0.toString();
    +
    +        final List<Text> results = new ArrayList<Text>(32);
    +        TokenStream stream = null;
    +        try {
    +            stream = analyzer.tokenStream("", line);
    +            if (stream != null) {
    +                analyzeTokens(stream, results);
    +            }
    +        } catch (IOException e) {
    +            IOUtils.closeQuietly(analyzer);
    +            throw new HiveException(e);
    +        } finally {
    +            IOUtils.closeQuietly(stream);
    +        }
    +        return results;
    +    }
    +
    +    @Override
    +    public void close() throws IOException {
    +        IOUtils.closeQuietly(_analyzer);
    +    }
    +
    --- End diff --
    
    Remove line


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-hivemall issue #91: [HIVEMALL-122] Added tokenize_cn UDF

Posted by partyyoung <gi...@git.apache.org>.
Github user partyyoung commented on the issue:

    https://github.com/apache/incubator-hivemall/pull/91
  
    @takuti i am fixing these typos above. thank you.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-hivemall pull request #91: [HIVEMALL-122] Added tokenize_cn UDF

Posted by takuti <gi...@git.apache.org>.
Github user takuti commented on a diff in the pull request:

    https://github.com/apache/incubator-hivemall/pull/91#discussion_r125123847
  
    --- Diff: nlp/src/test/java/hivemall/nlp/tokenizer/SmartcnUDFTest.java ---
    @@ -0,0 +1,85 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *   http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing,
    + * software distributed under the License is distributed on an
    + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
    + * KIND, either express or implied.  See the License for the
    + * specific language governing permissions and limitations
    + * under the License.
    + */
    +package hivemall.nlp.tokenizer;
    +
    +import java.io.IOException;
    +import java.util.List;
    +
    +import org.apache.hadoop.hive.ql.exec.UDFArgumentException;
    +import org.apache.hadoop.hive.ql.metadata.HiveException;
    +import org.apache.hadoop.hive.ql.udf.generic.GenericUDF;
    +import org.apache.hadoop.hive.ql.udf.generic.GenericUDF.DeferredObject;
    +import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
    +import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory;
    +import org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorFactory;
    +import org.apache.hadoop.io.Text;
    +import org.junit.Assert;
    +import org.junit.Test;
    +
    +public class SmartcnUDFTest {
    +
    +	@Test
    +	public void testOneArgment() throws UDFArgumentException, IOException {
    --- End diff --
    
    Could you fix typos `Argment` => `Argument` both in `KuromojiUDFTest` and `SmartcnUDFTest`?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-hivemall issue #91: [HIVEMALL-122] Added tokenize_cn UDF

Posted by takuti <gi...@git.apache.org>.
Github user takuti commented on the issue:

    https://github.com/apache/incubator-hivemall/pull/91
  
    @myui sure!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-hivemall issue #91: [HIVEMALL-122] Added tokenize_cn UDF

Posted by coveralls <gi...@git.apache.org>.
Github user coveralls commented on the issue:

    https://github.com/apache/incubator-hivemall/pull/91
  
    
    [![Coverage Status](https://coveralls.io/builds/12199532/badge)](https://coveralls.io/builds/12199532)
    
    Coverage increased (+0.06%) to 40.186% when pulling **5eb80373641920428d9f95f54b726995e89e8443 on partyyoung:master** into **9f01ebf20c74559be8a50d459103118a51c229bf on apache:master**.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-hivemall pull request #91: [HIVEMALL-122] Added tokenize_cn UDF

Posted by takuti <gi...@git.apache.org>.
Github user takuti commented on a diff in the pull request:

    https://github.com/apache/incubator-hivemall/pull/91#discussion_r125117516
  
    --- Diff: docs/gitbook/misc/tokenizer.md ---
    @@ -46,4 +46,25 @@ select tokenize_ja("kuromojiを使った分かち書きのテストです。第
     ```
     > ["kuromoji","使う","分かち書き","テスト","第","二","引数","normal","search","extended","指定","デフォルト","normal","モード"]
     
    -For detailed APIs, please refer Javadoc of [JapaneseAnalyzer](https://lucene.apache.org/core/5_3_1/analyzers-kuromoji/org/apache/lucene/analysis/ja/JapaneseAnalyzer.html) as well.
    \ No newline at end of file
    +For detailed APIs, please refer Javadoc of [JapaneseAnalyzer](https://lucene.apache.org/core/5_3_1/analyzers-kuromoji/org/apache/lucene/analysis/ja/JapaneseAnalyzer.html) as well.
    +
    +# Tokenizer for Chinese Texts
    +
    +Hivemall-NLP module provides a Chinese text tokenizer UDF using [SmartChineseAnalyzer](http://lucene.apache.org/core/5_3_1/analyzers-smartcn/org/apache/lucene/analysis/cn/smart/SmartChineseAnalyzer.html). 
    +
    +> add jar /tmp/[hivemall-nlp-xxx-with-dependencies.jar](https://github.com/myui/hivemall/releases);
    --- End diff --
    
    `add jar` and `source` instructions are similar to Japanese tokenizer. If possible, you can organize the page as:
    
    ---
    
    # Tokenizer for Non-English Texts
    
    (explain `add jar` and `source` stuff)
    
    ## Japanese Tokenizer
    
    `tokenize_ja`
    
    ## Chinese Tokenizer
    
    `tokenize_cn`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-hivemall issue #91: [HIVEMALL-122] Added tokenize_cn UDF

Posted by coveralls <gi...@git.apache.org>.
Github user coveralls commented on the issue:

    https://github.com/apache/incubator-hivemall/pull/91
  
    
    [![Coverage Status](https://coveralls.io/builds/12203226/badge)](https://coveralls.io/builds/12203226)
    
    Coverage increased (+0.06%) to 40.186% when pulling **e24c4fcc7c76d78ca0d8f2a18a5e7316318d0819 on partyyoung:master** into **9f01ebf20c74559be8a50d459103118a51c229bf on apache:master**.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-hivemall issue #91: [HIVEMALL-122] Added tokenize_cn UDF

Posted by myui <gi...@git.apache.org>.
Github user myui commented on the issue:

    https://github.com/apache/incubator-hivemall/pull/91
  
    @partyyoung Thank you for fixing!
    
    @takuti Could you review this PR and merge if it's looks good to you (for your experience as PPMC).
    
    Here is a typical flow for merging a pull request to ASF git.
    ```
    git checkout master
    git checkout -b bug123
    git pull http://repourl.git branch
    git log | grep "Author" | head -1
    git checkout master
    git merge --squash bug123
    git commit -a --author="Author <fo...@me.com>" --message="Close #1: fixed a bug in xxx"
    git push origin master
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-hivemall issue #91: [HIVEMALL-122] Added tokenize_cn UDF

Posted by myui <gi...@git.apache.org>.
Github user myui commented on the issue:

    https://github.com/apache/incubator-hivemall/pull/91
  
    Apply `mvn formatter:format` should be commented in the contribution guide.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-hivemall issue #91: [HIVEMALL-122] Added tokenize_cn UDF

Posted by myui <gi...@git.apache.org>.
Github user myui commented on the issue:

    https://github.com/apache/incubator-hivemall/pull/91
  
    Hi @partyyoung 
    
    Thank you for your contribution!
    
    Could you describe usage in https://github.com/apache/incubator-hivemall/blob/master/docs/gitbook/misc/tokenizer.md ?
    
    `npm install -g gitbook-cli`
    `cd docs/gitbook; gitbook` serve to check [the user guide](http://hivemall.incubator.apache.org/userguide/misc/tokenizer.html).
    
    Also, `sec 2.3.` can be renamed as `Text Tokenizer`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-hivemall issue #91: [HIVEMALL-122] Added tokenize_cn UDF

Posted by takuti <gi...@git.apache.org>.
Github user takuti commented on the issue:

    https://github.com/apache/incubator-hivemall/pull/91
  
    @myui It is! https://github.com/apache/incubator-hivemall/pull/92/files#diff-04c6e90faac2675aa89e2176d2eec7d8


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-hivemall pull request #91: [HIVEMALL-122] Added tokenize_cn UDF

Posted by takuti <gi...@git.apache.org>.
Github user takuti commented on a diff in the pull request:

    https://github.com/apache/incubator-hivemall/pull/91#discussion_r125120746
  
    --- Diff: resources/ddl/define-additional.hive ---
    @@ -9,6 +9,9 @@
     drop temporary function if exists tokenize_ja;
     create temporary function tokenize_ja as 'hivemall.nlp.tokenizer.KuromojiUDF';
     
    +drop temporary function if exists tokenize_cn;
    +create temporary function tokenize_cn as 'hivemall.nlp.tokenizer.SmartcnUDF';
    --- End diff --
    
    Could you also update `resources/ddl/define-udfs.td.hql`?
    
    ```
    $ grep -r 'tokenize_ja' resources/ddl
    resources/ddl/define-additional.hive:drop temporary function if exists tokenize_ja;
    resources/ddl/define-additional.hive:create temporary function tokenize_ja as 'hivemall.nlp.tokenizer.KuromojiUDF';
    resources/ddl/define-udfs.td.hql:create temporary function tokenize_ja as 'hivemall.nlp.tokenizer.KuromojiUDF';
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-hivemall pull request #91: [HIVEMALL-122] Added tokenize_cn UDF

Posted by takuti <gi...@git.apache.org>.
Github user takuti commented on a diff in the pull request:

    https://github.com/apache/incubator-hivemall/pull/91#discussion_r125123547
  
    --- Diff: nlp/src/test/java/hivemall/nlp/tokenizer/SmartcnUDFTest.java ---
    @@ -0,0 +1,85 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *   http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing,
    + * software distributed under the License is distributed on an
    + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
    + * KIND, either express or implied.  See the License for the
    + * specific language governing permissions and limitations
    + * under the License.
    + */
    +package hivemall.nlp.tokenizer;
    +
    +import java.io.IOException;
    +import java.util.List;
    +
    +import org.apache.hadoop.hive.ql.exec.UDFArgumentException;
    +import org.apache.hadoop.hive.ql.metadata.HiveException;
    +import org.apache.hadoop.hive.ql.udf.generic.GenericUDF;
    +import org.apache.hadoop.hive.ql.udf.generic.GenericUDF.DeferredObject;
    +import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
    +import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory;
    +import org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorFactory;
    +import org.apache.hadoop.io.Text;
    +import org.junit.Assert;
    +import org.junit.Test;
    +
    +public class SmartcnUDFTest {
    +
    +	@Test
    +	public void testOneArgment() throws UDFArgumentException, IOException {
    +		GenericUDF udf = new SmartcnUDF();
    +		ObjectInspector[] argOIs = new ObjectInspector[1];
    +		// line
    +		argOIs[0] = PrimitiveObjectInspectorFactory.javaStringObjectInspector;
    +		udf.initialize(argOIs);
    +		udf.close();
    +	}
    +
    +	@Test
    +	public void testTwoArgment() throws UDFArgumentException, IOException {
    +		GenericUDF udf = new SmartcnUDF();
    +		ObjectInspector[] argOIs = new ObjectInspector[2];
    +		// line
    +		argOIs[0] = PrimitiveObjectInspectorFactory.javaStringObjectInspector;
    +		// stopWords
    +		argOIs[1] = ObjectInspectorFactory
    +				.getStandardConstantListObjectInspector(
    +						PrimitiveObjectInspectorFactory.javaStringObjectInspector,
    +						null);
    +		udf.initialize(argOIs);
    +		udf.close();
    +	}
    +
    +	@Test
    +	public void testEvalauteOneRow() throws IOException, HiveException {
    --- End diff --
    
    Typo `Evalaute` => `Evaluate`
    
    I know the typo originally comes from `KuromojiUDFTest.java` :) Could you fix both typos?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-hivemall issue #91: [HIVEMALL-122] Added tokenize_cn UDF

Posted by coveralls <gi...@git.apache.org>.
Github user coveralls commented on the issue:

    https://github.com/apache/incubator-hivemall/pull/91
  
    
    [![Coverage Status](https://coveralls.io/builds/12203197/badge)](https://coveralls.io/builds/12203197)
    
    Coverage increased (+0.06%) to 40.186% when pulling **e24c4fcc7c76d78ca0d8f2a18a5e7316318d0819 on partyyoung:master** into **9f01ebf20c74559be8a50d459103118a51c229bf on apache:master**.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-hivemall issue #91: [HIVEMALL-122] Added tokenize_cn UDF

Posted by myui <gi...@git.apache.org>.
Github user myui commented on the issue:

    https://github.com/apache/incubator-hivemall/pull/91
  
    It seems that code identification is strange in the latest commit. I'll fix it.
    Better to apply `mvn formatter:format`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-hivemall issue #91: [HIVEMALL-122] Added tokenize_cn UDF

Posted by takuti <gi...@git.apache.org>.
Github user takuti commented on the issue:

    https://github.com/apache/incubator-hivemall/pull/91
  
    @myui Yep, that's why I've created another PR #92. As long as coding guideline is not clearly specified, we should not force contributors to follow specific coding convention :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-hivemall pull request #91: [HIVEMALL-122] Added tokenize_cn UDF

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/incubator-hivemall/pull/91


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-hivemall issue #91: [HIVEMALL-122] Added tokenize_cn UDF

Posted by myui <gi...@git.apache.org>.
Github user myui commented on the issue:

    https://github.com/apache/incubator-hivemall/pull/91
  
    Also, `tokenize_cn` should be added in [ddls](https://github.com/apache/incubator-hivemall/tree/master/resources/ddl).
    
    grep `tokenize_ja` in the directory for reference.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---