You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hivemall.apache.org by takuti <gi...@git.apache.org> on 2017/10/02 05:50:27 UTC

[GitHub] incubator-hivemall pull request #118: [HIVEMALL-146] Yet another UDF to gene...

GitHub user takuti opened a pull request:

    https://github.com/apache/incubator-hivemall/pull/118

    [HIVEMALL-146] Yet another UDF to generate n-grams

    ## What changes were proposed in this pull request?
    
    Add a new UDF `to_ngrams(array<string> words, int minSize, int maxSize)` which returns list of n-grams `minSize <= n <= maxSize` for given words. This UDF can be alternative of the original Hive `ngrams` function.
    
    ## What type of PR is it?
    
    Feature
    
    ## What is the Jira issue?
    
    https://issues.apache.org/jira/browse/HIVEMALL-146
    
    ## How was this patch tested?
    
    Unit test, manual tests both on EMR and local Hive
    
    ## How to use this feature?
    
    as documented
    
    ## Checklist
    
    (Please remove this section if not needed; check `x` for YES, blank for NO)
    
    - [x] Did you apply source code formatter, i.e., `mvn formatter:format`, for your commit?
    - [x] Did you run system tests on Hive (or Spark)?


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/takuti/incubator-hivemall HIVEMALL-146-ngrams

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/incubator-hivemall/pull/118.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #118
    
----
commit 6e9d08f264c173410e4a90fa4533db0dd28836ca
Author: Takuya Kitazawa <k....@gmail.com>
Date:   2017-10-02T05:38:31Z

    Implement `to_ngrams` UDF

commit df81ee2de13666636068c1691f595b775e39a6f5
Author: Takuya Kitazawa <k....@gmail.com>
Date:   2017-10-02T05:46:35Z

    Update document

----


---

[GitHub] incubator-hivemall pull request #118: [HIVEMALL-146] Yet another UDF to gene...

Posted by myui <gi...@git.apache.org>.
Github user myui commented on a diff in the pull request:

    https://github.com/apache/incubator-hivemall/pull/118#discussion_r142313252
  
    --- Diff: core/src/main/java/hivemall/tools/text/NgramsUDF.java ---
    @@ -0,0 +1,76 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *   http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing,
    + * software distributed under the License is distributed on an
    + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
    + * KIND, either express or implied.  See the License for the
    + * specific language governing permissions and limitations
    + * under the License.
    + */
    +package hivemall.tools.text;
    +
    +import hivemall.utils.lang.StringUtils;
    +
    +import org.apache.hadoop.hive.ql.exec.Description;
    +import org.apache.hadoop.hive.ql.exec.UDF;
    +import org.apache.hadoop.hive.ql.exec.UDFArgumentException;
    +import org.apache.hadoop.hive.ql.metadata.HiveException;
    +import org.apache.hadoop.hive.ql.udf.UDFType;
    +import org.apache.hadoop.io.Text;
    +
    +import javax.annotation.Nonnegative;
    +import javax.annotation.Nonnull;
    +
    +import java.util.ArrayList;
    +import java.util.List;
    +
    +@Description(name = "to_ngrams", value = "_FUNC_(array<string> words, int minSize, int maxSize])"
    --- End diff --
    
    Sounds good.


---

[GitHub] incubator-hivemall issue #118: [HIVEMALL-146] Yet another UDF to generate n-...

Posted by coveralls <gi...@git.apache.org>.
Github user coveralls commented on the issue:

    https://github.com/apache/incubator-hivemall/pull/118
  
    
    [![Coverage Status](https://coveralls.io/builds/13539099/badge)](https://coveralls.io/builds/13539099)
    
    Coverage increased (+0.05%) to 41.107% when pulling **022d23c61a44337eb1172214daf43321eb8f021e on takuti:HIVEMALL-146-ngrams** into **1e42387576fabbb326d451f4a00ac22d57828711 on apache:master**.



---

[GitHub] incubator-hivemall pull request #118: [HIVEMALL-146] Yet another UDF to gene...

Posted by takuti <gi...@git.apache.org>.
Github user takuti commented on a diff in the pull request:

    https://github.com/apache/incubator-hivemall/pull/118#discussion_r142300766
  
    --- Diff: core/src/main/java/hivemall/tools/text/NgramsUDF.java ---
    @@ -0,0 +1,76 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *   http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing,
    + * software distributed under the License is distributed on an
    + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
    + * KIND, either express or implied.  See the License for the
    + * specific language governing permissions and limitations
    + * under the License.
    + */
    +package hivemall.tools.text;
    +
    +import hivemall.utils.lang.StringUtils;
    +
    +import org.apache.hadoop.hive.ql.exec.Description;
    +import org.apache.hadoop.hive.ql.exec.UDF;
    +import org.apache.hadoop.hive.ql.exec.UDFArgumentException;
    +import org.apache.hadoop.hive.ql.metadata.HiveException;
    +import org.apache.hadoop.hive.ql.udf.UDFType;
    +import org.apache.hadoop.io.Text;
    +
    +import javax.annotation.Nonnegative;
    +import javax.annotation.Nonnull;
    +
    +import java.util.ArrayList;
    +import java.util.List;
    +
    +@Description(name = "to_ngrams", value = "_FUNC_(array<string> words, int minSize, int maxSize])"
    --- End diff --
    
    Since the function is applicable for a list of characters (character-based ngrams), using a word `word` e.g., `wordgram` sounds inappropriate for me. 
    
    How about:
    
    - `ngrams_joined_all` 
      - in contrast to the original `ngrams` which returns top-k most-frequent ngrams which contains "ngram" represented as a list of words
    - `ngrams_between`, `ngrams_in_range` 
      - our UDF returns ngrams where `min <= n <= max`
    - `to_ngram_list` 
      - a little bit longer than `to_ngrams` and less likely to be confusing 
      - `ngrams` creates list of named structs, but ours simply returns list of ngrams


---

[GitHub] incubator-hivemall issue #118: [HIVEMALL-146] Yet another UDF to generate n-...

Posted by coveralls <gi...@git.apache.org>.
Github user coveralls commented on the issue:

    https://github.com/apache/incubator-hivemall/pull/118
  
    
    [![Coverage Status](https://coveralls.io/builds/13537452/badge)](https://coveralls.io/builds/13537452)
    
    Coverage increased (+0.03%) to 41.094% when pulling **e4240d5175a2fca0f9e99bc2ade95e24655084df on takuti:HIVEMALL-146-ngrams** into **1e42387576fabbb326d451f4a00ac22d57828711 on apache:master**.



---

[GitHub] incubator-hivemall pull request #118: [HIVEMALL-146] Yet another UDF to gene...

Posted by myui <gi...@git.apache.org>.
Github user myui commented on a diff in the pull request:

    https://github.com/apache/incubator-hivemall/pull/118#discussion_r142323455
  
    --- Diff: core/src/main/java/hivemall/tools/text/WordNgramsUDF.java ---
    @@ -0,0 +1,85 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *   http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing,
    + * software distributed under the License is distributed on an
    + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
    + * KIND, either express or implied.  See the License for the
    + * specific language governing permissions and limitations
    + * under the License.
    + */
    +package hivemall.tools.text;
    +
    +import org.apache.hadoop.hive.ql.exec.Description;
    +import org.apache.hadoop.hive.ql.exec.UDF;
    +import org.apache.hadoop.hive.ql.exec.UDFArgumentException;
    +import org.apache.hadoop.hive.ql.metadata.HiveException;
    +import org.apache.hadoop.hive.ql.udf.UDFType;
    +import org.apache.hadoop.io.Text;
    +
    +import javax.annotation.Nonnegative;
    +import javax.annotation.Nonnull;
    +import javax.annotation.Nullable;
    +
    +import java.util.ArrayList;
    +import java.util.List;
    +
    +@Description(name = "word_ngrams", value = "_FUNC_(array<string> words, int minSize, int maxSize])"
    +        + " - Returns list of n-grams for given words, where `minSize <= n <= maxSize`")
    +@UDFType(deterministic = true, stateful = false)
    +public final class WordNgramsUDF extends UDF {
    +
    +    @Nullable
    +    public List<Text> evaluate(@Nullable final List<Text> words, final int minSize,
    +            final int maxSize) throws HiveException {
    +        if (words == null) {
    +            return null;
    +        }
    +        if (minSize <= 0) {
    +            throw new UDFArgumentException("`minSize` must be greater than zero: " + minSize);
    +        }
    +        if (minSize > maxSize) {
    +            throw new UDFArgumentException("`maxSize` must be greater than or equal to `minSize`: "
    +                    + maxSize);
    +        }
    +        return getNgrams(words, minSize, maxSize);
    +    }
    +
    +    @Nonnull
    +    private static List<Text> getNgrams(@Nonnull final List<Text> words,
    +            @Nonnegative final int minSize, @Nonnegative final int maxSize) throws HiveException {
    +        final List<Text> ngrams = new ArrayList<Text>();
    +        for (int i = 0, numWords = words.size(); i < numWords; i++) {
    +            for (int ngramSize = minSize; ngramSize <= maxSize; ngramSize++) {
    +                final int end = i + ngramSize;
    +                if (end > numWords) { // exceeds the final element
    +                    continue;
    +                }
    +
    +                final StringBuilder ngram = new StringBuilder();
    +                for (int j = i; j < end; j++) {
    +                    final Text word = words.get(j);
    +                    if (word == null) {
    +                        throw new HiveException(
    +                            "`array<string> words` must not contain NULL element");
    +                    }
    +                    if (j > i) { // insert single whitespace between elements
    +                        ngram.append(" ");
    +                    }
    +                    ngram.append(word.toString());
    +                }
    +                ngrams.add(new Text(ngram.toString()));
    +            }
    --- End diff --
    
    @takuti oops... Could you reuse StringBuilder instance as commented?


---

[GitHub] incubator-hivemall pull request #118: [HIVEMALL-146] Yet another UDF to gene...

Posted by myui <gi...@git.apache.org>.
Github user myui commented on a diff in the pull request:

    https://github.com/apache/incubator-hivemall/pull/118#discussion_r142307304
  
    --- Diff: core/src/main/java/hivemall/tools/text/NgramsUDF.java ---
    @@ -0,0 +1,76 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *   http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing,
    + * software distributed under the License is distributed on an
    + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
    + * KIND, either express or implied.  See the License for the
    + * specific language governing permissions and limitations
    + * under the License.
    + */
    +package hivemall.tools.text;
    +
    +import hivemall.utils.lang.StringUtils;
    +
    +import org.apache.hadoop.hive.ql.exec.Description;
    +import org.apache.hadoop.hive.ql.exec.UDF;
    +import org.apache.hadoop.hive.ql.exec.UDFArgumentException;
    +import org.apache.hadoop.hive.ql.metadata.HiveException;
    +import org.apache.hadoop.hive.ql.udf.UDFType;
    +import org.apache.hadoop.io.Text;
    +
    +import javax.annotation.Nonnegative;
    +import javax.annotation.Nonnull;
    +
    +import java.util.ArrayList;
    +import java.util.List;
    +
    +@Description(name = "to_ngrams", value = "_FUNC_(array<string> words, int minSize, int maxSize])"
    --- End diff --
    
    I user want character-level ngram, existing ngram is fine.
    
    This function is for word-level ngram; and thus wordgram seems fine.


---

[GitHub] incubator-hivemall pull request #118: [HIVEMALL-146] Yet another UDF to gene...

Posted by takuti <gi...@git.apache.org>.
Github user takuti commented on a diff in the pull request:

    https://github.com/apache/incubator-hivemall/pull/118#discussion_r142312467
  
    --- Diff: core/src/main/java/hivemall/tools/text/NgramsUDF.java ---
    @@ -0,0 +1,76 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *   http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing,
    + * software distributed under the License is distributed on an
    + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
    + * KIND, either express or implied.  See the License for the
    + * specific language governing permissions and limitations
    + * under the License.
    + */
    +package hivemall.tools.text;
    +
    +import hivemall.utils.lang.StringUtils;
    +
    +import org.apache.hadoop.hive.ql.exec.Description;
    +import org.apache.hadoop.hive.ql.exec.UDF;
    +import org.apache.hadoop.hive.ql.exec.UDFArgumentException;
    +import org.apache.hadoop.hive.ql.metadata.HiveException;
    +import org.apache.hadoop.hive.ql.udf.UDFType;
    +import org.apache.hadoop.io.Text;
    +
    +import javax.annotation.Nonnegative;
    +import javax.annotation.Nonnull;
    +
    +import java.util.ArrayList;
    +import java.util.List;
    +
    +@Description(name = "to_ngrams", value = "_FUNC_(array<string> words, int minSize, int maxSize])"
    +        + " - Returns list of n-grams for given words, where `minSize <= n <= maxSize`")
    +@UDFType(deterministic = true, stateful = false)
    +public final class NgramsUDF extends UDF {
    +
    +    public List<Text> evaluate(final List<Text> words, final int minSize, final int maxSize)
    +            throws HiveException {
    +        if (words == null) {
    +            return null;
    +        }
    +        if (minSize <= 0) {
    +            throw new UDFArgumentException("`minSize` must be greater than zero: " + minSize);
    +        }
    +        if (minSize > maxSize) {
    +            throw new UDFArgumentException("`maxSize` must be greater than or equal to `minSize`: "
    +                    + maxSize);
    +        }
    +        return getNgrams(words, minSize, maxSize);
    +    }
    +
    +    @Nonnull
    +    private List<Text> getNgrams(@Nonnull final List<Text> words, @Nonnegative final int minSize,
    +            @Nonnegative final int maxSize) {
    +        final List<Text> ngrams = new ArrayList<Text>();
    +        for (int i = 0, numWords = words.size(); i < numWords; i++) {
    +            for (int ngramSize = minSize; ngramSize <= maxSize; ngramSize++) {
    +                if (i + ngramSize > numWords) { // exceeds the final element
    +                    continue;
    +                }
    +
    +                final List<String> ngram = new ArrayList<String>();
    +                for (int j = i; j < i + ngramSize; j++) {
    +                    ngram.add(words.get(j).toString());
    +                }
    +                ngrams.add(new Text(StringUtils.join(ngram, " ")));
    --- End diff --
    
    Good point - `StringUtils.join()` adds a whitespace even to NULL element.


---

[GitHub] incubator-hivemall pull request #118: [HIVEMALL-146] Yet another UDF to gene...

Posted by myui <gi...@git.apache.org>.
Github user myui commented on a diff in the pull request:

    https://github.com/apache/incubator-hivemall/pull/118#discussion_r142065112
  
    --- Diff: core/src/main/java/hivemall/tools/text/NgramsUDF.java ---
    @@ -0,0 +1,76 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *   http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing,
    + * software distributed under the License is distributed on an
    + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
    + * KIND, either express or implied.  See the License for the
    + * specific language governing permissions and limitations
    + * under the License.
    + */
    +package hivemall.tools.text;
    +
    +import hivemall.utils.lang.StringUtils;
    +
    +import org.apache.hadoop.hive.ql.exec.Description;
    +import org.apache.hadoop.hive.ql.exec.UDF;
    +import org.apache.hadoop.hive.ql.exec.UDFArgumentException;
    +import org.apache.hadoop.hive.ql.metadata.HiveException;
    +import org.apache.hadoop.hive.ql.udf.UDFType;
    +import org.apache.hadoop.io.Text;
    +
    +import javax.annotation.Nonnegative;
    +import javax.annotation.Nonnull;
    +
    +import java.util.ArrayList;
    +import java.util.List;
    +
    +@Description(name = "to_ngrams", value = "_FUNC_(array<string> words, int minSize, int maxSize])"
    +        + " - Returns list of n-grams for given words, where `minSize <= n <= maxSize`")
    +@UDFType(deterministic = true, stateful = false)
    +public final class NgramsUDF extends UDF {
    +
    +    public List<Text> evaluate(final List<Text> words, final int minSize, final int maxSize)
    +            throws HiveException {
    +        if (words == null) {
    +            return null;
    +        }
    +        if (minSize <= 0) {
    +            throw new UDFArgumentException("`minSize` must be greater than zero: " + minSize);
    +        }
    +        if (minSize > maxSize) {
    +            throw new UDFArgumentException("`maxSize` must be greater than or equal to `minSize`: "
    +                    + maxSize);
    +        }
    +        return getNgrams(words, minSize, maxSize);
    +    }
    +
    +    @Nonnull
    +    private List<Text> getNgrams(@Nonnull final List<Text> words, @Nonnegative final int minSize,
    --- End diff --
    
    `private static`


---

[GitHub] incubator-hivemall pull request #118: [HIVEMALL-146] Yet another UDF to gene...

Posted by myui <gi...@git.apache.org>.
Github user myui commented on a diff in the pull request:

    https://github.com/apache/incubator-hivemall/pull/118#discussion_r142069282
  
    --- Diff: core/src/main/java/hivemall/tools/text/NgramsUDF.java ---
    @@ -0,0 +1,76 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *   http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing,
    + * software distributed under the License is distributed on an
    + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
    + * KIND, either express or implied.  See the License for the
    + * specific language governing permissions and limitations
    + * under the License.
    + */
    +package hivemall.tools.text;
    +
    +import hivemall.utils.lang.StringUtils;
    +
    +import org.apache.hadoop.hive.ql.exec.Description;
    +import org.apache.hadoop.hive.ql.exec.UDF;
    +import org.apache.hadoop.hive.ql.exec.UDFArgumentException;
    +import org.apache.hadoop.hive.ql.metadata.HiveException;
    +import org.apache.hadoop.hive.ql.udf.UDFType;
    +import org.apache.hadoop.io.Text;
    +
    +import javax.annotation.Nonnegative;
    +import javax.annotation.Nonnull;
    +
    +import java.util.ArrayList;
    +import java.util.List;
    +
    +@Description(name = "to_ngrams", value = "_FUNC_(array<string> words, int minSize, int maxSize])"
    +        + " - Returns list of n-grams for given words, where `minSize <= n <= maxSize`")
    +@UDFType(deterministic = true, stateful = false)
    +public final class NgramsUDF extends UDF {
    +
    +    public List<Text> evaluate(final List<Text> words, final int minSize, final int maxSize)
    +            throws HiveException {
    +        if (words == null) {
    +            return null;
    +        }
    +        if (minSize <= 0) {
    +            throw new UDFArgumentException("`minSize` must be greater than zero: " + minSize);
    +        }
    +        if (minSize > maxSize) {
    +            throw new UDFArgumentException("`maxSize` must be greater than or equal to `minSize`: "
    +                    + maxSize);
    +        }
    +        return getNgrams(words, minSize, maxSize);
    +    }
    +
    +    @Nonnull
    +    private List<Text> getNgrams(@Nonnull final List<Text> words, @Nonnegative final int minSize,
    +            @Nonnegative final int maxSize) {
    +        final List<Text> ngrams = new ArrayList<Text>();
    +        for (int i = 0, numWords = words.size(); i < numWords; i++) {
    +            for (int ngramSize = minSize; ngramSize <= maxSize; ngramSize++) {
    +                if (i + ngramSize > numWords) { // exceeds the final element
    +                    continue;
    +                }
    +
    +                final List<String> ngram = new ArrayList<String>();
    +                for (int j = i; j < i + ngramSize; j++) {
    +                    ngram.add(words.get(j).toString());
    +                }
    +                ngrams.add(new Text(StringUtils.join(ngram, " ")));
    --- End diff --
    
    empty string is added where `i + ngramSize == numWords`?


---

[GitHub] incubator-hivemall issue #118: [HIVEMALL-146] Yet another UDF to generate n-...

Posted by myui <gi...@git.apache.org>.
Github user myui commented on the issue:

    https://github.com/apache/incubator-hivemall/pull/118
  
    @takuti LGTM. Can you merge this PR with squashing.


---

[GitHub] incubator-hivemall issue #118: [HIVEMALL-146] Yet another UDF to generate n-...

Posted by coveralls <gi...@git.apache.org>.
Github user coveralls commented on the issue:

    https://github.com/apache/incubator-hivemall/pull/118
  
    
    [![Coverage Status](https://coveralls.io/builds/13521010/badge)](https://coveralls.io/builds/13521010)
    
    Coverage increased (+0.03%) to 41.092% when pulling **6e6de58a36bdc50a205e7c3f3d97f884012761e2 on takuti:HIVEMALL-146-ngrams** into **1e42387576fabbb326d451f4a00ac22d57828711 on apache:master**.



---

[GitHub] incubator-hivemall issue #118: [HIVEMALL-146] Yet another UDF to generate n-...

Posted by coveralls <gi...@git.apache.org>.
Github user coveralls commented on the issue:

    https://github.com/apache/incubator-hivemall/pull/118
  
    
    [![Coverage Status](https://coveralls.io/builds/13537534/badge)](https://coveralls.io/builds/13537534)
    
    Coverage increased (+0.03%) to 41.095% when pulling **e4240d5175a2fca0f9e99bc2ade95e24655084df on takuti:HIVEMALL-146-ngrams** into **1e42387576fabbb326d451f4a00ac22d57828711 on apache:master**.



---

[GitHub] incubator-hivemall pull request #118: [HIVEMALL-146] Yet another UDF to gene...

Posted by myui <gi...@git.apache.org>.
Github user myui commented on a diff in the pull request:

    https://github.com/apache/incubator-hivemall/pull/118#discussion_r142065541
  
    --- Diff: core/src/main/java/hivemall/tools/text/NgramsUDF.java ---
    @@ -0,0 +1,76 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *   http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing,
    + * software distributed under the License is distributed on an
    + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
    + * KIND, either express or implied.  See the License for the
    + * specific language governing permissions and limitations
    + * under the License.
    + */
    +package hivemall.tools.text;
    +
    +import hivemall.utils.lang.StringUtils;
    +
    +import org.apache.hadoop.hive.ql.exec.Description;
    +import org.apache.hadoop.hive.ql.exec.UDF;
    +import org.apache.hadoop.hive.ql.exec.UDFArgumentException;
    +import org.apache.hadoop.hive.ql.metadata.HiveException;
    +import org.apache.hadoop.hive.ql.udf.UDFType;
    +import org.apache.hadoop.io.Text;
    +
    +import javax.annotation.Nonnegative;
    +import javax.annotation.Nonnull;
    +
    +import java.util.ArrayList;
    +import java.util.List;
    +
    +@Description(name = "to_ngrams", value = "_FUNC_(array<string> words, int minSize, int maxSize])"
    --- End diff --
    
    `ngram` function already exists in Hive and the name is confusing.
    https://cwiki.apache.org/confluence/display/Hive/StatisticsAndDataMining
    
    How about renaming to `to_ngram` to `wordgram`.


---

[GitHub] incubator-hivemall pull request #118: [HIVEMALL-146] Yet another UDF to gene...

Posted by myui <gi...@git.apache.org>.
Github user myui commented on a diff in the pull request:

    https://github.com/apache/incubator-hivemall/pull/118#discussion_r142308265
  
    --- Diff: core/src/main/java/hivemall/tools/text/NgramsUDF.java ---
    @@ -0,0 +1,76 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *   http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing,
    + * software distributed under the License is distributed on an
    + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
    + * KIND, either express or implied.  See the License for the
    + * specific language governing permissions and limitations
    + * under the License.
    + */
    +package hivemall.tools.text;
    +
    +import hivemall.utils.lang.StringUtils;
    +
    +import org.apache.hadoop.hive.ql.exec.Description;
    +import org.apache.hadoop.hive.ql.exec.UDF;
    +import org.apache.hadoop.hive.ql.exec.UDFArgumentException;
    +import org.apache.hadoop.hive.ql.metadata.HiveException;
    +import org.apache.hadoop.hive.ql.udf.UDFType;
    +import org.apache.hadoop.io.Text;
    +
    +import javax.annotation.Nonnegative;
    +import javax.annotation.Nonnull;
    +
    +import java.util.ArrayList;
    +import java.util.List;
    +
    +@Description(name = "to_ngrams", value = "_FUNC_(array<string> words, int minSize, int maxSize])"
    +        + " - Returns list of n-grams for given words, where `minSize <= n <= maxSize`")
    +@UDFType(deterministic = true, stateful = false)
    +public final class NgramsUDF extends UDF {
    +
    +    public List<Text> evaluate(final List<Text> words, final int minSize, final int maxSize)
    +            throws HiveException {
    +        if (words == null) {
    +            return null;
    +        }
    +        if (minSize <= 0) {
    +            throw new UDFArgumentException("`minSize` must be greater than zero: " + minSize);
    +        }
    +        if (minSize > maxSize) {
    +            throw new UDFArgumentException("`maxSize` must be greater than or equal to `minSize`: "
    +                    + maxSize);
    +        }
    +        return getNgrams(words, minSize, maxSize);
    +    }
    +
    +    @Nonnull
    +    private List<Text> getNgrams(@Nonnull final List<Text> words, @Nonnegative final int minSize,
    +            @Nonnegative final int maxSize) {
    +        final List<Text> ngrams = new ArrayList<Text>();
    +        for (int i = 0, numWords = words.size(); i < numWords; i++) {
    +            for (int ngramSize = minSize; ngramSize <= maxSize; ngramSize++) {
    +                if (i + ngramSize > numWords) { // exceeds the final element
    +                    continue;
    +                }
    +
    +                final List<String> ngram = new ArrayList<String>();
    +                for (int j = i; j < i + ngramSize; j++) {
    +                    ngram.add(words.get(j).toString());
    +                }
    +                ngrams.add(new Text(StringUtils.join(ngram, " ")));
    --- End diff --
    
    I see. Then, avoid using ArrayList as follows and consider null in words array.
    
    ```java
    final int numWords = words.size();
    for (int i = 0; i < numWords; i++) {
     for (int ngramSize = minSize; ngramSize <= maxSize; ngramSize++) {
       final int end = i+ngram;
       if(end > numWords) {
          continue;
       }
       final StringBuilder buf = new StringBuilder();
       for(int j=i; j<end; j++) {
          Text w = words.get(i);
          if(w == null) { continue; } // avoid "null" to be added or throw exception
          if(j!=i) { buf.append(' '));
          buf.append(w.toString()).append(' ');
       }
       ngrams.add(new Text(buf.toString())));
    ```


---

[GitHub] incubator-hivemall pull request #118: [HIVEMALL-146] Yet another UDF to gene...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/incubator-hivemall/pull/118


---

[GitHub] incubator-hivemall pull request #118: [HIVEMALL-146] Yet another UDF to gene...

Posted by takuti <gi...@git.apache.org>.
Github user takuti commented on a diff in the pull request:

    https://github.com/apache/incubator-hivemall/pull/118#discussion_r142328137
  
    --- Diff: core/src/main/java/hivemall/tools/text/WordNgramsUDF.java ---
    @@ -0,0 +1,85 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *   http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing,
    + * software distributed under the License is distributed on an
    + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
    + * KIND, either express or implied.  See the License for the
    + * specific language governing permissions and limitations
    + * under the License.
    + */
    +package hivemall.tools.text;
    +
    +import org.apache.hadoop.hive.ql.exec.Description;
    +import org.apache.hadoop.hive.ql.exec.UDF;
    +import org.apache.hadoop.hive.ql.exec.UDFArgumentException;
    +import org.apache.hadoop.hive.ql.metadata.HiveException;
    +import org.apache.hadoop.hive.ql.udf.UDFType;
    +import org.apache.hadoop.io.Text;
    +
    +import javax.annotation.Nonnegative;
    +import javax.annotation.Nonnull;
    +import javax.annotation.Nullable;
    +
    +import java.util.ArrayList;
    +import java.util.List;
    +
    +@Description(name = "word_ngrams", value = "_FUNC_(array<string> words, int minSize, int maxSize])"
    +        + " - Returns list of n-grams for given words, where `minSize <= n <= maxSize`")
    +@UDFType(deterministic = true, stateful = false)
    +public final class WordNgramsUDF extends UDF {
    +
    +    @Nullable
    +    public List<Text> evaluate(@Nullable final List<Text> words, final int minSize,
    +            final int maxSize) throws HiveException {
    +        if (words == null) {
    +            return null;
    +        }
    +        if (minSize <= 0) {
    +            throw new UDFArgumentException("`minSize` must be greater than zero: " + minSize);
    +        }
    +        if (minSize > maxSize) {
    +            throw new UDFArgumentException("`maxSize` must be greater than or equal to `minSize`: "
    +                    + maxSize);
    +        }
    +        return getNgrams(words, minSize, maxSize);
    +    }
    +
    +    @Nonnull
    +    private static List<Text> getNgrams(@Nonnull final List<Text> words,
    +            @Nonnegative final int minSize, @Nonnegative final int maxSize) throws HiveException {
    +        final List<Text> ngrams = new ArrayList<Text>();
    +        for (int i = 0, numWords = words.size(); i < numWords; i++) {
    +            for (int ngramSize = minSize; ngramSize <= maxSize; ngramSize++) {
    +                final int end = i + ngramSize;
    +                if (end > numWords) { // exceeds the final element
    +                    continue;
    +                }
    +
    +                final StringBuilder ngram = new StringBuilder();
    +                for (int j = i; j < end; j++) {
    +                    final Text word = words.get(j);
    +                    if (word == null) {
    +                        throw new HiveException(
    +                            "`array<string> words` must not contain NULL element");
    +                    }
    +                    if (j > i) { // insert single whitespace between elements
    +                        ngram.append(" ");
    +                    }
    +                    ngram.append(word.toString());
    +                }
    +                ngrams.add(new Text(ngram.toString()));
    +            }
    --- End diff --
    
    👍 


---

[GitHub] incubator-hivemall pull request #118: [HIVEMALL-146] Yet another UDF to gene...

Posted by myui <gi...@git.apache.org>.
Github user myui commented on a diff in the pull request:

    https://github.com/apache/incubator-hivemall/pull/118#discussion_r142323289
  
    --- Diff: core/src/main/java/hivemall/tools/text/WordNgramsUDF.java ---
    @@ -0,0 +1,85 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *   http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing,
    + * software distributed under the License is distributed on an
    + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
    + * KIND, either express or implied.  See the License for the
    + * specific language governing permissions and limitations
    + * under the License.
    + */
    +package hivemall.tools.text;
    +
    +import org.apache.hadoop.hive.ql.exec.Description;
    +import org.apache.hadoop.hive.ql.exec.UDF;
    +import org.apache.hadoop.hive.ql.exec.UDFArgumentException;
    +import org.apache.hadoop.hive.ql.metadata.HiveException;
    +import org.apache.hadoop.hive.ql.udf.UDFType;
    +import org.apache.hadoop.io.Text;
    +
    +import javax.annotation.Nonnegative;
    +import javax.annotation.Nonnull;
    +import javax.annotation.Nullable;
    +
    +import java.util.ArrayList;
    +import java.util.List;
    +
    +@Description(name = "word_ngrams", value = "_FUNC_(array<string> words, int minSize, int maxSize])"
    +        + " - Returns list of n-grams for given words, where `minSize <= n <= maxSize`")
    +@UDFType(deterministic = true, stateful = false)
    +public final class WordNgramsUDF extends UDF {
    +
    +    @Nullable
    +    public List<Text> evaluate(@Nullable final List<Text> words, final int minSize,
    +            final int maxSize) throws HiveException {
    +        if (words == null) {
    +            return null;
    +        }
    +        if (minSize <= 0) {
    +            throw new UDFArgumentException("`minSize` must be greater than zero: " + minSize);
    +        }
    +        if (minSize > maxSize) {
    +            throw new UDFArgumentException("`maxSize` must be greater than or equal to `minSize`: "
    +                    + maxSize);
    +        }
    +        return getNgrams(words, minSize, maxSize);
    +    }
    +
    +    @Nonnull
    +    private static List<Text> getNgrams(@Nonnull final List<Text> words,
    +            @Nonnegative final int minSize, @Nonnegative final int maxSize) throws HiveException {
    +        final List<Text> ngrams = new ArrayList<Text>();
    --- End diff --
    
    final StringBuilder ngram = new StringBuilder();


---

[GitHub] incubator-hivemall pull request #118: [HIVEMALL-146] Yet another UDF to gene...

Posted by takuti <gi...@git.apache.org>.
Github user takuti commented on a diff in the pull request:

    https://github.com/apache/incubator-hivemall/pull/118#discussion_r142298457
  
    --- Diff: core/src/main/java/hivemall/tools/text/NgramsUDF.java ---
    @@ -0,0 +1,76 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *   http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing,
    + * software distributed under the License is distributed on an
    + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
    + * KIND, either express or implied.  See the License for the
    + * specific language governing permissions and limitations
    + * under the License.
    + */
    +package hivemall.tools.text;
    +
    +import hivemall.utils.lang.StringUtils;
    +
    +import org.apache.hadoop.hive.ql.exec.Description;
    +import org.apache.hadoop.hive.ql.exec.UDF;
    +import org.apache.hadoop.hive.ql.exec.UDFArgumentException;
    +import org.apache.hadoop.hive.ql.metadata.HiveException;
    +import org.apache.hadoop.hive.ql.udf.UDFType;
    +import org.apache.hadoop.io.Text;
    +
    +import javax.annotation.Nonnegative;
    +import javax.annotation.Nonnull;
    +
    +import java.util.ArrayList;
    +import java.util.List;
    +
    +@Description(name = "to_ngrams", value = "_FUNC_(array<string> words, int minSize, int maxSize])"
    --- End diff --
    
    I agree, but, at the same time, I like to use intuitive and well-known term as a name of such simple function. Let me think.


---

[GitHub] incubator-hivemall pull request #118: [HIVEMALL-146] Yet another UDF to gene...

Posted by takuti <gi...@git.apache.org>.
Github user takuti commented on a diff in the pull request:

    https://github.com/apache/incubator-hivemall/pull/118#discussion_r142296940
  
    --- Diff: core/src/main/java/hivemall/tools/text/NgramsUDF.java ---
    @@ -0,0 +1,76 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *   http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing,
    + * software distributed under the License is distributed on an
    + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
    + * KIND, either express or implied.  See the License for the
    + * specific language governing permissions and limitations
    + * under the License.
    + */
    +package hivemall.tools.text;
    +
    +import hivemall.utils.lang.StringUtils;
    +
    +import org.apache.hadoop.hive.ql.exec.Description;
    +import org.apache.hadoop.hive.ql.exec.UDF;
    +import org.apache.hadoop.hive.ql.exec.UDFArgumentException;
    +import org.apache.hadoop.hive.ql.metadata.HiveException;
    +import org.apache.hadoop.hive.ql.udf.UDFType;
    +import org.apache.hadoop.io.Text;
    +
    +import javax.annotation.Nonnegative;
    +import javax.annotation.Nonnull;
    +
    +import java.util.ArrayList;
    +import java.util.List;
    +
    +@Description(name = "to_ngrams", value = "_FUNC_(array<string> words, int minSize, int maxSize])"
    +        + " - Returns list of n-grams for given words, where `minSize <= n <= maxSize`")
    +@UDFType(deterministic = true, stateful = false)
    +public final class NgramsUDF extends UDF {
    +
    +    public List<Text> evaluate(final List<Text> words, final int minSize, final int maxSize)
    +            throws HiveException {
    +        if (words == null) {
    +            return null;
    +        }
    +        if (minSize <= 0) {
    +            throw new UDFArgumentException("`minSize` must be greater than zero: " + minSize);
    +        }
    +        if (minSize > maxSize) {
    +            throw new UDFArgumentException("`maxSize` must be greater than or equal to `minSize`: "
    +                    + maxSize);
    +        }
    +        return getNgrams(words, minSize, maxSize);
    +    }
    +
    +    @Nonnull
    +    private List<Text> getNgrams(@Nonnull final List<Text> words, @Nonnegative final int minSize,
    +            @Nonnegative final int maxSize) {
    +        final List<Text> ngrams = new ArrayList<Text>();
    +        for (int i = 0, numWords = words.size(); i < numWords; i++) {
    +            for (int ngramSize = minSize; ngramSize <= maxSize; ngramSize++) {
    +                if (i + ngramSize > numWords) { // exceeds the final element
    +                    continue;
    +                }
    +
    +                final List<String> ngram = new ArrayList<String>();
    +                for (int j = i; j < i + ngramSize; j++) {
    +                    ngram.add(words.get(j).toString());
    +                }
    +                ngrams.add(new Text(StringUtils.join(ngram, " ")));
    --- End diff --
    
    `ngram` will never be empty thanks to `if (i + ngramSize > numWords) continue;`. Notice that the for-loop creating `ngram` increments `j` from `i` to `i + ngramSize - 1`.


---

[GitHub] incubator-hivemall pull request #118: [HIVEMALL-146] Yet another UDF to gene...

Posted by myui <gi...@git.apache.org>.
Github user myui commented on a diff in the pull request:

    https://github.com/apache/incubator-hivemall/pull/118#discussion_r142323365
  
    --- Diff: core/src/main/java/hivemall/tools/text/WordNgramsUDF.java ---
    @@ -0,0 +1,85 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *   http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing,
    + * software distributed under the License is distributed on an
    + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
    + * KIND, either express or implied.  See the License for the
    + * specific language governing permissions and limitations
    + * under the License.
    + */
    +package hivemall.tools.text;
    +
    +import org.apache.hadoop.hive.ql.exec.Description;
    +import org.apache.hadoop.hive.ql.exec.UDF;
    +import org.apache.hadoop.hive.ql.exec.UDFArgumentException;
    +import org.apache.hadoop.hive.ql.metadata.HiveException;
    +import org.apache.hadoop.hive.ql.udf.UDFType;
    +import org.apache.hadoop.io.Text;
    +
    +import javax.annotation.Nonnegative;
    +import javax.annotation.Nonnull;
    +import javax.annotation.Nullable;
    +
    +import java.util.ArrayList;
    +import java.util.List;
    +
    +@Description(name = "word_ngrams", value = "_FUNC_(array<string> words, int minSize, int maxSize])"
    +        + " - Returns list of n-grams for given words, where `minSize <= n <= maxSize`")
    +@UDFType(deterministic = true, stateful = false)
    +public final class WordNgramsUDF extends UDF {
    +
    +    @Nullable
    +    public List<Text> evaluate(@Nullable final List<Text> words, final int minSize,
    +            final int maxSize) throws HiveException {
    +        if (words == null) {
    +            return null;
    +        }
    +        if (minSize <= 0) {
    +            throw new UDFArgumentException("`minSize` must be greater than zero: " + minSize);
    +        }
    +        if (minSize > maxSize) {
    +            throw new UDFArgumentException("`maxSize` must be greater than or equal to `minSize`: "
    +                    + maxSize);
    +        }
    +        return getNgrams(words, minSize, maxSize);
    +    }
    +
    +    @Nonnull
    +    private static List<Text> getNgrams(@Nonnull final List<Text> words,
    +            @Nonnegative final int minSize, @Nonnegative final int maxSize) throws HiveException {
    +        final List<Text> ngrams = new ArrayList<Text>();
    +        for (int i = 0, numWords = words.size(); i < numWords; i++) {
    +            for (int ngramSize = minSize; ngramSize <= maxSize; ngramSize++) {
    +                final int end = i + ngramSize;
    +                if (end > numWords) { // exceeds the final element
    +                    continue;
    +                }
    +
    +                final StringBuilder ngram = new StringBuilder();
    +                for (int j = i; j < end; j++) {
    +                    final Text word = words.get(j);
    +                    if (word == null) {
    +                        throw new HiveException(
    +                            "`array<string> words` must not contain NULL element");
    +                    }
    +                    if (j > i) { // insert single whitespace between elements
    +                        ngram.append(" ");
    +                    }
    +                    ngram.append(word.toString());
    +                }
    +                ngrams.add(new Text(ngram.toString()));
    +            }
    --- End diff --
    
    StringUtils.clear(ngram);


---

[GitHub] incubator-hivemall issue #118: [HIVEMALL-146] Yet another UDF to generate n-...

Posted by coveralls <gi...@git.apache.org>.
Github user coveralls commented on the issue:

    https://github.com/apache/incubator-hivemall/pull/118
  
    
    [![Coverage Status](https://coveralls.io/builds/13540113/badge)](https://coveralls.io/builds/13540113)
    
    Coverage increased (+0.05%) to 41.109% when pulling **6b467688338c1824f52f507852421ad30222002b on takuti:HIVEMALL-146-ngrams** into **1e42387576fabbb326d451f4a00ac22d57828711 on apache:master**.



---

[GitHub] incubator-hivemall issue #118: [HIVEMALL-146] Yet another UDF to generate n-...

Posted by myui <gi...@git.apache.org>.
Github user myui commented on the issue:

    https://github.com/apache/incubator-hivemall/pull/118
  
    I personally prefer `wordgrams` though.
    http://search.cpan.org/dist/Text-WordGrams/lib/Text/WordGrams.pm


---

[GitHub] incubator-hivemall pull request #118: [HIVEMALL-146] Yet another UDF to gene...

Posted by takuti <gi...@git.apache.org>.
Github user takuti commented on a diff in the pull request:

    https://github.com/apache/incubator-hivemall/pull/118#discussion_r142312317
  
    --- Diff: core/src/main/java/hivemall/tools/text/NgramsUDF.java ---
    @@ -0,0 +1,76 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *   http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing,
    + * software distributed under the License is distributed on an
    + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
    + * KIND, either express or implied.  See the License for the
    + * specific language governing permissions and limitations
    + * under the License.
    + */
    +package hivemall.tools.text;
    +
    +import hivemall.utils.lang.StringUtils;
    +
    +import org.apache.hadoop.hive.ql.exec.Description;
    +import org.apache.hadoop.hive.ql.exec.UDF;
    +import org.apache.hadoop.hive.ql.exec.UDFArgumentException;
    +import org.apache.hadoop.hive.ql.metadata.HiveException;
    +import org.apache.hadoop.hive.ql.udf.UDFType;
    +import org.apache.hadoop.io.Text;
    +
    +import javax.annotation.Nonnegative;
    +import javax.annotation.Nonnull;
    +
    +import java.util.ArrayList;
    +import java.util.List;
    +
    +@Description(name = "to_ngrams", value = "_FUNC_(array<string> words, int minSize, int maxSize])"
    --- End diff --
    
    + I've noticed that, since the results are joined with whitespace, using this new UDF for character-based ngram is infeasible.
    
    Let's say `word_ngrams`.


---

[GitHub] incubator-hivemall pull request #118: [HIVEMALL-146] Yet another UDF to gene...

Posted by myui <gi...@git.apache.org>.
Github user myui commented on a diff in the pull request:

    https://github.com/apache/incubator-hivemall/pull/118#discussion_r142323566
  
    --- Diff: core/src/main/java/hivemall/tools/text/WordNgramsUDF.java ---
    @@ -0,0 +1,85 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *   http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing,
    + * software distributed under the License is distributed on an
    + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
    + * KIND, either express or implied.  See the License for the
    + * specific language governing permissions and limitations
    + * under the License.
    + */
    +package hivemall.tools.text;
    +
    +import org.apache.hadoop.hive.ql.exec.Description;
    +import org.apache.hadoop.hive.ql.exec.UDF;
    +import org.apache.hadoop.hive.ql.exec.UDFArgumentException;
    +import org.apache.hadoop.hive.ql.metadata.HiveException;
    +import org.apache.hadoop.hive.ql.udf.UDFType;
    +import org.apache.hadoop.io.Text;
    +
    +import javax.annotation.Nonnegative;
    +import javax.annotation.Nonnull;
    +import javax.annotation.Nullable;
    +
    +import java.util.ArrayList;
    +import java.util.List;
    +
    +@Description(name = "word_ngrams", value = "_FUNC_(array<string> words, int minSize, int maxSize])"
    +        + " - Returns list of n-grams for given words, where `minSize <= n <= maxSize`")
    +@UDFType(deterministic = true, stateful = false)
    +public final class WordNgramsUDF extends UDF {
    +
    +    @Nullable
    +    public List<Text> evaluate(@Nullable final List<Text> words, final int minSize,
    +            final int maxSize) throws HiveException {
    +        if (words == null) {
    +            return null;
    +        }
    +        if (minSize <= 0) {
    +            throw new UDFArgumentException("`minSize` must be greater than zero: " + minSize);
    +        }
    +        if (minSize > maxSize) {
    +            throw new UDFArgumentException("`maxSize` must be greater than or equal to `minSize`: "
    +                    + maxSize);
    +        }
    +        return getNgrams(words, minSize, maxSize);
    +    }
    +
    +    @Nonnull
    +    private static List<Text> getNgrams(@Nonnull final List<Text> words,
    +            @Nonnegative final int minSize, @Nonnegative final int maxSize) throws HiveException {
    +        final List<Text> ngrams = new ArrayList<Text>();
    +        for (int i = 0, numWords = words.size(); i < numWords; i++) {
    +            for (int ngramSize = minSize; ngramSize <= maxSize; ngramSize++) {
    +                final int end = i + ngramSize;
    +                if (end > numWords) { // exceeds the final element
    +                    continue;
    +                }
    +
    +                final StringBuilder ngram = new StringBuilder();
    +                for (int j = i; j < end; j++) {
    +                    final Text word = words.get(j);
    +                    if (word == null) {
    +                        throw new HiveException(
    --- End diff --
    
    UDFArgumentException is more appropriate.


---

[GitHub] incubator-hivemall pull request #118: [HIVEMALL-146] Yet another UDF to gene...

Posted by myui <gi...@git.apache.org>.
Github user myui commented on a diff in the pull request:

    https://github.com/apache/incubator-hivemall/pull/118#discussion_r142062132
  
    --- Diff: core/src/main/java/hivemall/tools/text/NgramsUDF.java ---
    @@ -0,0 +1,76 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *   http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing,
    + * software distributed under the License is distributed on an
    + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
    + * KIND, either express or implied.  See the License for the
    + * specific language governing permissions and limitations
    + * under the License.
    + */
    +package hivemall.tools.text;
    +
    +import hivemall.utils.lang.StringUtils;
    +
    +import org.apache.hadoop.hive.ql.exec.Description;
    +import org.apache.hadoop.hive.ql.exec.UDF;
    +import org.apache.hadoop.hive.ql.exec.UDFArgumentException;
    +import org.apache.hadoop.hive.ql.metadata.HiveException;
    +import org.apache.hadoop.hive.ql.udf.UDFType;
    +import org.apache.hadoop.io.Text;
    +
    +import javax.annotation.Nonnegative;
    +import javax.annotation.Nonnull;
    +
    +import java.util.ArrayList;
    +import java.util.List;
    +
    +@Description(name = "to_ngrams", value = "_FUNC_(array<string> words, int minSize, int maxSize])"
    +        + " - Returns list of n-grams for given words, where `minSize <= n <= maxSize`")
    +@UDFType(deterministic = true, stateful = false)
    +public final class NgramsUDF extends UDF {
    +
    +    public List<Text> evaluate(final List<Text> words, final int minSize, final int maxSize)
    --- End diff --
    
    ```java
    @Nullable
    public List<Text> evaluate( @Nullable final List<Text> words, ...
    ```


---