You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hivemall.apache.org by takuti <gi...@git.apache.org> on 2017/10/02 05:50:27 UTC
[GitHub] incubator-hivemall pull request #118: [HIVEMALL-146] Yet another UDF to gene...
GitHub user takuti opened a pull request:
https://github.com/apache/incubator-hivemall/pull/118
[HIVEMALL-146] Yet another UDF to generate n-grams
## What changes were proposed in this pull request?
Add a new UDF `to_ngrams(array<string> words, int minSize, int maxSize)` which returns list of n-grams `minSize <= n <= maxSize` for given words. This UDF can be alternative of the original Hive `ngrams` function.
## What type of PR is it?
Feature
## What is the Jira issue?
https://issues.apache.org/jira/browse/HIVEMALL-146
## How was this patch tested?
Unit test, manual tests both on EMR and local Hive
## How to use this feature?
as documented
## Checklist
(Please remove this section if not needed; check `x` for YES, blank for NO)
- [x] Did you apply source code formatter, i.e., `mvn formatter:format`, for your commit?
- [x] Did you run system tests on Hive (or Spark)?
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/takuti/incubator-hivemall HIVEMALL-146-ngrams
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/incubator-hivemall/pull/118.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #118
----
commit 6e9d08f264c173410e4a90fa4533db0dd28836ca
Author: Takuya Kitazawa <k....@gmail.com>
Date: 2017-10-02T05:38:31Z
Implement `to_ngrams` UDF
commit df81ee2de13666636068c1691f595b775e39a6f5
Author: Takuya Kitazawa <k....@gmail.com>
Date: 2017-10-02T05:46:35Z
Update document
----
---
[GitHub] incubator-hivemall pull request #118: [HIVEMALL-146] Yet another UDF to gene...
Posted by myui <gi...@git.apache.org>.
Github user myui commented on a diff in the pull request:
https://github.com/apache/incubator-hivemall/pull/118#discussion_r142313252
--- Diff: core/src/main/java/hivemall/tools/text/NgramsUDF.java ---
@@ -0,0 +1,76 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package hivemall.tools.text;
+
+import hivemall.utils.lang.StringUtils;
+
+import org.apache.hadoop.hive.ql.exec.Description;
+import org.apache.hadoop.hive.ql.exec.UDF;
+import org.apache.hadoop.hive.ql.exec.UDFArgumentException;
+import org.apache.hadoop.hive.ql.metadata.HiveException;
+import org.apache.hadoop.hive.ql.udf.UDFType;
+import org.apache.hadoop.io.Text;
+
+import javax.annotation.Nonnegative;
+import javax.annotation.Nonnull;
+
+import java.util.ArrayList;
+import java.util.List;
+
+@Description(name = "to_ngrams", value = "_FUNC_(array<string> words, int minSize, int maxSize])"
--- End diff --
Sounds good.
---
[GitHub] incubator-hivemall issue #118: [HIVEMALL-146] Yet another UDF to generate n-...
Posted by coveralls <gi...@git.apache.org>.
Github user coveralls commented on the issue:
https://github.com/apache/incubator-hivemall/pull/118
[![Coverage Status](https://coveralls.io/builds/13539099/badge)](https://coveralls.io/builds/13539099)
Coverage increased (+0.05%) to 41.107% when pulling **022d23c61a44337eb1172214daf43321eb8f021e on takuti:HIVEMALL-146-ngrams** into **1e42387576fabbb326d451f4a00ac22d57828711 on apache:master**.
---
[GitHub] incubator-hivemall pull request #118: [HIVEMALL-146] Yet another UDF to gene...
Posted by takuti <gi...@git.apache.org>.
Github user takuti commented on a diff in the pull request:
https://github.com/apache/incubator-hivemall/pull/118#discussion_r142300766
--- Diff: core/src/main/java/hivemall/tools/text/NgramsUDF.java ---
@@ -0,0 +1,76 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package hivemall.tools.text;
+
+import hivemall.utils.lang.StringUtils;
+
+import org.apache.hadoop.hive.ql.exec.Description;
+import org.apache.hadoop.hive.ql.exec.UDF;
+import org.apache.hadoop.hive.ql.exec.UDFArgumentException;
+import org.apache.hadoop.hive.ql.metadata.HiveException;
+import org.apache.hadoop.hive.ql.udf.UDFType;
+import org.apache.hadoop.io.Text;
+
+import javax.annotation.Nonnegative;
+import javax.annotation.Nonnull;
+
+import java.util.ArrayList;
+import java.util.List;
+
+@Description(name = "to_ngrams", value = "_FUNC_(array<string> words, int minSize, int maxSize])"
--- End diff --
Since the function is applicable for a list of characters (character-based ngrams), using a word `word` e.g., `wordgram` sounds inappropriate for me.
How about:
- `ngrams_joined_all`
- in contrast to the original `ngrams` which returns top-k most-frequent ngrams which contains "ngram" represented as a list of words
- `ngrams_between`, `ngrams_in_range`
- our UDF returns ngrams where `min <= n <= max`
- `to_ngram_list`
- a little bit longer than `to_ngrams` and less likely to be confusing
- `ngrams` creates list of named structs, but ours simply returns list of ngrams
---
[GitHub] incubator-hivemall issue #118: [HIVEMALL-146] Yet another UDF to generate n-...
Posted by coveralls <gi...@git.apache.org>.
Github user coveralls commented on the issue:
https://github.com/apache/incubator-hivemall/pull/118
[![Coverage Status](https://coveralls.io/builds/13537452/badge)](https://coveralls.io/builds/13537452)
Coverage increased (+0.03%) to 41.094% when pulling **e4240d5175a2fca0f9e99bc2ade95e24655084df on takuti:HIVEMALL-146-ngrams** into **1e42387576fabbb326d451f4a00ac22d57828711 on apache:master**.
---
[GitHub] incubator-hivemall pull request #118: [HIVEMALL-146] Yet another UDF to gene...
Posted by myui <gi...@git.apache.org>.
Github user myui commented on a diff in the pull request:
https://github.com/apache/incubator-hivemall/pull/118#discussion_r142323455
--- Diff: core/src/main/java/hivemall/tools/text/WordNgramsUDF.java ---
@@ -0,0 +1,85 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package hivemall.tools.text;
+
+import org.apache.hadoop.hive.ql.exec.Description;
+import org.apache.hadoop.hive.ql.exec.UDF;
+import org.apache.hadoop.hive.ql.exec.UDFArgumentException;
+import org.apache.hadoop.hive.ql.metadata.HiveException;
+import org.apache.hadoop.hive.ql.udf.UDFType;
+import org.apache.hadoop.io.Text;
+
+import javax.annotation.Nonnegative;
+import javax.annotation.Nonnull;
+import javax.annotation.Nullable;
+
+import java.util.ArrayList;
+import java.util.List;
+
+@Description(name = "word_ngrams", value = "_FUNC_(array<string> words, int minSize, int maxSize])"
+ + " - Returns list of n-grams for given words, where `minSize <= n <= maxSize`")
+@UDFType(deterministic = true, stateful = false)
+public final class WordNgramsUDF extends UDF {
+
+ @Nullable
+ public List<Text> evaluate(@Nullable final List<Text> words, final int minSize,
+ final int maxSize) throws HiveException {
+ if (words == null) {
+ return null;
+ }
+ if (minSize <= 0) {
+ throw new UDFArgumentException("`minSize` must be greater than zero: " + minSize);
+ }
+ if (minSize > maxSize) {
+ throw new UDFArgumentException("`maxSize` must be greater than or equal to `minSize`: "
+ + maxSize);
+ }
+ return getNgrams(words, minSize, maxSize);
+ }
+
+ @Nonnull
+ private static List<Text> getNgrams(@Nonnull final List<Text> words,
+ @Nonnegative final int minSize, @Nonnegative final int maxSize) throws HiveException {
+ final List<Text> ngrams = new ArrayList<Text>();
+ for (int i = 0, numWords = words.size(); i < numWords; i++) {
+ for (int ngramSize = minSize; ngramSize <= maxSize; ngramSize++) {
+ final int end = i + ngramSize;
+ if (end > numWords) { // exceeds the final element
+ continue;
+ }
+
+ final StringBuilder ngram = new StringBuilder();
+ for (int j = i; j < end; j++) {
+ final Text word = words.get(j);
+ if (word == null) {
+ throw new HiveException(
+ "`array<string> words` must not contain NULL element");
+ }
+ if (j > i) { // insert single whitespace between elements
+ ngram.append(" ");
+ }
+ ngram.append(word.toString());
+ }
+ ngrams.add(new Text(ngram.toString()));
+ }
--- End diff --
@takuti oops... Could you reuse StringBuilder instance as commented?
---
[GitHub] incubator-hivemall pull request #118: [HIVEMALL-146] Yet another UDF to gene...
Posted by myui <gi...@git.apache.org>.
Github user myui commented on a diff in the pull request:
https://github.com/apache/incubator-hivemall/pull/118#discussion_r142307304
--- Diff: core/src/main/java/hivemall/tools/text/NgramsUDF.java ---
@@ -0,0 +1,76 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package hivemall.tools.text;
+
+import hivemall.utils.lang.StringUtils;
+
+import org.apache.hadoop.hive.ql.exec.Description;
+import org.apache.hadoop.hive.ql.exec.UDF;
+import org.apache.hadoop.hive.ql.exec.UDFArgumentException;
+import org.apache.hadoop.hive.ql.metadata.HiveException;
+import org.apache.hadoop.hive.ql.udf.UDFType;
+import org.apache.hadoop.io.Text;
+
+import javax.annotation.Nonnegative;
+import javax.annotation.Nonnull;
+
+import java.util.ArrayList;
+import java.util.List;
+
+@Description(name = "to_ngrams", value = "_FUNC_(array<string> words, int minSize, int maxSize])"
--- End diff --
I user want character-level ngram, existing ngram is fine.
This function is for word-level ngram; and thus wordgram seems fine.
---
[GitHub] incubator-hivemall pull request #118: [HIVEMALL-146] Yet another UDF to gene...
Posted by takuti <gi...@git.apache.org>.
Github user takuti commented on a diff in the pull request:
https://github.com/apache/incubator-hivemall/pull/118#discussion_r142312467
--- Diff: core/src/main/java/hivemall/tools/text/NgramsUDF.java ---
@@ -0,0 +1,76 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package hivemall.tools.text;
+
+import hivemall.utils.lang.StringUtils;
+
+import org.apache.hadoop.hive.ql.exec.Description;
+import org.apache.hadoop.hive.ql.exec.UDF;
+import org.apache.hadoop.hive.ql.exec.UDFArgumentException;
+import org.apache.hadoop.hive.ql.metadata.HiveException;
+import org.apache.hadoop.hive.ql.udf.UDFType;
+import org.apache.hadoop.io.Text;
+
+import javax.annotation.Nonnegative;
+import javax.annotation.Nonnull;
+
+import java.util.ArrayList;
+import java.util.List;
+
+@Description(name = "to_ngrams", value = "_FUNC_(array<string> words, int minSize, int maxSize])"
+ + " - Returns list of n-grams for given words, where `minSize <= n <= maxSize`")
+@UDFType(deterministic = true, stateful = false)
+public final class NgramsUDF extends UDF {
+
+ public List<Text> evaluate(final List<Text> words, final int minSize, final int maxSize)
+ throws HiveException {
+ if (words == null) {
+ return null;
+ }
+ if (minSize <= 0) {
+ throw new UDFArgumentException("`minSize` must be greater than zero: " + minSize);
+ }
+ if (minSize > maxSize) {
+ throw new UDFArgumentException("`maxSize` must be greater than or equal to `minSize`: "
+ + maxSize);
+ }
+ return getNgrams(words, minSize, maxSize);
+ }
+
+ @Nonnull
+ private List<Text> getNgrams(@Nonnull final List<Text> words, @Nonnegative final int minSize,
+ @Nonnegative final int maxSize) {
+ final List<Text> ngrams = new ArrayList<Text>();
+ for (int i = 0, numWords = words.size(); i < numWords; i++) {
+ for (int ngramSize = minSize; ngramSize <= maxSize; ngramSize++) {
+ if (i + ngramSize > numWords) { // exceeds the final element
+ continue;
+ }
+
+ final List<String> ngram = new ArrayList<String>();
+ for (int j = i; j < i + ngramSize; j++) {
+ ngram.add(words.get(j).toString());
+ }
+ ngrams.add(new Text(StringUtils.join(ngram, " ")));
--- End diff --
Good point - `StringUtils.join()` adds a whitespace even to NULL element.
---
[GitHub] incubator-hivemall pull request #118: [HIVEMALL-146] Yet another UDF to gene...
Posted by myui <gi...@git.apache.org>.
Github user myui commented on a diff in the pull request:
https://github.com/apache/incubator-hivemall/pull/118#discussion_r142065112
--- Diff: core/src/main/java/hivemall/tools/text/NgramsUDF.java ---
@@ -0,0 +1,76 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package hivemall.tools.text;
+
+import hivemall.utils.lang.StringUtils;
+
+import org.apache.hadoop.hive.ql.exec.Description;
+import org.apache.hadoop.hive.ql.exec.UDF;
+import org.apache.hadoop.hive.ql.exec.UDFArgumentException;
+import org.apache.hadoop.hive.ql.metadata.HiveException;
+import org.apache.hadoop.hive.ql.udf.UDFType;
+import org.apache.hadoop.io.Text;
+
+import javax.annotation.Nonnegative;
+import javax.annotation.Nonnull;
+
+import java.util.ArrayList;
+import java.util.List;
+
+@Description(name = "to_ngrams", value = "_FUNC_(array<string> words, int minSize, int maxSize])"
+ + " - Returns list of n-grams for given words, where `minSize <= n <= maxSize`")
+@UDFType(deterministic = true, stateful = false)
+public final class NgramsUDF extends UDF {
+
+ public List<Text> evaluate(final List<Text> words, final int minSize, final int maxSize)
+ throws HiveException {
+ if (words == null) {
+ return null;
+ }
+ if (minSize <= 0) {
+ throw new UDFArgumentException("`minSize` must be greater than zero: " + minSize);
+ }
+ if (minSize > maxSize) {
+ throw new UDFArgumentException("`maxSize` must be greater than or equal to `minSize`: "
+ + maxSize);
+ }
+ return getNgrams(words, minSize, maxSize);
+ }
+
+ @Nonnull
+ private List<Text> getNgrams(@Nonnull final List<Text> words, @Nonnegative final int minSize,
--- End diff --
`private static`
---
[GitHub] incubator-hivemall pull request #118: [HIVEMALL-146] Yet another UDF to gene...
Posted by myui <gi...@git.apache.org>.
Github user myui commented on a diff in the pull request:
https://github.com/apache/incubator-hivemall/pull/118#discussion_r142069282
--- Diff: core/src/main/java/hivemall/tools/text/NgramsUDF.java ---
@@ -0,0 +1,76 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package hivemall.tools.text;
+
+import hivemall.utils.lang.StringUtils;
+
+import org.apache.hadoop.hive.ql.exec.Description;
+import org.apache.hadoop.hive.ql.exec.UDF;
+import org.apache.hadoop.hive.ql.exec.UDFArgumentException;
+import org.apache.hadoop.hive.ql.metadata.HiveException;
+import org.apache.hadoop.hive.ql.udf.UDFType;
+import org.apache.hadoop.io.Text;
+
+import javax.annotation.Nonnegative;
+import javax.annotation.Nonnull;
+
+import java.util.ArrayList;
+import java.util.List;
+
+@Description(name = "to_ngrams", value = "_FUNC_(array<string> words, int minSize, int maxSize])"
+ + " - Returns list of n-grams for given words, where `minSize <= n <= maxSize`")
+@UDFType(deterministic = true, stateful = false)
+public final class NgramsUDF extends UDF {
+
+ public List<Text> evaluate(final List<Text> words, final int minSize, final int maxSize)
+ throws HiveException {
+ if (words == null) {
+ return null;
+ }
+ if (minSize <= 0) {
+ throw new UDFArgumentException("`minSize` must be greater than zero: " + minSize);
+ }
+ if (minSize > maxSize) {
+ throw new UDFArgumentException("`maxSize` must be greater than or equal to `minSize`: "
+ + maxSize);
+ }
+ return getNgrams(words, minSize, maxSize);
+ }
+
+ @Nonnull
+ private List<Text> getNgrams(@Nonnull final List<Text> words, @Nonnegative final int minSize,
+ @Nonnegative final int maxSize) {
+ final List<Text> ngrams = new ArrayList<Text>();
+ for (int i = 0, numWords = words.size(); i < numWords; i++) {
+ for (int ngramSize = minSize; ngramSize <= maxSize; ngramSize++) {
+ if (i + ngramSize > numWords) { // exceeds the final element
+ continue;
+ }
+
+ final List<String> ngram = new ArrayList<String>();
+ for (int j = i; j < i + ngramSize; j++) {
+ ngram.add(words.get(j).toString());
+ }
+ ngrams.add(new Text(StringUtils.join(ngram, " ")));
--- End diff --
empty string is added where `i + ngramSize == numWords`?
---
[GitHub] incubator-hivemall issue #118: [HIVEMALL-146] Yet another UDF to generate n-...
Posted by myui <gi...@git.apache.org>.
Github user myui commented on the issue:
https://github.com/apache/incubator-hivemall/pull/118
@takuti LGTM. Can you merge this PR with squashing.
---
[GitHub] incubator-hivemall issue #118: [HIVEMALL-146] Yet another UDF to generate n-...
Posted by coveralls <gi...@git.apache.org>.
Github user coveralls commented on the issue:
https://github.com/apache/incubator-hivemall/pull/118
[![Coverage Status](https://coveralls.io/builds/13521010/badge)](https://coveralls.io/builds/13521010)
Coverage increased (+0.03%) to 41.092% when pulling **6e6de58a36bdc50a205e7c3f3d97f884012761e2 on takuti:HIVEMALL-146-ngrams** into **1e42387576fabbb326d451f4a00ac22d57828711 on apache:master**.
---
[GitHub] incubator-hivemall issue #118: [HIVEMALL-146] Yet another UDF to generate n-...
Posted by coveralls <gi...@git.apache.org>.
Github user coveralls commented on the issue:
https://github.com/apache/incubator-hivemall/pull/118
[![Coverage Status](https://coveralls.io/builds/13537534/badge)](https://coveralls.io/builds/13537534)
Coverage increased (+0.03%) to 41.095% when pulling **e4240d5175a2fca0f9e99bc2ade95e24655084df on takuti:HIVEMALL-146-ngrams** into **1e42387576fabbb326d451f4a00ac22d57828711 on apache:master**.
---
[GitHub] incubator-hivemall pull request #118: [HIVEMALL-146] Yet another UDF to gene...
Posted by myui <gi...@git.apache.org>.
Github user myui commented on a diff in the pull request:
https://github.com/apache/incubator-hivemall/pull/118#discussion_r142065541
--- Diff: core/src/main/java/hivemall/tools/text/NgramsUDF.java ---
@@ -0,0 +1,76 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package hivemall.tools.text;
+
+import hivemall.utils.lang.StringUtils;
+
+import org.apache.hadoop.hive.ql.exec.Description;
+import org.apache.hadoop.hive.ql.exec.UDF;
+import org.apache.hadoop.hive.ql.exec.UDFArgumentException;
+import org.apache.hadoop.hive.ql.metadata.HiveException;
+import org.apache.hadoop.hive.ql.udf.UDFType;
+import org.apache.hadoop.io.Text;
+
+import javax.annotation.Nonnegative;
+import javax.annotation.Nonnull;
+
+import java.util.ArrayList;
+import java.util.List;
+
+@Description(name = "to_ngrams", value = "_FUNC_(array<string> words, int minSize, int maxSize])"
--- End diff --
`ngram` function already exists in Hive and the name is confusing.
https://cwiki.apache.org/confluence/display/Hive/StatisticsAndDataMining
How about renaming to `to_ngram` to `wordgram`.
---
[GitHub] incubator-hivemall pull request #118: [HIVEMALL-146] Yet another UDF to gene...
Posted by myui <gi...@git.apache.org>.
Github user myui commented on a diff in the pull request:
https://github.com/apache/incubator-hivemall/pull/118#discussion_r142308265
--- Diff: core/src/main/java/hivemall/tools/text/NgramsUDF.java ---
@@ -0,0 +1,76 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package hivemall.tools.text;
+
+import hivemall.utils.lang.StringUtils;
+
+import org.apache.hadoop.hive.ql.exec.Description;
+import org.apache.hadoop.hive.ql.exec.UDF;
+import org.apache.hadoop.hive.ql.exec.UDFArgumentException;
+import org.apache.hadoop.hive.ql.metadata.HiveException;
+import org.apache.hadoop.hive.ql.udf.UDFType;
+import org.apache.hadoop.io.Text;
+
+import javax.annotation.Nonnegative;
+import javax.annotation.Nonnull;
+
+import java.util.ArrayList;
+import java.util.List;
+
+@Description(name = "to_ngrams", value = "_FUNC_(array<string> words, int minSize, int maxSize])"
+ + " - Returns list of n-grams for given words, where `minSize <= n <= maxSize`")
+@UDFType(deterministic = true, stateful = false)
+public final class NgramsUDF extends UDF {
+
+ public List<Text> evaluate(final List<Text> words, final int minSize, final int maxSize)
+ throws HiveException {
+ if (words == null) {
+ return null;
+ }
+ if (minSize <= 0) {
+ throw new UDFArgumentException("`minSize` must be greater than zero: " + minSize);
+ }
+ if (minSize > maxSize) {
+ throw new UDFArgumentException("`maxSize` must be greater than or equal to `minSize`: "
+ + maxSize);
+ }
+ return getNgrams(words, minSize, maxSize);
+ }
+
+ @Nonnull
+ private List<Text> getNgrams(@Nonnull final List<Text> words, @Nonnegative final int minSize,
+ @Nonnegative final int maxSize) {
+ final List<Text> ngrams = new ArrayList<Text>();
+ for (int i = 0, numWords = words.size(); i < numWords; i++) {
+ for (int ngramSize = minSize; ngramSize <= maxSize; ngramSize++) {
+ if (i + ngramSize > numWords) { // exceeds the final element
+ continue;
+ }
+
+ final List<String> ngram = new ArrayList<String>();
+ for (int j = i; j < i + ngramSize; j++) {
+ ngram.add(words.get(j).toString());
+ }
+ ngrams.add(new Text(StringUtils.join(ngram, " ")));
--- End diff --
I see. Then, avoid using ArrayList as follows and consider null in words array.
```java
final int numWords = words.size();
for (int i = 0; i < numWords; i++) {
for (int ngramSize = minSize; ngramSize <= maxSize; ngramSize++) {
final int end = i+ngram;
if(end > numWords) {
continue;
}
final StringBuilder buf = new StringBuilder();
for(int j=i; j<end; j++) {
Text w = words.get(i);
if(w == null) { continue; } // avoid "null" to be added or throw exception
if(j!=i) { buf.append(' '));
buf.append(w.toString()).append(' ');
}
ngrams.add(new Text(buf.toString())));
```
---
[GitHub] incubator-hivemall pull request #118: [HIVEMALL-146] Yet another UDF to gene...
Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:
https://github.com/apache/incubator-hivemall/pull/118
---
[GitHub] incubator-hivemall pull request #118: [HIVEMALL-146] Yet another UDF to gene...
Posted by takuti <gi...@git.apache.org>.
Github user takuti commented on a diff in the pull request:
https://github.com/apache/incubator-hivemall/pull/118#discussion_r142328137
--- Diff: core/src/main/java/hivemall/tools/text/WordNgramsUDF.java ---
@@ -0,0 +1,85 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package hivemall.tools.text;
+
+import org.apache.hadoop.hive.ql.exec.Description;
+import org.apache.hadoop.hive.ql.exec.UDF;
+import org.apache.hadoop.hive.ql.exec.UDFArgumentException;
+import org.apache.hadoop.hive.ql.metadata.HiveException;
+import org.apache.hadoop.hive.ql.udf.UDFType;
+import org.apache.hadoop.io.Text;
+
+import javax.annotation.Nonnegative;
+import javax.annotation.Nonnull;
+import javax.annotation.Nullable;
+
+import java.util.ArrayList;
+import java.util.List;
+
+@Description(name = "word_ngrams", value = "_FUNC_(array<string> words, int minSize, int maxSize])"
+ + " - Returns list of n-grams for given words, where `minSize <= n <= maxSize`")
+@UDFType(deterministic = true, stateful = false)
+public final class WordNgramsUDF extends UDF {
+
+ @Nullable
+ public List<Text> evaluate(@Nullable final List<Text> words, final int minSize,
+ final int maxSize) throws HiveException {
+ if (words == null) {
+ return null;
+ }
+ if (minSize <= 0) {
+ throw new UDFArgumentException("`minSize` must be greater than zero: " + minSize);
+ }
+ if (minSize > maxSize) {
+ throw new UDFArgumentException("`maxSize` must be greater than or equal to `minSize`: "
+ + maxSize);
+ }
+ return getNgrams(words, minSize, maxSize);
+ }
+
+ @Nonnull
+ private static List<Text> getNgrams(@Nonnull final List<Text> words,
+ @Nonnegative final int minSize, @Nonnegative final int maxSize) throws HiveException {
+ final List<Text> ngrams = new ArrayList<Text>();
+ for (int i = 0, numWords = words.size(); i < numWords; i++) {
+ for (int ngramSize = minSize; ngramSize <= maxSize; ngramSize++) {
+ final int end = i + ngramSize;
+ if (end > numWords) { // exceeds the final element
+ continue;
+ }
+
+ final StringBuilder ngram = new StringBuilder();
+ for (int j = i; j < end; j++) {
+ final Text word = words.get(j);
+ if (word == null) {
+ throw new HiveException(
+ "`array<string> words` must not contain NULL element");
+ }
+ if (j > i) { // insert single whitespace between elements
+ ngram.append(" ");
+ }
+ ngram.append(word.toString());
+ }
+ ngrams.add(new Text(ngram.toString()));
+ }
--- End diff --
👍
---
[GitHub] incubator-hivemall pull request #118: [HIVEMALL-146] Yet another UDF to gene...
Posted by myui <gi...@git.apache.org>.
Github user myui commented on a diff in the pull request:
https://github.com/apache/incubator-hivemall/pull/118#discussion_r142323289
--- Diff: core/src/main/java/hivemall/tools/text/WordNgramsUDF.java ---
@@ -0,0 +1,85 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package hivemall.tools.text;
+
+import org.apache.hadoop.hive.ql.exec.Description;
+import org.apache.hadoop.hive.ql.exec.UDF;
+import org.apache.hadoop.hive.ql.exec.UDFArgumentException;
+import org.apache.hadoop.hive.ql.metadata.HiveException;
+import org.apache.hadoop.hive.ql.udf.UDFType;
+import org.apache.hadoop.io.Text;
+
+import javax.annotation.Nonnegative;
+import javax.annotation.Nonnull;
+import javax.annotation.Nullable;
+
+import java.util.ArrayList;
+import java.util.List;
+
+@Description(name = "word_ngrams", value = "_FUNC_(array<string> words, int minSize, int maxSize])"
+ + " - Returns list of n-grams for given words, where `minSize <= n <= maxSize`")
+@UDFType(deterministic = true, stateful = false)
+public final class WordNgramsUDF extends UDF {
+
+ @Nullable
+ public List<Text> evaluate(@Nullable final List<Text> words, final int minSize,
+ final int maxSize) throws HiveException {
+ if (words == null) {
+ return null;
+ }
+ if (minSize <= 0) {
+ throw new UDFArgumentException("`minSize` must be greater than zero: " + minSize);
+ }
+ if (minSize > maxSize) {
+ throw new UDFArgumentException("`maxSize` must be greater than or equal to `minSize`: "
+ + maxSize);
+ }
+ return getNgrams(words, minSize, maxSize);
+ }
+
+ @Nonnull
+ private static List<Text> getNgrams(@Nonnull final List<Text> words,
+ @Nonnegative final int minSize, @Nonnegative final int maxSize) throws HiveException {
+ final List<Text> ngrams = new ArrayList<Text>();
--- End diff --
final StringBuilder ngram = new StringBuilder();
---
[GitHub] incubator-hivemall pull request #118: [HIVEMALL-146] Yet another UDF to gene...
Posted by takuti <gi...@git.apache.org>.
Github user takuti commented on a diff in the pull request:
https://github.com/apache/incubator-hivemall/pull/118#discussion_r142298457
--- Diff: core/src/main/java/hivemall/tools/text/NgramsUDF.java ---
@@ -0,0 +1,76 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package hivemall.tools.text;
+
+import hivemall.utils.lang.StringUtils;
+
+import org.apache.hadoop.hive.ql.exec.Description;
+import org.apache.hadoop.hive.ql.exec.UDF;
+import org.apache.hadoop.hive.ql.exec.UDFArgumentException;
+import org.apache.hadoop.hive.ql.metadata.HiveException;
+import org.apache.hadoop.hive.ql.udf.UDFType;
+import org.apache.hadoop.io.Text;
+
+import javax.annotation.Nonnegative;
+import javax.annotation.Nonnull;
+
+import java.util.ArrayList;
+import java.util.List;
+
+@Description(name = "to_ngrams", value = "_FUNC_(array<string> words, int minSize, int maxSize])"
--- End diff --
I agree, but, at the same time, I like to use intuitive and well-known term as a name of such simple function. Let me think.
---
[GitHub] incubator-hivemall pull request #118: [HIVEMALL-146] Yet another UDF to gene...
Posted by takuti <gi...@git.apache.org>.
Github user takuti commented on a diff in the pull request:
https://github.com/apache/incubator-hivemall/pull/118#discussion_r142296940
--- Diff: core/src/main/java/hivemall/tools/text/NgramsUDF.java ---
@@ -0,0 +1,76 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package hivemall.tools.text;
+
+import hivemall.utils.lang.StringUtils;
+
+import org.apache.hadoop.hive.ql.exec.Description;
+import org.apache.hadoop.hive.ql.exec.UDF;
+import org.apache.hadoop.hive.ql.exec.UDFArgumentException;
+import org.apache.hadoop.hive.ql.metadata.HiveException;
+import org.apache.hadoop.hive.ql.udf.UDFType;
+import org.apache.hadoop.io.Text;
+
+import javax.annotation.Nonnegative;
+import javax.annotation.Nonnull;
+
+import java.util.ArrayList;
+import java.util.List;
+
+@Description(name = "to_ngrams", value = "_FUNC_(array<string> words, int minSize, int maxSize])"
+ + " - Returns list of n-grams for given words, where `minSize <= n <= maxSize`")
+@UDFType(deterministic = true, stateful = false)
+public final class NgramsUDF extends UDF {
+
+ public List<Text> evaluate(final List<Text> words, final int minSize, final int maxSize)
+ throws HiveException {
+ if (words == null) {
+ return null;
+ }
+ if (minSize <= 0) {
+ throw new UDFArgumentException("`minSize` must be greater than zero: " + minSize);
+ }
+ if (minSize > maxSize) {
+ throw new UDFArgumentException("`maxSize` must be greater than or equal to `minSize`: "
+ + maxSize);
+ }
+ return getNgrams(words, minSize, maxSize);
+ }
+
+ @Nonnull
+ private List<Text> getNgrams(@Nonnull final List<Text> words, @Nonnegative final int minSize,
+ @Nonnegative final int maxSize) {
+ final List<Text> ngrams = new ArrayList<Text>();
+ for (int i = 0, numWords = words.size(); i < numWords; i++) {
+ for (int ngramSize = minSize; ngramSize <= maxSize; ngramSize++) {
+ if (i + ngramSize > numWords) { // exceeds the final element
+ continue;
+ }
+
+ final List<String> ngram = new ArrayList<String>();
+ for (int j = i; j < i + ngramSize; j++) {
+ ngram.add(words.get(j).toString());
+ }
+ ngrams.add(new Text(StringUtils.join(ngram, " ")));
--- End diff --
`ngram` will never be empty thanks to `if (i + ngramSize > numWords) continue;`. Notice that the for-loop creating `ngram` increments `j` from `i` to `i + ngramSize - 1`.
---
[GitHub] incubator-hivemall pull request #118: [HIVEMALL-146] Yet another UDF to gene...
Posted by myui <gi...@git.apache.org>.
Github user myui commented on a diff in the pull request:
https://github.com/apache/incubator-hivemall/pull/118#discussion_r142323365
--- Diff: core/src/main/java/hivemall/tools/text/WordNgramsUDF.java ---
@@ -0,0 +1,85 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package hivemall.tools.text;
+
+import org.apache.hadoop.hive.ql.exec.Description;
+import org.apache.hadoop.hive.ql.exec.UDF;
+import org.apache.hadoop.hive.ql.exec.UDFArgumentException;
+import org.apache.hadoop.hive.ql.metadata.HiveException;
+import org.apache.hadoop.hive.ql.udf.UDFType;
+import org.apache.hadoop.io.Text;
+
+import javax.annotation.Nonnegative;
+import javax.annotation.Nonnull;
+import javax.annotation.Nullable;
+
+import java.util.ArrayList;
+import java.util.List;
+
+@Description(name = "word_ngrams", value = "_FUNC_(array<string> words, int minSize, int maxSize])"
+ + " - Returns list of n-grams for given words, where `minSize <= n <= maxSize`")
+@UDFType(deterministic = true, stateful = false)
+public final class WordNgramsUDF extends UDF {
+
+ @Nullable
+ public List<Text> evaluate(@Nullable final List<Text> words, final int minSize,
+ final int maxSize) throws HiveException {
+ if (words == null) {
+ return null;
+ }
+ if (minSize <= 0) {
+ throw new UDFArgumentException("`minSize` must be greater than zero: " + minSize);
+ }
+ if (minSize > maxSize) {
+ throw new UDFArgumentException("`maxSize` must be greater than or equal to `minSize`: "
+ + maxSize);
+ }
+ return getNgrams(words, minSize, maxSize);
+ }
+
+ @Nonnull
+ private static List<Text> getNgrams(@Nonnull final List<Text> words,
+ @Nonnegative final int minSize, @Nonnegative final int maxSize) throws HiveException {
+ final List<Text> ngrams = new ArrayList<Text>();
+ for (int i = 0, numWords = words.size(); i < numWords; i++) {
+ for (int ngramSize = minSize; ngramSize <= maxSize; ngramSize++) {
+ final int end = i + ngramSize;
+ if (end > numWords) { // exceeds the final element
+ continue;
+ }
+
+ final StringBuilder ngram = new StringBuilder();
+ for (int j = i; j < end; j++) {
+ final Text word = words.get(j);
+ if (word == null) {
+ throw new HiveException(
+ "`array<string> words` must not contain NULL element");
+ }
+ if (j > i) { // insert single whitespace between elements
+ ngram.append(" ");
+ }
+ ngram.append(word.toString());
+ }
+ ngrams.add(new Text(ngram.toString()));
+ }
--- End diff --
StringUtils.clear(ngram);
---
[GitHub] incubator-hivemall issue #118: [HIVEMALL-146] Yet another UDF to generate n-...
Posted by coveralls <gi...@git.apache.org>.
Github user coveralls commented on the issue:
https://github.com/apache/incubator-hivemall/pull/118
[![Coverage Status](https://coveralls.io/builds/13540113/badge)](https://coveralls.io/builds/13540113)
Coverage increased (+0.05%) to 41.109% when pulling **6b467688338c1824f52f507852421ad30222002b on takuti:HIVEMALL-146-ngrams** into **1e42387576fabbb326d451f4a00ac22d57828711 on apache:master**.
---
[GitHub] incubator-hivemall issue #118: [HIVEMALL-146] Yet another UDF to generate n-...
Posted by myui <gi...@git.apache.org>.
Github user myui commented on the issue:
https://github.com/apache/incubator-hivemall/pull/118
I personally prefer `wordgrams` though.
http://search.cpan.org/dist/Text-WordGrams/lib/Text/WordGrams.pm
---
[GitHub] incubator-hivemall pull request #118: [HIVEMALL-146] Yet another UDF to gene...
Posted by takuti <gi...@git.apache.org>.
Github user takuti commented on a diff in the pull request:
https://github.com/apache/incubator-hivemall/pull/118#discussion_r142312317
--- Diff: core/src/main/java/hivemall/tools/text/NgramsUDF.java ---
@@ -0,0 +1,76 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package hivemall.tools.text;
+
+import hivemall.utils.lang.StringUtils;
+
+import org.apache.hadoop.hive.ql.exec.Description;
+import org.apache.hadoop.hive.ql.exec.UDF;
+import org.apache.hadoop.hive.ql.exec.UDFArgumentException;
+import org.apache.hadoop.hive.ql.metadata.HiveException;
+import org.apache.hadoop.hive.ql.udf.UDFType;
+import org.apache.hadoop.io.Text;
+
+import javax.annotation.Nonnegative;
+import javax.annotation.Nonnull;
+
+import java.util.ArrayList;
+import java.util.List;
+
+@Description(name = "to_ngrams", value = "_FUNC_(array<string> words, int minSize, int maxSize])"
--- End diff --
+ I've noticed that, since the results are joined with whitespace, using this new UDF for character-based ngram is infeasible.
Let's say `word_ngrams`.
---
[GitHub] incubator-hivemall pull request #118: [HIVEMALL-146] Yet another UDF to gene...
Posted by myui <gi...@git.apache.org>.
Github user myui commented on a diff in the pull request:
https://github.com/apache/incubator-hivemall/pull/118#discussion_r142323566
--- Diff: core/src/main/java/hivemall/tools/text/WordNgramsUDF.java ---
@@ -0,0 +1,85 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package hivemall.tools.text;
+
+import org.apache.hadoop.hive.ql.exec.Description;
+import org.apache.hadoop.hive.ql.exec.UDF;
+import org.apache.hadoop.hive.ql.exec.UDFArgumentException;
+import org.apache.hadoop.hive.ql.metadata.HiveException;
+import org.apache.hadoop.hive.ql.udf.UDFType;
+import org.apache.hadoop.io.Text;
+
+import javax.annotation.Nonnegative;
+import javax.annotation.Nonnull;
+import javax.annotation.Nullable;
+
+import java.util.ArrayList;
+import java.util.List;
+
+@Description(name = "word_ngrams", value = "_FUNC_(array<string> words, int minSize, int maxSize])"
+ + " - Returns list of n-grams for given words, where `minSize <= n <= maxSize`")
+@UDFType(deterministic = true, stateful = false)
+public final class WordNgramsUDF extends UDF {
+
+ @Nullable
+ public List<Text> evaluate(@Nullable final List<Text> words, final int minSize,
+ final int maxSize) throws HiveException {
+ if (words == null) {
+ return null;
+ }
+ if (minSize <= 0) {
+ throw new UDFArgumentException("`minSize` must be greater than zero: " + minSize);
+ }
+ if (minSize > maxSize) {
+ throw new UDFArgumentException("`maxSize` must be greater than or equal to `minSize`: "
+ + maxSize);
+ }
+ return getNgrams(words, minSize, maxSize);
+ }
+
+ @Nonnull
+ private static List<Text> getNgrams(@Nonnull final List<Text> words,
+ @Nonnegative final int minSize, @Nonnegative final int maxSize) throws HiveException {
+ final List<Text> ngrams = new ArrayList<Text>();
+ for (int i = 0, numWords = words.size(); i < numWords; i++) {
+ for (int ngramSize = minSize; ngramSize <= maxSize; ngramSize++) {
+ final int end = i + ngramSize;
+ if (end > numWords) { // exceeds the final element
+ continue;
+ }
+
+ final StringBuilder ngram = new StringBuilder();
+ for (int j = i; j < end; j++) {
+ final Text word = words.get(j);
+ if (word == null) {
+ throw new HiveException(
--- End diff --
UDFArgumentException is more appropriate.
---
[GitHub] incubator-hivemall pull request #118: [HIVEMALL-146] Yet another UDF to gene...
Posted by myui <gi...@git.apache.org>.
Github user myui commented on a diff in the pull request:
https://github.com/apache/incubator-hivemall/pull/118#discussion_r142062132
--- Diff: core/src/main/java/hivemall/tools/text/NgramsUDF.java ---
@@ -0,0 +1,76 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package hivemall.tools.text;
+
+import hivemall.utils.lang.StringUtils;
+
+import org.apache.hadoop.hive.ql.exec.Description;
+import org.apache.hadoop.hive.ql.exec.UDF;
+import org.apache.hadoop.hive.ql.exec.UDFArgumentException;
+import org.apache.hadoop.hive.ql.metadata.HiveException;
+import org.apache.hadoop.hive.ql.udf.UDFType;
+import org.apache.hadoop.io.Text;
+
+import javax.annotation.Nonnegative;
+import javax.annotation.Nonnull;
+
+import java.util.ArrayList;
+import java.util.List;
+
+@Description(name = "to_ngrams", value = "_FUNC_(array<string> words, int minSize, int maxSize])"
+ + " - Returns list of n-grams for given words, where `minSize <= n <= maxSize`")
+@UDFType(deterministic = true, stateful = false)
+public final class NgramsUDF extends UDF {
+
+ public List<Text> evaluate(final List<Text> words, final int minSize, final int maxSize)
--- End diff --
```java
@Nullable
public List<Text> evaluate( @Nullable final List<Text> words, ...
```
---