You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (JIRA)" <ji...@apache.org> on 2019/05/21 04:00:26 UTC

[jira] [Updated] (SPARK-20028) Implement NGrams aggregate function

     [ https://issues.apache.org/jira/browse/SPARK-20028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hyukjin Kwon updated SPARK-20028:
---------------------------------
    Labels: bulk-closed  (was: )

> Implement NGrams aggregate function
> -----------------------------------
>
>                 Key: SPARK-20028
>                 URL: https://issues.apache.org/jira/browse/SPARK-20028
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>    Affects Versions: 2.2.0
>            Reporter: Chenzhao Guo
>            Priority: Major
>              Labels: bulk-closed
>
> This is the implementation of `ngrams` aggregate expression which is also implemented by Hive. It takes use of n-gram concept in natural language processing to understand texts.
> Currently, Spark doesn't support using Hive UDAF GenericUDAFnGrams, which is actually a feature missing.
> An n-gram is a contiguous subsequence of n item(s) drawn from a given sequence. This expression finds the k most frequent n-grams from one or more sequences. 
> This expression has the pattern of : ngrams(children: Array[Array[String]](or Array[String]), n: Int, k: Int, accuracy: Int), it can be used in conjuction with `sentences` to split the column of String to Array. Among the parameters: 
> Children indicates the 'given sequence' we collect n-grams from;
> N indicates n-gram's element number, size 1 is referred to as a "unigram", size 2 is a "bigram", size 3 is a "trigram"... 
> K indicates top k;
> Accuracy is related to the memory used for frequency estimation, more memory will give more accurate frequency counts.
> A simple example: 
> `SELECT ngrams(array("abc", "abc", "bcd", "abc", "bcd"), 2, 4);` will get
> `[{["abc","bcd"]:2.0}, 
> {["abc","abc"]:1.0}, 
> {["bcd","abc"]:1.0}]`. Because there are four 2-grams for the input which are `["abc", "abc"], ["abc", "bcd"], ["bcd", "abc"], ["abc", "bcd"]`, and `["abc", "bcd"]` occurs 2 times, the other two 2-grams occurs 1 time each, while `["abc","abc"]` is alphabetically before `["bcd","abc"]`, so the answer is like that.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org