You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hive.apache.org by "Mayank Lahiri (JIRA)" <ji...@apache.org> on 2010/07/22 23:25:50 UTC

[jira] Created: (HIVE-1481) ngrams() UDAF for estimating top-k n-gram frequencies

ngrams() UDAF for estimating top-k n-gram frequencies
-----------------------------------------------------

                 Key: HIVE-1481
                 URL: https://issues.apache.org/jira/browse/HIVE-1481
             Project: Hadoop Hive
          Issue Type: New Feature
          Components: Query Processor
    Affects Versions: 0.7.0
            Reporter: Mayank Lahiri
            Assignee: Mayank Lahiri
             Fix For: 0.7.0


[ngrams|http://en.wikipedia.org/wiki/N-gram] are fixed-length subsequences of a longer sequences. This patch will add a new ngrams() UDAF to heuristically estimate the top-k most frequent n-grams in a set of sequences.

_Example_: *top bigrams in natural language text*

Say you have a column with movie or product reviews from users expressed as natural language strings. You want to find the top 10 most frequent word pairs. First, pipe the text through the sentences() UDAF in HIVE-1438, which tokenizes natural language text into an array of sentences, where each sentence is an array of words.

SELECT sentences("I hated this movie. I hated watching it and this movie made me unhappy.") FROM reviews;

_gives_:

[  ["I", "hated", "this", "movie"], ["I", "hated", "watching", "it", "and", "this", "movie", "made", "me", "unhappy"] ]

SELECT ngrams(sentences("I hated this movie. I hated watching it and this movie made me unhappy."), 2, 5) FROM reviews;

_gives the *5* most frequent *2-grams*_:
[ { ngram: ["I", "hated"] , estfrequency: 2 },
  { ngram: ["this", "movie"], estfrequency: 2},
  { ngram: ["hated", "this"], estfrequency: 1},
  { ngram: ["hated", "watching"], estfrequency: 1},
  { ngram: ["made", "me"], estfrequency: 1} ]


Can also be used for finding common sequences of URL accesses, for example, or n-grams in any data that can be represented as sequences of strings. More examples will be put up in a separate wiki page after this UDAF is fully developed.

The algorithm is a heuristic. For relatively small "k" values, in the range of 10-1000, the heuristic appears to perform well, with frequency counts coming within 5% of their true values, and always undercounting. Again, more results will be posted on a separate wiki page.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HIVE-1481) ngrams() UDAF for estimating top-k n-gram frequencies

Posted by "John Sichi (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-1481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

John Sichi updated HIVE-1481:
-----------------------------

          Status: Resolved  (was: Patch Available)
    Hadoop Flags: [Reviewed]
      Resolution: Fixed

Committed.  Thanks Mayank!


> ngrams() UDAF for estimating top-k n-gram frequencies
> -----------------------------------------------------
>
>                 Key: HIVE-1481
>                 URL: https://issues.apache.org/jira/browse/HIVE-1481
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>    Affects Versions: 0.7.0
>            Reporter: Mayank Lahiri
>            Assignee: Mayank Lahiri
>             Fix For: 0.7.0
>
>         Attachments: HIVE-1481.1.patch
>
>
> [ngrams|http://en.wikipedia.org/wiki/N-gram] are fixed-length subsequences of a longer sequences. This patch will add a new ngrams() UDAF to heuristically estimate the top-k most frequent n-grams in a set of sequences.
> _Example_: *top bigrams in natural language text*
> Say you have a column with movie or product reviews from users expressed as natural language strings. You want to find the top 10 most frequent word pairs. First, pipe the text through the sentences() UDAF in HIVE-1438, which tokenizes natural language text into an array of sentences, where each sentence is an array of words.
> SELECT sentences("I hated this movie. I hated watching it and this movie made me unhappy.") FROM reviews;
> _gives_:
> [  ["I", "hated", "this", "movie"], ["I", "hated", "watching", "it", "and", "this", "movie", "made", "me", "unhappy"] ]
> SELECT ngrams(sentences("I hated this movie. I hated watching it and this movie made me unhappy."), 2, 5) FROM reviews;
> _gives the *5* most frequent *2-grams*_:
> [ { ngram: ["I", "hated"] , estfrequency: 2 },
>   { ngram: ["this", "movie"], estfrequency: 2},
>   { ngram: ["hated", "this"], estfrequency: 1},
>   { ngram: ["hated", "watching"], estfrequency: 1},
>   { ngram: ["made", "me"], estfrequency: 1} ]
> Can also be used for finding common sequences of URL accesses, for example, or n-grams in any data that can be represented as sequences of strings. More examples will be put up in a separate wiki page after this UDAF is fully developed.
> The algorithm is a heuristic. For relatively small "k" values, in the range of 10-1000, the heuristic appears to perform well, with frequency counts coming within 5% of their true values, and always undercounting. Again, more results will be posted on a separate wiki page.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HIVE-1481) ngrams() UDAF for estimating top-k n-gram frequencies

Posted by "Mayank Lahiri (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-1481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mayank Lahiri updated HIVE-1481:
--------------------------------

    Attachment: HIVE-1481.1.patch

(1) Added a new test data file with natural language English text from Project Gutenberg.
(2) Added ngrams() UDAF and associated test cases

Will upload some experimental results showing the heuristic's performance relative to the exact estimation of n-gram frequencies.

> ngrams() UDAF for estimating top-k n-gram frequencies
> -----------------------------------------------------
>
>                 Key: HIVE-1481
>                 URL: https://issues.apache.org/jira/browse/HIVE-1481
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>    Affects Versions: 0.7.0
>            Reporter: Mayank Lahiri
>            Assignee: Mayank Lahiri
>             Fix For: 0.7.0
>
>         Attachments: HIVE-1481.1.patch
>
>
> [ngrams|http://en.wikipedia.org/wiki/N-gram] are fixed-length subsequences of a longer sequences. This patch will add a new ngrams() UDAF to heuristically estimate the top-k most frequent n-grams in a set of sequences.
> _Example_: *top bigrams in natural language text*
> Say you have a column with movie or product reviews from users expressed as natural language strings. You want to find the top 10 most frequent word pairs. First, pipe the text through the sentences() UDAF in HIVE-1438, which tokenizes natural language text into an array of sentences, where each sentence is an array of words.
> SELECT sentences("I hated this movie. I hated watching it and this movie made me unhappy.") FROM reviews;
> _gives_:
> [  ["I", "hated", "this", "movie"], ["I", "hated", "watching", "it", "and", "this", "movie", "made", "me", "unhappy"] ]
> SELECT ngrams(sentences("I hated this movie. I hated watching it and this movie made me unhappy."), 2, 5) FROM reviews;
> _gives the *5* most frequent *2-grams*_:
> [ { ngram: ["I", "hated"] , estfrequency: 2 },
>   { ngram: ["this", "movie"], estfrequency: 2},
>   { ngram: ["hated", "this"], estfrequency: 1},
>   { ngram: ["hated", "watching"], estfrequency: 1},
>   { ngram: ["made", "me"], estfrequency: 1} ]
> Can also be used for finding common sequences of URL accesses, for example, or n-grams in any data that can be represented as sequences of strings. More examples will be put up in a separate wiki page after this UDAF is fully developed.
> The algorithm is a heuristic. For relatively small "k" values, in the range of 10-1000, the heuristic appears to perform well, with frequency counts coming within 5% of their true values, and always undercounting. Again, more results will be posted on a separate wiki page.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-1481) ngrams() UDAF for estimating top-k n-gram frequencies

Posted by "Mayank Lahiri (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-1481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12892990#action_12892990 ] 

Mayank Lahiri commented on HIVE-1481:
-------------------------------------

Nice catch. :)

> ngrams() UDAF for estimating top-k n-gram frequencies
> -----------------------------------------------------
>
>                 Key: HIVE-1481
>                 URL: https://issues.apache.org/jira/browse/HIVE-1481
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>    Affects Versions: 0.7.0
>            Reporter: Mayank Lahiri
>            Assignee: Mayank Lahiri
>             Fix For: 0.7.0
>
>         Attachments: HIVE-1481.1.patch
>
>
> [ngrams|http://en.wikipedia.org/wiki/N-gram] are fixed-length subsequences of a longer sequences. This patch will add a new ngrams() UDAF to heuristically estimate the top-k most frequent n-grams in a set of sequences.
> _Example_: *top bigrams in natural language text*
> Say you have a column with movie or product reviews from users expressed as natural language strings. You want to find the top 10 most frequent word pairs. First, pipe the text through the sentences() UDAF in HIVE-1438, which tokenizes natural language text into an array of sentences, where each sentence is an array of words.
> SELECT sentences("I hated this movie. I hated watching it and this movie made me unhappy.") FROM reviews;
> _gives_:
> [  ["I", "hated", "this", "movie"], ["I", "hated", "watching", "it", "and", "this", "movie", "made", "me", "unhappy"] ]
> SELECT ngrams(sentences("I hated this movie. I hated watching it and this movie made me unhappy."), 2, 5) FROM reviews;
> _gives the *5* most frequent *2-grams*_:
> [ { ngram: ["I", "hated"] , estfrequency: 2 },
>   { ngram: ["this", "movie"], estfrequency: 2},
>   { ngram: ["hated", "this"], estfrequency: 1},
>   { ngram: ["hated", "watching"], estfrequency: 1},
>   { ngram: ["made", "me"], estfrequency: 1} ]
> Can also be used for finding common sequences of URL accesses, for example, or n-grams in any data that can be represented as sequences of strings. More examples will be put up in a separate wiki page after this UDAF is fully developed.
> The algorithm is a heuristic. For relatively small "k" values, in the range of 10-1000, the heuristic appears to perform well, with frequency counts coming within 5% of their true values, and always undercounting. Again, more results will be posted on a separate wiki page.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-1481) ngrams() UDAF for estimating top-k n-gram frequencies

Posted by "John Sichi (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-1481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12892987#action_12892987 ] 

John Sichi commented on HIVE-1481:
----------------------------------

+1.  Will commit when tests pass.


> ngrams() UDAF for estimating top-k n-gram frequencies
> -----------------------------------------------------
>
>                 Key: HIVE-1481
>                 URL: https://issues.apache.org/jira/browse/HIVE-1481
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>    Affects Versions: 0.7.0
>            Reporter: Mayank Lahiri
>            Assignee: Mayank Lahiri
>             Fix For: 0.7.0
>
>         Attachments: HIVE-1481.1.patch
>
>
> [ngrams|http://en.wikipedia.org/wiki/N-gram] are fixed-length subsequences of a longer sequences. This patch will add a new ngrams() UDAF to heuristically estimate the top-k most frequent n-grams in a set of sequences.
> _Example_: *top bigrams in natural language text*
> Say you have a column with movie or product reviews from users expressed as natural language strings. You want to find the top 10 most frequent word pairs. First, pipe the text through the sentences() UDAF in HIVE-1438, which tokenizes natural language text into an array of sentences, where each sentence is an array of words.
> SELECT sentences("I hated this movie. I hated watching it and this movie made me unhappy.") FROM reviews;
> _gives_:
> [  ["I", "hated", "this", "movie"], ["I", "hated", "watching", "it", "and", "this", "movie", "made", "me", "unhappy"] ]
> SELECT ngrams(sentences("I hated this movie. I hated watching it and this movie made me unhappy."), 2, 5) FROM reviews;
> _gives the *5* most frequent *2-grams*_:
> [ { ngram: ["I", "hated"] , estfrequency: 2 },
>   { ngram: ["this", "movie"], estfrequency: 2},
>   { ngram: ["hated", "this"], estfrequency: 1},
>   { ngram: ["hated", "watching"], estfrequency: 1},
>   { ngram: ["made", "me"], estfrequency: 1} ]
> Can also be used for finding common sequences of URL accesses, for example, or n-grams in any data that can be represented as sequences of strings. More examples will be put up in a separate wiki page after this UDAF is fully developed.
> The algorithm is a heuristic. For relatively small "k" values, in the range of 10-1000, the heuristic appears to perform well, with frequency counts coming within 5% of their true values, and always undercounting. Again, more results will be posted on a separate wiki page.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-1481) ngrams() UDAF for estimating top-k n-gram frequencies

Posted by "John Sichi (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-1481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12892986#action_12892986 ] 

John Sichi commented on HIVE-1481:
----------------------------------

I like the fact that now if I am ever stuck on a desert island with nothing but my laptop, I will be able to read some Metamorphosis in the Hive codebase.  :)


> ngrams() UDAF for estimating top-k n-gram frequencies
> -----------------------------------------------------
>
>                 Key: HIVE-1481
>                 URL: https://issues.apache.org/jira/browse/HIVE-1481
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>    Affects Versions: 0.7.0
>            Reporter: Mayank Lahiri
>            Assignee: Mayank Lahiri
>             Fix For: 0.7.0
>
>         Attachments: HIVE-1481.1.patch
>
>
> [ngrams|http://en.wikipedia.org/wiki/N-gram] are fixed-length subsequences of a longer sequences. This patch will add a new ngrams() UDAF to heuristically estimate the top-k most frequent n-grams in a set of sequences.
> _Example_: *top bigrams in natural language text*
> Say you have a column with movie or product reviews from users expressed as natural language strings. You want to find the top 10 most frequent word pairs. First, pipe the text through the sentences() UDAF in HIVE-1438, which tokenizes natural language text into an array of sentences, where each sentence is an array of words.
> SELECT sentences("I hated this movie. I hated watching it and this movie made me unhappy.") FROM reviews;
> _gives_:
> [  ["I", "hated", "this", "movie"], ["I", "hated", "watching", "it", "and", "this", "movie", "made", "me", "unhappy"] ]
> SELECT ngrams(sentences("I hated this movie. I hated watching it and this movie made me unhappy."), 2, 5) FROM reviews;
> _gives the *5* most frequent *2-grams*_:
> [ { ngram: ["I", "hated"] , estfrequency: 2 },
>   { ngram: ["this", "movie"], estfrequency: 2},
>   { ngram: ["hated", "this"], estfrequency: 1},
>   { ngram: ["hated", "watching"], estfrequency: 1},
>   { ngram: ["made", "me"], estfrequency: 1} ]
> Can also be used for finding common sequences of URL accesses, for example, or n-grams in any data that can be represented as sequences of strings. More examples will be put up in a separate wiki page after this UDAF is fully developed.
> The algorithm is a heuristic. For relatively small "k" values, in the range of 10-1000, the heuristic appears to perform well, with frequency counts coming within 5% of their true values, and always undercounting. Again, more results will be posted on a separate wiki page.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HIVE-1481) ngrams() UDAF for estimating top-k n-gram frequencies

Posted by "Mayank Lahiri (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-1481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mayank Lahiri updated HIVE-1481:
--------------------------------

    Status: Patch Available  (was: Open)

> ngrams() UDAF for estimating top-k n-gram frequencies
> -----------------------------------------------------
>
>                 Key: HIVE-1481
>                 URL: https://issues.apache.org/jira/browse/HIVE-1481
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>    Affects Versions: 0.7.0
>            Reporter: Mayank Lahiri
>            Assignee: Mayank Lahiri
>             Fix For: 0.7.0
>
>         Attachments: HIVE-1481.1.patch
>
>
> [ngrams|http://en.wikipedia.org/wiki/N-gram] are fixed-length subsequences of a longer sequences. This patch will add a new ngrams() UDAF to heuristically estimate the top-k most frequent n-grams in a set of sequences.
> _Example_: *top bigrams in natural language text*
> Say you have a column with movie or product reviews from users expressed as natural language strings. You want to find the top 10 most frequent word pairs. First, pipe the text through the sentences() UDAF in HIVE-1438, which tokenizes natural language text into an array of sentences, where each sentence is an array of words.
> SELECT sentences("I hated this movie. I hated watching it and this movie made me unhappy.") FROM reviews;
> _gives_:
> [  ["I", "hated", "this", "movie"], ["I", "hated", "watching", "it", "and", "this", "movie", "made", "me", "unhappy"] ]
> SELECT ngrams(sentences("I hated this movie. I hated watching it and this movie made me unhappy."), 2, 5) FROM reviews;
> _gives the *5* most frequent *2-grams*_:
> [ { ngram: ["I", "hated"] , estfrequency: 2 },
>   { ngram: ["this", "movie"], estfrequency: 2},
>   { ngram: ["hated", "this"], estfrequency: 1},
>   { ngram: ["hated", "watching"], estfrequency: 1},
>   { ngram: ["made", "me"], estfrequency: 1} ]
> Can also be used for finding common sequences of URL accesses, for example, or n-grams in any data that can be represented as sequences of strings. More examples will be put up in a separate wiki page after this UDAF is fully developed.
> The algorithm is a heuristic. For relatively small "k" values, in the range of 10-1000, the heuristic appears to perform well, with frequency counts coming within 5% of their true values, and always undercounting. Again, more results will be posted on a separate wiki page.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.