You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Jonathan Coveney (JIRA)" <ji...@apache.org> on 2013/02/22 15:42:12 UTC

[jira] [Commented] (PIG-3190) Add LuceneTokenizer and SnowballTokenizer to Pig - useful text tokenization

    [ https://issues.apache.org/jira/browse/PIG-3190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13584312#comment-13584312 ] 

Jonathan Coveney commented on PIG-3190:
---------------------------------------

Can you throw this in RB? Either way, some initial comments...

1. ExecException is a subclass of IOException...why do you just catch and rethrow it?
2. On the TokenStream stream declaration you have too many parens.
3. I personally don't like functions whose aliases depends on the alias of an input type unless it really makes sense. IMHO this is not one of those cases. I'd just nix it and use the @OutputSchema annotation.

Many of these apply to the SnowballTokenizer as well.
                
> Add LuceneTokenizer and SnowballTokenizer to Pig - useful text tokenization
> ---------------------------------------------------------------------------
>
>                 Key: PIG-3190
>                 URL: https://issues.apache.org/jira/browse/PIG-3190
>             Project: Pig
>          Issue Type: Bug
>          Components: internal-udfs
>    Affects Versions: 0.11
>            Reporter: Russell Jurney
>            Assignee: Russell Jurney
>             Fix For: 0.12
>
>         Attachments: PIG-3190-2.patch, PIG-3190.patch
>
>
> TOKENIZE is literally useless. The Lucene Standard/Snowball tokenizers in lucene, as used by, varaha is much more useful for actual tasks: https://github.com/Ganglion/varaha/blob/master/src/main/java/varaha/text/TokenizeText.java 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira