You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Earl Cahill (JIRA)" <ji...@apache.org> on 2008/10/07 07:10:44 UTC

[jira] Created: (PIG-472) load files based on user provided regular expressions

load files based on user provided regular expressions
-----------------------------------------------------

                 Key: PIG-472
                 URL: https://issues.apache.org/jira/browse/PIG-472
             Project: Pig
          Issue Type: New Feature
          Components: data, grunt
            Reporter: Earl Cahill


Want to be able to load files based on regular expressions.  Each group specified in parenthesis should end up as a DataAtom, and the list of DataAtoms should end up in a Tuple.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-472) load files based on user provided regular expressions

Posted by "Earl Cahill (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Earl Cahill updated PIG-472:
----------------------------

    Attachment:     (was: RegExLoader-PIG-472)

> load files based on user provided regular expressions
> -----------------------------------------------------
>
>                 Key: PIG-472
>                 URL: https://issues.apache.org/jira/browse/PIG-472
>             Project: Pig
>          Issue Type: New Feature
>          Components: data, grunt
>    Affects Versions: 0.1.0
>            Reporter: Earl Cahill
>             Fix For: 0.1.0
>
>
> Want to be able to load files based on regular expressions.  Each group specified in parenthesis should end up as a DataAtom, and the list of DataAtoms should end up in a Tuple.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-472) load files based on user provided regular expressions

Posted by "Alan Gates (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alan Gates updated PIG-472:
---------------------------

    Resolution: Fixed
        Status: Resolved  (was: Patch Available)

Patch checked in.  Thanks Earl for contributing to pig.

> load files based on user provided regular expressions
> -----------------------------------------------------
>
>                 Key: PIG-472
>                 URL: https://issues.apache.org/jira/browse/PIG-472
>             Project: Pig
>          Issue Type: New Feature
>          Components: data, grunt
>    Affects Versions: 0.1.0
>            Reporter: Earl Cahill
>             Fix For: 0.1.0
>
>         Attachments: RegExLoader-PIG-472
>
>
> Want to be able to load files based on regular expressions.  Each group specified in parenthesis should end up as a DataAtom, and the list of DataAtoms should end up in a Tuple.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-472) load files based on user provided regular expressions

Posted by "Earl Cahill (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Earl Cahill updated PIG-472:
----------------------------

    Attachment: RegExLoader-PIG-472

> load files based on user provided regular expressions
> -----------------------------------------------------
>
>                 Key: PIG-472
>                 URL: https://issues.apache.org/jira/browse/PIG-472
>             Project: Pig
>          Issue Type: New Feature
>          Components: data, grunt
>    Affects Versions: 0.1.0
>            Reporter: Earl Cahill
>             Fix For: 0.1.0
>
>         Attachments: RegExLoader-PIG-472
>
>
> Want to be able to load files based on regular expressions.  Each group specified in parenthesis should end up as a DataAtom, and the list of DataAtoms should end up in a Tuple.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Reopened: (PIG-472) load files based on user provided regular expressions

Posted by "Alan Gates (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alan Gates reopened PIG-472:
----------------------------


In general the patch looks good.  A couple of comments and a question:

1) You need to add the Apache License comment to the header of some of the files.  You put it in some files, but not others.

2) When you submit a patch mark the JIRA as patch available.  The committer will mark it as resolved when it's checked in.  I'm reopening all three and setting them to patch available.

The question, in RegExLoader.getNext(), you construct a new Matcher for every line.  Would it be faster to construct one Matcher and call reset() on it for each line?

> load files based on user provided regular expressions
> -----------------------------------------------------
>
>                 Key: PIG-472
>                 URL: https://issues.apache.org/jira/browse/PIG-472
>             Project: Pig
>          Issue Type: New Feature
>          Components: data, grunt
>    Affects Versions: 0.1.0
>            Reporter: Earl Cahill
>             Fix For: 0.1.0
>
>         Attachments: RegExLoader-PIG-472
>
>
> Want to be able to load files based on regular expressions.  Each group specified in parenthesis should end up as a DataAtom, and the list of DataAtoms should end up in a Tuple.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-472) load files based on user provided regular expressions

Posted by "Earl Cahill (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12639494#action_12639494 ] 

Earl Cahill commented on PIG-472:
---------------------------------

Sounds like a winner.  In a past life, we wouldn't want parsing to end for any reason, so I think logging and moving on sounds great.

Earl





> load files based on user provided regular expressions
> -----------------------------------------------------
>
>                 Key: PIG-472
>                 URL: https://issues.apache.org/jira/browse/PIG-472
>             Project: Pig
>          Issue Type: New Feature
>          Components: data, grunt
>    Affects Versions: 0.1.0
>            Reporter: Earl Cahill
>             Fix For: 0.1.0
>
>         Attachments: RegExLoader-PIG-472
>
>
> Want to be able to load files based on regular expressions.  Each group specified in parenthesis should end up as a DataAtom, and the list of DataAtoms should end up in a Tuple.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-472) load files based on user provided regular expressions

Posted by "Earl Cahill (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12637776#action_12637776 ] 

Earl Cahill commented on PIG-472:
---------------------------------

1) Think I got the Apache License in each file.

2) Sorry, at my work I resolve issues and then our test team closes the issues.

I changed RegExLoader.getNext() to reuse a Matcher for each line, relying on reset.  I didn't do any timings, but the tests still pass.

I deleted the old patch and added the new.  Hope that was the right thing to do.

Thanks for the feedback.

> load files based on user provided regular expressions
> -----------------------------------------------------
>
>                 Key: PIG-472
>                 URL: https://issues.apache.org/jira/browse/PIG-472
>             Project: Pig
>          Issue Type: New Feature
>          Components: data, grunt
>    Affects Versions: 0.1.0
>            Reporter: Earl Cahill
>             Fix For: 0.1.0
>
>         Attachments: RegExLoader-PIG-472
>
>
> Want to be able to load files based on regular expressions.  Each group specified in parenthesis should end up as a DataAtom, and the list of DataAtoms should end up in a Tuple.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-472) load files based on user provided regular expressions

Posted by "Ian Holsman (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12638468#action_12638468 ] 

Ian Holsman commented on PIG-472:
---------------------------------

hey guys.

I just noticed that it RegexLoader stops when it finds a bad line (that doesn't match the regex pattern). 
I modified the loader so it will skip those bad lines and continue, logging an error (changing the 'if' to a 'while').

svn isn't working for me right now, so excuse the paste:

 66         while ((line = in.readLine(utf8, recordDel)) != null) {
 67             if (line.length() > 0 && line.charAt(line.length() - 1) == '\r')
 68                 line = line.substring(0, line.length() - 1);
 69 
 70             matcher.reset(line);
 71             if (matcher.find()) {
 72                 ArrayList<Datum> list = new ArrayList<Datum>();
 73 
 74                 for (int i = 1; i <= matcher.groupCount(); i++) {
 75                     list.add(new DataAtom(matcher.group(i)));
 76                 }
 77                 return new Tuple(list);
 78             }
 79             else {
 80                 log.warn("Warning: Line " + line + " did not match the regex. Skipping it");
 81             }
 82         }
 83         return null;



> load files based on user provided regular expressions
> -----------------------------------------------------
>
>                 Key: PIG-472
>                 URL: https://issues.apache.org/jira/browse/PIG-472
>             Project: Pig
>          Issue Type: New Feature
>          Components: data, grunt
>    Affects Versions: 0.1.0
>            Reporter: Earl Cahill
>             Fix For: 0.1.0
>
>         Attachments: RegExLoader-PIG-472
>
>
> Want to be able to load files based on regular expressions.  Each group specified in parenthesis should end up as a DataAtom, and the list of DataAtoms should end up in a Tuple.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-472) load files based on user provided regular expressions

Posted by "Earl Cahill (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Earl Cahill updated PIG-472:
----------------------------

    Status: Patch Available  (was: Open)

This patch satisfies the requirements of PIG-472.  The patch contains

org.apache.pig.piggybank.storage.RegExLoader
org.apache.pig.piggybank.test.storage.TestHelper
org.apache.pig.piggybank.test.storage.TestRegExLoader


> load files based on user provided regular expressions
> -----------------------------------------------------
>
>                 Key: PIG-472
>                 URL: https://issues.apache.org/jira/browse/PIG-472
>             Project: Pig
>          Issue Type: New Feature
>          Components: data, grunt
>            Reporter: Earl Cahill
>
> Want to be able to load files based on regular expressions.  Each group specified in parenthesis should end up as a DataAtom, and the list of DataAtoms should end up in a Tuple.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-472) load files based on user provided regular expressions

Posted by "Earl Cahill (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Earl Cahill updated PIG-472:
----------------------------

    Attachment: RegExLoader-PIG-472

Not sure if there is where to put this exactly, but here you go.  Attached is my patch.

> load files based on user provided regular expressions
> -----------------------------------------------------
>
>                 Key: PIG-472
>                 URL: https://issues.apache.org/jira/browse/PIG-472
>             Project: Pig
>          Issue Type: New Feature
>          Components: data, grunt
>            Reporter: Earl Cahill
>         Attachments: RegExLoader-PIG-472
>
>
> Want to be able to load files based on regular expressions.  Each group specified in parenthesis should end up as a DataAtom, and the list of DataAtoms should end up in a Tuple.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-472) load files based on user provided regular expressions

Posted by "Alan Gates (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alan Gates updated PIG-472:
---------------------------

        Fix Version/s: 0.1.0
    Affects Version/s: 0.1.0
               Status: Patch Available  (was: Reopened)

> load files based on user provided regular expressions
> -----------------------------------------------------
>
>                 Key: PIG-472
>                 URL: https://issues.apache.org/jira/browse/PIG-472
>             Project: Pig
>          Issue Type: New Feature
>          Components: data, grunt
>    Affects Versions: 0.1.0
>            Reporter: Earl Cahill
>             Fix For: 0.1.0
>
>         Attachments: RegExLoader-PIG-472
>
>
> Want to be able to load files based on regular expressions.  Each group specified in parenthesis should end up as a DataAtom, and the list of DataAtoms should end up in a Tuple.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-472) load files based on user provided regular expressions

Posted by "Earl Cahill (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Earl Cahill updated PIG-472:
----------------------------

    Resolution: Fixed
        Status: Resolved  (was: Patch Available)

handled by attached patch

> load files based on user provided regular expressions
> -----------------------------------------------------
>
>                 Key: PIG-472
>                 URL: https://issues.apache.org/jira/browse/PIG-472
>             Project: Pig
>          Issue Type: New Feature
>          Components: data, grunt
>            Reporter: Earl Cahill
>         Attachments: RegExLoader-PIG-472
>
>
> Want to be able to load files based on regular expressions.  Each group specified in parenthesis should end up as a DataAtom, and the list of DataAtoms should end up in a Tuple.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-472) load files based on user provided regular expressions

Posted by "Alan Gates (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12639490#action_12639490 ] 

Alan Gates commented on PIG-472:
--------------------------------

Earl, thoughts on this proposed change?  Was your intention that a line that doesn't match the regex means the file is bad and all processing should stop?  Or should it just throw out that line and continue?  I'd think the latter.

> load files based on user provided regular expressions
> -----------------------------------------------------
>
>                 Key: PIG-472
>                 URL: https://issues.apache.org/jira/browse/PIG-472
>             Project: Pig
>          Issue Type: New Feature
>          Components: data, grunt
>    Affects Versions: 0.1.0
>            Reporter: Earl Cahill
>             Fix For: 0.1.0
>
>         Attachments: RegExLoader-PIG-472
>
>
> Want to be able to load files based on regular expressions.  Each group specified in parenthesis should end up as a DataAtom, and the list of DataAtoms should end up in a Tuple.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.