You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Cheolsoo Park (JIRA)" <ji...@apache.org> on 2013/02/28 03:43:14 UTC

[jira] [Comment Edited] (PIG-3142) Fixed-width load and store functions for the Piggybank

    [ https://issues.apache.org/jira/browse/PIG-3142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13589116#comment-13589116 ] 

Cheolsoo Park edited comment on PIG-3142 at 2/28/13 2:42 AM:
-------------------------------------------------------------

[~jpacker], thank you very much for the patch! This looks very useful. I have a few minor comments:
* Can you move the following code to prepareToRead() and remove requiredFieldsInitialized? The comment in LoadFunc says "This (prepareToRead) will be called during execution before any calls to getNext", so I think this kind of initialization should be done there not in getNext().
{code:title=FixedWidthLoader.java}
+        if (!requiredFieldsInitialized) {
+            UDFContext udfc = UDFContext.getUDFContext();
+            Properties p = udfc.getUDFProperties(this.getClass(), new String[] { udfContextSignature });
+            requiredFields = (boolean[]) ObjectSerializer.deserialize(p.getProperty(REQUIRED_FIELDS_SIGNATURE));
+
+            if (requiredFields != null) {
+                numRequiredFields = 0;
+                for (int i = 0; i < requiredFields.length; i++) {
+                    if (requiredFields[i])
+                        numRequiredFields++;
+                }
+            }
+            
+            requiredFieldsInitialized = true;
+        }
{code}
In fact, I found almost all LoadFunc implementations (including PigStorage) do the same trick. Looking at the commit history, I believe that this is a legacy pattern that was introduced before the LoadFunc resign. It seems that we never cleaned up this pattern when merging the LoadFunc redesign to trunk. At least, let's stop using it from now on. 
* Can you update the usage messages with Apache Pig namespace?
{code:title=FixedWidthLoader.java}
"Usage: com.mortardata.pig.FixedWidthLoader(" +
{code}
Also, the usage message in FixedWidthStorage incorrectly refers to FixedWidthLoader.
{code:title=FixedWidthStorage.java}
"Usage: com.mortardata.pig.FixedWidthLoader(" +
{code}
* Can you factor out parseColumnSpec(String spec) and reuse it in both FixedWidthLoader and FixedWidthStorage instead have duplicate code?
* Can you rename FixedWidthStorage to FixedWidthStorer? I think Storage refers to Loader + Storer (e.g. PigStorage, AvroStorage, etc). So FixedWidthStorer is probably a better name (e.g. HCatStorer).

Please let me know what you think. Thanks!
                
      was (Author: cheolsoo):
    [~jpacker], thank you very much for the patch! This looks very useful. I have a few minor comments:
* Can you move the following code to prepareToRead() and remove requiredFieldsInitialized? The comment in LoadFunc says "This will be called during execution before any calls to getNext()", so I think this kind of initialization should be done there not in getNext().
{code:title=FixedWidthLoader.java}
+        if (!requiredFieldsInitialized) {
+            UDFContext udfc = UDFContext.getUDFContext();
+            Properties p = udfc.getUDFProperties(this.getClass(), new String[] { udfContextSignature });
+            requiredFields = (boolean[]) ObjectSerializer.deserialize(p.getProperty(REQUIRED_FIELDS_SIGNATURE));
+
+            if (requiredFields != null) {
+                numRequiredFields = 0;
+                for (int i = 0; i < requiredFields.length; i++) {
+                    if (requiredFields[i])
+                        numRequiredFields++;
+                }
+            }
+            
+            requiredFieldsInitialized = true;
+        }
{code}
In fact, I found almost all LoadFunc implementations (including PigStorage) do the same trick. Looking at the commit history, I believe that this is a legacy pattern that was introduced before the LoadFunc resign. It seems that we never cleaned up this pattern when merging the LoadFunc redesign to trunk. At least, let's stop using it from now on. 
* Can you update the usage messages with Apache Pig namespace?
{code:title=FixedWidthLoader.java}
"Usage: com.mortardata.pig.FixedWidthLoader(" +
{code}
Also, the usage message in FixedWidthStorage incorrectly refers to FixedWidthLoader.
{code:title=FixedWidthStorage.java}
"Usage: com.mortardata.pig.FixedWidthLoader(" +
{code}
* Can you factor out parseColumnSpec(String spec) and reuse it in both FixedWidthLoader and FixedWidthStorage instead have duplicate code?
* Can you rename FixedWidthStorage to FixedWidthStorer? I think Storage refers to Loader + Storer (e.g. PigStorage, AvroStorage, etc). So FixedWidthStorer is probably a better name (e.g. HCatStorer).

Please let me know what you think. Thanks!
                  
> Fixed-width load and store functions for the Piggybank
> ------------------------------------------------------
>
>                 Key: PIG-3142
>                 URL: https://issues.apache.org/jira/browse/PIG-3142
>             Project: Pig
>          Issue Type: New Feature
>          Components: piggybank
>    Affects Versions: 0.11
>            Reporter: Jonathan Packer
>         Attachments: fixed-width.patch, fixed-width-updated.patch
>
>
> Adds load/store functions for fixed width data to the Piggybank. They use the syntax of the unix "cut" command to specify column positions, and have an option to skip the header row when loading or to write a header row when storing.
> The header handling works properly with multiple small files each with a header being combined into one split, or a large file with a single header being split into multiple splits.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira