You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Gerrit Jansen van Vuuren (JIRA)" <ji...@apache.org> on 2010/07/30 12:13:17 UTC

[jira] Created: (PIG-1526) HiveColumnarLoader Partitioning Support

HiveColumnarLoader Partitioning Support
---------------------------------------

                 Key: PIG-1526
                 URL: https://issues.apache.org/jira/browse/PIG-1526
             Project: Pig
          Issue Type: Improvement
    Affects Versions: 0.8.0
            Reporter: Gerrit Jansen van Vuuren
            Assignee: Gerrit Jansen van Vuuren
             Fix For: 0.8.0



I've made allot improvements on the HiveColumnarLoader:
-> Added support for LoadMetadata and data path Partitioning 
-> Improved and simplefied column loading

Data Path Partitioning:

Hive stores partitions as folders like to /mytable/partition1=[value]/partition2=[value]. That is the table mytable contains 2 partitions [partition1, partition2].
The HiveColumnarLoader will scan the inputpath /mytable and add to the PigSchema the columns partition2 and partition2. 
These columns can then be used in filtering. 
For example: We've got year,month,day,hour partitions in our data uploads.
So a table might look like mytable/year=2010/month=02/day=01.
Loading with the HiveColumnarLoader allows our pig scripts do filter by date using the standard pig Filter operator.

I've added 2 classes for this:
-> PathPartitioner
-> PathPartitionHelper

These classes are not hive dependent and could be used by any other loader that wants to support partitioning and helps with implementing the LoadMetadata interface.
For this reason I though it best to put it into the package org.apache.pig.piggybank.storage.partition.
What would be nice is in the future have the PigStorage also use these 2 classes to provide automatic path partitioning support. 




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-1526) HiveColumnarLoader Partitioning Support

Posted by "Gerrit Jansen van Vuuren (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12894868#action_12894868 ] 

Gerrit Jansen van Vuuren commented on PIG-1526:
-----------------------------------------------

Hi, 

I've made the above changes to this patch, but the logic and classes remain the same.
If your ok with the tests passing this is ok for committing from my side.

I've been testing this new release for the last 48 hours on continuous queries running, and haven't seen any new errors or bugs appear.

I ran the tests manually locally and it all passed.



> HiveColumnarLoader Partitioning Support
> ---------------------------------------
>
>                 Key: PIG-1526
>                 URL: https://issues.apache.org/jira/browse/PIG-1526
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.8.0
>            Reporter: Gerrit Jansen van Vuuren
>            Assignee: Gerrit Jansen van Vuuren
>            Priority: Minor
>             Fix For: 0.8.0
>
>         Attachments: PIG-1526-2.patch, PIG-1526.patch
>
>
> I've made allot improvements on the HiveColumnarLoader:
> -> Added support for LoadMetadata and data path Partitioning 
> -> Improved and simplefied column loading
> Data Path Partitioning:
> Hive stores partitions as folders like to /mytable/partition1=[value]/partition2=[value]. That is the table mytable contains 2 partitions [partition1, partition2].
> The HiveColumnarLoader will scan the inputpath /mytable and add to the PigSchema the columns partition2 and partition2. 
> These columns can then be used in filtering. 
> For example: We've got year,month,day,hour partitions in our data uploads.
> So a table might look like mytable/year=2010/month=02/day=01.
> Loading with the HiveColumnarLoader allows our pig scripts do filter by date using the standard pig Filter operator.
> I've added 2 classes for this:
> -> PathPartitioner
> -> PathPartitionHelper
> These classes are not hive dependent and could be used by any other loader that wants to support partitioning and helps with implementing the LoadMetadata interface.
> For this reason I though it best to put it into the package org.apache.pig.piggybank.storage.partition.
> What would be nice is in the future have the PigStorage also use these 2 classes to provide automatic path partitioning support. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-1526) HiveColumnarLoader Partitioning Support

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12894087#action_12894087 ] 

Hadoop QA commented on PIG-1526:
--------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12450900/PIG-1526.patch
  against trunk revision 980276.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 9 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs.  The patch does not introduce any new Findbugs warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    +1 core tests.  The patch passed core unit tests.

    -1 contrib tests.  The patch failed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/367/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/367/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/367/console

This message is automatically generated.

> HiveColumnarLoader Partitioning Support
> ---------------------------------------
>
>                 Key: PIG-1526
>                 URL: https://issues.apache.org/jira/browse/PIG-1526
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.8.0
>            Reporter: Gerrit Jansen van Vuuren
>            Assignee: Gerrit Jansen van Vuuren
>            Priority: Minor
>             Fix For: 0.8.0
>
>         Attachments: PIG-1526.patch
>
>
> I've made allot improvements on the HiveColumnarLoader:
> -> Added support for LoadMetadata and data path Partitioning 
> -> Improved and simplefied column loading
> Data Path Partitioning:
> Hive stores partitions as folders like to /mytable/partition1=[value]/partition2=[value]. That is the table mytable contains 2 partitions [partition1, partition2].
> The HiveColumnarLoader will scan the inputpath /mytable and add to the PigSchema the columns partition2 and partition2. 
> These columns can then be used in filtering. 
> For example: We've got year,month,day,hour partitions in our data uploads.
> So a table might look like mytable/year=2010/month=02/day=01.
> Loading with the HiveColumnarLoader allows our pig scripts do filter by date using the standard pig Filter operator.
> I've added 2 classes for this:
> -> PathPartitioner
> -> PathPartitionHelper
> These classes are not hive dependent and could be used by any other loader that wants to support partitioning and helps with implementing the LoadMetadata interface.
> For this reason I though it best to put it into the package org.apache.pig.piggybank.storage.partition.
> What would be nice is in the future have the PigStorage also use these 2 classes to provide automatic path partitioning support. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-1526) HiveColumnarLoader Partitioning Support

Posted by "Daniel Dai (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Daniel Dai updated PIG-1526:
----------------------------

    Attachment: PIG-1526-fix.patch

Hi, Gerrit,
TestHiveColumnarLoader is due to OOM on my machine, I increase heap size in build.xml to solve it. 

I also find TestPathPartitionHelper and TestPathPartitioner does not work if I have a hadoop-site file in classpath. So I add the following code to deal with it:

{code}
File oldConf = new File(System.getProperty("user.home")+"/pigtest/conf/hadoop-site.xml");
oldConf.delete();
{code}

Please take a look of attached patch, if it is Ok, I will commit it.

Thanks

> HiveColumnarLoader Partitioning Support
> ---------------------------------------
>
>                 Key: PIG-1526
>                 URL: https://issues.apache.org/jira/browse/PIG-1526
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.8.0
>            Reporter: Gerrit Jansen van Vuuren
>            Assignee: Gerrit Jansen van Vuuren
>            Priority: Minor
>             Fix For: 0.8.0
>
>         Attachments: PIG-1526-2.patch, PIG-1526-fix.patch, PIG-1526.patch, TestHiveColumnarLoader.java, TestPathPartitioner.java, TestPathPartitionHelper.java
>
>
> I've made allot improvements on the HiveColumnarLoader:
> -> Added support for LoadMetadata and data path Partitioning 
> -> Improved and simplefied column loading
> Data Path Partitioning:
> Hive stores partitions as folders like to /mytable/partition1=[value]/partition2=[value]. That is the table mytable contains 2 partitions [partition1, partition2].
> The HiveColumnarLoader will scan the inputpath /mytable and add to the PigSchema the columns partition2 and partition2. 
> These columns can then be used in filtering. 
> For example: We've got year,month,day,hour partitions in our data uploads.
> So a table might look like mytable/year=2010/month=02/day=01.
> Loading with the HiveColumnarLoader allows our pig scripts do filter by date using the standard pig Filter operator.
> I've added 2 classes for this:
> -> PathPartitioner
> -> PathPartitionHelper
> These classes are not hive dependent and could be used by any other loader that wants to support partitioning and helps with implementing the LoadMetadata interface.
> For this reason I though it best to put it into the package org.apache.pig.piggybank.storage.partition.
> What would be nice is in the future have the PigStorage also use these 2 classes to provide automatic path partitioning support. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-1526) HiveColumnarLoader Partitioning Support

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12894719#action_12894719 ] 

Olga Natkovich commented on PIG-1526:
-------------------------------------

Is this ready to be committed?

> HiveColumnarLoader Partitioning Support
> ---------------------------------------
>
>                 Key: PIG-1526
>                 URL: https://issues.apache.org/jira/browse/PIG-1526
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.8.0
>            Reporter: Gerrit Jansen van Vuuren
>            Assignee: Gerrit Jansen van Vuuren
>            Priority: Minor
>             Fix For: 0.8.0
>
>         Attachments: PIG-1526.patch
>
>
> I've made allot improvements on the HiveColumnarLoader:
> -> Added support for LoadMetadata and data path Partitioning 
> -> Improved and simplefied column loading
> Data Path Partitioning:
> Hive stores partitions as folders like to /mytable/partition1=[value]/partition2=[value]. That is the table mytable contains 2 partitions [partition1, partition2].
> The HiveColumnarLoader will scan the inputpath /mytable and add to the PigSchema the columns partition2 and partition2. 
> These columns can then be used in filtering. 
> For example: We've got year,month,day,hour partitions in our data uploads.
> So a table might look like mytable/year=2010/month=02/day=01.
> Loading with the HiveColumnarLoader allows our pig scripts do filter by date using the standard pig Filter operator.
> I've added 2 classes for this:
> -> PathPartitioner
> -> PathPartitionHelper
> These classes are not hive dependent and could be used by any other loader that wants to support partitioning and helps with implementing the LoadMetadata interface.
> For this reason I though it best to put it into the package org.apache.pig.piggybank.storage.partition.
> What would be nice is in the future have the PigStorage also use these 2 classes to provide automatic path partitioning support. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-1526) HiveColumnarLoader Partitioning Support

Posted by "Gerrit Jansen van Vuuren (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Gerrit Jansen van Vuuren updated PIG-1526:
------------------------------------------

    Status: Patch Available  (was: Open)
      Tags: PIG-1526.patch

> HiveColumnarLoader Partitioning Support
> ---------------------------------------
>
>                 Key: PIG-1526
>                 URL: https://issues.apache.org/jira/browse/PIG-1526
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.8.0
>            Reporter: Gerrit Jansen van Vuuren
>            Assignee: Gerrit Jansen van Vuuren
>             Fix For: 0.8.0
>
>         Attachments: PIG-1526.patch
>
>
> I've made allot improvements on the HiveColumnarLoader:
> -> Added support for LoadMetadata and data path Partitioning 
> -> Improved and simplefied column loading
> Data Path Partitioning:
> Hive stores partitions as folders like to /mytable/partition1=[value]/partition2=[value]. That is the table mytable contains 2 partitions [partition1, partition2].
> The HiveColumnarLoader will scan the inputpath /mytable and add to the PigSchema the columns partition2 and partition2. 
> These columns can then be used in filtering. 
> For example: We've got year,month,day,hour partitions in our data uploads.
> So a table might look like mytable/year=2010/month=02/day=01.
> Loading with the HiveColumnarLoader allows our pig scripts do filter by date using the standard pig Filter operator.
> I've added 2 classes for this:
> -> PathPartitioner
> -> PathPartitionHelper
> These classes are not hive dependent and could be used by any other loader that wants to support partitioning and helps with implementing the LoadMetadata interface.
> For this reason I though it best to put it into the package org.apache.pig.piggybank.storage.partition.
> What would be nice is in the future have the PigStorage also use these 2 classes to provide automatic path partitioning support. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-1526) HiveColumnarLoader Partitioning Support

Posted by "Daniel Dai (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12896458#action_12896458 ] 

Daniel Dai commented on PIG-1526:
---------------------------------

Fix committed. Thanks.

> HiveColumnarLoader Partitioning Support
> ---------------------------------------
>
>                 Key: PIG-1526
>                 URL: https://issues.apache.org/jira/browse/PIG-1526
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.8.0
>            Reporter: Gerrit Jansen van Vuuren
>            Assignee: Gerrit Jansen van Vuuren
>            Priority: Minor
>             Fix For: 0.8.0
>
>         Attachments: PIG-1526-2.patch, PIG-1526-fix.patch, PIG-1526.patch, TestHiveColumnarLoader.java, TestPathPartitioner.java, TestPathPartitionHelper.java
>
>
> I've made allot improvements on the HiveColumnarLoader:
> -> Added support for LoadMetadata and data path Partitioning 
> -> Improved and simplefied column loading
> Data Path Partitioning:
> Hive stores partitions as folders like to /mytable/partition1=[value]/partition2=[value]. That is the table mytable contains 2 partitions [partition1, partition2].
> The HiveColumnarLoader will scan the inputpath /mytable and add to the PigSchema the columns partition2 and partition2. 
> These columns can then be used in filtering. 
> For example: We've got year,month,day,hour partitions in our data uploads.
> So a table might look like mytable/year=2010/month=02/day=01.
> Loading with the HiveColumnarLoader allows our pig scripts do filter by date using the standard pig Filter operator.
> I've added 2 classes for this:
> -> PathPartitioner
> -> PathPartitionHelper
> These classes are not hive dependent and could be used by any other loader that wants to support partitioning and helps with implementing the LoadMetadata interface.
> For this reason I though it best to put it into the package org.apache.pig.piggybank.storage.partition.
> What would be nice is in the future have the PigStorage also use these 2 classes to provide automatic path partitioning support. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-1526) HiveColumnarLoader Partitioning Support

Posted by "Gerrit Jansen van Vuuren (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Gerrit Jansen van Vuuren updated PIG-1526:
------------------------------------------

       Priority: Minor  (was: Major)
    Description: 
I've made allot improvements on the HiveColumnarLoader:
-> Added support for LoadMetadata and data path Partitioning 
-> Improved and simplefied column loading

Data Path Partitioning:

Hive stores partitions as folders like to /mytable/partition1=[value]/partition2=[value]. That is the table mytable contains 2 partitions [partition1, partition2].
The HiveColumnarLoader will scan the inputpath /mytable and add to the PigSchema the columns partition2 and partition2. 
These columns can then be used in filtering. 
For example: We've got year,month,day,hour partitions in our data uploads.
So a table might look like mytable/year=2010/month=02/day=01.
Loading with the HiveColumnarLoader allows our pig scripts do filter by date using the standard pig Filter operator.

I've added 2 classes for this:
-> PathPartitioner
-> PathPartitionHelper

These classes are not hive dependent and could be used by any other loader that wants to support partitioning and helps with implementing the LoadMetadata interface.
For this reason I though it best to put it into the package org.apache.pig.piggybank.storage.partition.
What would be nice is in the future have the PigStorage also use these 2 classes to provide automatic path partitioning support. 




  was:

I've made allot improvements on the HiveColumnarLoader:
-> Added support for LoadMetadata and data path Partitioning 
-> Improved and simplefied column loading

Data Path Partitioning:

Hive stores partitions as folders like to /mytable/partition1=[value]/partition2=[value]. That is the table mytable contains 2 partitions [partition1, partition2].
The HiveColumnarLoader will scan the inputpath /mytable and add to the PigSchema the columns partition2 and partition2. 
These columns can then be used in filtering. 
For example: We've got year,month,day,hour partitions in our data uploads.
So a table might look like mytable/year=2010/month=02/day=01.
Loading with the HiveColumnarLoader allows our pig scripts do filter by date using the standard pig Filter operator.

I've added 2 classes for this:
-> PathPartitioner
-> PathPartitionHelper

These classes are not hive dependent and could be used by any other loader that wants to support partitioning and helps with implementing the LoadMetadata interface.
For this reason I though it best to put it into the package org.apache.pig.piggybank.storage.partition.
What would be nice is in the future have the PigStorage also use these 2 classes to provide automatic path partitioning support. 





> HiveColumnarLoader Partitioning Support
> ---------------------------------------
>
>                 Key: PIG-1526
>                 URL: https://issues.apache.org/jira/browse/PIG-1526
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.8.0
>            Reporter: Gerrit Jansen van Vuuren
>            Assignee: Gerrit Jansen van Vuuren
>            Priority: Minor
>             Fix For: 0.8.0
>
>         Attachments: PIG-1526.patch
>
>
> I've made allot improvements on the HiveColumnarLoader:
> -> Added support for LoadMetadata and data path Partitioning 
> -> Improved and simplefied column loading
> Data Path Partitioning:
> Hive stores partitions as folders like to /mytable/partition1=[value]/partition2=[value]. That is the table mytable contains 2 partitions [partition1, partition2].
> The HiveColumnarLoader will scan the inputpath /mytable and add to the PigSchema the columns partition2 and partition2. 
> These columns can then be used in filtering. 
> For example: We've got year,month,day,hour partitions in our data uploads.
> So a table might look like mytable/year=2010/month=02/day=01.
> Loading with the HiveColumnarLoader allows our pig scripts do filter by date using the standard pig Filter operator.
> I've added 2 classes for this:
> -> PathPartitioner
> -> PathPartitionHelper
> These classes are not hive dependent and could be used by any other loader that wants to support partitioning and helps with implementing the LoadMetadata interface.
> For this reason I though it best to put it into the package org.apache.pig.piggybank.storage.partition.
> What would be nice is in the future have the PigStorage also use these 2 classes to provide automatic path partitioning support. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-1526) HiveColumnarLoader Partitioning Support

Posted by "Gerrit Jansen van Vuuren (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Gerrit Jansen van Vuuren updated PIG-1526:
------------------------------------------

    Attachment: TestHiveColumnarLoader.java
                TestPathPartitioner.java
                TestPathPartitionHelper.java

I've attached the 3 test source files.


> HiveColumnarLoader Partitioning Support
> ---------------------------------------
>
>                 Key: PIG-1526
>                 URL: https://issues.apache.org/jira/browse/PIG-1526
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.8.0
>            Reporter: Gerrit Jansen van Vuuren
>            Assignee: Gerrit Jansen van Vuuren
>            Priority: Minor
>             Fix For: 0.8.0
>
>         Attachments: PIG-1526-2.patch, PIG-1526.patch, TestHiveColumnarLoader.java, TestPathPartitioner.java, TestPathPartitionHelper.java
>
>
> I've made allot improvements on the HiveColumnarLoader:
> -> Added support for LoadMetadata and data path Partitioning 
> -> Improved and simplefied column loading
> Data Path Partitioning:
> Hive stores partitions as folders like to /mytable/partition1=[value]/partition2=[value]. That is the table mytable contains 2 partitions [partition1, partition2].
> The HiveColumnarLoader will scan the inputpath /mytable and add to the PigSchema the columns partition2 and partition2. 
> These columns can then be used in filtering. 
> For example: We've got year,month,day,hour partitions in our data uploads.
> So a table might look like mytable/year=2010/month=02/day=01.
> Loading with the HiveColumnarLoader allows our pig scripts do filter by date using the standard pig Filter operator.
> I've added 2 classes for this:
> -> PathPartitioner
> -> PathPartitionHelper
> These classes are not hive dependent and could be used by any other loader that wants to support partitioning and helps with implementing the LoadMetadata interface.
> For this reason I though it best to put it into the package org.apache.pig.piggybank.storage.partition.
> What would be nice is in the future have the PigStorage also use these 2 classes to provide automatic path partitioning support. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-1526) HiveColumnarLoader Partitioning Support

Posted by "Gerrit Jansen van Vuuren (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Gerrit Jansen van Vuuren updated PIG-1526:
------------------------------------------

    Status: Open  (was: Patch Available)

> HiveColumnarLoader Partitioning Support
> ---------------------------------------
>
>                 Key: PIG-1526
>                 URL: https://issues.apache.org/jira/browse/PIG-1526
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.8.0
>            Reporter: Gerrit Jansen van Vuuren
>            Assignee: Gerrit Jansen van Vuuren
>            Priority: Minor
>             Fix For: 0.8.0
>
>         Attachments: PIG-1526-2.patch, PIG-1526.patch
>
>
> I've made allot improvements on the HiveColumnarLoader:
> -> Added support for LoadMetadata and data path Partitioning 
> -> Improved and simplefied column loading
> Data Path Partitioning:
> Hive stores partitions as folders like to /mytable/partition1=[value]/partition2=[value]. That is the table mytable contains 2 partitions [partition1, partition2].
> The HiveColumnarLoader will scan the inputpath /mytable and add to the PigSchema the columns partition2 and partition2. 
> These columns can then be used in filtering. 
> For example: We've got year,month,day,hour partitions in our data uploads.
> So a table might look like mytable/year=2010/month=02/day=01.
> Loading with the HiveColumnarLoader allows our pig scripts do filter by date using the standard pig Filter operator.
> I've added 2 classes for this:
> -> PathPartitioner
> -> PathPartitionHelper
> These classes are not hive dependent and could be used by any other loader that wants to support partitioning and helps with implementing the LoadMetadata interface.
> For this reason I though it best to put it into the package org.apache.pig.piggybank.storage.partition.
> What would be nice is in the future have the PigStorage also use these 2 classes to provide automatic path partitioning support. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-1526) HiveColumnarLoader Partitioning Support

Posted by "Daniel Dai (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12896117#action_12896117 ] 

Daniel Dai commented on PIG-1526:
---------------------------------

Hi, Gerrit, 
Piggybank test TestHiveColumnarLoader, TestPathPartitionHelper and TestPathPartitioner fail. Can you take a look? I will temporary drop these test cases from trunk until it is fixed.

Thanks

> HiveColumnarLoader Partitioning Support
> ---------------------------------------
>
>                 Key: PIG-1526
>                 URL: https://issues.apache.org/jira/browse/PIG-1526
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.8.0
>            Reporter: Gerrit Jansen van Vuuren
>            Assignee: Gerrit Jansen van Vuuren
>            Priority: Minor
>             Fix For: 0.8.0
>
>         Attachments: PIG-1526-2.patch, PIG-1526.patch
>
>
> I've made allot improvements on the HiveColumnarLoader:
> -> Added support for LoadMetadata and data path Partitioning 
> -> Improved and simplefied column loading
> Data Path Partitioning:
> Hive stores partitions as folders like to /mytable/partition1=[value]/partition2=[value]. That is the table mytable contains 2 partitions [partition1, partition2].
> The HiveColumnarLoader will scan the inputpath /mytable and add to the PigSchema the columns partition2 and partition2. 
> These columns can then be used in filtering. 
> For example: We've got year,month,day,hour partitions in our data uploads.
> So a table might look like mytable/year=2010/month=02/day=01.
> Loading with the HiveColumnarLoader allows our pig scripts do filter by date using the standard pig Filter operator.
> I've added 2 classes for this:
> -> PathPartitioner
> -> PathPartitionHelper
> These classes are not hive dependent and could be used by any other loader that wants to support partitioning and helps with implementing the LoadMetadata interface.
> For this reason I though it best to put it into the package org.apache.pig.piggybank.storage.partition.
> What would be nice is in the future have the PigStorage also use these 2 classes to provide automatic path partitioning support. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-1526) HiveColumnarLoader Partitioning Support

Posted by "Gerrit Jansen van Vuuren (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Gerrit Jansen van Vuuren updated PIG-1526:
------------------------------------------

    Attachment: PIG-1526.patch

> HiveColumnarLoader Partitioning Support
> ---------------------------------------
>
>                 Key: PIG-1526
>                 URL: https://issues.apache.org/jira/browse/PIG-1526
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.8.0
>            Reporter: Gerrit Jansen van Vuuren
>            Assignee: Gerrit Jansen van Vuuren
>             Fix For: 0.8.0
>
>         Attachments: PIG-1526.patch
>
>
> I've made allot improvements on the HiveColumnarLoader:
> -> Added support for LoadMetadata and data path Partitioning 
> -> Improved and simplefied column loading
> Data Path Partitioning:
> Hive stores partitions as folders like to /mytable/partition1=[value]/partition2=[value]. That is the table mytable contains 2 partitions [partition1, partition2].
> The HiveColumnarLoader will scan the inputpath /mytable and add to the PigSchema the columns partition2 and partition2. 
> These columns can then be used in filtering. 
> For example: We've got year,month,day,hour partitions in our data uploads.
> So a table might look like mytable/year=2010/month=02/day=01.
> Loading with the HiveColumnarLoader allows our pig scripts do filter by date using the standard pig Filter operator.
> I've added 2 classes for this:
> -> PathPartitioner
> -> PathPartitionHelper
> These classes are not hive dependent and could be used by any other loader that wants to support partitioning and helps with implementing the LoadMetadata interface.
> For this reason I though it best to put it into the package org.apache.pig.piggybank.storage.partition.
> What would be nice is in the future have the PigStorage also use these 2 classes to provide automatic path partitioning support. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-1526) HiveColumnarLoader Partitioning Support

Posted by "Gerrit Jansen van Vuuren (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Gerrit Jansen van Vuuren updated PIG-1526:
------------------------------------------

    Status: Patch Available  (was: Open)
      Tags: PIG-1526-2.patch  (was: PIG-1526.patch)

> HiveColumnarLoader Partitioning Support
> ---------------------------------------
>
>                 Key: PIG-1526
>                 URL: https://issues.apache.org/jira/browse/PIG-1526
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.8.0
>            Reporter: Gerrit Jansen van Vuuren
>            Assignee: Gerrit Jansen van Vuuren
>            Priority: Minor
>             Fix For: 0.8.0
>
>         Attachments: PIG-1526-2.patch, PIG-1526.patch
>
>
> I've made allot improvements on the HiveColumnarLoader:
> -> Added support for LoadMetadata and data path Partitioning 
> -> Improved and simplefied column loading
> Data Path Partitioning:
> Hive stores partitions as folders like to /mytable/partition1=[value]/partition2=[value]. That is the table mytable contains 2 partitions [partition1, partition2].
> The HiveColumnarLoader will scan the inputpath /mytable and add to the PigSchema the columns partition2 and partition2. 
> These columns can then be used in filtering. 
> For example: We've got year,month,day,hour partitions in our data uploads.
> So a table might look like mytable/year=2010/month=02/day=01.
> Loading with the HiveColumnarLoader allows our pig scripts do filter by date using the standard pig Filter operator.
> I've added 2 classes for this:
> -> PathPartitioner
> -> PathPartitionHelper
> These classes are not hive dependent and could be used by any other loader that wants to support partitioning and helps with implementing the LoadMetadata interface.
> For this reason I though it best to put it into the package org.apache.pig.piggybank.storage.partition.
> What would be nice is in the future have the PigStorage also use these 2 classes to provide automatic path partitioning support. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-1526) HiveColumnarLoader Partitioning Support

Posted by "Gerrit Jansen van Vuuren (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12896248#action_12896248 ] 

Gerrit Jansen van Vuuren commented on PIG-1526:
-----------------------------------------------

Looks good. Thanks.

> HiveColumnarLoader Partitioning Support
> ---------------------------------------
>
>                 Key: PIG-1526
>                 URL: https://issues.apache.org/jira/browse/PIG-1526
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.8.0
>            Reporter: Gerrit Jansen van Vuuren
>            Assignee: Gerrit Jansen van Vuuren
>            Priority: Minor
>             Fix For: 0.8.0
>
>         Attachments: PIG-1526-2.patch, PIG-1526-fix.patch, PIG-1526.patch, TestHiveColumnarLoader.java, TestPathPartitioner.java, TestPathPartitionHelper.java
>
>
> I've made allot improvements on the HiveColumnarLoader:
> -> Added support for LoadMetadata and data path Partitioning 
> -> Improved and simplefied column loading
> Data Path Partitioning:
> Hive stores partitions as folders like to /mytable/partition1=[value]/partition2=[value]. That is the table mytable contains 2 partitions [partition1, partition2].
> The HiveColumnarLoader will scan the inputpath /mytable and add to the PigSchema the columns partition2 and partition2. 
> These columns can then be used in filtering. 
> For example: We've got year,month,day,hour partitions in our data uploads.
> So a table might look like mytable/year=2010/month=02/day=01.
> Loading with the HiveColumnarLoader allows our pig scripts do filter by date using the standard pig Filter operator.
> I've added 2 classes for this:
> -> PathPartitioner
> -> PathPartitionHelper
> These classes are not hive dependent and could be used by any other loader that wants to support partitioning and helps with implementing the LoadMetadata interface.
> For this reason I though it best to put it into the package org.apache.pig.piggybank.storage.partition.
> What would be nice is in the future have the PigStorage also use these 2 classes to provide automatic path partitioning support. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-1526) HiveColumnarLoader Partitioning Support

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12894933#action_12894933 ] 

Hadoop QA commented on PIG-1526:
--------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12451115/PIG-1526-2.patch
  against trunk revision 980930.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 9 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs.  The patch does not introduce any new Findbugs warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    +1 core tests.  The patch passed core unit tests.

    -1 contrib tests.  The patch failed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/369/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/369/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/369/console

This message is automatically generated.

> HiveColumnarLoader Partitioning Support
> ---------------------------------------
>
>                 Key: PIG-1526
>                 URL: https://issues.apache.org/jira/browse/PIG-1526
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.8.0
>            Reporter: Gerrit Jansen van Vuuren
>            Assignee: Gerrit Jansen van Vuuren
>            Priority: Minor
>             Fix For: 0.8.0
>
>         Attachments: PIG-1526-2.patch, PIG-1526.patch
>
>
> I've made allot improvements on the HiveColumnarLoader:
> -> Added support for LoadMetadata and data path Partitioning 
> -> Improved and simplefied column loading
> Data Path Partitioning:
> Hive stores partitions as folders like to /mytable/partition1=[value]/partition2=[value]. That is the table mytable contains 2 partitions [partition1, partition2].
> The HiveColumnarLoader will scan the inputpath /mytable and add to the PigSchema the columns partition2 and partition2. 
> These columns can then be used in filtering. 
> For example: We've got year,month,day,hour partitions in our data uploads.
> So a table might look like mytable/year=2010/month=02/day=01.
> Loading with the HiveColumnarLoader allows our pig scripts do filter by date using the standard pig Filter operator.
> I've added 2 classes for this:
> -> PathPartitioner
> -> PathPartitionHelper
> These classes are not hive dependent and could be used by any other loader that wants to support partitioning and helps with implementing the LoadMetadata interface.
> For this reason I though it best to put it into the package org.apache.pig.piggybank.storage.partition.
> What would be nice is in the future have the PigStorage also use these 2 classes to provide automatic path partitioning support. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-1526) HiveColumnarLoader Partitioning Support

Posted by "Gerrit Jansen van Vuuren (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Gerrit Jansen van Vuuren updated PIG-1526:
------------------------------------------

    Attachment: PIG-1526-2.patch

The previous patch did not use the UDFContext signature which caused the partition keys and expression to be overwritten if the loader was used for more than one table. That is fixed now.
Also added to PathPatitionHelper filtering out of hidden files i.e. files or directories starting with "_" are ignored now.


> HiveColumnarLoader Partitioning Support
> ---------------------------------------
>
>                 Key: PIG-1526
>                 URL: https://issues.apache.org/jira/browse/PIG-1526
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.8.0
>            Reporter: Gerrit Jansen van Vuuren
>            Assignee: Gerrit Jansen van Vuuren
>            Priority: Minor
>             Fix For: 0.8.0
>
>         Attachments: PIG-1526-2.patch, PIG-1526.patch
>
>
> I've made allot improvements on the HiveColumnarLoader:
> -> Added support for LoadMetadata and data path Partitioning 
> -> Improved and simplefied column loading
> Data Path Partitioning:
> Hive stores partitions as folders like to /mytable/partition1=[value]/partition2=[value]. That is the table mytable contains 2 partitions [partition1, partition2].
> The HiveColumnarLoader will scan the inputpath /mytable and add to the PigSchema the columns partition2 and partition2. 
> These columns can then be used in filtering. 
> For example: We've got year,month,day,hour partitions in our data uploads.
> So a table might look like mytable/year=2010/month=02/day=01.
> Loading with the HiveColumnarLoader allows our pig scripts do filter by date using the standard pig Filter operator.
> I've added 2 classes for this:
> -> PathPartitioner
> -> PathPartitionHelper
> These classes are not hive dependent and could be used by any other loader that wants to support partitioning and helps with implementing the LoadMetadata interface.
> For this reason I though it best to put it into the package org.apache.pig.piggybank.storage.partition.
> What would be nice is in the future have the PigStorage also use these 2 classes to provide automatic path partitioning support. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-1526) HiveColumnarLoader Partitioning Support

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Olga Natkovich updated PIG-1526:
--------------------------------

        Status: Resolved  (was: Patch Available)
    Resolution: Fixed

patch committed to the trunk. Thanks Gerrit!

> HiveColumnarLoader Partitioning Support
> ---------------------------------------
>
>                 Key: PIG-1526
>                 URL: https://issues.apache.org/jira/browse/PIG-1526
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.8.0
>            Reporter: Gerrit Jansen van Vuuren
>            Assignee: Gerrit Jansen van Vuuren
>            Priority: Minor
>             Fix For: 0.8.0
>
>         Attachments: PIG-1526-2.patch, PIG-1526.patch
>
>
> I've made allot improvements on the HiveColumnarLoader:
> -> Added support for LoadMetadata and data path Partitioning 
> -> Improved and simplefied column loading
> Data Path Partitioning:
> Hive stores partitions as folders like to /mytable/partition1=[value]/partition2=[value]. That is the table mytable contains 2 partitions [partition1, partition2].
> The HiveColumnarLoader will scan the inputpath /mytable and add to the PigSchema the columns partition2 and partition2. 
> These columns can then be used in filtering. 
> For example: We've got year,month,day,hour partitions in our data uploads.
> So a table might look like mytable/year=2010/month=02/day=01.
> Loading with the HiveColumnarLoader allows our pig scripts do filter by date using the standard pig Filter operator.
> I've added 2 classes for this:
> -> PathPartitioner
> -> PathPartitionHelper
> These classes are not hive dependent and could be used by any other loader that wants to support partitioning and helps with implementing the LoadMetadata interface.
> For this reason I though it best to put it into the package org.apache.pig.piggybank.storage.partition.
> What would be nice is in the future have the PigStorage also use these 2 classes to provide automatic path partitioning support. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.