You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pig.apache.org by "Jay Tang (JIRA)" <ji...@apache.org> on 2009/06/04 20:29:07 UTC

[jira] Created: (PIG-833) Storage access layer

Storage access layer
--------------------

                 Key: PIG-833
                 URL: https://issues.apache.org/jira/browse/PIG-833
             Project: Pig
          Issue Type: New Feature
            Reporter: Jay Tang


A layer is needed to provide a high level data access abstraction and a tabular view of data in Hadoop, and could free Pig users from implementing their own data storage/retrieval code.  This layer should also include a columnar storage format in order to provide fast data projection, CPU/space-efficient data serialization, and a schema language to manage physical storage metadata.  Eventually it could also support predicate pushdown for further performance improvement.  Initially, this layer could be a contrib project in Pig and become a hadoop subproject later on.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-833) Storage access layer

Posted by "Alan Gates (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alan Gates updated PIG-833:
---------------------------

    Attachment: TEST-org.apache.hadoop.zebra.pig.TestCheckin1.txt

Okay, now that I've first built Pig's test, I run the tests and I get:

{code}
 [delete] Deleting directory /Users/gates/src/pig/apache/top/zebra/trunk/build/contrib/zebra/test/logs
    [mkdir] Created dir: /Users/gates/src/pig/apache/top/zebra/trunk/build/contrib/zebra/test/logs
    [junit] Running org.apache.hadoop.zebra.io.TestCheckin
    [junit] Tests run: 125, Failures: 0, Errors: 0, Time elapsed: 16.894 sec
    [junit] Running org.apache.hadoop.zebra.mapred.TestCheckin
    [junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 158.741 sec
    [junit] Running org.apache.hadoop.zebra.pig.TestCheckin1
    [junit] Tests run: 0, Failures: 0, Errors: 2, Time elapsed: 0.13 sec
    [junit] Test org.apache.hadoop.zebra.pig.TestCheckin1 FAILED
    [junit] Running org.apache.hadoop.zebra.pig.TestCheckin2
    [junit] Tests run: 0, Failures: 0, Errors: 2, Time elapsed: 0.131 sec
    [junit] Test org.apache.hadoop.zebra.pig.TestCheckin2 FAILED
    [junit] Running org.apache.hadoop.zebra.pig.TestCheckin3
    [junit] Tests run: 0, Failures: 0, Errors: 2, Time elapsed: 0.133 sec
    [junit] Test org.apache.hadoop.zebra.pig.TestCheckin3 FAILED
    [junit] Running org.apache.hadoop.zebra.pig.TestCheckin4
    [junit] Tests run: 0, Failures: 0, Errors: 2, Time elapsed: 0.128 sec
    [junit] Test org.apache.hadoop.zebra.pig.TestCheckin4 FAILED
    [junit] Running org.apache.hadoop.zebra.pig.TestCheckin5
    [junit] Tests run: 0, Failures: 0, Errors: 2, Time elapsed: 0.128 sec
    [junit] Test org.apache.hadoop.zebra.pig.TestCheckin5 FAILED
    [junit] Running org.apache.hadoop.zebra.types.TestCheckin
    [junit] Tests run: 45, Failures: 0, Errors: 0, Time elapsed: 0.253 sec
{code}

I've attached the output from one of the tests.

> Storage access layer
> --------------------
>
>                 Key: PIG-833
>                 URL: https://issues.apache.org/jira/browse/PIG-833
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Jay Tang
>         Attachments: hadoop20.jar.bz2, PIG-833-zebra.patch, PIG-833-zebra.patch.bz2, PIG-833-zebra.patch.bz2, TEST-org.apache.hadoop.zebra.pig.TestCheckin1.txt, test.out, zebra-javadoc.tgz
>
>
> A layer is needed to provide a high level data access abstraction and a tabular view of data in Hadoop, and could free Pig users from implementing their own data storage/retrieval code.  This layer should also include a columnar storage format in order to provide fast data projection, CPU/space-efficient data serialization, and a schema language to manage physical storage metadata.  Eventually it could also support predicate pushdown for further performance improvement.  Initially, this layer could be a contrib project in Pig and become a hadoop subproject later on.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-833) Storage access layer

Posted by "Jeff Hammerbacher (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12716462#action_12716462 ] 

Jeff Hammerbacher commented on PIG-833:
---------------------------------------

You may want to see the Hive project, where a columnar storage format has been developed and benchmarked: https://issues.apache.org/jira/browse/HIVE-352.

> Storage access layer
> --------------------
>
>                 Key: PIG-833
>                 URL: https://issues.apache.org/jira/browse/PIG-833
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Jay Tang
>
> A layer is needed to provide a high level data access abstraction and a tabular view of data in Hadoop, and could free Pig users from implementing their own data storage/retrieval code.  This layer should also include a columnar storage format in order to provide fast data projection, CPU/space-efficient data serialization, and a schema language to manage physical storage metadata.  Eventually it could also support predicate pushdown for further performance improvement.  Initially, this layer could be a contrib project in Pig and become a hadoop subproject later on.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-833) Storage access layer

Posted by "Jeff Hammerbacher (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12744323#action_12744323 ] 

Jeff Hammerbacher commented on PIG-833:
---------------------------------------

Hey,

Raghu, you mention that a design document is forthcoming. It would be great to have a PDF design document, like Matei's for the fair scheduler, in addition to the Javadoc and wiki page. Any progress on that front? I'm quite interested in learning more about Zebra's use and implementation.

On a larger note, it would be great if Pig moved to the Hadoop model for new features, where a design document and test plan is required to commit. See https://issues.apache.org/jira/browse/HADOOP-5587. It's tough to digest the bulk dumps of Owl, Zebra, and Giraffe, though we certainly appreciate the work Yahoo has done on these projects!

Thanks,
Jeff

> Storage access layer
> --------------------
>
>                 Key: PIG-833
>                 URL: https://issues.apache.org/jira/browse/PIG-833
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Jay Tang
>         Attachments: hadoop20.jar.bz2, PIG-833-zebra.patch, PIG-833-zebra.patch.bz2, PIG-833-zebra.patch.bz2, TEST-org.apache.hadoop.zebra.pig.TestCheckin1.txt, test.out, zebra-javadoc.tgz
>
>
> A layer is needed to provide a high level data access abstraction and a tabular view of data in Hadoop, and could free Pig users from implementing their own data storage/retrieval code.  This layer should also include a columnar storage format in order to provide fast data projection, CPU/space-efficient data serialization, and a schema language to manage physical storage metadata.  Eventually it could also support predicate pushdown for further performance improvement.  Initially, this layer could be a contrib project in Pig and become a hadoop subproject later on.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-833) Storage access layer

Posted by "Zheng Shao (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12736405#action_12736405 ] 

Zheng Shao commented on PIG-833:
--------------------------------

Hive has such a storage access layer. It consists of Hadoop File Format (which provides row-level access via RecordReader), and SerDe and ObjectInspector (which get the column values from the row).
Because of that, we were able to integrate RCFile (the columnar storage file format) fair easily (NOTE: in the columnar storage case, the row is just a handle).  We did look at TFile at that time, but it was not complete yet so we implemented RCFile by modifying SequenceFile instead of TFile.

Experiments have shown that Hive SerDe/ObjectInspector has a pretty good performance because we always reuse the same objects for different rows. 

Most of the SerDe/ObjectInspector code is in hive/trunk/serde. There is a set of slides on http://www.slideshare.net/zshao/hive-object-model
If the Pig team is interested in taking a deeper look at Hive's SerDe/ObjectInspector infrastructure, I am happy to provide more information.


> Storage access layer
> --------------------
>
>                 Key: PIG-833
>                 URL: https://issues.apache.org/jira/browse/PIG-833
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Jay Tang
>         Attachments: hadoop20.jar.bz2
>
>
> A layer is needed to provide a high level data access abstraction and a tabular view of data in Hadoop, and could free Pig users from implementing their own data storage/retrieval code.  This layer should also include a columnar storage format in order to provide fast data projection, CPU/space-efficient data serialization, and a schema language to manage physical storage metadata.  Eventually it could also support predicate pushdown for further performance improvement.  Initially, this layer could be a contrib project in Pig and become a hadoop subproject later on.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (PIG-833) Storage access layer

Posted by "Alan Gates (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alan Gates resolved PIG-833.
----------------------------

       Resolution: Fixed
    Fix Version/s: 0.4.0

Patch was checked in a while ago.

> Storage access layer
> --------------------
>
>                 Key: PIG-833
>                 URL: https://issues.apache.org/jira/browse/PIG-833
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Jay Tang
>             Fix For: 0.4.0
>
>         Attachments: hadoop20.jar.bz2, PIG-833-zebra.patch, PIG-833-zebra.patch.bz2, PIG-833-zebra.patch.bz2, TEST-org.apache.hadoop.zebra.pig.TestCheckin1.txt, test.out, zebra-javadoc.tgz
>
>
> A layer is needed to provide a high level data access abstraction and a tabular view of data in Hadoop, and could free Pig users from implementing their own data storage/retrieval code.  This layer should also include a columnar storage format in order to provide fast data projection, CPU/space-efficient data serialization, and a schema language to manage physical storage metadata.  Eventually it could also support predicate pushdown for further performance improvement.  Initially, this layer could be a contrib project in Pig and become a hadoop subproject later on.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-833) Storage access layer

Posted by "Jeff Hammerbacher (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12736964#action_12736964 ] 

Jeff Hammerbacher commented on PIG-833:
---------------------------------------

Hey Raghu,

Good stuff! Do you guys have any internal benchmarks that you could add to the docs on design and usage?

Thanks,
Jeff

> Storage access layer
> --------------------
>
>                 Key: PIG-833
>                 URL: https://issues.apache.org/jira/browse/PIG-833
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Jay Tang
>         Attachments: hadoop20.jar.bz2, PIG-833-zebra.patch, zebra-javadoc.tgz
>
>
> A layer is needed to provide a high level data access abstraction and a tabular view of data in Hadoop, and could free Pig users from implementing their own data storage/retrieval code.  This layer should also include a columnar storage format in order to provide fast data projection, CPU/space-efficient data serialization, and a schema language to manage physical storage metadata.  Eventually it could also support predicate pushdown for further performance improvement.  Initially, this layer could be a contrib project in Pig and become a hadoop subproject later on.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-833) Storage access layer

Posted by "Jing Huang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12745125#action_12745125 ] 

Jing Huang commented on PIG-833:
--------------------------------

Zebra supports int, long, float, double, bool, collection (equivalent to Pig Bag), map, record (equivalent to Pig Tuple), string, bytes (equivalent to Pig Bytearray)

> Storage access layer
> --------------------
>
>                 Key: PIG-833
>                 URL: https://issues.apache.org/jira/browse/PIG-833
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Jay Tang
>         Attachments: hadoop20.jar.bz2, PIG-833-zebra.patch, PIG-833-zebra.patch.bz2, PIG-833-zebra.patch.bz2, TEST-org.apache.hadoop.zebra.pig.TestCheckin1.txt, test.out, zebra-javadoc.tgz
>
>
> A layer is needed to provide a high level data access abstraction and a tabular view of data in Hadoop, and could free Pig users from implementing their own data storage/retrieval code.  This layer should also include a columnar storage format in order to provide fast data projection, CPU/space-efficient data serialization, and a schema language to manage physical storage metadata.  Eventually it could also support predicate pushdown for further performance improvement.  Initially, this layer could be a contrib project in Pig and become a hadoop subproject later on.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-833) Storage access layer

Posted by "Hong Tang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12716470#action_12716470 ] 

Hong Tang commented on PIG-833:
-------------------------------

Jeff, just like the SQL effort, the space of columnar storage is also wide open, and I think it is more beneficial to the overall healthy of the hadoop ecosystem.

With that being said, I also looked at the patch attached with HIVE-352. It appears that what the patch does is a level below our stated objectives. Specifically, the guts of the implementation (RCFile) is very close in spirit to TFile as described HADOOP-3315, which seems to have its first comprehensive patch back in December 2008. 

> Storage access layer
> --------------------
>
>                 Key: PIG-833
>                 URL: https://issues.apache.org/jira/browse/PIG-833
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Jay Tang
>
> A layer is needed to provide a high level data access abstraction and a tabular view of data in Hadoop, and could free Pig users from implementing their own data storage/retrieval code.  This layer should also include a columnar storage format in order to provide fast data projection, CPU/space-efficient data serialization, and a schema language to manage physical storage metadata.  Eventually it could also support predicate pushdown for further performance improvement.  Initially, this layer could be a contrib project in Pig and become a hadoop subproject later on.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-833) Storage access layer

Posted by "Alan Gates (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12742100#action_12742100 ] 

Alan Gates commented on PIG-833:
--------------------------------

Patch checked in.  All the unit tests passed.

> Storage access layer
> --------------------
>
>                 Key: PIG-833
>                 URL: https://issues.apache.org/jira/browse/PIG-833
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Jay Tang
>         Attachments: hadoop20.jar.bz2, PIG-833-zebra.patch, PIG-833-zebra.patch.bz2, PIG-833-zebra.patch.bz2, TEST-org.apache.hadoop.zebra.pig.TestCheckin1.txt, test.out, zebra-javadoc.tgz
>
>
> A layer is needed to provide a high level data access abstraction and a tabular view of data in Hadoop, and could free Pig users from implementing their own data storage/retrieval code.  This layer should also include a columnar storage format in order to provide fast data projection, CPU/space-efficient data serialization, and a schema language to manage physical storage metadata.  Eventually it could also support predicate pushdown for further performance improvement.  Initially, this layer could be a contrib project in Pig and become a hadoop subproject later on.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-833) Storage access layer

Posted by "Dmitriy V. Ryaboy (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12742170#action_12742170 ] 

Dmitriy V. Ryaboy commented on PIG-833:
---------------------------------------

Alan, this means Pig contrib/ is no longer compatible with Hadoop 18.
Which probably means that you need to either rolls this back or roll 660 in (and add the hadoop20.jar file to lib/ )
Otherwise the build is broken.

> Storage access layer
> --------------------
>
>                 Key: PIG-833
>                 URL: https://issues.apache.org/jira/browse/PIG-833
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Jay Tang
>         Attachments: hadoop20.jar.bz2, PIG-833-zebra.patch, PIG-833-zebra.patch.bz2, PIG-833-zebra.patch.bz2, TEST-org.apache.hadoop.zebra.pig.TestCheckin1.txt, test.out, zebra-javadoc.tgz
>
>
> A layer is needed to provide a high level data access abstraction and a tabular view of data in Hadoop, and could free Pig users from implementing their own data storage/retrieval code.  This layer should also include a columnar storage format in order to provide fast data projection, CPU/space-efficient data serialization, and a schema language to manage physical storage metadata.  Eventually it could also support predicate pushdown for further performance improvement.  Initially, this layer could be a contrib project in Pig and become a hadoop subproject later on.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-833) Storage access layer

Posted by "Jeff Hammerbacher (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12716471#action_12716471 ] 

Jeff Hammerbacher commented on PIG-833:
---------------------------------------

Hey Hong,

I never mentioned SQL or an ecosystem in my comment, but thanks for your observation. I was simply referring to the existence of a fairly detailed discussion in a related subproject that the Pig team may not have been following. I'll add an additional one here: https://issues.apache.org/jira/browse/HIVE-279 addresses the predicate pushdown feature.

Regards,
Jeff 

> Storage access layer
> --------------------
>
>                 Key: PIG-833
>                 URL: https://issues.apache.org/jira/browse/PIG-833
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Jay Tang
>
> A layer is needed to provide a high level data access abstraction and a tabular view of data in Hadoop, and could free Pig users from implementing their own data storage/retrieval code.  This layer should also include a columnar storage format in order to provide fast data projection, CPU/space-efficient data serialization, and a schema language to manage physical storage metadata.  Eventually it could also support predicate pushdown for further performance improvement.  Initially, this layer could be a contrib project in Pig and become a hadoop subproject later on.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-833) Storage access layer

Posted by "Raghu Angadi (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Raghu Angadi updated PIG-833:
-----------------------------

    Attachment: PIG-833-zebra.patch.bz2

> Storage access layer
> --------------------
>
>                 Key: PIG-833
>                 URL: https://issues.apache.org/jira/browse/PIG-833
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Jay Tang
>         Attachments: hadoop20.jar.bz2, PIG-833-zebra.patch, PIG-833-zebra.patch.bz2, PIG-833-zebra.patch.bz2, zebra-javadoc.tgz
>
>
> A layer is needed to provide a high level data access abstraction and a tabular view of data in Hadoop, and could free Pig users from implementing their own data storage/retrieval code.  This layer should also include a columnar storage format in order to provide fast data projection, CPU/space-efficient data serialization, and a schema language to manage physical storage metadata.  Eventually it could also support predicate pushdown for further performance improvement.  Initially, this layer could be a contrib project in Pig and become a hadoop subproject later on.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-833) Storage access layer

Posted by "Raghu Angadi (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12742069#action_12742069 ] 

Raghu Angadi commented on PIG-833:
----------------------------------

Alan, in order to run unit tests you need to build pig test-core.

As mentioned in the instructions above please run {{'ant -Dtestcase=none test-core'}} under top level directory before running 'ant test' under contrib/zebra.


> Storage access layer
> --------------------
>
>                 Key: PIG-833
>                 URL: https://issues.apache.org/jira/browse/PIG-833
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Jay Tang
>         Attachments: hadoop20.jar.bz2, PIG-833-zebra.patch, PIG-833-zebra.patch.bz2, PIG-833-zebra.patch.bz2, test.out, zebra-javadoc.tgz
>
>
> A layer is needed to provide a high level data access abstraction and a tabular view of data in Hadoop, and could free Pig users from implementing their own data storage/retrieval code.  This layer should also include a columnar storage format in order to provide fast data projection, CPU/space-efficient data serialization, and a schema language to manage physical storage metadata.  Eventually it could also support predicate pushdown for further performance improvement.  Initially, this layer could be a contrib project in Pig and become a hadoop subproject later on.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-833) Storage access layer

Posted by "Raghu Angadi (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Raghu Angadi updated PIG-833:
-----------------------------

    Attachment: PIG-833-zebra.patch

The first cut of contrib/zebra. The patch is very large and should probably compress the subsequent versions of it.

More documentation on design and usage will be added to the jira.

How to compile :
----------------------
 * check out latest PIG trunk
 * Apply the latest patch from PIG-660
 * copy attached hadoop20.jar to ./lib
 * run '{{ant jar}}' (and {{'ant -Dtestcase=none test-core'}} for zebra tests).
 * cd contrib/zebra
 * ant jar
 * ant test (for tests).

Currently there are compile time deprecation warnings related to use of deprecated mapred API (JobConf). There is will be fixed later.


> Storage access layer
> --------------------
>
>                 Key: PIG-833
>                 URL: https://issues.apache.org/jira/browse/PIG-833
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Jay Tang
>         Attachments: hadoop20.jar.bz2, PIG-833-zebra.patch
>
>
> A layer is needed to provide a high level data access abstraction and a tabular view of data in Hadoop, and could free Pig users from implementing their own data storage/retrieval code.  This layer should also include a columnar storage format in order to provide fast data projection, CPU/space-efficient data serialization, and a schema language to manage physical storage metadata.  Eventually it could also support predicate pushdown for further performance improvement.  Initially, this layer could be a contrib project in Pig and become a hadoop subproject later on.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-833) Storage access layer

Posted by "Raghu Angadi (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Raghu Angadi updated PIG-833:
-----------------------------

    Attachment: zebra-javadoc.tgz

> Storage access layer
> --------------------
>
>                 Key: PIG-833
>                 URL: https://issues.apache.org/jira/browse/PIG-833
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Jay Tang
>         Attachments: hadoop20.jar.bz2, PIG-833-zebra.patch, zebra-javadoc.tgz
>
>
> A layer is needed to provide a high level data access abstraction and a tabular view of data in Hadoop, and could free Pig users from implementing their own data storage/retrieval code.  This layer should also include a columnar storage format in order to provide fast data projection, CPU/space-efficient data serialization, and a schema language to manage physical storage metadata.  Eventually it could also support predicate pushdown for further performance improvement.  Initially, this layer could be a contrib project in Pig and become a hadoop subproject later on.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-833) Storage access layer

Posted by "Jing Huang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12750093#action_12750093 ] 

Jing Huang commented on PIG-833:
--------------------------------

Hi Yongqiang, 
Sorry for the late reply. I was out of town last week. 
Right, SF_F is not defined in the schema, query a none-existing column is allowed and it will return null.

> Storage access layer
> --------------------
>
>                 Key: PIG-833
>                 URL: https://issues.apache.org/jira/browse/PIG-833
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Jay Tang
>         Attachments: hadoop20.jar.bz2, PIG-833-zebra.patch, PIG-833-zebra.patch.bz2, PIG-833-zebra.patch.bz2, TEST-org.apache.hadoop.zebra.pig.TestCheckin1.txt, test.out, zebra-javadoc.tgz
>
>
> A layer is needed to provide a high level data access abstraction and a tabular view of data in Hadoop, and could free Pig users from implementing their own data storage/retrieval code.  This layer should also include a columnar storage format in order to provide fast data projection, CPU/space-efficient data serialization, and a schema language to manage physical storage metadata.  Eventually it could also support predicate pushdown for further performance improvement.  Initially, this layer could be a contrib project in Pig and become a hadoop subproject later on.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-833) Storage access layer

Posted by "Alan Gates (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12742093#action_12742093 ] 

Alan Gates commented on PIG-833:
--------------------------------

My bad.  I missed the line in the instructions where it said to apply the PIG-660 patch.  I applied that and am trying again.

> Storage access layer
> --------------------
>
>                 Key: PIG-833
>                 URL: https://issues.apache.org/jira/browse/PIG-833
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Jay Tang
>         Attachments: hadoop20.jar.bz2, PIG-833-zebra.patch, PIG-833-zebra.patch.bz2, PIG-833-zebra.patch.bz2, TEST-org.apache.hadoop.zebra.pig.TestCheckin1.txt, test.out, zebra-javadoc.tgz
>
>
> A layer is needed to provide a high level data access abstraction and a tabular view of data in Hadoop, and could free Pig users from implementing their own data storage/retrieval code.  This layer should also include a columnar storage format in order to provide fast data projection, CPU/space-efficient data serialization, and a schema language to manage physical storage metadata.  Eventually it could also support predicate pushdown for further performance improvement.  Initially, this layer could be a contrib project in Pig and become a hadoop subproject later on.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-833) Storage access layer

Posted by "He Yongqiang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12745250#action_12745250 ] 

He Yongqiang commented on PIG-833:
----------------------------------

Thanks Jing.
I now have a better understand of schema and columngroups.
What the projection and partition are used for?

> Storage access layer
> --------------------
>
>                 Key: PIG-833
>                 URL: https://issues.apache.org/jira/browse/PIG-833
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Jay Tang
>         Attachments: hadoop20.jar.bz2, PIG-833-zebra.patch, PIG-833-zebra.patch.bz2, PIG-833-zebra.patch.bz2, TEST-org.apache.hadoop.zebra.pig.TestCheckin1.txt, test.out, zebra-javadoc.tgz
>
>
> A layer is needed to provide a high level data access abstraction and a tabular view of data in Hadoop, and could free Pig users from implementing their own data storage/retrieval code.  This layer should also include a columnar storage format in order to provide fast data projection, CPU/space-efficient data serialization, and a schema language to manage physical storage metadata.  Eventually it could also support predicate pushdown for further performance improvement.  Initially, this layer could be a contrib project in Pig and become a hadoop subproject later on.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-833) Storage access layer

Posted by "Raghu Angadi (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12744361#action_12744361 ] 

Raghu Angadi commented on PIG-833:
----------------------------------


will try to get some initial docs attached to this jira asap. I think the current plan is to have proper wiki pages (and attached here). This is part of the reason by we would like to keep this jira open.

The bulk initial dump is certainly not desirable but has been fairly common for many contrib projects in Hadoop. A bit of rush to get this committed to contrib is in part to avoid such large changes going again. The longer we delay larger the patch is going to get. We want to get the subsequent patches and discussions to public jira asap and we are already doing that.

I would like to clarify that this is not a PIG feature but rather a contrib project. We would not want this commit to be generalized for PIG commits. All the responsibility is with Zebra team. This patch is the initial verion. It does include many tests. 






> Storage access layer
> --------------------
>
>                 Key: PIG-833
>                 URL: https://issues.apache.org/jira/browse/PIG-833
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Jay Tang
>         Attachments: hadoop20.jar.bz2, PIG-833-zebra.patch, PIG-833-zebra.patch.bz2, PIG-833-zebra.patch.bz2, TEST-org.apache.hadoop.zebra.pig.TestCheckin1.txt, test.out, zebra-javadoc.tgz
>
>
> A layer is needed to provide a high level data access abstraction and a tabular view of data in Hadoop, and could free Pig users from implementing their own data storage/retrieval code.  This layer should also include a columnar storage format in order to provide fast data projection, CPU/space-efficient data serialization, and a schema language to manage physical storage metadata.  Eventually it could also support predicate pushdown for further performance improvement.  Initially, this layer could be a contrib project in Pig and become a hadoop subproject later on.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-833) Storage access layer

Posted by "Jing Huang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12745758#action_12745758 ] 

Jing Huang commented on PIG-833:
--------------------------------

Zebra supports vertical partition, meaning the rows of table can be splitted into columns and according to user specification(i.e. STR_STORAGE), Zebra will store columns into column groups.   Column Group is written in Tfile.
For excample, 
final static String STR_STORAGE = 
      "[s1, s2]; [m1#{a}]; [r1.f1]; [s3, s4, r2.r3.f3]; [s5, s6, m2#{x|y}];  " +
      "[r1.f2, m1#{b}]; [r2.r3.f4, m2#{z}]";

each [ ] is a column group. so, in this case, column group 0 will contain s1 and s2. 

Projection is  a view of table. Say, if your projection is something like:
 String projection = new String("s1,s3");
Zebra will load you date of s1 s3 (in this case, stitch ColumnGroup0 and ColumnGroup3 )

This design is mainly for performance improvement. This is specially useful for the users who are only interested in certain columns of the data instead of the whole row. 

> Storage access layer
> --------------------
>
>                 Key: PIG-833
>                 URL: https://issues.apache.org/jira/browse/PIG-833
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Jay Tang
>         Attachments: hadoop20.jar.bz2, PIG-833-zebra.patch, PIG-833-zebra.patch.bz2, PIG-833-zebra.patch.bz2, TEST-org.apache.hadoop.zebra.pig.TestCheckin1.txt, test.out, zebra-javadoc.tgz
>
>
> A layer is needed to provide a high level data access abstraction and a tabular view of data in Hadoop, and could free Pig users from implementing their own data storage/retrieval code.  This layer should also include a columnar storage format in order to provide fast data projection, CPU/space-efficient data serialization, and a schema language to manage physical storage metadata.  Eventually it could also support predicate pushdown for further performance improvement.  Initially, this layer could be a contrib project in Pig and become a hadoop subproject later on.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-833) Storage access layer

Posted by "Raghu Angadi (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12742435#action_12742435 ] 

Raghu Angadi commented on PIG-833:
----------------------------------

>  this means Pig contrib/ is no longer compatible with Hadoop 18.

This is not desirable and expected to be temporary until PIG-660 is committed. PIG-660 has other dependencies different schedule. We thought committing zebra will make zebra builds and subsequent patches easier if it is committed. 

As such PIG does not build contrib from top level ('ant test-contrib' is a no-op). So each contrib project needs to be build explicitly anyway. This is different from Hadoop build. This this patch should not fail existing automated builds.

> Storage access layer
> --------------------
>
>                 Key: PIG-833
>                 URL: https://issues.apache.org/jira/browse/PIG-833
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Jay Tang
>         Attachments: hadoop20.jar.bz2, PIG-833-zebra.patch, PIG-833-zebra.patch.bz2, PIG-833-zebra.patch.bz2, TEST-org.apache.hadoop.zebra.pig.TestCheckin1.txt, test.out, zebra-javadoc.tgz
>
>
> A layer is needed to provide a high level data access abstraction and a tabular view of data in Hadoop, and could free Pig users from implementing their own data storage/retrieval code.  This layer should also include a columnar storage format in order to provide fast data projection, CPU/space-efficient data serialization, and a schema language to manage physical storage metadata.  Eventually it could also support predicate pushdown for further performance improvement.  Initially, this layer could be a contrib project in Pig and become a hadoop subproject later on.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-833) Storage access layer

Posted by "Jay Tang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12742201#action_12742201 ] 

Jay Tang commented on PIG-833:
------------------------------

Zebra has a dependency on TFile that is available in Hadoop 20; that's why the compilation instruction is more complicated.  A new wiki at http://wiki.apache.org/pig/zebra will provide more information on Zebra.

> Storage access layer
> --------------------
>
>                 Key: PIG-833
>                 URL: https://issues.apache.org/jira/browse/PIG-833
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Jay Tang
>         Attachments: hadoop20.jar.bz2, PIG-833-zebra.patch, PIG-833-zebra.patch.bz2, PIG-833-zebra.patch.bz2, TEST-org.apache.hadoop.zebra.pig.TestCheckin1.txt, test.out, zebra-javadoc.tgz
>
>
> A layer is needed to provide a high level data access abstraction and a tabular view of data in Hadoop, and could free Pig users from implementing their own data storage/retrieval code.  This layer should also include a columnar storage format in order to provide fast data projection, CPU/space-efficient data serialization, and a schema language to manage physical storage metadata.  Eventually it could also support predicate pushdown for further performance improvement.  Initially, this layer could be a contrib project in Pig and become a hadoop subproject later on.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-833) Storage access layer

Posted by "He Yongqiang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12745049#action_12745049 ] 

He Yongqiang commented on PIG-833:
----------------------------------

Can add more description/explain in this jira or wiki page about usage etc, such as schma format, storage format, projection, and partition?

> Storage access layer
> --------------------
>
>                 Key: PIG-833
>                 URL: https://issues.apache.org/jira/browse/PIG-833
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Jay Tang
>         Attachments: hadoop20.jar.bz2, PIG-833-zebra.patch, PIG-833-zebra.patch.bz2, PIG-833-zebra.patch.bz2, TEST-org.apache.hadoop.zebra.pig.TestCheckin1.txt, test.out, zebra-javadoc.tgz
>
>
> A layer is needed to provide a high level data access abstraction and a tabular view of data in Hadoop, and could free Pig users from implementing their own data storage/retrieval code.  This layer should also include a columnar storage format in order to provide fast data projection, CPU/space-efficient data serialization, and a schema language to manage physical storage metadata.  Eventually it could also support predicate pushdown for further performance improvement.  Initially, this layer could be a contrib project in Pig and become a hadoop subproject later on.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-833) Storage access layer

Posted by "Raghu Angadi (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Raghu Angadi updated PIG-833:
-----------------------------

    Attachment: PIG-833-zebra.patch.bz2

Updated patch. Only change is that ant prints a descriptive error to user if hadoop20.jar does not exist in top level lib directory. It lists basic steps to get this built until PIG-660 is committed.


> Storage access layer
> --------------------
>
>                 Key: PIG-833
>                 URL: https://issues.apache.org/jira/browse/PIG-833
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Jay Tang
>         Attachments: hadoop20.jar.bz2, PIG-833-zebra.patch, PIG-833-zebra.patch.bz2, zebra-javadoc.tgz
>
>
> A layer is needed to provide a high level data access abstraction and a tabular view of data in Hadoop, and could free Pig users from implementing their own data storage/retrieval code.  This layer should also include a columnar storage format in order to provide fast data projection, CPU/space-efficient data serialization, and a schema language to manage physical storage metadata.  Eventually it could also support predicate pushdown for further performance improvement.  Initially, this layer could be a contrib project in Pig and become a hadoop subproject later on.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-833) Storage access layer

Posted by "Raghu Angadi (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12745219#action_12745219 ] 

Raghu Angadi commented on PIG-833:
----------------------------------

Thanks Jing. There are some PIG examples listed at the bottom of Zebra wiki : http://wiki.apache.org/pig/zebra (wiki is still under construction).

Just listing java strings in Jing's comment with out Jira formatting :

{noformat}
final static String STR_SCHEMA = 
     "s1:bool, s2:int, s3:long, s4:float, s5:string, s6:bytes, " +
     "r1:record(f1:int, f2:long), r2:record(r3:record(f3:float, f4)), " +
     "m1:map(string),m2:map(map(int)), c:collection(f13:double, f14:float, f15:bytes)";

final static String STR_STORAGE = 
      "[s1, s2]; [m1#{a}]; [r1.f1]; [s3, s4, r2.r3.f3]; [s5, s6, m2#{x|y}];  " +
      "[r1.f2, m1#{b}]; [r2.r3.f4, m2#{z}]";
{noformat}

> Storage access layer
> --------------------
>
>                 Key: PIG-833
>                 URL: https://issues.apache.org/jira/browse/PIG-833
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Jay Tang
>         Attachments: hadoop20.jar.bz2, PIG-833-zebra.patch, PIG-833-zebra.patch.bz2, PIG-833-zebra.patch.bz2, TEST-org.apache.hadoop.zebra.pig.TestCheckin1.txt, test.out, zebra-javadoc.tgz
>
>
> A layer is needed to provide a high level data access abstraction and a tabular view of data in Hadoop, and could free Pig users from implementing their own data storage/retrieval code.  This layer should also include a columnar storage format in order to provide fast data projection, CPU/space-efficient data serialization, and a schema language to manage physical storage metadata.  Eventually it could also support predicate pushdown for further performance improvement.  Initially, this layer could be a contrib project in Pig and become a hadoop subproject later on.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-833) Storage access layer

Posted by "Hudson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12742318#action_12742318 ] 

Hudson commented on PIG-833:
----------------------------

Integrated in Pig-trunk #520 (See [http://hudson.zones.apache.org/hudson/job/Pig-trunk/520/])
    : Added Zebra, new columnar storage mechanism for HDFS.


> Storage access layer
> --------------------
>
>                 Key: PIG-833
>                 URL: https://issues.apache.org/jira/browse/PIG-833
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Jay Tang
>         Attachments: hadoop20.jar.bz2, PIG-833-zebra.patch, PIG-833-zebra.patch.bz2, PIG-833-zebra.patch.bz2, TEST-org.apache.hadoop.zebra.pig.TestCheckin1.txt, test.out, zebra-javadoc.tgz
>
>
> A layer is needed to provide a high level data access abstraction and a tabular view of data in Hadoop, and could free Pig users from implementing their own data storage/retrieval code.  This layer should also include a columnar storage format in order to provide fast data projection, CPU/space-efficient data serialization, and a schema language to manage physical storage metadata.  Eventually it could also support predicate pushdown for further performance improvement.  Initially, this layer could be a contrib project in Pig and become a hadoop subproject later on.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-833) Storage access layer

Posted by "Raghu Angadi (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Raghu Angadi updated PIG-833:
-----------------------------

    Attachment: hadoop20.jar.bz2

Attaching hadoop20.jar that needs to be placed under lib/ directory under the top level PIG directory. will included specific instructions later in the jira.

> Storage access layer
> --------------------
>
>                 Key: PIG-833
>                 URL: https://issues.apache.org/jira/browse/PIG-833
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Jay Tang
>         Attachments: hadoop20.jar.bz2
>
>
> A layer is needed to provide a high level data access abstraction and a tabular view of data in Hadoop, and could free Pig users from implementing their own data storage/retrieval code.  This layer should also include a columnar storage format in order to provide fast data projection, CPU/space-efficient data serialization, and a schema language to manage physical storage metadata.  Eventually it could also support predicate pushdown for further performance improvement.  Initially, this layer could be a contrib project in Pig and become a hadoop subproject later on.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (PIG-833) Storage access layer

Posted by "Alan Gates (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alan Gates reassigned PIG-833:
------------------------------

    Assignee: Raghu Angadi

> Storage access layer
> --------------------
>
>                 Key: PIG-833
>                 URL: https://issues.apache.org/jira/browse/PIG-833
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Jay Tang
>            Assignee: Raghu Angadi
>             Fix For: 0.4.0
>
>         Attachments: hadoop20.jar.bz2, PIG-833-zebra.patch, PIG-833-zebra.patch.bz2, PIG-833-zebra.patch.bz2, TEST-org.apache.hadoop.zebra.pig.TestCheckin1.txt, test.out, zebra-javadoc.tgz
>
>
> A layer is needed to provide a high level data access abstraction and a tabular view of data in Hadoop, and could free Pig users from implementing their own data storage/retrieval code.  This layer should also include a columnar storage format in order to provide fast data projection, CPU/space-efficient data serialization, and a schema language to manage physical storage metadata.  Eventually it could also support predicate pushdown for further performance improvement.  Initially, this layer could be a contrib project in Pig and become a hadoop subproject later on.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-833) Storage access layer

Posted by "Alan Gates (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alan Gates updated PIG-833:
---------------------------

    Attachment: test.out

When I run ant test in contrib/zebra, I get failures.  I've attached the output of the command.

> Storage access layer
> --------------------
>
>                 Key: PIG-833
>                 URL: https://issues.apache.org/jira/browse/PIG-833
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Jay Tang
>         Attachments: hadoop20.jar.bz2, PIG-833-zebra.patch, PIG-833-zebra.patch.bz2, PIG-833-zebra.patch.bz2, test.out, zebra-javadoc.tgz
>
>
> A layer is needed to provide a high level data access abstraction and a tabular view of data in Hadoop, and could free Pig users from implementing their own data storage/retrieval code.  This layer should also include a columnar storage format in order to provide fast data projection, CPU/space-efficient data serialization, and a schema language to manage physical storage metadata.  Eventually it could also support predicate pushdown for further performance improvement.  Initially, this layer could be a contrib project in Pig and become a hadoop subproject later on.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-833) Storage access layer

Posted by "Dmitriy V. Ryaboy (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12742083#action_12742083 ] 

Dmitriy V. Ryaboy commented on PIG-833:
---------------------------------------

Alan -- if it's not finding .dfs , it's probably not linking hadoop20.jar

Try my patch in 660 :-)

> Storage access layer
> --------------------
>
>                 Key: PIG-833
>                 URL: https://issues.apache.org/jira/browse/PIG-833
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Jay Tang
>         Attachments: hadoop20.jar.bz2, PIG-833-zebra.patch, PIG-833-zebra.patch.bz2, PIG-833-zebra.patch.bz2, TEST-org.apache.hadoop.zebra.pig.TestCheckin1.txt, test.out, zebra-javadoc.tgz
>
>
> A layer is needed to provide a high level data access abstraction and a tabular view of data in Hadoop, and could free Pig users from implementing their own data storage/retrieval code.  This layer should also include a columnar storage format in order to provide fast data projection, CPU/space-efficient data serialization, and a schema language to manage physical storage metadata.  Eventually it could also support predicate pushdown for further performance improvement.  Initially, this layer could be a contrib project in Pig and become a hadoop subproject later on.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-833) Storage access layer

Posted by "He Yongqiang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12745770#action_12745770 ] 

He Yongqiang commented on PIG-833:
----------------------------------

Thanks Jing.
Yeah, i know the design of column groups and projection.
The reason i was asking is that i saw an usage in line 251 TestBasicTable.java:
{noformat}
doReadWrite(path, 2, 100, "SF_a,SF_b,SF_c,SF_d,SF_e", "[SF_a,SF_b,SF_c];[SF_d,SF_e]", "SF_f,SF_a,SF_c,SF_d", true, false);
{noformat}
where  "SF_f,SF_a,SF_c,SF_d" is passed as projection, but is there a column "SF_f" defined?

btw, can you give more detail about the design of Partition? ColumnGroup is much like projection in C-Store, so it can be more easily to be understood.

> Storage access layer
> --------------------
>
>                 Key: PIG-833
>                 URL: https://issues.apache.org/jira/browse/PIG-833
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Jay Tang
>         Attachments: hadoop20.jar.bz2, PIG-833-zebra.patch, PIG-833-zebra.patch.bz2, PIG-833-zebra.patch.bz2, TEST-org.apache.hadoop.zebra.pig.TestCheckin1.txt, test.out, zebra-javadoc.tgz
>
>
> A layer is needed to provide a high level data access abstraction and a tabular view of data in Hadoop, and could free Pig users from implementing their own data storage/retrieval code.  This layer should also include a columnar storage format in order to provide fast data projection, CPU/space-efficient data serialization, and a schema language to manage physical storage metadata.  Eventually it could also support predicate pushdown for further performance improvement.  Initially, this layer could be a contrib project in Pig and become a hadoop subproject later on.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-833) Storage access layer

Posted by "Jing Huang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12745140#action_12745140 ] 

Jing Huang commented on PIG-833:
--------------------------------

Here is an example for different data types:
final static String STR_SCHEMA = "s1:bool, s2:int, s3:long, s4:float, s5:string, s6:bytes, r1:record(f1:int, f2:long), r2:record(r3:record(f3:float, f4)), m1:map(string),m2:map(map(int)), c:collection(f13:double, f14:float, f15:bytes)";

final static String STR_STORAGE = "[s1, s2]; [m1#{a}]; [r1.f1]; [s3, s4, r2.r3.f3]; [s5, s6, m2#{x|y}]; [r1.f2, m1#{b}]; [r2.r3.f4, m2#{z}]";
  
On schema side, s1, s2....s6 are simple data type.   m2:map(map(int)): meaning m2 is a map of map. m2's value is a map and this inner map's value is a int type.   (key is always string type)
 r2:record(r3:record(f3:float, f4)): meaning r2 is a record with one field which is a record (r3). r3 is a record with two fields: f3: float and f4: default type bytes.

On storage side, i.e  [m1#{a}] meaning map m1 with key 'a' in this column group.  [s5, s6, m2#{x|y}] meaning s5, s6 and map m2 with key 'x' or 'y' in this column group. [r2.r3.f4, m2#{z}] meaning record r2's record r3 with field f4 and map m2 2ith key 'z' in this column group.

 

> Storage access layer
> --------------------
>
>                 Key: PIG-833
>                 URL: https://issues.apache.org/jira/browse/PIG-833
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Jay Tang
>         Attachments: hadoop20.jar.bz2, PIG-833-zebra.patch, PIG-833-zebra.patch.bz2, PIG-833-zebra.patch.bz2, TEST-org.apache.hadoop.zebra.pig.TestCheckin1.txt, test.out, zebra-javadoc.tgz
>
>
> A layer is needed to provide a high level data access abstraction and a tabular view of data in Hadoop, and could free Pig users from implementing their own data storage/retrieval code.  This layer should also include a columnar storage format in order to provide fast data projection, CPU/space-efficient data serialization, and a schema language to manage physical storage metadata.  Eventually it could also support predicate pushdown for further performance improvement.  Initially, this layer could be a contrib project in Pig and become a hadoop subproject later on.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-833) Storage access layer

Posted by "Hong Tang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12716578#action_12716578 ] 

Hong Tang commented on PIG-833:
-------------------------------

@Jeff, I might have read a bit too much into the lines. I was referring to your comments on PIG-824.

I also took a look at HIVE-279, the predicate push down mentioned there is actually a bit different from what we plan to do. HIVE-279 pushes predicates to map phase, whereby we would like to implement a storage layer where we can filter out bytes while reading data from disk.

> Storage access layer
> --------------------
>
>                 Key: PIG-833
>                 URL: https://issues.apache.org/jira/browse/PIG-833
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Jay Tang
>
> A layer is needed to provide a high level data access abstraction and a tabular view of data in Hadoop, and could free Pig users from implementing their own data storage/retrieval code.  This layer should also include a columnar storage format in order to provide fast data projection, CPU/space-efficient data serialization, and a schema language to manage physical storage metadata.  Eventually it could also support predicate pushdown for further performance improvement.  Initially, this layer could be a contrib project in Pig and become a hadoop subproject later on.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-833) Storage access layer

Posted by "Raghu Angadi (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12736998#action_12736998 ] 

Raghu Angadi commented on PIG-833:
----------------------------------

There will be benchmark results either attached to this jira or to a subsequent jira.

I would like to compare to SequenceFiles and the new format in Hive. Should to see on par performance.

Major performance benefits come from commonly used projections (through column groups) and map side joins of sorted tables. An important part of motivation is some features like column security, ability to delete entire columns. 

We are running some larger scale benchmarks internally.. but these run on Yahoo's internal data sources.


> Storage access layer
> --------------------
>
>                 Key: PIG-833
>                 URL: https://issues.apache.org/jira/browse/PIG-833
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Jay Tang
>         Attachments: hadoop20.jar.bz2, PIG-833-zebra.patch, zebra-javadoc.tgz
>
>
> A layer is needed to provide a high level data access abstraction and a tabular view of data in Hadoop, and could free Pig users from implementing their own data storage/retrieval code.  This layer should also include a columnar storage format in order to provide fast data projection, CPU/space-efficient data serialization, and a schema language to manage physical storage metadata.  Eventually it could also support predicate pushdown for further performance improvement.  Initially, this layer could be a contrib project in Pig and become a hadoop subproject later on.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-833) Storage access layer

Posted by "He Yongqiang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12745052#action_12745052 ] 

He Yongqiang commented on PIG-833:
----------------------------------

By schema format, i mean the string used to define column names and column types. What kind of data type zebra support? what's the format? It will be great if there are some example and explaination.
By storage format, i mean the string used to define column groups, such as "[r.f12, f1, m#{b}]; [m#{a}, r.f11]" in TestBasicTableMapSplits.java.

> Storage access layer
> --------------------
>
>                 Key: PIG-833
>                 URL: https://issues.apache.org/jira/browse/PIG-833
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Jay Tang
>         Attachments: hadoop20.jar.bz2, PIG-833-zebra.patch, PIG-833-zebra.patch.bz2, PIG-833-zebra.patch.bz2, TEST-org.apache.hadoop.zebra.pig.TestCheckin1.txt, test.out, zebra-javadoc.tgz
>
>
> A layer is needed to provide a high level data access abstraction and a tabular view of data in Hadoop, and could free Pig users from implementing their own data storage/retrieval code.  This layer should also include a columnar storage format in order to provide fast data projection, CPU/space-efficient data serialization, and a schema language to manage physical storage metadata.  Eventually it could also support predicate pushdown for further performance improvement.  Initially, this layer could be a contrib project in Pig and become a hadoop subproject later on.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-833) Storage access layer

Posted by "Amr Awadallah (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12742321#action_12742321 ] 

Amr Awadallah commented on PIG-833:
-----------------------------------

I am out of office until Aug 14th. I will be checking my email
intermittently. If this is urgent then please call my cell phone,
otherwise I will reply to your email when I get back.

Thanks for your patience,

-- amr


> Storage access layer
> --------------------
>
>                 Key: PIG-833
>                 URL: https://issues.apache.org/jira/browse/PIG-833
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Jay Tang
>         Attachments: hadoop20.jar.bz2, PIG-833-zebra.patch, PIG-833-zebra.patch.bz2, PIG-833-zebra.patch.bz2, TEST-org.apache.hadoop.zebra.pig.TestCheckin1.txt, test.out, zebra-javadoc.tgz
>
>
> A layer is needed to provide a high level data access abstraction and a tabular view of data in Hadoop, and could free Pig users from implementing their own data storage/retrieval code.  This layer should also include a columnar storage format in order to provide fast data projection, CPU/space-efficient data serialization, and a schema language to manage physical storage metadata.  Eventually it could also support predicate pushdown for further performance improvement.  Initially, this layer could be a contrib project in Pig and become a hadoop subproject later on.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-833) Storage access layer

Posted by "Raghu Angadi (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12736424#action_12736424 ] 

Raghu Angadi commented on PIG-833:
----------------------------------


Will surely look at Hive's storage layer and SerDe. I will be able to better comment on specifics  once I get better handle. In the mean while I will attach the work that is already been done on Zebra. 

This is currently a contrib in PIG. Based on these experiences we could probably provide a common storage layer more widely suitable for multiple Hadoop related projects.

> Storage access layer
> --------------------
>
>                 Key: PIG-833
>                 URL: https://issues.apache.org/jira/browse/PIG-833
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Jay Tang
>         Attachments: hadoop20.jar.bz2
>
>
> A layer is needed to provide a high level data access abstraction and a tabular view of data in Hadoop, and could free Pig users from implementing their own data storage/retrieval code.  This layer should also include a columnar storage format in order to provide fast data projection, CPU/space-efficient data serialization, and a schema language to manage physical storage metadata.  Eventually it could also support predicate pushdown for further performance improvement.  Initially, this layer could be a contrib project in Pig and become a hadoop subproject later on.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.