You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Jonathan Coveney (JIRA)" <ji...@apache.org> on 2012/06/10 21:14:43 UTC

[jira] [Updated] (PIG-2632) Create a SchemaTuple which generates efficient Tuples via code gen

     [ https://issues.apache.org/jira/browse/PIG-2632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Coveney updated PIG-2632:
----------------------------------

    Attachment: PIG-2632-3.patch

Oh snap! An update! Indeed. The code and logic are much cleaner, and it ~works~. Well, mostly. In order to turn it on, you need to set the key "pig.schematuple" to be true. In distributed mode, it distributes generated code via the distributed cache. I have some documentation, but will work on adding more. I think it's at a point where it could use some eyes.

One big issue is that you currently can't group on a SchemaTuple. This is actually a known limitation, and there are comments in Pig that UDF's have to emit a Tuple. Frustratingly, even though TupleFactory as a "tupleRawComparatorClass," it only works with default Tuples, and this assumption is baked into the code. There are a couple of ways to deal with this...

1) Change the default Tuple comparator so that it works with any sort of tuple.
2) Make it so that Tuples can return an instance of the factory that generated them, which could then be used to get the proper comparator... or something like that. The general idea being better (and by better I mean not-nonexistent) support for custom implementations of core types.

I think 2 is the way to go because as things are, 1 will not be easy to do and would share a lot in common with 2. 2 is nontrivial, but will open the door to the big one I'm leading up to: a bag that is schema aware. That is when the gains go from great to massive.

Another next step is to get joins to work with this, but I want to deal with the above issues.
                
> Create a SchemaTuple which generates efficient Tuples via code gen
> ------------------------------------------------------------------
>
>                 Key: PIG-2632
>                 URL: https://issues.apache.org/jira/browse/PIG-2632
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Jonathan Coveney
>            Assignee: Jonathan Coveney
>             Fix For: 0.11
>
>         Attachments: PIG-2632-0.patch, PIG-2632-1.patch, PIG-2632-3.patch, PIG-2632-3.patch, schematuple benchmarking.pptx
>
>
> This work builds on Dmitriy's PrimitiveTuple work. The idea is that, knowing the Schema on the frontend, we can code generate Tuples which can be used for fun and profit. In rudimentary tests, the memory efficiency is 2-4x better, and it's ~15% smaller serialized (heavily heavily depends on the data, though). Need to do get/set tests, but assuming that it's on par (or even faster) than Tuple, the memory gain is huge.
> Need to clean up the code and add tests.
> Right now, it generates a SchemaTuple for every inputSchema and outputSchema given to UDF's. The next step is to make a SchemaBag, where I think the serialization savings will be really huge.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira