You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@crunch.apache.org by David Ortiz <dp...@gmail.com> on 2016/11/21 14:58:15 UTC

Best way to imitate Hive Partition

Hello,

     I am working on a Crunch pipeline where the output is going to be read
by subsequent Hive jobs.  I want to partition it by the timezone contained
in the data records.  What is the best way to support this in Crunch?

     From the googling I did, it looked like one approach would be to write
the data out into a PTable keyed by the timezone, then use the
AvroPathPerKeyTarget.  However, from what I can tell this only works if I
am writing to an Avro output.  Is there similar functionality available for
parquet output?

     Alternatively, is there a better way to do this?  I imagine I could
filter the collection for each timezone, but that doesn't seem like it
would be an efficient way to bucket the data.

Thanks,
     Dave

Re: Best way to imitate Hive Partition

Posted by David Ortiz <dp...@gmail.com>.

That looks like exactly what I'm after. Even better, it looks like it was
backported to 0.11.0 in CDH 5.7.

Thanks!
     Dave

On Mon, Nov 21, 2016 at 11:34 AM Josh Wills <jo...@gmail.com> wrote:

There is an AvroParquetPathPerKeyTarget, IIRC- I'm on my phone at the
moment, so I can't check the docs. Still the best option at the moment.
On Mon, Nov 21, 2016 at 6:59 AM David Ortiz <dp...@gmail.com> wrote:

Hello,

     I am working on a Crunch pipeline where the output is going to be read
by subsequent Hive jobs.  I want to partition it by the timezone contained
in the data records.  What is the best way to support this in Crunch?

     From the googling I did, it looked like one approach would be to write
the data out into a PTable keyed by the timezone, then use the
AvroPathPerKeyTarget.  However, from what I can tell this only works if I
am writing to an Avro output.  Is there similar functionality available for
parquet output?

     Alternatively, is there a better way to do this?  I imagine I could
filter the collection for each timezone, but that doesn't seem like it
would be an efficient way to bucket the data.

Thanks,
     Dave

Re: Best way to imitate Hive Partition

Posted by Josh Wills <jo...@gmail.com>.

There is an AvroParquetPathPerKeyTarget, IIRC- I'm on my phone at the
moment, so I can't check the docs. Still the best option at the moment.
On Mon, Nov 21, 2016 at 6:59 AM David Ortiz <dp...@gmail.com> wrote:

> Hello,
>
>      I am working on a Crunch pipeline where the output is going to be
> read by subsequent Hive jobs.  I want to partition it by the timezone
> contained in the data records.  What is the best way to support this in
> Crunch?
>
>      From the googling I did, it looked like one approach would be to
> write the data out into a PTable keyed by the timezone, then use the
> AvroPathPerKeyTarget.  However, from what I can tell this only works if I
> am writing to an Avro output.  Is there similar functionality available for
> parquet output?
>
>      Alternatively, is there a better way to do this?  I imagine I could
> filter the collection for each timezone, but that doesn't seem like it
> would be an efficient way to bucket the data.
>
> Thanks,
>      Dave
>