You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@crunch.apache.org by Nipur Patodi <er...@gmail.com> on 2015/07/06 20:47:39 UTC
Multiple output from crunch
Hi All,
I am very new to crunch.
I am trying to read data from csv file using MR pipelines. I need to
convert and bucketize this data on the bases of time stamp which is a
field in csv. I need to write data per timestamp in to single file.
This scenario is equivalent to writing values (record) per key (which is
time stamp) to different file.
I can achieve this using multiple output format in mapreduce.
Do we have any equivalent concept or design pattern to achieve same
behavior using crunch?
Please suggest.
Thanks,
_Nipur
Re: Multiple output from crunch
Posted by Josh Wills <jw...@cloudera.com>.
Not right now, no. Is the intent that the output here will go into Hive
partitions?
On Mon, Jul 6, 2015 at 11:57 AM, Nipur Patodi <er...@gmail.com>
wrote:
> Thanks Much Josh,
>
> Do we have something for avro parquet file also?
>
> Thanks,
>
> _Nipur
>
>
>
> On Tue, Jul 7, 2015 at 12:17 AM, Nipur Patodi <er...@gmail.com>
> wrote:
>
>> Hi All,
>>
>>
>>
>> I am very new to crunch.
>>
>>
>> I am trying to read data from csv file using MR pipelines. I need to
>> convert and bucketize this data on the bases of time stamp which is a
>> field in csv. I need to write data per timestamp in to single file.
>>
>>
>>
>> This scenario is equivalent to writing values (record) per key (which is
>> time stamp) to different file.
>>
>> I can achieve this using multiple output format in mapreduce.
>>
>>
>>
>> Do we have any equivalent concept or design pattern to achieve same
>> behavior using crunch?
>>
>>
>>
>> Please suggest.
>>
>>
>>
>> Thanks,
>>
>>
>>
>> _Nipur
>>
>
>
--
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>
Re: Multiple output from crunch
Posted by Nipur Patodi <er...@gmail.com>.
Thanks Much Josh,
Do we have something for avro parquet file also?
Thanks,
_Nipur
On Tue, Jul 7, 2015 at 12:17 AM, Nipur Patodi <er...@gmail.com>
wrote:
> Hi All,
>
>
>
> I am very new to crunch.
>
>
> I am trying to read data from csv file using MR pipelines. I need to
> convert and bucketize this data on the bases of time stamp which is a
> field in csv. I need to write data per timestamp in to single file.
>
>
>
> This scenario is equivalent to writing values (record) per key (which is
> time stamp) to different file.
>
> I can achieve this using multiple output format in mapreduce.
>
>
>
> Do we have any equivalent concept or design pattern to achieve same
> behavior using crunch?
>
>
>
> Please suggest.
>
>
>
> Thanks,
>
>
>
> _Nipur
>
Re: Multiple output from crunch
Posted by Josh Wills <jw...@cloudera.com>.
Hey Nipur,
AvroPathPerKeyTarget is the closest thing to what you want; you can use it
on a PTable<String, T> collection, where T is any type that Avro supports.
It will write multiple output files to a common base directory where the
name of the file depends on the value of the String key in the PTable.
Josh
On Mon, Jul 6, 2015 at 11:47 AM, Nipur Patodi <er...@gmail.com>
wrote:
> Hi All,
>
>
>
> I am very new to crunch.
>
>
> I am trying to read data from csv file using MR pipelines. I need to
> convert and bucketize this data on the bases of time stamp which is a
> field in csv. I need to write data per timestamp in to single file.
>
>
>
> This scenario is equivalent to writing values (record) per key (which is
> time stamp) to different file.
>
> I can achieve this using multiple output format in mapreduce.
>
>
>
> Do we have any equivalent concept or design pattern to achieve same
> behavior using crunch?
>
>
>
> Please suggest.
>
>
>
> Thanks,
>
>
>
> _Nipur
>
--
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>