You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@crunch.apache.org by Nipur Patodi <er...@gmail.com> on 2015/08/05 07:49:25 UTC

multiple output format with crunch pipeline

Hi All,

I am trying to write  PGroupedTable contents to multiple output files based
on key of PGroupedTable. I know we have AvroPathPerKeyTarget for avro kind
of object.
But do we have some thing equivalent for Pair<Strings, Strings>?

Please suggest.

Thanks,

_Nipur

Re: multiple output format with crunch pipeline

Posted by Josh Wills <jw...@cloudera.com>.

Yeah, that we don't have right now. Writing custom Targets for this is
do-able (https://issues.apache.org/jira/browse/CRUNCH-555 ) but it isn't
super-fun.

J

On Tue, Aug 4, 2015 at 11:38 PM, Nipur Patodi <er...@gmail.com>
wrote:

> Hey Josh,
>
> Thanks, AvroPathPerKeyTarget works just fine. But I am looking for some
> thing which is equivalent to it but for writable objects
>
> Thanks much,
>
> _Nipur
>
> On Wed, Aug 5, 2015 at 12:03 PM, Josh Wills <jw...@cloudera.com> wrote:
>
>> So that seems fine, although we just now added the support for creating
>> child directories in the keys:
>> https://issues.apache.org/jira/browse/CRUNCH-543
>>
>> Are you running into a problem using the AvroPathPerKeyTarget as the
>> output of that table once you've called ungroup() on it?
>>
>> On Tue, Aug 4, 2015 at 11:28 PM, Nipur Patodi <er...@gmail.com>
>> wrote:
>>
>>> hey Josh,
>>>
>>> I want output from PGroupTable<String, String> to multiple files  where
>>> file name path  is actually key for PGroupTable.
>>> example PGroupTable<String, String> table =
>>>
>>>  [ /root/test, { data1,data2}],
>>>
>>>  [/root/test2,{data3,data4}]
>>>
>>> output should be
>>> $hadoop fs -cat /root/test/part-m-00000
>>> data1
>>> data2
>>>
>>> $hadoop fs -cat /root/test2/part-m-00000
>>> data3
>>> data4
>>>
>>>
>>> Thanks,
>>>
>>> _Nipur
>>>
>>>
>>>
>>> On Wed, Aug 5, 2015 at 11:27 AM, Josh Wills <jw...@cloudera.com> wrote:
>>>
>>>> Hey Nipur,
>>>>
>>>> I'm not quite sure what you mean: do you want to output a
>>>> PTable<String, String> via an AvroPathPerKeyTarget? Or a PTable<String,
>>>> Pair<String, String>>?
>>>>
>>>> J
>>>>
>>>> On Tue, Aug 4, 2015 at 10:49 PM, Nipur Patodi <
>>>> er.nipur.patodi@gmail.com> wrote:
>>>>
>>>>> Hi All,
>>>>>
>>>>> I am trying to write  PGroupedTable contents to multiple output files
>>>>> based on key of PGroupedTable. I know we have AvroPathPerKeyTarget for avro
>>>>> kind of object.
>>>>> But do we have some thing equivalent for Pair<Strings, Strings>?
>>>>>
>>>>> Please suggest.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> _Nipur
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Director of Data Science
>>>> Cloudera <http://www.cloudera.com>
>>>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>>>
>>>
>>>
>>
>>
>> --
>> Director of Data Science
>> Cloudera <http://www.cloudera.com>
>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>
>
>


-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>

Re: multiple output format with crunch pipeline

Posted by Nipur Patodi <er...@gmail.com>.

Hey Josh,

Thanks, AvroPathPerKeyTarget works just fine. But I am looking for some
thing which is equivalent to it but for writable objects

Thanks much,

_Nipur

On Wed, Aug 5, 2015 at 12:03 PM, Josh Wills <jw...@cloudera.com> wrote:

> So that seems fine, although we just now added the support for creating
> child directories in the keys:
> https://issues.apache.org/jira/browse/CRUNCH-543
>
> Are you running into a problem using the AvroPathPerKeyTarget as the
> output of that table once you've called ungroup() on it?
>
> On Tue, Aug 4, 2015 at 11:28 PM, Nipur Patodi <er...@gmail.com>
> wrote:
>
>> hey Josh,
>>
>> I want output from PGroupTable<String, String> to multiple files  where
>> file name path  is actually key for PGroupTable.
>> example PGroupTable<String, String> table =
>>
>>  [ /root/test, { data1,data2}],
>>
>>  [/root/test2,{data3,data4}]
>>
>> output should be
>> $hadoop fs -cat /root/test/part-m-00000
>> data1
>> data2
>>
>> $hadoop fs -cat /root/test2/part-m-00000
>> data3
>> data4
>>
>>
>> Thanks,
>>
>> _Nipur
>>
>>
>>
>> On Wed, Aug 5, 2015 at 11:27 AM, Josh Wills <jw...@cloudera.com> wrote:
>>
>>> Hey Nipur,
>>>
>>> I'm not quite sure what you mean: do you want to output a PTable<String,
>>> String> via an AvroPathPerKeyTarget? Or a PTable<String, Pair<String,
>>> String>>?
>>>
>>> J
>>>
>>> On Tue, Aug 4, 2015 at 10:49 PM, Nipur Patodi <er.nipur.patodi@gmail.com
>>> > wrote:
>>>
>>>> Hi All,
>>>>
>>>> I am trying to write  PGroupedTable contents to multiple output files
>>>> based on key of PGroupedTable. I know we have AvroPathPerKeyTarget for avro
>>>> kind of object.
>>>> But do we have some thing equivalent for Pair<Strings, Strings>?
>>>>
>>>> Please suggest.
>>>>
>>>> Thanks,
>>>>
>>>> _Nipur
>>>>
>>>
>>>
>>>
>>> --
>>> Director of Data Science
>>> Cloudera <http://www.cloudera.com>
>>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>>
>>
>>
>
>
> --
> Director of Data Science
> Cloudera <http://www.cloudera.com>
> Twitter: @josh_wills <http://twitter.com/josh_wills>
>

Re: multiple output format with crunch pipeline

Posted by Josh Wills <jw...@cloudera.com>.

So that seems fine, although we just now added the support for creating
child directories in the keys:
https://issues.apache.org/jira/browse/CRUNCH-543

Are you running into a problem using the AvroPathPerKeyTarget as the output
of that table once you've called ungroup() on it?

On Tue, Aug 4, 2015 at 11:28 PM, Nipur Patodi <er...@gmail.com>
wrote:

> hey Josh,
>
> I want output from PGroupTable<String, String> to multiple files  where
> file name path  is actually key for PGroupTable.
> example PGroupTable<String, String> table =
>                                                                          [
> /root/test, { data1,data2}],
>
>  [/root/test2,{data3,data4}]
>
> output should be
> $hadoop fs -cat /root/test/part-m-00000
> data1
> data2
>
> $hadoop fs -cat /root/test2/part-m-00000
> data3
> data4
>
>
> Thanks,
>
> _Nipur
>
>
>
> On Wed, Aug 5, 2015 at 11:27 AM, Josh Wills <jw...@cloudera.com> wrote:
>
>> Hey Nipur,
>>
>> I'm not quite sure what you mean: do you want to output a PTable<String,
>> String> via an AvroPathPerKeyTarget? Or a PTable<String, Pair<String,
>> String>>?
>>
>> J
>>
>> On Tue, Aug 4, 2015 at 10:49 PM, Nipur Patodi <er...@gmail.com>
>> wrote:
>>
>>> Hi All,
>>>
>>> I am trying to write  PGroupedTable contents to multiple output files
>>> based on key of PGroupedTable. I know we have AvroPathPerKeyTarget for avro
>>> kind of object.
>>> But do we have some thing equivalent for Pair<Strings, Strings>?
>>>
>>> Please suggest.
>>>
>>> Thanks,
>>>
>>> _Nipur
>>>
>>
>>
>>
>> --
>> Director of Data Science
>> Cloudera <http://www.cloudera.com>
>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>
>
>


-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>

Re: multiple output format with crunch pipeline

Posted by Nipur Patodi <er...@gmail.com>.

 Hey Josh

Appologies for adding confusion,

Flow should be some thing like this

          Pipeline pipeline = new MRPipeline(AggregatorDriver.class, conf);

        PCollection<String> record = pipeline.read(From.textFile(inputPath,
WritableTypeFamily.getInstance().strings()));

        PTable<String, String> outputTable = record.parallelDo(new
Processor(),Writables.tableOf(Writables.strings(), Writables.strings()));

        PGroupedTable<String, String> groupTable = outputTable.groupByKey();

        // Need to implement
        FilePathPerKeyTarget  target = new  FilePathPerKeyTarget(path);

          pipeline.write(groupTable, target, WriteMode.APPEND);
         PipelineResult result = pipeline.done();




On Wed, Aug 5, 2015 at 11:58 AM, Nipur Patodi <er...@gmail.com>
wrote:

> hey Josh,
>
> I want output from PGroupTable<String, String> to multiple files  where
> file name path  is actually key for PGroupTable.
> example PGroupTable<String, String> table =
>                                                                          [
> /root/test, { data1,data2}],
>
>  [/root/test2,{data3,data4}]
>
> output should be
> $hadoop fs -cat /root/test/part-m-00000
> data1
> data2
>
> $hadoop fs -cat /root/test2/part-m-00000
> data3
> data4
>
>
> Thanks,
>
> _Nipur
>
>
>
> On Wed, Aug 5, 2015 at 11:27 AM, Josh Wills <jw...@cloudera.com> wrote:
>
>> Hey Nipur,
>>
>> I'm not quite sure what you mean: do you want to output a PTable<String,
>> String> via an AvroPathPerKeyTarget? Or a PTable<String, Pair<String,
>> String>>?
>>
>> J
>>
>> On Tue, Aug 4, 2015 at 10:49 PM, Nipur Patodi <er...@gmail.com>
>> wrote:
>>
>>> Hi All,
>>>
>>> I am trying to write  PGroupedTable contents to multiple output files
>>> based on key of PGroupedTable. I know we have AvroPathPerKeyTarget for avro
>>> kind of object.
>>> But do we have some thing equivalent for Pair<Strings, Strings>?
>>>
>>> Please suggest.
>>>
>>> Thanks,
>>>
>>> _Nipur
>>>
>>
>>
>>
>> --
>> Director of Data Science
>> Cloudera <http://www.cloudera.com>
>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>
>
>

Re: multiple output format with crunch pipeline

Posted by Nipur Patodi <er...@gmail.com>.

hey Josh,

I want output from PGroupTable<String, String> to multiple files  where
file name path  is actually key for PGroupTable.
example PGroupTable<String, String> table =
                                                                         [
/root/test, { data1,data2}],

 [/root/test2,{data3,data4}]

output should be
$hadoop fs -cat /root/test/part-m-00000
data1
data2

$hadoop fs -cat /root/test2/part-m-00000
data3
data4


Thanks,

_Nipur



On Wed, Aug 5, 2015 at 11:27 AM, Josh Wills <jw...@cloudera.com> wrote:

> Hey Nipur,
>
> I'm not quite sure what you mean: do you want to output a PTable<String,
> String> via an AvroPathPerKeyTarget? Or a PTable<String, Pair<String,
> String>>?
>
> J
>
> On Tue, Aug 4, 2015 at 10:49 PM, Nipur Patodi <er...@gmail.com>
> wrote:
>
>> Hi All,
>>
>> I am trying to write  PGroupedTable contents to multiple output files
>> based on key of PGroupedTable. I know we have AvroPathPerKeyTarget for avro
>> kind of object.
>> But do we have some thing equivalent for Pair<Strings, Strings>?
>>
>> Please suggest.
>>
>> Thanks,
>>
>> _Nipur
>>
>
>
>
> --
> Director of Data Science
> Cloudera <http://www.cloudera.com>
> Twitter: @josh_wills <http://twitter.com/josh_wills>
>

Re: multiple output format with crunch pipeline

Posted by Josh Wills <jw...@cloudera.com>.

Hey Nipur,

I'm not quite sure what you mean: do you want to output a PTable<String,
String> via an AvroPathPerKeyTarget? Or a PTable<String, Pair<String,
String>>?

J

On Tue, Aug 4, 2015 at 10:49 PM, Nipur Patodi <er...@gmail.com>
wrote:

> Hi All,
>
> I am trying to write  PGroupedTable contents to multiple output files
> based on key of PGroupedTable. I know we have AvroPathPerKeyTarget for avro
> kind of object.
> But do we have some thing equivalent for Pair<Strings, Strings>?
>
> Please suggest.
>
> Thanks,
>
> _Nipur
>

-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>