You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@crunch.apache.org by Sankash Shankar <sa...@wealthfront.com> on 2015/06/22 22:17:31 UTC

How to write a generic transform method that will act upon generated avro objects in a generic fashion

Hello.

I am writing a Crunch job that takes in an arbitrary class that extends
SpecificRecord and performs a transformation on the fields in the class. I
am attempting to write a parallelDo function on these classes, but

public static PCollection<String> function(PCollection<? extends
SpecificRecord> coll) {
  coll.parallelDo(new DoFn<? extends SpecificRecord, String>() {
    ...
  }, Avros.strings());
}

will not compile given it expects a type at compile-time

*will not compile given it expects a type at compile time, while *

public static PCollection<String>
transformAvroToCsv(PCollection<SpecificRecord> coll) {
  coll.parallelDo(new DoFn<SpecificRecord, String>() {
    @Override
    public void process(SpecificRecord input, Emitter<String> emitter) {
    }
  }, Avros.strings());
  return null;
}

*will fail at run-time due to SpecificRecord not having an init
constructor.*

What is the standard way for taking in generic avro records and having a
generic
transform method to call on them?

Thanks.

Re: How to write a generic transform method that will act upon generated avro objects in a generic fashion

Posted by Sankash Shankar <sa...@wealthfront.com>.
The problem was solved by David's GenericAvroFunction solution.
Thanks again.

On Tue, Jun 23, 2015 at 1:57 AM, Josh Wills <jo...@gmail.com> wrote:

> Hey Sankash,
>
> I don't understand a couple of things here:
>
> 1) The init() error in SpecificRecord from your original email: I could
> see that sort of thing being a problem if you were trying to create a
> PType<SpecificRecord> vs. a PType<SomeImplOfSpecificRecord>, but I don't
> get why it would be a problem in defining an ordinary DoFn.
> 2) Why David's suggestion of GenericAvroFunction<T extends
> SpecificRecordBase> wouldn't be serializable.
>
> J
>
> On Mon, Jun 22, 2015 at 3:15 PM, David Ortiz <do...@videologygroup.com>
> wrote:
>
>>  How are you getting it into a PCollection?  Whatever you're doing there
>> should work for the function shouldn't it?
>>
>>  *Sent from my Verizon Wireless 4G LTE DROID*
>>  On Jun 22, 2015 6:09 PM, Sankash Shankar <sa...@wealthfront.com>
>> wrote:
>>  Hello,
>>
>>  With regards to your question, we will know the class will be one of a
>> pre-defined list of classes, but the exact class will not be known until
>> runtime. In addition, the generic class GenericAvroFunction cannot be
>> defined in a static manner and a generic type, which keeps it from being
>> serializable.
>>
>>  Thanks.
>>
>>
>>
>> On Mon, Jun 22, 2015 at 1:23 PM, David Ortiz <do...@videologygroup.com>
>> wrote:
>>
>>>  When you actually write the code will you know what the avro record
>>> is?  I’ve been able to do something along the lines of
>>>
>>>
>>>
>>> public class GenericAvroFunction<T extends SpecificRecordBase> extends
>>> DoFn<T, String> {
>>>
>>> …
>>>
>>>
>>>
>>> public void process(T input, Emitter<String> emitter) {
>>>
>>> …
>>>
>>> }
>>>
>>> }
>>>
>>>
>>>
>>> then parameterizing it in the various pipelines that use it.  Not sure
>>> with regards to making it work at run time though.
>>>
>>>
>>>
>>> *From:* Sankash Shankar [mailto:sankash@wealthfront.com]
>>> *Sent:* Monday, June 22, 2015 4:18 PM
>>> *To:* user@crunch.apache.org
>>> *Subject:* How to write a generic transform method that will act upon
>>> generated avro objects in a generic fashion
>>>
>>>
>>>
>>> Hello.
>>>
>>>
>>>
>>> I am writing a Crunch job that takes in an arbitrary class that extends
>>> SpecificRecord and performs a transformation on the fields in the class. I
>>> am attempting to write a parallelDo function on these classes, but
>>>
>>> *public static *PCollection<String> function(PCollection<? *extends *SpecificRecord> coll) {
>>>   coll.parallelDo(*new *DoFn<? *extends *SpecificRecord, String>() {
>>>     ...
>>>   }, Avros.*strings*());
>>> }
>>>
>>> will not compile given it expects a type at compile-time
>>>
>>>  *will not compile given it expects a type at compile time, while *
>>>
>>>  *public static *PCollection<String> transformAvroToCsv(PCollection<SpecificRecord> coll) {
>>>   coll.parallelDo(*new *DoFn<SpecificRecord, String>() {
>>>     @Override
>>>     *public void *process(SpecificRecord input, Emitter<String> emitter) {
>>>     }
>>>   }, Avros.*strings*());
>>>   *return null*;
>>> }
>>>
>>>  *will fail at run-time due to SpecificRecord not having an init constructor.*
>>>
>>>   What is the standard way for taking in generic avro records and
>>> having a generic
>>>
>>> transform method to call on them?
>>>
>>>
>>>
>>> Thanks.
>>>     *This email is intended only for the use of the individual(s) to
>>> whom it is addressed. If you have received this communication in error,
>>> please immediately notify the sender and delete the original email.*
>>>
>>
>>  *This email is intended only for the use of the individual(s) to whom
>> it is addressed. If you have received this communication in error, please
>> immediately notify the sender and delete the original email.*
>>
>
>

Re: How to write a generic transform method that will act upon generated avro objects in a generic fashion

Posted by Josh Wills <jo...@gmail.com>.
Hey Sankash,

I don't understand a couple of things here:

1) The init() error in SpecificRecord from your original email: I could see
that sort of thing being a problem if you were trying to create a
PType<SpecificRecord> vs. a PType<SomeImplOfSpecificRecord>, but I don't
get why it would be a problem in defining an ordinary DoFn.
2) Why David's suggestion of GenericAvroFunction<T extends
SpecificRecordBase> wouldn't be serializable.

J

On Mon, Jun 22, 2015 at 3:15 PM, David Ortiz <do...@videologygroup.com>
wrote:

>  How are you getting it into a PCollection?  Whatever you're doing there
> should work for the function shouldn't it?
>
>  *Sent from my Verizon Wireless 4G LTE DROID*
>  On Jun 22, 2015 6:09 PM, Sankash Shankar <sa...@wealthfront.com> wrote:
>  Hello,
>
>  With regards to your question, we will know the class will be one of a
> pre-defined list of classes, but the exact class will not be known until
> runtime. In addition, the generic class GenericAvroFunction cannot be
> defined in a static manner and a generic type, which keeps it from being
> serializable.
>
>  Thanks.
>
>
>
> On Mon, Jun 22, 2015 at 1:23 PM, David Ortiz <do...@videologygroup.com>
> wrote:
>
>>  When you actually write the code will you know what the avro record
>> is?  I’ve been able to do something along the lines of
>>
>>
>>
>> public class GenericAvroFunction<T extends SpecificRecordBase> extends
>> DoFn<T, String> {
>>
>> …
>>
>>
>>
>> public void process(T input, Emitter<String> emitter) {
>>
>> …
>>
>> }
>>
>> }
>>
>>
>>
>> then parameterizing it in the various pipelines that use it.  Not sure
>> with regards to making it work at run time though.
>>
>>
>>
>> *From:* Sankash Shankar [mailto:sankash@wealthfront.com]
>> *Sent:* Monday, June 22, 2015 4:18 PM
>> *To:* user@crunch.apache.org
>> *Subject:* How to write a generic transform method that will act upon
>> generated avro objects in a generic fashion
>>
>>
>>
>> Hello.
>>
>>
>>
>> I am writing a Crunch job that takes in an arbitrary class that extends
>> SpecificRecord and performs a transformation on the fields in the class. I
>> am attempting to write a parallelDo function on these classes, but
>>
>> *public static *PCollection<String> function(PCollection<? *extends *SpecificRecord> coll) {
>>   coll.parallelDo(*new *DoFn<? *extends *SpecificRecord, String>() {
>>     ...
>>   }, Avros.*strings*());
>> }
>>
>> will not compile given it expects a type at compile-time
>>
>>  *will not compile given it expects a type at compile time, while *
>>
>>  *public static *PCollection<String> transformAvroToCsv(PCollection<SpecificRecord> coll) {
>>   coll.parallelDo(*new *DoFn<SpecificRecord, String>() {
>>     @Override
>>     *public void *process(SpecificRecord input, Emitter<String> emitter) {
>>     }
>>   }, Avros.*strings*());
>>   *return null*;
>> }
>>
>>  *will fail at run-time due to SpecificRecord not having an init constructor.*
>>
>>   What is the standard way for taking in generic avro records and having
>> a generic
>>
>> transform method to call on them?
>>
>>
>>
>> Thanks.
>>     *This email is intended only for the use of the individual(s) to
>> whom it is addressed. If you have received this communication in error,
>> please immediately notify the sender and delete the original email.*
>>
>
>  *This email is intended only for the use of the individual(s) to whom it
> is addressed. If you have received this communication in error, please
> immediately notify the sender and delete the original email.*
>

Re: How to write a generic transform method that will act upon generated avro objects in a generic fashion

Posted by David Ortiz <do...@videologygroup.com>.
How are you getting it into a PCollection?  Whatever you're doing there should work for the function shouldn't it?

Sent from my Verizon Wireless 4G LTE DROID
On Jun 22, 2015 6:09 PM, Sankash Shankar <sa...@wealthfront.com> wrote:
Hello,

With regards to your question, we will know the class will be one of a pre-defined list of classes, but the exact class will not be known until runtime. In addition, the generic class GenericAvroFunction cannot be defined in a static manner and a generic type, which keeps it from being serializable.

Thanks.



On Mon, Jun 22, 2015 at 1:23 PM, David Ortiz <do...@videologygroup.com>> wrote:
When you actually write the code will you know what the avro record is?  I've been able to do something along the lines of

public class GenericAvroFunction<T extends SpecificRecordBase> extends DoFn<T, String> {
...

public void process(T input, Emitter<String> emitter) {
...
}
}

then parameterizing it in the various pipelines that use it.  Not sure with regards to making it work at run time though.

From: Sankash Shankar [mailto:sankash@wealthfront.com<ma...@wealthfront.com>]
Sent: Monday, June 22, 2015 4:18 PM
To: user@crunch.apache.org<ma...@crunch.apache.org>
Subject: How to write a generic transform method that will act upon generated avro objects in a generic fashion

Hello.

I am writing a Crunch job that takes in an arbitrary class that extends SpecificRecord and performs a transformation on the fields in the class. I am attempting to write a parallelDo function on these classes, but

public static PCollection<String> function(PCollection<? extends SpecificRecord> coll) {
  coll.parallelDo(new DoFn<? extends SpecificRecord, String>() {
    ...
  }, Avros.strings());
}

will not compile given it expects a type at compile-time

will not compile given it expects a type at compile time, while

public static PCollection<String> transformAvroToCsv(PCollection<SpecificRecord> coll) {
  coll.parallelDo(new DoFn<SpecificRecord, String>() {
    @Override
    public void process(SpecificRecord input, Emitter<String> emitter) {
    }
  }, Avros.strings());
  return null;
}

will fail at run-time due to SpecificRecord not having an init constructor.
What is the standard way for taking in generic avro records and having a generic
transform method to call on them?

Thanks.
This email is intended only for the use of the individual(s) to whom it is addressed. If you have received this communication in error, please immediately notify the sender and delete the original email.

This email is intended only for the use of the individual(s) to whom it is addressed. If you have received this communication in error, please immediately notify the sender and delete the original email.

Re: How to write a generic transform method that will act upon generated avro objects in a generic fashion

Posted by Sankash Shankar <sa...@wealthfront.com>.
Hello,

With regards to your question, we will know the class will be one of a
pre-defined list of classes, but the exact class will not be known until
runtime. In addition, the generic class GenericAvroFunction cannot be
defined in a static manner and a generic type, which keeps it from being
serializable.

Thanks.



On Mon, Jun 22, 2015 at 1:23 PM, David Ortiz <do...@videologygroup.com>
wrote:

>  When you actually write the code will you know what the avro record is?
> I’ve been able to do something along the lines of
>
>
>
> public class GenericAvroFunction<T extends SpecificRecordBase> extends
> DoFn<T, String> {
>
> …
>
>
>
> public void process(T input, Emitter<String> emitter) {
>
> …
>
> }
>
> }
>
>
>
> then parameterizing it in the various pipelines that use it.  Not sure
> with regards to making it work at run time though.
>
>
>
> *From:* Sankash Shankar [mailto:sankash@wealthfront.com]
> *Sent:* Monday, June 22, 2015 4:18 PM
> *To:* user@crunch.apache.org
> *Subject:* How to write a generic transform method that will act upon
> generated avro objects in a generic fashion
>
>
>
> Hello.
>
>
>
> I am writing a Crunch job that takes in an arbitrary class that extends
> SpecificRecord and performs a transformation on the fields in the class. I
> am attempting to write a parallelDo function on these classes, but
>
> *public static *PCollection<String> function(PCollection<? *extends *SpecificRecord> coll) {
>   coll.parallelDo(*new *DoFn<? *extends *SpecificRecord, String>() {
>     ...
>   }, Avros.*strings*());
> }
>
> will not compile given it expects a type at compile-time
>
>  *will not compile given it expects a type at compile time, while *
>
>  *public static *PCollection<String> transformAvroToCsv(PCollection<SpecificRecord> coll) {
>   coll.parallelDo(*new *DoFn<SpecificRecord, String>() {
>     @Override
>     *public void *process(SpecificRecord input, Emitter<String> emitter) {
>     }
>   }, Avros.*strings*());
>   *return null*;
> }
>
>  *will fail at run-time due to SpecificRecord not having an init constructor.*
>
>   What is the standard way for taking in generic avro records and having
> a generic
>
> transform method to call on them?
>
>
>
> Thanks.
>    *This email is intended only for the use of the individual(s) to whom
> it is addressed. If you have received this communication in error, please
> immediately notify the sender and delete the original email.*
>

RE: How to write a generic transform method that will act upon generated avro objects in a generic fashion

Posted by David Ortiz <do...@videologygroup.com>.
When you actually write the code will you know what the avro record is?  I’ve been able to do something along the lines of

public class GenericAvroFunction<T extends SpecificRecordBase> extends DoFn<T, String> {
…

public void process(T input, Emitter<String> emitter) {
…
}
}

then parameterizing it in the various pipelines that use it.  Not sure with regards to making it work at run time though.

From: Sankash Shankar [mailto:sankash@wealthfront.com]
Sent: Monday, June 22, 2015 4:18 PM
To: user@crunch.apache.org
Subject: How to write a generic transform method that will act upon generated avro objects in a generic fashion

Hello.

I am writing a Crunch job that takes in an arbitrary class that extends SpecificRecord and performs a transformation on the fields in the class. I am attempting to write a parallelDo function on these classes, but

public static PCollection<String> function(PCollection<? extends SpecificRecord> coll) {
  coll.parallelDo(new DoFn<? extends SpecificRecord, String>() {
    ...
  }, Avros.strings());
}

will not compile given it expects a type at compile-time

will not compile given it expects a type at compile time, while

public static PCollection<String> transformAvroToCsv(PCollection<SpecificRecord> coll) {
  coll.parallelDo(new DoFn<SpecificRecord, String>() {
    @Override
    public void process(SpecificRecord input, Emitter<String> emitter) {
    }
  }, Avros.strings());
  return null;
}

will fail at run-time due to SpecificRecord not having an init constructor.
What is the standard way for taking in generic avro records and having a generic
transform method to call on them?

Thanks.
This email is intended only for the use of the individual(s) to whom it is addressed. If you have received this communication in error, please immediately notify the sender and delete the original email.