You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@avro.apache.org by James Campbell <ja...@breachintelligence.com> on 2014/05/13 22:03:55 UTC

Reading from disjoint schemas in map

I'm trying to read data into a mapreduce job, where the data may have been created by one of a few different schemas, none of which are evolutions of one another (though they are related).

I have seen several people suggest using a union schema, such that during job setup, one would set the input schema to be the union:
ArrayList<Schema> schemas = new ArrayList<Schema>();
schemas.add(schema1);
...
Schema unionSchema = Schema.createUnion(schemas);
AvroJob.setInputKeySchema(job, unionSchema);

However, I don't know how to then extract the correct type inside my mapper (which was apparently trivial (sorry-I'm new to avro)).

I'd guess that the map function profile becomes map(AvroKey<GenericRecord> key, NullWritable value, ...) but how can I then cause Avro to read the correctly-typed data from the GenericRecord?

Thanks!

James

Re: Reading from disjoint schemas in map

Posted by Lewis John Mcgibbney <le...@gmail.com>.

There are code examples of Martins suggestion for the latter here


get -
https://github.com/apache/gora/blob/trunk/gora-cassandra/src/main/java/org/apache/gora/cassandra/store/CassandraStore.java#L405

getFieldValue -
https://github.com/apache/gora/blob/trunk/gora-cassandra/src/main/java/org/apache/gora/cassandra/store/CassandraStore.java#L435

getUnionSchema -
https://github.com/apache/gora/blob/trunk/gora-cassandra/src/main/java/org/apache/gora/cassandra/store/CassandraStore.java#L617


These methods are called (sequentially) when we wish to write data into an
underlying data store using Avro.

hth

Lewis
On May 14, 2014 7:15 AM, "Martin Kleppmann" <mk...@linkedin.com> wrote:

>  Hi James,
>
>  If you're using code generation to create Java classes for the Avro
> schemas, you should be able to just use Java's instanceof.
>
>  If you're using GenericRecord, you can use GenericRecord.getSchema() to
> determine the type of a particular record.
>
>  Hope that helps,
> Martin
>
>

RE: Reading from disjoint schemas in map

Posted by James Campbell <ja...@breachintelligence.com>.

Martin,

Thanks very much!  Setting the map to expect an AvroKey<Object> and using instanceof works nicely.

James

From: Martin Kleppmann [mailto:mkleppmann@linkedin.com]
Sent: Wednesday, May 14, 2014 4:48 AM
To: user@avro.apache.org
Subject: Re: Reading from disjoint schemas in map

Hi James,

If you're using code generation to create Java classes for the Avro schemas, you should be able to just use Java's instanceof.

If you're using GenericRecord, you can use GenericRecord.getSchema() to determine the type of a particular record.

Hope that helps,
Martin

On 13 May 2014, at 21:03, James Campbell <ja...@breachintelligence.com>> wrote:
I'm trying to read data into a mapreduce job, where the data may have been created by one of a few different schemas, none of which are evolutions of one another (though they are related).

I have seen several people suggest using a union schema, such that during job setup, one would set the input schema to be the union:
ArrayList<Schema> schemas = new ArrayList<Schema>();
schemas.add(schema1);
...
Schema unionSchema = Schema.createUnion(schemas);
AvroJob.setInputKeySchema(job, unionSchema);

However, I don't know how to then extract the correct type inside my mapper (which was apparently trivial (sorry-I'm new to avro)).

I'd guess that the map function profile becomes map(AvroKey<GenericRecord> key, NullWritable value, ...) but how can I then cause Avro to read the correctly-typed data from the GenericRecord?

Thanks!

James

Re: Reading from disjoint schemas in map

Posted by Martin Kleppmann <mk...@linkedin.com>.

Hi James,

If you're using code generation to create Java classes for the Avro schemas, you should be able to just use Java's instanceof.

If you're using GenericRecord, you can use GenericRecord.getSchema() to determine the type of a particular record.

Hope that helps,
Martin

On 13 May 2014, at 21:03, James Campbell <ja...@breachintelligence.com>> wrote:
I’m trying to read data into a mapreduce job, where the data may have been created by one of a few different schemas, none of which are evolutions of one another (though they are related).

I have seen several people suggest using a union schema, such that during job setup, one would set the input schema to be the union:
ArrayList<Schema> schemas = new ArrayList<Schema>();
schemas.add(schema1);
…
Schema unionSchema = Schema.createUnion(schemas);
AvroJob.setInputKeySchema(job, unionSchema);

However, I don’t know how to then extract the correct type inside my mapper (which was apparently trivial (sorry—I’m new to avro)).

I’d guess that the map function profile becomes map(AvroKey<GenericRecord> key, NullWritable value, …) but how can I then cause Avro to read the correctly-typed data from the GenericRecord?

Thanks!

James