You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@avro.apache.org by Luke Liu <lu...@gmail.com> on 2013/01/09 01:01:59 UTC

org.apache.avro.mapred.AvroMultipleOutputs (avro-1.7.3) does not allow different output schemas

Hi,

The issue is that it does not allow additional outputs with a different schema.

I am using "org.apache.avro.mapred.AvroMultipleOutputs" following the
Javadoc and pass in a "newSchema" that is different from the default
output avro schema.

// Job configuration
JobConf job = new JobConf();
....
AvroMultipleOutputs.addNamedOutput(job, "avro1",
AvroOutputFormat.class,  newSchema);

// In Reducer
MyReducer {

  private AvroMultipleOutputs amos;
  ...
  public void configure(JobConf conf) {
   ...
    amos = new AvroMultipleOutputs(conf);
 }
 public void reduce (...) {
   amos.getCollector("avro1", reporter).collect(datum);
 }

 public void close() {
   amos.close();
 }

}
....

Then "amos.getCollector("avro1", reporter).collect(datum);" always
uses the default avro schema in the JobConf. It should use
"newSchema".

I found in "org.apache.avro.mapred.AvroMultipleOutputs.addNamedOutput(JobConf
conf, String namedOutput,  Class<? extends OutputFormat>
outputFormatClass, Schema schema)"

it stores the additional schemas in a static HashMap called
"schemaList" at the time of the job configuration time.

But "reducer tasks" could be running on different hosts in which
"schemaList" was not initialized.  So the reducer won't get the schema
from the list. Therefore, it will use the default schema in the
jobconf.

I think "org.apache.avro.mapred.AvroMultipleOutputs.addNamedOutput(...)"
should store the passed in schema in the JobConf, not in static member
"schemaList".

Thanks,
Luke