You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@avro.apache.org by Lars Francke <la...@gmail.com> on 2020/09/21 18:15:23 UTC

Issues with AvroMultipleOutputs

Hi Avro Devs,

I am currently with a customer with a long running MapReduce job that is
very slow (hours where I expect minutes). I traced the issue back to
mapreduce.AvroMultipleOutputs.

My customer was using the write method that does not take a namedOutput.
The problem here is the instantiation of the Job and TaskContext class for
every record and that's a very slow operation (turned a 4h job into 45min
when fixed).

So instead we switched to namedOutputs but our problem is that we don't
know which outputs we'll have before we start the job.

Unfortunately, the class takes a copy of all named outputs at instantiation
time (from the job configuration at that time) so anything added after the
start is discarded.

It's been so long that I worked with MR related classes: The Job &
Configuration are instantiated on the ApplicationMaster and then serialized
as Tasks to the Mappers & Reducers. So putting the named outputs in some
other structure in the class probably won't work, I guess?

But does anything speak against making the named outputs changeable for
each instance of the AvroMultipleOutput class?
I'd like to add a non-static version of addNamedOutput - I can't think of
anything that would prevent this.

I'm working on a patch for this (it'll bring larger changes to the class)
and I'll obviously keep the current API. I'm just wondering if there is
anything that would prevent this addition?

Cheers,
Lars