You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@avro.apache.org by Pasquale Salza <pa...@gmail.com> on 2014/02/06 22:53:05 UTC

MapReduce and Avro split by number of records

Hi everybody,
I'm looking for a solution to my problem: split a group of Avro files by
number of records and not by block size, as default.

For the moment, my strategy is:
- Iterate among the records input files;
- Create a new InputSplit when a limit has been reached and store: the file
paths, the last sync point met in the first file and an offset, which is
the number of records from the sync point from which start with;
- The record reader opens the first path and launches a seek with the
stored sync point. Then it shift, by iterating, the number of records
offset and starts to read the split records.

I'm am obliged to use a split by records because the MapReduce work, in my
case, is computational centric and not data centric.

Do you have any better solution?

Pasquale Salza