You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@avro.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2017/01/03 19:00:59 UTC
[jira] [Commented] (AVRO-1976) Add Input/OutputFormat to read/write
encoded objects
[ https://issues.apache.org/jira/browse/AVRO-1976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15795859#comment-15795859 ]
ASF GitHub Bot commented on AVRO-1976:
--------------------------------------
GitHub user postamar opened a pull request:
https://github.com/apache/avro/pull/182
AVRO-1976: Add Input/OutputFormat to read/write encoded objects
`AvroEncodedInputFormat` reads a container file input split as key-value pairs in which the key is the file header and the value is the decompressed file data block. `AvroEncodedOutputFormat`follows the same logic for writing. See `TestAvroEncodedInputAndOutputFormats` for usage examples.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/postamar/avro AVRO-1976
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/avro/pull/182.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #182
----
commit 74d904b463fd7ae6acc1998350de437ea2aa8a83
Author: Marius Posta <ma...@adgear.com>
Date: 2017-01-03T18:52:52Z
AVRO-1976: Add Input/OutputFormat to read/write encoded objects
----
> Add Input/OutputFormat to read/write encoded objects
> ----------------------------------------------------
>
> Key: AVRO-1976
> URL: https://issues.apache.org/jira/browse/AVRO-1976
> Project: Avro
> Issue Type: Improvement
> Components: java
> Environment: hadoop
> Reporter: Marius Posta
> Priority: Minor
> Labels: newbie
> Original Estimate: 1h
> Remaining Estimate: 1h
>
> In certain cases, performance of some Avro map-reduce jobs can be considerably improved by de-coupling Avro encoding from actual Avro container file IO.
> In my case, a complex schema (100+ record fields) and large HDFS blocks resulted in Spark jobs where a lot of workers were idling while a couple of them were busy decoding their input splits.Furthermore, the objects then needed to be re-encoded in order to be shuffled about, which crippled performance further.
> I propose the addition of an InputFormat which reads a container file input split as key-value pairs in which the key is the file header and the value is the decompressed file data block. Also, an OutputFormat which follows the same logic for writing.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)