You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by "Ivan Vladimirov Ivanov (JIRA)" <ji...@apache.org> on 2012/09/03 12:32:08 UTC

[jira] [Updated] (HADOOP-8597) FsShell's Text command should be able to read avro data files

     [ https://issues.apache.org/jira/browse/HADOOP-8597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ivan Vladimirov Ivanov updated HADOOP-8597:
-------------------------------------------

    Attachment: HADOOP-8597.patch

The proposed patch adds the logic to output the content of Avro data files in JSON format.

The implementation does not use the DataFileReadTool class since, as it turned out, the org.apache.avro.tool package is not currently part of the project's dependencies. As a consequence this allowed a more memory efficient implementation, which keeps only a constant number of Avro records in memory.
                
> FsShell's Text command should be able to read avro data files
> -------------------------------------------------------------
>
>                 Key: HADOOP-8597
>                 URL: https://issues.apache.org/jira/browse/HADOOP-8597
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: fs
>    Affects Versions: 2.0.0-alpha
>            Reporter: Harsh J
>              Labels: newbie
>         Attachments: HADOOP-8597.patch
>
>
> Similar to SequenceFiles are Apache Avro's DataFiles. Since these are getting popular as a data format, perhaps it would be useful if {{fs -text}} were to add some support for reading it, like it reads SequenceFiles. Should be easy since Avro is already a dependency and provides the required classes.
> Of discussion is the output we ought to emit. Avro DataFiles aren't simple as text, nor have they the singular Key-Value pair structure of SequenceFiles. They usually contain a set of fields defined as a record, and the usual text emit, as available from avro-tools via http://avro.apache.org/docs/current/api/java/org/apache/avro/tool/DataFileReadTool.html, is in proper JSON format.
> I think we should use the JSON format as the output, rather than a delimited form, for there are many complex structures in Avro and JSON is the easiest and least-work-to-do way to display it (Avro supports json dumping by itself).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira