You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@avro.apache.org by "Vincenz Priesnitz (JIRA)" <ji...@apache.org> on 2013/04/22 18:35:16 UTC

[jira] [Created] (AVRO-1307) Add an avro-tool to extract samples from avro files

Vincenz Priesnitz created AVRO-1307:
---------------------------------------

             Summary: Add an avro-tool to extract samples from avro files
                 Key: AVRO-1307
                 URL: https://issues.apache.org/jira/browse/AVRO-1307
             Project: Avro
          Issue Type: New Feature
          Components: java
         Environment: java

            Reporter: Vincenz Priesnitz
            Priority: Minor


It would be nice to have an avro-tool that picks only some records from avro files.
 
I implemented a new avro-tool cat, which takes a list of avro files with identical schemas and concatenates them into a single file, with options to discard the first n records, to limit the output size and to collect records at a certain samplerate.

This tool allows a quicker peek into large avro files, e.g.:
{code}
java -jar avro-tools.jar cat input.avro output.avro --offset 50 --limit 10
# creates output.avro that contains records
# 51 to 60 from input.avro.
{\code}

{code}
java -jar avro-tools.jar cat input.avro output.avro --offset 1000 --limit 100 --samplerate .01
# samples every hundredth record from input,
# beginning at the 1000th record and limiting
# the output to 100 records. 
{\code}

The tool allows multiple input files or folders, in which case all files inside the folder will be used for input.
{code}
java -jar avro-tools.jar cat data_folder output.avro --samplerate .01
# reads all the files from the data folder and
# writes every 100th record into the output file.
{\code}

This tool uses the hadoop FileSystem api to handle files from any supported filesystem.


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira