You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@avro.apache.org by "Patrick Linehan (JIRA)" <ji...@apache.org> on 2010/10/25 21:56:24 UTC

[jira] Created: (AVRO-684) Java tool for altering the codec of an Avro data file stream.

Java tool for altering the codec of an Avro data file stream.
-------------------------------------------------------------

                 Key: AVRO-684
                 URL: https://issues.apache.org/jira/browse/AVRO-684
             Project: Avro
          Issue Type: New Feature
          Components: java
            Reporter: Patrick Linehan


An example is worth a thousand words:

  cat infile.avro | avro-tools recodec deflate - - > outfile.avro

The above example would create a new file, "outfile.avro", with the same contents as "infile.avro".  However, the codec of "outfile.avro" would be "deflate", regardless of the codec of "infile.avro".

Proposed features:

* The tool should preserve any metadata present in the input file.
* Supported codecs will be "deflate" and "null".
* Optionally add support for specifying the deflation level, perhaps with syntax as follows:  "deflate:N" where N is the deflation level, e.g. "deflate:4".

Does this proposal sound reasonable?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (AVRO-684) Java tool for altering the codec of an Avro data file stream.

Posted by "Patrick Linehan (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/AVRO-684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Patrick Linehan updated AVRO-684:
---------------------------------

    Attachment: AVRO-684.patch

Added a test and metadata preservation.  Please let me know if the test is up to snuff.

> Java tool for altering the codec of an Avro data file stream.
> -------------------------------------------------------------
>
>                 Key: AVRO-684
>                 URL: https://issues.apache.org/jira/browse/AVRO-684
>             Project: Avro
>          Issue Type: New Feature
>          Components: java
>            Reporter: Patrick Linehan
>         Attachments: AVRO-684.patch, AVRO-684.patch
>
>
> An example is worth a thousand words:
>   cat infile.avro | avro-tools recodec deflate - - > outfile.avro
> The above example would create a new file, "outfile.avro", with the same contents as "infile.avro".  However, the codec of "outfile.avro" would be "deflate", regardless of the codec of "infile.avro".
> Proposed features:
> * The tool should preserve any metadata present in the input file.
> * Supported codecs will be "deflate" and "null".
> * Optionally add support for specifying the deflation level, perhaps with syntax as follows:  "deflate:N" where N is the deflation level, e.g. "deflate:4".
> Does this proposal sound reasonable?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (AVRO-684) Java tool for altering the codec of an Avro data file stream.

Posted by "Patrick Linehan (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/AVRO-684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Patrick Linehan updated AVRO-684:
---------------------------------

    Status: Patch Available  (was: Open)

> Java tool for altering the codec of an Avro data file stream.
> -------------------------------------------------------------
>
>                 Key: AVRO-684
>                 URL: https://issues.apache.org/jira/browse/AVRO-684
>             Project: Avro
>          Issue Type: New Feature
>          Components: java
>            Reporter: Patrick Linehan
>         Attachments: AVRO-684.patch, AVRO-684.patch
>
>
> An example is worth a thousand words:
>   cat infile.avro | avro-tools recodec deflate - - > outfile.avro
> The above example would create a new file, "outfile.avro", with the same contents as "infile.avro".  However, the codec of "outfile.avro" would be "deflate", regardless of the codec of "infile.avro".
> Proposed features:
> * The tool should preserve any metadata present in the input file.
> * Supported codecs will be "deflate" and "null".
> * Optionally add support for specifying the deflation level, perhaps with syntax as follows:  "deflate:N" where N is the deflation level, e.g. "deflate:4".
> Does this proposal sound reasonable?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (AVRO-684) Java tool for altering the codec of an Avro data file stream.

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/AVRO-684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12924709#action_12924709 ] 

Doug Cutting commented on AVRO-684:
-----------------------------------

The codec and deflation level might be specified with '-codec' and '-level', so that the command syntax might be:

  recodec [-codec codec] [-level level] [infile [outfile]]

> Java tool for altering the codec of an Avro data file stream.
> -------------------------------------------------------------
>
>                 Key: AVRO-684
>                 URL: https://issues.apache.org/jira/browse/AVRO-684
>             Project: Avro
>          Issue Type: New Feature
>          Components: java
>            Reporter: Patrick Linehan
>
> An example is worth a thousand words:
>   cat infile.avro | avro-tools recodec deflate - - > outfile.avro
> The above example would create a new file, "outfile.avro", with the same contents as "infile.avro".  However, the codec of "outfile.avro" would be "deflate", regardless of the codec of "infile.avro".
> Proposed features:
> * The tool should preserve any metadata present in the input file.
> * Supported codecs will be "deflate" and "null".
> * Optionally add support for specifying the deflation level, perhaps with syntax as follows:  "deflate:N" where N is the deflation level, e.g. "deflate:4".
> Does this proposal sound reasonable?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (AVRO-684) Java tool for altering the codec of an Avro data file stream.

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/AVRO-684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doug Cutting updated AVRO-684:
------------------------------

       Resolution: Fixed
    Fix Version/s: 1.5.0
         Assignee: Patrick Linehan
     Hadoop Flags: [Reviewed]
           Status: Resolved  (was: Patch Available)

I just committed this.  Thanks, Patrick!

> Java tool for altering the codec of an Avro data file stream.
> -------------------------------------------------------------
>
>                 Key: AVRO-684
>                 URL: https://issues.apache.org/jira/browse/AVRO-684
>             Project: Avro
>          Issue Type: New Feature
>          Components: java
>            Reporter: Patrick Linehan
>            Assignee: Patrick Linehan
>             Fix For: 1.5.0
>
>         Attachments: AVRO-684.patch, AVRO-684.patch
>
>
> An example is worth a thousand words:
>   cat infile.avro | avro-tools recodec deflate - - > outfile.avro
> The above example would create a new file, "outfile.avro", with the same contents as "infile.avro".  However, the codec of "outfile.avro" would be "deflate", regardless of the codec of "infile.avro".
> Proposed features:
> * The tool should preserve any metadata present in the input file.
> * Supported codecs will be "deflate" and "null".
> * Optionally add support for specifying the deflation level, perhaps with syntax as follows:  "deflate:N" where N is the deflation level, e.g. "deflate:4".
> Does this proposal sound reasonable?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (AVRO-684) Java tool for altering the codec of an Avro data file stream.

Posted by "Patrick Linehan (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/AVRO-684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Patrick Linehan updated AVRO-684:
---------------------------------

    Attachment: AVRO-684.patch

I've finished a first draft.  Still to be done:

* Write the test.
* Preserve file metadata.
* Implement the concatenation described by Scott Carey.

I'm assuming that for concatenation, the following would be considered reasonable behavior:

* Only the metadata from the first input file is written to the output file.
* The schema from the first input file becomes the schema of the output file.  The remaining input file schemas only need to resolve with said schema, not be identical.

Anyway, the first draft is here in case anyone gets the urge to finish it for me :)  Otherwise I hope to finish it in the next few weeks.

> Java tool for altering the codec of an Avro data file stream.
> -------------------------------------------------------------
>
>                 Key: AVRO-684
>                 URL: https://issues.apache.org/jira/browse/AVRO-684
>             Project: Avro
>          Issue Type: New Feature
>          Components: java
>            Reporter: Patrick Linehan
>         Attachments: AVRO-684.patch
>
>
> An example is worth a thousand words:
>   cat infile.avro | avro-tools recodec deflate - - > outfile.avro
> The above example would create a new file, "outfile.avro", with the same contents as "infile.avro".  However, the codec of "outfile.avro" would be "deflate", regardless of the codec of "infile.avro".
> Proposed features:
> * The tool should preserve any metadata present in the input file.
> * Supported codecs will be "deflate" and "null".
> * Optionally add support for specifying the deflation level, perhaps with syntax as follows:  "deflate:N" where N is the deflation level, e.g. "deflate:4".
> Does this proposal sound reasonable?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (AVRO-684) Java tool for altering the codec of an Avro data file stream.

Posted by "Scott Carey (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/AVRO-684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12924710#action_12924710 ] 

Scott Carey commented on AVRO-684:
----------------------------------

Yes this would be useful.

Most of the machinery for this is already in the DataFileWriter class.  It is not exposed in a command-line tool though.

I currently use this machinery to take a large list of small avro files and merge them into one larger avro file with a set compression type and level.

In addition to the compression level, there is the concept of forcing a re-encode.  By default, the current code will not re-encode unless required.  Therefore, it won't re-encode deflate:1 to deflate:3 by default unless told to by passing in the flag to force it to re-encode.  By default it will decode deflate to null or encode null to deflate.   If a block is already compatible, it just copies the raw bytes of the block, which is very fast.

This tool should also support concatenation of files and creation of one larger file from a collection of smaller ones (of the same schema) with the requested encoding.  Maybe something like this:

{noformat}
$ avro-tools append_to -f outfile.avro -c deflate:5 infile.avro [infile2.avro, . . .]
{noformat}

Which would create outfile.avro with codec deflate:5 form multiple source files.


> Java tool for altering the codec of an Avro data file stream.
> -------------------------------------------------------------
>
>                 Key: AVRO-684
>                 URL: https://issues.apache.org/jira/browse/AVRO-684
>             Project: Avro
>          Issue Type: New Feature
>          Components: java
>            Reporter: Patrick Linehan
>
> An example is worth a thousand words:
>   cat infile.avro | avro-tools recodec deflate - - > outfile.avro
> The above example would create a new file, "outfile.avro", with the same contents as "infile.avro".  However, the codec of "outfile.avro" would be "deflate", regardless of the codec of "infile.avro".
> Proposed features:
> * The tool should preserve any metadata present in the input file.
> * Supported codecs will be "deflate" and "null".
> * Optionally add support for specifying the deflation level, perhaps with syntax as follows:  "deflate:N" where N is the deflation level, e.g. "deflate:4".
> Does this proposal sound reasonable?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (AVRO-684) Java tool for altering the codec of an Avro data file stream.

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/AVRO-684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12926468#action_12926468 ] 

Doug Cutting commented on AVRO-684:
-----------------------------------

This looks good to me.

A test is required before this can be committed.

Concatenation can and probably should be done as a separate follow-on issue.

Metadata would be nice to have from the start, but could also be done as a separate issue later.


> Java tool for altering the codec of an Avro data file stream.
> -------------------------------------------------------------
>
>                 Key: AVRO-684
>                 URL: https://issues.apache.org/jira/browse/AVRO-684
>             Project: Avro
>          Issue Type: New Feature
>          Components: java
>            Reporter: Patrick Linehan
>         Attachments: AVRO-684.patch
>
>
> An example is worth a thousand words:
>   cat infile.avro | avro-tools recodec deflate - - > outfile.avro
> The above example would create a new file, "outfile.avro", with the same contents as "infile.avro".  However, the codec of "outfile.avro" would be "deflate", regardless of the codec of "infile.avro".
> Proposed features:
> * The tool should preserve any metadata present in the input file.
> * Supported codecs will be "deflate" and "null".
> * Optionally add support for specifying the deflation level, perhaps with syntax as follows:  "deflate:N" where N is the deflation level, e.g. "deflate:4".
> Does this proposal sound reasonable?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.