You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@avro.apache.org by "Frank Grimes (Created) (JIRA)" <ji...@apache.org> on 2012/01/13 16:07:39 UTC

[jira] [Created] (AVRO-991) Allow combining multiple Avro files within a stream. (no files on disk)

Allow combining multiple Avro files within a stream. (no files on disk)
-----------------------------------------------------------------------

                 Key: AVRO-991
                 URL: https://issues.apache.org/jira/browse/AVRO-991
             Project: Avro
          Issue Type: Improvement
          Components: java
    Affects Versions: 1.6.1
            Reporter: Frank Grimes


It would be nice to be able to do as follows:

  cat file1.avro file2.avro | java -jar avro-tools.jar streamcombine > combined-file.avro

or similarly
  
  hadoop dfs -cat hdfs://hadoop/file1.avro hdfs://hadoop/file2.avro | java -jar avro-tools.jar streamcombine | hdfs -put - hdfs://hadoop/combined-file.avro

See the following thread for details: http://mail-archives.apache.org/mod_mbox/avro-user/201201.mbox/%3cC08F1DE9-97A8-4D28-B0AD-5E4A7F32F028@gmail.com%3e

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (AVRO-991) Allow combining multiple Avro files within a stream. (no files on disk)

Posted by "Doug Cutting (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/AVRO-991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13188830#comment-13188830 ] 

Doug Cutting commented on AVRO-991:
-----------------------------------

> +1 for user-specified sync markers.

That should probably be a separate issue from the appended-stream tool.
                
> Allow combining multiple Avro files within a stream. (no files on disk)
> -----------------------------------------------------------------------
>
>                 Key: AVRO-991
>                 URL: https://issues.apache.org/jira/browse/AVRO-991
>             Project: Avro
>          Issue Type: Improvement
>          Components: java
>    Affects Versions: 1.6.1
>            Reporter: Frank Grimes
>
> It would be nice to be able to do as follows:
>   cat file1.avro file2.avro | java -jar avro-tools.jar streamcombine > combined-file.avro
> or similarly
>   
>   hadoop dfs -cat hdfs://hadoop/file1.avro hdfs://hadoop/file2.avro | java -jar avro-tools.jar streamcombine | hdfs -put - hdfs://hadoop/combined-file.avro
> See the following thread for details: http://mail-archives.apache.org/mod_mbox/avro-user/201201.mbox/%3cC08F1DE9-97A8-4D28-B0AD-5E4A7F32F028@gmail.com%3e

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (AVRO-991) Allow combining multiple Avro files within a stream. (no files on disk)

Posted by "Doug Cutting (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/AVRO-991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13185746#comment-13185746 ] 

Doug Cutting commented on AVRO-991:
-----------------------------------

I think this would work.  We'd need to be able to distinguish the start of a block from the start of the next file.  A block starts with the count of items in it, encoded as a variable-length zig-zag-encoded long.  A file starts with ASCII 'O'.  Interpreted as a variable-length zig-zag encoded long, this is -40, which is an invalid item count.  So a DataFileStream would need to, when the item count is -40, try to read a file header, and if its schema is compatible, update its sync and codec and keep reading.

                
> Allow combining multiple Avro files within a stream. (no files on disk)
> -----------------------------------------------------------------------
>
>                 Key: AVRO-991
>                 URL: https://issues.apache.org/jira/browse/AVRO-991
>             Project: Avro
>          Issue Type: Improvement
>          Components: java
>    Affects Versions: 1.6.1
>            Reporter: Frank Grimes
>
> It would be nice to be able to do as follows:
>   cat file1.avro file2.avro | java -jar avro-tools.jar streamcombine > combined-file.avro
> or similarly
>   
>   hadoop dfs -cat hdfs://hadoop/file1.avro hdfs://hadoop/file2.avro | java -jar avro-tools.jar streamcombine | hdfs -put - hdfs://hadoop/combined-file.avro
> See the following thread for details: http://mail-archives.apache.org/mod_mbox/avro-user/201201.mbox/%3cC08F1DE9-97A8-4D28-B0AD-5E4A7F32F028@gmail.com%3e

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (AVRO-991) Allow combining multiple Avro files within a stream. (no files on disk)

Posted by "Tom White (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/AVRO-991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13187913#comment-13187913 ] 

Tom White commented on AVRO-991:
--------------------------------

+1 for user-specified sync markers. It would be sufficient to be able to say "use the default sync marker" for my use case (checking that an Avro data file generated from a MapReduce program is as expected).

> However, two files with the same logical content can differ in other ways too: User provided metadata, different block sizes, compression codecs, etc.

Right, but these can all be set by the user - so being able to control the sync marker would make everything deterministic.
                
> Allow combining multiple Avro files within a stream. (no files on disk)
> -----------------------------------------------------------------------
>
>                 Key: AVRO-991
>                 URL: https://issues.apache.org/jira/browse/AVRO-991
>             Project: Avro
>          Issue Type: Improvement
>          Components: java
>    Affects Versions: 1.6.1
>            Reporter: Frank Grimes
>
> It would be nice to be able to do as follows:
>   cat file1.avro file2.avro | java -jar avro-tools.jar streamcombine > combined-file.avro
> or similarly
>   
>   hadoop dfs -cat hdfs://hadoop/file1.avro hdfs://hadoop/file2.avro | java -jar avro-tools.jar streamcombine | hdfs -put - hdfs://hadoop/combined-file.avro
> See the following thread for details: http://mail-archives.apache.org/mod_mbox/avro-user/201201.mbox/%3cC08F1DE9-97A8-4D28-B0AD-5E4A7F32F028@gmail.com%3e

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (AVRO-991) Allow combining multiple Avro files within a stream. (no files on disk)

Posted by "Doug Cutting (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/AVRO-991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13187346#comment-13187346 ] 

Doug Cutting commented on AVRO-991:
-----------------------------------

On second thought, I don't think we ought to add this to the spec.  I think a tool that can read appended streams and write a single file would be useful, but I don't think we should require every implementation to be able to parse appended files.  That would be an incompatible change, and, as Scott points out, would also create difficult to split files.

I also think Scott's idea of permitting user-spec'd sync markers could be useful.
                
> Allow combining multiple Avro files within a stream. (no files on disk)
> -----------------------------------------------------------------------
>
>                 Key: AVRO-991
>                 URL: https://issues.apache.org/jira/browse/AVRO-991
>             Project: Avro
>          Issue Type: Improvement
>          Components: java
>    Affects Versions: 1.6.1
>            Reporter: Frank Grimes
>
> It would be nice to be able to do as follows:
>   cat file1.avro file2.avro | java -jar avro-tools.jar streamcombine > combined-file.avro
> or similarly
>   
>   hadoop dfs -cat hdfs://hadoop/file1.avro hdfs://hadoop/file2.avro | java -jar avro-tools.jar streamcombine | hdfs -put - hdfs://hadoop/combined-file.avro
> See the following thread for details: http://mail-archives.apache.org/mod_mbox/avro-user/201201.mbox/%3cC08F1DE9-97A8-4D28-B0AD-5E4A7F32F028@gmail.com%3e

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (AVRO-991) Allow combining multiple Avro files within a stream. (no files on disk)

Posted by "Scott Carey (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/AVRO-991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13187338#comment-13187338 ] 

Scott Carey commented on AVRO-991:
----------------------------------

TO help out Tom's md5 issue, we could support user-provided sync markers.  However, two files with the same logical content can differ in other ways too:  User provided metadata, different block sizes, compression codecs, etc.

That brings up another tool -- we could provide a tool that does checksums on the binary content of an avro data file. and ignores the sync markers.

                
> Allow combining multiple Avro files within a stream. (no files on disk)
> -----------------------------------------------------------------------
>
>                 Key: AVRO-991
>                 URL: https://issues.apache.org/jira/browse/AVRO-991
>             Project: Avro
>          Issue Type: Improvement
>          Components: java
>    Affects Versions: 1.6.1
>            Reporter: Frank Grimes
>
> It would be nice to be able to do as follows:
>   cat file1.avro file2.avro | java -jar avro-tools.jar streamcombine > combined-file.avro
> or similarly
>   
>   hadoop dfs -cat hdfs://hadoop/file1.avro hdfs://hadoop/file2.avro | java -jar avro-tools.jar streamcombine | hdfs -put - hdfs://hadoop/combined-file.avro
> See the following thread for details: http://mail-archives.apache.org/mod_mbox/avro-user/201201.mbox/%3cC08F1DE9-97A8-4D28-B0AD-5E4A7F32F028@gmail.com%3e

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (AVRO-991) Allow combining multiple Avro files within a stream. (no files on disk)

Posted by "Scott Carey (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/AVRO-991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13187335#comment-13187335 ] 

Scott Carey commented on AVRO-991:
----------------------------------

{quote}
A file starts with ASCII 'O'. Interpreted as a variable-length zig-zag encoded long, this is -40, which is an invalid item count. So a DataFileStream would need to, when the item count is -40, try to read a file header, and if its schema is compatible, update its sync and codec and keep reading.{quote}

* If reading sequentially this will work, but it means that the resulting concatenated file cannot be split.

I think the first thing we need to do is add a tool to avro-tools that can do the equivalent of 'cat file1.avro file2.avro > combined-file.avro'.  If the schemas are equal, this is extremely fast (blocks can be copied and new sync markers put between).  This requires no format change.   This same tool can be extended to 'recodec' or change the sync interval size.  It can also convert compatible schemas if need be.

I find that in most cases, if I have  a few hundred files that I want to lump up into fewer, if the result is one file per schema, I'd be happy.  IMO all we need is tool support for easy concatenation of same-schema files with some metadata preservation.
                
> Allow combining multiple Avro files within a stream. (no files on disk)
> -----------------------------------------------------------------------
>
>                 Key: AVRO-991
>                 URL: https://issues.apache.org/jira/browse/AVRO-991
>             Project: Avro
>          Issue Type: Improvement
>          Components: java
>    Affects Versions: 1.6.1
>            Reporter: Frank Grimes
>
> It would be nice to be able to do as follows:
>   cat file1.avro file2.avro | java -jar avro-tools.jar streamcombine > combined-file.avro
> or similarly
>   
>   hadoop dfs -cat hdfs://hadoop/file1.avro hdfs://hadoop/file2.avro | java -jar avro-tools.jar streamcombine | hdfs -put - hdfs://hadoop/combined-file.avro
> See the following thread for details: http://mail-archives.apache.org/mod_mbox/avro-user/201201.mbox/%3cC08F1DE9-97A8-4D28-B0AD-5E4A7F32F028@gmail.com%3e

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (AVRO-991) Allow combining multiple Avro files within a stream. (no files on disk)

Posted by "Doug Cutting (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/AVRO-991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13187227#comment-13187227 ] 

Doug Cutting commented on AVRO-991:
-----------------------------------

For the record, the thinking behind the varied sync marker is that it makes collisions less likely.  In theory this is not true, but in practice my concern was that, once a value was fixed and known, there'd be a significantly higher probability that someone would include it in some data.  Perhaps that's not correct, though.

As for expanding the spec, as I mentioned above, we can do that at present, since the file's magic number can never be the start of a valid block.  So if a block ever starts with the magic number then a reader could assume that it's an appended file.  It's perhaps not the way one would design an appendable format from scratch, but I think it's workable.
                
> Allow combining multiple Avro files within a stream. (no files on disk)
> -----------------------------------------------------------------------
>
>                 Key: AVRO-991
>                 URL: https://issues.apache.org/jira/browse/AVRO-991
>             Project: Avro
>          Issue Type: Improvement
>          Components: java
>    Affects Versions: 1.6.1
>            Reporter: Frank Grimes
>
> It would be nice to be able to do as follows:
>   cat file1.avro file2.avro | java -jar avro-tools.jar streamcombine > combined-file.avro
> or similarly
>   
>   hadoop dfs -cat hdfs://hadoop/file1.avro hdfs://hadoop/file2.avro | java -jar avro-tools.jar streamcombine | hdfs -put - hdfs://hadoop/combined-file.avro
> See the following thread for details: http://mail-archives.apache.org/mod_mbox/avro-user/201201.mbox/%3cC08F1DE9-97A8-4D28-B0AD-5E4A7F32F028@gmail.com%3e

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Issue Comment Edited] (AVRO-991) Allow combining multiple Avro files within a stream. (no files on disk)

Posted by "Scott Carey (Issue Comment Edited) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/AVRO-991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13187338#comment-13187338 ] 

Scott Carey edited comment on AVRO-991 at 1/17/12 12:34 AM:
------------------------------------------------------------

To help out Tom's checksum issue, we could support user-provided sync markers.  However, two files with the same logical content can differ in other ways too:  User provided metadata, different block sizes, compression codecs, etc.

That brings up another tool -- we could provide a tool that does checksums on the binary content of an avro data file. and ignores the sync markers.

                
      was (Author: scott_carey):
    TO help out Tom's md5 issue, we could support user-provided sync markers.  However, two files with the same logical content can differ in other ways too:  User provided metadata, different block sizes, compression codecs, etc.

That brings up another tool -- we could provide a tool that does checksums on the binary content of an avro data file. and ignores the sync markers.

                  
> Allow combining multiple Avro files within a stream. (no files on disk)
> -----------------------------------------------------------------------
>
>                 Key: AVRO-991
>                 URL: https://issues.apache.org/jira/browse/AVRO-991
>             Project: Avro
>          Issue Type: Improvement
>          Components: java
>    Affects Versions: 1.6.1
>            Reporter: Frank Grimes
>
> It would be nice to be able to do as follows:
>   cat file1.avro file2.avro | java -jar avro-tools.jar streamcombine > combined-file.avro
> or similarly
>   
>   hadoop dfs -cat hdfs://hadoop/file1.avro hdfs://hadoop/file2.avro | java -jar avro-tools.jar streamcombine | hdfs -put - hdfs://hadoop/combined-file.avro
> See the following thread for details: http://mail-archives.apache.org/mod_mbox/avro-user/201201.mbox/%3cC08F1DE9-97A8-4D28-B0AD-5E4A7F32F028@gmail.com%3e

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (AVRO-991) Allow combining multiple Avro files within a stream. (no files on disk)

Posted by "Scott Carey (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/AVRO-991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13187351#comment-13187351 ] 

Scott Carey commented on AVRO-991:
----------------------------------

{quote}
For the record, the thinking behind the varied sync marker is that it makes collisions less likely. In theory this is not true, but in practice my concern was that, once a value was fixed and known, there'd be a significantly higher probability that someone would include it in some data. Perhaps that's not correct, though.{quote}

If the sync marker was known to have a few properties it would reduce the collision rate with typical Avro data with the 'null codec'
* It could contain a sequence of bytes that can not be interpreted as UTF8. (e.g. insufficient or too many continuation bytes)
* It could contain a sequence of bytes that can not be interpreted as an Avro encoded int or long.  (e.g. 10 consecutive bytes with the MSB set)

In order to achieve the above you lose some randomness, and we may have to compensate with a couple extra bytes.

For each codec, there may be a byte sequences that is impossible in the encoded data.  Each codec could have its own sync marker.  Files with incompatible codecs could not be concatenated together anyway.

                
> Allow combining multiple Avro files within a stream. (no files on disk)
> -----------------------------------------------------------------------
>
>                 Key: AVRO-991
>                 URL: https://issues.apache.org/jira/browse/AVRO-991
>             Project: Avro
>          Issue Type: Improvement
>          Components: java
>    Affects Versions: 1.6.1
>            Reporter: Frank Grimes
>
> It would be nice to be able to do as follows:
>   cat file1.avro file2.avro | java -jar avro-tools.jar streamcombine > combined-file.avro
> or similarly
>   
>   hadoop dfs -cat hdfs://hadoop/file1.avro hdfs://hadoop/file2.avro | java -jar avro-tools.jar streamcombine | hdfs -put - hdfs://hadoop/combined-file.avro
> See the following thread for details: http://mail-archives.apache.org/mod_mbox/avro-user/201201.mbox/%3cC08F1DE9-97A8-4D28-B0AD-5E4A7F32F028@gmail.com%3e

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (AVRO-991) Allow combining multiple Avro files within a stream. (no files on disk)

Posted by "Tom White (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/AVRO-991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13185815#comment-13185815 ] 

Tom White commented on AVRO-991:
--------------------------------

Is it worth considering expanding the spec to support concatenated Avro data files, just like gzip allows (http://www.gnu.org/software/gzip/manual/gzip.html#Advanced-usage )? Then you'd be able to do

{noformat}
cat file1.avro file2.avro > combined-file.avro
{noformat}

and read combined-file.avro as a regular Avro file.

However, I'm not sure how splitting would work since the sync markers would be different in the original Avro files. (If only Avro files had a fixed sync marker! Like bzip2 does. This also tripped me up recently for another reason: two Avro files with the same contents do not have the same binary representation, so you can't just checksum them to see if they are the same.) 
                
> Allow combining multiple Avro files within a stream. (no files on disk)
> -----------------------------------------------------------------------
>
>                 Key: AVRO-991
>                 URL: https://issues.apache.org/jira/browse/AVRO-991
>             Project: Avro
>          Issue Type: Improvement
>          Components: java
>    Affects Versions: 1.6.1
>            Reporter: Frank Grimes
>
> It would be nice to be able to do as follows:
>   cat file1.avro file2.avro | java -jar avro-tools.jar streamcombine > combined-file.avro
> or similarly
>   
>   hadoop dfs -cat hdfs://hadoop/file1.avro hdfs://hadoop/file2.avro | java -jar avro-tools.jar streamcombine | hdfs -put - hdfs://hadoop/combined-file.avro
> See the following thread for details: http://mail-archives.apache.org/mod_mbox/avro-user/201201.mbox/%3cC08F1DE9-97A8-4D28-B0AD-5E4A7F32F028@gmail.com%3e

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira