You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@avro.apache.org by "Michael Cooper (Created) (JIRA)" <ji...@apache.org> on 2011/12/22 05:09:31 UTC

[jira] [Created] (AVRO-986) Avro files generated from avro-c dont work with the Java mapred implementation.

Avro files generated from avro-c dont work with the Java mapred implementation.
-------------------------------------------------------------------------------

                 Key: AVRO-986
                 URL: https://issues.apache.org/jira/browse/AVRO-986
             Project: Avro
          Issue Type: Bug
          Components: c, java
         Environment: avro-c 1.6.2-SNAPSHOT
avro-java 1.6.2-SNAPSHOT
hadoop 0.20.2
            Reporter: Michael Cooper
            Priority: Critical


When a file generated from the Avro-C implementation is fed into Hadoop, it will fail with "Block size invalid or too large for this implementation: -49".

This is caused by the sync marker, namely the one that Avro-C puts into the header...

The org.apache.avro.mapred.AvroRecordReader uses a FileSplit object to work out where it should read from, but this class is not particularly smart, it just divides the file up into equal size chunks, the first being with position 0.

So org.apache.avro.mapred.AvroRecordReader gets 0 as the start of its chunk, and calls
{code:title=AvroRecordReader.java}reader.sync(split.getStart());   // sync to start{code}
Then the org.apache.avro.file.DataFileReader::seek() goes to 0, then searches for a sync marker....
It encounters one at position 32, the one in the header metadata map, "avro.sync"

No other implementations add the sync marker in the metadata map, and none read it from there, not even the C version.

I suggest we remove this from the header as the simplest solution.
Another solution would be to create an AvroFileSplit class in mapred that knows where the blocks are, and provides the correct locations in the first place.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (AVRO-986) Avro files generated from avro-c dont work with the Java mapred implementation.

Posted by "Douglas Creager (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/AVRO-986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13174955#comment-13174955 ] 

Douglas Creager commented on AVRO-986:
--------------------------------------

+1  Should we apply the patch to 1.5, too?

Also, it'll still be the case that any existing C-produced files will be unreadable in Java mapred.  Should we also provide a fixup script of some kind?
                
> Avro files generated from avro-c dont work with the Java mapred implementation.
> -------------------------------------------------------------------------------
>
>                 Key: AVRO-986
>                 URL: https://issues.apache.org/jira/browse/AVRO-986
>             Project: Avro
>          Issue Type: Bug
>          Components: c, java
>         Environment: avro-c 1.6.2-SNAPSHOT
> avro-java 1.6.2-SNAPSHOT
> hadoop 0.20.2
>            Reporter: Michael Cooper
>            Priority: Critical
>              Labels: c, hadoop, java, mapreduce
>         Attachments: 0001-Remove-sync-marker-from-metadata-in-header.patch
>
>
> When a file generated from the Avro-C implementation is fed into Hadoop, it will fail with "Block size invalid or too large for this implementation: -49".
> This is caused by the sync marker, namely the one that Avro-C puts into the header...
> The org.apache.avro.mapred.AvroRecordReader uses a FileSplit object to work out where it should read from, but this class is not particularly smart, it just divides the file up into equal size chunks, the first being with position 0.
> So org.apache.avro.mapred.AvroRecordReader gets 0 as the start of its chunk, and calls
> {code:title=AvroRecordReader.java}reader.sync(split.getStart());   // sync to start{code}
> Then the org.apache.avro.file.DataFileReader::seek() goes to 0, then searches for a sync marker....
> It encounters one at position 32, the one in the header metadata map, "avro.sync"
> No other implementations add the sync marker in the metadata map, and none read it from there, not even the C version.
> I suggest we remove this from the header as the simplest solution.
> Another solution would be to create an AvroFileSplit class in mapred that knows where the blocks are, and provides the correct locations in the first place.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (AVRO-986) Avro files generated from avro-c dont work with the Java mapred implementation.

Posted by "Douglas Creager (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/AVRO-986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13192181#comment-13192181 ] 

Douglas Creager commented on AVRO-986:
--------------------------------------

Actually, it looks like the java patch file you uploaded doesn't contain the contents of the binary syncInMeta.avro file.  Can you upload that separately?
                
> Avro files generated from avro-c dont work with the Java mapred implementation.
> -------------------------------------------------------------------------------
>
>                 Key: AVRO-986
>                 URL: https://issues.apache.org/jira/browse/AVRO-986
>             Project: Avro
>          Issue Type: Bug
>          Components: c, java
>         Environment: avro-c 1.6.2-SNAPSHOT
> avro-java 1.6.2-SNAPSHOT
> hadoop 0.20.2
>            Reporter: Michael Cooper
>            Priority: Critical
>              Labels: c, hadoop, java, mapreduce
>             Fix For: 1.6.2
>
>         Attachments: 0001-Remove-sync-marker-from-metadata-in-header.patch, 0001-avromod-utility.patch, AVRO-986-java.patch, AVRO-986-java.patch, quickstop.db
>
>
> When a file generated from the Avro-C implementation is fed into Hadoop, it will fail with "Block size invalid or too large for this implementation: -49".
> This is caused by the sync marker, namely the one that Avro-C puts into the header...
> The org.apache.avro.mapred.AvroRecordReader uses a FileSplit object to work out where it should read from, but this class is not particularly smart, it just divides the file up into equal size chunks, the first being with position 0.
> So org.apache.avro.mapred.AvroRecordReader gets 0 as the start of its chunk, and calls
> {code:title=AvroRecordReader.java}reader.sync(split.getStart());   // sync to start{code}
> Then the org.apache.avro.file.DataFileReader::seek() goes to 0, then searches for a sync marker....
> It encounters one at position 32, the one in the header metadata map, "avro.sync"
> No other implementations add the sync marker in the metadata map, and none read it from there, not even the C version.
> I suggest we remove this from the header as the simplest solution.
> Another solution would be to create an AvroFileSplit class in mapred that knows where the blocks are, and provides the correct locations in the first place.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (AVRO-986) Avro files generated from avro-c dont work with the Java mapred implementation.

Posted by "Douglas Creager (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/AVRO-986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Douglas Creager updated AVRO-986:
---------------------------------

    Resolution: Fixed
        Status: Resolved  (was: Patch Available)

Committed all of these to SVN.
                
> Avro files generated from avro-c dont work with the Java mapred implementation.
> -------------------------------------------------------------------------------
>
>                 Key: AVRO-986
>                 URL: https://issues.apache.org/jira/browse/AVRO-986
>             Project: Avro
>          Issue Type: Bug
>          Components: c, java
>         Environment: avro-c 1.6.2-SNAPSHOT
> avro-java 1.6.2-SNAPSHOT
> hadoop 0.20.2
>            Reporter: Michael Cooper
>            Priority: Critical
>              Labels: c, hadoop, java, mapreduce
>             Fix For: 1.6.2
>
>         Attachments: 0001-Remove-sync-marker-from-metadata-in-header.patch, 0001-avromod-utility.patch, AVRO-986-java.patch, AVRO-986-java.patch, quickstop.db
>
>
> When a file generated from the Avro-C implementation is fed into Hadoop, it will fail with "Block size invalid or too large for this implementation: -49".
> This is caused by the sync marker, namely the one that Avro-C puts into the header...
> The org.apache.avro.mapred.AvroRecordReader uses a FileSplit object to work out where it should read from, but this class is not particularly smart, it just divides the file up into equal size chunks, the first being with position 0.
> So org.apache.avro.mapred.AvroRecordReader gets 0 as the start of its chunk, and calls
> {code:title=AvroRecordReader.java}reader.sync(split.getStart());   // sync to start{code}
> Then the org.apache.avro.file.DataFileReader::seek() goes to 0, then searches for a sync marker....
> It encounters one at position 32, the one in the header metadata map, "avro.sync"
> No other implementations add the sync marker in the metadata map, and none read it from there, not even the C version.
> I suggest we remove this from the header as the simplest solution.
> Another solution would be to create an AvroFileSplit class in mapred that knows where the blocks are, and provides the correct locations in the first place.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (AVRO-986) Avro files generated from avro-c dont work with the Java mapred implementation.

Posted by "Doug Cutting (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/AVRO-986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doug Cutting updated AVRO-986:
------------------------------

    Attachment: AVRO-986-java.patch

Here's a version of the java changes that includes a test.
                
> Avro files generated from avro-c dont work with the Java mapred implementation.
> -------------------------------------------------------------------------------
>
>                 Key: AVRO-986
>                 URL: https://issues.apache.org/jira/browse/AVRO-986
>             Project: Avro
>          Issue Type: Bug
>          Components: c, java
>         Environment: avro-c 1.6.2-SNAPSHOT
> avro-java 1.6.2-SNAPSHOT
> hadoop 0.20.2
>            Reporter: Michael Cooper
>            Priority: Critical
>              Labels: c, hadoop, java, mapreduce
>         Attachments: 0001-Remove-sync-marker-from-metadata-in-header.patch, 0001-avromod-utility.patch, AVRO-986-java.patch, AVRO-986-java.patch, quickstop.db
>
>
> When a file generated from the Avro-C implementation is fed into Hadoop, it will fail with "Block size invalid or too large for this implementation: -49".
> This is caused by the sync marker, namely the one that Avro-C puts into the header...
> The org.apache.avro.mapred.AvroRecordReader uses a FileSplit object to work out where it should read from, but this class is not particularly smart, it just divides the file up into equal size chunks, the first being with position 0.
> So org.apache.avro.mapred.AvroRecordReader gets 0 as the start of its chunk, and calls
> {code:title=AvroRecordReader.java}reader.sync(split.getStart());   // sync to start{code}
> Then the org.apache.avro.file.DataFileReader::seek() goes to 0, then searches for a sync marker....
> It encounters one at position 32, the one in the header metadata map, "avro.sync"
> No other implementations add the sync marker in the metadata map, and none read it from there, not even the C version.
> I suggest we remove this from the header as the simplest solution.
> Another solution would be to create an AvroFileSplit class in mapred that knows where the blocks are, and provides the correct locations in the first place.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (AVRO-986) Avro files generated from avro-c dont work with the Java mapred implementation.

Posted by "Douglas Creager (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/AVRO-986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Douglas Creager updated AVRO-986:
---------------------------------

    Attachment: quickstop.db

I've attached a copy of the quickstop.db Avro file; this is generated by one of the C test cases.  It contains the avro.sync metadata field.  I'm happy to add this to the share directory also, but unfortunately I don't know enough about the Java build scripts to write a test case for Doug's patch.
                
> Avro files generated from avro-c dont work with the Java mapred implementation.
> -------------------------------------------------------------------------------
>
>                 Key: AVRO-986
>                 URL: https://issues.apache.org/jira/browse/AVRO-986
>             Project: Avro
>          Issue Type: Bug
>          Components: c, java
>         Environment: avro-c 1.6.2-SNAPSHOT
> avro-java 1.6.2-SNAPSHOT
> hadoop 0.20.2
>            Reporter: Michael Cooper
>            Priority: Critical
>              Labels: c, hadoop, java, mapreduce
>         Attachments: 0001-Remove-sync-marker-from-metadata-in-header.patch, AVRO-986-java.patch, quickstop.db
>
>
> When a file generated from the Avro-C implementation is fed into Hadoop, it will fail with "Block size invalid or too large for this implementation: -49".
> This is caused by the sync marker, namely the one that Avro-C puts into the header...
> The org.apache.avro.mapred.AvroRecordReader uses a FileSplit object to work out where it should read from, but this class is not particularly smart, it just divides the file up into equal size chunks, the first being with position 0.
> So org.apache.avro.mapred.AvroRecordReader gets 0 as the start of its chunk, and calls
> {code:title=AvroRecordReader.java}reader.sync(split.getStart());   // sync to start{code}
> Then the org.apache.avro.file.DataFileReader::seek() goes to 0, then searches for a sync marker....
> It encounters one at position 32, the one in the header metadata map, "avro.sync"
> No other implementations add the sync marker in the metadata map, and none read it from there, not even the C version.
> I suggest we remove this from the header as the simplest solution.
> Another solution would be to create an AvroFileSplit class in mapred that knows where the blocks are, and provides the correct locations in the first place.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (AVRO-986) Avro files generated from avro-c dont work with the Java mapred implementation.

Posted by "Doug Cutting (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/AVRO-986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doug Cutting updated AVRO-986:
------------------------------

    Fix Version/s: 1.6.2

It would be good to get this into 1.6.2.  Douglas, do you want to commit it, or should I?
                
> Avro files generated from avro-c dont work with the Java mapred implementation.
> -------------------------------------------------------------------------------
>
>                 Key: AVRO-986
>                 URL: https://issues.apache.org/jira/browse/AVRO-986
>             Project: Avro
>          Issue Type: Bug
>          Components: c, java
>         Environment: avro-c 1.6.2-SNAPSHOT
> avro-java 1.6.2-SNAPSHOT
> hadoop 0.20.2
>            Reporter: Michael Cooper
>            Priority: Critical
>              Labels: c, hadoop, java, mapreduce
>             Fix For: 1.6.2
>
>         Attachments: 0001-Remove-sync-marker-from-metadata-in-header.patch, 0001-avromod-utility.patch, AVRO-986-java.patch, AVRO-986-java.patch, quickstop.db
>
>
> When a file generated from the Avro-C implementation is fed into Hadoop, it will fail with "Block size invalid or too large for this implementation: -49".
> This is caused by the sync marker, namely the one that Avro-C puts into the header...
> The org.apache.avro.mapred.AvroRecordReader uses a FileSplit object to work out where it should read from, but this class is not particularly smart, it just divides the file up into equal size chunks, the first being with position 0.
> So org.apache.avro.mapred.AvroRecordReader gets 0 as the start of its chunk, and calls
> {code:title=AvroRecordReader.java}reader.sync(split.getStart());   // sync to start{code}
> Then the org.apache.avro.file.DataFileReader::seek() goes to 0, then searches for a sync marker....
> It encounters one at position 32, the one in the header metadata map, "avro.sync"
> No other implementations add the sync marker in the metadata map, and none read it from there, not even the C version.
> I suggest we remove this from the header as the simplest solution.
> Another solution would be to create an AvroFileSplit class in mapred that knows where the blocks are, and provides the correct locations in the first place.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (AVRO-986) Avro files generated from avro-c dont work with the Java mapred implementation.

Posted by "Michael Cooper (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/AVRO-986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael Cooper updated AVRO-986:
--------------------------------

    Attachment: 0001-Remove-sync-marker-from-metadata-in-header.patch

Attaching patch to remove sync marker from the metadata in avro-c.
                
> Avro files generated from avro-c dont work with the Java mapred implementation.
> -------------------------------------------------------------------------------
>
>                 Key: AVRO-986
>                 URL: https://issues.apache.org/jira/browse/AVRO-986
>             Project: Avro
>          Issue Type: Bug
>          Components: c, java
>         Environment: avro-c 1.6.2-SNAPSHOT
> avro-java 1.6.2-SNAPSHOT
> hadoop 0.20.2
>            Reporter: Michael Cooper
>            Priority: Critical
>              Labels: c, hadoop, java, mapreduce
>         Attachments: 0001-Remove-sync-marker-from-metadata-in-header.patch
>
>
> When a file generated from the Avro-C implementation is fed into Hadoop, it will fail with "Block size invalid or too large for this implementation: -49".
> This is caused by the sync marker, namely the one that Avro-C puts into the header...
> The org.apache.avro.mapred.AvroRecordReader uses a FileSplit object to work out where it should read from, but this class is not particularly smart, it just divides the file up into equal size chunks, the first being with position 0.
> So org.apache.avro.mapred.AvroRecordReader gets 0 as the start of its chunk, and calls
> {code:title=AvroRecordReader.java}reader.sync(split.getStart());   // sync to start{code}
> Then the org.apache.avro.file.DataFileReader::seek() goes to 0, then searches for a sync marker....
> It encounters one at position 32, the one in the header metadata map, "avro.sync"
> No other implementations add the sync marker in the metadata map, and none read it from there, not even the C version.
> I suggest we remove this from the header as the simplest solution.
> Another solution would be to create an AvroFileSplit class in mapred that knows where the blocks are, and provides the correct locations in the first place.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (AVRO-986) Avro files generated from avro-c dont work with the Java mapred implementation.

Posted by "Douglas Creager (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/AVRO-986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13192180#comment-13192180 ] 

Douglas Creager commented on AVRO-986:
--------------------------------------

Sure, I'll commit the patches now.
                
> Avro files generated from avro-c dont work with the Java mapred implementation.
> -------------------------------------------------------------------------------
>
>                 Key: AVRO-986
>                 URL: https://issues.apache.org/jira/browse/AVRO-986
>             Project: Avro
>          Issue Type: Bug
>          Components: c, java
>         Environment: avro-c 1.6.2-SNAPSHOT
> avro-java 1.6.2-SNAPSHOT
> hadoop 0.20.2
>            Reporter: Michael Cooper
>            Priority: Critical
>              Labels: c, hadoop, java, mapreduce
>             Fix For: 1.6.2
>
>         Attachments: 0001-Remove-sync-marker-from-metadata-in-header.patch, 0001-avromod-utility.patch, AVRO-986-java.patch, AVRO-986-java.patch, quickstop.db
>
>
> When a file generated from the Avro-C implementation is fed into Hadoop, it will fail with "Block size invalid or too large for this implementation: -49".
> This is caused by the sync marker, namely the one that Avro-C puts into the header...
> The org.apache.avro.mapred.AvroRecordReader uses a FileSplit object to work out where it should read from, but this class is not particularly smart, it just divides the file up into equal size chunks, the first being with position 0.
> So org.apache.avro.mapred.AvroRecordReader gets 0 as the start of its chunk, and calls
> {code:title=AvroRecordReader.java}reader.sync(split.getStart());   // sync to start{code}
> Then the org.apache.avro.file.DataFileReader::seek() goes to 0, then searches for a sync marker....
> It encounters one at position 32, the one in the header metadata map, "avro.sync"
> No other implementations add the sync marker in the metadata map, and none read it from there, not even the C version.
> I suggest we remove this from the header as the simplest solution.
> Another solution would be to create an AvroFileSplit class in mapred that knows where the blocks are, and provides the correct locations in the first place.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (AVRO-986) Avro files generated from avro-c dont work with the Java mapred implementation.

Posted by "Michael Cooper (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/AVRO-986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael Cooper updated AVRO-986:
--------------------------------

    Status: Patch Available  (was: Open)
    
> Avro files generated from avro-c dont work with the Java mapred implementation.
> -------------------------------------------------------------------------------
>
>                 Key: AVRO-986
>                 URL: https://issues.apache.org/jira/browse/AVRO-986
>             Project: Avro
>          Issue Type: Bug
>          Components: c, java
>         Environment: avro-c 1.6.2-SNAPSHOT
> avro-java 1.6.2-SNAPSHOT
> hadoop 0.20.2
>            Reporter: Michael Cooper
>            Priority: Critical
>              Labels: c, hadoop, java, mapreduce
>         Attachments: 0001-Remove-sync-marker-from-metadata-in-header.patch
>
>
> When a file generated from the Avro-C implementation is fed into Hadoop, it will fail with "Block size invalid or too large for this implementation: -49".
> This is caused by the sync marker, namely the one that Avro-C puts into the header...
> The org.apache.avro.mapred.AvroRecordReader uses a FileSplit object to work out where it should read from, but this class is not particularly smart, it just divides the file up into equal size chunks, the first being with position 0.
> So org.apache.avro.mapred.AvroRecordReader gets 0 as the start of its chunk, and calls
> {code:title=AvroRecordReader.java}reader.sync(split.getStart());   // sync to start{code}
> Then the org.apache.avro.file.DataFileReader::seek() goes to 0, then searches for a sync marker....
> It encounters one at position 32, the one in the header metadata map, "avro.sync"
> No other implementations add the sync marker in the metadata map, and none read it from there, not even the C version.
> I suggest we remove this from the header as the simplest solution.
> Another solution would be to create an AvroFileSplit class in mapred that knows where the blocks are, and provides the correct locations in the first place.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (AVRO-986) Avro files generated from avro-c dont work with the Java mapred implementation.

Posted by "Doug Cutting (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/AVRO-986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doug Cutting updated AVRO-986:
------------------------------

    Attachment: AVRO-986-java.patch

Perhaps we should fix the Java code too.

Here's a patch that should do the trick.  To test this we should probably add a file to share/test/data that has "avro.sync" in its metadata and test that reads after a DataFileReader#sync(0) on this work correctly.
                
> Avro files generated from avro-c dont work with the Java mapred implementation.
> -------------------------------------------------------------------------------
>
>                 Key: AVRO-986
>                 URL: https://issues.apache.org/jira/browse/AVRO-986
>             Project: Avro
>          Issue Type: Bug
>          Components: c, java
>         Environment: avro-c 1.6.2-SNAPSHOT
> avro-java 1.6.2-SNAPSHOT
> hadoop 0.20.2
>            Reporter: Michael Cooper
>            Priority: Critical
>              Labels: c, hadoop, java, mapreduce
>         Attachments: 0001-Remove-sync-marker-from-metadata-in-header.patch, AVRO-986-java.patch
>
>
> When a file generated from the Avro-C implementation is fed into Hadoop, it will fail with "Block size invalid or too large for this implementation: -49".
> This is caused by the sync marker, namely the one that Avro-C puts into the header...
> The org.apache.avro.mapred.AvroRecordReader uses a FileSplit object to work out where it should read from, but this class is not particularly smart, it just divides the file up into equal size chunks, the first being with position 0.
> So org.apache.avro.mapred.AvroRecordReader gets 0 as the start of its chunk, and calls
> {code:title=AvroRecordReader.java}reader.sync(split.getStart());   // sync to start{code}
> Then the org.apache.avro.file.DataFileReader::seek() goes to 0, then searches for a sync marker....
> It encounters one at position 32, the one in the header metadata map, "avro.sync"
> No other implementations add the sync marker in the metadata map, and none read it from there, not even the C version.
> I suggest we remove this from the header as the simplest solution.
> Another solution would be to create an AvroFileSplit class in mapred that knows where the blocks are, and provides the correct locations in the first place.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (AVRO-986) Avro files generated from avro-c dont work with the Java mapred implementation.

Posted by "Doug Cutting (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/AVRO-986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13174936#comment-13174936 ] 

Doug Cutting commented on AVRO-986:
-----------------------------------

+1 This patch sounds like the right way to fix this to me.

If we were to instead fix this in Java then I don't think we should try to make the splitter smarter, since splitting is single-threaded and that's not scalable.  Rather we should make sync(0) skip over the metadata.  But there probably shouldn't be any sync markers in the metadata anyway...
                
> Avro files generated from avro-c dont work with the Java mapred implementation.
> -------------------------------------------------------------------------------
>
>                 Key: AVRO-986
>                 URL: https://issues.apache.org/jira/browse/AVRO-986
>             Project: Avro
>          Issue Type: Bug
>          Components: c, java
>         Environment: avro-c 1.6.2-SNAPSHOT
> avro-java 1.6.2-SNAPSHOT
> hadoop 0.20.2
>            Reporter: Michael Cooper
>            Priority: Critical
>              Labels: c, hadoop, java, mapreduce
>         Attachments: 0001-Remove-sync-marker-from-metadata-in-header.patch
>
>
> When a file generated from the Avro-C implementation is fed into Hadoop, it will fail with "Block size invalid or too large for this implementation: -49".
> This is caused by the sync marker, namely the one that Avro-C puts into the header...
> The org.apache.avro.mapred.AvroRecordReader uses a FileSplit object to work out where it should read from, but this class is not particularly smart, it just divides the file up into equal size chunks, the first being with position 0.
> So org.apache.avro.mapred.AvroRecordReader gets 0 as the start of its chunk, and calls
> {code:title=AvroRecordReader.java}reader.sync(split.getStart());   // sync to start{code}
> Then the org.apache.avro.file.DataFileReader::seek() goes to 0, then searches for a sync marker....
> It encounters one at position 32, the one in the header metadata map, "avro.sync"
> No other implementations add the sync marker in the metadata map, and none read it from there, not even the C version.
> I suggest we remove this from the header as the simplest solution.
> Another solution would be to create an AvroFileSplit class in mapred that knows where the blocks are, and provides the correct locations in the first place.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (AVRO-986) Avro files generated from avro-c dont work with the Java mapred implementation.

Posted by "Doug Cutting (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/AVRO-986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13192387#comment-13192387 ] 

Doug Cutting commented on AVRO-986:
-----------------------------------

share/test/data/syncInMeta.avro was just where I placed the quickstop.db file already attached to this issue.
                
> Avro files generated from avro-c dont work with the Java mapred implementation.
> -------------------------------------------------------------------------------
>
>                 Key: AVRO-986
>                 URL: https://issues.apache.org/jira/browse/AVRO-986
>             Project: Avro
>          Issue Type: Bug
>          Components: c, java
>         Environment: avro-c 1.6.2-SNAPSHOT
> avro-java 1.6.2-SNAPSHOT
> hadoop 0.20.2
>            Reporter: Michael Cooper
>            Priority: Critical
>              Labels: c, hadoop, java, mapreduce
>             Fix For: 1.6.2
>
>         Attachments: 0001-Remove-sync-marker-from-metadata-in-header.patch, 0001-avromod-utility.patch, AVRO-986-java.patch, AVRO-986-java.patch, quickstop.db
>
>
> When a file generated from the Avro-C implementation is fed into Hadoop, it will fail with "Block size invalid or too large for this implementation: -49".
> This is caused by the sync marker, namely the one that Avro-C puts into the header...
> The org.apache.avro.mapred.AvroRecordReader uses a FileSplit object to work out where it should read from, but this class is not particularly smart, it just divides the file up into equal size chunks, the first being with position 0.
> So org.apache.avro.mapred.AvroRecordReader gets 0 as the start of its chunk, and calls
> {code:title=AvroRecordReader.java}reader.sync(split.getStart());   // sync to start{code}
> Then the org.apache.avro.file.DataFileReader::seek() goes to 0, then searches for a sync marker....
> It encounters one at position 32, the one in the header metadata map, "avro.sync"
> No other implementations add the sync marker in the metadata map, and none read it from there, not even the C version.
> I suggest we remove this from the header as the simplest solution.
> Another solution would be to create an AvroFileSplit class in mapred that knows where the blocks are, and provides the correct locations in the first place.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (AVRO-986) Avro files generated from avro-c dont work with the Java mapred implementation.

Posted by "Douglas Creager (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/AVRO-986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Douglas Creager updated AVRO-986:
---------------------------------

    Attachment: 0001-avromod-utility.patch

Here's a patch that adds a new "avromod" command-line utility.  It can be used as a fixup script to remove the avro.sync field from the header (once I commit Michael's patch).  It's also useful in its own right since you can create copies of Avro files with different compression codecs and block sizes.  Eventually, we can also add options for changing the schema of the data in the file.
                
> Avro files generated from avro-c dont work with the Java mapred implementation.
> -------------------------------------------------------------------------------
>
>                 Key: AVRO-986
>                 URL: https://issues.apache.org/jira/browse/AVRO-986
>             Project: Avro
>          Issue Type: Bug
>          Components: c, java
>         Environment: avro-c 1.6.2-SNAPSHOT
> avro-java 1.6.2-SNAPSHOT
> hadoop 0.20.2
>            Reporter: Michael Cooper
>            Priority: Critical
>              Labels: c, hadoop, java, mapreduce
>         Attachments: 0001-Remove-sync-marker-from-metadata-in-header.patch, 0001-avromod-utility.patch, AVRO-986-java.patch, quickstop.db
>
>
> When a file generated from the Avro-C implementation is fed into Hadoop, it will fail with "Block size invalid or too large for this implementation: -49".
> This is caused by the sync marker, namely the one that Avro-C puts into the header...
> The org.apache.avro.mapred.AvroRecordReader uses a FileSplit object to work out where it should read from, but this class is not particularly smart, it just divides the file up into equal size chunks, the first being with position 0.
> So org.apache.avro.mapred.AvroRecordReader gets 0 as the start of its chunk, and calls
> {code:title=AvroRecordReader.java}reader.sync(split.getStart());   // sync to start{code}
> Then the org.apache.avro.file.DataFileReader::seek() goes to 0, then searches for a sync marker....
> It encounters one at position 32, the one in the header metadata map, "avro.sync"
> No other implementations add the sync marker in the metadata map, and none read it from there, not even the C version.
> I suggest we remove this from the header as the simplest solution.
> Another solution would be to create an AvroFileSplit class in mapred that knows where the blocks are, and provides the correct locations in the first place.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira