You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@avro.apache.org by "Catalin Alexandru Zamfir (JIRA)" <ji...@apache.org> on 2012/05/16 16:37:03 UTC

[jira] [Created] (AVRO-1093) DataFileWriter, appendEncoded causes AvroRuntimeException when read back

Catalin Alexandru Zamfir created AVRO-1093:
----------------------------------------------

             Summary: DataFileWriter, appendEncoded causes AvroRuntimeException when read back
                 Key: AVRO-1093
                 URL: https://issues.apache.org/jira/browse/AVRO-1093
             Project: Avro
          Issue Type: Bug
    Affects Versions: 1.6.3
            Reporter: Catalin Alexandru Zamfir


We're doing this:
{code}
// Check
		if (!(objRecordsBuffer
		.containsKey (objShardPath))) {
			// Set
			objRecordsBuffer.put (objShardPath,
			new ByteBufferOutputStream ());
		}

		// Set
		Encoder objEncoder =  EncoderFactory.get ()
		.binaryEncoder (objRecordsBuffer
		.get (objShardPath), null);

		// Write
		objGenericDatumWriter.write (objRecordConstructor.build (), objEncoder);
		objEncoder.flush ();

// For
				for (ByteBuffer objRecord : objRecordsBuffer
				.get (objKey).getBufferList ()) {
					// Append
					objRecordWriter.appendEncoded (objRecord);
				}

				// Erase
				objRecordWriter.flush ();
				objRecordWriter.close ();
{code}

It writes the data to HDFS. Reading it back outputs the follosing exception:
{code}
Caused by: org.apache.avro.AvroRuntimeException: java.io.IOException: Block read partially, the data may be corrupt
        at org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:210)
        at net.RnD.FileUtils.TimestampedReader.hasNext(TimestampedReader.java:113)
        at net.RnD.Hadoop.App.read1BAvros(App.java:131)
        at net.RnD.Hadoop.App.executeCode(App.java:534)
        at net.RnD.Hadoop.App.main(App.java:453)
        ... 5 more
Caused by: java.io.IOException: Block read partially, the data may be corrupt
        at org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:194)
        ... 9 more
{code}

The objRecordWriter is an instance of DataFileWriter.create or DataFileWriter.appendto (SeekableInput). In relation to AVRO-1090 ticket.

Instead of having big "hashmaps" in memory, we've decided to serialize the data in "byte buffers" in memory. Because it's faster. Using "appendEncoded" although seems to write something to HDFS, reading the data back, exposes this error.

Help would be appreciated. I've looked @ appendEncoded in DataFileWriter but could not figure out if it's our job to add a sync marker, or does appendEncoded does that for us.

Must the "ByteBuffer" we give, be the length of one exact record?
Examples and documentation on this method is welcomed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (AVRO-1093) DataFileWriter, appendEncoded causes AvroRuntimeException when read back

Posted by "Catalin Alexandru Zamfir (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/AVRO-1093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Catalin Alexandru Zamfir updated AVRO-1093:
-------------------------------------------

    Description: 
We're doing this:
{code}
// Check
		if (!(objRecordsBuffer
		.containsKey (objShardPath))) {
			// Set
			objRecordsBuffer.put (objShardPath,
			new ByteBufferOutputStream ());
		}

		// Set
		Encoder objEncoder =  EncoderFactory.get ()
		.binaryEncoder (objRecordsBuffer
		.get (objShardPath), null);

		// Write
		objGenericDatumWriter.write (objRecordConstructor.build (), objEncoder);
		objEncoder.flush ();

// For
				for (ByteBuffer objRecord : objRecordsBuffer
				.get (objKey).getBufferList ()) {
					// Append
					objRecordWriter.appendEncoded (objRecord);
				}

				// Erase
				objRecordWriter.flush ();
				objRecordWriter.close ();
{code}

It writes the data to HDFS. Reading it back outputs the follosing exception:
{code}
Caused by: org.apache.avro.AvroRuntimeException: java.io.IOException: Block read partially, the data may be corrupt
        at org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:210)
        at net.RnD.FileUtils.TimestampedReader.hasNext(TimestampedReader.java:113)
        at net.RnD.Hadoop.App.read1BAvros(App.java:131)
        at net.RnD.Hadoop.App.executeCode(App.java:534)
        at net.RnD.Hadoop.App.main(App.java:453)
        ... 5 more
Caused by: java.io.IOException: Block read partially, the data may be corrupt
        at org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:194)
        ... 9 more
{code}

The objRecordWriter is an instance of DataFileWriter.create or DataFileWriter.appendto (SeekableInput). In relation to AVRO-1090 ticket.
Instead of having big "hashmaps" in memory, we've decided to serialize the data in "byte buffers" in memory. Because it's faster. Using "appendEncoded" although seems to write something to HDFS, reading the data back, exposes this error.
Help would be appreciated. I've looked @ appendEncoded in DataFileWriter but could not figure out if it's our job to add a sync marker, or does appendEncoded does that for us.
Must the "ByteBuffer" we give, be the length of one exact record?
Examples and documentation on this method is welcomed.

  was:
We're doing this:
{code}
// Check
		if (!(objRecordsBuffer
		.containsKey (objShardPath))) {
			// Set
			objRecordsBuffer.put (objShardPath,
			new ByteBufferOutputStream ());
		}

		// Set
		Encoder objEncoder =  EncoderFactory.get ()
		.binaryEncoder (objRecordsBuffer
		.get (objShardPath), null);

		// Write
		objGenericDatumWriter.write (objRecordConstructor.build (), objEncoder);
		objEncoder.flush ();

// For
				for (ByteBuffer objRecord : objRecordsBuffer
				.get (objKey).getBufferList ()) {
					// Append
					objRecordWriter.appendEncoded (objRecord);
				}

				// Erase
				objRecordWriter.flush ();
				objRecordWriter.close ();
{code}

It writes the data to HDFS. Reading it back outputs the follosing exception:
{code}
Caused by: org.apache.avro.AvroRuntimeException: java.io.IOException: Block read partially, the data may be corrupt
        at org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:210)
        at net.RnD.FileUtils.TimestampedReader.hasNext(TimestampedReader.java:113)
        at net.RnD.Hadoop.App.read1BAvros(App.java:131)
        at net.RnD.Hadoop.App.executeCode(App.java:534)
        at net.RnD.Hadoop.App.main(App.java:453)
        ... 5 more
Caused by: java.io.IOException: Block read partially, the data may be corrupt
        at org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:194)
        ... 9 more
{code}

The objRecordWriter is an instance of DataFileWriter.create or DataFileWriter.appendto (SeekableInput). In relation to AVRO-1090 ticket.

Instead of having big "hashmaps" in memory, we've decided to serialize the data in "byte buffers" in memory. Because it's faster. Using "appendEncoded" although seems to write something to HDFS, reading the data back, exposes this error.

Help would be appreciated. I've looked @ appendEncoded in DataFileWriter but could not figure out if it's our job to add a sync marker, or does appendEncoded does that for us.

Must the "ByteBuffer" we give, be the length of one exact record?
Examples and documentation on this method is welcomed.

    
> DataFileWriter, appendEncoded causes AvroRuntimeException when read back
> ------------------------------------------------------------------------
>
>                 Key: AVRO-1093
>                 URL: https://issues.apache.org/jira/browse/AVRO-1093
>             Project: Avro
>          Issue Type: Bug
>    Affects Versions: 1.6.3
>            Reporter: Catalin Alexandru Zamfir
>
> We're doing this:
> {code}
> // Check
> 		if (!(objRecordsBuffer
> 		.containsKey (objShardPath))) {
> 			// Set
> 			objRecordsBuffer.put (objShardPath,
> 			new ByteBufferOutputStream ());
> 		}
> 		// Set
> 		Encoder objEncoder =  EncoderFactory.get ()
> 		.binaryEncoder (objRecordsBuffer
> 		.get (objShardPath), null);
> 		// Write
> 		objGenericDatumWriter.write (objRecordConstructor.build (), objEncoder);
> 		objEncoder.flush ();
> // For
> 				for (ByteBuffer objRecord : objRecordsBuffer
> 				.get (objKey).getBufferList ()) {
> 					// Append
> 					objRecordWriter.appendEncoded (objRecord);
> 				}
> 				// Erase
> 				objRecordWriter.flush ();
> 				objRecordWriter.close ();
> {code}
> It writes the data to HDFS. Reading it back outputs the follosing exception:
> {code}
> Caused by: org.apache.avro.AvroRuntimeException: java.io.IOException: Block read partially, the data may be corrupt
>         at org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:210)
>         at net.RnD.FileUtils.TimestampedReader.hasNext(TimestampedReader.java:113)
>         at net.RnD.Hadoop.App.read1BAvros(App.java:131)
>         at net.RnD.Hadoop.App.executeCode(App.java:534)
>         at net.RnD.Hadoop.App.main(App.java:453)
>         ... 5 more
> Caused by: java.io.IOException: Block read partially, the data may be corrupt
>         at org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:194)
>         ... 9 more
> {code}
> The objRecordWriter is an instance of DataFileWriter.create or DataFileWriter.appendto (SeekableInput). In relation to AVRO-1090 ticket.
> Instead of having big "hashmaps" in memory, we've decided to serialize the data in "byte buffers" in memory. Because it's faster. Using "appendEncoded" although seems to write something to HDFS, reading the data back, exposes this error.
> Help would be appreciated. I've looked @ appendEncoded in DataFileWriter but could not figure out if it's our job to add a sync marker, or does appendEncoded does that for us.
> Must the "ByteBuffer" we give, be the length of one exact record?
> Examples and documentation on this method is welcomed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (AVRO-1093) DataFileWriter, appendEncoded causes AvroRuntimeException when read back

Posted by "Catalin Alexandru Zamfir (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/AVRO-1093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Catalin Alexandru Zamfir updated AVRO-1093:
-------------------------------------------

    Affects Version/s: 1.7.0
    
> DataFileWriter, appendEncoded causes AvroRuntimeException when read back
> ------------------------------------------------------------------------
>
>                 Key: AVRO-1093
>                 URL: https://issues.apache.org/jira/browse/AVRO-1093
>             Project: Avro
>          Issue Type: Bug
>    Affects Versions: 1.6.3, 1.7.0
>            Reporter: Catalin Alexandru Zamfir
>
> We're doing this:
> {code}
> // Check
> 		if (!(objRecordsBuffer
> 		.containsKey (objShardPath))) {
> 			// Set
> 			objRecordsBuffer.put (objShardPath,
> 			new ByteBufferOutputStream ());
> 		}
> 		// Set
> 		Encoder objEncoder =  EncoderFactory.get ()
> 		.binaryEncoder (objRecordsBuffer
> 		.get (objShardPath), null);
> 		// Write
> 		objGenericDatumWriter.write (objRecordConstructor.build (), objEncoder);
> 		objEncoder.flush ();
> // For
> 				for (ByteBuffer objRecord : objRecordsBuffer
> 				.get (objKey).getBufferList ()) {
> 					// Append
> 					objRecordWriter.appendEncoded (objRecord);
> 				}
> 				// Erase
> 				objRecordWriter.flush ();
> 				objRecordWriter.close ();
> {code}
> It writes the data to HDFS. Reading it back outputs the follosing exception:
> {code}
> Caused by: org.apache.avro.AvroRuntimeException: java.io.IOException: Block read partially, the data may be corrupt
>         at org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:210)
>         at net.RnD.FileUtils.TimestampedReader.hasNext(TimestampedReader.java:113)
>         at net.RnD.Hadoop.App.read1BAvros(App.java:131)
>         at net.RnD.Hadoop.App.executeCode(App.java:534)
>         at net.RnD.Hadoop.App.main(App.java:453)
>         ... 5 more
> Caused by: java.io.IOException: Block read partially, the data may be corrupt
>         at org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:194)
>         ... 9 more
> {code}
> The objRecordWriter is an instance of DataFileWriter.create or DataFileWriter.appendto (SeekableInput). In relation to AVRO-1090 ticket.
> Instead of having big "hashmaps" in memory, we've decided to serialize the data in "byte buffers" in memory. Because it's faster. Using "appendEncoded" although seems to write something to HDFS, reading the data back, exposes this error.
> Help would be appreciated. I've looked @ appendEncoded in DataFileWriter but could not figure out if it's our job to add a sync marker, or does appendEncoded does that for us.
> Must the "ByteBuffer" we give, be the length of one exact record?
> Examples and documentation on this method is welcomed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (AVRO-1093) DataFileWriter, appendEncoded causes AvroRuntimeException when read back

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/AVRO-1093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doug Cutting resolved AVRO-1093.
--------------------------------

    Resolution: Invalid
    
> DataFileWriter, appendEncoded causes AvroRuntimeException when read back
> ------------------------------------------------------------------------
>
>                 Key: AVRO-1093
>                 URL: https://issues.apache.org/jira/browse/AVRO-1093
>             Project: Avro
>          Issue Type: Bug
>    Affects Versions: 1.6.3, 1.7.0
>            Reporter: Catalin Alexandru Zamfir
>
> We're doing this:
> {code}
> // Check
> 		if (!(objRecordsBuffer
> 		.containsKey (objShardPath))) {
> 			// Set
> 			objRecordsBuffer.put (objShardPath,
> 			new ByteBufferOutputStream ());
> 		}
> 		// Set
> 		Encoder objEncoder =  EncoderFactory.get ()
> 		.binaryEncoder (objRecordsBuffer
> 		.get (objShardPath), null);
> 		// Write
> 		objGenericDatumWriter.write (objRecordConstructor.build (), objEncoder);
> 		objEncoder.flush ();
> // For
> 				for (ByteBuffer objRecord : objRecordsBuffer
> 				.get (objKey).getBufferList ()) {
> 					// Append
> 					objRecordWriter.appendEncoded (objRecord);
> 				}
> 				// Erase
> 				objRecordWriter.flush ();
> 				objRecordWriter.close ();
> {code}
> It writes the data to HDFS. Reading it back outputs the follosing exception:
> {code}
> Caused by: org.apache.avro.AvroRuntimeException: java.io.IOException: Block read partially, the data may be corrupt
>         at org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:210)
>         at net.RnD.FileUtils.TimestampedReader.hasNext(TimestampedReader.java:113)
>         at net.RnD.Hadoop.App.read1BAvros(App.java:131)
>         at net.RnD.Hadoop.App.executeCode(App.java:534)
>         at net.RnD.Hadoop.App.main(App.java:453)
>         ... 5 more
> Caused by: java.io.IOException: Block read partially, the data may be corrupt
>         at org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:194)
>         ... 9 more
> {code}
> The objRecordWriter is an instance of DataFileWriter.create or DataFileWriter.appendto (SeekableInput). In relation to AVRO-1090 ticket.
> Instead of having big "hashmaps" in memory, we've decided to serialize the data in "byte buffers" in memory. Because it's faster. Using "appendEncoded" although seems to write something to HDFS, reading the data back, exposes this error.
> Help would be appreciated. I've looked @ appendEncoded in DataFileWriter but could not figure out if it's our job to add a sync marker, or does appendEncoded does that for us.
> Must the "ByteBuffer" we give, be the length of one exact record?
> Examples and documentation on this method is welcomed.
> Files are getting created because:
> {code}
> -rw-r--r--   3 root supergroup  124901360 2012-05-17 10:09 /Streams/Timestamped/Threads/2012/05/17/10/09/Shard.avro
> -rw-r--r--   3 root supergroup  124845625 2012-05-17 10:10 /Streams/Timestamped/Threads/2012/05/17/10/10/Shard.avro
> -rw-r--r--   3 root supergroup   62378307 2012-05-17 10:11 /Streams/Timestamped/Threads/2012/05/17/10/11/Shard.avro
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (AVRO-1093) DataFileWriter, appendEncoded causes AvroRuntimeException when read back

Posted by "Catalin Alexandru Zamfir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/AVRO-1093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13277779#comment-13277779 ] 

Catalin Alexandru Zamfir commented on AVRO-1093:
------------------------------------------------

Close it as invalid please. Fixed. Instead of "ByteBufferOutputStream" use "ByteArrayOutputStream". When you "flush the encoder" (ex. objEncoder.flush ()) MAKE SURE TO CALL ByteArrayOutputStream.close () afterwards! Then, for each BAOS in memory, when you want to write you just appendEncoded.

{code}
for (ByteArrayOutputStream 
objStream : objRecordsBuffer
.get (objKey)) {
	// Bytes
	ByteBuffer objBytes = ByteBuffer.wrap (objStream.toByteArray ());
	objRecordWriter.appendEncoded (objBytes);
}
{code}

Writing directly to ByteArrayOutputStream, via the EncoderFactory.get ().binaryEncoder, makes sure no objects stay in memory. They're serialized as soon as possible. Using appendEncoded, you wrap the BAOS into a ByteBuffer and append it to the stream.

We were trying to write more than 1 record per 1 BAOS. But that did not work. We could not read the data back.
Thus, we tried to write 1 record per BAOS, closing the stream after each write. That worked. Data could be read back properly.
I don't know why we can't write more than 1 record per 1 BAOS.
                
> DataFileWriter, appendEncoded causes AvroRuntimeException when read back
> ------------------------------------------------------------------------
>
>                 Key: AVRO-1093
>                 URL: https://issues.apache.org/jira/browse/AVRO-1093
>             Project: Avro
>          Issue Type: Bug
>    Affects Versions: 1.6.3, 1.7.0
>            Reporter: Catalin Alexandru Zamfir
>
> We're doing this:
> {code}
> // Check
> 		if (!(objRecordsBuffer
> 		.containsKey (objShardPath))) {
> 			// Set
> 			objRecordsBuffer.put (objShardPath,
> 			new ByteBufferOutputStream ());
> 		}
> 		// Set
> 		Encoder objEncoder =  EncoderFactory.get ()
> 		.binaryEncoder (objRecordsBuffer
> 		.get (objShardPath), null);
> 		// Write
> 		objGenericDatumWriter.write (objRecordConstructor.build (), objEncoder);
> 		objEncoder.flush ();
> // For
> 				for (ByteBuffer objRecord : objRecordsBuffer
> 				.get (objKey).getBufferList ()) {
> 					// Append
> 					objRecordWriter.appendEncoded (objRecord);
> 				}
> 				// Erase
> 				objRecordWriter.flush ();
> 				objRecordWriter.close ();
> {code}
> It writes the data to HDFS. Reading it back outputs the follosing exception:
> {code}
> Caused by: org.apache.avro.AvroRuntimeException: java.io.IOException: Block read partially, the data may be corrupt
>         at org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:210)
>         at net.RnD.FileUtils.TimestampedReader.hasNext(TimestampedReader.java:113)
>         at net.RnD.Hadoop.App.read1BAvros(App.java:131)
>         at net.RnD.Hadoop.App.executeCode(App.java:534)
>         at net.RnD.Hadoop.App.main(App.java:453)
>         ... 5 more
> Caused by: java.io.IOException: Block read partially, the data may be corrupt
>         at org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:194)
>         ... 9 more
> {code}
> The objRecordWriter is an instance of DataFileWriter.create or DataFileWriter.appendto (SeekableInput). In relation to AVRO-1090 ticket.
> Instead of having big "hashmaps" in memory, we've decided to serialize the data in "byte buffers" in memory. Because it's faster. Using "appendEncoded" although seems to write something to HDFS, reading the data back, exposes this error.
> Help would be appreciated. I've looked @ appendEncoded in DataFileWriter but could not figure out if it's our job to add a sync marker, or does appendEncoded does that for us.
> Must the "ByteBuffer" we give, be the length of one exact record?
> Examples and documentation on this method is welcomed.
> Files are getting created because:
> {code}
> -rw-r--r--   3 root supergroup  124901360 2012-05-17 10:09 /Streams/Timestamped/Threads/2012/05/17/10/09/Shard.avro
> -rw-r--r--   3 root supergroup  124845625 2012-05-17 10:10 /Streams/Timestamped/Threads/2012/05/17/10/10/Shard.avro
> -rw-r--r--   3 root supergroup   62378307 2012-05-17 10:11 /Streams/Timestamped/Threads/2012/05/17/10/11/Shard.avro
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (AVRO-1093) DataFileWriter, appendEncoded causes AvroRuntimeException when read back

Posted by "Catalin Alexandru Zamfir (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/AVRO-1093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Catalin Alexandru Zamfir updated AVRO-1093:
-------------------------------------------

    Description: 
We're doing this:
{code}
// Check
		if (!(objRecordsBuffer
		.containsKey (objShardPath))) {
			// Set
			objRecordsBuffer.put (objShardPath,
			new ByteBufferOutputStream ());
		}

		// Set
		Encoder objEncoder =  EncoderFactory.get ()
		.binaryEncoder (objRecordsBuffer
		.get (objShardPath), null);

		// Write
		objGenericDatumWriter.write (objRecordConstructor.build (), objEncoder);
		objEncoder.flush ();

// For
				for (ByteBuffer objRecord : objRecordsBuffer
				.get (objKey).getBufferList ()) {
					// Append
					objRecordWriter.appendEncoded (objRecord);
				}

				// Erase
				objRecordWriter.flush ();
				objRecordWriter.close ();
{code}

It writes the data to HDFS. Reading it back outputs the follosing exception:
{code}
Caused by: org.apache.avro.AvroRuntimeException: java.io.IOException: Block read partially, the data may be corrupt
        at org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:210)
        at net.RnD.FileUtils.TimestampedReader.hasNext(TimestampedReader.java:113)
        at net.RnD.Hadoop.App.read1BAvros(App.java:131)
        at net.RnD.Hadoop.App.executeCode(App.java:534)
        at net.RnD.Hadoop.App.main(App.java:453)
        ... 5 more
Caused by: java.io.IOException: Block read partially, the data may be corrupt
        at org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:194)
        ... 9 more
{code}

The objRecordWriter is an instance of DataFileWriter.create or DataFileWriter.appendto (SeekableInput). In relation to AVRO-1090 ticket.
Instead of having big "hashmaps" in memory, we've decided to serialize the data in "byte buffers" in memory. Because it's faster. Using "appendEncoded" although seems to write something to HDFS, reading the data back, exposes this error.
Help would be appreciated. I've looked @ appendEncoded in DataFileWriter but could not figure out if it's our job to add a sync marker, or does appendEncoded does that for us.
Must the "ByteBuffer" we give, be the length of one exact record?
Examples and documentation on this method is welcomed.

Files are getting created because:
{code}
-rw-r--r--   3 root supergroup  124901360 2012-05-17 10:09 /Streams/Timestamped/Threads/2012/05/17/10/09/Shard.avro
-rw-r--r--   3 root supergroup  124845625 2012-05-17 10:10 /Streams/Timestamped/Threads/2012/05/17/10/10/Shard.avro
-rw-r--r--   3 root supergroup   62378307 2012-05-17 10:11 /Streams/Timestamped/Threads/2012/05/17/10/11/Shard.avro
{code}

  was:
We're doing this:
{code}
// Check
		if (!(objRecordsBuffer
		.containsKey (objShardPath))) {
			// Set
			objRecordsBuffer.put (objShardPath,
			new ByteBufferOutputStream ());
		}

		// Set
		Encoder objEncoder =  EncoderFactory.get ()
		.binaryEncoder (objRecordsBuffer
		.get (objShardPath), null);

		// Write
		objGenericDatumWriter.write (objRecordConstructor.build (), objEncoder);
		objEncoder.flush ();

// For
				for (ByteBuffer objRecord : objRecordsBuffer
				.get (objKey).getBufferList ()) {
					// Append
					objRecordWriter.appendEncoded (objRecord);
				}

				// Erase
				objRecordWriter.flush ();
				objRecordWriter.close ();
{code}

It writes the data to HDFS. Reading it back outputs the follosing exception:
{code}
Caused by: org.apache.avro.AvroRuntimeException: java.io.IOException: Block read partially, the data may be corrupt
        at org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:210)
        at net.RnD.FileUtils.TimestampedReader.hasNext(TimestampedReader.java:113)
        at net.RnD.Hadoop.App.read1BAvros(App.java:131)
        at net.RnD.Hadoop.App.executeCode(App.java:534)
        at net.RnD.Hadoop.App.main(App.java:453)
        ... 5 more
Caused by: java.io.IOException: Block read partially, the data may be corrupt
        at org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:194)
        ... 9 more
{code}

The objRecordWriter is an instance of DataFileWriter.create or DataFileWriter.appendto (SeekableInput). In relation to AVRO-1090 ticket.
Instead of having big "hashmaps" in memory, we've decided to serialize the data in "byte buffers" in memory. Because it's faster. Using "appendEncoded" although seems to write something to HDFS, reading the data back, exposes this error.
Help would be appreciated. I've looked @ appendEncoded in DataFileWriter but could not figure out if it's our job to add a sync marker, or does appendEncoded does that for us.
Must the "ByteBuffer" we give, be the length of one exact record?
Examples and documentation on this method is welcomed.

    
> DataFileWriter, appendEncoded causes AvroRuntimeException when read back
> ------------------------------------------------------------------------
>
>                 Key: AVRO-1093
>                 URL: https://issues.apache.org/jira/browse/AVRO-1093
>             Project: Avro
>          Issue Type: Bug
>    Affects Versions: 1.6.3, 1.7.0
>            Reporter: Catalin Alexandru Zamfir
>
> We're doing this:
> {code}
> // Check
> 		if (!(objRecordsBuffer
> 		.containsKey (objShardPath))) {
> 			// Set
> 			objRecordsBuffer.put (objShardPath,
> 			new ByteBufferOutputStream ());
> 		}
> 		// Set
> 		Encoder objEncoder =  EncoderFactory.get ()
> 		.binaryEncoder (objRecordsBuffer
> 		.get (objShardPath), null);
> 		// Write
> 		objGenericDatumWriter.write (objRecordConstructor.build (), objEncoder);
> 		objEncoder.flush ();
> // For
> 				for (ByteBuffer objRecord : objRecordsBuffer
> 				.get (objKey).getBufferList ()) {
> 					// Append
> 					objRecordWriter.appendEncoded (objRecord);
> 				}
> 				// Erase
> 				objRecordWriter.flush ();
> 				objRecordWriter.close ();
> {code}
> It writes the data to HDFS. Reading it back outputs the follosing exception:
> {code}
> Caused by: org.apache.avro.AvroRuntimeException: java.io.IOException: Block read partially, the data may be corrupt
>         at org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:210)
>         at net.RnD.FileUtils.TimestampedReader.hasNext(TimestampedReader.java:113)
>         at net.RnD.Hadoop.App.read1BAvros(App.java:131)
>         at net.RnD.Hadoop.App.executeCode(App.java:534)
>         at net.RnD.Hadoop.App.main(App.java:453)
>         ... 5 more
> Caused by: java.io.IOException: Block read partially, the data may be corrupt
>         at org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:194)
>         ... 9 more
> {code}
> The objRecordWriter is an instance of DataFileWriter.create or DataFileWriter.appendto (SeekableInput). In relation to AVRO-1090 ticket.
> Instead of having big "hashmaps" in memory, we've decided to serialize the data in "byte buffers" in memory. Because it's faster. Using "appendEncoded" although seems to write something to HDFS, reading the data back, exposes this error.
> Help would be appreciated. I've looked @ appendEncoded in DataFileWriter but could not figure out if it's our job to add a sync marker, or does appendEncoded does that for us.
> Must the "ByteBuffer" we give, be the length of one exact record?
> Examples and documentation on this method is welcomed.
> Files are getting created because:
> {code}
> -rw-r--r--   3 root supergroup  124901360 2012-05-17 10:09 /Streams/Timestamped/Threads/2012/05/17/10/09/Shard.avro
> -rw-r--r--   3 root supergroup  124845625 2012-05-17 10:10 /Streams/Timestamped/Threads/2012/05/17/10/10/Shard.avro
> -rw-r--r--   3 root supergroup   62378307 2012-05-17 10:11 /Streams/Timestamped/Threads/2012/05/17/10/11/Shard.avro
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira