You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Markus Resch <ma...@adtech.de> on 2012/04/03 10:28:29 UTC

Sync Marker Issue while reading AVRO files writen with FLUME with PIG

Hey everyone,

we're facing a problem while reading AVRO files written with FLUME using
the AVRO Java API 1.5.4 into a HADOOP cluster. The Avro Data Store
complains about missing sync marker. Investigating the problem shows us,
that's perfectly right. The sync marker is missing. Thus we have a block
of the double size. 

Our software packets:
 rpm -qa | grep hadoop
hadoop-0.20-namenode-0.20.2+923.142-1
hadoop-0.20-0.20.2+923.142-1
hadoop-0.20-native-0.20.2+923.142-1
hadoop-hive-0.7.1+42.27-2
hadoop-pig-0.8.1+28.18-1

This is pretty much all a basic cloudera 
CDH3 Update 2 Packaging installation with a patched PIG version which is
CDH3 Update 3.

Did anyone had a similar issue? Does this ring a bell?

Thanks

Markus

Re: Sync Marker Issue while reading AVRO files writen with FLUME with PIG

Posted by Scott Carey <sc...@apache.org>.

I have not seen this issue before with 100 TB of Avro files, but am not
using Flume to write them.  We have moved on to Avro 1.6.x but were on the
1.5.x line for quite some time.  Perhaps while writing there was an
exception of some sort that was not handled correctly in Avro or Flume.

Looking at the DataFileWriter code, I can see how a file could get
truncated without a sync marker if the writing process crashes, but not
how it could successfully write two blocks in a row without a sync between.

You should be able to modify the file reader to recover and re-write the
data if it is only a missing sync marker, or skip over the block if it is
corrupt.

On 4/3/12 1:28 AM, "Markus Resch" <ma...@adtech.de> wrote:

>Hey everyone,
>
>we're facing a problem while reading AVRO files written with FLUME using
>the AVRO Java API 1.5.4 into a HADOOP cluster. The Avro Data Store
>complains about missing sync marker. Investigating the problem shows us,
>that's perfectly right. The sync marker is missing. Thus we have a block
>of the double size.
>
>Our software packets:
> rpm -qa | grep hadoop
>hadoop-0.20-namenode-0.20.2+923.142-1
>hadoop-0.20-0.20.2+923.142-1
>hadoop-0.20-native-0.20.2+923.142-1
>hadoop-hive-0.7.1+42.27-2
>hadoop-pig-0.8.1+28.18-1
>
>This is pretty much all a basic cloudera
>CDH3 Update 2 Packaging installation with a patched PIG version which is
>CDH3 Update 3.
>
>Did anyone had a similar issue? Does this ring a bell?
>
>Thanks
>
>Markus
>
>