You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@avro.apache.org by Terry Healy <th...@bnl.gov> on 2013/01/14 20:22:16 UTC

Possible to include open .avro file in Map/Reduce job?

I have a log collection application that writes .avro files within HDFS.
Ideally I would like to include the current days (open for append) file
as one of the input files for a periodic M/R job.

I tried this but the Map job exited in error with the dreaded "Invalid
Sync!" IOException. I guess I should have expected this, but is there a
reasonable way around it? Can I catch the exception and just exit the
map at that point?

All suggestions appreciated.

-Terry

Re: Possible to include open .avro file in Map/Reduce job?

Posted by Terry Healy <th...@bnl.gov>.
Thanks Doug.

In this case I could truncate the logs earlier, but then I have to go
back at some point and recombine the small files. For now, I can live
with moving the files daily.

I was unable to find a way to trap the "Invalid Sync"
(org.apache.avro.AvroRuntimeException: java.io.IOException: Invalid
sync! at
org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:210)

Since my mapper extends AvroMapper, and map throws exceptions, I don't
know where to trap it. Another person suggested using low-level avro
functions for this. Perhaps I need to write an avro file validator of
some sort to be run before the Map/Reduce job? This seems nasty. But I
had another M/R job failure for this error over night, and even finding
the offending file via the logs is quite a pain.

Any suggestions?

-Terry

On 01/17/2013 04:36 PM, Doug Cutting wrote:
> Folks often move files once they're closed into a directory where
> they're processed to avoid issues with partially written data.  Maybe
> you could start a new log file every hour rather than every day?
> 
> We could add an ignoreTruncation or ignoreCorruption option to
> DataFileReader that attempts to read files that might be truncated or
> corrupted.
> 
> And yes, you can probably just catch those exceptions and exit the map
> at that point.
> 
> Doug
> 
> On Mon, Jan 14, 2013 at 11:22 AM, Terry Healy <th...@bnl.gov> wrote:
>> I have a log collection application that writes .avro files within HDFS.
>> Ideally I would like to include the current days (open for append) file
>> as one of the input files for a periodic M/R job.
>>
>> I tried this but the Map job exited in error with the dreaded "Invalid
>> Sync!" IOException. I guess I should have expected this, but is there a
>> reasonable way around it? Can I catch the exception and just exit the
>> map at that point?
>>
>> All suggestions appreciated.
>>
>> -Terry

Re: Possible to include open .avro file in Map/Reduce job?

Posted by Doug Cutting <cu...@apache.org>.
Folks often move files once they're closed into a directory where
they're processed to avoid issues with partially written data.  Maybe
you could start a new log file every hour rather than every day?

We could add an ignoreTruncation or ignoreCorruption option to
DataFileReader that attempts to read files that might be truncated or
corrupted.

And yes, you can probably just catch those exceptions and exit the map
at that point.

Doug

On Mon, Jan 14, 2013 at 11:22 AM, Terry Healy <th...@bnl.gov> wrote:
> I have a log collection application that writes .avro files within HDFS.
> Ideally I would like to include the current days (open for append) file
> as one of the input files for a periodic M/R job.
>
> I tried this but the Map job exited in error with the dreaded "Invalid
> Sync!" IOException. I guess I should have expected this, but is there a
> reasonable way around it? Can I catch the exception and just exit the
> map at that point?
>
> All suggestions appreciated.
>
> -Terry