You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@avro.apache.org by Terry Healy <th...@bnl.gov> on 2013/01/03 21:36:52 UTC

Appending to .avro log files

Hello-

I'm upgrading a logging program to append GenericRecords to a .avro file
instead of text (.tsv). I have a working schema that is used to convert
existing .tsv of the same format into .avro and that works fine.

When I run a test writing 30,000 bogus records, it runs but when I try
to use "avro-tools-1.7.3.jar tojson" on the output file, it reports:

"AvroRuntimeException: java.io.IOException: Invalid sync!"

The file is still open at this point since the logging program is
running. Is this expected behavior because it is still open? (getmeta
and getschema work fine).

I'm not sure if it has any bearing, since I never really understood the
function of the the AVRO sync interval; in this and the working programs
it is set to 1000000.

Any ideas appreciated.

-Terry

Re: Appending to .avro log files

Posted by Terry Healy <th...@bnl.gov>.

Thank you Scott - that did the trick. It seems that I may need to reduce
my sync value as well.


On 01/08/2013 04:14 AM, Scott Carey wrote:
> A sync marker delimits each block in the avro file.  If you want to start
> reading data from the middle of a 100GB file, DataFileReader will seek to
> the middle and find the next sync marker.  Each block can be individually
> compressed, and by default when writing a file the writer will not
> compress the block and flush to disk until a block as gotten as large as
> the sync interval in bytes.    Alternatively, you can manually sync().
> 
> If you have a 1000000 byte sync interval, you may not see any data reach
> disk until that many bytes have been written (or sync() is called
> manually).
> 
> Your problem is likely that the first block in the file has not been
> flushed to disk yet, and therefore the file is corrupt and missing a
> trailing sync marker.
> 
> On 1/3/13 12:36 PM, "Terry Healy" <th...@bnl.gov> wrote:
> 
>> Hello-
>>
>> I'm upgrading a logging program to append GenericRecords to a .avro file
>> instead of text (.tsv). I have a working schema that is used to convert
>> existing .tsv of the same format into .avro and that works fine.
>>
>> When I run a test writing 30,000 bogus records, it runs but when I try
>> to use "avro-tools-1.7.3.jar tojson" on the output file, it reports:
>>
>> "AvroRuntimeException: java.io.IOException: Invalid sync!"
>>
>> The file is still open at this point since the logging program is
>> running. Is this expected behavior because it is still open? (getmeta
>> and getschema work fine).
>>
>> I'm not sure if it has any bearing, since I never really understood the
>> function of the the AVRO sync interval; in this and the working programs
>> it is set to 1000000.
>>
>> Any ideas appreciated.
>>
>> -Terry
> 
>

Re: Appending to .avro log files

Posted by Scott Carey <sc...@apache.org>.

A sync marker delimits each block in the avro file.  If you want to start
reading data from the middle of a 100GB file, DataFileReader will seek to
the middle and find the next sync marker.  Each block can be individually
compressed, and by default when writing a file the writer will not
compress the block and flush to disk until a block as gotten as large as
the sync interval in bytes.    Alternatively, you can manually sync().

If you have a 1000000 byte sync interval, you may not see any data reach
disk until that many bytes have been written (or sync() is called
manually).

Your problem is likely that the first block in the file has not been
flushed to disk yet, and therefore the file is corrupt and missing a
trailing sync marker.

On 1/3/13 12:36 PM, "Terry Healy" <th...@bnl.gov> wrote:

>Hello-
>
>I'm upgrading a logging program to append GenericRecords to a .avro file
>instead of text (.tsv). I have a working schema that is used to convert
>existing .tsv of the same format into .avro and that works fine.
>
>When I run a test writing 30,000 bogus records, it runs but when I try
>to use "avro-tools-1.7.3.jar tojson" on the output file, it reports:
>
>"AvroRuntimeException: java.io.IOException: Invalid sync!"
>
>The file is still open at this point since the logging program is
>running. Is this expected behavior because it is still open? (getmeta
>and getschema work fine).
>
>I'm not sure if it has any bearing, since I never really understood the
>function of the the AVRO sync interval; in this and the working programs
>it is set to 1000000.
>
>Any ideas appreciated.
>
>-Terry