You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@avro.apache.org by ๏̯͡๏ <ÐΞ€ρ@Ҝ>, de...@gmail.com on 2013/12/21 13:35:03 UTC

Avro Read with sync() {java.io.IOException: Invalid sync}

Hello,
I have a 340 MB avro data file that contains records sorted and identified
by unique id (duplicate records exists). At the beginning of every unique
record a synchronization point is created with DataFileWriter.sync(). (I
cannot or do not want to save the sync points and i do not want to use
SortedKeyValueFile as output format for M/R job)

There are at-least 25k synchronization points in a 340 MB file.

Ex:
Marker1_RecordA1_RecordA2_RecordA3_Marker2_RecordB1_RecordB2


As records are sorted, for efficient retrieval, binary search is performed
using the attached code.

Most of the times the search is successful, at times the code throws the
following exception
------
org.apache.avro.AvroRuntimeException: java.io.IOException: Invalid sync! at
org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:210
------



Questions
1) Is it ok to have 25k sycn points for 300 MB file ? Does it cost in
performance while reading ?
2) I note down the position that was used to invoke fileReader.sync(mid);.
If i catch AvroRuntimeException, close and open the file and sync(mid) i do
not see exception. Why should Avro throw exception before and not later ?
3) Is there a limit on number of times sync() is invoked ?
4) When sync(position) is invoked, are any 0 >= position <= file.size()
 valid ? If yes why do i see AvroRuntimeException (#2) ?

Regards,
Deepak

Re: Avro Read with sync() {java.io.IOException: Invalid sync}

Posted by ๏̯͡๏ <ÐΞ€ρ@Ҝ>, de...@gmail.com.
Any suggestions ?


On Sat, Dec 21, 2013 at 6:05 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) <de...@gmail.com> wrote:

> Hello,
> I have a 340 MB avro data file that contains records sorted and identified
> by unique id (duplicate records exists). At the beginning of every unique
> record a synchronization point is created with DataFileWriter.sync(). (I
> cannot or do not want to save the sync points and i do not want to use
> SortedKeyValueFile as output format for M/R job)
>
> There are at-least 25k synchronization points in a 340 MB file.
>
> Ex:
> Marker1_RecordA1_RecordA2_RecordA3_Marker2_RecordB1_RecordB2
>
>
> As records are sorted, for efficient retrieval, binary search is performed
> using the attached code.
>
> Most of the times the search is successful, at times the code throws the
> following exception
> ------
> org.apache.avro.AvroRuntimeException: java.io.IOException: Invalid sync!
> at org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:210
> ------
>
>
>
> Questions
> 1) Is it ok to have 25k sycn points for 300 MB file ? Does it cost in
> performance while reading ?
> 2) I note down the position that was used to invoke fileReader.sync(mid);.
> If i catch AvroRuntimeException, close and open the file and sync(mid) i do
> not see exception. Why should Avro throw exception before and not later ?
> 3) Is there a limit on number of times sync() is invoked ?
> 4) When sync(position) is invoked, are any 0 >= position <= file.size()
>  valid ? If yes why do i see AvroRuntimeException (#2) ?
>
> Regards,
> Deepak
>
>


-- 
Deepak

Re: Avro Read with sync() {java.io.IOException: Invalid sync}

Posted by Doug Cutting <cu...@apache.org>.
Yes, Avro. A similar bug may exist in Avro's input buffering code.

Doug
On Dec 23, 2013 8:50 PM, "ÐΞ€ρ@Ҝ (๏̯͡๏)" <de...@gmail.com> wrote:

> Hi Doug,
> You want me to raise a bug against Avro or Hadoop-Core. My guess is avro
> Regards,
> Deepak
>
>
> On Tue, Dec 24, 2013 at 12:10 AM, Doug Cutting <cu...@apache.org> wrote:
>
>> This sounds like a bug.
>>
>> I wonder if it is similar to a related bug in Hadoop?
>>
>> https://issues.apache.org/jira/browse/HADOOP-9307
>>
>> If so, please file an issue in Jira.
>>
>> Doug
>>
>> On Sat, Dec 21, 2013 at 4:35 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) <de...@gmail.com>
>> wrote:
>> > Hello,
>> > I have a 340 MB avro data file that contains records sorted and
>> identified
>> > by unique id (duplicate records exists). At the beginning of every
>> unique
>> > record a synchronization point is created with DataFileWriter.sync(). (I
>> > cannot or do not want to save the sync points and i do not want to use
>> > SortedKeyValueFile as output format for M/R job)
>> >
>> > There are at-least 25k synchronization points in a 340 MB file.
>> >
>> > Ex:
>> > Marker1_RecordA1_RecordA2_RecordA3_Marker2_RecordB1_RecordB2
>> >
>> >
>> > As records are sorted, for efficient retrieval, binary search is
>> performed
>> > using the attached code.
>> >
>> > Most of the times the search is successful, at times the code throws the
>> > following exception
>> > ------
>> > org.apache.avro.AvroRuntimeException: java.io.IOException: Invalid
>> sync! at
>> > org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:210
>> > ------
>> >
>> >
>> >
>> > Questions
>> > 1) Is it ok to have 25k sycn points for 300 MB file ? Does it cost in
>> > performance while reading ?
>> > 2) I note down the position that was used to invoke
>> fileReader.sync(mid);.
>> > If i catch AvroRuntimeException, close and open the file and sync(mid)
>> i do
>> > not see exception. Why should Avro throw exception before and not later
>> ?
>> > 3) Is there a limit on number of times sync() is invoked ?
>> > 4) When sync(position) is invoked, are any 0 >= position <= file.size()
>> > valid ? If yes why do i see AvroRuntimeException (#2) ?
>> >
>> > Regards,
>> > Deepak
>> >
>>
>
>
>
> --
> Deepak
>
>

Re: Avro Read with sync() {java.io.IOException: Invalid sync}

Posted by ๏̯͡๏ <ÐΞ€ρ@Ҝ>, de...@gmail.com.
Hi Doug,
You want me to raise a bug against Avro or Hadoop-Core. My guess is avro
Regards,
Deepak


On Tue, Dec 24, 2013 at 12:10 AM, Doug Cutting <cu...@apache.org> wrote:

> This sounds like a bug.
>
> I wonder if it is similar to a related bug in Hadoop?
>
> https://issues.apache.org/jira/browse/HADOOP-9307
>
> If so, please file an issue in Jira.
>
> Doug
>
> On Sat, Dec 21, 2013 at 4:35 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) <de...@gmail.com>
> wrote:
> > Hello,
> > I have a 340 MB avro data file that contains records sorted and
> identified
> > by unique id (duplicate records exists). At the beginning of every unique
> > record a synchronization point is created with DataFileWriter.sync(). (I
> > cannot or do not want to save the sync points and i do not want to use
> > SortedKeyValueFile as output format for M/R job)
> >
> > There are at-least 25k synchronization points in a 340 MB file.
> >
> > Ex:
> > Marker1_RecordA1_RecordA2_RecordA3_Marker2_RecordB1_RecordB2
> >
> >
> > As records are sorted, for efficient retrieval, binary search is
> performed
> > using the attached code.
> >
> > Most of the times the search is successful, at times the code throws the
> > following exception
> > ------
> > org.apache.avro.AvroRuntimeException: java.io.IOException: Invalid sync!
> at
> > org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:210
> > ------
> >
> >
> >
> > Questions
> > 1) Is it ok to have 25k sycn points for 300 MB file ? Does it cost in
> > performance while reading ?
> > 2) I note down the position that was used to invoke
> fileReader.sync(mid);.
> > If i catch AvroRuntimeException, close and open the file and sync(mid) i
> do
> > not see exception. Why should Avro throw exception before and not later ?
> > 3) Is there a limit on number of times sync() is invoked ?
> > 4) When sync(position) is invoked, are any 0 >= position <= file.size()
> > valid ? If yes why do i see AvroRuntimeException (#2) ?
> >
> > Regards,
> > Deepak
> >
>



-- 
Deepak

Re: Avro Read with sync() {java.io.IOException: Invalid sync}

Posted by Doug Cutting <cu...@apache.org>.
This sounds like a bug.

I wonder if it is similar to a related bug in Hadoop?

https://issues.apache.org/jira/browse/HADOOP-9307

If so, please file an issue in Jira.

Doug

On Sat, Dec 21, 2013 at 4:35 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) <de...@gmail.com> wrote:
> Hello,
> I have a 340 MB avro data file that contains records sorted and identified
> by unique id (duplicate records exists). At the beginning of every unique
> record a synchronization point is created with DataFileWriter.sync(). (I
> cannot or do not want to save the sync points and i do not want to use
> SortedKeyValueFile as output format for M/R job)
>
> There are at-least 25k synchronization points in a 340 MB file.
>
> Ex:
> Marker1_RecordA1_RecordA2_RecordA3_Marker2_RecordB1_RecordB2
>
>
> As records are sorted, for efficient retrieval, binary search is performed
> using the attached code.
>
> Most of the times the search is successful, at times the code throws the
> following exception
> ------
> org.apache.avro.AvroRuntimeException: java.io.IOException: Invalid sync! at
> org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:210
> ------
>
>
>
> Questions
> 1) Is it ok to have 25k sycn points for 300 MB file ? Does it cost in
> performance while reading ?
> 2) I note down the position that was used to invoke fileReader.sync(mid);.
> If i catch AvroRuntimeException, close and open the file and sync(mid) i do
> not see exception. Why should Avro throw exception before and not later ?
> 3) Is there a limit on number of times sync() is invoked ?
> 4) When sync(position) is invoked, are any 0 >= position <= file.size()
> valid ? If yes why do i see AvroRuntimeException (#2) ?
>
> Regards,
> Deepak
>