You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-issues@hadoop.apache.org by "Deepak Kumar V (JIRA)" <ji...@apache.org> on 2013/12/24 05:47:54 UTC

[jira] [Commented] (HADOOP-9307) BufferedFSInputStream.read returns wrong results after certain seeks

    [ https://issues.apache.org/jira/browse/HADOOP-9307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13856136#comment-13856136 ] 

Deepak Kumar V commented on HADOOP-9307:
----------------------------------------

Doug pointed me here.

I see a similar error while reading Avro file, doing random number of seeks.


Details
=====
Hello,
I have a 340 MB avro data file that contains records sorted and identified by unique id (duplicate records exists). At the beginning of every unique record a synchronization point is created with DataFileWriter.sync(). (I cannot or do not want to save the sync points and i do not want to use SortedKeyValueFile as output format for M/R job)  

There are at-least 25k synchronization points in a 340 MB file.

Ex:
Marker1_RecordA1_RecordA2_RecordA3_Marker2_RecordB1_RecordB2


As records are sorted, for efficient retrieval, binary search is performed using the attached code.

Most of the times the search is successful, at times the code throws the following exception
------
org.apache.avro.AvroRuntimeException: java.io.IOException: Invalid sync! at org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:210 
------



Questions
1) Is it ok to have 25k sycn points for 300 MB file ? Does it cost in performance while reading ?
2) I note down the position that was used to invoke fileReader.sync(mid);. If i catch AvroRuntimeException, close and open the file and sync(mid) i do not see exception. Why should Avro throw exception before and not later ?
3) Is there a limit on number of times sync() is invoked ?
4) When sync(position) is invoked, are any 0 >= position <= file.size()  valid ? If yes why do i see AvroRuntimeException (#2) ?

======

Some of the questions are irrelevant here.

As the patch has been committed, what version of hadoop-core will have this fix ? 

> BufferedFSInputStream.read returns wrong results after certain seeks
> --------------------------------------------------------------------
>
>                 Key: HADOOP-9307
>                 URL: https://issues.apache.org/jira/browse/HADOOP-9307
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: fs
>    Affects Versions: 1.1.1, 2.0.2-alpha
>            Reporter: Todd Lipcon
>            Assignee: Todd Lipcon
>             Fix For: 3.0.0, 2.1.0-beta, 1.3.0
>
>         Attachments: hadoop-9307-branch-1.txt, hadoop-9307.txt
>
>
> After certain sequences of seek/read, BufferedFSInputStream can silently return data from the wrong part of the file. Further description in first comment below.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)