You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@beam.apache.org by "Łukasz Gajowy (JIRA)" <ji...@apache.org> on 2018/01/16 14:23:00 UTC

[jira] [Created] (BEAM-3484) HadoopInputFormatIO reads big datasets invalid

Łukasz Gajowy created BEAM-3484:
-----------------------------------

             Summary: HadoopInputFormatIO reads big datasets invalid
                 Key: BEAM-3484
                 URL: https://issues.apache.org/jira/browse/BEAM-3484
             Project: Beam
          Issue Type: Bug
          Components: beam-model, runner-dataflow
            Reporter: Łukasz Gajowy
            Assignee: Kenneth Knowles


For big datasets HadoopInputFormat sometimes skips/duplicates elements from database in resulting PCollection. This results in incorrect read result.

Occurred to me while developing HadoopInputFormatIOIT and running it on dataflow. For datasets smaller or equal to 600 000 database rows I wasn't able to reproduce the issue. Bug appeared only for bigger sets, eg. 700 000, 1 000 000. 

Attachments:
 - text file with sorted HadoopInputFormat.read() result saved using TextIO.write().to().withoutSharding(). If you look carefully you'll notice duplicates or missing values that should not happen

 - same text file for 600 000 records not having any duplicates and missing elements
- link to a PR with HadoopInputFormatIO integration test that allows to reproduce this issue. At the moment of writing, this code is not merged yet.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)