You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@beam.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2018/04/18 15:22:00 UTC

[jira] [Work logged] (BEAM-3484) HadoopInputFormatIO reads big datasets invalid

     [ https://issues.apache.org/jira/browse/BEAM-3484?focusedWorklogId=92133&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-92133 ]

ASF GitHub Bot logged work on BEAM-3484:
----------------------------------------

                Author: ASF GitHub Bot
            Created on: 18/Apr/18 15:21
            Start Date: 18/Apr/18 15:21
    Worklog Time Spent: 10m 
      Work Description: aromanenko-dev opened a new pull request #5166: [BEAM-3484] Fix split issue in HadoopInputFormatIOIT
URL: https://github.com/apache/beam/pull/5166
 
 
   When using DBInputFormat to fetch data from RDBMS, Beam parallelises the process by using LIMIT and OFFSET clauses of SQL query to fetch different ranges of records (as a split) by different workers. By default, RDBMS doesn't guarantee predicted order of results and for the same query it can be different every time. So, it can cause duplicates or missing of some rows in final result.
   To guarantee the same order and proper split of results the client must order them by one or more keys (either PRIMARY or UNIQUE). It can be done by setting configuration option in Hadoop configuration.
   
   ------------------------
   
   Follow this checklist to help us incorporate your contribution quickly and easily:
   
    - [x] Make sure there is a [JIRA issue](https://issues.apache.org/jira/projects/BEAM/issues/) filed for the change (usually before you start working on it).  Trivial changes like typos do not require a JIRA issue.  Your pull request should address just this issue, without pulling in other changes.
    - [x] Format the pull request title like `[BEAM-XXX] Fixes bug in ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA issue.
    - [x] Write a pull request description that is detailed enough to understand:
      - [x] What the pull request does
      - [x] Why it does it
      - [x] How it does it
      - [x] Why this approach
    - [x] Each commit in the pull request should have a meaningful subject line and body.
    - [x] Run `mvn clean verify` to make sure basic checks pass. A more thorough check will be performed on your pull request automatically.
    - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.pdf).
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Issue Time Tracking
-------------------

            Worklog Id:     (was: 92133)
            Time Spent: 10m
    Remaining Estimate: 0h

> HadoopInputFormatIO reads big datasets invalid
> ----------------------------------------------
>
>                 Key: BEAM-3484
>                 URL: https://issues.apache.org/jira/browse/BEAM-3484
>             Project: Beam
>          Issue Type: Bug
>          Components: io-java-hadoop
>    Affects Versions: 2.3.0, 2.4.0
>            Reporter: Łukasz Gajowy
>            Assignee: Alexey Romanenko
>            Priority: Minor
>             Fix For: 2.5.0
>
>         Attachments: result_sorted1000000, result_sorted600000
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> For big datasets HadoopInputFormat sometimes skips/duplicates elements from database in resulting PCollection. This gives incorrect read result.
> Occurred to me while developing HadoopInputFormatIOIT and running it on dataflow. For datasets smaller or equal to 600 000 database rows I wasn't able to reproduce the issue. Bug appeared only for bigger sets, eg. 700 000, 1 000 000. 
> Attachments:
>   - text file with sorted HadoopInputFormat.read() result saved using TextIO.write().to().withoutSharding(). If you look carefully you'll notice duplicates or missing values that should not happen
>  - same text file for 600 000 records not having any duplicates and missing elements
>  - link to a PR with HadoopInputFormatIO integration test that allows to reproduce this issue. At the moment of writing, this code is not merged yet.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)