You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@beam.apache.org by "Luke Cwik (Jira)" <ji...@apache.org> on 2020/01/28 01:54:00 UTC

[jira] [Commented] (BEAM-4735) Make HBaseIO.read() based on SDF

    [ https://issues.apache.org/jira/browse/BEAM-4735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17024812#comment-17024812 ] 

Luke Cwik commented on BEAM-4735:
---------------------------------

I noticed there was a bug in the `@SplitRestriction`. The range input parameter is not being used to restrict the splitRanges that are being returned. If multiple rounds of splitting happened, it could be that `@SplitRestriction` is invoked multiple times, once for each split leading to duplication of work.

 

https://github.com/apache/beam/blob/0a37f19e274b9d766f9eee2228460226c81b6b7c/sdks/java/io/hbase/src/main/java/org/apache/beam/sdk/io/hbase/HBaseReadSplittableDoFn.java#L87

> Make HBaseIO.read() based on SDF
> --------------------------------
>
>                 Key: BEAM-4735
>                 URL: https://issues.apache.org/jira/browse/BEAM-4735
>             Project: Beam
>          Issue Type: Improvement
>          Components: io-java-hbase
>            Reporter: Ismaël Mejía
>            Priority: Minor
>
> BEAM-4020 introduces HBaseIO reads based on SDF. So far the read() method still uses the Source based API for two reasons:
> 1. Most distributed runners don't supports Bounded SDF today.
> 2. SDF does not support Dynamic Work Rebalancing but the Source API of HBase already supports it so changing it means losing some functionality.
> Once there is improvements in both (1) and (2) we should consider moving the main read() function to use the SDF API and remove the Source based implementation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)