You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "PJ Van Aeken (JIRA)" <ji...@apache.org> on 2016/01/27 17:40:40 UTC

[jira] [Commented] (FLINK-2055) Implement Streaming HBaseSink

    [ https://issues.apache.org/jira/browse/FLINK-2055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15119736#comment-15119736 ] 

PJ Van Aeken commented on FLINK-2055:
-------------------------------------

What is the latest news on this?

Having read through the mailing thread, and the corresponding code, it seems like the current solution is more of a workaround. I can understand the desire for reusing what is already out there, but reusing the HBase TableOutputFormat feels a bit like making a sacrifice. I haven't had time to thoroughly investigate my suspicions though and am very interested to learn if anyone else has. I am by no means an HBase expert, but based on what I think I know about HBase, this is the sacrifice I think we're making here:

The native HBaseTableOutputFormat was built for use in batch jobs. It uses the BufferedMutator under the hood, which as far as I understood decides to flush based on constraints which are determined by HBase itself, such as the cumulative size of the Puts etc. That means that while we may "write" to our TableOutputFormat every X milliseconds, HBase will still decide on its own when to actually flush the records. The HBase client, in order to avoid a large amount of small files, also groups the Puts together, but in the mean time exposes them through a component called the memstore, making them available before the flush. I believe that by using the TableOutputFormat with the BufferedMutator, we are skipping the memstore and therefore new Puts remain unavailable until the flush. We could off course configure HBase to flush to disk more frequently, but should we really do that if we have an alternative?

Now, as mentioned, I'm not sure I fully grasped the inner workings of HBase so if I made some false assumptions, I'm sorry. But based on what I think I know now, it seems like we're making an unnecessary sacrifice here.

> Implement Streaming HBaseSink
> -----------------------------
>
>                 Key: FLINK-2055
>                 URL: https://issues.apache.org/jira/browse/FLINK-2055
>             Project: Flink
>          Issue Type: New Feature
>          Components: Streaming, Streaming Connectors
>    Affects Versions: 0.9
>            Reporter: Robert Metzger
>            Assignee: Hilmi Yildirim
>
> As per : http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Write-Stream-to-HBase-td1300.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)