You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Ufuk Celebi (JIRA)" <ji...@apache.org> on 2015/06/09 14:50:01 UTC

[jira] [Comment Edited] (FLINK-2188) Reading from big HBase Tables

    [ https://issues.apache.org/jira/browse/FLINK-2188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14578830#comment-14578830 ] 

Ufuk Celebi edited comment on FLINK-2188 at 6/9/15 12:49 PM:
-------------------------------------------------------------

Thanks! Do you have time to test it with the original TableInputFormat from Hadoop? I guess this is what you are using with Spark as well, right?

Hadoop IFs work out of the box with Flink as well (1).

{code}
DataSet<Tuple2<LongWritable, Text>> input =
    env.readHadoopFile(new TextInputFormat(), LongWritable.class, Text.class, textPath);
{code}

I will look into this and try to reproduce the problem locally. I can also provide you with the code snippet for the TableInputFormat if you don't have time to do it. Again, sorry that this has been so inconvenient.

(1) http://ci.apache.org/projects/flink/flink-docs-master/apis/hadoop_compatibility.html 

PS: the code snippet is just the example from the docs


was (Author: uce):
Thanks! Do you have time to test it with the original TableInputFormat from Hadoop? I guess this is what you are using with Spark as well, right?

Hadoop IFs work out of the box with Flink as well (1).

{code}
DataSet<Tuple2<LongWritable, Text>> input =
    env.readHadoopFile(new TextInputFormat(), LongWritable.class, Text.class, textPath);
{code}

I will look into this and try to reproduce the problem locally. I can also provide you with the code snippet for the TableInputFormat if you don't have time to do it. Again, sorry that this has been so inconvenient.

(1) http://ci.apache.org/projects/flink/flink-docs-master/apis/hadoop_compatibility.html 

> Reading from big HBase Tables
> -----------------------------
>
>                 Key: FLINK-2188
>                 URL: https://issues.apache.org/jira/browse/FLINK-2188
>             Project: Flink
>          Issue Type: Bug
>            Reporter: Hilmi Yildirim
>            Priority: Critical
>         Attachments: flinkTest.zip
>
>
> I detected a bug in the reading from a big Hbase Table.
> I used a cluster of 13 machines with 13 processing slots for each machine which results in a total number of processing slots of 169. Further, our cluster uses cdh5.4.1 and the HBase version is 1.0.0-cdh5.4.1. There is a Hbase Table with nearly 100. mio rows. I used Spark and Hive to count the number of rows and both results are identical (nearly 100 mio.). 
> Then, I used Flink to count the number of rows. For that I added the hbase-client 1.0.0-cdh5.4.1 Java API as dependency in maven and excluded the other hbase-client dependencies. The result of the job is nearly 102 mio. , 2 mio. rows more than the result of Spark and Hive. Moreover, I run the Flink job multiple times and sometimes the result fluctuates by +-5.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)