You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "luoyuxia (Jira)" <ji...@apache.org> on 2022/03/21 08:57:00 UTC
[jira] [Commented] (FLINK-26718) Limitations of flink+hive dimension table

    [ https://issues.apache.org/jira/browse/FLINK-26718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17509692#comment-17509692 ] 

luoyuxia commented on FLINK-26718:
----------------------------------

[~kunghsu] For hive dim table, it'll load all the data to memory to look up for it's not support to query by keys. For your case, I think it may be not suitable to make Hive as dim table with so large amout of data. It'll consumes much much memory and slow to look up. Two quick solutions for it:

1: Considering whether you really need all the data  in the Hive table to look up,  limit the size of the Hive table.

2: use others as dim table, such as HBase.

> Limitations of flink+hive dimension table
> -----------------------------------------
>
>                 Key: FLINK-26718
>                 URL: https://issues.apache.org/jira/browse/FLINK-26718
>             Project: Flink
>          Issue Type: Bug
>          Components: Connectors / Hive
>    Affects Versions: 1.12.7
>            Reporter: kunghsu
>            Priority: Major
>              Labels: HIVE
>
> Limitations of flink+hive dimension table
> The scenario I am involved in is a join relationship between the Kafka input table and the Hive dimension table. The hive dimension table is some user data, and the data is very large.
> When the data volume of the hive table is small, about a few hundred rows, everything is normal, the partition is automatically recognized and the entire task is executed normally.
> When the hive table reached about 1.3 million, the TaskManager began to fail to work properly. It was very difficult to even look at the log. I guess it burst the JVM memory when it tried to load the entire table into memory. You can see that a heartbeat timeout exception occurs in Taskmanager, such as Heartbeat TimeoutException.I even increased the parallelism to no avail.
> Official website documentation: [https://nightlies.apache.org/flink/flink-docs-release-1.12/dev/table/connectors/hive/hive_read_write.html#source-parallelism-inference]
> So I have a question, does flink+hive not support association of large tables so far?
> Is this solution unusable when the amount of data is too large?
>  
>  
>  
> Simply estimate, how much memory will 25 million data take up?
> Suppose a line of data is 1K, 25 million K is 25000M, or 25G.
> If the memory of the TM is set to 32G, can the problem be solved?
> It doesn't seem to work either, because this can only be allocated roughly 16G to the jvm.
> Assuming that the official solution can support such a large amount, how should the memory of the TM be set?
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)