You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "luoyuxia (Jira)" <ji...@apache.org> on 2022/06/30 04:11:00 UTC

[jira] [Resolved] (FLINK-26718) Limitations of flink+hive dimension table

     [ https://issues.apache.org/jira/browse/FLINK-26718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

luoyuxia resolved FLINK-26718.
------------------------------
    Resolution: Not A Problem

[~kunghsu] I think we can close this issue. Feel free to open it again if you still have question about this issue.

> Limitations of flink+hive dimension table
> -----------------------------------------
>
>                 Key: FLINK-26718
>                 URL: https://issues.apache.org/jira/browse/FLINK-26718
>             Project: Flink
>          Issue Type: Bug
>          Components: Connectors / Hive
>    Affects Versions: 1.12.7
>            Reporter: kunghsu
>            Priority: Major
>              Labels: HIVE
>
> Limitations of flink+hive dimension table
> The scenario I am involved in is a join relationship between the Kafka input table and the Hive dimension table. The hive dimension table is some user data, and the data is very large.
> When the data volume of the hive table is small, about a few hundred rows, everything is normal, the partition is automatically recognized and the entire task is executed normally.
> When the hive table reached about 1.3 million, the TaskManager began to fail to work properly. It was very difficult to even look at the log. I guess it burst the JVM memory when it tried to load the entire table into memory. You can see that a heartbeat timeout exception occurs in Taskmanager, such as Heartbeat TimeoutException.I even increased the parallelism to no avail.
> Official website documentation: [https://nightlies.apache.org/flink/flink-docs-release-1.12/dev/table/connectors/hive/hive_read_write.html#source-parallelism-inference]
> So I have a question, does flink+hive not support association of large tables so far?
> Is this solution unusable when the amount of data is too large?
>  
>  
>  
> Simply estimate, how much memory will 25 million data take up?
> Suppose a line of data is 1K, 25 million K is 25000M, or 25G.
> If the memory of the TM is set to 32G, can the problem be solved?
> It doesn't seem to work either, because this can only be allocated roughly 16G to the jvm.
> Assuming that the official solution can support such a large amount, how should the memory of the TM be set?
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)