You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2021/09/24 16:04:45 UTC

[GitHub] [hudi] ziudu commented on issue #3344: [SUPPORT]Best way to ingest a large number of tables

ziudu commented on issue #3344:
URL: https://github.com/apache/hudi/issues/3344#issuecomment-926745203

sorry for the late reply.

we used 4-node Hadoop cluster for testing, each node is an ESXi virtual machines (8c@2.1GHz, 32GB, HDD virtual disk),

We tested several different ways for the initial load:

1) The fastest way is to load the data in Hive format, and then convert them to Hudi format. We used DataX to extract data from DB and load it into Hadoop. The speed is 80K record per second, the conversion is slower, around 30K per second. But it is very easy to parallelize and easy to manage. We tested 2 DataX instances, the load speed is 150K per second(linear!). I think the speed is limited only by the hardware configuration. Since the result is already satisfactory, we didn't test further.

2) We tested streaming upsert with a scala application, the speed is 3K per second. This speed is largely enough for a micro service application's continuous ingestion process.

3) We tested bulk insert, but it is even slower than upsert (1.5K per second).

So what we are doing now is :
A. Write an application to scan database metadata and store them into Linkedin Datahub.
B. Write an application to generate various configuration files for dataX, kafka, debezium etc from Linkedin Datahub and automatize the initial load and continuous ingestion process.
C. Write a scala application which could subscribe to a range of topics and ingest data.
D. We've chosen method 1 for initial load for the moment. It is not beautiful but it's fast.

We haven't tested the scala application's performance if there are lots of tables (e.g. > 1000) in a single DB.
We haven't tested Deltastreamer in Hudi 0.9 yet.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org