You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/09/04 13:01:22 UTC

[GitHub] [hudi] QuChunhe opened a new issue, #6585: [SUPPORT]Hudi Java client writes data very very slowly

QuChunhe opened a new issue, #6585:
URL: https://github.com/apache/hudi/issues/6585

   Every row data is about 0.72K. Every insert commit with 300 rows takes about 2 minutes. But increasing the rows of a commit, the time of every commit did not increase much. Event if each  insert commit has 4500 rows, the time of every commit take just over 2 minutes.
   
   1. Hudi version: 0.12.0, aliyun oss file system, flink 1.13.6.2. 
   
   2.  Spark beeline is used to create the table. The table properties are as follows.
   
   USING hudi
   PARTITIONED BY (bizdate)
   TBLPROPERTIES (
     'primaryKey' = 'timestamp,serial_number,message_id',
     'type' = 'mor',
     'preCombineField' = 'timestamp',
     'hoodie.datasource.write.hive_style_partitioning'='false',
     'hoodie.database.name'='gs_ods',
     'hoodie.table.base.file.format'='parquet',
     'hoodie.parquet.writelegacyformat.enabled'='false'
   );
   
   
   3. Hudi Java client  in a flink pipeline writes the data from  a Kafka cluster, and the configuration is as follows.
   
     private String baseFileFormat = "parquet";
     private String recordKeyFields;
     private int parallelism = 20;
     private WriteConcurrencyMode writeConcurrencyMode = WriteConcurrencyMode.SINGLE_WRITER;
     private HoodieLockConfig hoodieLockConfig = HoodieLockConfig.newBuilder().build();
     private HoodieFailedWritesCleaningPolicy hoodieFailedWritesCleaningPolicy
         = HoodieFailedWritesCleaningPolicy.EAGER;
   
    @Override
     public void open(Configuration parameters) throws Exception {
       ****
       // Create the write client to write some records in
       HoodieWriteConfig cfg = HoodieWriteConfig.newBuilder()
           .withPath(tablePath)
           .withSchema(schema)
           .forTable(tableName)
           .withAutoCommit(true)
           .withTableServicesEnabled(true)
           .withEmbeddedTimelineServerEnabled(true)
           .withMarkersType(MarkerType.TIMELINE_SERVER_BASED.name())
           .withRollbackUsingMarkers(true)
           .withDeleteParallelism(parallelism)
           .withParallelism(parallelism, parallelism)
           .withFinalizeWriteParallelism(parallelism)
           .withRollbackParallelism(parallelism / 2)
           .withWriteBufferLimitBytes(32 * 1024 * 1024)
           .withWriteConcurrencyMode(writeConcurrencyMode)
           .withLockConfig(hoodieLockConfig)
           //.withEngineType(EngineType.SPARK)
           .withCleanConfig(HoodieCleanConfig.newBuilder()
               .withAutoClean(true)
               .withFailedWritesCleaningPolicy(hoodieFailedWritesCleaningPolicy)
               .withAsyncClean(false)
               .build())
           .withStorageConfig(
               HoodieStorageConfig.newBuilder()
                   .parquetWriteLegacyFormat("false")
                   .build())
           .withMetadataConfig(
               HoodieMetadataConfig.newBuilder()
                   .withAsyncClean(false)
                   .withAsyncIndex(false)
                   .enable(true)
                   .build())
           .withIndexConfig(
               HoodieIndexConfig.newBuilder()
                   .withIndexType(IndexType.BLOOM)
                   .build())
           .withArchivalConfig(HoodieArchivalConfig.newBuilder()
               .archiveCommitsWith(40, 60)
               .build())
           .withCompactionConfig(
               HoodieCompactionConfig.newBuilder()
                   .withCompactionLazyBlockReadEnabled(true)
                   .build())
           .build();
       client = new HoodieJavaWriteClient<>(new HoodieJavaEngineContext(hadoopConf), cfg);
   
       gson = (new GsonBuilder()).setExclusionStrategies()
           .registerTypeAdapter(FieldStateEnum.class, new VoidJsonSerializer())
           .enableComplexMapKeySerialization()
           .setLenient()
           .setFieldNamingPolicy(doesUseLowerCaseWithUnderScores
               ? FieldNamingPolicy.LOWER_CASE_WITH_UNDERSCORES : FieldNamingPolicy.IDENTITY)
           .create();
     }
   
   
   @Override
     public void invoke(T value, Context context) throws Exception {
       List<HoodieRecord<HoodieJsonPayload>> records = toHoodieRecords(value);
       if (null == records || records.size() == 0) {
         return;
       }
       List<WriteStatus> statusList = null;
       try {
         String newCommitTime = client.startCommit();
         statusList = client.insert(records, newCommitTime);
       } catch (Exception e) {
         log.error("Meet some errors " + Arrays.toString(records.toArray()), e);
         throw e;
       }
   
       HashMap<HoodieKey, Throwable> errors = statusList.get(0).getErrors();
       if (null == errors || errors.size() == 0) {
         return;
       }
       for (Map.Entry<HoodieKey, Throwable> e : errors.entrySet()) {
         log.error("Can not insert into " + e.getKey().getRecordKey(), e.getValue());
       }
     }


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] zhengyuan-cn commented on issue #6585: [SUPPORT]Hudi Java client writes data very very slowly

Posted by GitBox <gi...@apache.org>.

zhengyuan-cn commented on issue #6585:
URL: https://github.com/apache/hudi/issues/6585#issuecomment-1237577659

   java client必须慢，用flink就快了


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] QuChunhe commented on issue #6585: [SUPPORT]Hudi Java client writes data very very slowly

Posted by GitBox <gi...@apache.org>.

QuChunhe commented on issue #6585:
URL: https://github.com/apache/hudi/issues/6585#issuecomment-1236340053

   Furthermore, the performance does not improve when increasing the parallelism, for example setting parallelism to 400 or 1000.
   ```java
           .withDeleteParallelism(parallelism)
           .withParallelism(parallelism, parallelism)
           .withFinalizeWriteParallelism(parallelism)
           .withRollbackParallelism(parallelism / 2))
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] jhchee commented on issue #6585: [SUPPORT]Hudi Java client writes data very very slowly

Posted by "jhchee (via GitHub)" <gi...@apache.org>.

jhchee commented on issue #6585:
URL: https://github.com/apache/hudi/issues/6585#issuecomment-1501379290

   @QuChunhe Did you manage to solve this?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] danny0405 commented on issue #6585: [SUPPORT]Hudi Java client writes data very very slowly

Posted by "danny0405 (via GitHub)" <gi...@apache.org>.

danny0405 commented on issue #6585:
URL: https://github.com/apache/hudi/issues/6585#issuecomment-1531070481

   Did you try the Flink Hudi sink instead? Why useing the Java client then? The Java Client is not good for scaling horizontally.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org