You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/07/04 16:57:53 UTC

[GitHub] [hudi] tommss opened a new issue, #6038: [SUPPORT] MOR taking more time using HoodieJavaWriteClient

tommss opened a new issue, #6038:
URL: https://github.com/apache/hudi/issues/6038

   I am doing a PoC of HUDI and I noticed that while using HoodieJavaWriteClient.java, the writes in case of MOR are taking more time when compared to COW.
   But when using DataFrameReader, MOR is faster than COW for the same dataset.
   In both cases I am doing 'insert' operation.
   In case of java client, I notice a gradual decrease in throughput. There are total of 7million rows and I am batching with 200k rows.
   I have deployed code in Azure databricks cluster and worker nodes pick up the task of executing below code snippet.
   Attaching code snippet using java client :
   
   ```
             HoodieKey key = new HoodieKey(UUID.randomUUID().toString(), "partitionPath");
             HoodieAvroPayload payload = new HoodieAvroPayload(Option.of(rec));
             HoodieAvroRecord<HoodieAvroPayload> record = new HoodieAvroRecord<>(key, payload);
             records.add(record);
           }
         }
   
         String tableName = "tableName";
         String tablePath = "abfss://xxx@xxx.dfs.core.windows.net/" + tableName;
         Configuration hadoopConf = new Configuration();
         hadoopConf.set("fs.azure.account.key","xxx");
         Path path = new Path(tablePath);
         FileSystem fs = FSUtils.getFs(tablePath, hadoopConf);
         if (!fs.exists(path)) {
           HoodieTableMetaClient.withPropertyBuilder()
               .setTableType(tableType)
               .setTableName(tableName)
               .setPayloadClassName(HoodieAvroPayload.class.getName())
               .initTable(hadoopConf, tablePath);
         }
   
         HoodieWriteConfig cfg = HoodieWriteConfig.newBuilder().withPath(tablePath)
             .withSchema(avroSchema.toString())
             .withParallelism(2, 2)
             .withDeleteParallelism(2)
             .forTable(tableName)
             .withIndexConfig(HoodieIndexConfig.newBuilder().withIndexType(HoodieIndex.IndexType.INMEMORY).build())
             .withCompactionConfig(HoodieCompactionConfig.newBuilder().archiveCommitsWith(20, 30).build())
             .build();
         HoodieJavaWriteClient<HoodieAvroPayload> client =
             new HoodieJavaWriteClient<>(new HoodieJavaEngineContext(hadoopConf), cfg);
   
         String commitTime = client.startCommit();
         client.insert(records, commitTime);
         client.close();
   ```
   -------------------------------------------------------------------------------------
   
   **Environment Description**
   
   * Hudi version : 0.11.0
   
   * Spark version : 3.2
   
   * Hive version :
   
   * Hadoop version :
   
   * Storage (HDFS/S3/GCS..) : Azure storage account - StorageV2 (general purpose v2)
   
   * Running on Docker? (yes/no) : no
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] fengjian428 commented on issue #6038: [SUPPORT] MOR taking more time than COW using HoodieJavaWriteClient

Posted by GitBox <gi...@apache.org>.
fengjian428 commented on issue #6038:
URL: https://github.com/apache/hudi/issues/6038#issuecomment-1179203197

   Already discussed with @tommss via slack, I recommend using SparkWriteClient
   @tommss do you have any update about this?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] fengjian428 commented on issue #6038: [SUPPORT] MOR taking more time than COW using HoodieJavaWriteClient

Posted by GitBox <gi...@apache.org>.
fengjian428 commented on issue #6038:
URL: https://github.com/apache/hudi/issues/6038#issuecomment-1175214688

   https://hudi.apache.org/community/get-involved  just click the join group link


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] fengjian428 commented on issue #6038: [SUPPORT] MOR taking more time than COW using HoodieJavaWriteClient

Posted by GitBox <gi...@apache.org>.
fengjian428 commented on issue #6038:
URL: https://github.com/apache/hudi/issues/6038#issuecomment-1175192155

   why don't you create a dataframe on top of rdd and then save it to hudi?
   btw have you joined hudi's slack? we can discuss this


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] fengjian428 commented on issue #6038: [SUPPORT] MOR taking more time than COW using HoodieJavaWriteClient

Posted by GitBox <gi...@apache.org>.
fengjian428 commented on issue #6038:
URL: https://github.com/apache/hudi/issues/6038#issuecomment-1175179593

   HoodieJavaWriteClient use HoodieJavaMergeOnReadTable for handle mor table, and HoodieJavaMergeOnReadTable does nothing but inherits function from HoodieJavaCopyOnWriteTable. 
   So if you cannot change to SparkWriteClient or Flink, we need to implement HoodieJavaMergeOnReadTable's function


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan closed issue #6038: [SUPPORT] MOR taking more time than COW using HoodieJavaWriteClient

Posted by GitBox <gi...@apache.org>.
nsivabalan closed issue #6038: [SUPPORT] MOR taking more time than COW using HoodieJavaWriteClient
URL: https://github.com/apache/hudi/issues/6038


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] tommss commented on issue #6038: [SUPPORT] MOR taking more time than COW using HoodieJavaWriteClient

Posted by GitBox <gi...@apache.org>.
tommss commented on issue #6038:
URL: https://github.com/apache/hudi/issues/6038#issuecomment-1174356088

   Attached are the screenshots of files under the partition folder
   ![filesset2](https://user-images.githubusercontent.com/3656499/177216296-bc1a341a-1539-4691-b112-13407907469c.png)
   
   and some of the files in .hoodie folder
   ![hoodie_folder_fileset3](https://user-images.githubusercontent.com/3656499/177216308-4d118534-480d-4194-ab1f-d54e4a170c7e.png)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan commented on issue #6038: [SUPPORT] MOR taking more time than COW using HoodieJavaWriteClient

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #6038:
URL: https://github.com/apache/hudi/issues/6038#issuecomment-1302872105

   feel free to raise a new issue if you are looking for further enhancement. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] fengjian428 commented on issue #6038: [SUPPORT] MOR taking more time than COW using HoodieJavaWriteClient

Posted by GitBox <gi...@apache.org>.
fengjian428 commented on issue #6038:
URL: https://github.com/apache/hudi/issues/6038#issuecomment-1174038108

   as long as use bulk_insert with NonSortPartitioner you can get better performance in cow mode than mor mode.
   otherwise, cow table still need to copy small files into bigger one in every commit, that's also cause write amplification


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] tommss commented on issue #6038: [SUPPORT] MOR taking more time than COW using HoodieJavaWriteClient

Posted by GitBox <gi...@apache.org>.
tommss commented on issue #6038:
URL: https://github.com/apache/hudi/issues/6038#issuecomment-1175174196

   - I changed index to Bloom to see if it makes any difference, but it does not.
   - What do you mean by HoodieJavaMergeOnReadTable is unfinished ?
   - Below is what we are trying to achieve in the cluster and the reason for using hudi java client.
   
   
   ![image](https://user-images.githubusercontent.com/3656499/177359536-299c0c3f-bc68-4159-8b8e-197986d12139.png)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] xushiyan commented on issue #6038: [SUPPORT] MOR taking more time than COW using HoodieJavaWriteClient

Posted by GitBox <gi...@apache.org>.
xushiyan commented on issue #6038:
URL: https://github.com/apache/hudi/issues/6038#issuecomment-1296299606

   Thanks for the support @fengjian428 ! 
   @tommss do you need further assistance? we may close this in a week time due to inactivity


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] tommss commented on issue #6038: [SUPPORT] MOR taking more time than COW using HoodieJavaWriteClient

Posted by GitBox <gi...@apache.org>.
tommss commented on issue #6038:
URL: https://github.com/apache/hudi/issues/6038#issuecomment-1174042521

   But , I am not using bulk_insert option here as java client does not support it


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] tommss commented on issue #6038: [SUPPORT] MOR taking more time than COW using HoodieJavaWriteClient

Posted by GitBox <gi...@apache.org>.
tommss commented on issue #6038:
URL: https://github.com/apache/hudi/issues/6038#issuecomment-1174063524

   ok, I have noticed the same when using DataFrameReader. But after moving to java client, I see this behavior of MOR being slower than COW


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] fengjian428 commented on issue #6038: [SUPPORT] MOR taking more time than COW using HoodieJavaWriteClient

Posted by GitBox <gi...@apache.org>.
fengjian428 commented on issue #6038:
URL: https://github.com/apache/hudi/issues/6038#issuecomment-1174142907

   Can just provide screenshots of data files and upload the files in .hoodie here
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] fengjian428 commented on issue #6038: [SUPPORT] MOR taking more time than COW using HoodieJavaWriteClient

Posted by GitBox <gi...@apache.org>.
fengjian428 commented on issue #6038:
URL: https://github.com/apache/hudi/issues/6038#issuecomment-1174055566

   If not using bulk_insert, in my experience mor mode should be faster than cow mode in most of the cases due to write amplification, I've tested it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] tommss commented on issue #6038: [SUPPORT] MOR taking more time than COW using HoodieJavaWriteClient

Posted by GitBox <gi...@apache.org>.
tommss commented on issue #6038:
URL: https://github.com/apache/hudi/issues/6038#issuecomment-1174100066

   What exactly do you mean by file layout and timeline? Could you elaborate?
   I am reading just 1 table from SQL DB table which has basic column types (around 15 columns). I have chosen the simplest table possible to begin with. I have created default partition for now and all 7million rows go into the same partition.
   It has become necessary to use HoodieJavaWriteClient as  sparksession, sparkcontext and sqlcontext are not available inside worker nodes of databricks cluster. And I believe without sparkcontext it is not possible to create DataFrameReader.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] fengjian428 commented on issue #6038: [SUPPORT] MOR taking more time than COW using HoodieJavaWriteClient

Posted by GitBox <gi...@apache.org>.
fengjian428 commented on issue #6038:
URL: https://github.com/apache/hudi/issues/6038#issuecomment-1174072185

   OK understood, could you provide more information like the file layout and timeline of the MOR table? 
   btw, is there necessary to use HoodieJavaWriteClient? I think there are also some limitations on MOR when using it


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] tommss commented on issue #6038: [SUPPORT] MOR taking more time than COW using HoodieJavaWriteClient

Posted by GitBox <gi...@apache.org>.
tommss commented on issue #6038:
URL: https://github.com/apache/hudi/issues/6038#issuecomment-1175202995

   I have sent request to add me to the slack group (https://github.com/apache/hudi/issues/143).
    Can you add me there 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] tommss commented on issue #6038: [SUPPORT] MOR taking more time than COW using HoodieJavaWriteClient

Posted by GitBox <gi...@apache.org>.
tommss commented on issue #6038:
URL: https://github.com/apache/hudi/issues/6038#issuecomment-1174535110

   Aren't delta-log files created when there are any updates? But in my case I am doing only inserts. There are no updates in my batch. I am just dumping data from DB table to file


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] fengjian428 commented on issue #6038: [SUPPORT] MOR taking more time than COW using HoodieJavaWriteClient

Posted by GitBox <gi...@apache.org>.
fengjian428 commented on issue #6038:
URL: https://github.com/apache/hudi/issues/6038#issuecomment-1174529133

   @tommss  there are no delta-logs in the path,  normally if you use a Global index, should generate delta-log first. I go through the source code of  HoodieJavaCopyOnWriteTable and HoodieSparkMergeOnReadTable and found there are no differences between them in insert behavior, HoodieSparkMergeOnReadTable just inherits the insert method from HoodieJavaCopyOnWriteTable.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] fengjian428 commented on issue #6038: [SUPPORT] MOR taking more time than COW using HoodieJavaWriteClient

Posted by GitBox <gi...@apache.org>.
fengjian428 commented on issue #6038:
URL: https://github.com/apache/hudi/issues/6038#issuecomment-1174554938

   > 
   
   not exactly.  in this case, you choose MemoryIndex which is a CanIndexLog index. so data can append to delta-log even if a new record insert. but for now HoodieJavaMergeOnReadTable is unfinished but this should be normal behavior
   
   I'm not exactly sure what's your requirements really are.  the Dataframe or rdd is already distributed, why do you need to create a spark session in the worker node?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org