You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2020/04/18 08:07:27 UTC

[GitHub] [incubator-hudi] jenu9417 opened a new issue #1528: [SUPPORT] Issue while writing to HDFS via hudi. Only `/.hoodie` folder is written.

jenu9417 opened a new issue #1528: [SUPPORT] Issue while writing to HDFS via hudi. Only `/.hoodie` folder is written.
URL: https://github.com/apache/incubator-hudi/issues/1528
 
 
   Hi, all.
   We are doing a POC experimenting with syncing our data in micro-batches from Kafka to HDFS. We are currently using the general Kafka consumer API's and converting them to DataSet and then writing it on to HDFS via hudi. We are facing some problems with this.
   
   ```
   	       // `items`  is List<String> containing data from kafka		
             final Dataset<Record> df = spark.createDataset(items, Encoders.STRING()).toDF()
   					                             .map(new Mapper(), Encoders.bean(Record.class))
                                                 .filter(new Column("name").equalTo("aaa"));			
   
             df.write().format("hudi")
   					.option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY(), "id")
   					.option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY(), "batch")
   					.option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY(), "timestamp")
   					.option(HoodieWriteConfig.TABLE_NAME, table).mode(SaveMode.Append)
   					.save(output);
   					//.parquet(output);
   ```
   
   a) When using save option to write dataset, only /.hoodie folder exists after writing. No actual data is present. From the logs we are not seeing any issues. The following set of lines are repeated continuously in the write phase. 
   ```
   91973 [Executor task launch worker for task 11129] INFO  org.apache.spark.storage.BlockManager  - Found block rdd_36_624 locally
   91974 [Executor task launch worker for task 11128] INFO  org.apache.spark.storage.BlockManager  - Found block rdd_36_623 locally
   91974 [Executor task launch worker for task 11129] INFO  org.apache.spark.executor.Executor  - Finished task 624.0 in stage 19.0 (TID 11129). 699 bytes result sent to driver
   91975 [dispatcher-event-loop-0] INFO  org.apache.spark.scheduler.TaskSetManager  - Starting task 625.0 in stage 19.0 (TID 11130, localhost, executor driver, partition 625, PROCESS_LOCAL, 7193 bytes)
   91975 [Executor task launch worker for task 11130] INFO  org.apache.spark.executor.Executor  - Running task 625.0 in stage 19.0 (TID 11130)
   91975 [task-result-getter-0] INFO  org.apache.spark.scheduler.TaskSetManager  - Finished task 624.0 in stage 19.0 (TID 11129) in 16 ms on localhost (executor driver) (624/1500)
   91985 [Executor task launch worker for task 11128] INFO  org.apache.spark.executor.Executor  - Finished task 623.0 in stage 19.0 (TID 11128). 871 bytes result sent to driver
   91985 [dispatcher-event-loop-0] INFO  org.apache.spark.scheduler.TaskSetManager  - Starting task 626.0 in stage 19.0 (TID 11131, localhost, executor driver, partition 626, PROCESS_LOCAL, 7193 bytes)
   91985 [task-result-getter-1] INFO  org.apache.spark.scheduler.TaskSetManager  - Finished task 623.0 in stage 19.0 (TID 11128) in 27 ms on localhost (executor driver) (625/1500)
   91986 [Executor task launch worker for task 11131] INFO  org.apache.spark.executor.Executor  - Running task 626.0 in stage 19.0 (TID 11131)
   ```
   We have verified that there is no issue with fetching data from kafka or creating data set. Only issue seems to be with the write.
   
   b) When using parquet option to write dataset, actual data is written in parquet file format in the output directory. But without  any partition folders. Is this expected? What is difference in save v/s parquet? Also, while querying this parquet data in Spark shell, via Spark SQL, I was not able to find any hudi meta fields. For eg:
   ```
   spark.sql("select id, name, `_hoodie_commit_time` from table1 limit 5").show();
   ```
   The query was throwing error that there are no such field called _hoodie_commit_time
   
   c) Where can i find the meta data regarding the data currently present in hudi tables. ie, what are the new commits? When was the last commit? etc., From the documentation it seemed these data are managed by hudi.
   
   d) How is data compaction managed by hudi? Is there any background jobs running?
   
   Sorry, if these are naive questions. But we are completely new to this. Also, it would be helpful if someone could point us to a little detailed documentation on these.
   
   Thanks.
   
   
   **Steps to reproduce the behavior:**
   
   1. Code snippet used for write has been shared.
   
   
   **Expected behavior**
   
   Currently, when using write only `/.hoodie` folder alone is being written without any data. Expected behaviour is Data should also be written.
   
   A clear and concise description of what you expected to happen.
   
   **Environment Description**
   
   * Hudi version :  0.5.2-incubating
   
   * Spark version :  2.4.0
   
   * Hive version : - 
   
   * Hadoop version : 2.9.2
   
   * Storage (HDFS/S3/GCS..) : HDFS
   
   * Running on Docker? (yes/no) : No
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] lamber-ken commented on issue #1528: [SUPPORT] Issue while writing to HDFS via hudi. Only `/.hoodie` folder is written.

Posted by GitBox <gi...@apache.org>.

lamber-ken commented on issue #1528:
URL: https://github.com/apache/incubator-hudi/issues/1528#issuecomment-617028414


   hi @jenu9417, default parallelism is `1500`, you can adjust according to the actual situation.
   ```
   // parallelisms
   "hoodie.insert.shuffle.parallelism" -> "10",
   "hoodie.upsert.shuffle.parallelism" -> "10",
   "hoodie.delete.shuffle.parallelism" -> "10",
   "hoodie.bulkinsert.shuffle.parallelism" -> "10"
   
   
   // demo
   export SPARK_HOME=/work/BigData/install/spark/spark-2.4.4-bin-hadoop2.7
   ${SPARK_HOME}/bin/spark-shell \
       --driver-memory 6G \
       --packages org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating,org.apache.spark:spark-avro_2.11:2.4.4 \
       --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
   
   val tableName = "hudi_mor_table"
   val basePath = "file:///tmp/hudi_mor_tablen"
   
   val hudiOptions = Map[String,String](
     "hoodie.insert.shuffle.parallelism" -> "10",
     "hoodie.upsert.shuffle.parallelism" -> "10",
     "hoodie.delete.shuffle.parallelism" -> "10",
     "hoodie.bulkinsert.shuffle.parallelism" -> "10",
     "hoodie.datasource.write.recordkey.field" -> "key",
     "hoodie.datasource.write.partitionpath.field" -> "dt", 
     "hoodie.table.name" -> tableName,
     "hoodie.datasource.write.precombine.field" -> "timestamp",
     "hoodie.table.base.file.format" -> "PARQUET"
   )
   
   val inputDF = spark.range(1, 5).
      withColumn("key", $"id").
      withColumn("data", lit("data")).
      withColumn("timestamp", current_timestamp()).
      withColumn("dt", date_format($"timestamp", "yyyy-MM-dd"))
   
   inputDF.write.format("org.apache.hudi").
     options(hudiOptions).
     mode("Overwrite").
     save(basePath)
   
   spark.read.format("org.apache.hudi").load(basePath + "/*/*").show();
   ```
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-hudi] vinothchandar commented on issue #1528: [SUPPORT] Issue while writing to HDFS via hudi. Only `/.hoodie` folder is written.

Posted by GitBox <gi...@apache.org>.

vinothchandar commented on issue #1528:
URL: https://github.com/apache/incubator-hudi/issues/1528#issuecomment-616877781


   @jenu9417 Thanks for taking the time to report this. 
   
   a) is weird.. The logs do indicate that tasks got scheduled atleast.. but I think the job died before getting to write any data.. Do you have access to Spark UI? to see how the jobs are doing..
   
   b) So `.parquet()` does not use hudi at all (I suspect).. It uses the Spark parquet datasource and you can look at official spark docs to understand how you can partition that write (I think `.partitionBy("batch")`). `.save()` will invoke the save method of the datasource you configured using `format(...)`.. Spark docs will do a better job of explaining this than me :) 
   
   >The query was throwing error that there are no such field called _hoodie_commit_time
   
   parquet and hudi are different things.. Only hudi datasets have this field 
   
   c) `.hoodie` will contain all the metadata
   
   d) You can find more on compaction here https://cwiki.apache.org/confluence/display/HUDI/Design+And+Architecture#DesignAndArchitecture-Compaction 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-hudi] jenu9417 commented on issue #1528: [SUPPORT] Issue while writing to HDFS via hudi. Only `/.hoodie` folder is written.

Posted by GitBox <gi...@apache.org>.

jenu9417 commented on issue #1528:
URL: https://github.com/apache/incubator-hudi/issues/1528#issuecomment-618336942


   @lamber-ken  @vinothchandar 
   The above mentioned suggestions works fine. Time to write has now reduced drastically.
   Thank you for the continued support.
   
   Closing the ticket, since the original issue is resolved now.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-hudi] jenu9417 commented on issue #1528: [SUPPORT] Issue while writing to HDFS via hudi. Only `/.hoodie` folder is written.

Posted by GitBox <gi...@apache.org>.

jenu9417 commented on issue #1528:
URL: https://github.com/apache/incubator-hudi/issues/1528#issuecomment-616968792

@vinothchandar
Thanks for replying in detail.
As you pointed out, premature termination of job seems to be the problem. Since this was a POC and dry run, I was using a timer for closing the job after x seconds, which seems to close the job before write phase is finished.

But now, the problem is why is write taking more than 40secs, even for as simple as 10 records, where average record size is less than a KB.

```
91975 [dispatcher-event-loop-0] INFO org.apache.spark.scheduler.TaskSetManager - Starting task 625.0 in stage 19.0 (TID 11130, localhost, executor driver, partition 625, PROCESS_LOCAL, 7193 bytes)
91975 [Executor task launch worker for task 11130] INFO org.apache.spark.executor.Executor - Running task 625.0 in stage 19.0 (TID 11130)
91975 [task-result-getter-0] INFO org.apache.spark.scheduler.TaskSetManager - Finished task 624.0 in stage 19.0 (TID 11129) in 16 ms on localhost (executor driver) (624/1500)
```
From the logs, above set of lines were continuously repeating for multiple times.
The stage number was increasing and the same 1500 tasks were run again and again. I presume, these 1500 are partitions in rdd? If so, is it possible/advisable to reduce the number of partitions in the RDD.

And what all would be the general suggestions to speed up write here.

Happy to provide any other supporting data, if needed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-hudi] vinothchandar commented on issue #1528: [SUPPORT] Issue while writing to HDFS via hudi. Only `/.hoodie` folder is written.

Posted by GitBox <gi...@apache.org>.

vinothchandar commented on issue #1528:
URL: https://github.com/apache/incubator-hudi/issues/1528#issuecomment-617575183


   +1 defaults are for a large production setup.. if you want to play around, I suggest following the quickstart will sets these configs up for smaller scale and avoids these overheads


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org