You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "Nishith Agarwal (Jira)" <ji...@apache.org> on 2021/05/24 22:23:00 UTC
[jira] [Commented] (HUDI-1668) GlobalSortPartitioner is getting
called twice during bulk_insert.
[ https://issues.apache.org/jira/browse/HUDI-1668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17350651#comment-17350651 ]
Nishith Agarwal commented on HUDI-1668:
---------------------------------------
[~shivnarayan] [~sugamberku] Spark has a 3 stage sorting technique
# In the first stage, spark will do a sample scan of the data to find how many partitions should this data be mapped to if it were to be sorted
# In the second stage, spark will do the actual sorting of this data by writing data to N different buckets/partitions known as mappers.
# In the third stage, the data will be pulled by the reducers using external sorting techniques (this is only seen in the stage of the job consuming this data)
Having 2 stages for sorting is pretty common. if you need more information, this can be found out here -> [https://stackoverflow.com/questions/51831696/what-happens-under-the-hood-when-you-sort-a-dataframe-in-spark]
> GlobalSortPartitioner is getting called twice during bulk_insert.
> -----------------------------------------------------------------
>
> Key: HUDI-1668
> URL: https://issues.apache.org/jira/browse/HUDI-1668
> Project: Apache Hudi
> Issue Type: Bug
> Reporter: Sugamber
> Priority: Minor
> Labels: sev:high, user-support-issues
> Attachments: 1st.png, 2nd.png, Screen Shot 2021-04-17 at 11.23.17 AM.png, Screenshot 2021-04-21 at 6.40.19 PM.png, Screenshot 2021-04-21 at 6.40.40 PM.png
>
>
> Hi Team,
> I'm using bulk insert option to load close to 2 TB data. The process is taking near by 2 hours to get completed. While looking at the job log, it is identified that [sortBy at GlobalSortPartitioner.java:41|https://gdlcuspc1a3-6.us-central1.us.walmart.net:18481/history/application_1614298633248_1444/1/jobs/job?id=1] is running twice.
> It is getting triggered at 1 stage. *refer this screenshot ->[^1st.png]*.
> Second time it is getting trigged from *HoodieSparkSqlWriter.scala:433* *[count at HoodieSparkSqlWriter.scala:433|https://gdlcuspc1a3-6.us-central1.us.walmart.net:18481/history/application_1614298633248_1444/1/jobs/job?id=2]* step.
> In both cases, same number of job got triggered and running time is close to each other. *Refer this screenshot* -> [^2nd.png]
> Is there any way to run only one time so that data can be loaded faster or it is expected behaviour.
> *Spark and Hudi configurations*
>
> {code:java}
> Spark - 2.3.0
> Scala- 2.11.12
> Hudi - 0.7.0
>
> {code}
>
> Hudi Configuration
> {code:java}
> "hoodie.cleaner.commits.retained" = 2
> "hoodie.bulkinsert.shuffle.parallelism"=2000
> "hoodie.parquet.small.file.limit" = 100000000
> "hoodie.parquet.max.file.size" = 128000000
> "hoodie.index.bloom.num_entries" = 1800000
> "hoodie.bloom.index.filter.type" = "DYNAMIC_V0"
> "hoodie.bloom.index.filter.dynamic.max.entries" = 2500000
> "hoodie.bloom.index.bucketized.checking" = "false"
> "hoodie.datasource.write.operation" = "bulk_insert"
> "hoodie.datasource.write.table.type" = "COPY_ON_WRITE"
> {code}
>
> Spark Configuration -
> {code:java}
> --num-executors 180
> --executor-cores 4
> --executor-memory 16g
> --driver-memory=24g
> --conf spark.rdd.compress=true
> --queue=default
> --conf spark.serializer=org.apache.spark.serializer.KryoSerializer
> --conf spark.executor.memoryOverhead=1600
> --conf spark.driver.memoryOverhead=1200
> --conf spark.driver.maxResultSize=2g
> --conf spark.kryoserializer.buffer.max=512m
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)