You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@carbondata.apache.org by Anshul Jain <an...@impetus.co.in.INVALID> on 2019/05/20 05:04:23 UTC
Query About Carbon Write Process : why always 10 Task get created
when we write dataframe or rdd in carbon format in a write job or save job
Hi Dev team ,
I am doing a test with carbondata to load a csv file of 600 gb and write it in carbon format in s3 location , while writing I can see only 10 task getting created in final step of execution job as I was using 10 nodes , while I have num-executor as 18 , so its degrading my job performance and How can I make change to let task no. equal to no. of executor for best performance.
Thanks & Regards,
Anshul Jain
Big Data Engineer
Impetus Infotech (India) Pvt. Ltd.
Tel: +91-0731-4743600/3662
________________________________
NOTE: This message may contain information that is confidential, proprietary, privileged or otherwise protected by law. The message is intended solely for the named addressee. If received in error, please destroy and notify the sender. Any use of this email is prohibited when received in error. Impetus does not represent, warrant and/or guarantee, that the integrity of this communication has been maintained nor that the communication is free of errors, virus, interception or interference.
Re: Query About Carbon Write Process : why always 10 Task get
created when we write dataframe or rdd in carbon format in a write job or
save job
Posted by Jacky Li <ja...@qq.com>.
one correction for my last reply, the property to control the number of
threads for sorting during data load is:
"carbon.number.of.cores.while.loading"
You can set it like
CarbonProperties.getInstance().addProperty("carbon.number.of.cores.while.loading",
8)
Regards,
Jacky
--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
Re: Query About Carbon Write Process : why always 10 Task get
created when we write dataframe or rdd in carbon format in a write job or
save job
Posted by Jacky Li <ja...@qq.com>.
Hi Anshul Jain,
If you have specified the SORT_COLUMNS table property when creating table,
by default carbon will sort the input data during data loading (to build
index). The sorting is controlled by a table property called SORT_SCOPE, by
default it is LOCAL_SORT, it means it will sort the data locally within the
spark executor, without shuffling across executors. And there are other
options too, see http://carbondata.apache.org/ddl-of-carbondata.html
In your case, I guess it is using LOCAL_SORT. This sorting is using
multi-thread inside the executor, controlled by a CarbonProperty call
"NUM_THREAD_WHILE_LOADING".
If you want the spark default behavior like loading parquet, you can set the
SORT_SCOPE to NO_SORT.
Regards,
Jacky
--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/