You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@carbondata.apache.org by GitBox <gi...@apache.org> on 2021/06/23 03:16:39 UTC

[GitHub] [carbondata] 01lin opened a new issue #4160: Why opened task less than available executors in case of insert into/load data

01lin opened a new issue #4160:
URL: https://github.com/apache/carbondata/issues/4160


   In case of insert into or load data, the total number of tasks in the stage is almost equal to the number of hosts, and in general it is much smaller than the available executors. The low parallelism of the stage results in slower execution. Why must the parallelism be constrained on the distinct host?  Can start more tasks to increase parallelism and improve resource utilization? Thanks
   
   org/apache/carbondata/spark/rdd/CarbonDataRDDFactory.scala: loadDataFrame
   ```
     /**
      * Execute load process to load from input dataframe
      */
     private def loadDataFrame(
         sqlContext: SQLContext,
         dataFrame: Option[DataFrame],
         carbonLoadModel: CarbonLoadModel
     ): Array[(String, (LoadMetadataDetails, ExecutionErrors))] = {
       try {
         val rdd = dataFrame.get.rdd
         // 基于getPreferredLocs获取优化位置，取distinct值：获取host list
         val nodeNumOfData = rdd.partitions.flatMap[String, Array[String]] { p =>
           DataLoadPartitionCoalescer.getPreferredLocs(rdd, p).map(_.host)
         }.distinct.length
         val nodes = DistributionUtil.ensureExecutorsByNumberAndGetNodeList(
           nodeNumOfData,
           sqlContext.sparkContext)  // 确保executor数量要和数据的节点数一样多
         val newRdd = new DataLoadCoalescedRDD[Row](sqlContext.sparkSession, rdd, nodes.toArray
           .distinct)
   
         new NewDataFrameLoaderRDD(
           sqlContext.sparkSession,
           new DataLoadResultImpl(),
           carbonLoadModel,
           newRdd
         ).collect()
       } catch {
         case ex: Exception =>
           LOGGER.error("load data frame failed", ex)
           throw ex
       }
     }
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [carbondata] QiangCai commented on issue #4160: Why opened task less than available executors in case of insert into/load data

Posted by GitBox <gi...@apache.org>.

QiangCai commented on issue #4160:
URL: https://github.com/apache/carbondata/issues/4160#issuecomment-869274861


   It only works for the local_sort loading.
   It can help to avoid data shuffle during executors. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@carbondata.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org