You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by 崔苗 <cu...@danale.com> on 2018/05/26 06:24:09 UTC

what defines dataset partition number in spark sql

Hi,
I want to know when I create a dataset by reading files in hdfs in spark sql,
like : Dataset<Row> user = spark.read().format("json").load(filePath) , what defines the partition number of the dataset?
And what if the filePath is a directory instead of a singe file ?
Why we can't get the partitions number of dataset by dataset.getNumPartitions()? why we must change the dataset to rdd to get partition number: dataset.rdd().getNumPartitions() ?


Thanks