You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by "McBeath, Darin W (ELS-STL)" <D....@elsevier.com> on 2016/09/20 18:22:26 UTC

Dataset doesn't have partitioner after a repartition on one of the columns

I'm using Spark 2.0.

I've created a dataset from a parquet file and repartition on one of the columns (docId) and persist the repartitioned dataset.

val om = ds.repartition($"docId").persist(StorageLevel.MEMORY_AND_DISK)

When I try to confirm the partitioner, with

om.rdd.partitioner

I get

Option[org.apache.spark.Partitioner] = None

I would have thought it would be HashPartitioner.

Does anyone know why this would be None and not HashPartitioner?

Thanks.

Darin.

Re: Dataset doesn't have partitioner after a repartition on one of the columns

Posted by Igor Berman <ig...@gmail.com>.

Michael, can you explain please why bucketBy is supported when using
writeAsTable() to parquet by not with parquet()
Is it only difference between table api and dataframe/dataset api? or there
are some other?

org.apache.spark.sql.AnalysisException: 'save' does not support bucketing
right now;
at
org.apache.spark.sql.DataFrameWriter.assertNotBucketed(DataFrameWriter.scala:310)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:203)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:194)
at org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:478)


thanks in advance


On 28 September 2016 at 21:26, Michael Armbrust <mi...@databricks.com>
wrote:

> Hi Darin,
>
> In SQL we have finer grained information about partitioning, so we don't
> use the RDD Partitioner.  Here's a notebook
> <https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1023043053387187/3633335638369146/2840265927289860/latest.html>that
> walks through what we do expose and how it is used by the query planner.
>
> Michael
>
> On Tue, Sep 20, 2016 at 11:22 AM, McBeath, Darin W (ELS-STL) <
> D.McBeath@elsevier.com> wrote:
>
>> I’m using Spark 2.0.
>>
>> I’ve created a dataset from a parquet file and repartition on one of the
>> columns (docId) and persist the repartitioned dataset.
>>
>> val om = ds.repartition($"docId”).persist(StorageLevel.MEMORY_AND_DISK)
>>
>> When I try to confirm the partitioner, with
>>
>> om.rdd.partitioner
>>
>> I get
>>
>> Option[org.apache.spark.Partitioner] = None
>>
>> I would have thought it would be HashPartitioner.
>>
>> Does anyone know why this would be None and not HashPartitioner?
>>
>> Thanks.
>>
>> Darin.
>>
>>
>>
>

Re: Dataset doesn't have partitioner after a repartition on one of the columns

Posted by Michael Armbrust <mi...@databricks.com>.

Hi Darin,

In SQL we have finer grained information about partitioning, so we don't
use the RDD Partitioner.  Here's a notebook
<https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1023043053387187/3633335638369146/2840265927289860/latest.html>that
walks through what we do expose and how it is used by the query planner.

Michael

On Tue, Sep 20, 2016 at 11:22 AM, McBeath, Darin W (ELS-STL) <
D.McBeath@elsevier.com> wrote:

> I’m using Spark 2.0.
>
> I’ve created a dataset from a parquet file and repartition on one of the
> columns (docId) and persist the repartitioned dataset.
>
> val om = ds.repartition($"docId”).persist(StorageLevel.MEMORY_AND_DISK)
>
> When I try to confirm the partitioner, with
>
> om.rdd.partitioner
>
> I get
>
> Option[org.apache.spark.Partitioner] = None
>
> I would have thought it would be HashPartitioner.
>
> Does anyone know why this would be None and not HashPartitioner?
>
> Thanks.
>
> Darin.
>
>
>