You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Matt Cheah <mc...@palantir.com> on 2015/06/02 07:21:56 UTC

[SQL] Write parquet files under partition directories?

Hi there,

I noticed in the latest Spark SQL programming guide
<https://spark.apache.org/docs/latest/sql-programming-guide.html> , there is
support for optimized reading of partitioned Parquet files that have a
particular directory structure (year=1/month=10/day=3, for example).
However, I see no analogous way to write DataFrames as Parquet files with
similar directory structures based on user-provided partitioning.

Generally, is it possible to write DataFrames as partitioned Parquet files
that downstream partition discovery can take advantage of later? I
considered extending the Parquet output format, but it looks like
ParquetTableOperations.scala has fixed the output format to
AppendingParquetOutputFormat.

Also, I was wondering if it would be valuable to contribute writing Parquet
in partition directories as a PR.

Thanks,

-Matt Cheah



Re: [SQL] Write parquet files under partition directories?

Posted by Reynold Xin <rx...@databricks.com>.
Almost all dataframe stuff are tracked by this umbrella ticket:
https://issues.apache.org/jira/browse/SPARK-6116

For the reader/writer interface, it's here:

https://issues.apache.org/jira/browse/SPARK-7654

https://github.com/apache/spark/pull/6175

On Tue, Jun 2, 2015 at 3:57 PM, Matt Cheah <mc...@palantir.com> wrote:

> Excellent! Where can I find the code, pull request, and Spark ticket where
> this was introduced?
>
> Thanks,
>
> -Matt Cheah
>
> From: Reynold Xin <rx...@databricks.com>
> Date: Monday, June 1, 2015 at 10:25 PM
> To: Matt Cheah <mc...@palantir.com>
> Cc: "dev@spark.apache.org" <de...@spark.apache.org>, Mingyu Kim <
> mkim@palantir.com>, Andrew Ash <aa...@palantir.com>
> Subject: Re: [SQL] Write parquet files under partition directories?
>
> There will be in 1.4.
>
> df.write.partitionBy("year", "month", "day").parquet("/path/to/output")
>
> On Mon, Jun 1, 2015 at 10:21 PM, Matt Cheah <mc...@palantir.com> wrote:
>
>> Hi there,
>>
>> I noticed in the latest Spark SQL programming guide
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__spark.apache.org_docs_latest_sql-2Dprogramming-2Dguide.html&d=BQMFaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=hzwIMNQ9E99EMYGuqHI0kXhVbvX3nU3OSDadUnJxjAs&m=_7T9n01KFlQS8djMTP3ylblUaOYNr68mj286s8zIdQ8&s=VQxAw6mG9yopDs37lNi7H_CnYiFQumqDAn9A8881Xyc&e=>,
>> there is support for optimized reading of partitioned Parquet files that
>> have a particular directory structure (year=1/month=10/day=3, for example).
>> However, I see no analogous way to write DataFrames as Parquet files with
>> similar directory structures based on user-provided partitioning.
>>
>> Generally, is it possible to write DataFrames as partitioned Parquet
>> files that downstream partition discovery can take advantage of later? I
>> considered extending the Parquet output format, but it looks like
>> ParquetTableOperations.scala has fixed the output format to
>> AppendingParquetOutputFormat.
>>
>> Also, I was wondering if it would be valuable to contribute writing
>> Parquet in partition directories as a PR.
>>
>> Thanks,
>>
>> -Matt Cheah
>>
>
>

Re: [SQL] Write parquet files under partition directories?

Posted by Matt Cheah <mc...@palantir.com>.
Excellent! Where can I find the code, pull request, and Spark ticket where
this was introduced?

Thanks,

-Matt Cheah

From:  Reynold Xin <rx...@databricks.com>
Date:  Monday, June 1, 2015 at 10:25 PM
To:  Matt Cheah <mc...@palantir.com>
Cc:  "dev@spark.apache.org" <de...@spark.apache.org>, Mingyu Kim
<mk...@palantir.com>, Andrew Ash <aa...@palantir.com>
Subject:  Re: [SQL] Write parquet files under partition directories?

There will be in 1.4.

df.write.partitionBy("year", "month", "day").parquet("/path/to/output")

On Mon, Jun 1, 2015 at 10:21 PM, Matt Cheah <mc...@palantir.com> wrote:
> Hi there,
> 
> I noticed in the latest Spark SQL programming guide
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__spark.apache.org_docs_la
> test_sql-2Dprogramming-2Dguide.html&d=BQMFaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBr
> Z4tFb6oOnmz8&r=hzwIMNQ9E99EMYGuqHI0kXhVbvX3nU3OSDadUnJxjAs&m=_7T9n01KFlQS8djMT
> P3ylblUaOYNr68mj286s8zIdQ8&s=VQxAw6mG9yopDs37lNi7H_CnYiFQumqDAn9A8881Xyc&e=> ,
> there is support for optimized reading of partitioned Parquet files that have
> a particular directory structure (year=1/month=10/day=3, for example).
> However, I see no analogous way to write DataFrames as Parquet files with
> similar directory structures based on user-provided partitioning.
> 
> Generally, is it possible to write DataFrames as partitioned Parquet files
> that downstream partition discovery can take advantage of later? I considered
> extending the Parquet output format, but it looks like
> ParquetTableOperations.scala has fixed the output format to
> AppendingParquetOutputFormat.
> 
> Also, I was wondering if it would be valuable to contribute writing Parquet in
> partition directories as a PR.
> 
> Thanks,
> 
> -Matt Cheah




Re: [SQL] Write parquet files under partition directories?

Posted by Reynold Xin <rx...@databricks.com>.
There will be in 1.4.

df.write.partitionBy("year", "month", "day").parquet("/path/to/output")

On Mon, Jun 1, 2015 at 10:21 PM, Matt Cheah <mc...@palantir.com> wrote:

> Hi there,
>
> I noticed in the latest Spark SQL programming guide
> <https://spark.apache.org/docs/latest/sql-programming-guide.html>, there
> is support for optimized reading of partitioned Parquet files that have a
> particular directory structure (year=1/month=10/day=3, for example).
> However, I see no analogous way to write DataFrames as Parquet files with
> similar directory structures based on user-provided partitioning.
>
> Generally, is it possible to write DataFrames as partitioned Parquet files
> that downstream partition discovery can take advantage of later? I
> considered extending the Parquet output format, but it looks like
> ParquetTableOperations.scala has fixed the output format to
> AppendingParquetOutputFormat.
>
> Also, I was wondering if it would be valuable to contribute writing
> Parquet in partition directories as a PR.
>
> Thanks,
>
> -Matt Cheah
>