You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@parquet.apache.org by ashokkumar rajendran <as...@gmail.com> on 2016/03/04 07:50:38 UTC

Re: Do we need schema for Parquet files with Spark?

Hi Ted,

Thanks for pointing out this. This page has mailing list for developers but
not for users yet it seems. Including developers mailing list only.

Hi Parquet Team,

Could you please clarify the question below? Please let me know if there is
a separate mailing list for users but not developers.

Regards
Ashok

On Fri, Mar 4, 2016 at 11:01 AM, Ted Yu <yu...@gmail.com> wrote:

> Have you taken a look at https://parquet.apache.org/community/ ?
>
> On Thu, Mar 3, 2016 at 7:32 PM, ashokkumar rajendran <
> ashokkumar.rajendran@gmail.com> wrote:
>
>> Hi,
>>
>> I am exploring to use Apache Parquet with Spark SQL in our project. I
>> notice that Apache Parquet uses different encoding for different columns.
>> The dictionary encoding in Parquet will be one of the good ones for our
>> performance. I do not see much documentation in Spark or Parquet on how to
>> configure this. For example, how would Parquet know dictionary of words if
>> there is no schema provided by user? Where/how to specify my schema /
>> config for Parquet format?
>>
>> Could not find Apache Parquet mailing list in the official site. It would
>> be great if anyone could share it as well.
>>
>> Regards
>> Ashok
>>
>
>

Re: Do we need schema for Parquet files with Spark?

Posted by Ryan Blue <rb...@netflix.com.INVALID>.

Hi Ashok,

The schema for your data comes from the data frame you're using in Spark
and resolved with a Hive table schema if you are writing to one. For
encodings, you don't need to configure them because they are selected for
your data automatically. For example, Parquet will try dictionary-encoding
first and fall back to non-dictionary if it looks like the
dictionary-encoding would take more space.

I recommend writing out a data frame to Parquet and then just taking a look
at the result using parquet-tools, which you can download from maven
central.

rb

On Thu, Mar 3, 2016 at 10:50 PM, ashokkumar rajendran <
ashokkumar.rajendran@gmail.com> wrote:

> Hi Ted,
>
> Thanks for pointing out this. This page has mailing list for developers but
> not for users yet it seems. Including developers mailing list only.
>
> Hi Parquet Team,
>
> Could you please clarify the question below? Please let me know if there is
> a separate mailing list for users but not developers.
>
> Regards
> Ashok
>
> On Fri, Mar 4, 2016 at 11:01 AM, Ted Yu <yu...@gmail.com> wrote:
>
> > Have you taken a look at https://parquet.apache.org/community/ ?
> >
> > On Thu, Mar 3, 2016 at 7:32 PM, ashokkumar rajendran <
> > ashokkumar.rajendran@gmail.com> wrote:
> >
> >> Hi,
> >>
> >> I am exploring to use Apache Parquet with Spark SQL in our project. I
> >> notice that Apache Parquet uses different encoding for different
> columns.
> >> The dictionary encoding in Parquet will be one of the good ones for our
> >> performance. I do not see much documentation in Spark or Parquet on how
> to
> >> configure this. For example, how would Parquet know dictionary of words
> if
> >> there is no schema provided by user? Where/how to specify my schema /
> >> config for Parquet format?
> >>
> >> Could not find Apache Parquet mailing list in the official site. It
> would
> >> be great if anyone could share it as well.
> >>
> >> Regards
> >> Ashok
> >>
> >
> >
>



-- 
Ryan Blue
Software Engineer
Netflix

Re: Do we need schema for Parquet files with Spark?

Posted by Ryan Blue <rb...@netflix.com.INVALID>.

Hi Ashok,

The schema for your data comes from the data frame you're using in Spark
and resolved with a Hive table schema if you are writing to one. For
encodings, you don't need to configure them because they are selected for
your data automatically. For example, Parquet will try dictionary-encoding
first and fall back to non-dictionary if it looks like the
dictionary-encoding would take more space.

I recommend writing out a data frame to Parquet and then just taking a look
at the result using parquet-tools, which you can download from maven
central.

rb

On Thu, Mar 3, 2016 at 10:50 PM, ashokkumar rajendran <
ashokkumar.rajendran@gmail.com> wrote:

> Hi Ted,
>
> Thanks for pointing out this. This page has mailing list for developers but
> not for users yet it seems. Including developers mailing list only.
>
> Hi Parquet Team,
>
> Could you please clarify the question below? Please let me know if there is
> a separate mailing list for users but not developers.
>
> Regards
> Ashok
>
> On Fri, Mar 4, 2016 at 11:01 AM, Ted Yu <yu...@gmail.com> wrote:
>
> > Have you taken a look at https://parquet.apache.org/community/ ?
> >
> > On Thu, Mar 3, 2016 at 7:32 PM, ashokkumar rajendran <
> > ashokkumar.rajendran@gmail.com> wrote:
> >
> >> Hi,
> >>
> >> I am exploring to use Apache Parquet with Spark SQL in our project. I
> >> notice that Apache Parquet uses different encoding for different
> columns.
> >> The dictionary encoding in Parquet will be one of the good ones for our
> >> performance. I do not see much documentation in Spark or Parquet on how
> to
> >> configure this. For example, how would Parquet know dictionary of words
> if
> >> there is no schema provided by user? Where/how to specify my schema /
> >> config for Parquet format?
> >>
> >> Could not find Apache Parquet mailing list in the official site. It
> would
> >> be great if anyone could share it as well.
> >>
> >> Regards
> >> Ashok
> >>
> >
> >
>



-- 
Ryan Blue
Software Engineer
Netflix