You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by Lucas Pickup <Lu...@microsoft.com.INVALID> on 2017/08/16 15:58:15 UTC

Major difference between Spark and Arrow Parquet Implementations

Hello,

I have been using pyarrow and PySpark to write Parquet files. I have used pyarrow to successfully write out a Parquet file with spaces in column names. E.g. 'X Coordinate'.
When I try to write out the same dataset using Sparks Parquet writer it fails claiming:
"Attribute name "X Coordinate" contains invalid character(s) among " ,;{}()\\n\\t<file://n//t>="".
It seems that according to Spark's Parquet implementation those above characters are not allowed to be a part of a Parquet Schema due to special meaning.
The code that checks this is here<https://github.com/apache/spark/blob/cba826d00173a945b0c9a7629c66e36fa73b723e/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala#L565>.

I was wondering if there was a reason why the implementations have such a major difference when it comes to schema generation?

Cheers, Lucas Pickup

RE: Major difference between Spark and Arrow Parquet Implementations

Posted by Erin Sobkow <es...@parklandvalley.ca>.

Thanks.  

Erin Sobkow, BA Kin, RMT
Community Consultant
Parkland Valley Sport, Culture & Recreation District

Box 263, Yorkton, SK  S3N 2V7
Phone: (306) 786-6585
Fax: (306) 782-0474
Email:  esobkow@parklandvalley.ca
Website:  www.parklandvalley.ca

If you no longer wish to receive electronic messages from Parkland Valley Sport, Culture & Recreation District please reply with the word 'STOP'.

 

Together...building healthy communities through sport, culture and recreation

-----Original Message-----
From: Wes McKinney [mailto:wesmckinn@gmail.com] 
Sent: August 16, 2017 11:30 AM
To: dev@arrow.apache.org
Subject: Re: Major difference between Spark and Arrow Parquet Implementations

hi Erin -- please send a separate e-mail to dev-unsubscribe@arrow.apache.org

Thanks

On Wed, Aug 16, 2017 at 1:06 PM, Erin Sobkow <es...@parklandvalley.ca> wrote:
> Hi Wes:
>
> Somehow I have been inadvertently added to your list and am getting all these emails that make no sense to me at all.  I'm in on some conversation I know nothing about and am getting up to 20 emails a day from different people.  Can I ask you to remove me from your list and can you get all the other people in your group to remove me as well?  Thanks!
>
> Erin Sobkow, BA Kin, RMT
> Community Consultant
> Parkland Valley Sport, Culture & Recreation District
>
> Box 263, Yorkton, SK  S3N 2V7
> Phone: (306) 786-6585
> Fax: (306) 782-0474
> Email:  esobkow@parklandvalley.ca
> Website:  www.parklandvalley.ca
>
> If you no longer wish to receive electronic messages from Parkland Valley Sport, Culture & Recreation District please reply with the word 'STOP'.
>
>
>
> Together...building healthy communities through sport, culture and recreation
>
> -----Original Message-----
> From: Wes McKinney [mailto:wesmckinn@gmail.com]
> Sent: August 16, 2017 10:04 AM
> To: dev@arrow.apache.org
> Subject: Re: Major difference between Spark and Arrow Parquet Implementations
>
> hi Lucas,
>
> My understanding is that the Parquet format by itself does not place any such restrictions on the names of fields, and so this is a Spark SQL-specific issue (anyone please correct me if I'm mistaken about this). I would be happy to help add a schema cleaning option to normalize field names for use in Spark. I just opened:
>
> https://issues.apache.org/jira/browse/ARROW-1359
>
> Thanks
> Wes
>
> On Wed, Aug 16, 2017 at 11:58 AM, Lucas Pickup <Lu...@microsoft.com.invalid> wrote:
>> Hello,
>>
>> I have been using pyarrow and PySpark to write Parquet files. I have used pyarrow to successfully write out a Parquet file with spaces in column names. E.g. 'X Coordinate'.
>> When I try to write out the same dataset using Sparks Parquet writer it fails claiming:
>> "Attribute name "X Coordinate" contains invalid character(s) among " ,;{}()\\n\\t<file://n//t>="".
>> It seems that according to Spark's Parquet implementation those above characters are not allowed to be a part of a Parquet Schema due to special meaning.
>> The code that checks this is here<https://github.com/apache/spark/blob/cba826d00173a945b0c9a7629c66e36fa73b723e/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala#L565>.
>>
>> I was wondering if there was a reason why the implementations have such a major difference when it comes to schema generation?
>>
>> Cheers, Lucas Pickup
>

Re: Major difference between Spark and Arrow Parquet Implementations

Posted by Wes McKinney <we...@gmail.com>.

hi Erin -- please send a separate e-mail to dev-unsubscribe@arrow.apache.org

Thanks

On Wed, Aug 16, 2017 at 1:06 PM, Erin Sobkow <es...@parklandvalley.ca> wrote:
> Hi Wes:
>
> Somehow I have been inadvertently added to your list and am getting all these emails that make no sense to me at all.  I'm in on some conversation I know nothing about and am getting up to 20 emails a day from different people.  Can I ask you to remove me from your list and can you get all the other people in your group to remove me as well?  Thanks!
>
> Erin Sobkow, BA Kin, RMT
> Community Consultant
> Parkland Valley Sport, Culture & Recreation District
>
> Box 263, Yorkton, SK  S3N 2V7
> Phone: (306) 786-6585
> Fax: (306) 782-0474
> Email:  esobkow@parklandvalley.ca
> Website:  www.parklandvalley.ca
>
> If you no longer wish to receive electronic messages from Parkland Valley Sport, Culture & Recreation District please reply with the word 'STOP'.
>
>
>
> Together...building healthy communities through sport, culture and recreation
>
> -----Original Message-----
> From: Wes McKinney [mailto:wesmckinn@gmail.com]
> Sent: August 16, 2017 10:04 AM
> To: dev@arrow.apache.org
> Subject: Re: Major difference between Spark and Arrow Parquet Implementations
>
> hi Lucas,
>
> My understanding is that the Parquet format by itself does not place any such restrictions on the names of fields, and so this is a Spark SQL-specific issue (anyone please correct me if I'm mistaken about this). I would be happy to help add a schema cleaning option to normalize field names for use in Spark. I just opened:
>
> https://issues.apache.org/jira/browse/ARROW-1359
>
> Thanks
> Wes
>
> On Wed, Aug 16, 2017 at 11:58 AM, Lucas Pickup <Lu...@microsoft.com.invalid> wrote:
>> Hello,
>>
>> I have been using pyarrow and PySpark to write Parquet files. I have used pyarrow to successfully write out a Parquet file with spaces in column names. E.g. 'X Coordinate'.
>> When I try to write out the same dataset using Sparks Parquet writer it fails claiming:
>> "Attribute name "X Coordinate" contains invalid character(s) among " ,;{}()\\n\\t<file://n//t>="".
>> It seems that according to Spark's Parquet implementation those above characters are not allowed to be a part of a Parquet Schema due to special meaning.
>> The code that checks this is here<https://github.com/apache/spark/blob/cba826d00173a945b0c9a7629c66e36fa73b723e/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala#L565>.
>>
>> I was wondering if there was a reason why the implementations have such a major difference when it comes to schema generation?
>>
>> Cheers, Lucas Pickup
>

RE: Major difference between Spark and Arrow Parquet Implementations

Posted by Erin Sobkow <es...@parklandvalley.ca>.

Hi Wes:

Somehow I have been inadvertently added to your list and am getting all these emails that make no sense to me at all.  I'm in on some conversation I know nothing about and am getting up to 20 emails a day from different people.  Can I ask you to remove me from your list and can you get all the other people in your group to remove me as well?  Thanks!

Erin Sobkow, BA Kin, RMT
Community Consultant
Parkland Valley Sport, Culture & Recreation District

Box 263, Yorkton, SK  S3N 2V7
Phone: (306) 786-6585
Fax: (306) 782-0474
Email:  esobkow@parklandvalley.ca
Website:  www.parklandvalley.ca

If you no longer wish to receive electronic messages from Parkland Valley Sport, Culture & Recreation District please reply with the word 'STOP'.

Together...building healthy communities through sport, culture and recreation

-----Original Message-----
From: Wes McKinney [mailto:wesmckinn@gmail.com] 
Sent: August 16, 2017 10:04 AM
To: dev@arrow.apache.org
Subject: Re: Major difference between Spark and Arrow Parquet Implementations

hi Lucas,

My understanding is that the Parquet format by itself does not place any such restrictions on the names of fields, and so this is a Spark SQL-specific issue (anyone please correct me if I'm mistaken about this). I would be happy to help add a schema cleaning option to normalize field names for use in Spark. I just opened:

https://issues.apache.org/jira/browse/ARROW-1359

Thanks
Wes

On Wed, Aug 16, 2017 at 11:58 AM, Lucas Pickup <Lu...@microsoft.com.invalid> wrote:
> Hello,
>
> I have been using pyarrow and PySpark to write Parquet files. I have used pyarrow to successfully write out a Parquet file with spaces in column names. E.g. 'X Coordinate'.
> When I try to write out the same dataset using Sparks Parquet writer it fails claiming:
> "Attribute name "X Coordinate" contains invalid character(s) among " ,;{}()\\n\\t<file://n//t>="".
> It seems that according to Spark's Parquet implementation those above characters are not allowed to be a part of a Parquet Schema due to special meaning.
> The code that checks this is here<https://github.com/apache/spark/blob/cba826d00173a945b0c9a7629c66e36fa73b723e/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala#L565>.
>
> I was wondering if there was a reason why the implementations have such a major difference when it comes to schema generation?
>
> Cheers, Lucas Pickup

Re: Major difference between Spark and Arrow Parquet Implementations

Posted by Wes McKinney <we...@gmail.com>.

hi Lucas,

My understanding is that the Parquet format by itself does not place
any such restrictions on the names of fields, and so this is a Spark
SQL-specific issue (anyone please correct me if I'm mistaken about
this). I would be happy to help add a schema cleaning option to
normalize field names for use in Spark. I just opened:

https://issues.apache.org/jira/browse/ARROW-1359

Thanks
Wes

On Wed, Aug 16, 2017 at 11:58 AM, Lucas Pickup
<Lu...@microsoft.com.invalid> wrote:
> Hello,
>
> I have been using pyarrow and PySpark to write Parquet files. I have used pyarrow to successfully write out a Parquet file with spaces in column names. E.g. 'X Coordinate'.
> When I try to write out the same dataset using Sparks Parquet writer it fails claiming:
> "Attribute name "X Coordinate" contains invalid character(s) among " ,;{}()\\n\\t<file://n//t>="".
> It seems that according to Spark's Parquet implementation those above characters are not allowed to be a part of a Parquet Schema due to special meaning.
> The code that checks this is here<https://github.com/apache/spark/blob/cba826d00173a945b0c9a7629c66e36fa73b723e/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala#L565>.
>
> I was wondering if there was a reason why the implementations have such a major difference when it comes to schema generation?
>
> Cheers, Lucas Pickup