You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Daniel Darabos (Jira)" <ji...@apache.org> on 2022/10/21 13:12:00 UTC
[jira] [Created] (SPARK-40873) Spark doesn't see some Parquet columns written from r-arrow
Daniel Darabos created SPARK-40873:
--------------------------------------
Summary: Spark doesn't see some Parquet columns written from r-arrow
Key: SPARK-40873
URL: https://issues.apache.org/jira/browse/SPARK-40873
Project: Spark
Issue Type: Bug
Components: SQL
Affects Versions: 3.3.0
Reporter: Daniel Darabos
Attachments: part-0.parquet
I have a Parquet file that was created in R with the r-arrow package version 9.0.0 from Conda Forge with the write_dataset() function. It has four columns, but Spark 3.3.0 only sees two of them.
{{>>> df = spark.read.parquet('part-0.parquet')}}
{{()}}
{{>>> df.head()}}
{{Row(name='Adam', age=20.0)}}
{{>>> df.columns}}
{{['name', 'age']}}
{{>>> import pandas as pd}}
{{>>> pd.read_parquet('part-0.parquet')}}
{{ name age age_2 age_4}}
{{0 Adam 20.0 400.0 160000.0}}
{{1 Eve 18.0 324.0 104976.0}}
{{2 Bob 50.0 2500.0 6250000.0}}
{{3 Isolated Joe 2.0 4.0 16.0}}
{{>>> import pyarrow as pa}}
{{>>> import pyarrow.parquet as pq}}
{{>>> t = pq.read_table('part-0.parquet')}}
{{>>> t}}
{{pyarrow.Table}}
{{name: string}}
{{age: double}}
{{age_2: double}}
{{age_4: double}}
{{----}}
{{name: [["Adam","Eve","Bob","Isolated Joe"]]}}
{{age: [[20,18,50,2]]}}
{{age_2: [[400,324,2500,4]]}}
{{age_4: [[160000,104976,6250000,16]]}}
{{>>> pq.read_metadata('part-0.parquet')}}
{{<pyarrow._parquet.FileMetaData object at 0x7f13e9dee5e0>}}
{{ created_by: parquet-cpp-arrow version 9.0.0}}
{{ num_columns: 4}}
{{ num_rows: 4}}
{{ num_row_groups: 1}}
{{ format_version: 2.6}}
{{ serialized_size: 1510}}
{{>>> pq.read_metadata('part-0.parquet').schema}}
{{<pyarrow._parquet.ParquetSchema object at 0x7f13e9dc46c0>}}
{{required group field_id=-1 schema {}}
{{ optional binary field_id=-1 name (String);}}
{{ optional double field_id=-1 age;}}
{{ optional double field_id=-1 age_2;}}
{{ optional double field_id=-1 age_4;}}
{{}}}
"age_2" and "age_4" look no different from "age" based on the schema. I tried changing the names (just letters) but I still get the same behavior.
Is something wrong with my file? Is something wrong with Spark?
(I'll attach the file in a minute, I just need to figure out how.)
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org