You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Daniel Darabos (Jira)" <ji...@apache.org> on 2022/10/21 13:12:00 UTC
[jira] [Created] (SPARK-40873) Spark doesn't see some Parquet columns written from r-arrow

Daniel Darabos created SPARK-40873:
--------------------------------------

             Summary: Spark doesn't see some Parquet columns written from r-arrow
                 Key: SPARK-40873
                 URL: https://issues.apache.org/jira/browse/SPARK-40873
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 3.3.0
            Reporter: Daniel Darabos
         Attachments: part-0.parquet

I have a Parquet file that was created in R with the r-arrow package version 9.0.0 from Conda Forge with the write_dataset() function. It has four columns, but Spark 3.3.0 only sees two of them.

{{>>> df = spark.read.parquet('part-0.parquet')}}
{{()}}
{{>>> df.head()}}
{{Row(name='Adam', age=20.0)}}
{{>>> df.columns}}
{{['name', 'age']}}
{{>>> import pandas as pd}}
{{>>> pd.read_parquet('part-0.parquet')}}
{{           name   age   age_2      age_4}}
{{0          Adam  20.0   400.0   160000.0}}
{{1           Eve  18.0   324.0   104976.0}}
{{2           Bob  50.0  2500.0  6250000.0}}
{{3  Isolated Joe   2.0     4.0       16.0}}
{{>>> import pyarrow as pa}}
{{>>> import pyarrow.parquet as pq}}
{{>>> t = pq.read_table('part-0.parquet')}}
{{>>> t}}
{{pyarrow.Table}}
{{name: string}}
{{age: double}}
{{age_2: double}}
{{age_4: double}}
{{----}}
{{name: [["Adam","Eve","Bob","Isolated Joe"]]}}
{{age: [[20,18,50,2]]}}
{{age_2: [[400,324,2500,4]]}}
{{age_4: [[160000,104976,6250000,16]]}}
{{>>> pq.read_metadata('part-0.parquet')}}
{{<pyarrow._parquet.FileMetaData object at 0x7f13e9dee5e0>}}
{{  created_by: parquet-cpp-arrow version 9.0.0}}
{{  num_columns: 4}}
{{  num_rows: 4}}
{{  num_row_groups: 1}}
{{  format_version: 2.6}}
{{  serialized_size: 1510}}
{{>>> pq.read_metadata('part-0.parquet').schema}}
{{<pyarrow._parquet.ParquetSchema object at 0x7f13e9dc46c0>}}
{{required group field_id=-1 schema {}}
{{  optional binary field_id=-1 name (String);}}
{{  optional double field_id=-1 age;}}
{{  optional double field_id=-1 age_2;}}
{{  optional double field_id=-1 age_4;}}
{{}}}

"age_2" and "age_4" look no different from "age" based on the schema. I tried changing the names (just letters) but I still get the same behavior.

Is something wrong with my file? Is something wrong with Spark?

(I'll attach the file in a minute, I just need to figure out how.)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org