You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Daniel Darabos (Jira)" <ji...@apache.org> on 2022/10/21 13:18:00 UTC
[jira] [Commented] (SPARK-40873) Spark doesn't see some Parquet columns written from r-arrow

    [ https://issues.apache.org/jira/browse/SPARK-40873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17622251#comment-17622251 ] 

Daniel Darabos commented on SPARK-40873:
----------------------------------------

Oh, I think I got it! With debug logging Spark prints a lot of stuff, including the metadata:

{{"keyValueMetaData" : {}}
{{      "ARROW:schema" : "/////zACAAAQAAAAAAAKAA4ABgAFAAgACgAAAAABBAAQAAAAAAAKAAwAAAAEAAgACgAAACwBAAAEAAAAAgAAAOgAAAAEAAAAKP///wgAAAA0AAAAKQAAAG9yZy5hcGFjaGUuc3Bhcmsuc3FsLnBhcnF1ZXQucm93Lm1ldGFkYXRhAAAAlwAAAHsidHlwZSI6InN0cnVjdCIsImZpZWxkcyI6W3sibmFtZSI6Im5hbWUiLCJ0eXBlIjoic3RyaW5nIiwibnVsbGFibGUiOnRydWUsIm1ldGFkYXRhIjp7fX0seyJuYW1lIjoiYWdlIiwidHlwZSI6ImRvdWJsZSIsIm51bGxhYmxlIjp0cnVlLCJtZXRhZGF0YSI6e319XX0ACAAMAAQACAAIAAAACAAAACQAAAAYAAAAb3JnLmFwYWNoZS5zcGFyay52ZXJzaW9uAAAAAAUAAAAzLjMuMAAAAAQAAACoAAAAZAAAADQAAAAEAAAAeP///wAAAQMQAAAAGAAAAAQAAAAAAAAABQAAAGFnZV80AAAAqv///wAAAgCk////AAABAxAAAAAYAAAABAAAAAAAAAAFAAAAYWdlXzIAAADW////AAACAND///8AAAEDEAAAABwAAAAEAAAAAAAAAAMAAABhZ2UAAAAGAAgABgAGAAAAAAACABAAFAAIAAYABwAMAAAAEAAQAAAAAAABBRAAAAAcAAAABAAAAAAAAAAEAAAAbmFtZQAAAAAEAAQABAAAAA==",}}
{{      "org.apache.spark.version" : "3.3.0",}}
{{      "org.apache.spark.sql.parquet.row.metadata" : "\{\"type\":\"struct\",\"fields\":[{\"name\":\"name\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},\{\"name\":\"age\",\"type\":\"double\",\"nullable\":true,\"metadata\":{}}]}"}}
{{    },}}

This file is based on a Parquet file written by Spark. The "age_2" and "age_4" columns were added in R. Looks like r-arrow managed to carry over the metadata from the original file, so we have "{{{}org.apache.spark.sql.parquet.row.metadata{}}}" with only "name" and "age".

I'll see if I can drop the metadata in r-arrow.

> Spark doesn't see some Parquet columns written from r-arrow
> -----------------------------------------------------------
>
>                 Key: SPARK-40873
>                 URL: https://issues.apache.org/jira/browse/SPARK-40873
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 3.3.0
>            Reporter: Daniel Darabos
>            Priority: Minor
>         Attachments: part-0.parquet
>
>
> I have a Parquet file that was created in R with the r-arrow package version 9.0.0 from Conda Forge with the write_dataset() function. It has four columns, but Spark 3.3.0 only sees two of them.
> {{>>> df = spark.read.parquet('part-0.parquet')}}
> {{()}}
> {{>>> df.head()}}
> {{Row(name='Adam', age=20.0)}}
> {{>>> df.columns}}
> {{['name', 'age']}}
> {{>>> import pandas as pd}}
> {{>>> pd.read_parquet('part-0.parquet')}}
> {{           name   age   age_2      age_4}}
> {{0          Adam  20.0   400.0   160000.0}}
> {{1           Eve  18.0   324.0   104976.0}}
> {{2           Bob  50.0  2500.0  6250000.0}}
> {{3  Isolated Joe   2.0     4.0       16.0}}
> {{>>> import pyarrow as pa}}
> {{>>> import pyarrow.parquet as pq}}
> {{>>> t = pq.read_table('part-0.parquet')}}
> {{>>> t}}
> {{pyarrow.Table}}
> {{name: string}}
> {{age: double}}
> {{age_2: double}}
> {{age_4: double}}
> {{----}}
> {{name: [["Adam","Eve","Bob","Isolated Joe"]]}}
> {{age: [[20,18,50,2]]}}
> {{age_2: [[400,324,2500,4]]}}
> {{age_4: [[160000,104976,6250000,16]]}}
> {{>>> pq.read_metadata('part-0.parquet')}}
> {{<pyarrow._parquet.FileMetaData object at 0x7f13e9dee5e0>}}
> {{  created_by: parquet-cpp-arrow version 9.0.0}}
> {{  num_columns: 4}}
> {{  num_rows: 4}}
> {{  num_row_groups: 1}}
> {{  format_version: 2.6}}
> {{  serialized_size: 1510}}
> {{>>> pq.read_metadata('part-0.parquet').schema}}
> {{<pyarrow._parquet.ParquetSchema object at 0x7f13e9dc46c0>}}
> {{required group field_id=-1 schema {}}
> {{  optional binary field_id=-1 name (String);}}
> {{  optional double field_id=-1 age;}}
> {{  optional double field_id=-1 age_2;}}
> {{  optional double field_id=-1 age_4;}}
> {{}}}
> "age_2" and "age_4" look no different from "age" based on the schema. I tried changing the names (just letters) but I still get the same behavior.
> Is something wrong with my file? Is something wrong with Spark?
> (I'll attach the file in a minute, I just need to figure out how.)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org