You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Apache Spark (Jira)" <ji...@apache.org> on 2020/08/07 02:55:00 UTC

[jira] [Commented] (SPARK-31703) Changes made by SPARK-26985 break reading parquet files correctly in BigEndian architectures (AIX + LinuxPPC64)

    [ https://issues.apache.org/jira/browse/SPARK-31703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17172785#comment-17172785 ] 

Apache Spark commented on SPARK-31703:
--------------------------------------

User 'tinhto-000' has created a pull request for this issue:
https://github.com/apache/spark/pull/29383

> Changes made by SPARK-26985 break reading parquet files correctly in BigEndian architectures (AIX + LinuxPPC64)
> ---------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-31703
>                 URL: https://issues.apache.org/jira/browse/SPARK-31703
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.4.5, 3.0.0
>         Environment: AIX 7.2
> LinuxPPC64 with RedHat.
>            Reporter: Michail Giannakopoulos
>            Priority: Critical
>              Labels: BigEndian
>         Attachments: Data_problem_Spark.gif
>
>
> Trying to upgrade to Apache Spark 2.4.5 in our IBM systems (AIX and PowerPC) so as to be able to read data stored in parquet format, we notice that values associated with DOUBLE and DECIMAL types are parsed in the wrong form.
> According toe parquet documentation, they always opt to store the values using left-endian representation for values:
> [https://github.com/apache/parquet-format/blob/master/Encodings.md]
> {noformat}
> The plain encoding is used whenever a more efficient encoding can not be used. It
> stores the data in the following format:
> BOOLEAN: Bit Packed, LSB first
> INT32: 4 bytes little endian
> INT64: 8 bytes little endian
> INT96: 12 bytes little endian (deprecated)
> FLOAT: 4 bytes IEEE little endian
> DOUBLE: 8 bytes IEEE little endian
> BYTE_ARRAY: length in 4 bytes little endian followed by the bytes contained in the array
> FIXED_LEN_BYTE_ARRAY: the bytes contained in the array
> For native types, this outputs the data as little endian. Floating
> point types are encoded in IEEE.
> For the byte array type, it encodes the length as a 4 byte little
> endian, followed by the bytes.{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org