You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Jim Green <op...@gmail.com> on 2015/08/28 00:53:02 UTC

Array column stored as “.bag” in parquet file instead of “REPEATED INT64"

Hi Team,

Say I have a test.json file: {"c1":[1,2,3]}
I can create a parquet file like :
var df = sqlContext.load("/tmp/test.json","json")
var df_c = df.repartition(1)
df_c.select("*").save("/tmp/testjson_spark","parquet”)

The output parquet file’s schema is like:
c1:          OPTIONAL F:1
.bag:        REPEATED F:1
..array:     OPTIONAL INT64 R:1 D:3

Is there anyway to avoid using “.bag”, instead of, can we create the
parquet file using column type “REPEATED INT64”?
The expected data type is:
c1:          REPEATED INT64 R:1 D:1

Thanks!
-- 
Thanks,
www.openkb.info
(Open KnowledgeBase for Hadoop/Database/OS/Network/Tool)

Re: Array column stored as “.bag” in parquet file instead of “REPEATED INT64"

Posted by Cheng Lian <li...@gmail.com>.
Hi Jim,

Unfortunately this is neither possible in Spark nor a standard practice for
Parquet.

In your case, actually repeated int64 c1 doesn't catch the full semantics.
Because it represents a *required* array of long values containing zero or
more *non-null* elements. However, when inferring schema from JSON files,
it's not safe to assume any field is non-nullable. So we always generated
nullable schema for JSON. The schema generated by Spark SQL for the JSON
snippet you provided is:

message root {
  optional group c1 (LIST) {
    repeated group bag {
      optional int64 array;
    }
  }
}

The outer optional means the array field c1 itself can be null, and the
inner optional means elements contained in the array can also be null.
That's the reason why parquet-format defines a 3-level structure to
represent LIST. This is different from ProtocolBuffer. Another thing to
note is that, extra nested levels are super cheap in Parquet (almost zero
cost), because only leaf nodes are materialized in the physical file. If
you are worrying about interoperability with other Parquet libraries like
parquet-protobuf, then to the best of my knowledge, currently Spark SQL 1.5
is the only system that can correctly interpret Parquet files generated by
other systems. Because it is the only one that implemented all
backwards-compatibility rules defined in parquet-format.

As for Parquet compatibility, it's a little bit complicated and requires
some background knowledge to understand. You may find more details in this
section of parquet-format spec
<https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists>.
Although Parquet was designed with interoperability in mind, the format
spec didn't explicitly specify how nested types like LIST and MAP should be
represented in the early days. The consequence is that, different Parquet
libraries, including Spark SQL, all use different representations, and are
incompatible with each other in many cases. For example, to represent a
required list of string containing no null values, all the Parquet schemas
below are valid:

// parquet-protobuf style
message m0 {
  repeated binary f (UTF8);
}

// parquet-avro style
message m1 {
  required group f (LIST) {
    repeated binary array (UTF8);
  }
}

// parquet-thrift style
message m2 {
  required group f (LIST) {
    repeated binary f_tuple (UTF8);
  }
}

// standard layout defined in the most recent parquet-format
message m3 {
  required group f (LIST) {
    repeated group list {
      required binary element (UTF8);
    }
  }
}

Apparently, this badly hurts Parquet interoperability. To fix this issue,
recently parquet-format defined standard layouts for nested types
<https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#nested-types>,
as well as backwards-compatibility rules for reading legacy Parquet files (1
<https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists>,
2
<https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#maps>).
We implemented all these rules on the read path in Spark SQL 1.5, this
means now we can read non-standard legacy Parquet files generated by
various systems. However, we haven't refactored the write path to follow
the spec yet. This is a task for 1.6. HTH.

Cheng

On Fri, Aug 28, 2015 at 6:53 AM, Jim Green <op...@gmail.com> wrote:

> Hi Team,
>
> Say I have a test.json file: {"c1":[1,2,3]}
> I can create a parquet file like :
> var df = sqlContext.load("/tmp/test.json","json")
> var df_c = df.repartition(1)
> df_c.select("*").save("/tmp/testjson_spark","parquet”)
>
> The output parquet file’s schema is like:
> c1:          OPTIONAL F:1
> .bag:        REPEATED F:1
> ..array:     OPTIONAL INT64 R:1 D:3
>
> Is there anyway to avoid using “.bag”, instead of, can we create the
> parquet file using column type “REPEATED INT64”?
> The expected data type is:
> c1:          REPEATED INT64 R:1 D:1
>
> Thanks!
> --
> Thanks,
> www.openkb.info
> (Open KnowledgeBase for Hadoop/Database/OS/Network/Tool)
>