You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2019/10/29 15:31:27 UTC
[GitHub] [incubator-iceberg] andrei-ionescu commented on issue #575: After
writing Iceberg dataset with nested partitions it cannot be read anymore
andrei-ionescu commented on issue #575: After writing Iceberg dataset with nested partitions it cannot be read anymore
URL: https://github.com/apache/incubator-iceberg/issues/575#issuecomment-547481132
The steps are:
1. Have a schema and its json data
2. Add a simple partition and a nested partition
3. Write the data in `iceberg` format using Iceberg data frame writer
4. Read the data into a new Iceberg dataset
The last step fails with the error above.
Here is some code to explain the issue (extracted from the [gist above](https://gist.github.com/andrei-ionescu/b3e5f5345df3166af7562a830d3dc57d)).
```scala
Schema nestedSchema = new Schema(
optional(1, "id", Types.IntegerType.get()),
optional(2, "data", Types.StringType.get()),
optional(3, "nestedData", Types.StructType.of(
optional(4, "id", Types.IntegerType.get()),
optional(5, "moreData", Types.StringType.get())))
);
File parent = temp.newFolder("parquet");
File location = new File(parent, "test");
HadoopTables tables = new HadoopTables(new Configuration());
PartitionSpec spec = PartitionSpec.builderFor(nestedSchema)
.identity("id")
.identity("nestedData.moreData")
.build();
Table table = tables.create(nestedSchema, spec, location.toString());
List<String> jsons = Lists.newArrayList(
"{ \"id\": 1, \"data\": \"a\", \"nestedData\": { \"id\": 100, \"moreData\": \"p1\"} }",
"{ \"id\": 2, \"data\": \"b\", \"nestedData\": { \"id\": 200, \"moreData\": \"p1\"} }",
"{ \"id\": 3, \"data\": \"c\", \"nestedData\": { \"id\": 300, \"moreData\": \"p2\"} }",
"{ \"id\": 4, \"data\": \"d\", \"nestedData\": { \"id\": 400, \"moreData\": \"p2\"} }"
);
Dataset<Row> df = spark
.read()
.schema(SparkSchemaUtil.convert(nestedSchema))
.json(spark.createDataset(jsons, Encoders.STRING()));
// TODO: incoming columns must be ordered according to the table's schema
df.select("id", "data", "nestedData").write()
.format("iceberg")
.mode("append")
.save(location.toString());
table.refresh();
Dataset<Row> result = spark.read()
.format("iceberg")
.load(location.toString());
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org