You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2019/10/29 15:31:27 UTC

[GitHub] [incubator-iceberg] andrei-ionescu commented on issue #575: After writing Iceberg dataset with nested partitions it cannot be read anymore

andrei-ionescu commented on issue #575: After writing Iceberg dataset with nested partitions it cannot be read anymore
URL: https://github.com/apache/incubator-iceberg/issues/575#issuecomment-547481132
 
 
   The steps are:
   
   1. Have a schema and its json data
   2. Add a simple partition and a nested partition
   3. Write the data in `iceberg` format using Iceberg data frame writer 
   4. Read the data into a new Iceberg dataset
   
   The last step fails with the error above. 
   
   Here is some code to explain the issue (extracted from the [gist above](https://gist.github.com/andrei-ionescu/b3e5f5345df3166af7562a830d3dc57d)).
   
   ```scala
   Schema nestedSchema = new Schema(
       optional(1, "id", Types.IntegerType.get()),
       optional(2, "data", Types.StringType.get()),
       optional(3, "nestedData", Types.StructType.of(
           optional(4, "id", Types.IntegerType.get()),
           optional(5, "moreData", Types.StringType.get())))
   );
   
   File parent = temp.newFolder("parquet");
   File location = new File(parent, "test");
   
   HadoopTables tables = new HadoopTables(new Configuration());
   PartitionSpec spec = PartitionSpec.builderFor(nestedSchema)
       .identity("id")
       .identity("nestedData.moreData")
       .build();
   Table table = tables.create(nestedSchema, spec, location.toString());
   
   List<String> jsons = Lists.newArrayList(
       "{ \"id\": 1, \"data\": \"a\", \"nestedData\": { \"id\": 100, \"moreData\": \"p1\"} }",
       "{ \"id\": 2, \"data\": \"b\", \"nestedData\": { \"id\": 200, \"moreData\": \"p1\"} }",
       "{ \"id\": 3, \"data\": \"c\", \"nestedData\": { \"id\": 300, \"moreData\": \"p2\"} }",
       "{ \"id\": 4, \"data\": \"d\", \"nestedData\": { \"id\": 400, \"moreData\": \"p2\"} }"
   );
   Dataset<Row> df = spark
       .read()
       .schema(SparkSchemaUtil.convert(nestedSchema))
       .json(spark.createDataset(jsons, Encoders.STRING()));
   
   // TODO: incoming columns must be ordered according to the table's schema
   df.select("id", "data", "nestedData").write()
       .format("iceberg")
       .mode("append")
       .save(location.toString());
   
   table.refresh();
   
   Dataset<Row> result = spark.read()
       .format("iceberg")
       .load(location.toString());
   ```
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org