You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2019/10/30 21:38:25 UTC

[GitHub] [incubator-iceberg] rdblue commented on issue #585: Fix Iceberg Reader for nested partitions (#575)

rdblue commented on issue #585: Fix Iceberg Reader for nested partitions (#575)
URL: https://github.com/apache/incubator-iceberg/pull/585#issuecomment-548123812
 
 
   I don't think that this fix is correct. It looks like this allows creating a projection schema with a `.` in the name, but that isn't actually returning the data as it would be.
   
   Say I have records like this: `{"id": 1, "info": {"type": "x", ...} }` and I want to store those records by `identity(info.type)` and `bucket(id, 16)`. When I read the data, I should get the original record back, with "type" nested within "info". But this approach instead would return `{"id": 1, "info.type": "x", "type": ...}`. Right?
   
   I think the solution is to update the Spark code to use constant readers, like we recently added to Pig.
   
   The current Spark implementation creates a join row of partition data that must be flat, and then combines it with a row that is materialized from the data file. Instead, we need to materialize the entire record, but instead of reading the partition values, we want to set them to constants instead. That's what the ConstantReader in Pig does. Instead of performing a read from the file to get the data, it just returns the known partition value. This works to fix the current problem because the constant reader is at the right place in the value reader tree, so it looks like this:
   
   ```
   StructReader(
     LongReader(pos=0, name="id", decoder=...),
     StructReader(name="info",
       ConstantReader(pos=0, name="type", value="x"),
       StringReader(pos=1, decoder=...),
       ...
     )
   );
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org