You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/04/17 07:45:52 UTC

[GitHub] [iceberg] huadongliu opened a new issue #2490: Table SortOrder not being respected in Spark write

huadongliu opened a new issue #2490:
URL: https://github.com/apache/iceberg/issues/2490


   The table is created with `SortOrder.builderFor(schema).withOrderId(1).asc("userId", NULLS_LAST).build()` and populated with `df.write().format("iceberg").mode("append").save(tableLocation)`. The table has below schema and partition spec. I am trying to improve `userId` join and query by sorting them in parquet partitions.
   
   ```
   Schema schema = new Schema(
           optional(1, "userId", Types.StringType.get()),
           optional(2, "eventTime", Types.TimestampType.withZone()),
           optional(3, "count", Types.IntegerType.get())
   );
   
   PartitionSpec spec = PartitionSpec.builderFor(schema)
           .day("eventTime")
           .truncate("userId", 2)
           .build();
   ```
   
   userId is not sorted in parquet data files. Is Spark iceberg write supposed to respect SortOrder? `df.sortWithinPartitions("userId").write().format("iceberg").mode("append").save(tableLocation)` seems a workaround.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] shay1bz commented on issue #2490: Table SortOrder not being respected in Spark write

Posted by GitBox <gi...@apache.org>.
shay1bz commented on issue #2490:
URL: https://github.com/apache/iceberg/issues/2490#issuecomment-1002157812


   Thanks @RussellSpitzer . I'm using 0.12.1 so I did not see this addition. Do you have any idea when is the next release? I prefer to use release package rather than custom build. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] yyanyy commented on issue #2490: Table SortOrder not being respected in Spark write

Posted by GitBox <gi...@apache.org>.
yyanyy commented on issue #2490:
URL: https://github.com/apache/iceberg/issues/2490#issuecomment-829728213


   I think currently sort order support is only implemented on API level, but no engine integration is completed yet; I think the situation doesn't change much from the email thread here, in case you need more information: https://lists.apache.org/thread.html/r4d41a78b722af230738e597203d42e524f5f09738fcf041ba78f613a%40%3Cdev.iceberg.apache.org%3E


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on issue #2490: Table SortOrder not being respected in Spark write

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on issue #2490:
URL: https://github.com/apache/iceberg/issues/2490#issuecomment-1002139816


   > Any update on this? Would a PR for Spark rewrite action be any good or are you looking for a more holistic solution?
   
   The Spark rewrite action already uses SortOrder. The SparkWrite code needed the distribution and ordering changes in Spark 3.2, I think the current snapshot already has this implemented. So it should be done


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] shay1bz commented on issue #2490: Table SortOrder not being respected in Spark write

Posted by GitBox <gi...@apache.org>.
shay1bz commented on issue #2490:
URL: https://github.com/apache/iceberg/issues/2490#issuecomment-1001974725


   Any update on this? 
   Would a PR for Spark rewrite action be any good or are you looking for a more holistic solution?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer edited a comment on issue #2490: Table SortOrder not being respected in Spark write

Posted by GitBox <gi...@apache.org>.
RussellSpitzer edited a comment on issue #2490:
URL: https://github.com/apache/iceberg/issues/2490#issuecomment-1002139816


   > Any update on this? Would a PR for Spark rewrite action be any good or are you looking for a more holistic solution?
   
   The Spark rewrite action already uses SortOrder. The SparkWrite code needed the distribution and ordering changes in Spark 3.2, I think the current snapshot already has this implemented. So it should be done
   
   https://github.com/apache/iceberg/blob/master/spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/source/SparkWrite.java#L132-L140
   
   https://github.com/apache/iceberg/blob/master/spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/source/SparkWriteBuilder.java#L140-L156


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org