You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/05/04 01:05:29 UTC

[GitHub] [iceberg] samredai commented on a diff in pull request #4463: Docs: update Spark Write doc for partitioned tables

samredai commented on code in PR #4463:
URL: https://github.com/apache/iceberg/pull/4463#discussion_r864397904


##########
docs/spark/spark-writes.md:
##########
@@ -311,7 +311,11 @@ distribution & sort order to Spark.
 {{< /hint >}}
 
 {{< hint info >}}
-Both global sort (`orderBy`/`sort`) and local sort (`sortWithinPartitions`) work for the requirement.
+Both global sort (sorting all the data in the write) and local sort (sorting the data within a Spark task) can be used to write against partitioned table.

Review Comment:
   ```suggestion
   Both global sort (sorting all the data in the write) and local sort (sorting the data within a Spark task) can be used to write against a partitioned table.
   ```



##########
docs/spark/spark-writes.md:
##########
@@ -376,17 +413,17 @@ Explicit registration of the function is necessary because Spark doesn't allow I
 which can be used in query.
 {{< /hint >}}
 
-Here we just registered the bucket function as `iceberg_bucket16`, which can be used in sort clause.
+Here the bucket function is registered as `iceberg_bucket16`, which can be used in sort clause.

Review Comment:
   ```suggestion
   Here the bucket function is registered as `iceberg_bucket16`, which can be used in a sort clause.
   ```



##########
docs/spark/spark-writes.md:
##########
@@ -326,28 +330,61 @@ USING iceberg
 PARTITIONED BY (days(ts), category)
 ```
 
-To write data to the sample table, your data needs to be sorted by `days(ts), category`.
+#### In Spark SQL

Review Comment:
   ```suggestion
   #### Sort Order Using Spark SQL
   ```



##########
docs/spark/spark-writes.md:
##########
@@ -326,28 +330,61 @@ USING iceberg
 PARTITIONED BY (days(ts), category)
 ```
 
-To write data to the sample table, your data needs to be sorted by `days(ts), category`.
+#### In Spark SQL
 
-If you're inserting data with SQL statement, you can use `ORDER BY` to achieve it, like below:
+To globally sort data based on `ts` and `category`:
 
 ```sql
 INSERT INTO prod.db.sample
 SELECT id, data, category, ts FROM another_table
 ORDER BY ts, category
 ```
 
-If you're inserting data with DataFrame, you can use either `orderBy`/`sort` to trigger global sort, or `sortWithinPartitions`
-to trigger local sort. Local sort for example:
+To locally sort data based on `ts` and `category`:
+
+```sql
+INSERT INTO prod.db.sample
+SELECT id, data, category, ts FROM another_table
+SORT BY ts, category
+```
+
+`SORT BY` clauses can also be used with partition transforms. The [date-and-timestamp-functions](https://spark.apache.org/docs/latest/sql-ref-functions-builtin.html#date-and-timestamp-functions) should be used when partition transforms are time related. Truncate related functions such as `substr` should be used when the partition transform is `truncate[W]`. It is required to [define and register UDFs](https://spark.apache.org/docs/latest/sql-ref-functions-udf-scalar.html) when the partition transform is [bucket transform](##Bucket Transform).
+
+```sql
+INSERT INTO prod.db.sample
+SELECT id, data, category, ts FROM another_table
+SORT BY day(ts), category
+```
+
+#### In the Dataframe API

Review Comment:
   ```suggestion
   #### Sort Order Using the Dataframe API
   ```
   This is so the item in the table of content is more meaningful



##########
docs/spark/spark-writes.md:
##########
@@ -311,7 +311,11 @@ distribution & sort order to Spark.
 {{< /hint >}}
 
 {{< hint info >}}
-Both global sort (`orderBy`/`sort`) and local sort (`sortWithinPartitions`) work for the requirement.
+Both global sort (sorting all the data in the write) and local sort (sorting the data within a Spark task) can be used to write against partitioned table.
+
+In SQL, the [`ORDER BY`](https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-orderby.html) will achieve global sorting and [`SORT BY`](https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-sortby.html) will achieve local sorting.

Review Comment:
   ```suggestion
   In SQL, [`ORDER BY`](https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-orderby.html) will achieve global sorting and [`SORT BY`](https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-sortby.html) will achieve local sorting.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org