You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@paimon.apache.org by "leaves12138 (via GitHub)" <gi...@apache.org> on 2023/11/09 08:30:32 UTC

[PR] [spark] support spark sort compact procedure for unaware-bucket table [incubator-paimon]

leaves12138 opened a new pull request, #2296:
URL: https://github.com/apache/incubator-paimon/pull/2296

   
   ### Purpose
   
   Support spark sort procedure with order and zorder.
   
   Usage :
   CALL sort_compact('<database_name>.<table_name>', '<order/zorder>', '<column1>[,column2]...',   ['<conditions>'])
   
   example:
   
   set spark.sql.shuffle.partitions=10;
   CALL sort_compact('my_db.Orders1', 'zorder', 'f1,f2',  'f0=0');
   
   conditions:   "," means "AND"  ";" means "OR"
   If you want sort compact two partition date=01 and date=02, you need to write   'date=01;date=02'
   If you want sort one partition with date=01 and day=01, you need to write 'date=01,day=01'
   
   ### Documentation
   
   Wait to add
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@paimon.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [PR] [spark] Introduce spark compact procedure [incubator-paimon]

Posted by "leaves12138 (via GitHub)" <gi...@apache.org>.

leaves12138 commented on code in PR #2296:
URL: https://github.com/apache/incubator-paimon/pull/2296#discussion_r1393561499


##########
docs/content/engines/spark3.md:
##########
@@ -425,6 +425,29 @@ val query = spark.readStream
 */
 ```
 
+## Spark Procedure
+
+This section introduce all available spark procedures about paimon.
+
+<table class="table table-bordered">
+    <thead>
+    <tr>
+      <th class="text-left" style="width: 4%">Procedure Name</th>
+      <th class="text-left" style="width: 4%">Usage</th>
+      <th class="text-left" style="width: 20%">Explaination</th>
+      <th class="text-left" style="width: 4%">Example</th>
+    </tr>
+    </thead>
+    <tbody style="font-size: 12px; ">
+    <tr>
+      <td>compact</td>
+      <td><nobr>CALL compact('&ltidentifier&gt','&ltpartitions&gt','&ltsort_type&gt','&ltcolumns&gt')</nobr><br>CALL compact(table => '&ltidentifier&gt' [,partitions => '&ltpartitions&gt'] [, order_strategy =>'&ltsort_type&gt'] [,order_by => '&ltcolumns&gt'])</td>
+      <td>identifier: the target table identifier<br><br><nobr>partitions: partition filter<br> "," means "AND"<br>";" means "OR"</nobr><br><br>order_strategy: 'order' or 'zorder' or 'none' <br><br><nobr>order_columns: the columns need to be sort</nobr><br><br>If you want sort compact two partitions date=01 and date=02, you need to write 'date=01;date=02'<br><br>If you want sort one partition with date=01 and day=01, you need to write 'date=01,day=01'</td>
+      <td><nobr>SET spark.sql.shuffle.partitions=10; --set the sort parallelism</nobr> <nobr>CALL paimon.sys.compact('my_db.Orders1','f0=0,f1=1;f0=1,f1=1', 'zorder', 'f1,f2');</nobr><br><nobr>CALL paimon.sys.compact(table => 'T', partitions => 'p=0',  order_strategy => 'zorder', order_by => 'a,b')</nobr></td>

Review Comment:
   We just support this kind like iceberg, we don't recomment to use this. User could determine what he like.



##########
docs/content/engines/spark3.md:
##########
@@ -425,6 +425,29 @@ val query = spark.readStream
 */
 ```
 
+## Spark Procedure
+
+This section introduce all available spark procedures about paimon.
+
+<table class="table table-bordered">
+    <thead>
+    <tr>
+      <th class="text-left" style="width: 4%">Procedure Name</th>
+      <th class="text-left" style="width: 4%">Usage</th>
+      <th class="text-left" style="width: 20%">Explaination</th>
+      <th class="text-left" style="width: 4%">Example</th>
+    </tr>
+    </thead>
+    <tbody style="font-size: 12px; ">
+    <tr>
+      <td>compact</td>
+      <td><nobr>CALL compact('&ltidentifier&gt','&ltpartitions&gt','&ltsort_type&gt','&ltcolumns&gt')</nobr><br>CALL compact(table => '&ltidentifier&gt' [,partitions => '&ltpartitions&gt'] [, order_strategy =>'&ltsort_type&gt'] [,order_by => '&ltcolumns&gt'])</td>
+      <td>identifier: the target table identifier<br><br><nobr>partitions: partition filter<br> "," means "AND"<br>";" means "OR"</nobr><br><br>order_strategy: 'order' or 'zorder' or 'none' <br><br><nobr>order_columns: the columns need to be sort</nobr><br><br>If you want sort compact two partitions date=01 and date=02, you need to write 'date=01;date=02'<br><br>If you want sort one partition with date=01 and day=01, you need to write 'date=01,day=01'</td>
+      <td><nobr>SET spark.sql.shuffle.partitions=10; --set the sort parallelism</nobr> <nobr>CALL paimon.sys.compact('my_db.Orders1','f0=0,f1=1;f0=1,f1=1', 'zorder', 'f1,f2');</nobr><br><nobr>CALL paimon.sys.compact(table => 'T', partitions => 'p=0',  order_strategy => 'zorder', order_by => 'a,b')</nobr></td>

Review Comment:
   > Here we may need to remind that "paimon" is the customized catalog name, or just use sys.xxx under paimon catalog namespace
   
   Fixed this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@paimon.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [PR] [spark] Introduce spark compact procedure [incubator-paimon]

Posted by "JingsongLi (via GitHub)" <gi...@apache.org>.

JingsongLi commented on PR #2296:
URL: https://github.com/apache/incubator-paimon/pull/2296#issuecomment-1808185917

   CC @Zouxxyy to review again~


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@paimon.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [PR] [spark] Introduce spark compact procedure [incubator-paimon]

Posted by "leaves12138 (via GitHub)" <gi...@apache.org>.

leaves12138 commented on code in PR #2296:
URL: https://github.com/apache/incubator-paimon/pull/2296#discussion_r1393565761


##########
paimon-spark/paimon-spark-3.4/src/test/scala/org/apache/paimon/spark/sql/CreateAndDeleteTagProcedureTest.scala:
##########
@@ -68,13 +68,14 @@ class CreateAndDeleteTagProcedureTest extends PaimonSparkTestBase with StreamTes
             stream.processAllAvailable()
             checkAnswer(query(), Row(1, "a") :: Row(2, "b2") :: Nil)
             checkAnswer(
-              spark.sql("CALL create_tag(table => 'test.T', tag => 'test_tag', snapshot => 2)"),
+              spark.sql(
+                "CALL paimon.sys.create_tag(table => 'test.T', tag => 'test_tag', snapshot => 2)"),

Review Comment:
   Tagged in docs



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@paimon.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [PR] [spark] Introduce spark sort compact procedure for unaware-bucket table [incubator-paimon]

Posted by "Zouxxyy (via GitHub)" <gi...@apache.org>.

Zouxxyy commented on code in PR #2296:
URL: https://github.com/apache/incubator-paimon/pull/2296#discussion_r1388107790


##########
paimon-spark/paimon-spark-common/src/main/java/org/apache/paimon/spark/procedure/SortCompactProcedure.java:
##########
@@ -0,0 +1,164 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.paimon.spark.procedure;
+
+import org.apache.paimon.CoreOptions;
+import org.apache.paimon.annotation.VisibleForTesting;
+import org.apache.paimon.options.Options;
+import org.apache.paimon.spark.SaveMode;
+import org.apache.paimon.spark.commands.WriteIntoPaimonTable;
+import org.apache.paimon.spark.sort.TableSorter;
+import org.apache.paimon.table.AppendOnlyFileStoreTable;
+import org.apache.paimon.table.FileStoreTable;
+import org.apache.paimon.utils.ParameterUtils;
+import org.apache.paimon.utils.Preconditions;
+import org.apache.paimon.utils.StringUtils;
+
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.connector.catalog.Identifier;
+import org.apache.spark.sql.connector.catalog.TableCatalog;
+import org.apache.spark.sql.types.DataTypes;
+import org.apache.spark.sql.types.Metadata;
+import org.apache.spark.sql.types.StructField;
+import org.apache.spark.sql.types.StructType;
+
+import javax.annotation.Nullable;
+
+import java.util.Arrays;
+import java.util.List;
+import java.util.Map;
+import java.util.Optional;
+
+import static org.apache.spark.sql.types.DataTypes.StringType;
+
+/** Sort compact procedure for sort unaware-bucket table. */
+public class SortCompactProcedure extends BaseProcedure {
+
+    private static final ProcedureParameter[] PARAMETERS =
+            new ProcedureParameter[] {
+                ProcedureParameter.required("table", StringType),
+                ProcedureParameter.required("order_type", StringType),
+                ProcedureParameter.required("columns", StringType),
+                ProcedureParameter.optional("conditions", StringType),
+            };
+
+    private static final StructType OUTPUT_TYPE =
+            new StructType(
+                    new StructField[] {
+                        new StructField("result", DataTypes.BooleanType, true, Metadata.empty())
+                    });
+
+    protected SortCompactProcedure(TableCatalog tableCatalog) {
+        super(tableCatalog);
+    }
+
+    @Override
+    public ProcedureParameter[] parameters() {
+        return PARAMETERS;
+    }
+
+    @Override
+    public StructType outputType() {
+        return OUTPUT_TYPE;
+    }
+
+    @Override
+    public InternalRow[] call(InternalRow args) {
+        Preconditions.checkArgument(args.numFields() >= 3);
+        Identifier tableIdent = toIdentifier(args.getString(0), PARAMETERS[0].name());
+        String sortType = args.getString(1);
+        List<String> sortColumns = Arrays.asList(args.getString(2).split(","));
+
+        String partitionFilter = args.isNullAt(3) ? null : toWhere(args.getString(3));
+
+        return modifyPaimonTable(
+                tableIdent,
+                table -> {
+                    Preconditions.checkArgument(table instanceof FileStoreTable);
+                    InternalRow internalRow =
+                            newInternalRow(
+                                    execute(
+                                            (FileStoreTable) table,
+                                            sortType,
+                                            sortColumns,
+                                            partitionFilter));
+                    return new InternalRow[] {internalRow};
+                });
+    }
+
+    @Override
+    public String description() {
+        return "This procedure execute sort compact action on unaware-bucket table.";
+    }
+
+    private boolean execute(
+            FileStoreTable table,
+            String sortType,
+            List<String> sortColumns,
+            @Nullable String filter) {
+        CoreOptions coreOptions = table.store().options();
+
+        if (!(table instanceof AppendOnlyFileStoreTable) || coreOptions.bucket() != -1) {
+            throw new UnsupportedOperationException(
+                    "Spark sort compact only support unaware-bucket append-only table yet.");
+        }
+
+        Dataset<Row> row = spark().read().format("paimon").load(coreOptions.path().getPath());
+        row = StringUtils.isBlank(filter) ? row : row.where(filter);
+        new WriteIntoPaimonTable(
+                        table,
+                        SaveMode.dynamic(),
+                        TableSorter.getSorter(table, sortType, sortColumns).sort(row),
+                        new Options())
+                .run(spark());
+        return true;
+    }
+
+    @VisibleForTesting
+    static String toWhere(String partitions) {
+        if (StringUtils.isBlank(partitions)) {
+            return null;
+        }
+
+        List<Map<String, String>> maps = ParameterUtils.getPartitions(partitions.split(";"));
+
+        return maps.stream()

Review Comment:
   We can directly pass a where condition string (such as p1 = a and p2 > b) and then let spark analyze it by itself.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@paimon.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [PR] [spark] Introduce spark compact procedure [incubator-paimon]

Posted by "Zouxxyy (via GitHub)" <gi...@apache.org>.

Zouxxyy commented on code in PR #2296:
URL: https://github.com/apache/incubator-paimon/pull/2296#discussion_r1392119659


##########
paimon-spark/paimon-spark-3.4/src/test/scala/org/apache/paimon/spark/sql/CreateAndDeleteTagProcedureTest.scala:
##########
@@ -68,13 +68,14 @@ class CreateAndDeleteTagProcedureTest extends PaimonSparkTestBase with StreamTes
             stream.processAllAvailable()
             checkAnswer(query(), Row(1, "a") :: Row(2, "b2") :: Nil)
             checkAnswer(
-              spark.sql("CALL create_tag(table => 'test.T', tag => 'test_tag', snapshot => 2)"),
+              spark.sql(
+                "CALL paimon.sys.create_tag(table => 'test.T', tag => 'test_tag', snapshot => 2)"),

Review Comment:
   here we can just use sys.create_tag



##########
docs/content/engines/spark3.md:
##########
@@ -425,6 +425,29 @@ val query = spark.readStream
 */
 ```
 
+## Spark Procedure
+
+This section introduce all available spark procedures about paimon.
+
+<table class="table table-bordered">
+    <thead>
+    <tr>
+      <th class="text-left" style="width: 4%">Procedure Name</th>
+      <th class="text-left" style="width: 4%">Usage</th>
+      <th class="text-left" style="width: 20%">Explaination</th>
+      <th class="text-left" style="width: 4%">Example</th>
+    </tr>
+    </thead>
+    <tbody style="font-size: 12px; ">
+    <tr>
+      <td>compact</td>
+      <td><nobr>CALL compact('&ltidentifier&gt','&ltpartitions&gt','&ltsort_type&gt','&ltcolumns&gt')</nobr><br>CALL compact(table => '&ltidentifier&gt' [,partitions => '&ltpartitions&gt'] [, order_strategy =>'&ltsort_type&gt'] [,order_by => '&ltcolumns&gt'])</td>
+      <td>identifier: the target table identifier<br><br><nobr>partitions: partition filter<br> "," means "AND"<br>";" means "OR"</nobr><br><br>order_strategy: 'order' or 'zorder' or 'none' <br><br><nobr>order_columns: the columns need to be sort</nobr><br><br>If you want sort compact two partitions date=01 and date=02, you need to write 'date=01;date=02'<br><br>If you want sort one partition with date=01 and day=01, you need to write 'date=01,day=01'</td>
+      <td><nobr>SET spark.sql.shuffle.partitions=10; --set the sort parallelism</nobr> <nobr>CALL paimon.sys.compact('my_db.Orders1','f0=0,f1=1;f0=1,f1=1', 'zorder', 'f1,f2');</nobr><br><nobr>CALL paimon.sys.compact(table => 'T', partitions => 'p=0',  order_strategy => 'zorder', order_by => 'a,b')</nobr></td>

Review Comment:
   I don't think omitting parameter name is a good way to use procedures, it's not clear enough. Although flink currently does not support (parameterName => value) ?



##########
docs/content/engines/spark3.md:
##########
@@ -425,6 +425,29 @@ val query = spark.readStream
 */
 ```
 
+## Spark Procedure
+
+This section introduce all available spark procedures about paimon.
+
+<table class="table table-bordered">
+    <thead>
+    <tr>
+      <th class="text-left" style="width: 4%">Procedure Name</th>
+      <th class="text-left" style="width: 4%">Usage</th>
+      <th class="text-left" style="width: 20%">Explaination</th>
+      <th class="text-left" style="width: 4%">Example</th>
+    </tr>
+    </thead>
+    <tbody style="font-size: 12px; ">
+    <tr>
+      <td>compact</td>
+      <td><nobr>CALL compact('&ltidentifier&gt','&ltpartitions&gt','&ltsort_type&gt','&ltcolumns&gt')</nobr><br>CALL compact(table => '&ltidentifier&gt' [,partitions => '&ltpartitions&gt'] [, order_strategy =>'&ltsort_type&gt'] [,order_by => '&ltcolumns&gt'])</td>
+      <td>identifier: the target table identifier<br><br><nobr>partitions: partition filter<br> "," means "AND"<br>";" means "OR"</nobr><br><br>order_strategy: 'order' or 'zorder' or 'none' <br><br><nobr>order_columns: the columns need to be sort</nobr><br><br>If you want sort compact two partitions date=01 and date=02, you need to write 'date=01;date=02'<br><br>If you want sort one partition with date=01 and day=01, you need to write 'date=01,day=01'</td>

Review Comment:
   Need to explain what the default behavior of optional parameter



##########
docs/content/engines/spark3.md:
##########
@@ -425,6 +425,29 @@ val query = spark.readStream
 */
 ```
 
+## Spark Procedure
+
+This section introduce all available spark procedures about paimon.
+
+<table class="table table-bordered">
+    <thead>
+    <tr>
+      <th class="text-left" style="width: 4%">Procedure Name</th>
+      <th class="text-left" style="width: 4%">Usage</th>
+      <th class="text-left" style="width: 20%">Explaination</th>
+      <th class="text-left" style="width: 4%">Example</th>
+    </tr>
+    </thead>
+    <tbody style="font-size: 12px; ">
+    <tr>
+      <td>compact</td>
+      <td><nobr>CALL compact('&ltidentifier&gt','&ltpartitions&gt','&ltsort_type&gt','&ltcolumns&gt')</nobr><br>CALL compact(table => '&ltidentifier&gt' [,partitions => '&ltpartitions&gt'] [, order_strategy =>'&ltsort_type&gt'] [,order_by => '&ltcolumns&gt'])</td>
+      <td>identifier: the target table identifier<br><br><nobr>partitions: partition filter<br> "," means "AND"<br>";" means "OR"</nobr><br><br>order_strategy: 'order' or 'zorder' or 'none' <br><br><nobr>order_columns: the columns need to be sort</nobr><br><br>If you want sort compact two partitions date=01 and date=02, you need to write 'date=01;date=02'<br><br>If you want sort one partition with date=01 and day=01, you need to write 'date=01,day=01'</td>
+      <td><nobr>SET spark.sql.shuffle.partitions=10; --set the sort parallelism</nobr> <nobr>CALL paimon.sys.compact('my_db.Orders1','f0=0,f1=1;f0=1,f1=1', 'zorder', 'f1,f2');</nobr><br><nobr>CALL paimon.sys.compact(table => 'T', partitions => 'p=0',  order_strategy => 'zorder', order_by => 'a,b')</nobr></td>

Review Comment:
   Here we may need to remind that "paimon" is the customized catalog name, or just use sys.xxx under paimon catalog namespace



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@paimon.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [PR] [spark] Introduce spark compact procedure [incubator-paimon]

Posted by "Zouxxyy (via GitHub)" <gi...@apache.org>.

Zouxxyy commented on code in PR #2296:
URL: https://github.com/apache/incubator-paimon/pull/2296#discussion_r1392125881


##########
docs/content/engines/spark3.md:
##########
@@ -425,6 +425,29 @@ val query = spark.readStream
 */
 ```
 
+## Spark Procedure
+
+This section introduce all available spark procedures about paimon.
+
+<table class="table table-bordered">
+    <thead>
+    <tr>
+      <th class="text-left" style="width: 4%">Procedure Name</th>
+      <th class="text-left" style="width: 4%">Usage</th>
+      <th class="text-left" style="width: 20%">Explaination</th>
+      <th class="text-left" style="width: 4%">Example</th>
+    </tr>
+    </thead>
+    <tbody style="font-size: 12px; ">
+    <tr>
+      <td>compact</td>
+      <td><nobr>CALL compact('&ltidentifier&gt','&ltpartitions&gt','&ltsort_type&gt','&ltcolumns&gt')</nobr><br>CALL compact(table => '&ltidentifier&gt' [,partitions => '&ltpartitions&gt'] [, order_strategy =>'&ltsort_type&gt'] [,order_by => '&ltcolumns&gt'])</td>
+      <td>identifier: the target table identifier<br><br><nobr>partitions: partition filter<br> "," means "AND"<br>";" means "OR"</nobr><br><br>order_strategy: 'order' or 'zorder' or 'none' <br><br><nobr>order_columns: the columns need to be sort</nobr><br><br>If you want sort compact two partitions date=01 and date=02, you need to write 'date=01;date=02'<br><br>If you want sort one partition with date=01 and day=01, you need to write 'date=01,day=01'</td>
+      <td><nobr>SET spark.sql.shuffle.partitions=10; --set the sort parallelism</nobr> <nobr>CALL paimon.sys.compact('my_db.Orders1','f0=0,f1=1;f0=1,f1=1', 'zorder', 'f1,f2');</nobr><br><nobr>CALL paimon.sys.compact(table => 'T', partitions => 'p=0',  order_strategy => 'zorder', order_by => 'a,b')</nobr></td>

Review Comment:
   I don't think omitting parameter name is a good way to use procedures, it's not clear enough and does not support order swapping. Although flink currently does not support (parameterName => value) ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@paimon.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [PR] [spark] Introduce spark compact procedure [incubator-paimon]

Posted by "Zouxxyy (via GitHub)" <gi...@apache.org>.

Zouxxyy commented on code in PR #2296:
URL: https://github.com/apache/incubator-paimon/pull/2296#discussion_r1392118326


##########
docs/content/engines/spark3.md:
##########
@@ -425,6 +425,29 @@ val query = spark.readStream
 */
 ```
 
+## Spark Procedure
+
+This section introduce all available spark procedures about paimon.
+
+<table class="table table-bordered">
+    <thead>
+    <tr>
+      <th class="text-left" style="width: 4%">Procedure Name</th>
+      <th class="text-left" style="width: 4%">Usage</th>
+      <th class="text-left" style="width: 20%">Explaination</th>
+      <th class="text-left" style="width: 4%">Example</th>
+    </tr>
+    </thead>
+    <tbody style="font-size: 12px; ">
+    <tr>
+      <td>compact</td>
+      <td><nobr>CALL compact('&ltidentifier&gt','&ltpartitions&gt','&ltsort_type&gt','&ltcolumns&gt')</nobr><br>CALL compact(table => '&ltidentifier&gt' [,partitions => '&ltpartitions&gt'] [, order_strategy =>'&ltsort_type&gt'] [,order_by => '&ltcolumns&gt'])</td>
+      <td>identifier: the target table identifier<br><br><nobr>partitions: partition filter<br> "," means "AND"<br>";" means "OR"</nobr><br><br>order_strategy: 'order' or 'zorder' or 'none' <br><br><nobr>order_columns: the columns need to be sort</nobr><br><br>If you want sort compact two partitions date=01 and date=02, you need to write 'date=01;date=02'<br><br>If you want sort one partition with date=01 and day=01, you need to write 'date=01,day=01'</td>
+      <td><nobr>SET spark.sql.shuffle.partitions=10; --set the sort parallelism</nobr> <nobr>CALL paimon.sys.compact('my_db.Orders1','f0=0,f1=1;f0=1,f1=1', 'zorder', 'f1,f2');</nobr><br><nobr>CALL paimon.sys.compact(table => 'T', partitions => 'p=0',  order_strategy => 'zorder', order_by => 'a,b')</nobr></td>

Review Comment:
   Here we may need to remind that "paimon" is the customized catalog name, or just use sys.xxx under paimon catalog namespace



##########
docs/content/engines/spark3.md:
##########
@@ -425,6 +425,29 @@ val query = spark.readStream
 */
 ```
 
+## Spark Procedure
+
+This section introduce all available spark procedures about paimon.
+
+<table class="table table-bordered">
+    <thead>
+    <tr>
+      <th class="text-left" style="width: 4%">Procedure Name</th>
+      <th class="text-left" style="width: 4%">Usage</th>
+      <th class="text-left" style="width: 20%">Explaination</th>
+      <th class="text-left" style="width: 4%">Example</th>
+    </tr>
+    </thead>
+    <tbody style="font-size: 12px; ">
+    <tr>
+      <td>compact</td>
+      <td><nobr>CALL compact('&ltidentifier&gt','&ltpartitions&gt','&ltsort_type&gt','&ltcolumns&gt')</nobr><br>CALL compact(table => '&ltidentifier&gt' [,partitions => '&ltpartitions&gt'] [, order_strategy =>'&ltsort_type&gt'] [,order_by => '&ltcolumns&gt'])</td>
+      <td>identifier: the target table identifier<br><br><nobr>partitions: partition filter<br> "," means "AND"<br>";" means "OR"</nobr><br><br>order_strategy: 'order' or 'zorder' or 'none' <br><br><nobr>order_columns: the columns need to be sort</nobr><br><br>If you want sort compact two partitions date=01 and date=02, you need to write 'date=01;date=02'<br><br>If you want sort one partition with date=01 and day=01, you need to write 'date=01,day=01'</td>
+      <td><nobr>SET spark.sql.shuffle.partitions=10; --set the sort parallelism</nobr> <nobr>CALL paimon.sys.compact('my_db.Orders1','f0=0,f1=1;f0=1,f1=1', 'zorder', 'f1,f2');</nobr><br><nobr>CALL paimon.sys.compact(table => 'T', partitions => 'p=0',  order_strategy => 'zorder', order_by => 'a,b')</nobr></td>

Review Comment:
   Here we may need to remind that "paimon" is the customized catalog name, or just use sys.xxx under paimon catalog namespace



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@paimon.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [PR] [spark] Introduce spark sort compact procedure for unaware-bucket table [incubator-paimon]

Posted by "leaves12138 (via GitHub)" <gi...@apache.org>.

leaves12138 commented on code in PR #2296:
URL: https://github.com/apache/incubator-paimon/pull/2296#discussion_r1388837880


##########
paimon-spark/paimon-spark-common/src/main/java/org/apache/paimon/spark/procedure/SortCompactProcedure.java:
##########
@@ -0,0 +1,164 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.paimon.spark.procedure;
+
+import org.apache.paimon.CoreOptions;
+import org.apache.paimon.annotation.VisibleForTesting;
+import org.apache.paimon.options.Options;
+import org.apache.paimon.spark.SaveMode;
+import org.apache.paimon.spark.commands.WriteIntoPaimonTable;
+import org.apache.paimon.spark.sort.TableSorter;
+import org.apache.paimon.table.AppendOnlyFileStoreTable;
+import org.apache.paimon.table.FileStoreTable;
+import org.apache.paimon.utils.ParameterUtils;
+import org.apache.paimon.utils.Preconditions;
+import org.apache.paimon.utils.StringUtils;
+
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.connector.catalog.Identifier;
+import org.apache.spark.sql.connector.catalog.TableCatalog;
+import org.apache.spark.sql.types.DataTypes;
+import org.apache.spark.sql.types.Metadata;
+import org.apache.spark.sql.types.StructField;
+import org.apache.spark.sql.types.StructType;
+
+import javax.annotation.Nullable;
+
+import java.util.Arrays;
+import java.util.List;
+import java.util.Map;
+import java.util.Optional;
+
+import static org.apache.spark.sql.types.DataTypes.StringType;
+
+/** Sort compact procedure for sort unaware-bucket table. */
+public class SortCompactProcedure extends BaseProcedure {
+
+    private static final ProcedureParameter[] PARAMETERS =
+            new ProcedureParameter[] {
+                ProcedureParameter.required("table", StringType),
+                ProcedureParameter.required("order_type", StringType),
+                ProcedureParameter.required("columns", StringType),
+                ProcedureParameter.optional("conditions", StringType),
+            };
+
+    private static final StructType OUTPUT_TYPE =
+            new StructType(
+                    new StructField[] {
+                        new StructField("result", DataTypes.BooleanType, true, Metadata.empty())
+                    });
+
+    protected SortCompactProcedure(TableCatalog tableCatalog) {
+        super(tableCatalog);
+    }
+
+    @Override
+    public ProcedureParameter[] parameters() {
+        return PARAMETERS;
+    }
+
+    @Override
+    public StructType outputType() {
+        return OUTPUT_TYPE;
+    }
+
+    @Override
+    public InternalRow[] call(InternalRow args) {
+        Preconditions.checkArgument(args.numFields() >= 3);
+        Identifier tableIdent = toIdentifier(args.getString(0), PARAMETERS[0].name());
+        String sortType = args.getString(1);
+        List<String> sortColumns = Arrays.asList(args.getString(2).split(","));
+
+        String partitionFilter = args.isNullAt(3) ? null : toWhere(args.getString(3));
+
+        return modifyPaimonTable(
+                tableIdent,
+                table -> {
+                    Preconditions.checkArgument(table instanceof FileStoreTable);
+                    InternalRow internalRow =
+                            newInternalRow(
+                                    execute(
+                                            (FileStoreTable) table,
+                                            sortType,
+                                            sortColumns,
+                                            partitionFilter));
+                    return new InternalRow[] {internalRow};
+                });
+    }
+
+    @Override
+    public String description() {
+        return "This procedure execute sort compact action on unaware-bucket table.";
+    }
+
+    private boolean execute(
+            FileStoreTable table,
+            String sortType,
+            List<String> sortColumns,
+            @Nullable String filter) {
+        CoreOptions coreOptions = table.store().options();
+
+        if (!(table instanceof AppendOnlyFileStoreTable) || coreOptions.bucket() != -1) {
+            throw new UnsupportedOperationException(
+                    "Spark sort compact only support unaware-bucket append-only table yet.");
+        }
+
+        Dataset<Row> row = spark().read().format("paimon").load(coreOptions.path().getPath());
+        row = StringUtils.isBlank(filter) ? row : row.where(filter);
+        new WriteIntoPaimonTable(
+                        table,
+                        SaveMode.dynamic(),
+                        TableSorter.getSorter(table, sortType, sortColumns).sort(row),
+                        new Options())
+                .run(spark());
+        return true;
+    }
+
+    @VisibleForTesting
+    static String toWhere(String partitions) {
+        if (StringUtils.isBlank(partitions)) {
+            return null;
+        }
+
+        List<Map<String, String>> maps = ParameterUtils.getPartitions(partitions.split(";"));
+
+        return maps.stream()

Review Comment:
   For compatible with flink action, but iceberg does like what you say.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@paimon.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [PR] [spark] Introduce spark compact procedure [incubator-paimon]

Posted by "leaves12138 (via GitHub)" <gi...@apache.org>.

leaves12138 commented on code in PR #2296:
URL: https://github.com/apache/incubator-paimon/pull/2296#discussion_r1393562178


##########
docs/content/engines/spark3.md:
##########
@@ -425,6 +425,29 @@ val query = spark.readStream
 */
 ```
 
+## Spark Procedure
+
+This section introduce all available spark procedures about paimon.
+
+<table class="table table-bordered">
+    <thead>
+    <tr>
+      <th class="text-left" style="width: 4%">Procedure Name</th>
+      <th class="text-left" style="width: 4%">Usage</th>
+      <th class="text-left" style="width: 20%">Explaination</th>
+      <th class="text-left" style="width: 4%">Example</th>
+    </tr>
+    </thead>
+    <tbody style="font-size: 12px; ">
+    <tr>
+      <td>compact</td>
+      <td><nobr>CALL compact('&ltidentifier&gt','&ltpartitions&gt','&ltsort_type&gt','&ltcolumns&gt')</nobr><br>CALL compact(table => '&ltidentifier&gt' [,partitions => '&ltpartitions&gt'] [, order_strategy =>'&ltsort_type&gt'] [,order_by => '&ltcolumns&gt'])</td>
+      <td>identifier: the target table identifier<br><br><nobr>partitions: partition filter<br> "," means "AND"<br>";" means "OR"</nobr><br><br>order_strategy: 'order' or 'zorder' or 'none' <br><br><nobr>order_columns: the columns need to be sort</nobr><br><br>If you want sort compact two partitions date=01 and date=02, you need to write 'date=01;date=02'<br><br>If you want sort one partition with date=01 and day=01, you need to write 'date=01,day=01'</td>

Review Comment:
   Fixed this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@paimon.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [PR] [spark] Introduce spark sort compact procedure for unaware-bucket table [incubator-paimon]

Posted by "leaves12138 (via GitHub)" <gi...@apache.org>.

leaves12138 commented on code in PR #2296:
URL: https://github.com/apache/incubator-paimon/pull/2296#discussion_r1388837880


##########
paimon-spark/paimon-spark-common/src/main/java/org/apache/paimon/spark/procedure/SortCompactProcedure.java:
##########
@@ -0,0 +1,164 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.paimon.spark.procedure;
+
+import org.apache.paimon.CoreOptions;
+import org.apache.paimon.annotation.VisibleForTesting;
+import org.apache.paimon.options.Options;
+import org.apache.paimon.spark.SaveMode;
+import org.apache.paimon.spark.commands.WriteIntoPaimonTable;
+import org.apache.paimon.spark.sort.TableSorter;
+import org.apache.paimon.table.AppendOnlyFileStoreTable;
+import org.apache.paimon.table.FileStoreTable;
+import org.apache.paimon.utils.ParameterUtils;
+import org.apache.paimon.utils.Preconditions;
+import org.apache.paimon.utils.StringUtils;
+
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.connector.catalog.Identifier;
+import org.apache.spark.sql.connector.catalog.TableCatalog;
+import org.apache.spark.sql.types.DataTypes;
+import org.apache.spark.sql.types.Metadata;
+import org.apache.spark.sql.types.StructField;
+import org.apache.spark.sql.types.StructType;
+
+import javax.annotation.Nullable;
+
+import java.util.Arrays;
+import java.util.List;
+import java.util.Map;
+import java.util.Optional;
+
+import static org.apache.spark.sql.types.DataTypes.StringType;
+
+/** Sort compact procedure for sort unaware-bucket table. */
+public class SortCompactProcedure extends BaseProcedure {
+
+    private static final ProcedureParameter[] PARAMETERS =
+            new ProcedureParameter[] {
+                ProcedureParameter.required("table", StringType),
+                ProcedureParameter.required("order_type", StringType),
+                ProcedureParameter.required("columns", StringType),
+                ProcedureParameter.optional("conditions", StringType),
+            };
+
+    private static final StructType OUTPUT_TYPE =
+            new StructType(
+                    new StructField[] {
+                        new StructField("result", DataTypes.BooleanType, true, Metadata.empty())
+                    });
+
+    protected SortCompactProcedure(TableCatalog tableCatalog) {
+        super(tableCatalog);
+    }
+
+    @Override
+    public ProcedureParameter[] parameters() {
+        return PARAMETERS;
+    }
+
+    @Override
+    public StructType outputType() {
+        return OUTPUT_TYPE;
+    }
+
+    @Override
+    public InternalRow[] call(InternalRow args) {
+        Preconditions.checkArgument(args.numFields() >= 3);
+        Identifier tableIdent = toIdentifier(args.getString(0), PARAMETERS[0].name());
+        String sortType = args.getString(1);
+        List<String> sortColumns = Arrays.asList(args.getString(2).split(","));
+
+        String partitionFilter = args.isNullAt(3) ? null : toWhere(args.getString(3));
+
+        return modifyPaimonTable(
+                tableIdent,
+                table -> {
+                    Preconditions.checkArgument(table instanceof FileStoreTable);
+                    InternalRow internalRow =
+                            newInternalRow(
+                                    execute(
+                                            (FileStoreTable) table,
+                                            sortType,
+                                            sortColumns,
+                                            partitionFilter));
+                    return new InternalRow[] {internalRow};
+                });
+    }
+
+    @Override
+    public String description() {
+        return "This procedure execute sort compact action on unaware-bucket table.";
+    }
+
+    private boolean execute(
+            FileStoreTable table,
+            String sortType,
+            List<String> sortColumns,
+            @Nullable String filter) {
+        CoreOptions coreOptions = table.store().options();
+
+        if (!(table instanceof AppendOnlyFileStoreTable) || coreOptions.bucket() != -1) {
+            throw new UnsupportedOperationException(
+                    "Spark sort compact only support unaware-bucket append-only table yet.");
+        }
+
+        Dataset<Row> row = spark().read().format("paimon").load(coreOptions.path().getPath());
+        row = StringUtils.isBlank(filter) ? row : row.where(filter);
+        new WriteIntoPaimonTable(
+                        table,
+                        SaveMode.dynamic(),
+                        TableSorter.getSorter(table, sortType, sortColumns).sort(row),
+                        new Options())
+                .run(spark());
+        return true;
+    }
+
+    @VisibleForTesting
+    static String toWhere(String partitions) {
+        if (StringUtils.isBlank(partitions)) {
+            return null;
+        }
+
+        List<Map<String, String>> maps = ParameterUtils.getPartitions(partitions.split(";"));
+
+        return maps.stream()

Review Comment:
   For compatible with flink action, but iceberg does like what you said.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@paimon.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [PR] [spark] Introduce spark compact procedure [incubator-paimon]

Posted by "yuzelin (via GitHub)" <gi...@apache.org>.

yuzelin commented on code in PR #2296:
URL: https://github.com/apache/incubator-paimon/pull/2296#discussion_r1389014029


##########
docs/content/engines/spark3.md:
##########
@@ -425,6 +425,29 @@ val query = spark.readStream
 */
 ```
 
+## Spark Procedure
+
+This section introduce all available spark procedures about paimon.
+
+<table class="table table-bordered">
+    <thead>
+    <tr>
+      <th class="text-left" style="width: 4%">Procedure Name</th>
+      <th class="text-left" style="width: 4%">Usage</th>
+      <th class="text-left" style="width: 20%">Explaination</th>
+      <th class="text-left" style="width: 4%">Example</th>
+    </tr>
+    </thead>
+    <tbody style="font-size: 12px; ">
+    <tr>
+      <td>sort_compact</td>

Review Comment:
   -> compact



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@paimon.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [PR] [spark] Introduce spark sort compact procedure for unaware-bucket table [incubator-paimon]

Posted by "Zouxxyy (via GitHub)" <gi...@apache.org>.

Zouxxyy commented on code in PR #2296:
URL: https://github.com/apache/incubator-paimon/pull/2296#discussion_r1388107790


##########
paimon-spark/paimon-spark-common/src/main/java/org/apache/paimon/spark/procedure/SortCompactProcedure.java:
##########
@@ -0,0 +1,164 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.paimon.spark.procedure;
+
+import org.apache.paimon.CoreOptions;
+import org.apache.paimon.annotation.VisibleForTesting;
+import org.apache.paimon.options.Options;
+import org.apache.paimon.spark.SaveMode;
+import org.apache.paimon.spark.commands.WriteIntoPaimonTable;
+import org.apache.paimon.spark.sort.TableSorter;
+import org.apache.paimon.table.AppendOnlyFileStoreTable;
+import org.apache.paimon.table.FileStoreTable;
+import org.apache.paimon.utils.ParameterUtils;
+import org.apache.paimon.utils.Preconditions;
+import org.apache.paimon.utils.StringUtils;
+
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.connector.catalog.Identifier;
+import org.apache.spark.sql.connector.catalog.TableCatalog;
+import org.apache.spark.sql.types.DataTypes;
+import org.apache.spark.sql.types.Metadata;
+import org.apache.spark.sql.types.StructField;
+import org.apache.spark.sql.types.StructType;
+
+import javax.annotation.Nullable;
+
+import java.util.Arrays;
+import java.util.List;
+import java.util.Map;
+import java.util.Optional;
+
+import static org.apache.spark.sql.types.DataTypes.StringType;
+
+/** Sort compact procedure for sort unaware-bucket table. */
+public class SortCompactProcedure extends BaseProcedure {
+
+    private static final ProcedureParameter[] PARAMETERS =
+            new ProcedureParameter[] {
+                ProcedureParameter.required("table", StringType),
+                ProcedureParameter.required("order_type", StringType),
+                ProcedureParameter.required("columns", StringType),
+                ProcedureParameter.optional("conditions", StringType),
+            };
+
+    private static final StructType OUTPUT_TYPE =
+            new StructType(
+                    new StructField[] {
+                        new StructField("result", DataTypes.BooleanType, true, Metadata.empty())
+                    });
+
+    protected SortCompactProcedure(TableCatalog tableCatalog) {
+        super(tableCatalog);
+    }
+
+    @Override
+    public ProcedureParameter[] parameters() {
+        return PARAMETERS;
+    }
+
+    @Override
+    public StructType outputType() {
+        return OUTPUT_TYPE;
+    }
+
+    @Override
+    public InternalRow[] call(InternalRow args) {
+        Preconditions.checkArgument(args.numFields() >= 3);
+        Identifier tableIdent = toIdentifier(args.getString(0), PARAMETERS[0].name());
+        String sortType = args.getString(1);
+        List<String> sortColumns = Arrays.asList(args.getString(2).split(","));
+
+        String partitionFilter = args.isNullAt(3) ? null : toWhere(args.getString(3));
+
+        return modifyPaimonTable(
+                tableIdent,
+                table -> {
+                    Preconditions.checkArgument(table instanceof FileStoreTable);
+                    InternalRow internalRow =
+                            newInternalRow(
+                                    execute(
+                                            (FileStoreTable) table,
+                                            sortType,
+                                            sortColumns,
+                                            partitionFilter));
+                    return new InternalRow[] {internalRow};
+                });
+    }
+
+    @Override
+    public String description() {
+        return "This procedure execute sort compact action on unaware-bucket table.";
+    }
+
+    private boolean execute(
+            FileStoreTable table,
+            String sortType,
+            List<String> sortColumns,
+            @Nullable String filter) {
+        CoreOptions coreOptions = table.store().options();
+
+        if (!(table instanceof AppendOnlyFileStoreTable) || coreOptions.bucket() != -1) {
+            throw new UnsupportedOperationException(
+                    "Spark sort compact only support unaware-bucket append-only table yet.");
+        }
+
+        Dataset<Row> row = spark().read().format("paimon").load(coreOptions.path().getPath());
+        row = StringUtils.isBlank(filter) ? row : row.where(filter);
+        new WriteIntoPaimonTable(
+                        table,
+                        SaveMode.dynamic(),
+                        TableSorter.getSorter(table, sortType, sortColumns).sort(row),
+                        new Options())
+                .run(spark());
+        return true;
+    }
+
+    @VisibleForTesting
+    static String toWhere(String partitions) {
+        if (StringUtils.isBlank(partitions)) {
+            return null;
+        }
+
+        List<Map<String, String>> maps = ParameterUtils.getPartitions(partitions.split(";"));
+
+        return maps.stream()

Review Comment:
   Why use this special format? We can directly pass a where condition string (such as p1 = a and p2 > b) and then let spark analyze it by itself.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@paimon.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [PR] [spark] Introduce spark sort compact procedure for unaware-bucket table [incubator-paimon]

Posted by "leaves12138 (via GitHub)" <gi...@apache.org>.

leaves12138 commented on code in PR #2296:
URL: https://github.com/apache/incubator-paimon/pull/2296#discussion_r1388837880


##########
paimon-spark/paimon-spark-common/src/main/java/org/apache/paimon/spark/procedure/SortCompactProcedure.java:
##########
@@ -0,0 +1,164 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.paimon.spark.procedure;
+
+import org.apache.paimon.CoreOptions;
+import org.apache.paimon.annotation.VisibleForTesting;
+import org.apache.paimon.options.Options;
+import org.apache.paimon.spark.SaveMode;
+import org.apache.paimon.spark.commands.WriteIntoPaimonTable;
+import org.apache.paimon.spark.sort.TableSorter;
+import org.apache.paimon.table.AppendOnlyFileStoreTable;
+import org.apache.paimon.table.FileStoreTable;
+import org.apache.paimon.utils.ParameterUtils;
+import org.apache.paimon.utils.Preconditions;
+import org.apache.paimon.utils.StringUtils;
+
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.connector.catalog.Identifier;
+import org.apache.spark.sql.connector.catalog.TableCatalog;
+import org.apache.spark.sql.types.DataTypes;
+import org.apache.spark.sql.types.Metadata;
+import org.apache.spark.sql.types.StructField;
+import org.apache.spark.sql.types.StructType;
+
+import javax.annotation.Nullable;
+
+import java.util.Arrays;
+import java.util.List;
+import java.util.Map;
+import java.util.Optional;
+
+import static org.apache.spark.sql.types.DataTypes.StringType;
+
+/** Sort compact procedure for sort unaware-bucket table. */
+public class SortCompactProcedure extends BaseProcedure {
+
+    private static final ProcedureParameter[] PARAMETERS =
+            new ProcedureParameter[] {
+                ProcedureParameter.required("table", StringType),
+                ProcedureParameter.required("order_type", StringType),
+                ProcedureParameter.required("columns", StringType),
+                ProcedureParameter.optional("conditions", StringType),
+            };
+
+    private static final StructType OUTPUT_TYPE =
+            new StructType(
+                    new StructField[] {
+                        new StructField("result", DataTypes.BooleanType, true, Metadata.empty())
+                    });
+
+    protected SortCompactProcedure(TableCatalog tableCatalog) {
+        super(tableCatalog);
+    }
+
+    @Override
+    public ProcedureParameter[] parameters() {
+        return PARAMETERS;
+    }
+
+    @Override
+    public StructType outputType() {
+        return OUTPUT_TYPE;
+    }
+
+    @Override
+    public InternalRow[] call(InternalRow args) {
+        Preconditions.checkArgument(args.numFields() >= 3);
+        Identifier tableIdent = toIdentifier(args.getString(0), PARAMETERS[0].name());
+        String sortType = args.getString(1);
+        List<String> sortColumns = Arrays.asList(args.getString(2).split(","));
+
+        String partitionFilter = args.isNullAt(3) ? null : toWhere(args.getString(3));
+
+        return modifyPaimonTable(
+                tableIdent,
+                table -> {
+                    Preconditions.checkArgument(table instanceof FileStoreTable);
+                    InternalRow internalRow =
+                            newInternalRow(
+                                    execute(
+                                            (FileStoreTable) table,
+                                            sortType,
+                                            sortColumns,
+                                            partitionFilter));
+                    return new InternalRow[] {internalRow};
+                });
+    }
+
+    @Override
+    public String description() {
+        return "This procedure execute sort compact action on unaware-bucket table.";
+    }
+
+    private boolean execute(
+            FileStoreTable table,
+            String sortType,
+            List<String> sortColumns,
+            @Nullable String filter) {
+        CoreOptions coreOptions = table.store().options();
+
+        if (!(table instanceof AppendOnlyFileStoreTable) || coreOptions.bucket() != -1) {
+            throw new UnsupportedOperationException(
+                    "Spark sort compact only support unaware-bucket append-only table yet.");
+        }
+
+        Dataset<Row> row = spark().read().format("paimon").load(coreOptions.path().getPath());
+        row = StringUtils.isBlank(filter) ? row : row.where(filter);
+        new WriteIntoPaimonTable(
+                        table,
+                        SaveMode.dynamic(),
+                        TableSorter.getSorter(table, sortType, sortColumns).sort(row),
+                        new Options())
+                .run(spark());
+        return true;
+    }
+
+    @VisibleForTesting
+    static String toWhere(String partitions) {
+        if (StringUtils.isBlank(partitions)) {
+            return null;
+        }
+
+        List<Map<String, String>> maps = ParameterUtils.getPartitions(partitions.split(";"));
+
+        return maps.stream()

Review Comment:
   For compatible with flink action, but iceberg does like this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@paimon.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [PR] [spark] Introduce spark sort compact procedure for unaware-bucket table [incubator-paimon]

Posted by "leaves12138 (via GitHub)" <gi...@apache.org>.

leaves12138 commented on code in PR #2296:
URL: https://github.com/apache/incubator-paimon/pull/2296#discussion_r1388837045


##########
paimon-spark/paimon-spark-common/src/main/java/org/apache/paimon/spark/procedure/SortCompactProcedure.java:
##########
@@ -0,0 +1,164 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.paimon.spark.procedure;
+
+import org.apache.paimon.CoreOptions;
+import org.apache.paimon.annotation.VisibleForTesting;
+import org.apache.paimon.options.Options;
+import org.apache.paimon.spark.SaveMode;
+import org.apache.paimon.spark.commands.WriteIntoPaimonTable;
+import org.apache.paimon.spark.sort.TableSorter;
+import org.apache.paimon.table.AppendOnlyFileStoreTable;
+import org.apache.paimon.table.FileStoreTable;
+import org.apache.paimon.utils.ParameterUtils;
+import org.apache.paimon.utils.Preconditions;
+import org.apache.paimon.utils.StringUtils;
+
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.connector.catalog.Identifier;
+import org.apache.spark.sql.connector.catalog.TableCatalog;
+import org.apache.spark.sql.types.DataTypes;
+import org.apache.spark.sql.types.Metadata;
+import org.apache.spark.sql.types.StructField;
+import org.apache.spark.sql.types.StructType;
+
+import javax.annotation.Nullable;
+
+import java.util.Arrays;
+import java.util.List;
+import java.util.Map;
+import java.util.Optional;
+
+import static org.apache.spark.sql.types.DataTypes.StringType;
+
+/** Sort compact procedure for sort unaware-bucket table. */
+public class SortCompactProcedure extends BaseProcedure {

Review Comment:
   Renamed this to CompactProcedure, and support all compaction, but sort only works unaware-bucket table yet.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@paimon.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [PR] [spark] Introduce spark sort compact procedure for unaware-bucket table [incubator-paimon]

Posted by "leaves12138 (via GitHub)" <gi...@apache.org>.

leaves12138 closed pull request #2296: [spark] Introduce spark sort compact procedure for unaware-bucket table
URL: https://github.com/apache/incubator-paimon/pull/2296


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@paimon.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [PR] [spark] Introduce spark sort compact procedure for unaware-bucket table [incubator-paimon]

Posted by "Zouxxyy (via GitHub)" <gi...@apache.org>.

Zouxxyy commented on code in PR #2296:
URL: https://github.com/apache/incubator-paimon/pull/2296#discussion_r1388128061


##########
paimon-spark/paimon-spark-common/src/main/java/org/apache/paimon/spark/procedure/SortCompactProcedure.java:
##########
@@ -0,0 +1,164 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.paimon.spark.procedure;
+
+import org.apache.paimon.CoreOptions;
+import org.apache.paimon.annotation.VisibleForTesting;
+import org.apache.paimon.options.Options;
+import org.apache.paimon.spark.SaveMode;
+import org.apache.paimon.spark.commands.WriteIntoPaimonTable;
+import org.apache.paimon.spark.sort.TableSorter;
+import org.apache.paimon.table.AppendOnlyFileStoreTable;
+import org.apache.paimon.table.FileStoreTable;
+import org.apache.paimon.utils.ParameterUtils;
+import org.apache.paimon.utils.Preconditions;
+import org.apache.paimon.utils.StringUtils;
+
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.connector.catalog.Identifier;
+import org.apache.spark.sql.connector.catalog.TableCatalog;
+import org.apache.spark.sql.types.DataTypes;
+import org.apache.spark.sql.types.Metadata;
+import org.apache.spark.sql.types.StructField;
+import org.apache.spark.sql.types.StructType;
+
+import javax.annotation.Nullable;
+
+import java.util.Arrays;
+import java.util.List;
+import java.util.Map;
+import java.util.Optional;
+
+import static org.apache.spark.sql.types.DataTypes.StringType;
+
+/** Sort compact procedure for sort unaware-bucket table. */
+public class SortCompactProcedure extends BaseProcedure {
+
+    private static final ProcedureParameter[] PARAMETERS =
+            new ProcedureParameter[] {
+                ProcedureParameter.required("table", StringType),
+                ProcedureParameter.required("order_type", StringType),
+                ProcedureParameter.required("columns", StringType),
+                ProcedureParameter.optional("conditions", StringType),
+            };
+
+    private static final StructType OUTPUT_TYPE =
+            new StructType(
+                    new StructField[] {
+                        new StructField("result", DataTypes.BooleanType, true, Metadata.empty())
+                    });
+
+    protected SortCompactProcedure(TableCatalog tableCatalog) {
+        super(tableCatalog);
+    }
+
+    @Override
+    public ProcedureParameter[] parameters() {
+        return PARAMETERS;
+    }
+
+    @Override
+    public StructType outputType() {
+        return OUTPUT_TYPE;
+    }
+
+    @Override
+    public InternalRow[] call(InternalRow args) {
+        Preconditions.checkArgument(args.numFields() >= 3);
+        Identifier tableIdent = toIdentifier(args.getString(0), PARAMETERS[0].name());
+        String sortType = args.getString(1);
+        List<String> sortColumns = Arrays.asList(args.getString(2).split(","));
+
+        String partitionFilter = args.isNullAt(3) ? null : toWhere(args.getString(3));
+
+        return modifyPaimonTable(
+                tableIdent,
+                table -> {
+                    Preconditions.checkArgument(table instanceof FileStoreTable);
+                    InternalRow internalRow =
+                            newInternalRow(
+                                    execute(
+                                            (FileStoreTable) table,
+                                            sortType,
+                                            sortColumns,
+                                            partitionFilter));
+                    return new InternalRow[] {internalRow};
+                });
+    }
+
+    @Override
+    public String description() {
+        return "This procedure execute sort compact action on unaware-bucket table.";
+    }
+
+    private boolean execute(
+            FileStoreTable table,
+            String sortType,
+            List<String> sortColumns,
+            @Nullable String filter) {
+        CoreOptions coreOptions = table.store().options();
+
+        if (!(table instanceof AppendOnlyFileStoreTable) || coreOptions.bucket() != -1) {
+            throw new UnsupportedOperationException(
+                    "Spark sort compact only support unaware-bucket append-only table yet.");
+        }
+
+        Dataset<Row> row = spark().read().format("paimon").load(coreOptions.path().getPath());
+        row = StringUtils.isBlank(filter) ? row : row.where(filter);
+        new WriteIntoPaimonTable(
+                        table,
+                        SaveMode.dynamic(),

Review Comment:
   DynamicOverWrite$.MODULE$



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@paimon.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [PR] [spark] Introduce spark sort compact procedure for unaware-bucket table [incubator-paimon]

Posted by "Zouxxyy (via GitHub)" <gi...@apache.org>.

Zouxxyy commented on code in PR #2296:
URL: https://github.com/apache/incubator-paimon/pull/2296#discussion_r1388084015


##########
paimon-spark/paimon-spark-common/src/main/java/org/apache/paimon/spark/procedure/SortCompactProcedure.java:
##########
@@ -0,0 +1,164 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.paimon.spark.procedure;
+
+import org.apache.paimon.CoreOptions;
+import org.apache.paimon.annotation.VisibleForTesting;
+import org.apache.paimon.options.Options;
+import org.apache.paimon.spark.SaveMode;
+import org.apache.paimon.spark.commands.WriteIntoPaimonTable;
+import org.apache.paimon.spark.sort.TableSorter;
+import org.apache.paimon.table.AppendOnlyFileStoreTable;
+import org.apache.paimon.table.FileStoreTable;
+import org.apache.paimon.utils.ParameterUtils;
+import org.apache.paimon.utils.Preconditions;
+import org.apache.paimon.utils.StringUtils;
+
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.connector.catalog.Identifier;
+import org.apache.spark.sql.connector.catalog.TableCatalog;
+import org.apache.spark.sql.types.DataTypes;
+import org.apache.spark.sql.types.Metadata;
+import org.apache.spark.sql.types.StructField;
+import org.apache.spark.sql.types.StructType;
+
+import javax.annotation.Nullable;
+
+import java.util.Arrays;
+import java.util.List;
+import java.util.Map;
+import java.util.Optional;
+
+import static org.apache.spark.sql.types.DataTypes.StringType;
+
+/** Sort compact procedure for sort unaware-bucket table. */
+public class SortCompactProcedure extends BaseProcedure {

Review Comment:
   Why is there a procedure called sort_compact instead of using compact and then specifying something like sort_type = NONE/ORDER/ZORDER



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@paimon.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [PR] [spark] Introduce spark compact procedure [incubator-paimon]

Posted by "Zouxxyy (via GitHub)" <gi...@apache.org>.

Zouxxyy commented on PR #2296:
URL: https://github.com/apache/incubator-paimon/pull/2296#issuecomment-1813797033

   +1


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@paimon.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [PR] [spark] Introduce spark compact procedure [incubator-paimon]

Posted by "JingsongLi (via GitHub)" <gi...@apache.org>.

JingsongLi merged PR #2296:
URL: https://github.com/apache/incubator-paimon/pull/2296


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@paimon.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org