You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/11/28 12:58:35 UTC

[GitHub] [iceberg] brysd opened a new issue, #6290: Spark SQL support for partition spec?

brysd opened a new issue, #6290:
URL: https://github.com/apache/iceberg/issues/6290

   Hi, going through all the documentation with respect to partition evolution and metadata I was wondering whether it's possible to retrieve the actual definition of on which column or columns and potential transform(s) a table is partitioned.
   (https://iceberg.apache.org/docs/latest/spark-queries/, and https://iceberg.apache.org/spec/)
   
   Using spark SQL the partitions table provides only an insight in the actual partition values, not on how this is constructed. It points to partition spec id's but where can we find the actual partition spec by using Spark SQL or the pyiceberg api?
   
   It's not clear whether this is supported with pyspark or not.
   
   Since we want to dynamically change partitions we first would need to know whether partitions already exist and if yes, which table fields and/or transform functions are being used for the partitioning.
   
   thx!
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] brysd commented on issue #6290: Spark SQL support for partition spec?

Posted by GitBox <gi...@apache.org>.

brysd commented on issue #6290:
URL: https://github.com/apache/iceberg/issues/6290#issuecomment-1330360184

Hi, thanks for your feedback.

maybe I'm getting the use case wrong but the intention we have is to create iceberg tables dynamically where the 'user' of the application can configure the partition columns. But this could change of course and we'd like to change the partition columns. Maybe the 'user' also removes the partition columns from the configuration and hence the partition field needs to be dropped.

So we would use the `alter table ... replace partition field ... with ...`, `alter table ... drop partition field ... `, `alter table ... add partition field ...` spark sql statements for sure. However, to be able to dynamically generate these spark SQL statements in the code we need to know what the current partition columns actually are.

So e.g. suppose we have a table A with a column ts and we originally set the partition on months(ts). Someone changes the configuration to days(ts). We then want to generate a spark sql statement like 'alter table A replace partition field months(ts) with days(ts)'. This implies we know that the current partition field = months(ts) before we execute the alter statement. We don't know this anymore in our configuration since this has been overwritten.
So can we retrieve somehow the current partition field(s) definitions through Spark SQL or the python API?

We know we can use `select * from table A.partitions` to get the partition instances but it is not clear how we can 'derive' the actual partition fields. We can retrieve the spec_id through this table (partitions) but how can we then find the potential transform and table field on which partitioning was done so we can dynamically create the spark SQL alter statement?

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] github-actions[bot] commented on issue #6290: Spark SQL support for partition spec?

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.

github-actions[bot] commented on issue #6290:
URL: https://github.com/apache/iceberg/issues/6290#issuecomment-1566319769

   This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] brysd commented on issue #6290: Spark SQL support for partition spec?

Posted by GitBox <gi...@apache.org>.

brysd commented on issue #6290:
URL: https://github.com/apache/iceberg/issues/6290#issuecomment-1330907397

   @RussellSpitzer thanks for the suggestion. Describe will also fit for pyspark so will give it a try. Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on issue #6290: Spark SQL support for partition spec?

Posted by GitBox <gi...@apache.org>.

RussellSpitzer commented on issue #6290:
URL: https://github.com/apache/iceberg/issues/6290#issuecomment-1330830688

   Via sql you can see the current partitioning with "describe table", again this is really just the description of how new files will be added to the table, old partitioning still works even when the current spec changes.
   
   ```scala
   scala> spark.sql("describe table parttruncate").show
   +--------------+--------------+-------+
   |      col_name|     data_type|comment|
   +--------------+--------------+-------+
   |             x|           int|       |
   |             y|           int|       |
   |              |              |       |
   |# Partitioning|              |       |
   |        Part 0|truncate(x, 5)|       |
   +--------------+--------------+-------+
   ```
   
   But if I was doing something programattically I would probably use
   ```scala
   
   scala> import org.apache.iceberg.spark.Spark3Util
   import org.apache.iceberg.spark.Spark3Util
   scala> Spark3Util.loadIcebergTable(spark, "parttruncate").spec
   res17: org.apache.iceberg.PartitionSpec =
   [
     1000: x_trunc: truncate[5](1)
   ]
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] github-actions[bot] commented on issue #6290: Spark SQL support for partition spec?

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.

github-actions[bot] commented on issue #6290:
URL: https://github.com/apache/iceberg/issues/6290#issuecomment-1590236602

   This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale'


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on issue #6290: Spark SQL support for partition spec?

Posted by GitBox <gi...@apache.org>.

RussellSpitzer commented on issue #6290:
URL: https://github.com/apache/iceberg/issues/6290#issuecomment-1329119528

   This is probably not necessary unless I am missing something. Unlike systems like Hive, individual partitions are implicit metadata constructs and don't have to be explicitly created or modified through DDL. 
   
   The "spec" is the description of how fields are transformed to generate that metadata. So for example something like identity(column a) which allows files to state they have only values for a specific value of column a. Modifying this spec is allowed via our custom alter table commands. 
   
   Commands like drop partition x=5 just become delete statements in Iceberg where the engine determines whether it can be a metadata delete or not by analyzing all files in the table and their respective partition values as well as the spec used to write them.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] github-actions[bot] closed issue #6290: Spark SQL support for partition spec?

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.

github-actions[bot] closed issue #6290: Spark SQL support for partition spec?
URL: https://github.com/apache/iceberg/issues/6290


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org