You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/08/25 12:09:58 UTC

[GitHub] [iceberg] Chronos-LYH opened a new issue, #5631: API/Core: Add metadata field to NestedField

Chronos-LYH opened a new issue, #5631:
URL: https://github.com/apache/iceberg/issues/5631

   ### Feature Request / Improvement
   
   In ML scenarios, we may want Iceberg schemas to include additional information about a field. For example:
   For an integer field representing a feature, we need information indicating whether the feature is continuous or categorical:
   ```
   {“type”: “continuous”}
   {“type”: “categorical”, “categories”: [“US”, “CA”, “CN”, ...]}
   ```
   For a list field representing multiple features, we may want information on some of the features:
   ```
   {
     "features": [
       {
         "index": 0,
         "name": "age",
         "type": "continuous"
       },
       {
         "index": 5,
         "name": "gender",
         "type": "categorical",
         "categories": [
           "male",
           "female"
         ]
       }
     ]
   }
   ```
   For a binary field representing a custom-encoded feature, we need information on the encoding.
   ```
    {"encoding": "feature_id_v1"}
    {"encoding": "feature_id_v2"}
   ```
   Spark has a metadata field in its StructType class since Spark 1.2 (https://issues.apache.org/jira/browse/SPARK-3569), so that Spark DataFrames can hold ML-specific information as mentioned above.
   
   Referring to Spark's implementation, Iceberg can add a metadata field to the NestedField class. When declaring an Iceberg NestedField, users can provide an "metadata" argument with additional information about the field.
   ```
   Schema schema = new Schema(
           required(1, "feature1", Types.IntegerType.get(), null, Metadata.fromJson("
                   {“type”: “categorical”, “categories”: [“US”, “CA”, “CN”]}
           "))),
           required(2, "feature2", Types.IntegerType.get(), null, Metadata.fromJson("
                   {“type”: “continuous”}
           ")))
   );
   ```
   The metadata in Iceberg and Spark should be able to convert to each other, so that the field metadata in Iceberg can be passed to Spark DataFrames. Also DataFrames will be able to preserve field metadata when saved in iceberg format.
   
   See also:
   https://docs.google.com/document/d/1RGJgVJhCebnilpL15ODcq0EWBeVjl9ltoHUvosWodPg/edit#
   
   ### Query engine
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] linyanghao commented on issue #5631: API/Core: Add metadata field to NestedField

Posted by GitBox <gi...@apache.org>.
linyanghao commented on issue #5631:
URL: https://github.com/apache/iceberg/issues/5631#issuecomment-1288361002

   > @linyanghao, can you bring this up as a discussion on the dev list?
   > 
   > The reason why we haven't done this is that metadata is not part of SQL and is not supported across engines. As a result, operations will drop metadata when people expect it to be carried through:
   > 
   > ```sql
   > CREATE TABLE copy AS SELECT * FROM original
   > ```
   > 
   > Any field metadata in `original` is not present in `copy`. I think that's confusing enough that we have so far chosen to not add this "feature" to the format.
   > 
   > If we want to change that decision, we'll need to document this in the spec and define compatibility rules across spec versions.
   
   Thanks for pointing out the issue. I will bring up a discussion when I have some thoughts on it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on issue #5631: API/Core: Add metadata field to NestedField

Posted by GitBox <gi...@apache.org>.
rdblue commented on issue #5631:
URL: https://github.com/apache/iceberg/issues/5631#issuecomment-1288161731

   @linyanghao, can you bring this up as a discussion on the dev list?
   
   The reason why we haven't done this is that metadata is not part of SQL and is not supported across engines. As a result, operations will drop metadata when people expect it to be carried through:
   
   ```sql
   CREATE TABLE copy AS SELECT * FROM original
   ```
   
   Any field metadata in `original` is not present in `copy`. I think that's confusing enough that we have so far chosen to not add this "feature" to the format.
   
   If we want to change that decision, we'll need to document this in the spec and define compatibility rules across spec versions.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] github-actions[bot] commented on issue #5631: API/Core: Add metadata field to NestedField

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on issue #5631:
URL: https://github.com/apache/iceberg/issues/5631#issuecomment-1619282809

   This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale'


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] github-actions[bot] commented on issue #5631: API/Core: Add metadata field to NestedField

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on issue #5631:
URL: https://github.com/apache/iceberg/issues/5631#issuecomment-1519204691

   This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] github-actions[bot] closed issue #5631: API/Core: Add metadata field to NestedField

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] closed issue #5631: API/Core: Add metadata field to NestedField
URL: https://github.com/apache/iceberg/issues/5631


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org