You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/06/05 19:17:55 UTC

[GitHub] [iceberg] schatterjee10 opened a new issue, #4849: Ignore downcasting of column types when "mergeSchema" is set.

schatterjee10 opened a new issue, #4849:
URL: https://github.com/apache/iceberg/issues/4849

   We encountered an error while writing to iceberg table
   `java.lang.IllegalArgumentException: Cannot change column type: myCol: long->int`
   The table was created with long type for myCol. We are writing to Iceberg table from Spark application like this :
   `df.writeTo(icebergTable).option("mergeSchema", "true").overwritePartitions()`
    
   Since we have datatype for myCol as int in the dataframe and we are using 'mergeSchema' option, I believe the writer is trying to downcast the column type which is failing. Would it be better if we ignore the downcasting of column types with 'mergeSchema' option?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kbendick commented on issue #4849: Ignore downcasting of column types when "mergeSchema" is set.

Posted by GitBox <gi...@apache.org>.
kbendick commented on issue #4849:
URL: https://github.com/apache/iceberg/issues/4849#issuecomment-1138192430

   Thanks for reporting this @schatterjee10.
   
   I’d be open to a quick demo PR, but I do worry about trying to be very “smart” with casting rules. We do allow for “incompatible” schema updates to be made, but this would require casting the data afaik and `mergeSchema` should really err on the side of caution in my opinion.
   
   That said, if you have a sample draft POC to show this is possible, I think we’d be willing to take a look to consider it. Again though, to be really correct the data should be stored in the files as long (in my opinion) and should likely be casted.
   
   But I will admit I’m not super familiar with this code so if it can be easily resolved, feel free to present it or discuss it. But in general, I’m hesitant about overriding the built in rules about dangerous casting. Eventually, I worry it will come back to bite us.
   
   I’m just one person so others might disagree, but that’s my 2 cents fwiw. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kbendick commented on issue #4849: Ignore downcasting of column types when "mergeSchema" is set.

Posted by GitBox <gi...@apache.org>.
kbendick commented on issue #4849:
URL: https://github.com/apache/iceberg/issues/4849#issuecomment-1138197433

   For what it’s worth, here’s the original PR that added support.
   
   There’s a function in `SchemaUtils` it seems you can look into to check viability / ease of adding this change: https://github.com/apache/iceberg/commit/69c06cead6e5ddfe58c17630c4863eb98e04ff4e


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kbendick commented on issue #4849: Ignore downcasting of column types when "mergeSchema" is set.

Posted by GitBox <gi...@apache.org>.
kbendick commented on issue #4849:
URL: https://github.com/apache/iceberg/issues/4849#issuecomment-1146868577

   Ok. I’m good with that argument.
   
   Specifically that this write would succeed otherwise. I do worry about unintended consequences from eventually aggressively checking widening, but int and long are good in my book.
   
   Thanks for weighing in @RussellSpitzer 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] github-actions[bot] closed issue #4849: Ignore downcasting of column types when "mergeSchema" is set.

Posted by GitBox <gi...@apache.org>.
github-actions[bot] closed issue #4849: Ignore downcasting of column types when "mergeSchema" is set.
URL: https://github.com/apache/iceberg/issues/4849


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


Re: [I] Ignore downcasting of column types when "mergeSchema" is set. [iceberg]

Posted by "yyy1000 (via GitHub)" <gi...@apache.org>.
yyy1000 commented on issue #4849:
URL: https://github.com/apache/iceberg/issues/4849#issuecomment-1842133002

   Since no one is working on this, I'd like a try.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kbendick closed issue #4849: Ignore downcasting of column types when "mergeSchema" is set.

Posted by GitBox <gi...@apache.org>.
kbendick closed issue #4849: Ignore downcasting of column types when "mergeSchema" is set.
URL: https://github.com/apache/iceberg/issues/4849


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kbendick commented on issue #4849: Ignore downcasting of column types when "mergeSchema" is set.

Posted by GitBox <gi...@apache.org>.
kbendick commented on issue #4849:
URL: https://github.com/apache/iceberg/issues/4849#issuecomment-1146869834

   If anybody is interested in working on this, as I mentioned before the best place to start the work would be around here as this method is where the `mergeSchema` logic begins: https://github.com/apache/iceberg/commit/69c06cead6e5ddfe58c17630c4863eb98e04ff4e#diff-9eb9256d82cd6ab19f73c941a7011ff4973eb61255c7fee0b38402f922a2e48cR239


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on issue #4849: Ignore downcasting of column types when "mergeSchema" is set.

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on issue #4849:
URL: https://github.com/apache/iceberg/issues/4849#issuecomment-1146868271

   @kbendick I do think we should allow this, just ignoring the schema change. Currently the behavior of merge schema invalidates a write that would succeed if mergeSchema had been set to false.
   
   I think in cases like this merge schema should just not downcast and allow the write to proceed as normal. This would be really useful for users who have multiple numeric sources and want to use mergeSchema functionality. Mostly the merge schema is being used for adding new missing columns but if an existing column has a wider type (or was widened by another operation) I don't think that should break anything


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kbendick commented on issue #4849: Ignore downcasting of column types when "mergeSchema" is set.

Posted by GitBox <gi...@apache.org>.
kbendick commented on issue #4849:
URL: https://github.com/apache/iceberg/issues/4849#issuecomment-1146735641

   But TLDR is that downcasting from long -> int is not safe.
   
   It could break existing data in the table.
   
   If you’re _sure_ that the change is allowed, I’d suggest you use the Java API to allow for incompatible changes.
   
   Even with `overwritePartitions()`, there could be data in _other_partitions that has true long types that don’t fit into int.
   
   I’m going to close this issue as this is the expected behavior and downcasting in an unsafe way is not something `mergeSchema` should do on the users behalf.
   
   You can use the Java API to allow for incompatible schema changes (at your own risk as long -> int is a lossy conversion for values greater than INT_MAX). Feel free to open a new issue if you still have issues 🙂 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] github-actions[bot] commented on issue #4849: Ignore downcasting of column types when "mergeSchema" is set.

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on issue #4849:
URL: https://github.com/apache/iceberg/issues/4849#issuecomment-1335978175

   This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] github-actions[bot] commented on issue #4849: Ignore downcasting of column types when "mergeSchema" is set.

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on issue #4849:
URL: https://github.com/apache/iceberg/issues/4849#issuecomment-1374284039

   This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale'


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org