You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2020/08/18 09:51:40 UTC

[GitHub] [iceberg] zhangdove opened a new pull request #1355: Fixed non-greenway time zone, data loss with day partition

zhangdove opened a new pull request #1355:
URL: https://github.com/apache/iceberg/pull/1355


   Fix #1354 .
   
   I have made some attempts to fix the data loss caused by the time zone, I don't know if it is feasible.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] zhangdove commented on pull request #1355: Fixed non-greenway time zone, data loss with day partition

Posted by GitBox <gi...@apache.org>.
zhangdove commented on pull request #1355:
URL: https://github.com/apache/iceberg/pull/1355#issuecomment-677154515


   >One could imagine if you have a system running in both China and in US East you would get different answers with the same timestamps and this to me is the crux of the problem. System locale should not effect data partitioning.
   
   @RussellSpitzer Thanks for your explanation, maybe I know why the community is using UTC time zone.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #1355: Fixed non-greenway time zone, data loss with day partition

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #1355:
URL: https://github.com/apache/iceberg/pull/1355#issuecomment-677772657


   @zhangdove, thanks for bringing this up, I think it is a valuable discussion to have. I'm definitely open to adding date functions that are parameterized by a zone offset because there are times when the partition boundaries matter. If you're interested in that, it would be great to have you contribute them!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] zhangdove commented on pull request #1355: Fixed non-greenway time zone, data loss with day partition

Posted by GitBox <gi...@apache.org>.
zhangdove commented on pull request #1355:
URL: https://github.com/apache/iceberg/pull/1355#issuecomment-675874398


   @rdblue Thanks for your reply. And I still have some questions about time zones.
   
   >Timestamp modification or adjustment to display a value by time zone is done by processing engines, not by Iceberg itself. Iceberg must only store values and return the same values.
   
   1.I don't think that such an operation changes the value of the time, and Iceberg does not change the real data of the time type, but only changes the value of the time partition directory to meet the user's current time zone.
   
   2.I don't understand why iceberg must use UTC time zone in the time partition directory. The user's time data in other time zones is prone to the problems described in #1354 .


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #1355: Fixed non-greenway time zone, data loss with day partition

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #1355:
URL: https://github.com/apache/iceberg/pull/1355#issuecomment-675604146


   @zhangdove, Iceberg does not modify timestamps other than converting between representations (like micros to millis).
   
   Timestamp modification or adjustment to display a value by time zone is done by processing engines, not by Iceberg itself. Iceberg must only store values and return the same values.
   
   Timestamp values are in microseconds and come in two flavors: timestamp is a zoneless date and time represented as microseconds from the Unix epoch in UTC, and timestamptz is an instant in time represented as microseconds from the Unix epoch in UTC.
   
   The report in #1354 uses Spark. Spark only supports timestamptz and it passes values to Iceberg as microseconds from epoch, which Iceberg returns to Spark unmodified. The `day` partition function is not specific to a zone, which is probably the source of the confusion. The daily boundaries used by that function are UTC day boundaries, if I remember correctly.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] zhangdove closed pull request #1355: Fixed non-greenway time zone, data loss with day partition

Posted by GitBox <gi...@apache.org>.
zhangdove closed pull request #1355:
URL: https://github.com/apache/iceberg/pull/1355


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on pull request #1355: Fixed non-greenway time zone, data loss with day partition

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on pull request #1355:
URL: https://github.com/apache/iceberg/pull/1355#issuecomment-676488569


   The real issue here is not that the "Day" has to be in UTC, but that timestamps are in many cases transmitted without any timezone information, assuming that we can add that information back based on the System locale is going to essentially be buggy especially when we don't know where the System's running the code are actually located and this may be different for multiple users of the same table.
   
   One could imagine if you have a system running in both China and in US East you would get different answers with the same timestamps and this to me is the crux of the problem. System locale should not effect data partitioning.
   
   This is one of the (many) reasons that almost all modern apis for calculating Day or Day since Epoch don't represent the underlying time as a timestamp. There is just too much ambiguity in what the "day" should actually be. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer edited a comment on pull request #1355: Fixed non-greenway time zone, data loss with day partition

Posted by GitBox <gi...@apache.org>.
RussellSpitzer edited a comment on pull request #1355:
URL: https://github.com/apache/iceberg/pull/1355#issuecomment-676488569


   One issue here is not that the "Day" has to be in UTC, but that timestamps are in many cases transmitted without any timezone information, assuming that we can add that information back based on the System locale is going to essentially be buggy especially when we don't know where the System's running the code are actually located and this may be different for multiple users of the same table.
   
   One could imagine if you have a system running in both China and in US East you would get different answers with the same timestamps and this to me is the crux of the problem. System locale should not effect data partitioning.
   
   This is one of the (many) reasons that almost all modern apis for calculating Day or Day since Epoch don't represent the underlying time as a timestamp. There is just too much ambiguity in what the "day" should actually be. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer edited a comment on pull request #1355: Fixed non-greenway time zone, data loss with day partition

Posted by GitBox <gi...@apache.org>.
RussellSpitzer edited a comment on pull request #1355:
URL: https://github.com/apache/iceberg/pull/1355#issuecomment-676488569


   One issue here is not that the "Day" has to be in UTC, but that timestamps are in many cases transmitted without any timezone information, assuming that we can add that information back based on the System locale is going to essentially be buggy especially when we don't know where the System's running the code is and the location may change depending on deployment.
   
   One could imagine if you have a system running in both China and in US East you would get different answers with the same timestamps and this to me is the crux of the problem. System locale should not effect data partitioning.
   
   This is one of the (many) reasons that almost all modern apis for calculating Day or Day since Epoch don't represent the underlying time as a timestamp. There is just too much ambiguity in what the "day" should actually be. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] zhangdove commented on pull request #1355: Fixed non-greenway time zone, data loss with day partition

Posted by GitBox <gi...@apache.org>.
zhangdove commented on pull request #1355:
URL: https://github.com/apache/iceberg/pull/1355#issuecomment-677179685


    I am now closing the relevant issue and this PR


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] zhangdove commented on pull request #1355: Fixed non-greenway time zone, data loss with day partition

Posted by GitBox <gi...@apache.org>.
zhangdove commented on pull request #1355:
URL: https://github.com/apache/iceberg/pull/1355#issuecomment-677169468


   @rdblue Thank you very much for your patience.
   After internal consideration of my team and I, we decided to privatize this function locally. At least so far, we have no internal problem of processing data across time zones. 
   Of course, I am more looking forward to relevant modifications by the community, as you mentioned before, adding an offset parameter in the function. If you don't mind, maybe I'll take the time to provide a PR that supports the time zone parameter function.
   
   The reason why I use 'dynamicOverwrite' is because of the problem from #1341 . However, when I want to reproduce the problem, I find that the problem no longer exists. Next, I'll use the recommended API `(overwrite(Expression))`.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on pull request #1355: Fixed non-greenway time zone, data loss with day partition

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on pull request #1355:
URL: https://github.com/apache/iceberg/pull/1355#issuecomment-677710648


   Timezones are the worst :( 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #1355: Fixed non-greenway time zone, data loss with day partition

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #1355:
URL: https://github.com/apache/iceberg/pull/1355#issuecomment-676591216


   > I don't think that such an operation changes the value of the time, and Iceberg does not change the real data of the time type, but only changes the value of the time partition directory to meet the user's current time zone.
   
   Okay, I see that you're just adjusting the time zone in the partition function. That's similar to a solution we may want to use, although I think we would need to do it differently: the partition function must be well-defined and cannot depend on the environment. I think we would want to add either 23 new functions (one for each offset) or add an offset parameter to the function, similar to the bucket width and truncation length.
   
   The reason why the partitioning is UTC is that we just need to break data up into day-sized partitions. The actual partition boundaries would ideally not matter. If you need data split across files at some boundary for deletes, then we would normally recommend hourly partitioning to ensure that you can delete any hour individually.
   
   > I don't understand why iceberg must use UTC time zone in the time partition directory. The user's time data in other time zones is prone to the problems described in #1354 .
   
   The requirement is that Iceberg must be consistent. Whatever one engine uses, all the others must as well. That's why I'm saying that the function would need to be parameterized and we would need to add the offsets to the spec.
   
   Also, I should note that `dynamicOverwrite` is supported, but not considered a best practice. We don't recommend using `dynamicOverwrite` because it can lead to situations like what you hit in #1354. The problem is that the data being overwritten is _implicit_ based on the data written.
   
   If your job had a bug and wrote a single record from another day -- even if the zone problem were fixed -- then that record would overwrite that day worth of data. This is why we added the `overwrite(Expression)` action in `DataFrameWriterV2`. That is explicit about what data is deleted when your job commits. That is a much more reliable pattern.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org