You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/05/30 21:02:28 UTC

[GitHub] [iceberg] jun-he opened a new pull request, #4908: [Python] add identity transform

jun-he opened a new pull request, #4908:
URL: https://github.com/apache/iceberg/pull/4908

   To split the PR https://github.com/apache/iceberg/pull/3450, open a new PR for identity transform here.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #4908: [Python] add identity transform

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #4908:
URL: https://github.com/apache/iceberg/pull/4908#discussion_r885097455


##########
python/src/iceberg/utils/binary.py:
##########
@@ -0,0 +1,25 @@
+#  Licensed to the Apache Software Foundation (ASF) under one
+#  or more contributor license agreements.  See the NOTICE file
+#  distributed with this work for additional information
+#  regarding copyright ownership.  The ASF licenses this file
+#  to you under the Apache License, Version 2.0 (the
+#  "License"); you may not use this file except in compliance
+#  with the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+#  Unless required by applicable law or agreed to in writing,
+#  software distributed under the License is distributed on an
+#  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+#  KIND, either express or implied.  See the License for the
+#  specific language governing permissions and limitations
+#  under the License.
+"""Helper methods for working with bytes (FixedType or BinaryType) data
+"""
+
+import base64
+
+
+def base64encode(buffer: bytes) -> str:

Review Comment:
   I think the utils methods added in this PR should be private methods in transforms.py instead because they are not used in Iceberg Java outside of the `transforms` package. There's no need to expose them in Python either.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jun-he commented on pull request #4908: [Python] add identity transform

Posted by GitBox <gi...@apache.org>.
jun-he commented on PR #4908:
URL: https://github.com/apache/iceberg/pull/4908#issuecomment-1153338536

   @rdblue  updated the PR and removed tox.ini


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #4908: [Python] add identity transform

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #4908:
URL: https://github.com/apache/iceberg/pull/4908#discussion_r885096266


##########
python/src/iceberg/utils/binary.py:
##########
@@ -0,0 +1,25 @@
+#  Licensed to the Apache Software Foundation (ASF) under one
+#  or more contributor license agreements.  See the NOTICE file
+#  distributed with this work for additional information
+#  regarding copyright ownership.  The ASF licenses this file
+#  to you under the Apache License, Version 2.0 (the
+#  "License"); you may not use this file except in compliance
+#  with the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+#  Unless required by applicable law or agreed to in writing,
+#  software distributed under the License is distributed on an
+#  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+#  KIND, either express or implied.  See the License for the
+#  specific language governing permissions and limitations
+#  under the License.
+"""Helper methods for working with bytes (FixedType or BinaryType) data
+"""
+
+import base64
+
+
+def base64encode(buffer: bytes) -> str:
+    """Converts bytes to bases64 string"""

Review Comment:
   Typo: should be `base64` not `bases64`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #4908: [Python] add identity transform

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #4908:
URL: https://github.com/apache/iceberg/pull/4908#discussion_r885105191


##########
python/src/iceberg/utils/datetime.py:
##########
@@ -63,3 +65,27 @@ def timestamptz_to_micros(timestamptz_str: str) -> int:
     if ISO_TIMESTAMPTZ.fullmatch(timestamptz_str):
         return datetime_to_micros(datetime.fromisoformat(timestamptz_str))
     raise ValueError(f"Invalid timestamp with zone: {timestamptz_str} (must be ISO-8601)")
+
+
+def to_human_day(day_ordinal: int) -> str:
+    """Converts a DateType value to human string"""
+    to_time = EPOCH_TIMESTAMP + timedelta(days=day_ordinal)
+    return "{0:0=4d}-{1:0=2d}-{2:0=2d}".format(to_time.year, to_time.month, to_time.day)
+
+
+def to_human_time(micros_from_midnight: int) -> str:
+    """Converts a TimeType value to human string"""
+    to_day = EPOCH_TIMESTAMP + timedelta(microseconds=micros_from_midnight)
+    return f"{to_day.time()}"
+
+
+def to_human_timestamptz(timestamp_micros: int) -> str:
+    """Converts a TimestamptzType value to human string"""
+    to_timestamptz = EPOCH_TIMESTAMP + timedelta(microseconds=timestamp_micros)
+    return pytz.timezone("UTC").localize(to_timestamptz).strftime("%Y-%m-%dT%H:%M:%S.%fZ")
+
+
+def to_human_timestamp(timestamp_micros: int) -> str:
+    """Converts a TimestampType value to human string"""
+    to_timestamp = EPOCH_TIMESTAMP + timedelta(microseconds=timestamp_micros)
+    return to_timestamp.isoformat()

Review Comment:
   This works for me to produce the correct timestamp string:
   
   ```python
       return (EPOCH_TIMESTAMP + timedelta(microseconds=1653926816315057)).isoformat()
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #4908: [Python] add identity transform

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #4908:
URL: https://github.com/apache/iceberg/pull/4908#discussion_r885104803


##########
python/src/iceberg/utils/datetime.py:
##########
@@ -63,3 +65,27 @@ def timestamptz_to_micros(timestamptz_str: str) -> int:
     if ISO_TIMESTAMPTZ.fullmatch(timestamptz_str):
         return datetime_to_micros(datetime.fromisoformat(timestamptz_str))
     raise ValueError(f"Invalid timestamp with zone: {timestamptz_str} (must be ISO-8601)")
+
+
+def to_human_day(day_ordinal: int) -> str:
+    """Converts a DateType value to human string"""
+    to_time = EPOCH_TIMESTAMP + timedelta(days=day_ordinal)
+    return "{0:0=4d}-{1:0=2d}-{2:0=2d}".format(to_time.year, to_time.month, to_time.day)
+
+
+def to_human_time(micros_from_midnight: int) -> str:
+    """Converts a TimeType value to human string"""
+    to_day = EPOCH_TIMESTAMP + timedelta(microseconds=micros_from_midnight)
+    return f"{to_day.time()}"
+
+
+def to_human_timestamptz(timestamp_micros: int) -> str:
+    """Converts a TimestamptzType value to human string"""
+    to_timestamptz = EPOCH_TIMESTAMP + timedelta(microseconds=timestamp_micros)

Review Comment:
   I think this should use `EPOCH_TIMESTAMPTZ`. That avoids the need for pytz.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #4908: [Python] add identity transform

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #4908:
URL: https://github.com/apache/iceberg/pull/4908#discussion_r895225327


##########
python/src/iceberg/transforms.py:
##########
@@ -224,6 +227,80 @@ def hash(self, value: UUID) -> int:
         )
 
 
+def _base64encode(buffer: bytes) -> str:
+    """Converts bytes to base64 string"""
+    return base64.b64encode(buffer).decode("ISO-8859-1")
+
+
+class IdentityTransform(Transform[S, S]):
+    """Transforms a value into itself.
+
+    Example:
+        >>> transform = IdentityTransform(StringType())
+        >>> transform.apply('hello-world')
+        'hello-world'
+    """
+
+    def __init__(self, source_type: IcebergType):
+        super().__init__(
+            "identity",
+            f"transforms.identity(source_type={repr(source_type)})",
+        )
+        self._type = source_type
+
+    def apply(self, value: Optional[S]) -> Optional[S]:
+        return value
+
+    def can_transform(self, source: IcebergType) -> bool:
+        return source.is_primitive
+
+    def result_type(self, source: IcebergType) -> IcebergType:
+        return source
+
+    @property
+    def preserves_order(self) -> bool:
+        return True
+
+    def satisfies_order_of(self, other: Transform) -> bool:
+        """ordering by value is the same as long as the other preserves order"""
+        return other.preserves_order
+
+    def to_human_string(self, value: Optional[S]) -> str:
+        return self._human_string(value)
+
+    @singledispatchmethod
+    def _human_string(self, value: Optional[S]) -> str:

Review Comment:
   Minor: There's no need for these to be in the class, and it's actually faster if they are simple methods.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jun-he commented on a diff in pull request #4908: [Python] add identity transform

Posted by GitBox <gi...@apache.org>.
jun-he commented on code in PR #4908:
URL: https://github.com/apache/iceberg/pull/4908#discussion_r895252659


##########
python/src/iceberg/utils/datetime.py:
##########
@@ -63,3 +65,27 @@ def timestamptz_to_micros(timestamptz_str: str) -> int:
     if ISO_TIMESTAMPTZ.fullmatch(timestamptz_str):
         return datetime_to_micros(datetime.fromisoformat(timestamptz_str))
     raise ValueError(f"Invalid timestamp with zone: {timestamptz_str} (must be ISO-8601)")
+
+
+def to_human_day(day_ordinal: int) -> str:
+    """Converts a DateType value to human string"""
+    to_time = EPOCH_TIMESTAMP + timedelta(days=day_ordinal)
+    return "{0:0=4d}-{1:0=2d}-{2:0=2d}".format(to_time.year, to_time.month, to_time.day)
+
+
+def to_human_time(micros_from_midnight: int) -> str:
+    """Converts a TimeType value to human string"""
+    to_day = EPOCH_TIMESTAMP + timedelta(microseconds=micros_from_midnight)

Review Comment:
   👌  Updated.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #4908: [Python] add identity transform

Posted by GitBox <gi...@apache.org>.
rdblue commented on PR #4908:
URL: https://github.com/apache/iceberg/pull/4908#issuecomment-1154385684

   Thanks, @jun-he!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jun-he commented on a diff in pull request #4908: [Python] add identity transform

Posted by GitBox <gi...@apache.org>.
jun-he commented on code in PR #4908:
URL: https://github.com/apache/iceberg/pull/4908#discussion_r889757148


##########
python/src/iceberg/transforms.py:
##########
@@ -224,6 +226,67 @@ def hash(self, value: UUID) -> int:
         )
 
 
+class IdentityTransform(Transform[S, S]):
+    """Transforms a value into itself.
+
+    Example:
+        >>> transform = IdentityTransform(StringType())
+        >>> transform.apply('hello-world')
+        'hello-world'
+    """
+
+    def __init__(self, source_type: IcebergType):
+        super().__init__(
+            "identity",
+            f"transforms.identity(source_type={repr(source_type)})",
+        )
+        self._type = source_type
+
+    def apply(self, value: Optional[S]) -> Optional[S]:
+        return value
+
+    def can_transform(self, source: IcebergType) -> bool:
+        return source.is_primitive
+
+    def result_type(self, source: IcebergType) -> IcebergType:
+        return source
+
+    @property
+    def preserves_order(self) -> bool:
+        return True
+
+    def satisfies_order_of(self, other: Transform) -> bool:
+        """ordering by value is the same as long as the other preserves order"""
+        return other.preserves_order
+
+    def to_human_string(self, value: Optional[S]) -> str:
+        return self._human_string(value)
+
+    @singledispatchmethod
+    def _human_string(self, value: Optional[S]) -> str:
+        return str(value) if value is not None else "null"
+
+    @_human_string.register(int)
+    def _(self, value: int) -> str:

Review Comment:
   Sure, updated it. A small issue is that `valueType` is unused.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jun-he commented on pull request #4908: [Python] add identity transform

Posted by GitBox <gi...@apache.org>.
jun-he commented on PR #4908:
URL: https://github.com/apache/iceberg/pull/4908#issuecomment-1146918517

   @rdblue Yep, just addressed the comments accordingly. Can you take a look? Thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #4908: [Python] add identity transform

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #4908:
URL: https://github.com/apache/iceberg/pull/4908#discussion_r885104124


##########
python/src/iceberg/utils/datetime.py:
##########
@@ -63,3 +65,27 @@ def timestamptz_to_micros(timestamptz_str: str) -> int:
     if ISO_TIMESTAMPTZ.fullmatch(timestamptz_str):
         return datetime_to_micros(datetime.fromisoformat(timestamptz_str))
     raise ValueError(f"Invalid timestamp with zone: {timestamptz_str} (must be ISO-8601)")
+
+
+def to_human_day(day_ordinal: int) -> str:
+    """Converts a DateType value to human string"""
+    to_time = EPOCH_TIMESTAMP + timedelta(days=day_ordinal)
+    return "{0:0=4d}-{1:0=2d}-{2:0=2d}".format(to_time.year, to_time.month, to_time.day)
+
+
+def to_human_time(micros_from_midnight: int) -> str:
+    """Converts a TimeType value to human string"""
+    to_day = EPOCH_TIMESTAMP + timedelta(microseconds=micros_from_midnight)
+    return f"{to_day.time()}"

Review Comment:
   Isn't this just `str(to_day.time())`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jun-he commented on a diff in pull request #4908: [Python] add identity transform

Posted by GitBox <gi...@apache.org>.
jun-he commented on code in PR #4908:
URL: https://github.com/apache/iceberg/pull/4908#discussion_r889757256


##########
python/src/iceberg/utils/datetime.py:
##########
@@ -63,3 +65,27 @@ def timestamptz_to_micros(timestamptz_str: str) -> int:
     if ISO_TIMESTAMPTZ.fullmatch(timestamptz_str):
         return datetime_to_micros(datetime.fromisoformat(timestamptz_str))
     raise ValueError(f"Invalid timestamp with zone: {timestamptz_str} (must be ISO-8601)")
+
+
+def to_human_day(day_ordinal: int) -> str:
+    """Converts a DateType value to human string"""
+    to_time = EPOCH_TIMESTAMP + timedelta(days=day_ordinal)
+    return "{0:0=4d}-{1:0=2d}-{2:0=2d}".format(to_time.year, to_time.month, to_time.day)
+
+
+def to_human_time(micros_from_midnight: int) -> str:
+    """Converts a TimeType value to human string"""
+    to_day = EPOCH_TIMESTAMP + timedelta(microseconds=micros_from_midnight)
+    return f"{to_day.time()}"

Review Comment:
   Done.



##########
python/src/iceberg/utils/datetime.py:
##########
@@ -63,3 +65,27 @@ def timestamptz_to_micros(timestamptz_str: str) -> int:
     if ISO_TIMESTAMPTZ.fullmatch(timestamptz_str):
         return datetime_to_micros(datetime.fromisoformat(timestamptz_str))
     raise ValueError(f"Invalid timestamp with zone: {timestamptz_str} (must be ISO-8601)")
+
+
+def to_human_day(day_ordinal: int) -> str:
+    """Converts a DateType value to human string"""
+    to_time = EPOCH_TIMESTAMP + timedelta(days=day_ordinal)
+    return "{0:0=4d}-{1:0=2d}-{2:0=2d}".format(to_time.year, to_time.month, to_time.day)
+
+
+def to_human_time(micros_from_midnight: int) -> str:
+    """Converts a TimeType value to human string"""
+    to_day = EPOCH_TIMESTAMP + timedelta(microseconds=micros_from_midnight)
+    return f"{to_day.time()}"
+
+
+def to_human_timestamptz(timestamp_micros: int) -> str:
+    """Converts a TimestamptzType value to human string"""
+    to_timestamptz = EPOCH_TIMESTAMP + timedelta(microseconds=timestamp_micros)

Review Comment:
   👌 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #4908: [Python] add identity transform

Posted by GitBox <gi...@apache.org>.
rdblue commented on PR #4908:
URL: https://github.com/apache/iceberg/pull/4908#issuecomment-1146902590

   @jun-he do you have time to update this? If not, we can probably find someone to pick it up.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #4908: [Python] add identity transform

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #4908:
URL: https://github.com/apache/iceberg/pull/4908#discussion_r895226010


##########
python/src/iceberg/utils/datetime.py:
##########
@@ -63,3 +65,27 @@ def timestamptz_to_micros(timestamptz_str: str) -> int:
     if ISO_TIMESTAMPTZ.fullmatch(timestamptz_str):
         return datetime_to_micros(datetime.fromisoformat(timestamptz_str))
     raise ValueError(f"Invalid timestamp with zone: {timestamptz_str} (must be ISO-8601)")
+
+
+def to_human_day(day_ordinal: int) -> str:
+    """Converts a DateType value to human string"""
+    to_time = EPOCH_TIMESTAMP + timedelta(days=day_ordinal)
+    return "{0:0=4d}-{1:0=2d}-{2:0=2d}".format(to_time.year, to_time.month, to_time.day)
+
+
+def to_human_time(micros_from_midnight: int) -> str:
+    """Converts a TimeType value to human string"""
+    to_day = EPOCH_TIMESTAMP + timedelta(microseconds=micros_from_midnight)

Review Comment:
   This should use the code provided here, rather than creating a datetime and converting it to a time.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jun-he commented on a diff in pull request #4908: [Python] add identity transform

Posted by GitBox <gi...@apache.org>.
jun-he commented on code in PR #4908:
URL: https://github.com/apache/iceberg/pull/4908#discussion_r895252601


##########
python/tox.ini:
##########
@@ -0,0 +1,111 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+[tox]

Review Comment:
   👌 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jun-he commented on a diff in pull request #4908: [Python] add identity transform

Posted by GitBox <gi...@apache.org>.
jun-he commented on code in PR #4908:
URL: https://github.com/apache/iceberg/pull/4908#discussion_r889757312


##########
python/src/iceberg/utils/datetime.py:
##########
@@ -63,3 +65,27 @@ def timestamptz_to_micros(timestamptz_str: str) -> int:
     if ISO_TIMESTAMPTZ.fullmatch(timestamptz_str):
         return datetime_to_micros(datetime.fromisoformat(timestamptz_str))
     raise ValueError(f"Invalid timestamp with zone: {timestamptz_str} (must be ISO-8601)")
+
+
+def to_human_day(day_ordinal: int) -> str:
+    """Converts a DateType value to human string"""
+    to_time = EPOCH_TIMESTAMP + timedelta(days=day_ordinal)

Review Comment:
   Fixed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jun-he commented on a diff in pull request #4908: [Python] add identity transform

Posted by GitBox <gi...@apache.org>.
jun-he commented on code in PR #4908:
URL: https://github.com/apache/iceberg/pull/4908#discussion_r889756967


##########
python/setup.cfg:
##########
@@ -44,6 +44,7 @@ packages = find:
 python_requires = >=3.8
 install_requires =
     mmh3
+    pytz

Review Comment:
   Updated the PR and removed it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #4908: [Python] add identity transform

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #4908:
URL: https://github.com/apache/iceberg/pull/4908#discussion_r885105459


##########
python/src/iceberg/utils/datetime.py:
##########
@@ -63,3 +65,27 @@ def timestamptz_to_micros(timestamptz_str: str) -> int:
     if ISO_TIMESTAMPTZ.fullmatch(timestamptz_str):
         return datetime_to_micros(datetime.fromisoformat(timestamptz_str))
     raise ValueError(f"Invalid timestamp with zone: {timestamptz_str} (must be ISO-8601)")
+
+
+def to_human_day(day_ordinal: int) -> str:
+    """Converts a DateType value to human string"""
+    to_time = EPOCH_TIMESTAMP + timedelta(days=day_ordinal)
+    return "{0:0=4d}-{1:0=2d}-{2:0=2d}".format(to_time.year, to_time.month, to_time.day)
+
+
+def to_human_time(micros_from_midnight: int) -> str:
+    """Converts a TimeType value to human string"""
+    to_day = EPOCH_TIMESTAMP + timedelta(microseconds=micros_from_midnight)
+    return f"{to_day.time()}"

Review Comment:
   I think this should also use `.isoformat()` after constructing a time.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #4908: [Python] add identity transform

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #4908:
URL: https://github.com/apache/iceberg/pull/4908#discussion_r885105811


##########
python/src/iceberg/utils/datetime.py:
##########
@@ -63,3 +65,27 @@ def timestamptz_to_micros(timestamptz_str: str) -> int:
     if ISO_TIMESTAMPTZ.fullmatch(timestamptz_str):
         return datetime_to_micros(datetime.fromisoformat(timestamptz_str))
     raise ValueError(f"Invalid timestamp with zone: {timestamptz_str} (must be ISO-8601)")
+
+
+def to_human_day(day_ordinal: int) -> str:
+    """Converts a DateType value to human string"""
+    to_time = EPOCH_TIMESTAMP + timedelta(days=day_ordinal)

Review Comment:
   This should not use `EPOCH_TIMESTAMP` because that includes time. Instead, use `EPOCH_DATE`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #4908: [Python] add identity transform

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #4908:
URL: https://github.com/apache/iceberg/pull/4908#discussion_r885103817


##########
python/src/iceberg/utils/datetime.py:
##########
@@ -63,3 +65,27 @@ def timestamptz_to_micros(timestamptz_str: str) -> int:
     if ISO_TIMESTAMPTZ.fullmatch(timestamptz_str):
         return datetime_to_micros(datetime.fromisoformat(timestamptz_str))
     raise ValueError(f"Invalid timestamp with zone: {timestamptz_str} (must be ISO-8601)")
+
+
+def to_human_day(day_ordinal: int) -> str:
+    """Converts a DateType value to human string"""
+    to_time = EPOCH_TIMESTAMP + timedelta(days=day_ordinal)
+    return "{0:0=4d}-{1:0=2d}-{2:0=2d}".format(to_time.year, to_time.month, to_time.day)
+
+
+def to_human_time(micros_from_midnight: int) -> str:
+    """Converts a TimeType value to human string"""
+    to_day = EPOCH_TIMESTAMP + timedelta(microseconds=micros_from_midnight)

Review Comment:
   Time shouldn't use epoch. There's no need to mix dates into this.
   
   Getting a time is actually straightforward:
   
   ```python
   def time_from_micros(micros: int) -> time:
       seconds = micros // 1_000_000
       minutes = seconds // 60
       hours = minutes // 60
       time(hour=hours, minute=minutes % 60, second=seconds % 60, microsecond=micros % 1_000_000)
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #4908: [Python] add identity transform

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #4908:
URL: https://github.com/apache/iceberg/pull/4908#discussion_r885105053


##########
python/src/iceberg/utils/datetime.py:
##########
@@ -63,3 +65,27 @@ def timestamptz_to_micros(timestamptz_str: str) -> int:
     if ISO_TIMESTAMPTZ.fullmatch(timestamptz_str):
         return datetime_to_micros(datetime.fromisoformat(timestamptz_str))
     raise ValueError(f"Invalid timestamp with zone: {timestamptz_str} (must be ISO-8601)")
+
+
+def to_human_day(day_ordinal: int) -> str:
+    """Converts a DateType value to human string"""
+    to_time = EPOCH_TIMESTAMP + timedelta(days=day_ordinal)
+    return "{0:0=4d}-{1:0=2d}-{2:0=2d}".format(to_time.year, to_time.month, to_time.day)
+
+
+def to_human_time(micros_from_midnight: int) -> str:
+    """Converts a TimeType value to human string"""
+    to_day = EPOCH_TIMESTAMP + timedelta(microseconds=micros_from_midnight)
+    return f"{to_day.time()}"
+
+
+def to_human_timestamptz(timestamp_micros: int) -> str:
+    """Converts a TimestamptzType value to human string"""
+    to_timestamptz = EPOCH_TIMESTAMP + timedelta(microseconds=timestamp_micros)

Review Comment:
   This works for me to produce the correct timestamptz string:
   ```py
       return (EPOCH_TIMESTAMPTZ + timedelta(microseconds=1653926816315057)).isoformat()
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #4908: [Python] add identity transform

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #4908:
URL: https://github.com/apache/iceberg/pull/4908#discussion_r885105749


##########
python/src/iceberg/utils/datetime.py:
##########
@@ -63,3 +65,27 @@ def timestamptz_to_micros(timestamptz_str: str) -> int:
     if ISO_TIMESTAMPTZ.fullmatch(timestamptz_str):
         return datetime_to_micros(datetime.fromisoformat(timestamptz_str))
     raise ValueError(f"Invalid timestamp with zone: {timestamptz_str} (must be ISO-8601)")
+
+
+def to_human_day(day_ordinal: int) -> str:
+    """Converts a DateType value to human string"""
+    to_time = EPOCH_TIMESTAMP + timedelta(days=day_ordinal)
+    return "{0:0=4d}-{1:0=2d}-{2:0=2d}".format(to_time.year, to_time.month, to_time.day)

Review Comment:
   This should use `.isoformat()`:
   
   ```python
       return (EPOCH_DATE + timedelta(days=days)).isoformat()
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #4908: [Python] add identity transform

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #4908:
URL: https://github.com/apache/iceberg/pull/4908#discussion_r885101804


##########
python/tests/test_transforms.py:
##########
@@ -129,6 +131,52 @@ def test_string_with_surrogate_pair():
     assert bucket_transform.hash(string_with_surrogate_pair) == mmh3.hash(as_bytes)
 
 
+@pytest.mark.parametrize(
+    "type_var,value,expected",
+    [
+        (LongType(), None, "null"),
+        (DateType(), 17501, "2017-12-01"),
+        (TimeType(), 36775038194, "10:12:55.038194"),
+        (TimestamptzType(), 1512151975038194, "2017-12-01T18:12:55.038194Z"),

Review Comment:
   Timestamptz literals must always include a valid zone offset that matches this regex: `[+-]\d\d:\d\d$`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #4908: [Python] add identity transform

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #4908:
URL: https://github.com/apache/iceberg/pull/4908#discussion_r885095705


##########
python/src/iceberg/utils/datetime.py:
##########
@@ -17,7 +17,9 @@
 """Helper methods for working with date/time representations
 """
 import re
-from datetime import date, datetime, time
+from datetime import date, datetime, time, timedelta
+
+import pytz

Review Comment:
   I don't think this is needed. Instead of using pytz, this should convert using the python built-ins.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #4908: [Python] add identity transform

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #4908:
URL: https://github.com/apache/iceberg/pull/4908#discussion_r885095496


##########
python/setup.cfg:
##########
@@ -44,6 +44,7 @@ packages = find:
 python_requires = >=3.8
 install_requires =
     mmh3
+    pytz

Review Comment:
   Is this needed? We have time utils already.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jun-he commented on a diff in pull request #4908: [Python] add identity transform

Posted by GitBox <gi...@apache.org>.
jun-he commented on code in PR #4908:
URL: https://github.com/apache/iceberg/pull/4908#discussion_r889757068


##########
python/src/iceberg/utils/binary.py:
##########
@@ -0,0 +1,25 @@
+#  Licensed to the Apache Software Foundation (ASF) under one
+#  or more contributor license agreements.  See the NOTICE file
+#  distributed with this work for additional information
+#  regarding copyright ownership.  The ASF licenses this file
+#  to you under the Apache License, Version 2.0 (the
+#  "License"); you may not use this file except in compliance
+#  with the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+#  Unless required by applicable law or agreed to in writing,
+#  software distributed under the License is distributed on an
+#  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+#  KIND, either express or implied.  See the License for the
+#  specific language governing permissions and limitations
+#  under the License.
+"""Helper methods for working with bytes (FixedType or BinaryType) data
+"""
+
+import base64
+
+
+def base64encode(buffer: bytes) -> str:

Review Comment:
   Done and removed binary.py.



##########
python/src/iceberg/transforms.py:
##########
@@ -224,6 +226,67 @@ def hash(self, value: UUID) -> int:
         )
 
 
+class IdentityTransform(Transform[S, S]):
+    """Transforms a value into itself.
+
+    Example:
+        >>> transform = IdentityTransform(StringType())
+        >>> transform.apply('hello-world')
+        'hello-world'
+    """
+
+    def __init__(self, source_type: IcebergType):
+        super().__init__(
+            "identity",
+            f"transforms.identity(source_type={repr(source_type)})",
+        )
+        self._type = source_type
+
+    def apply(self, value: Optional[S]) -> Optional[S]:
+        return value
+
+    def can_transform(self, source: IcebergType) -> bool:
+        return source.is_primitive
+
+    def result_type(self, source: IcebergType) -> IcebergType:
+        return source
+
+    @property
+    def preserves_order(self) -> bool:
+        return True
+
+    def satisfies_order_of(self, other: Transform) -> bool:
+        """ordering by value is the same as long as the other preserves order"""
+        return other.preserves_order
+
+    def to_human_string(self, value: Optional[S]) -> str:
+        return self._human_string(value)
+
+    @singledispatchmethod
+    def _human_string(self, value: Optional[S]) -> str:
+        return str(value) if value is not None else "null"
+
+    @_human_string.register(int)
+    def _(self, value: int) -> str:
+        if isinstance(self._type, DateType):
+            return datetime.to_human_day(value)
+        elif isinstance(self._type, TimeType):
+            return datetime.to_human_time(value)
+        elif isinstance(self._type, TimestampType):
+            return datetime.to_human_timestamp(value)
+        elif isinstance(self._type, TimestamptzType):
+            return datetime.to_human_timestamptz(value)
+        else:
+            return str(value)
+
+    @_human_string.register(bytes)
+    def _(self, value: bytes) -> str:
+        if type(self._type) in {FixedType, BinaryType}:

Review Comment:
   👌 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jun-he commented on a diff in pull request #4908: [Python] add identity transform

Posted by GitBox <gi...@apache.org>.
jun-he commented on code in PR #4908:
URL: https://github.com/apache/iceberg/pull/4908#discussion_r889757285


##########
python/src/iceberg/utils/datetime.py:
##########
@@ -63,3 +65,27 @@ def timestamptz_to_micros(timestamptz_str: str) -> int:
     if ISO_TIMESTAMPTZ.fullmatch(timestamptz_str):
         return datetime_to_micros(datetime.fromisoformat(timestamptz_str))
     raise ValueError(f"Invalid timestamp with zone: {timestamptz_str} (must be ISO-8601)")
+
+
+def to_human_day(day_ordinal: int) -> str:
+    """Converts a DateType value to human string"""
+    to_time = EPOCH_TIMESTAMP + timedelta(days=day_ordinal)
+    return "{0:0=4d}-{1:0=2d}-{2:0=2d}".format(to_time.year, to_time.month, to_time.day)
+
+
+def to_human_time(micros_from_midnight: int) -> str:
+    """Converts a TimeType value to human string"""
+    to_day = EPOCH_TIMESTAMP + timedelta(microseconds=micros_from_midnight)
+    return f"{to_day.time()}"
+
+
+def to_human_timestamptz(timestamp_micros: int) -> str:
+    """Converts a TimestamptzType value to human string"""
+    to_timestamptz = EPOCH_TIMESTAMP + timedelta(microseconds=timestamp_micros)
+    return pytz.timezone("UTC").localize(to_timestamptz).strftime("%Y-%m-%dT%H:%M:%S.%fZ")
+
+
+def to_human_timestamp(timestamp_micros: int) -> str:
+    """Converts a TimestampType value to human string"""
+    to_timestamp = EPOCH_TIMESTAMP + timedelta(microseconds=timestamp_micros)
+    return to_timestamp.isoformat()

Review Comment:
   👌 



##########
python/src/iceberg/utils/datetime.py:
##########
@@ -63,3 +65,27 @@ def timestamptz_to_micros(timestamptz_str: str) -> int:
     if ISO_TIMESTAMPTZ.fullmatch(timestamptz_str):
         return datetime_to_micros(datetime.fromisoformat(timestamptz_str))
     raise ValueError(f"Invalid timestamp with zone: {timestamptz_str} (must be ISO-8601)")
+
+
+def to_human_day(day_ordinal: int) -> str:
+    """Converts a DateType value to human string"""
+    to_time = EPOCH_TIMESTAMP + timedelta(days=day_ordinal)
+    return "{0:0=4d}-{1:0=2d}-{2:0=2d}".format(to_time.year, to_time.month, to_time.day)

Review Comment:
   Fixed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #4908: [Python] add identity transform

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #4908:
URL: https://github.com/apache/iceberg/pull/4908#discussion_r895225857


##########
python/tox.ini:
##########
@@ -0,0 +1,111 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+[tox]

Review Comment:
   Since we no longer use tox, can you remove this file?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue merged pull request #4908: [Python] add identity transform

Posted by GitBox <gi...@apache.org>.
rdblue merged PR #4908:
URL: https://github.com/apache/iceberg/pull/4908


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #4908: [Python] add identity transform

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #4908:
URL: https://github.com/apache/iceberg/pull/4908#discussion_r885104317


##########
python/src/iceberg/utils/datetime.py:
##########
@@ -63,3 +65,27 @@ def timestamptz_to_micros(timestamptz_str: str) -> int:
     if ISO_TIMESTAMPTZ.fullmatch(timestamptz_str):
         return datetime_to_micros(datetime.fromisoformat(timestamptz_str))
     raise ValueError(f"Invalid timestamp with zone: {timestamptz_str} (must be ISO-8601)")
+
+
+def to_human_day(day_ordinal: int) -> str:
+    """Converts a DateType value to human string"""
+    to_time = EPOCH_TIMESTAMP + timedelta(days=day_ordinal)
+    return "{0:0=4d}-{1:0=2d}-{2:0=2d}".format(to_time.year, to_time.month, to_time.day)

Review Comment:
   This conversion looks good.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #4908: [Python] add identity transform

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #4908:
URL: https://github.com/apache/iceberg/pull/4908#discussion_r885101579


##########
python/src/iceberg/transforms.py:
##########
@@ -224,6 +226,67 @@ def hash(self, value: UUID) -> int:
         )
 
 
+class IdentityTransform(Transform[S, S]):
+    """Transforms a value into itself.
+
+    Example:
+        >>> transform = IdentityTransform(StringType())
+        >>> transform.apply('hello-world')
+        'hello-world'
+    """
+
+    def __init__(self, source_type: IcebergType):
+        super().__init__(
+            "identity",
+            f"transforms.identity(source_type={repr(source_type)})",
+        )
+        self._type = source_type
+
+    def apply(self, value: Optional[S]) -> Optional[S]:
+        return value
+
+    def can_transform(self, source: IcebergType) -> bool:
+        return source.is_primitive
+
+    def result_type(self, source: IcebergType) -> IcebergType:
+        return source
+
+    @property
+    def preserves_order(self) -> bool:
+        return True
+
+    def satisfies_order_of(self, other: Transform) -> bool:
+        """ordering by value is the same as long as the other preserves order"""
+        return other.preserves_order
+
+    def to_human_string(self, value: Optional[S]) -> str:
+        return self._human_string(value)
+
+    @singledispatchmethod
+    def _human_string(self, value: Optional[S]) -> str:
+        return str(value) if value is not None else "null"
+
+    @_human_string.register(int)
+    def _(self, value: int) -> str:

Review Comment:
   Rather than using `isinstance`, why not add another `@singledispatchmethod`: `_int_to_human_string(valueType: IcebergType, value: int)`? That seems a bit cleaner to me.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jun-he commented on a diff in pull request #4908: [Python] add identity transform

Posted by GitBox <gi...@apache.org>.
jun-he commented on code in PR #4908:
URL: https://github.com/apache/iceberg/pull/4908#discussion_r889757035


##########
python/src/iceberg/utils/binary.py:
##########
@@ -0,0 +1,25 @@
+#  Licensed to the Apache Software Foundation (ASF) under one
+#  or more contributor license agreements.  See the NOTICE file
+#  distributed with this work for additional information
+#  regarding copyright ownership.  The ASF licenses this file
+#  to you under the Apache License, Version 2.0 (the
+#  "License"); you may not use this file except in compliance
+#  with the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+#  Unless required by applicable law or agreed to in writing,
+#  software distributed under the License is distributed on an
+#  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+#  KIND, either express or implied.  See the License for the
+#  specific language governing permissions and limitations
+#  under the License.
+"""Helper methods for working with bytes (FixedType or BinaryType) data
+"""
+
+import base64
+
+
+def base64encode(buffer: bytes) -> str:
+    """Converts bytes to bases64 string"""

Review Comment:
   Corrected.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jun-he commented on a diff in pull request #4908: [Python] add identity transform

Posted by GitBox <gi...@apache.org>.
jun-he commented on code in PR #4908:
URL: https://github.com/apache/iceberg/pull/4908#discussion_r889757195


##########
python/tests/test_transforms.py:
##########
@@ -129,6 +131,52 @@ def test_string_with_surrogate_pair():
     assert bucket_transform.hash(string_with_surrogate_pair) == mmh3.hash(as_bytes)
 
 
+@pytest.mark.parametrize(
+    "type_var,value,expected",
+    [
+        (LongType(), None, "null"),
+        (DateType(), 17501, "2017-12-01"),
+        (TimeType(), 36775038194, "10:12:55.038194"),
+        (TimestamptzType(), 1512151975038194, "2017-12-01T18:12:55.038194Z"),

Review Comment:
   👌 Fixed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #4908: [Python] add identity transform

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #4908:
URL: https://github.com/apache/iceberg/pull/4908#discussion_r885097696


##########
python/src/iceberg/transforms.py:
##########
@@ -224,6 +226,67 @@ def hash(self, value: UUID) -> int:
         )
 
 
+class IdentityTransform(Transform[S, S]):
+    """Transforms a value into itself.
+
+    Example:
+        >>> transform = IdentityTransform(StringType())
+        >>> transform.apply('hello-world')
+        'hello-world'
+    """
+
+    def __init__(self, source_type: IcebergType):
+        super().__init__(
+            "identity",
+            f"transforms.identity(source_type={repr(source_type)})",
+        )
+        self._type = source_type
+
+    def apply(self, value: Optional[S]) -> Optional[S]:
+        return value
+
+    def can_transform(self, source: IcebergType) -> bool:
+        return source.is_primitive
+
+    def result_type(self, source: IcebergType) -> IcebergType:
+        return source
+
+    @property
+    def preserves_order(self) -> bool:
+        return True
+
+    def satisfies_order_of(self, other: Transform) -> bool:
+        """ordering by value is the same as long as the other preserves order"""
+        return other.preserves_order
+
+    def to_human_string(self, value: Optional[S]) -> str:
+        return self._human_string(value)
+
+    @singledispatchmethod
+    def _human_string(self, value: Optional[S]) -> str:
+        return str(value) if value is not None else "null"
+
+    @_human_string.register(int)
+    def _(self, value: int) -> str:
+        if isinstance(self._type, DateType):
+            return datetime.to_human_day(value)
+        elif isinstance(self._type, TimeType):
+            return datetime.to_human_time(value)
+        elif isinstance(self._type, TimestampType):
+            return datetime.to_human_timestamp(value)
+        elif isinstance(self._type, TimestamptzType):
+            return datetime.to_human_timestamptz(value)
+        else:
+            return str(value)
+
+    @_human_string.register(bytes)
+    def _(self, value: bytes) -> str:
+        if type(self._type) in {FixedType, BinaryType}:

Review Comment:
   There is no need for an `if` here. The only way to handle binary is to base64 encode it. It doesn't depend on the type.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #4908: [Python] add identity transform

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #4908:
URL: https://github.com/apache/iceberg/pull/4908#discussion_r885104716


##########
python/src/iceberg/utils/binary.py:
##########
@@ -0,0 +1,25 @@
+#  Licensed to the Apache Software Foundation (ASF) under one
+#  or more contributor license agreements.  See the NOTICE file
+#  distributed with this work for additional information
+#  regarding copyright ownership.  The ASF licenses this file
+#  to you under the Apache License, Version 2.0 (the
+#  "License"); you may not use this file except in compliance
+#  with the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+#  Unless required by applicable law or agreed to in writing,
+#  software distributed under the License is distributed on an
+#  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+#  KIND, either express or implied.  See the License for the
+#  specific language governing permissions and limitations
+#  under the License.
+"""Helper methods for working with bytes (FixedType or BinaryType) data
+"""
+
+import base64
+
+
+def base64encode(buffer: bytes) -> str:

Review Comment:
   Actually, I think I'd just move this one to avoid needing to move EPOCH constants.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #4908: [Python] add identity transform

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #4908:
URL: https://github.com/apache/iceberg/pull/4908#discussion_r885104317


##########
python/src/iceberg/utils/datetime.py:
##########
@@ -63,3 +65,27 @@ def timestamptz_to_micros(timestamptz_str: str) -> int:
     if ISO_TIMESTAMPTZ.fullmatch(timestamptz_str):
         return datetime_to_micros(datetime.fromisoformat(timestamptz_str))
     raise ValueError(f"Invalid timestamp with zone: {timestamptz_str} (must be ISO-8601)")
+
+
+def to_human_day(day_ordinal: int) -> str:
+    """Converts a DateType value to human string"""
+    to_time = EPOCH_TIMESTAMP + timedelta(days=day_ordinal)
+    return "{0:0=4d}-{1:0=2d}-{2:0=2d}".format(to_time.year, to_time.month, to_time.day)

Review Comment:
   This conversion looks good.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jun-he commented on a diff in pull request #4908: [Python] add identity transform

Posted by GitBox <gi...@apache.org>.
jun-he commented on code in PR #4908:
URL: https://github.com/apache/iceberg/pull/4908#discussion_r895253856


##########
python/src/iceberg/transforms.py:
##########
@@ -224,6 +227,80 @@ def hash(self, value: UUID) -> int:
         )
 
 
+def _base64encode(buffer: bytes) -> str:
+    """Converts bytes to base64 string"""
+    return base64.b64encode(buffer).decode("ISO-8859-1")
+
+
+class IdentityTransform(Transform[S, S]):
+    """Transforms a value into itself.
+
+    Example:
+        >>> transform = IdentityTransform(StringType())
+        >>> transform.apply('hello-world')
+        'hello-world'
+    """
+
+    def __init__(self, source_type: IcebergType):
+        super().__init__(
+            "identity",
+            f"transforms.identity(source_type={repr(source_type)})",
+        )
+        self._type = source_type
+
+    def apply(self, value: Optional[S]) -> Optional[S]:
+        return value
+
+    def can_transform(self, source: IcebergType) -> bool:
+        return source.is_primitive
+
+    def result_type(self, source: IcebergType) -> IcebergType:
+        return source
+
+    @property
+    def preserves_order(self) -> bool:
+        return True
+
+    def satisfies_order_of(self, other: Transform) -> bool:
+        """ordering by value is the same as long as the other preserves order"""
+        return other.preserves_order
+
+    def to_human_string(self, value: Optional[S]) -> str:
+        return self._human_string(value)
+
+    @singledispatchmethod
+    def _human_string(self, value: Optional[S]) -> str:

Review Comment:
   You mean moving them out of the class?
   
   Wondering if we can save `_human_string` but only keeps `_int_to_human_string`.
   like
   ```
       def to_human_string(self, value: Optional[S]) -> str:
           if value is None:
               return "null"
           elif isinstance(value, bytes):
               return _base64encode(value)
           elif isinstance(value, int):
               return self._int_to_human_string(self._type, value)
           else:
               return str(value)
   ```
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #4908: [Python] add identity transform

Posted by GitBox <gi...@apache.org>.
rdblue commented on PR #4908:
URL: https://github.com/apache/iceberg/pull/4908#issuecomment-1153277273

   @jun-he, this is about ready to go. The main problem is that it reintroduces `tox.ini`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org