You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/05/06 22:57:57 UTC

[GitHub] [iceberg] dramaticlly opened a new pull request, #4717: Python: Add PartitionSpec

dramaticlly opened a new pull request, #4717:
URL: https://github.com/apache/iceberg/pull/4717

   2nd step of complete the https://github.com/apache/iceberg/issues/3228
   
   This change include the PartitionSpec but not include its builder part as I want to keep the changelist as small as possible. As suggested in https://github.com/apache/iceberg/issues/4631, the construction of PartitionSpec shall rely on its Builder class with proper checks and such logic shall not be duplicated in its dunder init methods.
   
   ```python
   from iceberg.schema import Schema
   from iceberg.transforms import bucket
   from iceberg.types import BooleanType, IntegerType, NestedField, StringType
   from iceberg.table.partitioning import PartitionSpec, PartitionField
   
   table_schema = Schema(
           NestedField(field_id=1, name="foo", field_type=StringType(), is_optional=False),
           NestedField(field_id=2, name="bar", field_type=IntegerType(), is_optional=True),
           NestedField(field_id=3, name="baz", field_type=BooleanType(), is_optional=False),
           schema_id=1,
           identifier_field_ids=[1],
       )
   bucket_transform = bucket(IntegerType(), 100)
   foo_field = PartitionField(source_id=1, field_id=1001, transform=bucket_transform, name="foo_bucket")
   partition_spec = PartitionSpec(schema=table_schema, spec_id=0, fields=(foo_field,), last_assigned_field_id=1001)
   
   >>> partition_spec
   PartitionSpec: [
     1001: foo_bucket: bucket[100](1)
   ]
   ``` 
   
   I will come up with follow up on PartitionSpecBuilder class
   
   CC @samredai @rdblue @dhruv-pratap


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] dramaticlly commented on a diff in pull request #4717: Python: Add PartitionSpec

Posted by GitBox <gi...@apache.org>.
dramaticlly commented on code in PR #4717:
URL: https://github.com/apache/iceberg/pull/4717#discussion_r885926747


##########
python/src/iceberg/table/partitioning.py:
##########
@@ -29,38 +36,79 @@ class PartitionField:
         name(str): The name of this partition field
     """
 
-    def __init__(self, source_id: int, field_id: int, transform: Transform, name: str):
-        self._source_id = source_id
-        self._field_id = field_id
-        self._transform = transform
-        self._name = name
+    source_id: int
+    field_id: int
+    transform: Transform
+    name: str
+
+    def __str__(self):
+        return f"{self.field_id}: {self.name}: {self.transform}({self.source_id})"
 
-    @property
-    def source_id(self) -> int:
-        return self._source_id
 
-    @property
-    def field_id(self) -> int:
-        return self._field_id
+@dataclass(eq=False, frozen=True)
+class PartitionSpec:
+    """
+    PartitionSpec capture the transformation from table data to partition values
 
-    @property
-    def name(self) -> str:
-        return self._name
+    Attributes:
+        schema(Schema): the schema of data table
+        spec_id(int): any change to PartitionSpec will produce a new specId
+        fields(List[PartitionField): list of partition fields to produce partition values
+        last_assigned_field_id(int): auto-increment partition field id starting from PARTITION_DATA_ID_START

Review Comment:
   Thanks Fokko, I think this is where I want to leave out the example as I intended to construct the PartitionSpec via [dedicated builder](https://github.com/apache/iceberg/blob/master/api/src/main/java/org/apache/iceberg/PartitionSpec.java#L337) (which is not included in this PR), as that's the desired way to construct the PartitionSpec with convenient transform helper method and equipped with validation. I can include the example there, what do you think?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] dhruv-pratap commented on a diff in pull request #4717: Python: Add PartitionSpec

Posted by GitBox <gi...@apache.org>.
dhruv-pratap commented on code in PR #4717:
URL: https://github.com/apache/iceberg/pull/4717#discussion_r869581923


##########
python/src/iceberg/table/partitioning.py:
##########
@@ -64,3 +69,95 @@ def __str__(self):
 
     def __repr__(self):
         return f"PartitionField(field_id={self.field_id}, name={self.name}, transform={repr(self.transform)}, source_id={self.source_id})"
+
+    def __hash__(self):
+        return hash((self.source_id, self.field_id, self.name, self.transform))
+
+
+class PartitionSpec:
+    """
+    PartitionSpec capture the transformation from table data to partition values
+
+    Attributes:
+        schema(Schema): the schema of data table
+        spec_id(int): any change to PartitionSpec will produce a new specId
+        fields(List[PartitionField): list of partition fields to produce partition values
+        last_assigned_field_id(int): auto-increment partition field id starting from PARTITION_DATA_ID_START
+    """
+
+    def __init__(self, schema: Schema, spec_id: int, fields: Iterable[PartitionField], last_assigned_field_id: int):

Review Comment:
   Since we do not have a "builder" for it, should we enforce Keyword only arguments here for better readability? The same goes for `PartitionField` as well.
   



##########
python/src/iceberg/table/partitioning.py:
##########
@@ -64,3 +67,100 @@ def __str__(self):
 
     def __repr__(self):
         return f"PartitionField(field_id={self.field_id}, name={self.name}, transform={repr(self.transform)}, source_id={self.source_id})"
+
+    def __hash__(self):
+        return hash((self.source_id, self.field_id, self.name, self.transform))

Review Comment:
   @dramaticlly Why not use `@dataclasses.dataclass` or `@attrs.frozen` that will implement these dunder methods for you and reduce the boiler plate code?



##########
python/src/iceberg/table/partitioning.py:
##########
@@ -64,3 +69,95 @@ def __str__(self):
 
     def __repr__(self):
         return f"PartitionField(field_id={self.field_id}, name={self.name}, transform={repr(self.transform)}, source_id={self.source_id})"
+
+    def __hash__(self):
+        return hash((self.source_id, self.field_id, self.name, self.transform))
+
+
+class PartitionSpec:
+    """
+    PartitionSpec capture the transformation from table data to partition values
+
+    Attributes:
+        schema(Schema): the schema of data table
+        spec_id(int): any change to PartitionSpec will produce a new specId
+        fields(List[PartitionField): list of partition fields to produce partition values
+        last_assigned_field_id(int): auto-increment partition field id starting from PARTITION_DATA_ID_START
+    """
+
+    def __init__(self, schema: Schema, spec_id: int, fields: Iterable[PartitionField], last_assigned_field_id: int):
+        self._schema = schema
+        self._spec_id = spec_id
+        self._fields = tuple(fields)
+        self._last_assigned_field_id = last_assigned_field_id
+        # derived
+        self._fields_by_source_id: Dict[int, List[PartitionField]] = {}
+
+    @property
+    def schema(self) -> Schema:
+        return self._schema
+
+    @property
+    def spec_id(self) -> int:
+        return self._spec_id
+
+    @property
+    def fields(self) -> Tuple[PartitionField, ...]:
+        return self._fields
+
+    @property
+    def last_assigned_field_id(self) -> int:
+        return self._last_assigned_field_id
+
+    def __eq__(self, other):
+        return self.spec_id == other.spec_id and self.fields == other.fields
+
+    def __str__(self):
+        if self.is_unpartitioned():
+            return "[]"
+        else:
+            delimiter = "\n  "
+            partition_fields_in_str = (str(partition_field) for partition_field in self.fields)
+            head = f"[{delimiter}"
+            tail = f"\n]"
+            return f"{head}{delimiter.join(partition_fields_in_str)}{tail}"
+
+    def __repr__(self):
+        return f"PartitionSpec: {str(self)}"
+
+    def __hash__(self):
+        return hash((self.spec_id, self.fields))
+
+    def is_unpartitioned(self) -> bool:
+        return len(self.fields) < 1
+
+    def fields_by_source_id(self, field_id: int) -> List[PartitionField]:
+        if not self._fields_by_source_id:
+            for partition_field in self.fields:
+                source_column = self.schema.find_column_name(partition_field.source_id)
+                if not source_column:
+                    raise ValueError(f"Cannot find source column: {partition_field.source_id}")
+                existing = self._fields_by_source_id.get(partition_field.source_id, [])
+                existing.append(partition_field)
+                self._fields_by_source_id[partition_field.source_id] = existing

Review Comment:
   I feel this field value should be derived in `__init__` , or `__post_init__` if you are using `@dataclass` or `@attrs`. Reason being it validates the correctness of the object state, and does should raise `ValueError` as soon as it is created. This seems too late to do raise that error. 



##########
python/src/iceberg/table/partitioning.py:
##########
@@ -64,3 +67,100 @@ def __str__(self):
 
     def __repr__(self):
         return f"PartitionField(field_id={self.field_id}, name={self.name}, transform={repr(self.transform)}, source_id={self.source_id})"
+
+    def __hash__(self):
+        return hash((self.source_id, self.field_id, self.name, self.transform))
+
+
+class PartitionSpec:
+    """
+    PartitionSpec capture the transformation from table data to partition values
+
+    Attributes:
+        schema(Schema): the schema of data table
+        spec_id(int): any change to PartitionSpec will produce a new specId
+        fields(List[PartitionField): list of partition fields to produce partition values
+        last_assigned_field_id(int): auto-increment partition field id starting from PARTITION_DATA_ID_START
+    """
+
+    PARTITION_DATA_ID_START: int = 1000
+
+    def __init__(self, schema: Schema, spec_id: int, fields: Tuple[PartitionField], last_assigned_field_id: int):
+        self._schema = schema
+        self._spec_id = spec_id
+        self._fields = fields
+        self._last_assigned_field_id = last_assigned_field_id
+        # derived
+        self.fields_by_source_id: Dict[int, List[PartitionField]] = {}
+
+    @property
+    def schema(self) -> Schema:
+        return self._schema
+
+    @property
+    def spec_id(self) -> int:
+        return self._spec_id
+
+    @property
+    def fields(self) -> Tuple[PartitionField]:
+        return self._fields
+
+    @property
+    def last_assigned_field_id(self) -> int:
+        return self._last_assigned_field_id
+
+    def __eq__(self, other):
+        return self.spec_id == other.spec_id and self.fields == other.fields
+
+    def __str__(self):
+        if self.is_unpartitioned():

Review Comment:
   @dramaticlly Again, I would just use `@dataclasses.dataclass` or `@attrs.frozen` that will implement all the dunder methods and override them wherever you need special behavior. 



##########
python/src/iceberg/table/partitioning.py:
##########
@@ -64,3 +67,100 @@ def __str__(self):
 
     def __repr__(self):
         return f"PartitionField(field_id={self.field_id}, name={self.name}, transform={repr(self.transform)}, source_id={self.source_id})"
+
+    def __hash__(self):
+        return hash((self.source_id, self.field_id, self.name, self.transform))
+
+
+class PartitionSpec:
+    """
+    PartitionSpec capture the transformation from table data to partition values
+
+    Attributes:
+        schema(Schema): the schema of data table
+        spec_id(int): any change to PartitionSpec will produce a new specId
+        fields(List[PartitionField): list of partition fields to produce partition values
+        last_assigned_field_id(int): auto-increment partition field id starting from PARTITION_DATA_ID_START
+    """
+
+    PARTITION_DATA_ID_START: int = 1000
+
+    def __init__(self, schema: Schema, spec_id: int, fields: Tuple[PartitionField], last_assigned_field_id: int):
+        self._schema = schema
+        self._spec_id = spec_id
+        self._fields = fields
+        self._last_assigned_field_id = last_assigned_field_id
+        # derived
+        self.fields_by_source_id: Dict[int, List[PartitionField]] = {}
+
+    @property
+    def schema(self) -> Schema:
+        return self._schema
+
+    @property
+    def spec_id(self) -> int:
+        return self._spec_id
+
+    @property
+    def fields(self) -> Tuple[PartitionField]:
+        return self._fields
+
+    @property
+    def last_assigned_field_id(self) -> int:
+        return self._last_assigned_field_id
+
+    def __eq__(self, other):
+        return self.spec_id == other.spec_id and self.fields == other.fields
+
+    def __str__(self):
+        if self.is_unpartitioned():
+            return "[]"
+        else:
+            delimiter = "\n  "
+            partition_fields_in_str = (str(partition_field) for partition_field in self.fields)
+            head = f"[{delimiter}"
+            tail = f"\n]"

Review Comment:
   IMO, and this is just me, I feel like this is too much code for just str and repr. I would just take what `dataclass` or `attrs` will give me, even for the special unpartitioned case.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] dramaticlly commented on a diff in pull request #4717: Python: Add PartitionSpec

Posted by GitBox <gi...@apache.org>.
dramaticlly commented on code in PR #4717:
URL: https://github.com/apache/iceberg/pull/4717#discussion_r872653887


##########
python/src/iceberg/table/partitioning.py:
##########
@@ -64,3 +69,95 @@ def __str__(self):
 
     def __repr__(self):
         return f"PartitionField(field_id={self.field_id}, name={self.name}, transform={repr(self.transform)}, source_id={self.source_id})"
+
+    def __hash__(self):
+        return hash((self.source_id, self.field_id, self.name, self.transform))
+
+
+class PartitionSpec:
+    """
+    PartitionSpec capture the transformation from table data to partition values
+
+    Attributes:
+        schema(Schema): the schema of data table
+        spec_id(int): any change to PartitionSpec will produce a new specId
+        fields(List[PartitionField): list of partition fields to produce partition values
+        last_assigned_field_id(int): auto-increment partition field id starting from PARTITION_DATA_ID_START
+    """
+
+    def __init__(self, schema: Schema, spec_id: int, fields: Iterable[PartitionField], last_assigned_field_id: int):

Review Comment:
   hey @dhruv-pratap , I specifically mentioned that this is second part of rolling out partitioning, we will have a follow up PR for PartitionSpec builder as it shall be the right way to construct the PartitionSpec, more context in https://github.com/apache/iceberg/issues/4631. I purposefully leave it outside scope of this PR



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #4717: Python: Add PartitionSpec

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #4717:
URL: https://github.com/apache/iceberg/pull/4717#discussion_r872612178


##########
python/src/iceberg/table/partitioning.py:
##########
@@ -64,3 +67,100 @@ def __str__(self):
 
     def __repr__(self):
         return f"PartitionField(field_id={self.field_id}, name={self.name}, transform={repr(self.transform)}, source_id={self.source_id})"
+
+    def __hash__(self):
+        return hash((self.source_id, self.field_id, self.name, self.transform))
+
+
+class PartitionSpec:
+    """
+    PartitionSpec capture the transformation from table data to partition values
+
+    Attributes:
+        schema(Schema): the schema of data table
+        spec_id(int): any change to PartitionSpec will produce a new specId
+        fields(List[PartitionField): list of partition fields to produce partition values
+        last_assigned_field_id(int): auto-increment partition field id starting from PARTITION_DATA_ID_START
+    """
+
+    PARTITION_DATA_ID_START: int = 1000
+
+    def __init__(self, schema: Schema, spec_id: int, fields: Tuple[PartitionField], last_assigned_field_id: int):
+        self._schema = schema
+        self._spec_id = spec_id
+        self._fields = fields
+        self._last_assigned_field_id = last_assigned_field_id
+        # derived
+        self.fields_by_source_id: Dict[int, List[PartitionField]] = {}
+
+    @property
+    def schema(self) -> Schema:
+        return self._schema
+
+    @property
+    def spec_id(self) -> int:
+        return self._spec_id
+
+    @property
+    def fields(self) -> Tuple[PartitionField]:
+        return self._fields
+
+    @property
+    def last_assigned_field_id(self) -> int:
+        return self._last_assigned_field_id
+
+    def __eq__(self, other):
+        return self.spec_id == other.spec_id and self.fields == other.fields
+
+    def __str__(self):
+        if self.is_unpartitioned():

Review Comment:
   > I think it's a bit nasty to construct it exactly like what Java implementation does
   
   I don't think it is necessary to construct it exactly like Java. But it should not have a special case here, since `is_unpartitioned` can include specs with `void` transforms. This should show all of the transforms and not special case.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] samredai commented on a diff in pull request #4717: Python: Add PartitionSpec

Posted by GitBox <gi...@apache.org>.
samredai commented on code in PR #4717:
URL: https://github.com/apache/iceberg/pull/4717#discussion_r872701045


##########
python/src/iceberg/table/partitioning.py:
##########
@@ -14,8 +14,13 @@
 # KIND, either express or implied.  See the License for the
 # specific language governing permissions and limitations
 # under the License.
+from typing import Dict, Iterable, List, Tuple
+
+from iceberg.schema import Schema
 from iceberg.transforms import Transform
 
+_PARTITION_DATA_ID_START: int = 1000
+
 
 class PartitionField:

Review Comment:
   @dramaticlly that's an interesting idea I haven't thought of. A big argument for using the builder pattern was that we wanted `PartitionSpec` to be immutable, which would require us to include a ton of validation logic (everything that would be in a builder) in the `__init__` method. If I understand your suggestion, using `__post_init__` would allow us to have a typical init method, but then include the builder-type validation logic in the `__post_init__` which would fail the initialization of any invalid `PartitionSpec`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] dramaticlly commented on pull request #4717: Python: Add PartitionSpec

Posted by GitBox <gi...@apache.org>.
dramaticlly commented on PR #4717:
URL: https://github.com/apache/iceberg/pull/4717#issuecomment-1152864897

   Thank you Fokko and Sam for reviewing, @rdblue can you take another look?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #4717: Python: Add PartitionSpec

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #4717:
URL: https://github.com/apache/iceberg/pull/4717#discussion_r876112058


##########
python/src/iceberg/table/partitioning.py:
##########
@@ -14,8 +14,13 @@
 # KIND, either express or implied.  See the License for the
 # specific language governing permissions and limitations
 # under the License.
+from typing import Dict, Iterable, List, Tuple
+
+from iceberg.schema import Schema
 from iceberg.transforms import Transform
 
+_PARTITION_DATA_ID_START: int = 1000
+
 
 class PartitionField:

Review Comment:
   Is the plan to use `@dataclass`? I like that idea, but I won't hold off on reviewing if we want to get it in like this first.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] dramaticlly commented on a diff in pull request #4717: Python: Add PartitionSpec

Posted by GitBox <gi...@apache.org>.
dramaticlly commented on code in PR #4717:
URL: https://github.com/apache/iceberg/pull/4717#discussion_r876496116


##########
python/src/iceberg/table/partitioning.py:
##########
@@ -29,38 +36,76 @@ class PartitionField:
         name(str): The name of this partition field
     """
 
-    def __init__(self, source_id: int, field_id: int, transform: Transform, name: str):
-        self._source_id = source_id
-        self._field_id = field_id
-        self._transform = transform
-        self._name = name
+    source_id: int
+    field_id: int
+    transform: Transform
+    name: str
 
-    @property
-    def source_id(self) -> int:
-        return self._source_id
+    def __str__(self):
+        return f"{self.field_id}: {self.name}: {self.transform}({self.source_id})"
+
+
+@dataclass(eq=False, frozen=True)
+class PartitionSpec:
+    """
+    PartitionSpec capture the transformation from table data to partition values
 
-    @property
-    def field_id(self) -> int:
-        return self._field_id
+    Attributes:
+        schema(Schema): the schema of data table
+        spec_id(int): any change to PartitionSpec will produce a new specId
+        fields(List[PartitionField): list of partition fields to produce partition values
+        last_assigned_field_id(int): auto-increment partition field id starting from PARTITION_DATA_ID_START
+    """
 
-    @property
-    def name(self) -> str:
-        return self._name
+    schema: Schema
+    spec_id: int
+    fields: Tuple[PartitionField]
+    last_assigned_field_id: int
+    source_id_to_fields_map: Dict[int, List[PartitionField]] = field(init=False)
 
-    @property
-    def transform(self) -> Transform:
-        return self._transform
+    def __post_init__(self):
+        source_id_to_fields_map = dict()
+        for partition_field in self.fields:
+            source_column = self.schema.find_column_name(partition_field.source_id)
+            if not source_column:
+                raise ValueError(f"Cannot find source column: {partition_field.source_id}")
+            existing = source_id_to_fields_map.get(partition_field.source_id, [])
+            existing.append(partition_field)
+            source_id_to_fields_map[partition_field.source_id] = existing
+        object.__setattr__(self, "source_id_to_fields_map", source_id_to_fields_map)

Review Comment:
   based on https://docs.python.org/3.8/library/dataclasses.html#frozen-instances and https://stackoverflow.com/questions/53756788/how-to-set-the-value-of-dataclass-field-in-post-init-when-frozen-true



##########
python/src/iceberg/table/partitioning.py:
##########
@@ -29,38 +36,76 @@ class PartitionField:
         name(str): The name of this partition field
     """
 
-    def __init__(self, source_id: int, field_id: int, transform: Transform, name: str):
-        self._source_id = source_id
-        self._field_id = field_id
-        self._transform = transform
-        self._name = name
+    source_id: int
+    field_id: int
+    transform: Transform
+    name: str
 
-    @property
-    def source_id(self) -> int:
-        return self._source_id
+    def __str__(self):
+        return f"{self.field_id}: {self.name}: {self.transform}({self.source_id})"
+
+
+@dataclass(eq=False, frozen=True)
+class PartitionSpec:
+    """
+    PartitionSpec capture the transformation from table data to partition values
 
-    @property
-    def field_id(self) -> int:
-        return self._field_id
+    Attributes:
+        schema(Schema): the schema of data table
+        spec_id(int): any change to PartitionSpec will produce a new specId
+        fields(List[PartitionField): list of partition fields to produce partition values
+        last_assigned_field_id(int): auto-increment partition field id starting from PARTITION_DATA_ID_START
+    """
 
-    @property
-    def name(self) -> str:
-        return self._name
+    schema: Schema
+    spec_id: int
+    fields: Tuple[PartitionField]
+    last_assigned_field_id: int
+    source_id_to_fields_map: Dict[int, List[PartitionField]] = field(init=False)
 
-    @property
-    def transform(self) -> Transform:
-        return self._transform
+    def __post_init__(self):
+        source_id_to_fields_map = dict()
+        for partition_field in self.fields:
+            source_column = self.schema.find_column_name(partition_field.source_id)
+            if not source_column:
+                raise ValueError(f"Cannot find source column: {partition_field.source_id}")
+            existing = source_id_to_fields_map.get(partition_field.source_id, [])
+            existing.append(partition_field)
+            source_id_to_fields_map[partition_field.source_id] = existing
+        object.__setattr__(self, "source_id_to_fields_map", source_id_to_fields_map)
 
     def __eq__(self, other):
-        return (
-            self.field_id == other.field_id
-            and self.source_id == other.source_id
-            and self.name == other.name
-            and self.transform == other.transform
-        )
+        """
+        Equality check on spec_id and partition fields only
+        """
+        return self.spec_id == other.spec_id and self.fields == other.fields
 
     def __str__(self):
-        return f"{self.field_id}: {self.name}: {self.transform}({self.source_id})"
+        """
+        PartitionSpec str method highlight the partition field only
+        """
+        result_str = "["
+        for partition_field in self.fields:
+            result_str += f"\n  {str(partition_field)}"
+        if len(self.fields) > 0:
+            result_str += "\n"
+        result_str += "]"
+        return result_str
 
-    def __repr__(self):
-        return f"PartitionField(field_id={self.field_id}, name={self.name}, transform={repr(self.transform)}, source_id={self.source_id})"

Review Comment:
   FYI using default`__repr__` from dataclass now also include the derived attribute such as `source_id_to_fields_map` into consideration, please let me know if anyone feels strongly to override and keep the original behaviour



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] Fokko commented on a diff in pull request #4717: Python: Add PartitionSpec

Posted by GitBox <gi...@apache.org>.
Fokko commented on code in PR #4717:
URL: https://github.com/apache/iceberg/pull/4717#discussion_r882821386


##########
python/src/iceberg/table/partitioning.py:
##########
@@ -14,8 +14,13 @@
 # KIND, either express or implied.  See the License for the
 # specific language governing permissions and limitations
 # under the License.
+from typing import Dict, Iterable, List, Tuple
+
+from iceberg.schema import Schema
 from iceberg.transforms import Transform
 
+_PARTITION_DATA_ID_START: int = 1000
+
 
 class PartitionField:

Review Comment:
   We can also go full pythonic and bump the dataclass to a pydantic model. With Pydantic you can annotate the fields with validators: https://pydantic-docs.helpmanual.io/usage/validators/
   
   We could use the generate Open API classes as the base classes and extend from those:
   https://github.com/apache/iceberg/pull/4858/files#diff-4f32e455c8da9fc5dc641048dc398741b72e928f359bfb9e5ef3640e7d32873e
   
   This also allows us to add validation. For example, the `BaseUserModel` is the generated one from open-api, and the `UserModel` is the one extended with all the (convience) methods attached to it:
   ```python
   from pydantic import BaseModel, ValidationError, validator
   
   class BaseUserModel(BaseModel):
       name: str
       username: str
       password1: str
       password2: str
   
       
   class UserModel(BaseUserModel):
       @validator('name')
       def name_must_contain_space(cls, v):
           if ' ' not in v:
               raise ValueError('must contain a space')
           return v.title()
       @validator('password2')
       def passwords_match(cls, v, values, **kwargs):
           if 'password1' in values and v != values['password1']:
               raise ValueError('passwords do not match')
           return v
       @validator('username')
       def username_alphanumeric(cls, v):
           assert v.isalnum(), 'must be alphanumeric'
           return v
   
   user = UserModel(
       name='samuel colvin',
       username='scolvin',
       password1='zxcvbn',
       password2='zxcvbn',
   )
   print(user)
   #> name='Samuel Colvin' username='scolvin' password1='zxcvbn' password2='zxcvbn'
   
   try:
       UserModel(
           name='samuel',
           username='scolvin',
           password1='zxcvbn',
           password2='zxcvbn2',
       )
   except ValidationError as e:
       print(e)
       """
       2 validation errors for UserModel
       name
         must contain a space (type=value_error)
       password2
         passwords do not match (type=value_error)
       """
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #4717: Python: Add PartitionSpec

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #4717:
URL: https://github.com/apache/iceberg/pull/4717#discussion_r868762001


##########
python/src/iceberg/table/partitioning.py:
##########
@@ -64,3 +67,100 @@ def __str__(self):
 
     def __repr__(self):
         return f"PartitionField(field_id={self.field_id}, name={self.name}, transform={repr(self.transform)}, source_id={self.source_id})"
+
+    def __hash__(self):
+        return hash((self.source_id, self.field_id, self.name, self.transform))
+
+
+class PartitionSpec:
+    """
+    PartitionSpec capture the transformation from table data to partition values
+
+    Attributes:
+        schema(Schema): the schema of data table
+        spec_id(int): any change to PartitionSpec will produce a new specId
+        fields(List[PartitionField): list of partition fields to produce partition values
+        last_assigned_field_id(int): auto-increment partition field id starting from PARTITION_DATA_ID_START
+    """
+
+    PARTITION_DATA_ID_START: int = 1000
+
+    def __init__(self, schema: Schema, spec_id: int, fields: Tuple[PartitionField], last_assigned_field_id: int):
+        self._schema = schema
+        self._spec_id = spec_id
+        self._fields = fields
+        self._last_assigned_field_id = last_assigned_field_id
+        # derived
+        self.fields_by_source_id: Dict[int, List[PartitionField]] = {}
+
+    @property
+    def schema(self) -> Schema:
+        return self._schema
+
+    @property
+    def spec_id(self) -> int:
+        return self._spec_id
+
+    @property
+    def fields(self) -> Tuple[PartitionField]:
+        return self._fields
+
+    @property
+    def last_assigned_field_id(self) -> int:
+        return self._last_assigned_field_id
+
+    def __eq__(self, other):
+        return self.spec_id == other.spec_id and self.fields == other.fields
+
+    def __str__(self):
+        if self.is_unpartitioned():

Review Comment:
   Rather than special casing `is_unpartitioned`, I think this should just construct the inner string differently. The Java implementation doesn't use special cases.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] Fokko commented on a diff in pull request #4717: Python: Add PartitionSpec

Posted by GitBox <gi...@apache.org>.
Fokko commented on code in PR #4717:
URL: https://github.com/apache/iceberg/pull/4717#discussion_r872655581


##########
python/src/iceberg/table/partitioning.py:
##########
@@ -64,3 +69,95 @@ def __str__(self):
 
     def __repr__(self):
         return f"PartitionField(field_id={self.field_id}, name={self.name}, transform={repr(self.transform)}, source_id={self.source_id})"
+
+    def __hash__(self):
+        return hash((self.source_id, self.field_id, self.name, self.transform))
+
+
+class PartitionSpec:
+    """
+    PartitionSpec capture the transformation from table data to partition values
+
+    Attributes:
+        schema(Schema): the schema of data table
+        spec_id(int): any change to PartitionSpec will produce a new specId
+        fields(List[PartitionField): list of partition fields to produce partition values
+        last_assigned_field_id(int): auto-increment partition field id starting from PARTITION_DATA_ID_START
+    """
+
+    def __init__(self, schema: Schema, spec_id: int, fields: Iterable[PartitionField], last_assigned_field_id: int):
+        self._schema = schema
+        self._spec_id = spec_id
+        self._fields = tuple(fields)
+        self._last_assigned_field_id = last_assigned_field_id
+        # derived
+        self._fields_by_source_id: Dict[int, List[PartitionField]] = {}
+
+    @property
+    def schema(self) -> Schema:
+        return self._schema
+
+    @property
+    def spec_id(self) -> int:
+        return self._spec_id
+
+    @property
+    def fields(self) -> Tuple[PartitionField, ...]:
+        return self._fields
+
+    @property
+    def last_assigned_field_id(self) -> int:
+        return self._last_assigned_field_id
+
+    def __eq__(self, other):
+        return self.spec_id == other.spec_id and self.fields == other.fields
+
+    def __str__(self):
+        if self.is_unpartitioned():
+            return "[]"
+        else:
+            delimiter = "\n  "
+            partition_fields_in_str = (str(partition_field) for partition_field in self.fields)
+            head = f"[{delimiter}"
+            tail = f"\n]"
+            return f"{head}{delimiter.join(partition_fields_in_str)}{tail}"
+
+    def __repr__(self):
+        return f"PartitionSpec: {str(self)}"
+
+    def __hash__(self):
+        return hash((self.spec_id, self.fields))
+
+    def is_unpartitioned(self) -> bool:
+        return len(self.fields) < 1
+
+    def fields_by_source_id(self, field_id: int) -> List[PartitionField]:
+        if not self._fields_by_source_id:
+            for partition_field in self.fields:
+                source_column = self.schema.find_column_name(partition_field.source_id)
+                if not source_column:
+                    raise ValueError(f"Cannot find source column: {partition_field.source_id}")
+                existing = self._fields_by_source_id.get(partition_field.source_id, [])
+                existing.append(partition_field)
+                self._fields_by_source_id[partition_field.source_id] = existing
+        return self._fields_by_source_id[field_id]
+
+    def compatible_with(self, other) -> bool:

Review Comment:
   ```suggestion
       def compatible_with(self, other: "PartitionSpec") -> bool:
   ```



##########
python/src/iceberg/table/partitioning.py:
##########
@@ -14,8 +14,13 @@
 # KIND, either express or implied.  See the License for the
 # specific language governing permissions and limitations
 # under the License.
+from typing import Dict, Iterable, List, Tuple
+
+from iceberg.schema import Schema
 from iceberg.transforms import Transform
 
+_PARTITION_DATA_ID_START: int = 1000
+
 
 class PartitionField:

Review Comment:
   Instead of implementing the `__eq__` and `__hash__` we could leverage the dataclass library. If we set `eq=True` and `frozen=True` (which makes it immutable, which is also nice), then we get __hash__ automatically:
   
   If `eq` and `frozen` are both true, by default [dataclass()](https://docs.python.org/3/library/dataclasses.html#dataclasses.dataclass) will generate a `__hash__()` method for you. If eq is true and frozen is false, `__hash__()` will be set to `None`, marking it unhashable (which it is, since it is mutable). If `eq` is false, `__hash__()` will be left untouched meaning the `__hash__()` method of the superclass will be used (if the superclass is [object](https://docs.python.org/3/library/functions.html#object), this means it will fall back to id-based hashing).
   
   More information here: https://docs.python.org/3/library/dataclasses.html



##########
python/src/iceberg/table/partitioning.py:
##########
@@ -64,3 +69,95 @@ def __str__(self):
 
     def __repr__(self):
         return f"PartitionField(field_id={self.field_id}, name={self.name}, transform={repr(self.transform)}, source_id={self.source_id})"
+
+    def __hash__(self):
+        return hash((self.source_id, self.field_id, self.name, self.transform))
+
+
+class PartitionSpec:
+    """
+    PartitionSpec capture the transformation from table data to partition values
+
+    Attributes:
+        schema(Schema): the schema of data table
+        spec_id(int): any change to PartitionSpec will produce a new specId
+        fields(List[PartitionField): list of partition fields to produce partition values
+        last_assigned_field_id(int): auto-increment partition field id starting from PARTITION_DATA_ID_START
+    """
+
+    def __init__(self, schema: Schema, spec_id: int, fields: Iterable[PartitionField], last_assigned_field_id: int):
+        self._schema = schema
+        self._spec_id = spec_id
+        self._fields = tuple(fields)
+        self._last_assigned_field_id = last_assigned_field_id
+        # derived
+        self._fields_by_source_id: Dict[int, List[PartitionField]] = {}
+
+    @property
+    def schema(self) -> Schema:
+        return self._schema
+
+    @property
+    def spec_id(self) -> int:
+        return self._spec_id
+
+    @property
+    def fields(self) -> Tuple[PartitionField, ...]:
+        return self._fields
+
+    @property
+    def last_assigned_field_id(self) -> int:
+        return self._last_assigned_field_id
+
+    def __eq__(self, other):
+        return self.spec_id == other.spec_id and self.fields == other.fields
+
+    def __str__(self):
+        if self.is_unpartitioned():
+            return "[]"
+        else:
+            delimiter = "\n  "
+            partition_fields_in_str = (str(partition_field) for partition_field in self.fields)
+            head = f"[{delimiter}"
+            tail = f"\n]"
+            return f"{head}{delimiter.join(partition_fields_in_str)}{tail}"
+
+    def __repr__(self):
+        return f"PartitionSpec: {str(self)}"
+
+    def __hash__(self):
+        return hash((self.spec_id, self.fields))
+
+    def is_unpartitioned(self) -> bool:
+        return len(self.fields) < 1
+
+    def fields_by_source_id(self, field_id: int) -> List[PartitionField]:
+        if not self._fields_by_source_id:
+            for partition_field in self.fields:
+                source_column = self.schema.find_column_name(partition_field.source_id)
+                if not source_column:
+                    raise ValueError(f"Cannot find source column: {partition_field.source_id}")
+                existing = self._fields_by_source_id.get(partition_field.source_id, [])
+                existing.append(partition_field)
+                self._fields_by_source_id[partition_field.source_id] = existing
+        return self._fields_by_source_id[field_id]
+
+    def compatible_with(self, other) -> bool:
+        """
+        Returns true if this partition spec is equivalent to the other, with partition field_id ignored.
+        That is, if both specs have the same number of fields, field order, field name, source column ids, and transforms.
+        """
+        if self == other:
+            return True
+        if len(self.fields) != len(other.fields):
+            return False
+        for index in range(len(self.fields)):

Review Comment:
   I would write this as:
   ```python
   return all(
       this_field.source_id == that_field.source_id and this_field.transform == that_field.transform and this_field.name == that_field.name
       for this_field, that_field
       in zip(self.fields], other.fields)
   )
   ```
   Or rely on the `eq` method :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] dramaticlly commented on a diff in pull request #4717: Python: Add PartitionSpec

Posted by GitBox <gi...@apache.org>.
dramaticlly commented on code in PR #4717:
URL: https://github.com/apache/iceberg/pull/4717#discussion_r885199154


##########
python/tests/table/test_partitioning.py:
##########
@@ -29,8 +30,37 @@ def test_partition_field_init():
     assert partition_field.transform == bucket_transform
     assert partition_field.name == "id"
     assert partition_field == partition_field
+    print(str(partition_field))
+    print("repr")
+    print(repr(partition_field))

Review Comment:
   Thank you, I was testing and forgot about this, removed



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #4717: Python: Add PartitionSpec

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #4717:
URL: https://github.com/apache/iceberg/pull/4717#discussion_r868758762


##########
python/src/iceberg/table/partitioning.py:
##########
@@ -64,3 +67,100 @@ def __str__(self):
 
     def __repr__(self):
         return f"PartitionField(field_id={self.field_id}, name={self.name}, transform={repr(self.transform)}, source_id={self.source_id})"
+
+    def __hash__(self):
+        return hash((self.source_id, self.field_id, self.name, self.transform))

Review Comment:
   Why is `__hash__` needed? Are partition fields used as map keys?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] dramaticlly commented on a diff in pull request #4717: Python: Add PartitionSpec

Posted by GitBox <gi...@apache.org>.
dramaticlly commented on code in PR #4717:
URL: https://github.com/apache/iceberg/pull/4717#discussion_r868762915


##########
python/src/iceberg/table/partitioning.py:
##########
@@ -64,3 +67,100 @@ def __str__(self):
 
     def __repr__(self):
         return f"PartitionField(field_id={self.field_id}, name={self.name}, transform={repr(self.transform)}, source_id={self.source_id})"
+
+    def __hash__(self):
+        return hash((self.source_id, self.field_id, self.name, self.transform))
+
+
+class PartitionSpec:
+    """
+    PartitionSpec capture the transformation from table data to partition values
+
+    Attributes:
+        schema(Schema): the schema of data table
+        spec_id(int): any change to PartitionSpec will produce a new specId
+        fields(List[PartitionField): list of partition fields to produce partition values
+        last_assigned_field_id(int): auto-increment partition field id starting from PARTITION_DATA_ID_START
+    """
+
+    PARTITION_DATA_ID_START: int = 1000
+
+    def __init__(self, schema: Schema, spec_id: int, fields: Tuple[PartitionField], last_assigned_field_id: int):
+        self._schema = schema
+        self._spec_id = spec_id
+        self._fields = fields
+        self._last_assigned_field_id = last_assigned_field_id
+        # derived
+        self.fields_by_source_id: Dict[int, List[PartitionField]] = {}
+
+    @property
+    def schema(self) -> Schema:
+        return self._schema
+
+    @property
+    def spec_id(self) -> int:
+        return self._spec_id
+
+    @property
+    def fields(self) -> Tuple[PartitionField]:
+        return self._fields
+
+    @property
+    def last_assigned_field_id(self) -> int:
+        return self._last_assigned_field_id
+
+    def __eq__(self, other):
+        return self.spec_id == other.spec_id and self.fields == other.fields
+
+    def __str__(self):
+        if self.is_unpartitioned():
+            return "[]"
+        else:
+            delimiter = "\n  "
+            partition_fields_in_str = (str(partition_field) for partition_field in self.fields)
+            head = f"[{delimiter}"
+            tail = f"\n]"
+            return f"{head}{delimiter.join(partition_fields_in_str)}{tail}"
+
+    def __repr__(self):
+        return f"PartitionSpec: {str(self)}"
+
+    def __hash__(self):
+        return hash((self.spec_id, self.fields))
+
+    def is_unpartitioned(self) -> bool:
+        return len(self.fields) < 1
+
+    def get_fields_by_source_id(self, filed_id: int) -> List[PartitionField]:
+        if not self.fields_by_source_id:
+            for partition_field in self.fields:
+                source_column = self.schema.find_column_name(partition_field.source_id)
+                if not source_column:
+                    raise ValueError(f"Cannot find source column: {partition_field.source_id}")
+                existing = self.fields_by_source_id.get(partition_field.source_id, [])
+                existing.append(partition_field)
+                self.fields_by_source_id[partition_field.source_id] = existing
+        return self.fields_by_source_id.get(filed_id)  # type: ignore

Review Comment:
   I think this relate to mypy hint when `dict.get(field_id)` will return `Optional[List[PartitionField]]` instead of `List[PartitionField]`
   
   do you prefer `self.fields_by_source_id[field_id]` or adjust the return type hint to be optional?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #4717: Python: Add PartitionSpec

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #4717:
URL: https://github.com/apache/iceberg/pull/4717#discussion_r872613738


##########
python/src/iceberg/table/partitioning.py:
##########
@@ -64,3 +67,100 @@ def __str__(self):
 
     def __repr__(self):
         return f"PartitionField(field_id={self.field_id}, name={self.name}, transform={repr(self.transform)}, source_id={self.source_id})"
+
+    def __hash__(self):
+        return hash((self.source_id, self.field_id, self.name, self.transform))

Review Comment:
   > because it's being used in the hash of the PartitionSpec at line 129
   
   Why does `PartitionSpec` define `__hash__`? Will spec be used as a key in a dict or map? I doubt it so I'd probably remove both.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] dramaticlly commented on a diff in pull request #4717: Python: Add PartitionSpec

Posted by GitBox <gi...@apache.org>.
dramaticlly commented on code in PR #4717:
URL: https://github.com/apache/iceberg/pull/4717#discussion_r872689598


##########
python/src/iceberg/table/partitioning.py:
##########
@@ -14,8 +14,13 @@
 # KIND, either express or implied.  See the License for the
 # specific language governing permissions and limitations
 # under the License.
+from typing import Dict, Iterable, List, Tuple
+
+from iceberg.schema import Schema
 from iceberg.transforms import Transform
 
+_PARTITION_DATA_ID_START: int = 1000
+
 
 class PartitionField:

Review Comment:
   yeah I agree the PartitionField is an immutable class after construction so dataclass with both eq and frozen sounds fair to me. 
   
   On the other side, I think the biggest benefit of the dataclass is the `__post_init__` method which allow for java-like builderPattern equivalent processing when we build the PartitionSpec. There's collection of validations need to happen and I am discussing with @samredai  in https://github.com/apache/iceberg/issues/4631#issuecomment-1113632408. 
   
   From what I can tell, we will need a `PartitionSpecBuilder` class with convenient way to construct the PartitionSpec, but we also want to make sure avoid duplicate the builder logic in an overly complex init method for PartitionSpec



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] samredai commented on a diff in pull request #4717: Python: Add PartitionSpec

Posted by GitBox <gi...@apache.org>.
samredai commented on code in PR #4717:
URL: https://github.com/apache/iceberg/pull/4717#discussion_r882941190


##########
python/src/iceberg/table/partitioning.py:
##########
@@ -14,8 +14,13 @@
 # KIND, either express or implied.  See the License for the
 # specific language governing permissions and limitations
 # under the License.
+from typing import Dict, Iterable, List, Tuple
+
+from iceberg.schema import Schema
 from iceberg.transforms import Transform
 
+_PARTITION_DATA_ID_START: int = 1000
+
 
 class PartitionField:

Review Comment:
   I really like this idea of wrapping the classes generated from the OpenAPI spec. The class naming here might be tricky. Module namespacing allows us to re-use the same name if we want, something like:
   ```py
   from iceberg.openapi import rest_catalog
   
   class PartitionField(rest_catalog.PartitionField):
       ...
   ```
   We shouldn't expect users to import from the openapi module directly so we shouldn't need to worry about naming conflicts, right? Maybe we should name it `_openapi` just to be super clear about that.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #4717: Python: Add PartitionSpec

Posted by GitBox <gi...@apache.org>.
rdblue commented on PR #4717:
URL: https://github.com/apache/iceberg/pull/4717#issuecomment-1153290729

   Looks great. Thanks, @dramaticlly! I merged this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] dramaticlly commented on a diff in pull request #4717: Python: Add PartitionSpec

Posted by GitBox <gi...@apache.org>.
dramaticlly commented on code in PR #4717:
URL: https://github.com/apache/iceberg/pull/4717#discussion_r868761455


##########
python/src/iceberg/table/partitioning.py:
##########
@@ -64,3 +67,100 @@ def __str__(self):
 
     def __repr__(self):
         return f"PartitionField(field_id={self.field_id}, name={self.name}, transform={repr(self.transform)}, source_id={self.source_id})"
+
+    def __hash__(self):
+        return hash((self.source_id, self.field_id, self.name, self.transform))
+
+
+class PartitionSpec:
+    """
+    PartitionSpec capture the transformation from table data to partition values
+
+    Attributes:
+        schema(Schema): the schema of data table
+        spec_id(int): any change to PartitionSpec will produce a new specId
+        fields(List[PartitionField): list of partition fields to produce partition values
+        last_assigned_field_id(int): auto-increment partition field id starting from PARTITION_DATA_ID_START
+    """
+
+    PARTITION_DATA_ID_START: int = 1000
+
+    def __init__(self, schema: Schema, spec_id: int, fields: Tuple[PartitionField], last_assigned_field_id: int):
+        self._schema = schema
+        self._spec_id = spec_id
+        self._fields = fields
+        self._last_assigned_field_id = last_assigned_field_id
+        # derived
+        self.fields_by_source_id: Dict[int, List[PartitionField]] = {}
+
+    @property
+    def schema(self) -> Schema:
+        return self._schema
+
+    @property
+    def spec_id(self) -> int:
+        return self._spec_id
+
+    @property
+    def fields(self) -> Tuple[PartitionField]:
+        return self._fields
+
+    @property
+    def last_assigned_field_id(self) -> int:
+        return self._last_assigned_field_id
+
+    def __eq__(self, other):
+        return self.spec_id == other.spec_id and self.fields == other.fields
+
+    def __str__(self):
+        if self.is_unpartitioned():
+            return "[]"
+        else:
+            delimiter = "\n  "
+            partition_fields_in_str = (str(partition_field) for partition_field in self.fields)
+            head = f"[{delimiter}"
+            tail = f"\n]"
+            return f"{head}{delimiter.join(partition_fields_in_str)}{tail}"
+
+    def __repr__(self):
+        return f"PartitionSpec: {str(self)}"
+
+    def __hash__(self):
+        return hash((self.spec_id, self.fields))
+
+    def is_unpartitioned(self) -> bool:
+        return len(self.fields) < 1
+
+    def get_fields_by_source_id(self, filed_id: int) -> List[PartitionField]:
+        if not self.fields_by_source_id:
+            for partition_field in self.fields:
+                source_column = self.schema.find_column_name(partition_field.source_id)
+                if not source_column:
+                    raise ValueError(f"Cannot find source column: {partition_field.source_id}")
+                existing = self.fields_by_source_id.get(partition_field.source_id, [])
+                existing.append(partition_field)
+                self.fields_by_source_id[partition_field.source_id] = existing
+        return self.fields_by_source_id.get(filed_id)  # type: ignore
+
+    def compatible_with(self, other) -> bool:
+        """
+        Returns true if this partition spec is equivalent to the other, with partition field_id ignored.
+        That is, if both specs have the same number of fields, field order, field name, source column ids, and transforms.
+        """
+        if self.__eq__(other):

Review Comment:
   I will update to `if self == other`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] Fokko commented on a diff in pull request #4717: Python: Add PartitionSpec

Posted by GitBox <gi...@apache.org>.
Fokko commented on code in PR #4717:
URL: https://github.com/apache/iceberg/pull/4717#discussion_r894871233


##########
python/src/iceberg/table/partitioning.py:
##########
@@ -29,38 +36,81 @@ class PartitionField:
         name(str): The name of this partition field
     """
 
-    def __init__(self, source_id: int, field_id: int, transform: Transform, name: str):
-        self._source_id = source_id
-        self._field_id = field_id
-        self._transform = transform
-        self._name = name
+    source_id: int
+    field_id: int
+    transform: Transform
+    name: str
+
+    def __str__(self):
+        return f"{self.field_id}: {self.name}: {self.transform}({self.source_id})"
+
 
-    @property
-    def source_id(self) -> int:
-        return self._source_id
+@dataclass(eq=False, frozen=True)
+class PartitionSpec:
+    """
+    PartitionSpec captures the transformation from table data to partition values
 
-    @property
-    def field_id(self) -> int:
-        return self._field_id
+    Attributes:
+        schema(Schema): the schema of data table
+        spec_id(int): any change to PartitionSpec will produce a new specId
+        fields(List[PartitionField): list of partition fields to produce partition values
+        last_assigned_field_id(int): auto-increment partition field id starting from PARTITION_DATA_ID_START
+    """
 
-    @property
-    def name(self) -> str:
-        return self._name
+    schema: Schema
+    spec_id: int
+    fields: Tuple[PartitionField, ...]
+    last_assigned_field_id: int
+    source_id_to_fields_map: Dict[int, List[PartitionField]] = field(init=False, repr=False)
 
-    @property
-    def transform(self) -> Transform:
-        return self._transform
+    def __post_init__(self):
+        source_id_to_fields_map = dict()
+        for partition_field in self.fields:
+            source_column = self.schema.find_column_name(partition_field.source_id)
+            if not source_column:
+                raise ValueError(f"Cannot find source column: {partition_field.source_id}")
+            existing = source_id_to_fields_map.get(partition_field.source_id, [])
+            existing.append(partition_field)
+            source_id_to_fields_map[partition_field.source_id] = existing
+        object.__setattr__(self, "source_id_to_fields_map", source_id_to_fields_map)
 
     def __eq__(self, other):
-        return (
-            self.field_id == other.field_id
-            and self.source_id == other.source_id
-            and self.name == other.name
-            and self.transform == other.transform
-        )
+        """
+        Produce a boolean to return True if two objects are considered equal
+
+        Note:
+            Equality of PartitionSpec is determined by spec_id and partition fields only
+        """
+        if not isinstance(other, PartitionSpec):
+            return False
+        return self.spec_id == other.spec_id and self.fields == other.fields
 
     def __str__(self):
-        return f"{self.field_id}: {self.name}: {self.transform}({self.source_id})"
+        """
+        Produce a human-readable string representation of PartitionSpec
 
-    def __repr__(self):
-        return f"PartitionField(field_id={self.field_id}, name={self.name}, transform={repr(self.transform)}, source_id={self.source_id})"
+        Note:
+            Only include list of partition fields in the PartitionSpec's string representation
+        """
+        result_str = "["
+        if self.fields:
+            result_str += "\n  " + "\n  ".join([str(field) for field in self.fields]) + "\n"
+        result_str += "]"
+        return result_str
+
+    def is_unpartitioned(self) -> bool:
+        return len(self.fields) < 1
+
+    def fields_by_source_id(self, field_id: int) -> List[PartitionField]:
+        return self.source_id_to_fields_map[field_id]
+
+    def compatible_with(self, other: "PartitionSpec") -> bool:

Review Comment:
   We should also assert the number of fields. The following test is passing:
   ```python
   def test_partition_compatible_with(table_schema_simple: Schema):
       bucket_transform = bucket(IntegerType(), 4)
       field1 = PartitionField(3, 100, bucket_transform, "id")
       field2 = PartitionField(3, 102, bucket_transform, "id")
       lhs = PartitionSpec(table_schema_simple, 0, (field1,), 1001)
       rhs = PartitionSpec(table_schema_simple, 0, (field1, field2), 1001)
       assert lhs.compatible_with(rhs)
   ```
   <img width="1624" alt="image" src="https://user-images.githubusercontent.com/1134248/173146156-8e4f9459-92f7-4cc2-ada9-a1a11582b402.png">
   
   We also do this on the Java side: https://github.com/apache/iceberg/blob/master/api/src/main/java/org/apache/iceberg/PartitionSpec.java#L211
   
   The Python zip will only zip the number of fields that are in the shortest list.



##########
python/src/iceberg/table/partitioning.py:
##########
@@ -29,38 +36,81 @@ class PartitionField:
         name(str): The name of this partition field
     """
 
-    def __init__(self, source_id: int, field_id: int, transform: Transform, name: str):
-        self._source_id = source_id
-        self._field_id = field_id
-        self._transform = transform
-        self._name = name
+    source_id: int
+    field_id: int
+    transform: Transform
+    name: str
+
+    def __str__(self):
+        return f"{self.field_id}: {self.name}: {self.transform}({self.source_id})"
+
 
-    @property
-    def source_id(self) -> int:
-        return self._source_id
+@dataclass(eq=False, frozen=True)
+class PartitionSpec:
+    """
+    PartitionSpec captures the transformation from table data to partition values
 
-    @property
-    def field_id(self) -> int:
-        return self._field_id
+    Attributes:
+        schema(Schema): the schema of data table
+        spec_id(int): any change to PartitionSpec will produce a new specId
+        fields(List[PartitionField): list of partition fields to produce partition values
+        last_assigned_field_id(int): auto-increment partition field id starting from PARTITION_DATA_ID_START
+    """
 
-    @property
-    def name(self) -> str:
-        return self._name
+    schema: Schema
+    spec_id: int
+    fields: Tuple[PartitionField, ...]
+    last_assigned_field_id: int
+    source_id_to_fields_map: Dict[int, List[PartitionField]] = field(init=False, repr=False)
 
-    @property
-    def transform(self) -> Transform:
-        return self._transform
+    def __post_init__(self):
+        source_id_to_fields_map = dict()
+        for partition_field in self.fields:
+            source_column = self.schema.find_column_name(partition_field.source_id)
+            if not source_column:
+                raise ValueError(f"Cannot find source column: {partition_field.source_id}")
+            existing = source_id_to_fields_map.get(partition_field.source_id, [])
+            existing.append(partition_field)
+            source_id_to_fields_map[partition_field.source_id] = existing
+        object.__setattr__(self, "source_id_to_fields_map", source_id_to_fields_map)
 
     def __eq__(self, other):
-        return (
-            self.field_id == other.field_id
-            and self.source_id == other.source_id
-            and self.name == other.name
-            and self.transform == other.transform
-        )
+        """
+        Produce a boolean to return True if two objects are considered equal
+
+        Note:
+            Equality of PartitionSpec is determined by spec_id and partition fields only
+        """
+        if not isinstance(other, PartitionSpec):
+            return False
+        return self.spec_id == other.spec_id and self.fields == other.fields
 
     def __str__(self):
-        return f"{self.field_id}: {self.name}: {self.transform}({self.source_id})"
+        """
+        Produce a human-readable string representation of PartitionSpec
 
-    def __repr__(self):
-        return f"PartitionField(field_id={self.field_id}, name={self.name}, transform={repr(self.transform)}, source_id={self.source_id})"
+        Note:
+            Only include list of partition fields in the PartitionSpec's string representation
+        """
+        result_str = "["
+        if self.fields:
+            result_str += "\n  " + "\n  ".join([str(field) for field in self.fields]) + "\n"
+        result_str += "]"
+        return result_str
+
+    def is_unpartitioned(self) -> bool:
+        return len(self.fields) < 1

Review Comment:
   supernit: We could shorten this to:
   ```suggestion
           return not self.fields
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #4717: Python: Add PartitionSpec

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #4717:
URL: https://github.com/apache/iceberg/pull/4717#discussion_r885109922


##########
python/tests/table/test_partitioning.py:
##########
@@ -29,8 +30,37 @@ def test_partition_field_init():
     assert partition_field.transform == bucket_transform
     assert partition_field.name == "id"
     assert partition_field == partition_field
+    print(str(partition_field))
+    print("repr")
+    print(repr(partition_field))

Review Comment:
   Can you remove all print statements from the tests?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] dramaticlly commented on a diff in pull request #4717: Python: Add PartitionSpec

Posted by GitBox <gi...@apache.org>.
dramaticlly commented on code in PR #4717:
URL: https://github.com/apache/iceberg/pull/4717#discussion_r885919277


##########
python/src/iceberg/table/partitioning.py:
##########
@@ -29,38 +36,79 @@ class PartitionField:
         name(str): The name of this partition field
     """
 
-    def __init__(self, source_id: int, field_id: int, transform: Transform, name: str):
-        self._source_id = source_id
-        self._field_id = field_id
-        self._transform = transform
-        self._name = name
+    source_id: int
+    field_id: int
+    transform: Transform
+    name: str
+
+    def __str__(self):
+        return f"{self.field_id}: {self.name}: {self.transform}({self.source_id})"
 
-    @property
-    def source_id(self) -> int:
-        return self._source_id
 
-    @property
-    def field_id(self) -> int:
-        return self._field_id
+@dataclass(eq=False, frozen=True)
+class PartitionSpec:
+    """
+    PartitionSpec capture the transformation from table data to partition values
 
-    @property
-    def name(self) -> str:
-        return self._name
+    Attributes:
+        schema(Schema): the schema of data table
+        spec_id(int): any change to PartitionSpec will produce a new specId
+        fields(List[PartitionField): list of partition fields to produce partition values
+        last_assigned_field_id(int): auto-increment partition field id starting from PARTITION_DATA_ID_START
+    """
 
-    @property
-    def transform(self) -> Transform:
-        return self._transform
+    schema: Schema
+    spec_id: int
+    fields: Tuple[PartitionField]
+    last_assigned_field_id: int
+    source_id_to_fields_map: Dict[int, List[PartitionField]] = field(init=False, repr=False)
+
+    def __post_init__(self):
+        source_id_to_fields_map = dict()
+        for partition_field in self.fields:
+            source_column = self.schema.find_column_name(partition_field.source_id)
+            if not source_column:
+                raise ValueError(f"Cannot find source column: {partition_field.source_id}")
+            existing = source_id_to_fields_map.get(partition_field.source_id, [])
+            existing.append(partition_field)
+            source_id_to_fields_map[partition_field.source_id] = existing
+        object.__setattr__(self, "source_id_to_fields_map", source_id_to_fields_map)
 
     def __eq__(self, other):
-        return (
-            self.field_id == other.field_id
-            and self.source_id == other.source_id
-            and self.name == other.name
-            and self.transform == other.transform
-        )
+        """
+        Equality check on spec_id and partition fields only
+        """
+        return self.spec_id == other.spec_id and self.fields == other.fields

Review Comment:
   Thanks Fokko, I realized this as well, also in couple of other classes. Do you think we shall let it throw as is or do you prefer to type check via `isinstanceof` and error out with our own message? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] dramaticlly commented on a diff in pull request #4717: Python: Add PartitionSpec

Posted by GitBox <gi...@apache.org>.
dramaticlly commented on code in PR #4717:
URL: https://github.com/apache/iceberg/pull/4717#discussion_r889419341


##########
python/src/iceberg/table/partitioning.py:
##########
@@ -29,38 +36,79 @@ class PartitionField:
         name(str): The name of this partition field
     """
 
-    def __init__(self, source_id: int, field_id: int, transform: Transform, name: str):
-        self._source_id = source_id
-        self._field_id = field_id
-        self._transform = transform
-        self._name = name
+    source_id: int
+    field_id: int
+    transform: Transform
+    name: str
+
+    def __str__(self):
+        return f"{self.field_id}: {self.name}: {self.transform}({self.source_id})"
 
-    @property
-    def source_id(self) -> int:
-        return self._source_id
 
-    @property
-    def field_id(self) -> int:
-        return self._field_id
+@dataclass(eq=False, frozen=True)
+class PartitionSpec:
+    """
+    PartitionSpec capture the transformation from table data to partition values
 
-    @property
-    def name(self) -> str:
-        return self._name
+    Attributes:
+        schema(Schema): the schema of data table
+        spec_id(int): any change to PartitionSpec will produce a new specId
+        fields(List[PartitionField): list of partition fields to produce partition values
+        last_assigned_field_id(int): auto-increment partition field id starting from PARTITION_DATA_ID_START
+    """
 
-    @property
-    def transform(self) -> Transform:
-        return self._transform
+    schema: Schema
+    spec_id: int
+    fields: Tuple[PartitionField]
+    last_assigned_field_id: int
+    source_id_to_fields_map: Dict[int, List[PartitionField]] = field(init=False, repr=False)
+
+    def __post_init__(self):
+        source_id_to_fields_map = dict()
+        for partition_field in self.fields:
+            source_column = self.schema.find_column_name(partition_field.source_id)
+            if not source_column:
+                raise ValueError(f"Cannot find source column: {partition_field.source_id}")
+            existing = source_id_to_fields_map.get(partition_field.source_id, [])
+            existing.append(partition_field)
+            source_id_to_fields_map[partition_field.source_id] = existing
+        object.__setattr__(self, "source_id_to_fields_map", source_id_to_fields_map)
 
     def __eq__(self, other):
-        return (
-            self.field_id == other.field_id
-            and self.source_id == other.source_id
-            and self.name == other.name
-            and self.transform == other.transform
-        )
+        """
+        Equality check on spec_id and partition fields only
+        """
+        return self.spec_id == other.spec_id and self.fields == other.fields

Review Comment:
   Thank you @samredai , updated 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] dramaticlly commented on a diff in pull request #4717: Python: Add PartitionSpec

Posted by GitBox <gi...@apache.org>.
dramaticlly commented on code in PR #4717:
URL: https://github.com/apache/iceberg/pull/4717#discussion_r868773667


##########
python/src/iceberg/table/partitioning.py:
##########
@@ -64,3 +67,100 @@ def __str__(self):
 
     def __repr__(self):
         return f"PartitionField(field_id={self.field_id}, name={self.name}, transform={repr(self.transform)}, source_id={self.source_id})"
+
+    def __hash__(self):
+        return hash((self.source_id, self.field_id, self.name, self.transform))
+
+
+class PartitionSpec:
+    """
+    PartitionSpec capture the transformation from table data to partition values
+
+    Attributes:
+        schema(Schema): the schema of data table
+        spec_id(int): any change to PartitionSpec will produce a new specId
+        fields(List[PartitionField): list of partition fields to produce partition values
+        last_assigned_field_id(int): auto-increment partition field id starting from PARTITION_DATA_ID_START
+    """
+
+    PARTITION_DATA_ID_START: int = 1000
+
+    def __init__(self, schema: Schema, spec_id: int, fields: Tuple[PartitionField], last_assigned_field_id: int):
+        self._schema = schema
+        self._spec_id = spec_id
+        self._fields = fields
+        self._last_assigned_field_id = last_assigned_field_id
+        # derived
+        self.fields_by_source_id: Dict[int, List[PartitionField]] = {}
+
+    @property
+    def schema(self) -> Schema:
+        return self._schema
+
+    @property
+    def spec_id(self) -> int:
+        return self._spec_id
+
+    @property
+    def fields(self) -> Tuple[PartitionField]:
+        return self._fields
+
+    @property
+    def last_assigned_field_id(self) -> int:
+        return self._last_assigned_field_id
+
+    def __eq__(self, other):
+        return self.spec_id == other.spec_id and self.fields == other.fields
+
+    def __str__(self):
+        if self.is_unpartitioned():
+            return "[]"
+        else:
+            delimiter = "\n  "
+            partition_fields_in_str = (str(partition_field) for partition_field in self.fields)
+            head = f"[{delimiter}"
+            tail = f"\n]"
+            return f"{head}{delimiter.join(partition_fields_in_str)}{tail}"
+
+    def __repr__(self):
+        return f"PartitionSpec: {str(self)}"
+
+    def __hash__(self):
+        return hash((self.spec_id, self.fields))
+
+    def is_unpartitioned(self) -> bool:
+        return len(self.fields) < 1
+
+    def get_fields_by_source_id(self, filed_id: int) -> List[PartitionField]:
+        if not self.fields_by_source_id:
+            for partition_field in self.fields:
+                source_column = self.schema.find_column_name(partition_field.source_id)
+                if not source_column:
+                    raise ValueError(f"Cannot find source column: {partition_field.source_id}")
+                existing = self.fields_by_source_id.get(partition_field.source_id, [])
+                existing.append(partition_field)
+                self.fields_by_source_id[partition_field.source_id] = existing
+        return self.fields_by_source_id.get(filed_id)  # type: ignore

Review Comment:
   updated to `self.fields_by_source_id[field_id]` and removed ignore type override



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] dramaticlly commented on a diff in pull request #4717: Python: Add PartitionSpec

Posted by GitBox <gi...@apache.org>.
dramaticlly commented on code in PR #4717:
URL: https://github.com/apache/iceberg/pull/4717#discussion_r872652195


##########
python/src/iceberg/table/partitioning.py:
##########
@@ -64,3 +67,100 @@ def __str__(self):
 
     def __repr__(self):
         return f"PartitionField(field_id={self.field_id}, name={self.name}, transform={repr(self.transform)}, source_id={self.source_id})"
+
+    def __hash__(self):
+        return hash((self.source_id, self.field_id, self.name, self.transform))
+
+
+class PartitionSpec:
+    """
+    PartitionSpec capture the transformation from table data to partition values
+
+    Attributes:
+        schema(Schema): the schema of data table
+        spec_id(int): any change to PartitionSpec will produce a new specId
+        fields(List[PartitionField): list of partition fields to produce partition values
+        last_assigned_field_id(int): auto-increment partition field id starting from PARTITION_DATA_ID_START
+    """
+
+    PARTITION_DATA_ID_START: int = 1000
+
+    def __init__(self, schema: Schema, spec_id: int, fields: Tuple[PartitionField], last_assigned_field_id: int):
+        self._schema = schema
+        self._spec_id = spec_id
+        self._fields = fields
+        self._last_assigned_field_id = last_assigned_field_id
+        # derived
+        self.fields_by_source_id: Dict[int, List[PartitionField]] = {}
+
+    @property
+    def schema(self) -> Schema:
+        return self._schema
+
+    @property
+    def spec_id(self) -> int:
+        return self._spec_id
+
+    @property
+    def fields(self) -> Tuple[PartitionField]:
+        return self._fields
+
+    @property
+    def last_assigned_field_id(self) -> int:
+        return self._last_assigned_field_id
+
+    def __eq__(self, other):
+        return self.spec_id == other.spec_id and self.fields == other.fields
+
+    def __str__(self):
+        if self.is_unpartitioned():

Review Comment:
   >I don't think it is necessary to construct it exactly like Java. 
   
   got it, thanks for clarify
   
   > is_unpartitioned can include specs with void transforms
   
   But I am a bit behind on this. From what I can tell, the void transform can never be unpartitioned per existing java code https://github.com/apache/iceberg/blob/32841a273b6dda897f6efece4ebb178f8ad54719/api/src/main/java/org/apache/iceberg/PartitionSpec.java#L97-L103. And if there's one void transform added to such PartitionSpec, then I assume there's one partition field with voidTransform as its Transform, which is not `unpartitioned` like below 
   
   ```
   > PartitionSpec ps = PartitionSpec.builderFor(SCHEMA).alwaysNull("ts").build();
   > System.out.println(ps.toString());
   [
     1000: ts_null: void(5)
   ]
   
   > System.out.println(ps.isUnpartitioned());
   false
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #4717: Python: Add PartitionSpec

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #4717:
URL: https://github.com/apache/iceberg/pull/4717#discussion_r868760787


##########
python/src/iceberg/table/partitioning.py:
##########
@@ -64,3 +67,100 @@ def __str__(self):
 
     def __repr__(self):
         return f"PartitionField(field_id={self.field_id}, name={self.name}, transform={repr(self.transform)}, source_id={self.source_id})"
+
+    def __hash__(self):
+        return hash((self.source_id, self.field_id, self.name, self.transform))
+
+
+class PartitionSpec:
+    """
+    PartitionSpec capture the transformation from table data to partition values
+
+    Attributes:
+        schema(Schema): the schema of data table
+        spec_id(int): any change to PartitionSpec will produce a new specId
+        fields(List[PartitionField): list of partition fields to produce partition values
+        last_assigned_field_id(int): auto-increment partition field id starting from PARTITION_DATA_ID_START
+    """
+
+    PARTITION_DATA_ID_START: int = 1000
+
+    def __init__(self, schema: Schema, spec_id: int, fields: Tuple[PartitionField], last_assigned_field_id: int):
+        self._schema = schema
+        self._spec_id = spec_id
+        self._fields = fields
+        self._last_assigned_field_id = last_assigned_field_id
+        # derived
+        self.fields_by_source_id: Dict[int, List[PartitionField]] = {}
+
+    @property
+    def schema(self) -> Schema:
+        return self._schema
+
+    @property
+    def spec_id(self) -> int:
+        return self._spec_id
+
+    @property
+    def fields(self) -> Tuple[PartitionField]:
+        return self._fields
+
+    @property
+    def last_assigned_field_id(self) -> int:
+        return self._last_assigned_field_id
+
+    def __eq__(self, other):
+        return self.spec_id == other.spec_id and self.fields == other.fields
+
+    def __str__(self):
+        if self.is_unpartitioned():
+            return "[]"
+        else:
+            delimiter = "\n  "
+            partition_fields_in_str = (str(partition_field) for partition_field in self.fields)
+            head = f"[{delimiter}"
+            tail = f"\n]"
+            return f"{head}{delimiter.join(partition_fields_in_str)}{tail}"
+
+    def __repr__(self):
+        return f"PartitionSpec: {str(self)}"
+
+    def __hash__(self):
+        return hash((self.spec_id, self.fields))
+
+    def is_unpartitioned(self) -> bool:
+        return len(self.fields) < 1
+
+    def get_fields_by_source_id(self, filed_id: int) -> List[PartitionField]:
+        if not self.fields_by_source_id:
+            for partition_field in self.fields:
+                source_column = self.schema.find_column_name(partition_field.source_id)
+                if not source_column:
+                    raise ValueError(f"Cannot find source column: {partition_field.source_id}")
+                existing = self.fields_by_source_id.get(partition_field.source_id, [])
+                existing.append(partition_field)
+                self.fields_by_source_id[partition_field.source_id] = existing
+        return self.fields_by_source_id.get(filed_id)  # type: ignore
+
+    def compatible_with(self, other) -> bool:
+        """
+        Returns true if this partition spec is equivalent to the other, with partition field_id ignored.
+        That is, if both specs have the same number of fields, field order, field name, source column ids, and transforms.
+        """
+        if self.__eq__(other):

Review Comment:
   We usually don't call these methods directly.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #4717: Python: Add PartitionSpec

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #4717:
URL: https://github.com/apache/iceberg/pull/4717#discussion_r868760597


##########
python/src/iceberg/table/partitioning.py:
##########
@@ -64,3 +67,100 @@ def __str__(self):
 
     def __repr__(self):
         return f"PartitionField(field_id={self.field_id}, name={self.name}, transform={repr(self.transform)}, source_id={self.source_id})"
+
+    def __hash__(self):
+        return hash((self.source_id, self.field_id, self.name, self.transform))
+
+
+class PartitionSpec:
+    """
+    PartitionSpec capture the transformation from table data to partition values
+
+    Attributes:
+        schema(Schema): the schema of data table
+        spec_id(int): any change to PartitionSpec will produce a new specId
+        fields(List[PartitionField): list of partition fields to produce partition values
+        last_assigned_field_id(int): auto-increment partition field id starting from PARTITION_DATA_ID_START
+    """
+
+    PARTITION_DATA_ID_START: int = 1000
+
+    def __init__(self, schema: Schema, spec_id: int, fields: Tuple[PartitionField], last_assigned_field_id: int):
+        self._schema = schema
+        self._spec_id = spec_id
+        self._fields = fields
+        self._last_assigned_field_id = last_assigned_field_id
+        # derived
+        self.fields_by_source_id: Dict[int, List[PartitionField]] = {}
+
+    @property
+    def schema(self) -> Schema:
+        return self._schema
+
+    @property
+    def spec_id(self) -> int:
+        return self._spec_id
+
+    @property
+    def fields(self) -> Tuple[PartitionField]:
+        return self._fields
+
+    @property
+    def last_assigned_field_id(self) -> int:
+        return self._last_assigned_field_id
+
+    def __eq__(self, other):
+        return self.spec_id == other.spec_id and self.fields == other.fields
+
+    def __str__(self):
+        if self.is_unpartitioned():
+            return "[]"
+        else:
+            delimiter = "\n  "
+            partition_fields_in_str = (str(partition_field) for partition_field in self.fields)
+            head = f"[{delimiter}"
+            tail = f"\n]"
+            return f"{head}{delimiter.join(partition_fields_in_str)}{tail}"
+
+    def __repr__(self):
+        return f"PartitionSpec: {str(self)}"
+
+    def __hash__(self):
+        return hash((self.spec_id, self.fields))
+
+    def is_unpartitioned(self) -> bool:
+        return len(self.fields) < 1
+
+    def get_fields_by_source_id(self, filed_id: int) -> List[PartitionField]:

Review Comment:
   How about `find_field_by_source`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #4717: Python: Add PartitionSpec

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #4717:
URL: https://github.com/apache/iceberg/pull/4717#discussion_r868759430


##########
python/src/iceberg/table/partitioning.py:
##########
@@ -64,3 +67,100 @@ def __str__(self):
 
     def __repr__(self):
         return f"PartitionField(field_id={self.field_id}, name={self.name}, transform={repr(self.transform)}, source_id={self.source_id})"
+
+    def __hash__(self):
+        return hash((self.source_id, self.field_id, self.name, self.transform))
+
+
+class PartitionSpec:
+    """
+    PartitionSpec capture the transformation from table data to partition values
+
+    Attributes:
+        schema(Schema): the schema of data table
+        spec_id(int): any change to PartitionSpec will produce a new specId
+        fields(List[PartitionField): list of partition fields to produce partition values
+        last_assigned_field_id(int): auto-increment partition field id starting from PARTITION_DATA_ID_START
+    """
+
+    PARTITION_DATA_ID_START: int = 1000
+
+    def __init__(self, schema: Schema, spec_id: int, fields: Tuple[PartitionField], last_assigned_field_id: int):

Review Comment:
   Should `fields` be `Iterable` rather than being so strict? This could easily convert to a tuple.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #4717: Python: Add PartitionSpec

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #4717:
URL: https://github.com/apache/iceberg/pull/4717#discussion_r868759028


##########
python/src/iceberg/table/partitioning.py:
##########
@@ -64,3 +67,100 @@ def __str__(self):
 
     def __repr__(self):
         return f"PartitionField(field_id={self.field_id}, name={self.name}, transform={repr(self.transform)}, source_id={self.source_id})"
+
+    def __hash__(self):
+        return hash((self.source_id, self.field_id, self.name, self.transform))
+
+
+class PartitionSpec:
+    """
+    PartitionSpec capture the transformation from table data to partition values
+
+    Attributes:
+        schema(Schema): the schema of data table
+        spec_id(int): any change to PartitionSpec will produce a new specId
+        fields(List[PartitionField): list of partition fields to produce partition values
+        last_assigned_field_id(int): auto-increment partition field id starting from PARTITION_DATA_ID_START
+    """
+
+    PARTITION_DATA_ID_START: int = 1000

Review Comment:
   This isn't used and probably doesn't belong in the class later. Can you make it a top-level constant and prefix it with `_`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] dramaticlly commented on pull request #4717: Python: Add PartitionSpec

Posted by GitBox <gi...@apache.org>.
dramaticlly commented on PR #4717:
URL: https://github.com/apache/iceberg/pull/4717#issuecomment-1130873434

   Updated the PR with what dataclass can offer for PartitionField and PartitionSpec. The scope of this change stay the same and need a following PR to add convenient Builder for PartitionSpec
   
   Appreciate the review, if you can take a look that would be great! @rdblue @samredai @Fokko 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] dramaticlly commented on a diff in pull request #4717: Python: Add PartitionSpec

Posted by GitBox <gi...@apache.org>.
dramaticlly commented on code in PR #4717:
URL: https://github.com/apache/iceberg/pull/4717#discussion_r894983066


##########
python/src/iceberg/table/partitioning.py:
##########
@@ -29,38 +36,81 @@ class PartitionField:
         name(str): The name of this partition field
     """
 
-    def __init__(self, source_id: int, field_id: int, transform: Transform, name: str):
-        self._source_id = source_id
-        self._field_id = field_id
-        self._transform = transform
-        self._name = name
+    source_id: int
+    field_id: int
+    transform: Transform
+    name: str
+
+    def __str__(self):
+        return f"{self.field_id}: {self.name}: {self.transform}({self.source_id})"
+
 
-    @property
-    def source_id(self) -> int:
-        return self._source_id
+@dataclass(eq=False, frozen=True)
+class PartitionSpec:
+    """
+    PartitionSpec captures the transformation from table data to partition values
 
-    @property
-    def field_id(self) -> int:
-        return self._field_id
+    Attributes:
+        schema(Schema): the schema of data table
+        spec_id(int): any change to PartitionSpec will produce a new specId
+        fields(List[PartitionField): list of partition fields to produce partition values
+        last_assigned_field_id(int): auto-increment partition field id starting from PARTITION_DATA_ID_START
+    """
 
-    @property
-    def name(self) -> str:
-        return self._name
+    schema: Schema
+    spec_id: int
+    fields: Tuple[PartitionField, ...]
+    last_assigned_field_id: int
+    source_id_to_fields_map: Dict[int, List[PartitionField]] = field(init=False, repr=False)
 
-    @property
-    def transform(self) -> Transform:
-        return self._transform
+    def __post_init__(self):
+        source_id_to_fields_map = dict()
+        for partition_field in self.fields:
+            source_column = self.schema.find_column_name(partition_field.source_id)
+            if not source_column:
+                raise ValueError(f"Cannot find source column: {partition_field.source_id}")
+            existing = source_id_to_fields_map.get(partition_field.source_id, [])
+            existing.append(partition_field)
+            source_id_to_fields_map[partition_field.source_id] = existing
+        object.__setattr__(self, "source_id_to_fields_map", source_id_to_fields_map)
 
     def __eq__(self, other):
-        return (
-            self.field_id == other.field_id
-            and self.source_id == other.source_id
-            and self.name == other.name
-            and self.transform == other.transform
-        )
+        """
+        Produce a boolean to return True if two objects are considered equal
+
+        Note:
+            Equality of PartitionSpec is determined by spec_id and partition fields only
+        """
+        if not isinstance(other, PartitionSpec):
+            return False
+        return self.spec_id == other.spec_id and self.fields == other.fields
 
     def __str__(self):
-        return f"{self.field_id}: {self.name}: {self.transform}({self.source_id})"
+        """
+        Produce a human-readable string representation of PartitionSpec
 
-    def __repr__(self):
-        return f"PartitionField(field_id={self.field_id}, name={self.name}, transform={repr(self.transform)}, source_id={self.source_id})"
+        Note:
+            Only include list of partition fields in the PartitionSpec's string representation
+        """
+        result_str = "["
+        if self.fields:
+            result_str += "\n  " + "\n  ".join([str(field) for field in self.fields]) + "\n"
+        result_str += "]"
+        return result_str
+
+    def is_unpartitioned(self) -> bool:
+        return len(self.fields) < 1
+
+    def fields_by_source_id(self, field_id: int) -> List[PartitionField]:
+        return self.source_id_to_fields_map[field_id]
+
+    def compatible_with(self, other: "PartitionSpec") -> bool:

Review Comment:
   Thanks for catching it for me, I forgot about this when move from loop into zip. Added eq and length check back. Your proposed unit test also looks helpful, so I added in the latest commit as well!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] Fokko commented on a diff in pull request #4717: Python: Add PartitionSpec

Posted by GitBox <gi...@apache.org>.
Fokko commented on code in PR #4717:
URL: https://github.com/apache/iceberg/pull/4717#discussion_r884003558


##########
python/src/iceberg/table/partitioning.py:
##########
@@ -29,38 +36,76 @@ class PartitionField:
         name(str): The name of this partition field
     """
 
-    def __init__(self, source_id: int, field_id: int, transform: Transform, name: str):
-        self._source_id = source_id
-        self._field_id = field_id
-        self._transform = transform
-        self._name = name
+    source_id: int
+    field_id: int
+    transform: Transform
+    name: str
 
-    @property
-    def source_id(self) -> int:
-        return self._source_id
+    def __str__(self):
+        return f"{self.field_id}: {self.name}: {self.transform}({self.source_id})"
+
+
+@dataclass(eq=False, frozen=True)
+class PartitionSpec:
+    """
+    PartitionSpec capture the transformation from table data to partition values
 
-    @property
-    def field_id(self) -> int:
-        return self._field_id
+    Attributes:
+        schema(Schema): the schema of data table
+        spec_id(int): any change to PartitionSpec will produce a new specId
+        fields(List[PartitionField): list of partition fields to produce partition values
+        last_assigned_field_id(int): auto-increment partition field id starting from PARTITION_DATA_ID_START
+    """
 
-    @property
-    def name(self) -> str:
-        return self._name
+    schema: Schema
+    spec_id: int
+    fields: Tuple[PartitionField]
+    last_assigned_field_id: int
+    source_id_to_fields_map: Dict[int, List[PartitionField]] = field(init=False)
 
-    @property
-    def transform(self) -> Transform:
-        return self._transform
+    def __post_init__(self):
+        source_id_to_fields_map = dict()
+        for partition_field in self.fields:
+            source_column = self.schema.find_column_name(partition_field.source_id)
+            if not source_column:
+                raise ValueError(f"Cannot find source column: {partition_field.source_id}")
+            existing = source_id_to_fields_map.get(partition_field.source_id, [])
+            existing.append(partition_field)
+            source_id_to_fields_map[partition_field.source_id] = existing
+        object.__setattr__(self, "source_id_to_fields_map", source_id_to_fields_map)
 
     def __eq__(self, other):
-        return (
-            self.field_id == other.field_id
-            and self.source_id == other.source_id
-            and self.name == other.name
-            and self.transform == other.transform
-        )
+        """
+        Equality check on spec_id and partition fields only
+        """
+        return self.spec_id == other.spec_id and self.fields == other.fields
 
     def __str__(self):
-        return f"{self.field_id}: {self.name}: {self.transform}({self.source_id})"
+        """
+        PartitionSpec str method highlight the partition field only
+        """
+        result_str = "["
+        for partition_field in self.fields:
+            result_str += f"\n  {str(partition_field)}"
+        if len(self.fields) > 0:
+            result_str += "\n"
+        result_str += "]"
+        return result_str
 
-    def __repr__(self):
-        return f"PartitionField(field_id={self.field_id}, name={self.name}, transform={repr(self.transform)}, source_id={self.source_id})"

Review Comment:
   You can easily restore the original behavior by changing:
   ```python
       source_id_to_fields_map: Dict[int, List[PartitionField]] = field(init=False)
   ```
   into:
   ```
       source_id_to_fields_map: Dict[int, List[PartitionField]] = field(init=False, repr=False)
   ```



##########
python/src/iceberg/table/partitioning.py:
##########
@@ -29,38 +36,76 @@ class PartitionField:
         name(str): The name of this partition field
     """
 
-    def __init__(self, source_id: int, field_id: int, transform: Transform, name: str):
-        self._source_id = source_id
-        self._field_id = field_id
-        self._transform = transform
-        self._name = name
+    source_id: int
+    field_id: int
+    transform: Transform
+    name: str
 
-    @property
-    def source_id(self) -> int:
-        return self._source_id
+    def __str__(self):
+        return f"{self.field_id}: {self.name}: {self.transform}({self.source_id})"
+
+
+@dataclass(eq=False, frozen=True)
+class PartitionSpec:
+    """
+    PartitionSpec capture the transformation from table data to partition values
 
-    @property
-    def field_id(self) -> int:
-        return self._field_id
+    Attributes:
+        schema(Schema): the schema of data table
+        spec_id(int): any change to PartitionSpec will produce a new specId
+        fields(List[PartitionField): list of partition fields to produce partition values
+        last_assigned_field_id(int): auto-increment partition field id starting from PARTITION_DATA_ID_START
+    """
 
-    @property
-    def name(self) -> str:
-        return self._name
+    schema: Schema
+    spec_id: int
+    fields: Tuple[PartitionField]
+    last_assigned_field_id: int
+    source_id_to_fields_map: Dict[int, List[PartitionField]] = field(init=False)
 
-    @property
-    def transform(self) -> Transform:
-        return self._transform
+    def __post_init__(self):
+        source_id_to_fields_map = dict()
+        for partition_field in self.fields:
+            source_column = self.schema.find_column_name(partition_field.source_id)
+            if not source_column:
+                raise ValueError(f"Cannot find source column: {partition_field.source_id}")
+            existing = source_id_to_fields_map.get(partition_field.source_id, [])
+            existing.append(partition_field)
+            source_id_to_fields_map[partition_field.source_id] = existing
+        object.__setattr__(self, "source_id_to_fields_map", source_id_to_fields_map)
 
     def __eq__(self, other):
-        return (
-            self.field_id == other.field_id
-            and self.source_id == other.source_id
-            and self.name == other.name
-            and self.transform == other.transform
-        )
+        """
+        Equality check on spec_id and partition fields only
+        """
+        return self.spec_id == other.spec_id and self.fields == other.fields
 
     def __str__(self):
-        return f"{self.field_id}: {self.name}: {self.transform}({self.source_id})"
+        """
+        PartitionSpec str method highlight the partition field only
+        """
+        result_str = "["
+        for partition_field in self.fields:
+            result_str += f"\n  {str(partition_field)}"
+        if len(self.fields) > 0:
+            result_str += "\n"
+        result_str += "]"
+        return result_str
 
-    def __repr__(self):
-        return f"PartitionField(field_id={self.field_id}, name={self.name}, transform={repr(self.transform)}, source_id={self.source_id})"

Review Comment:
   You can easily restore the original behavior by changing:
   ```python
       source_id_to_fields_map: Dict[int, List[PartitionField]] = field(init=False)
   ```
   into:
   ```python
       source_id_to_fields_map: Dict[int, List[PartitionField]] = field(init=False, repr=False)
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] dramaticlly commented on a diff in pull request #4717: Python: Add PartitionSpec

Posted by GitBox <gi...@apache.org>.
dramaticlly commented on code in PR #4717:
URL: https://github.com/apache/iceberg/pull/4717#discussion_r868771265


##########
python/src/iceberg/table/partitioning.py:
##########
@@ -64,3 +67,100 @@ def __str__(self):
 
     def __repr__(self):
         return f"PartitionField(field_id={self.field_id}, name={self.name}, transform={repr(self.transform)}, source_id={self.source_id})"
+
+    def __hash__(self):
+        return hash((self.source_id, self.field_id, self.name, self.transform))
+
+
+class PartitionSpec:
+    """
+    PartitionSpec capture the transformation from table data to partition values
+
+    Attributes:
+        schema(Schema): the schema of data table
+        spec_id(int): any change to PartitionSpec will produce a new specId
+        fields(List[PartitionField): list of partition fields to produce partition values
+        last_assigned_field_id(int): auto-increment partition field id starting from PARTITION_DATA_ID_START
+    """
+
+    PARTITION_DATA_ID_START: int = 1000
+
+    def __init__(self, schema: Schema, spec_id: int, fields: Tuple[PartitionField], last_assigned_field_id: int):
+        self._schema = schema
+        self._spec_id = spec_id
+        self._fields = fields
+        self._last_assigned_field_id = last_assigned_field_id
+        # derived
+        self.fields_by_source_id: Dict[int, List[PartitionField]] = {}
+
+    @property
+    def schema(self) -> Schema:
+        return self._schema
+
+    @property
+    def spec_id(self) -> int:
+        return self._spec_id
+
+    @property
+    def fields(self) -> Tuple[PartitionField]:
+        return self._fields
+
+    @property
+    def last_assigned_field_id(self) -> int:
+        return self._last_assigned_field_id
+
+    def __eq__(self, other):
+        return self.spec_id == other.spec_id and self.fields == other.fields
+
+    def __str__(self):
+        if self.is_unpartitioned():

Review Comment:
   I think it's a bit nasty to construct it exactly like what Java implementation does, so the special casing is essentially simplified from https://github.com/apache/iceberg/blob/master/api/src/main/java/org/apache/iceberg/PartitionSpec.java#L299. The unpartitioned differ from rest with 1 less `\n` right before the `]`
   
   so here's my alternative way to construct, not sure how do you like it compare to what I have right now.
   
   
   
   ```python
       def __str__(self):
           result_str = "["
           for partition_field in self.fields:
               result_str += f"\n  {str(partition_field)}"
           if self.is_unpartitioned():
               result_str += "]"
           else:
               result_str += "\n]"
           return result_str
   ```
   



##########
python/src/iceberg/table/partitioning.py:
##########
@@ -64,3 +67,100 @@ def __str__(self):
 
     def __repr__(self):
         return f"PartitionField(field_id={self.field_id}, name={self.name}, transform={repr(self.transform)}, source_id={self.source_id})"
+
+    def __hash__(self):
+        return hash((self.source_id, self.field_id, self.name, self.transform))
+
+
+class PartitionSpec:
+    """
+    PartitionSpec capture the transformation from table data to partition values
+
+    Attributes:
+        schema(Schema): the schema of data table
+        spec_id(int): any change to PartitionSpec will produce a new specId
+        fields(List[PartitionField): list of partition fields to produce partition values
+        last_assigned_field_id(int): auto-increment partition field id starting from PARTITION_DATA_ID_START
+    """
+
+    PARTITION_DATA_ID_START: int = 1000
+
+    def __init__(self, schema: Schema, spec_id: int, fields: Tuple[PartitionField], last_assigned_field_id: int):
+        self._schema = schema
+        self._spec_id = spec_id
+        self._fields = fields
+        self._last_assigned_field_id = last_assigned_field_id
+        # derived
+        self.fields_by_source_id: Dict[int, List[PartitionField]] = {}
+
+    @property
+    def schema(self) -> Schema:
+        return self._schema
+
+    @property
+    def spec_id(self) -> int:
+        return self._spec_id
+
+    @property
+    def fields(self) -> Tuple[PartitionField]:
+        return self._fields
+
+    @property
+    def last_assigned_field_id(self) -> int:
+        return self._last_assigned_field_id
+
+    def __eq__(self, other):
+        return self.spec_id == other.spec_id and self.fields == other.fields
+
+    def __str__(self):
+        if self.is_unpartitioned():
+            return "[]"
+        else:
+            delimiter = "\n  "
+            partition_fields_in_str = (str(partition_field) for partition_field in self.fields)
+            head = f"[{delimiter}"
+            tail = f"\n]"

Review Comment:
   
   Python `string.join(Iterable)` does not have a way to attach head and tail, so I figured it might be easier to see it this way as in `f"{head}{delimiter.join(partition_fields_in_str)}{tail}"`.
   
   But I can certainly change quickly if above is what you prefer



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #4717: Python: Add PartitionSpec

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #4717:
URL: https://github.com/apache/iceberg/pull/4717#discussion_r868761485


##########
python/src/iceberg/table/partitioning.py:
##########
@@ -64,3 +67,100 @@ def __str__(self):
 
     def __repr__(self):
         return f"PartitionField(field_id={self.field_id}, name={self.name}, transform={repr(self.transform)}, source_id={self.source_id})"
+
+    def __hash__(self):
+        return hash((self.source_id, self.field_id, self.name, self.transform))
+
+
+class PartitionSpec:
+    """
+    PartitionSpec capture the transformation from table data to partition values
+
+    Attributes:
+        schema(Schema): the schema of data table
+        spec_id(int): any change to PartitionSpec will produce a new specId
+        fields(List[PartitionField): list of partition fields to produce partition values
+        last_assigned_field_id(int): auto-increment partition field id starting from PARTITION_DATA_ID_START
+    """
+
+    PARTITION_DATA_ID_START: int = 1000
+
+    def __init__(self, schema: Schema, spec_id: int, fields: Tuple[PartitionField], last_assigned_field_id: int):
+        self._schema = schema
+        self._spec_id = spec_id
+        self._fields = fields
+        self._last_assigned_field_id = last_assigned_field_id
+        # derived
+        self.fields_by_source_id: Dict[int, List[PartitionField]] = {}
+
+    @property
+    def schema(self) -> Schema:
+        return self._schema
+
+    @property
+    def spec_id(self) -> int:
+        return self._spec_id
+
+    @property
+    def fields(self) -> Tuple[PartitionField]:
+        return self._fields
+
+    @property
+    def last_assigned_field_id(self) -> int:
+        return self._last_assigned_field_id
+
+    def __eq__(self, other):
+        return self.spec_id == other.spec_id and self.fields == other.fields
+
+    def __str__(self):
+        if self.is_unpartitioned():
+            return "[]"
+        else:
+            delimiter = "\n  "
+            partition_fields_in_str = (str(partition_field) for partition_field in self.fields)
+            head = f"[{delimiter}"
+            tail = f"\n]"
+            return f"{head}{delimiter.join(partition_fields_in_str)}{tail}"
+
+    def __repr__(self):
+        return f"PartitionSpec: {str(self)}"
+
+    def __hash__(self):
+        return hash((self.spec_id, self.fields))
+
+    def is_unpartitioned(self) -> bool:
+        return len(self.fields) < 1
+
+    def get_fields_by_source_id(self, filed_id: int) -> List[PartitionField]:
+        if not self.fields_by_source_id:
+            for partition_field in self.fields:
+                source_column = self.schema.find_column_name(partition_field.source_id)
+                if not source_column:
+                    raise ValueError(f"Cannot find source column: {partition_field.source_id}")
+                existing = self.fields_by_source_id.get(partition_field.source_id, [])
+                existing.append(partition_field)
+                self.fields_by_source_id[partition_field.source_id] = existing
+        return self.fields_by_source_id.get(filed_id)  # type: ignore

Review Comment:
   Why is `type: ignore` needed? Can you get type checking working by adding annotations?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] samredai commented on a diff in pull request #4717: Python: Add PartitionSpec

Posted by GitBox <gi...@apache.org>.
samredai commented on code in PR #4717:
URL: https://github.com/apache/iceberg/pull/4717#discussion_r883076143


##########
python/src/iceberg/table/partitioning.py:
##########
@@ -14,8 +14,13 @@
 # KIND, either express or implied.  See the License for the
 # specific language governing permissions and limitations
 # under the License.
+from typing import Dict, Iterable, List, Tuple
+
+from iceberg.schema import Schema
 from iceberg.transforms import Transform
 
+_PARTITION_DATA_ID_START: int = 1000
+
 
 class PartitionField:

Review Comment:
   Great point, I just checked the [dependencies](https://github.com/samuelcolvin/pydantic/blob/master/setup.py#L132) for pydantic and the good news is that all it requires is `typing-extensions` which is probably just for some python 3.7 backports. That will even probably get dropped at some point when they no longer support 3.7



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] samredai commented on a diff in pull request #4717: Python: Add PartitionSpec

Posted by GitBox <gi...@apache.org>.
samredai commented on code in PR #4717:
URL: https://github.com/apache/iceberg/pull/4717#discussion_r889277698


##########
python/src/iceberg/table/partitioning.py:
##########
@@ -29,38 +36,79 @@ class PartitionField:
         name(str): The name of this partition field
     """
 
-    def __init__(self, source_id: int, field_id: int, transform: Transform, name: str):
-        self._source_id = source_id
-        self._field_id = field_id
-        self._transform = transform
-        self._name = name
+    source_id: int
+    field_id: int
+    transform: Transform
+    name: str
+
+    def __str__(self):
+        return f"{self.field_id}: {self.name}: {self.transform}({self.source_id})"
 
-    @property
-    def source_id(self) -> int:
-        return self._source_id
 
-    @property
-    def field_id(self) -> int:
-        return self._field_id
+@dataclass(eq=False, frozen=True)
+class PartitionSpec:
+    """
+    PartitionSpec capture the transformation from table data to partition values
 
-    @property
-    def name(self) -> str:
-        return self._name
+    Attributes:
+        schema(Schema): the schema of data table
+        spec_id(int): any change to PartitionSpec will produce a new specId
+        fields(List[PartitionField): list of partition fields to produce partition values
+        last_assigned_field_id(int): auto-increment partition field id starting from PARTITION_DATA_ID_START
+    """
 
-    @property
-    def transform(self) -> Transform:
-        return self._transform
+    schema: Schema
+    spec_id: int
+    fields: Tuple[PartitionField]
+    last_assigned_field_id: int
+    source_id_to_fields_map: Dict[int, List[PartitionField]] = field(init=False, repr=False)
+
+    def __post_init__(self):
+        source_id_to_fields_map = dict()
+        for partition_field in self.fields:
+            source_column = self.schema.find_column_name(partition_field.source_id)
+            if not source_column:
+                raise ValueError(f"Cannot find source column: {partition_field.source_id}")
+            existing = source_id_to_fields_map.get(partition_field.source_id, [])
+            existing.append(partition_field)
+            source_id_to_fields_map[partition_field.source_id] = existing
+        object.__setattr__(self, "source_id_to_fields_map", source_id_to_fields_map)
 
     def __eq__(self, other):
-        return (
-            self.field_id == other.field_id
-            and self.source_id == other.source_id
-            and self.name == other.name
-            and self.transform == other.transform
-        )
+        """
+        Equality check on spec_id and partition fields only
+        """
+        return self.spec_id == other.spec_id and self.fields == other.fields

Review Comment:
   I think what we want to do here if `other` is not an instance of `PartitionSpec` is just return False. So that `PartitionSpec(...) == "foo"` returns `False` instead of raising an error. Same thing for cases where `other` is `None`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] Fokko commented on a diff in pull request #4717: Python: Add PartitionSpec

Posted by GitBox <gi...@apache.org>.
Fokko commented on code in PR #4717:
URL: https://github.com/apache/iceberg/pull/4717#discussion_r885357634


##########
python/src/iceberg/table/partitioning.py:
##########
@@ -29,38 +36,79 @@ class PartitionField:
         name(str): The name of this partition field
     """
 
-    def __init__(self, source_id: int, field_id: int, transform: Transform, name: str):
-        self._source_id = source_id
-        self._field_id = field_id
-        self._transform = transform
-        self._name = name
+    source_id: int
+    field_id: int
+    transform: Transform
+    name: str
+
+    def __str__(self):
+        return f"{self.field_id}: {self.name}: {self.transform}({self.source_id})"
 
-    @property
-    def source_id(self) -> int:
-        return self._source_id
 
-    @property
-    def field_id(self) -> int:
-        return self._field_id
+@dataclass(eq=False, frozen=True)
+class PartitionSpec:
+    """
+    PartitionSpec capture the transformation from table data to partition values
 
-    @property
-    def name(self) -> str:
-        return self._name
+    Attributes:
+        schema(Schema): the schema of data table
+        spec_id(int): any change to PartitionSpec will produce a new specId
+        fields(List[PartitionField): list of partition fields to produce partition values
+        last_assigned_field_id(int): auto-increment partition field id starting from PARTITION_DATA_ID_START
+    """
 
-    @property
-    def transform(self) -> Transform:
-        return self._transform
+    schema: Schema
+    spec_id: int
+    fields: Tuple[PartitionField]
+    last_assigned_field_id: int
+    source_id_to_fields_map: Dict[int, List[PartitionField]] = field(init=False, repr=False)
+
+    def __post_init__(self):
+        source_id_to_fields_map = dict()
+        for partition_field in self.fields:
+            source_column = self.schema.find_column_name(partition_field.source_id)
+            if not source_column:
+                raise ValueError(f"Cannot find source column: {partition_field.source_id}")
+            existing = source_id_to_fields_map.get(partition_field.source_id, [])
+            existing.append(partition_field)
+            source_id_to_fields_map[partition_field.source_id] = existing
+        object.__setattr__(self, "source_id_to_fields_map", source_id_to_fields_map)
 
     def __eq__(self, other):
-        return (
-            self.field_id == other.field_id
-            and self.source_id == other.source_id
-            and self.name == other.name
-            and self.transform == other.transform
-        )
+        """
+        Equality check on spec_id and partition fields only
+        """
+        return self.spec_id == other.spec_id and self.fields == other.fields
 
     def __str__(self):
-        return f"{self.field_id}: {self.name}: {self.transform}({self.source_id})"
+        """
+        Produce a human-readable string representation of PartitionSpec
+
+        Note:
+            Only include list of partition fields in the PartitionSpec's string representation
+        """
+        result_str = "["
+        for partition_field in self.fields:
+            result_str += f"\n  {str(partition_field)}"

Review Comment:
   Instead of:
   
   ```python
   >>> from iceberg.types import NestedField, StringType, FixedType
   >>> 
   >>> fields = [
   ...   NestedField(
   ...     field_id=19,
   ...     name='foo',
   ...     field_type=FixedType(22),
   ...     is_optional=False,
   ...   ),
   ...   NestedField(
   ...     field_id=25,
   ...     name='foo',
   ...     field_type=StringType(),
   ...     is_optional=True,
   ...   ),
   ... ]
   
   >>> result_str = "["
   >>> for partition_field in fields:
   ...   result_str += f"\n  {str(partition_field)}"
   ... 
   >>> if len(fields) > 0:
   ...   result_str += "\n"
   ... 
   >>> result_str += "]"
   >>> 
   >>> print(result_str)
   [
     19: foo: required fixed[22]
     25: foo: optional string
   ]
   ```
   
   We could also write:
   ```python
   result_str = "["
   if fields:
     result_str += "\n  " + "\n  ".join([str(field) for field in fields]) + "\n"
   result_str += "]"
   
   print(result_str)
   ```
   
   Which is equivalent
   ```python
   >>> result_str = "["
   >>> if fields:
   ...   result_str += "\n  " + "\n  ".join([str(field) for field in fields]) + "\n"
   ... 
   >>> result_str += "]"
   >>> 
   >>> print(result_str)
   [
     19: foo: required fixed[22]
     25: foo: optional string
   ]
   ```
   This is more readable for me at least :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] dramaticlly commented on a diff in pull request #4717: Python: Add PartitionSpec

Posted by GitBox <gi...@apache.org>.
dramaticlly commented on code in PR #4717:
URL: https://github.com/apache/iceberg/pull/4717#discussion_r885928150


##########
python/src/iceberg/table/partitioning.py:
##########
@@ -29,38 +36,79 @@ class PartitionField:
         name(str): The name of this partition field
     """
 
-    def __init__(self, source_id: int, field_id: int, transform: Transform, name: str):
-        self._source_id = source_id
-        self._field_id = field_id
-        self._transform = transform
-        self._name = name
+    source_id: int
+    field_id: int
+    transform: Transform
+    name: str
+
+    def __str__(self):
+        return f"{self.field_id}: {self.name}: {self.transform}({self.source_id})"
 
-    @property
-    def source_id(self) -> int:
-        return self._source_id
 
-    @property
-    def field_id(self) -> int:
-        return self._field_id
+@dataclass(eq=False, frozen=True)
+class PartitionSpec:
+    """
+    PartitionSpec capture the transformation from table data to partition values
 
-    @property
-    def name(self) -> str:
-        return self._name
+    Attributes:
+        schema(Schema): the schema of data table
+        spec_id(int): any change to PartitionSpec will produce a new specId
+        fields(List[PartitionField): list of partition fields to produce partition values
+        last_assigned_field_id(int): auto-increment partition field id starting from PARTITION_DATA_ID_START
+    """
 
-    @property
-    def transform(self) -> Transform:
-        return self._transform
+    schema: Schema
+    spec_id: int
+    fields: Tuple[PartitionField]
+    last_assigned_field_id: int
+    source_id_to_fields_map: Dict[int, List[PartitionField]] = field(init=False, repr=False)
+
+    def __post_init__(self):
+        source_id_to_fields_map = dict()
+        for partition_field in self.fields:
+            source_column = self.schema.find_column_name(partition_field.source_id)
+            if not source_column:
+                raise ValueError(f"Cannot find source column: {partition_field.source_id}")
+            existing = source_id_to_fields_map.get(partition_field.source_id, [])
+            existing.append(partition_field)
+            source_id_to_fields_map[partition_field.source_id] = existing
+        object.__setattr__(self, "source_id_to_fields_map", source_id_to_fields_map)
 
     def __eq__(self, other):
-        return (
-            self.field_id == other.field_id
-            and self.source_id == other.source_id
-            and self.name == other.name
-            and self.transform == other.transform
-        )
+        """
+        Equality check on spec_id and partition fields only
+        """
+        return self.spec_id == other.spec_id and self.fields == other.fields
 
     def __str__(self):
-        return f"{self.field_id}: {self.name}: {self.transform}({self.source_id})"
+        """
+        Produce a human-readable string representation of PartitionSpec
+
+        Note:
+            Only include list of partition fields in the PartitionSpec's string representation
+        """
+        result_str = "["
+        for partition_field in self.fields:
+            result_str += f"\n  {str(partition_field)}"

Review Comment:
   Thanks @Fokko , readability is all I cared, will update per your suggestion!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] samredai commented on a diff in pull request #4717: Python: Add PartitionSpec

Posted by GitBox <gi...@apache.org>.
samredai commented on code in PR #4717:
URL: https://github.com/apache/iceberg/pull/4717#discussion_r894764046


##########
python/src/iceberg/table/partitioning.py:
##########
@@ -29,38 +36,81 @@ class PartitionField:
         name(str): The name of this partition field
     """
 
-    def __init__(self, source_id: int, field_id: int, transform: Transform, name: str):
-        self._source_id = source_id
-        self._field_id = field_id
-        self._transform = transform
-        self._name = name
+    source_id: int
+    field_id: int
+    transform: Transform
+    name: str
+
+    def __str__(self):
+        return f"{self.field_id}: {self.name}: {self.transform}({self.source_id})"
+
 
-    @property
-    def source_id(self) -> int:
-        return self._source_id
+@dataclass(eq=False, frozen=True)
+class PartitionSpec:
+    """
+    PartitionSpec capture the transformation from table data to partition values

Review Comment:
   nit: `s/capture/captures`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] dramaticlly commented on a diff in pull request #4717: Python: Add PartitionSpec

Posted by GitBox <gi...@apache.org>.
dramaticlly commented on code in PR #4717:
URL: https://github.com/apache/iceberg/pull/4717#discussion_r894775249


##########
python/src/iceberg/table/partitioning.py:
##########
@@ -29,38 +36,81 @@ class PartitionField:
         name(str): The name of this partition field
     """
 
-    def __init__(self, source_id: int, field_id: int, transform: Transform, name: str):
-        self._source_id = source_id
-        self._field_id = field_id
-        self._transform = transform
-        self._name = name
+    source_id: int
+    field_id: int
+    transform: Transform
+    name: str
+
+    def __str__(self):
+        return f"{self.field_id}: {self.name}: {self.transform}({self.source_id})"
+
 
-    @property
-    def source_id(self) -> int:
-        return self._source_id
+@dataclass(eq=False, frozen=True)
+class PartitionSpec:
+    """
+    PartitionSpec capture the transformation from table data to partition values

Review Comment:
   Thanks, fixed it



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #4717: Python: Add PartitionSpec

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #4717:
URL: https://github.com/apache/iceberg/pull/4717#discussion_r868759842


##########
python/src/iceberg/table/partitioning.py:
##########
@@ -64,3 +67,100 @@ def __str__(self):
 
     def __repr__(self):
         return f"PartitionField(field_id={self.field_id}, name={self.name}, transform={repr(self.transform)}, source_id={self.source_id})"
+
+    def __hash__(self):
+        return hash((self.source_id, self.field_id, self.name, self.transform))
+
+
+class PartitionSpec:
+    """
+    PartitionSpec capture the transformation from table data to partition values
+
+    Attributes:
+        schema(Schema): the schema of data table
+        spec_id(int): any change to PartitionSpec will produce a new specId
+        fields(List[PartitionField): list of partition fields to produce partition values
+        last_assigned_field_id(int): auto-increment partition field id starting from PARTITION_DATA_ID_START
+    """
+
+    PARTITION_DATA_ID_START: int = 1000
+
+    def __init__(self, schema: Schema, spec_id: int, fields: Tuple[PartitionField], last_assigned_field_id: int):
+        self._schema = schema
+        self._spec_id = spec_id
+        self._fields = fields
+        self._last_assigned_field_id = last_assigned_field_id
+        # derived
+        self.fields_by_source_id: Dict[int, List[PartitionField]] = {}
+
+    @property
+    def schema(self) -> Schema:
+        return self._schema
+
+    @property
+    def spec_id(self) -> int:
+        return self._spec_id
+
+    @property
+    def fields(self) -> Tuple[PartitionField]:
+        return self._fields
+
+    @property
+    def last_assigned_field_id(self) -> int:
+        return self._last_assigned_field_id
+
+    def __eq__(self, other):
+        return self.spec_id == other.spec_id and self.fields == other.fields
+
+    def __str__(self):
+        if self.is_unpartitioned():
+            return "[]"
+        else:
+            delimiter = "\n  "
+            partition_fields_in_str = (str(partition_field) for partition_field in self.fields)
+            head = f"[{delimiter}"
+            tail = f"\n]"

Review Comment:
   Why not inline these?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] dramaticlly commented on a diff in pull request #4717: Python: Add PartitionSpec

Posted by GitBox <gi...@apache.org>.
dramaticlly commented on code in PR #4717:
URL: https://github.com/apache/iceberg/pull/4717#discussion_r868760540


##########
python/src/iceberg/table/partitioning.py:
##########
@@ -64,3 +67,100 @@ def __str__(self):
 
     def __repr__(self):
         return f"PartitionField(field_id={self.field_id}, name={self.name}, transform={repr(self.transform)}, source_id={self.source_id})"
+
+    def __hash__(self):
+        return hash((self.source_id, self.field_id, self.name, self.transform))

Review Comment:
   yeah I added back because it's being used in the hash of the PartitionSpec at line 129



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] dramaticlly commented on a diff in pull request #4717: Python: Add PartitionSpec

Posted by GitBox <gi...@apache.org>.
dramaticlly commented on code in PR #4717:
URL: https://github.com/apache/iceberg/pull/4717#discussion_r868761055


##########
python/src/iceberg/table/partitioning.py:
##########
@@ -64,3 +67,100 @@ def __str__(self):
 
     def __repr__(self):
         return f"PartitionField(field_id={self.field_id}, name={self.name}, transform={repr(self.transform)}, source_id={self.source_id})"
+
+    def __hash__(self):
+        return hash((self.source_id, self.field_id, self.name, self.transform))
+
+
+class PartitionSpec:
+    """
+    PartitionSpec capture the transformation from table data to partition values
+
+    Attributes:
+        schema(Schema): the schema of data table
+        spec_id(int): any change to PartitionSpec will produce a new specId
+        fields(List[PartitionField): list of partition fields to produce partition values
+        last_assigned_field_id(int): auto-increment partition field id starting from PARTITION_DATA_ID_START
+    """
+
+    PARTITION_DATA_ID_START: int = 1000
+
+    def __init__(self, schema: Schema, spec_id: int, fields: Tuple[PartitionField], last_assigned_field_id: int):
+        self._schema = schema
+        self._spec_id = spec_id
+        self._fields = fields
+        self._last_assigned_field_id = last_assigned_field_id
+        # derived
+        self.fields_by_source_id: Dict[int, List[PartitionField]] = {}
+
+    @property
+    def schema(self) -> Schema:
+        return self._schema
+
+    @property
+    def spec_id(self) -> int:
+        return self._spec_id
+
+    @property
+    def fields(self) -> Tuple[PartitionField]:
+        return self._fields
+
+    @property
+    def last_assigned_field_id(self) -> int:
+        return self._last_assigned_field_id
+
+    def __eq__(self, other):
+        return self.spec_id == other.spec_id and self.fields == other.fields
+
+    def __str__(self):
+        if self.is_unpartitioned():
+            return "[]"
+        else:
+            delimiter = "\n  "
+            partition_fields_in_str = (str(partition_field) for partition_field in self.fields)
+            head = f"[{delimiter}"
+            tail = f"\n]"
+            return f"{head}{delimiter.join(partition_fields_in_str)}{tail}"
+
+    def __repr__(self):
+        return f"PartitionSpec: {str(self)}"
+
+    def __hash__(self):
+        return hash((self.spec_id, self.fields))
+
+    def is_unpartitioned(self) -> bool:
+        return len(self.fields) < 1
+
+    def get_fields_by_source_id(self, filed_id: int) -> List[PartitionField]:

Review Comment:
   Sure I can rename to `find_field_by_source`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue merged pull request #4717: Python: Add PartitionSpec

Posted by GitBox <gi...@apache.org>.
rdblue merged PR #4717:
URL: https://github.com/apache/iceberg/pull/4717


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] Fokko commented on a diff in pull request #4717: Python: Add PartitionSpec

Posted by GitBox <gi...@apache.org>.
Fokko commented on code in PR #4717:
URL: https://github.com/apache/iceberg/pull/4717#discussion_r885319727


##########
python/src/iceberg/table/partitioning.py:
##########
@@ -29,38 +36,79 @@ class PartitionField:
         name(str): The name of this partition field
     """
 
-    def __init__(self, source_id: int, field_id: int, transform: Transform, name: str):
-        self._source_id = source_id
-        self._field_id = field_id
-        self._transform = transform
-        self._name = name
+    source_id: int
+    field_id: int
+    transform: Transform
+    name: str
+
+    def __str__(self):
+        return f"{self.field_id}: {self.name}: {self.transform}({self.source_id})"
 
-    @property
-    def source_id(self) -> int:
-        return self._source_id
 
-    @property
-    def field_id(self) -> int:
-        return self._field_id
+@dataclass(eq=False, frozen=True)
+class PartitionSpec:
+    """
+    PartitionSpec capture the transformation from table data to partition values
 
-    @property
-    def name(self) -> str:
-        return self._name
+    Attributes:
+        schema(Schema): the schema of data table
+        spec_id(int): any change to PartitionSpec will produce a new specId
+        fields(List[PartitionField): list of partition fields to produce partition values
+        last_assigned_field_id(int): auto-increment partition field id starting from PARTITION_DATA_ID_START
+    """
 
-    @property
-    def transform(self) -> Transform:
-        return self._transform
+    schema: Schema
+    spec_id: int
+    fields: Tuple[PartitionField]
+    last_assigned_field_id: int
+    source_id_to_fields_map: Dict[int, List[PartitionField]] = field(init=False, repr=False)
+
+    def __post_init__(self):
+        source_id_to_fields_map = dict()
+        for partition_field in self.fields:
+            source_column = self.schema.find_column_name(partition_field.source_id)
+            if not source_column:
+                raise ValueError(f"Cannot find source column: {partition_field.source_id}")
+            existing = source_id_to_fields_map.get(partition_field.source_id, [])
+            existing.append(partition_field)
+            source_id_to_fields_map[partition_field.source_id] = existing
+        object.__setattr__(self, "source_id_to_fields_map", source_id_to_fields_map)
 
     def __eq__(self, other):
-        return (
-            self.field_id == other.field_id
-            and self.source_id == other.source_id
-            and self.name == other.name
-            and self.transform == other.transform
-        )
+        """
+        Equality check on spec_id and partition fields only
+        """
+        return self.spec_id == other.spec_id and self.fields == other.fields

Review Comment:
   Nit: It would be good to check if the other is `is not None`, maybe also `isinstanceof(other, PartitionSpec)` otherwise the `other.spec_id` will/could throw an exception



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #4717: Python: Add PartitionSpec

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #4717:
URL: https://github.com/apache/iceberg/pull/4717#discussion_r885109575


##########
python/src/iceberg/table/partitioning.py:
##########
@@ -29,38 +36,76 @@ class PartitionField:
         name(str): The name of this partition field
     """
 
-    def __init__(self, source_id: int, field_id: int, transform: Transform, name: str):
-        self._source_id = source_id
-        self._field_id = field_id
-        self._transform = transform
-        self._name = name
+    source_id: int
+    field_id: int
+    transform: Transform
+    name: str
 
-    @property
-    def source_id(self) -> int:
-        return self._source_id
+    def __str__(self):
+        return f"{self.field_id}: {self.name}: {self.transform}({self.source_id})"
+
+
+@dataclass(eq=False, frozen=True)
+class PartitionSpec:
+    """
+    PartitionSpec capture the transformation from table data to partition values
 
-    @property
-    def field_id(self) -> int:
-        return self._field_id
+    Attributes:
+        schema(Schema): the schema of data table
+        spec_id(int): any change to PartitionSpec will produce a new specId
+        fields(List[PartitionField): list of partition fields to produce partition values
+        last_assigned_field_id(int): auto-increment partition field id starting from PARTITION_DATA_ID_START
+    """
 
-    @property
-    def name(self) -> str:
-        return self._name
+    schema: Schema
+    spec_id: int
+    fields: Tuple[PartitionField]
+    last_assigned_field_id: int
+    source_id_to_fields_map: Dict[int, List[PartitionField]] = field(init=False)
 
-    @property
-    def transform(self) -> Transform:
-        return self._transform
+    def __post_init__(self):
+        source_id_to_fields_map = dict()
+        for partition_field in self.fields:
+            source_column = self.schema.find_column_name(partition_field.source_id)
+            if not source_column:
+                raise ValueError(f"Cannot find source column: {partition_field.source_id}")
+            existing = source_id_to_fields_map.get(partition_field.source_id, [])
+            existing.append(partition_field)
+            source_id_to_fields_map[partition_field.source_id] = existing
+        object.__setattr__(self, "source_id_to_fields_map", source_id_to_fields_map)
 
     def __eq__(self, other):
-        return (
-            self.field_id == other.field_id
-            and self.source_id == other.source_id
-            and self.name == other.name
-            and self.transform == other.transform
-        )
+        """
+        Equality check on spec_id and partition fields only
+        """
+        return self.spec_id == other.spec_id and self.fields == other.fields
 
     def __str__(self):
-        return f"{self.field_id}: {self.name}: {self.transform}({self.source_id})"
+        """
+        PartitionSpec str method highlight the partition field only

Review Comment:
   This isn't really a docstring. It is more of a comment to explain the logic in the method. If you want to add it as a docstring, then I think it should be real docs (like "Produce a human-readable string representation of PartitionSpec") and have this as an additional note.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] dramaticlly commented on a diff in pull request #4717: Python: Add PartitionSpec

Posted by GitBox <gi...@apache.org>.
dramaticlly commented on code in PR #4717:
URL: https://github.com/apache/iceberg/pull/4717#discussion_r885199019


##########
python/src/iceberg/table/partitioning.py:
##########
@@ -29,38 +36,76 @@ class PartitionField:
         name(str): The name of this partition field
     """
 
-    def __init__(self, source_id: int, field_id: int, transform: Transform, name: str):
-        self._source_id = source_id
-        self._field_id = field_id
-        self._transform = transform
-        self._name = name
+    source_id: int
+    field_id: int
+    transform: Transform
+    name: str
 
-    @property
-    def source_id(self) -> int:
-        return self._source_id
+    def __str__(self):
+        return f"{self.field_id}: {self.name}: {self.transform}({self.source_id})"
+
+
+@dataclass(eq=False, frozen=True)
+class PartitionSpec:
+    """
+    PartitionSpec capture the transformation from table data to partition values
 
-    @property
-    def field_id(self) -> int:
-        return self._field_id
+    Attributes:
+        schema(Schema): the schema of data table
+        spec_id(int): any change to PartitionSpec will produce a new specId
+        fields(List[PartitionField): list of partition fields to produce partition values
+        last_assigned_field_id(int): auto-increment partition field id starting from PARTITION_DATA_ID_START
+    """
 
-    @property
-    def name(self) -> str:
-        return self._name
+    schema: Schema
+    spec_id: int
+    fields: Tuple[PartitionField]
+    last_assigned_field_id: int
+    source_id_to_fields_map: Dict[int, List[PartitionField]] = field(init=False)
 
-    @property
-    def transform(self) -> Transform:
-        return self._transform
+    def __post_init__(self):
+        source_id_to_fields_map = dict()
+        for partition_field in self.fields:
+            source_column = self.schema.find_column_name(partition_field.source_id)
+            if not source_column:
+                raise ValueError(f"Cannot find source column: {partition_field.source_id}")
+            existing = source_id_to_fields_map.get(partition_field.source_id, [])
+            existing.append(partition_field)
+            source_id_to_fields_map[partition_field.source_id] = existing
+        object.__setattr__(self, "source_id_to_fields_map", source_id_to_fields_map)
 
     def __eq__(self, other):
-        return (
-            self.field_id == other.field_id
-            and self.source_id == other.source_id
-            and self.name == other.name
-            and self.transform == other.transform
-        )
+        """
+        Equality check on spec_id and partition fields only
+        """
+        return self.spec_id == other.spec_id and self.fields == other.fields
 
     def __str__(self):
-        return f"{self.field_id}: {self.name}: {self.transform}({self.source_id})"
+        """
+        PartitionSpec str method highlight the partition field only

Review Comment:
   Thanks @rdblue , updated per your suggestion



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] dramaticlly commented on a diff in pull request #4717: Python: Add PartitionSpec

Posted by GitBox <gi...@apache.org>.
dramaticlly commented on code in PR #4717:
URL: https://github.com/apache/iceberg/pull/4717#discussion_r885926747


##########
python/src/iceberg/table/partitioning.py:
##########
@@ -29,38 +36,79 @@ class PartitionField:
         name(str): The name of this partition field
     """
 
-    def __init__(self, source_id: int, field_id: int, transform: Transform, name: str):
-        self._source_id = source_id
-        self._field_id = field_id
-        self._transform = transform
-        self._name = name
+    source_id: int
+    field_id: int
+    transform: Transform
+    name: str
+
+    def __str__(self):
+        return f"{self.field_id}: {self.name}: {self.transform}({self.source_id})"
 
-    @property
-    def source_id(self) -> int:
-        return self._source_id
 
-    @property
-    def field_id(self) -> int:
-        return self._field_id
+@dataclass(eq=False, frozen=True)
+class PartitionSpec:
+    """
+    PartitionSpec capture the transformation from table data to partition values
 
-    @property
-    def name(self) -> str:
-        return self._name
+    Attributes:
+        schema(Schema): the schema of data table
+        spec_id(int): any change to PartitionSpec will produce a new specId
+        fields(List[PartitionField): list of partition fields to produce partition values
+        last_assigned_field_id(int): auto-increment partition field id starting from PARTITION_DATA_ID_START

Review Comment:
   Thanks Fokko, I think this is where I want to leave out the example as I intended to construct the PartitionSpec via [dedicated builder](https://github.com/apache/iceberg/blob/master/api/src/main/java/org/apache/iceberg/PartitionSpec.java#L337) (which is not included in this PR), as that's the desired way to construct the PartitionSpec with convenient transform helper method and equipped with validation. I can include the example here, what do you think?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] dramaticlly commented on a diff in pull request #4717: Python: Add PartitionSpec

Posted by GitBox <gi...@apache.org>.
dramaticlly commented on code in PR #4717:
URL: https://github.com/apache/iceberg/pull/4717#discussion_r872689598


##########
python/src/iceberg/table/partitioning.py:
##########
@@ -14,8 +14,13 @@
 # KIND, either express or implied.  See the License for the
 # specific language governing permissions and limitations
 # under the License.
+from typing import Dict, Iterable, List, Tuple
+
+from iceberg.schema import Schema
 from iceberg.transforms import Transform
 
+_PARTITION_DATA_ID_START: int = 1000
+
 
 class PartitionField:

Review Comment:
   yeah I agree the PartitionField is an immutable class after construction so dataclass with both eq and frozen sounds fair to me. 
   
   for reference this is what will be look like for immutable PartitionField, with all testcase passing (small ordering change on repr but I think default one is very close to what we have today in java impl)
   ```python
   @dataclass(frozen=True)
   class PartitionField:
       """
       PartitionField is a single element with name and unique id,
       It represents how one partition value is derived from the source column via transformation
   
       Attributes:
           source_id(int): The source column id of table's schema
           field_id(int): The partition field id across all the table metadata's partition specs
           transform(Transform): The transform used to produce partition values from source column
           name(str): The name of this partition field
       """
       source_id: int
       field_id: int
       transform: Transform
       name: str
   
       def __str__(self):
           return f"{self.field_id}: {self.name}: {self.transform}({self.source_id})"
   ```
   
   On the other side, I think the biggest benefit of the dataclass is the [`__post_init__`](https://docs.python.org/3.7/library/dataclasses.html#post-init-processing) method which allow for java-like builderPattern equivalent processing when we build the PartitionSpec. There's collection of validations need to happen and I am discussing with @samredai  in https://github.com/apache/iceberg/issues/4631#issuecomment-1113632408. 
   
   From what I can tell, we will need a `PartitionSpecBuilder` class with convenient way to construct the PartitionSpec, but we also want to make sure avoid duplicate the builder logic in an overly complex init method for PartitionSpec



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] dramaticlly commented on a diff in pull request #4717: Python: Add PartitionSpec

Posted by GitBox <gi...@apache.org>.
dramaticlly commented on code in PR #4717:
URL: https://github.com/apache/iceberg/pull/4717#discussion_r872689598


##########
python/src/iceberg/table/partitioning.py:
##########
@@ -14,8 +14,13 @@
 # KIND, either express or implied.  See the License for the
 # specific language governing permissions and limitations
 # under the License.
+from typing import Dict, Iterable, List, Tuple
+
+from iceberg.schema import Schema
 from iceberg.transforms import Transform
 
+_PARTITION_DATA_ID_START: int = 1000
+
 
 class PartitionField:

Review Comment:
   yeah I agree the PartitionField is an immutable class after construction so dataclass with both eq and frozen sounds fair to me. 
   
   for reference this is what will be look like for immutable PartitionField, with all testcase passing (small ordering change on repr but I think default one is very close to what we have today in java impl)
   ```python
   @dataclass(frozen=True)
   class PartitionField:
       """
       PartitionField is a single element with name and unique id,
       It represents how one partition value is derived from the source column via transformation
   
       Attributes:
           source_id(int): The source column id of table's schema
           field_id(int): The partition field id across all the table metadata's partition specs
           transform(Transform): The transform used to produce partition values from source column
           name(str): The name of this partition field
       """
       source_id: int
       field_id: int
       transform: Transform
       name: str
   
       def __str__(self):
           return f"{self.field_id}: {self.name}: {self.transform}({self.source_id})"
   ```
   
   On the other side, I think the biggest benefit of the dataclass is the `__post_init__` method which allow for java-like builderPattern equivalent processing when we build the PartitionSpec. There's collection of validations need to happen and I am discussing with @samredai  in https://github.com/apache/iceberg/issues/4631#issuecomment-1113632408. 
   
   From what I can tell, we will need a `PartitionSpecBuilder` class with convenient way to construct the PartitionSpec, but we also want to make sure avoid duplicate the builder logic in an overly complex init method for PartitionSpec



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] dramaticlly commented on pull request #4717: Python: Add PartitionSpec

Posted by GitBox <gi...@apache.org>.
dramaticlly commented on PR #4717:
URL: https://github.com/apache/iceberg/pull/4717#issuecomment-1146180455

   Thanks @Fokko for reviewing, appreciate one more pass on this @rdblue and @samredai 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] Fokko commented on a diff in pull request #4717: Python: Add PartitionSpec

Posted by GitBox <gi...@apache.org>.
Fokko commented on code in PR #4717:
URL: https://github.com/apache/iceberg/pull/4717#discussion_r885321594


##########
python/src/iceberg/table/partitioning.py:
##########
@@ -29,38 +36,79 @@ class PartitionField:
         name(str): The name of this partition field
     """
 
-    def __init__(self, source_id: int, field_id: int, transform: Transform, name: str):
-        self._source_id = source_id
-        self._field_id = field_id
-        self._transform = transform
-        self._name = name
+    source_id: int
+    field_id: int
+    transform: Transform
+    name: str
+
+    def __str__(self):
+        return f"{self.field_id}: {self.name}: {self.transform}({self.source_id})"
 
-    @property
-    def source_id(self) -> int:
-        return self._source_id
 
-    @property
-    def field_id(self) -> int:
-        return self._field_id
+@dataclass(eq=False, frozen=True)
+class PartitionSpec:
+    """
+    PartitionSpec capture the transformation from table data to partition values
 
-    @property
-    def name(self) -> str:
-        return self._name
+    Attributes:
+        schema(Schema): the schema of data table
+        spec_id(int): any change to PartitionSpec will produce a new specId
+        fields(List[PartitionField): list of partition fields to produce partition values
+        last_assigned_field_id(int): auto-increment partition field id starting from PARTITION_DATA_ID_START

Review Comment:
   Would be great if we can also add an example: https://github.com/apache/iceberg/blob/master/python/src/iceberg/types.py#L83-L87 this will also test the __str__ method as the examples are executed as tests as well 👍🏻 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] Fokko commented on a diff in pull request #4717: Python: Add PartitionSpec

Posted by GitBox <gi...@apache.org>.
Fokko commented on code in PR #4717:
URL: https://github.com/apache/iceberg/pull/4717#discussion_r885316537


##########
python/src/iceberg/table/partitioning.py:
##########
@@ -14,8 +14,13 @@
 # KIND, either express or implied.  See the License for the
 # specific language governing permissions and limitations
 # under the License.
+from typing import Dict, Iterable, List, Tuple
+
+from iceberg.schema import Schema
 from iceberg.transforms import Transform
 
+_PARTITION_DATA_ID_START: int = 1000
+
 
 class PartitionField:

Review Comment:
   I would not consider this blocking for now.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] dramaticlly commented on a diff in pull request #4717: Python: Add PartitionSpec

Posted by GitBox <gi...@apache.org>.
dramaticlly commented on code in PR #4717:
URL: https://github.com/apache/iceberg/pull/4717#discussion_r894982929


##########
python/src/iceberg/table/partitioning.py:
##########
@@ -29,38 +36,81 @@ class PartitionField:
         name(str): The name of this partition field
     """
 
-    def __init__(self, source_id: int, field_id: int, transform: Transform, name: str):
-        self._source_id = source_id
-        self._field_id = field_id
-        self._transform = transform
-        self._name = name
+    source_id: int
+    field_id: int
+    transform: Transform
+    name: str
+
+    def __str__(self):
+        return f"{self.field_id}: {self.name}: {self.transform}({self.source_id})"
+
 
-    @property
-    def source_id(self) -> int:
-        return self._source_id
+@dataclass(eq=False, frozen=True)
+class PartitionSpec:
+    """
+    PartitionSpec captures the transformation from table data to partition values
 
-    @property
-    def field_id(self) -> int:
-        return self._field_id
+    Attributes:
+        schema(Schema): the schema of data table
+        spec_id(int): any change to PartitionSpec will produce a new specId
+        fields(List[PartitionField): list of partition fields to produce partition values
+        last_assigned_field_id(int): auto-increment partition field id starting from PARTITION_DATA_ID_START
+    """
 
-    @property
-    def name(self) -> str:
-        return self._name
+    schema: Schema
+    spec_id: int
+    fields: Tuple[PartitionField, ...]
+    last_assigned_field_id: int
+    source_id_to_fields_map: Dict[int, List[PartitionField]] = field(init=False, repr=False)
 
-    @property
-    def transform(self) -> Transform:
-        return self._transform
+    def __post_init__(self):
+        source_id_to_fields_map = dict()
+        for partition_field in self.fields:
+            source_column = self.schema.find_column_name(partition_field.source_id)
+            if not source_column:
+                raise ValueError(f"Cannot find source column: {partition_field.source_id}")
+            existing = source_id_to_fields_map.get(partition_field.source_id, [])
+            existing.append(partition_field)
+            source_id_to_fields_map[partition_field.source_id] = existing
+        object.__setattr__(self, "source_id_to_fields_map", source_id_to_fields_map)
 
     def __eq__(self, other):
-        return (
-            self.field_id == other.field_id
-            and self.source_id == other.source_id
-            and self.name == other.name
-            and self.transform == other.transform
-        )
+        """
+        Produce a boolean to return True if two objects are considered equal
+
+        Note:
+            Equality of PartitionSpec is determined by spec_id and partition fields only
+        """
+        if not isinstance(other, PartitionSpec):
+            return False
+        return self.spec_id == other.spec_id and self.fields == other.fields
 
     def __str__(self):
-        return f"{self.field_id}: {self.name}: {self.transform}({self.source_id})"
+        """
+        Produce a human-readable string representation of PartitionSpec
 
-    def __repr__(self):
-        return f"PartitionField(field_id={self.field_id}, name={self.name}, transform={repr(self.transform)}, source_id={self.source_id})"
+        Note:
+            Only include list of partition fields in the PartitionSpec's string representation
+        """
+        result_str = "["
+        if self.fields:
+            result_str += "\n  " + "\n  ".join([str(field) for field in self.fields]) + "\n"
+        result_str += "]"
+        return result_str
+
+    def is_unpartitioned(self) -> bool:
+        return len(self.fields) < 1

Review Comment:
   Thanks @Fokko , updated



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] samredai commented on a diff in pull request #4717: Python: Add PartitionSpec

Posted by GitBox <gi...@apache.org>.
samredai commented on code in PR #4717:
URL: https://github.com/apache/iceberg/pull/4717#discussion_r872701045


##########
python/src/iceberg/table/partitioning.py:
##########
@@ -14,8 +14,13 @@
 # KIND, either express or implied.  See the License for the
 # specific language governing permissions and limitations
 # under the License.
+from typing import Dict, Iterable, List, Tuple
+
+from iceberg.schema import Schema
 from iceberg.transforms import Transform
 
+_PARTITION_DATA_ID_START: int = 1000
+
 
 class PartitionField:

Review Comment:
   @dramaticlly that's an interesting idea I haven't thought of. A big argument for using the builder pattern was that we wanted `PartitionSpec` to be immutable, which would require us to include a ton of validation logic (everything that would be in a builder) in the `__init__` method. If I understand your suggestion, using `__post_init__` would allow us to have a typical init method, but then include the builder-type validation logic in the `__post_init__` which would fail the initialization of any invalid `PartitionSpec`. cc: @rdblue what do you think?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] dramaticlly commented on a diff in pull request #4717: Python: Add PartitionSpec

Posted by GitBox <gi...@apache.org>.
dramaticlly commented on code in PR #4717:
URL: https://github.com/apache/iceberg/pull/4717#discussion_r872653887


##########
python/src/iceberg/table/partitioning.py:
##########
@@ -64,3 +69,95 @@ def __str__(self):
 
     def __repr__(self):
         return f"PartitionField(field_id={self.field_id}, name={self.name}, transform={repr(self.transform)}, source_id={self.source_id})"
+
+    def __hash__(self):
+        return hash((self.source_id, self.field_id, self.name, self.transform))
+
+
+class PartitionSpec:
+    """
+    PartitionSpec capture the transformation from table data to partition values
+
+    Attributes:
+        schema(Schema): the schema of data table
+        spec_id(int): any change to PartitionSpec will produce a new specId
+        fields(List[PartitionField): list of partition fields to produce partition values
+        last_assigned_field_id(int): auto-increment partition field id starting from PARTITION_DATA_ID_START
+    """
+
+    def __init__(self, schema: Schema, spec_id: int, fields: Iterable[PartitionField], last_assigned_field_id: int):

Review Comment:
   hey @dhruv-pratap , I specifically mentioned that this is second part of rolling out partitioning, we will have a follow up PR for PartitionSpec builder as it shall be the right way to construct the PartitionSpec. I leave it outside scope of this PR to it doesn't get too large



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] dramaticlly commented on a diff in pull request #4717: Python: Add PartitionSpec

Posted by GitBox <gi...@apache.org>.
dramaticlly commented on code in PR #4717:
URL: https://github.com/apache/iceberg/pull/4717#discussion_r872679457


##########
python/src/iceberg/table/partitioning.py:
##########
@@ -64,3 +69,95 @@ def __str__(self):
 
     def __repr__(self):
         return f"PartitionField(field_id={self.field_id}, name={self.name}, transform={repr(self.transform)}, source_id={self.source_id})"
+
+    def __hash__(self):
+        return hash((self.source_id, self.field_id, self.name, self.transform))
+
+
+class PartitionSpec:
+    """
+    PartitionSpec capture the transformation from table data to partition values
+
+    Attributes:
+        schema(Schema): the schema of data table
+        spec_id(int): any change to PartitionSpec will produce a new specId
+        fields(List[PartitionField): list of partition fields to produce partition values
+        last_assigned_field_id(int): auto-increment partition field id starting from PARTITION_DATA_ID_START
+    """
+
+    def __init__(self, schema: Schema, spec_id: int, fields: Iterable[PartitionField], last_assigned_field_id: int):
+        self._schema = schema
+        self._spec_id = spec_id
+        self._fields = tuple(fields)
+        self._last_assigned_field_id = last_assigned_field_id
+        # derived
+        self._fields_by_source_id: Dict[int, List[PartitionField]] = {}
+
+    @property
+    def schema(self) -> Schema:
+        return self._schema
+
+    @property
+    def spec_id(self) -> int:
+        return self._spec_id
+
+    @property
+    def fields(self) -> Tuple[PartitionField, ...]:
+        return self._fields
+
+    @property
+    def last_assigned_field_id(self) -> int:
+        return self._last_assigned_field_id
+
+    def __eq__(self, other):
+        return self.spec_id == other.spec_id and self.fields == other.fields
+
+    def __str__(self):
+        if self.is_unpartitioned():
+            return "[]"
+        else:
+            delimiter = "\n  "
+            partition_fields_in_str = (str(partition_field) for partition_field in self.fields)
+            head = f"[{delimiter}"
+            tail = f"\n]"
+            return f"{head}{delimiter.join(partition_fields_in_str)}{tail}"
+
+    def __repr__(self):
+        return f"PartitionSpec: {str(self)}"
+
+    def __hash__(self):
+        return hash((self.spec_id, self.fields))
+
+    def is_unpartitioned(self) -> bool:
+        return len(self.fields) < 1
+
+    def fields_by_source_id(self, field_id: int) -> List[PartitionField]:
+        if not self._fields_by_source_id:
+            for partition_field in self.fields:
+                source_column = self.schema.find_column_name(partition_field.source_id)
+                if not source_column:
+                    raise ValueError(f"Cannot find source column: {partition_field.source_id}")
+                existing = self._fields_by_source_id.get(partition_field.source_id, [])
+                existing.append(partition_field)
+                self._fields_by_source_id[partition_field.source_id] = existing
+        return self._fields_by_source_id[field_id]
+
+    def compatible_with(self, other) -> bool:
+        """
+        Returns true if this partition spec is equivalent to the other, with partition field_id ignored.
+        That is, if both specs have the same number of fields, field order, field name, source column ids, and transforms.
+        """
+        if self == other:
+            return True
+        if len(self.fields) != len(other.fields):
+            return False
+        for index in range(len(self.fields)):

Review Comment:
   Thanks @Fokko , I think your suggestion looks clean, but want to make sure we dont need `]` after self.fields, so I guess it's probably typo and expect 
   
   ```python
   in zip(self.fields, other.fields)
   ```
   
   On the other side, I think this method differs from `eq` just because we can relax on `field_id` for equality, everything else is the same



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] dramaticlly commented on a diff in pull request #4717: Python: Add PartitionSpec

Posted by GitBox <gi...@apache.org>.
dramaticlly commented on code in PR #4717:
URL: https://github.com/apache/iceberg/pull/4717#discussion_r868771265


##########
python/src/iceberg/table/partitioning.py:
##########
@@ -64,3 +67,100 @@ def __str__(self):
 
     def __repr__(self):
         return f"PartitionField(field_id={self.field_id}, name={self.name}, transform={repr(self.transform)}, source_id={self.source_id})"
+
+    def __hash__(self):
+        return hash((self.source_id, self.field_id, self.name, self.transform))
+
+
+class PartitionSpec:
+    """
+    PartitionSpec capture the transformation from table data to partition values
+
+    Attributes:
+        schema(Schema): the schema of data table
+        spec_id(int): any change to PartitionSpec will produce a new specId
+        fields(List[PartitionField): list of partition fields to produce partition values
+        last_assigned_field_id(int): auto-increment partition field id starting from PARTITION_DATA_ID_START
+    """
+
+    PARTITION_DATA_ID_START: int = 1000
+
+    def __init__(self, schema: Schema, spec_id: int, fields: Tuple[PartitionField], last_assigned_field_id: int):
+        self._schema = schema
+        self._spec_id = spec_id
+        self._fields = fields
+        self._last_assigned_field_id = last_assigned_field_id
+        # derived
+        self.fields_by_source_id: Dict[int, List[PartitionField]] = {}
+
+    @property
+    def schema(self) -> Schema:
+        return self._schema
+
+    @property
+    def spec_id(self) -> int:
+        return self._spec_id
+
+    @property
+    def fields(self) -> Tuple[PartitionField]:
+        return self._fields
+
+    @property
+    def last_assigned_field_id(self) -> int:
+        return self._last_assigned_field_id
+
+    def __eq__(self, other):
+        return self.spec_id == other.spec_id and self.fields == other.fields
+
+    def __str__(self):
+        if self.is_unpartitioned():

Review Comment:
   I think it's a bit nasty to construct it exactly like what Java implementation does, so the special casing is essentially simplified from https://github.com/apache/iceberg/blob/master/api/src/main/java/org/apache/iceberg/PartitionSpec.java#L299. The unpartitioned differ from rest with 1 less `\n` right before the `]`
   
   so here's my alternative way to construct, not sure how do you like it compare to what I have right now.
   
   
   
   ```python
       def __str__(self):
           result_str = "["
           for partition_field in self.fields:
               result_str += f"\n  {str(partition_field)}"
           if self.is_unpartitioned():
               result_str += "]"
           else:
               result_str += "\n]"
           return result_str
   ```
   
   Python `string.join(Iterable)` does not have a way to attach head and tail, so I figured it might be easier to see it this way as in `f"{head}{delimiter.join(partition_fields_in_str)}{tail}"`.
   
   But I can certainly change quickly if above is what you prefer



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #4717: Python: Add PartitionSpec

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #4717:
URL: https://github.com/apache/iceberg/pull/4717#discussion_r868760692


##########
python/src/iceberg/table/partitioning.py:
##########
@@ -64,3 +67,100 @@ def __str__(self):
 
     def __repr__(self):
         return f"PartitionField(field_id={self.field_id}, name={self.name}, transform={repr(self.transform)}, source_id={self.source_id})"
+
+    def __hash__(self):
+        return hash((self.source_id, self.field_id, self.name, self.transform))
+
+
+class PartitionSpec:
+    """
+    PartitionSpec capture the transformation from table data to partition values
+
+    Attributes:
+        schema(Schema): the schema of data table
+        spec_id(int): any change to PartitionSpec will produce a new specId
+        fields(List[PartitionField): list of partition fields to produce partition values
+        last_assigned_field_id(int): auto-increment partition field id starting from PARTITION_DATA_ID_START
+    """
+
+    PARTITION_DATA_ID_START: int = 1000
+
+    def __init__(self, schema: Schema, spec_id: int, fields: Tuple[PartitionField], last_assigned_field_id: int):
+        self._schema = schema
+        self._spec_id = spec_id
+        self._fields = fields
+        self._last_assigned_field_id = last_assigned_field_id
+        # derived
+        self.fields_by_source_id: Dict[int, List[PartitionField]] = {}
+
+    @property
+    def schema(self) -> Schema:
+        return self._schema
+
+    @property
+    def spec_id(self) -> int:
+        return self._spec_id
+
+    @property
+    def fields(self) -> Tuple[PartitionField]:
+        return self._fields
+
+    @property
+    def last_assigned_field_id(self) -> int:
+        return self._last_assigned_field_id
+
+    def __eq__(self, other):
+        return self.spec_id == other.spec_id and self.fields == other.fields
+
+    def __str__(self):
+        if self.is_unpartitioned():
+            return "[]"
+        else:
+            delimiter = "\n  "
+            partition_fields_in_str = (str(partition_field) for partition_field in self.fields)
+            head = f"[{delimiter}"
+            tail = f"\n]"
+            return f"{head}{delimiter.join(partition_fields_in_str)}{tail}"
+
+    def __repr__(self):
+        return f"PartitionSpec: {str(self)}"
+
+    def __hash__(self):
+        return hash((self.spec_id, self.fields))
+
+    def is_unpartitioned(self) -> bool:
+        return len(self.fields) < 1
+
+    def get_fields_by_source_id(self, filed_id: int) -> List[PartitionField]:

Review Comment:
   Typo: `filed` should be `field`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kbendick commented on a diff in pull request #4717: Python: Add PartitionSpec

Posted by GitBox <gi...@apache.org>.
kbendick commented on code in PR #4717:
URL: https://github.com/apache/iceberg/pull/4717#discussion_r882945870


##########
python/src/iceberg/table/partitioning.py:
##########
@@ -14,8 +14,13 @@
 # KIND, either express or implied.  See the License for the
 # specific language governing permissions and limitations
 # under the License.
+from typing import Dict, Iterable, List, Tuple
+
+from iceberg.schema import Schema
 from iceberg.transforms import Transform
 
+_PARTITION_DATA_ID_START: int = 1000
+
 
 class PartitionField:

Review Comment:
   I'll hold off on commenting too much until I have a chance to look into the `pydantic` project as well as look at the other PR.
   
   My first ask would be how many dependencies are we bringing in if we add `pydantic`? I know that some folks were concerned about adding very many external python dependencies, so as not to conflict with their own dependencies, but if the benefit is very large I'm not personally opposed to it (I believe it was somebody / some group from Netflix that originally requested we keep the number of required dependencies down).
   
   But validation, either via a library or via a common pattern we settle on, is something that would be very beneficial.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org