You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/05/10 19:14:22 UTC

[GitHub] [iceberg] dhruv-pratap commented on a diff in pull request #4717: Python: Add PartitionSpec

dhruv-pratap commented on code in PR #4717:
URL: https://github.com/apache/iceberg/pull/4717#discussion_r869581923


##########
python/src/iceberg/table/partitioning.py:
##########
@@ -64,3 +69,95 @@ def __str__(self):
 
     def __repr__(self):
         return f"PartitionField(field_id={self.field_id}, name={self.name}, transform={repr(self.transform)}, source_id={self.source_id})"
+
+    def __hash__(self):
+        return hash((self.source_id, self.field_id, self.name, self.transform))
+
+
+class PartitionSpec:
+    """
+    PartitionSpec capture the transformation from table data to partition values
+
+    Attributes:
+        schema(Schema): the schema of data table
+        spec_id(int): any change to PartitionSpec will produce a new specId
+        fields(List[PartitionField): list of partition fields to produce partition values
+        last_assigned_field_id(int): auto-increment partition field id starting from PARTITION_DATA_ID_START
+    """
+
+    def __init__(self, schema: Schema, spec_id: int, fields: Iterable[PartitionField], last_assigned_field_id: int):

Review Comment:
   Since we do not have a "builder" for it, should we enforce Keyword only arguments here for better readability? The same goes for `PartitionField` as well.
   



##########
python/src/iceberg/table/partitioning.py:
##########
@@ -64,3 +67,100 @@ def __str__(self):
 
     def __repr__(self):
         return f"PartitionField(field_id={self.field_id}, name={self.name}, transform={repr(self.transform)}, source_id={self.source_id})"
+
+    def __hash__(self):
+        return hash((self.source_id, self.field_id, self.name, self.transform))

Review Comment:
   @dramaticlly Why not use `@dataclasses.dataclass` or `@attrs.frozen` that will implement these dunder methods for you and reduce the boiler plate code?



##########
python/src/iceberg/table/partitioning.py:
##########
@@ -64,3 +69,95 @@ def __str__(self):
 
     def __repr__(self):
         return f"PartitionField(field_id={self.field_id}, name={self.name}, transform={repr(self.transform)}, source_id={self.source_id})"
+
+    def __hash__(self):
+        return hash((self.source_id, self.field_id, self.name, self.transform))
+
+
+class PartitionSpec:
+    """
+    PartitionSpec capture the transformation from table data to partition values
+
+    Attributes:
+        schema(Schema): the schema of data table
+        spec_id(int): any change to PartitionSpec will produce a new specId
+        fields(List[PartitionField): list of partition fields to produce partition values
+        last_assigned_field_id(int): auto-increment partition field id starting from PARTITION_DATA_ID_START
+    """
+
+    def __init__(self, schema: Schema, spec_id: int, fields: Iterable[PartitionField], last_assigned_field_id: int):
+        self._schema = schema
+        self._spec_id = spec_id
+        self._fields = tuple(fields)
+        self._last_assigned_field_id = last_assigned_field_id
+        # derived
+        self._fields_by_source_id: Dict[int, List[PartitionField]] = {}
+
+    @property
+    def schema(self) -> Schema:
+        return self._schema
+
+    @property
+    def spec_id(self) -> int:
+        return self._spec_id
+
+    @property
+    def fields(self) -> Tuple[PartitionField, ...]:
+        return self._fields
+
+    @property
+    def last_assigned_field_id(self) -> int:
+        return self._last_assigned_field_id
+
+    def __eq__(self, other):
+        return self.spec_id == other.spec_id and self.fields == other.fields
+
+    def __str__(self):
+        if self.is_unpartitioned():
+            return "[]"
+        else:
+            delimiter = "\n  "
+            partition_fields_in_str = (str(partition_field) for partition_field in self.fields)
+            head = f"[{delimiter}"
+            tail = f"\n]"
+            return f"{head}{delimiter.join(partition_fields_in_str)}{tail}"
+
+    def __repr__(self):
+        return f"PartitionSpec: {str(self)}"
+
+    def __hash__(self):
+        return hash((self.spec_id, self.fields))
+
+    def is_unpartitioned(self) -> bool:
+        return len(self.fields) < 1
+
+    def fields_by_source_id(self, field_id: int) -> List[PartitionField]:
+        if not self._fields_by_source_id:
+            for partition_field in self.fields:
+                source_column = self.schema.find_column_name(partition_field.source_id)
+                if not source_column:
+                    raise ValueError(f"Cannot find source column: {partition_field.source_id}")
+                existing = self._fields_by_source_id.get(partition_field.source_id, [])
+                existing.append(partition_field)
+                self._fields_by_source_id[partition_field.source_id] = existing

Review Comment:
   I feel this field value should be derived in `__init__` , or `__post_init__` if you are using `@dataclass` or `@attrs`. Reason being it validates the correctness of the object state, and does should raise `ValueError` as soon as it is created. This seems too late to do raise that error. 



##########
python/src/iceberg/table/partitioning.py:
##########
@@ -64,3 +67,100 @@ def __str__(self):
 
     def __repr__(self):
         return f"PartitionField(field_id={self.field_id}, name={self.name}, transform={repr(self.transform)}, source_id={self.source_id})"
+
+    def __hash__(self):
+        return hash((self.source_id, self.field_id, self.name, self.transform))
+
+
+class PartitionSpec:
+    """
+    PartitionSpec capture the transformation from table data to partition values
+
+    Attributes:
+        schema(Schema): the schema of data table
+        spec_id(int): any change to PartitionSpec will produce a new specId
+        fields(List[PartitionField): list of partition fields to produce partition values
+        last_assigned_field_id(int): auto-increment partition field id starting from PARTITION_DATA_ID_START
+    """
+
+    PARTITION_DATA_ID_START: int = 1000
+
+    def __init__(self, schema: Schema, spec_id: int, fields: Tuple[PartitionField], last_assigned_field_id: int):
+        self._schema = schema
+        self._spec_id = spec_id
+        self._fields = fields
+        self._last_assigned_field_id = last_assigned_field_id
+        # derived
+        self.fields_by_source_id: Dict[int, List[PartitionField]] = {}
+
+    @property
+    def schema(self) -> Schema:
+        return self._schema
+
+    @property
+    def spec_id(self) -> int:
+        return self._spec_id
+
+    @property
+    def fields(self) -> Tuple[PartitionField]:
+        return self._fields
+
+    @property
+    def last_assigned_field_id(self) -> int:
+        return self._last_assigned_field_id
+
+    def __eq__(self, other):
+        return self.spec_id == other.spec_id and self.fields == other.fields
+
+    def __str__(self):
+        if self.is_unpartitioned():

Review Comment:
   @dramaticlly Again, I would just use `@dataclasses.dataclass` or `@attrs.frozen` that will implement all the dunder methods and override them wherever you need special behavior. 



##########
python/src/iceberg/table/partitioning.py:
##########
@@ -64,3 +67,100 @@ def __str__(self):
 
     def __repr__(self):
         return f"PartitionField(field_id={self.field_id}, name={self.name}, transform={repr(self.transform)}, source_id={self.source_id})"
+
+    def __hash__(self):
+        return hash((self.source_id, self.field_id, self.name, self.transform))
+
+
+class PartitionSpec:
+    """
+    PartitionSpec capture the transformation from table data to partition values
+
+    Attributes:
+        schema(Schema): the schema of data table
+        spec_id(int): any change to PartitionSpec will produce a new specId
+        fields(List[PartitionField): list of partition fields to produce partition values
+        last_assigned_field_id(int): auto-increment partition field id starting from PARTITION_DATA_ID_START
+    """
+
+    PARTITION_DATA_ID_START: int = 1000
+
+    def __init__(self, schema: Schema, spec_id: int, fields: Tuple[PartitionField], last_assigned_field_id: int):
+        self._schema = schema
+        self._spec_id = spec_id
+        self._fields = fields
+        self._last_assigned_field_id = last_assigned_field_id
+        # derived
+        self.fields_by_source_id: Dict[int, List[PartitionField]] = {}
+
+    @property
+    def schema(self) -> Schema:
+        return self._schema
+
+    @property
+    def spec_id(self) -> int:
+        return self._spec_id
+
+    @property
+    def fields(self) -> Tuple[PartitionField]:
+        return self._fields
+
+    @property
+    def last_assigned_field_id(self) -> int:
+        return self._last_assigned_field_id
+
+    def __eq__(self, other):
+        return self.spec_id == other.spec_id and self.fields == other.fields
+
+    def __str__(self):
+        if self.is_unpartitioned():
+            return "[]"
+        else:
+            delimiter = "\n  "
+            partition_fields_in_str = (str(partition_field) for partition_field in self.fields)
+            head = f"[{delimiter}"
+            tail = f"\n]"

Review Comment:
   IMO, and this is just me, I feel like this is too much code for just str and repr. I would just take what `dataclass` or `attrs` will give me, even for the special unpartitioned case.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org