You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/12/30 21:18:42 UTC

[GitHub] [iceberg] rdblue commented on a diff in pull request #6490: Python: Replace Pydantic with StructRecord

rdblue commented on code in PR #6490:
URL: https://github.com/apache/iceberg/pull/6490#discussion_r1059522065


##########
python/pyiceberg/manifest.py:
##########
@@ -76,137 +66,283 @@ def __repr__(self) -> str:
         return f"FileFormat.{self.name}"
 
 
-class DataFile(IcebergBaseModel):
-    content: DataFileContent = Field(default=DataFileContent.DATA)
-    file_path: str = Field()
-    file_format: FileFormat = Field()
-    partition: Dict[str, Any] = Field()
-    record_count: int = Field()
-    file_size_in_bytes: int = Field()
-    block_size_in_bytes: Optional[int] = Field()
-    column_sizes: Optional[Dict[int, int]] = Field()
-    value_counts: Optional[Dict[int, int]] = Field()
-    null_value_counts: Optional[Dict[int, int]] = Field()
-    nan_value_counts: Optional[Dict[int, int]] = Field()
-    distinct_counts: Optional[Dict[int, int]] = Field()
-    lower_bounds: Optional[Dict[int, bytes]] = Field()
-    upper_bounds: Optional[Dict[int, bytes]] = Field()
-    key_metadata: Optional[bytes] = Field()
-    split_offsets: Optional[List[int]] = Field()
-    equality_ids: Optional[List[int]] = Field()
-    sort_order_id: Optional[int] = Field()
-
-
-class ManifestEntry(IcebergBaseModel):
-    status: ManifestEntryStatus = Field()
-    snapshot_id: Optional[int] = Field()
-    sequence_number: Optional[int] = Field()
-    data_file: DataFile = Field()
-
-
-class PartitionFieldSummary(IcebergBaseModel):
-    contains_null: bool = Field()
-    contains_nan: Optional[bool] = Field()
-    lower_bound: Optional[bytes] = Field()
-    upper_bound: Optional[bytes] = Field()
-
-
-class ManifestFile(IcebergBaseModel):
-    manifest_path: str = Field()
-    manifest_length: int = Field()
-    partition_spec_id: int = Field()
-    content: ManifestContent = Field(default=ManifestContent.DATA)
-    sequence_number: int = Field(default=0)
-    min_sequence_number: int = Field(default=0)
-    added_snapshot_id: Optional[int] = Field()
-    added_data_files_count: Optional[int] = Field()
-    existing_data_files_count: Optional[int] = Field()
-    deleted_data_files_count: Optional[int] = Field()
-    added_rows_count: Optional[int] = Field()
-    existing_rows_counts: Optional[int] = Field()
-    deleted_rows_count: Optional[int] = Field()
-    partitions: Optional[List[PartitionFieldSummary]] = Field()
-    key_metadata: Optional[bytes] = Field()
-
-    def fetch_manifest_entry(self, io: FileIO) -> List[ManifestEntry]:
+class DataFile(Record):
+    @staticmethod
+    def from_record(record: Record, format_version: int) -> Union[DataFileV1, DataFileV2]:
+        if format_version == 1:
+            return DataFileV1(*record)
+        elif format_version == 2:
+            return DataFileV2(*record)
+        else:
+            raise ValueError(f"Unknown format-version: {format_version}")
+
+    file_path: str
+    file_format: FileFormat
+    partition: Record
+    record_count: int
+    file_size_in_bytes: int
+    block_size_in_bytes: Optional[int] = None
+    column_sizes: Optional[Dict[int, int]] = None
+    value_counts: Optional[Dict[int, int]] = None
+    null_value_counts: Optional[Dict[int, int]] = None
+    nan_value_counts: Optional[Dict[int, int]] = None
+    distinct_counts: Optional[Dict[int, int]] = None  # Does not seem to be used on the Java side!?
+    lower_bounds: Optional[Dict[int, bytes]] = None
+    upper_bounds: Optional[Dict[int, bytes]] = None
+    key_metadata: Optional[bytes] = None
+    split_offsets: Optional[List[int]] = None
+    equality_ids: Optional[List[int]] = None
+    sort_order_id: Optional[int] = None
+    content: DataFileContent = DataFileContent.DATA
+
+
+class DataFileV1(DataFile):
+    def __setitem__(self, pos: int, value: Any) -> None:

Review Comment:
   I think that these methods need to be dynamic based on the schema that is used to read the file. That's why we have a mapping from the canonical (full schema) positions and the positions from the expected schema: https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/BaseFile.java#L240-L242
   
   That would also make it so you can have just one `DataFile` class because v1 is just a projection of v2.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org