You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/11/08 08:41:30 UTC

[GitHub] [iceberg] Fokko opened a new pull request, #6145: Python: Add initial TableScan implementation

Fokko opened a new pull request, #6145:
URL: https://github.com/apache/iceberg/pull/6145

   This adds an implementation of TableScan that is an alternative to the one in https://github.com/apache/iceberg/pull/6131. This doesn't implement plan_files, it is just to demonstrate a possible scan API:
   
   ```python
   scan = table.scan(
       row_filter=In("id", [5, 6, 7]),
       selected_fields=("id", "data"),
       snapshot_id=1234567890
   )
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #6145: Python: Add initial TableScan implementation

Posted by GitBox <gi...@apache.org>.
rdblue commented on PR #6145:
URL: https://github.com/apache/iceberg/pull/6145#issuecomment-1310622915

   Thanks, @Fokko!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] Fokko commented on pull request #6145: Python: Add initial TableScan implementation

Posted by GitBox <gi...@apache.org>.
Fokko commented on PR #6145:
URL: https://github.com/apache/iceberg/pull/6145#issuecomment-1310545882

   Also threw in some tests 👍🏻 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #6145: Python: Add initial TableScan implementation

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #6145:
URL: https://github.com/apache/iceberg/pull/6145#discussion_r1017249517


##########
python/pyiceberg/table/__init__.py:
##########
@@ -90,3 +119,53 @@ def snapshot_by_name(self, name: str) -> Optional[Snapshot]:
     def history(self) -> List[SnapshotLogEntry]:
         """Get the snapshot history of this table."""
         return self.metadata.snapshot_log
+
+
+class TableScan:
+    table: Table
+    row_filter: BooleanExpression
+    partition_filter: BooleanExpression
+    selected_fields: Tuple[str]
+    case_sensitive: bool
+    snapshot_id: Optional[int]
+    options: Properties
+
+    def __init__(
+        self,
+        table: Table,
+        row_filter: Optional[BooleanExpression],
+        partition_filter: Optional[BooleanExpression],
+        selected_fields: Tuple[str] = ("*",),
+        case_sensitive: bool = True,
+        snapshot_id: Optional[int] = None,
+        options: Properties = EMPTY_DICT,
+    ):
+        self.table = table
+        self.row_filter = row_filter or AlwaysTrue()
+        self.partition_filter = partition_filter or AlwaysTrue()

Review Comment:
   Looks good.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #6145: Python: Add initial TableScan implementation

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #6145:
URL: https://github.com/apache/iceberg/pull/6145#discussion_r1017251652


##########
python/pyiceberg/table/__init__.py:
##########
@@ -90,3 +119,53 @@ def snapshot_by_name(self, name: str) -> Optional[Snapshot]:
     def history(self) -> List[SnapshotLogEntry]:
         """Get the snapshot history of this table."""
         return self.metadata.snapshot_log
+
+
+class TableScan:
+    table: Table
+    row_filter: BooleanExpression
+    partition_filter: BooleanExpression
+    selected_fields: Tuple[str]
+    case_sensitive: bool
+    snapshot_id: Optional[int]
+    options: Properties
+
+    def __init__(
+        self,
+        table: Table,
+        row_filter: Optional[BooleanExpression],
+        partition_filter: Optional[BooleanExpression],
+        selected_fields: Tuple[str] = ("*",),
+        case_sensitive: bool = True,
+        snapshot_id: Optional[int] = None,
+        options: Properties = EMPTY_DICT,
+    ):
+        self.table = table
+        self.row_filter = row_filter or AlwaysTrue()
+        self.partition_filter = partition_filter or AlwaysTrue()
+        self.selected_fields = selected_fields
+        self.case_sensitive = case_sensitive
+        self.snapshot_id = snapshot_id
+        self.options = options
+
+    def snapshot(self) -> Optional[Snapshot]:
+        if self.snapshot_id:
+            return self.table.snapshot_by_id(self.snapshot_id)
+        return self.table.current_snapshot()
+
+    def projection(self) -> Schema:
+        snapshot_schema = self.table.schema()
+        if snapshot := self.snapshot():
+            if snapshot_schema_id := snapshot.schema_id:
+                snapshot_schema = self.table.schemas()[snapshot_schema_id]
+
+        if "*" in self.selected_fields:
+            return snapshot_schema
+
+        return snapshot_schema.select(*self.selected_fields, case_sensitive=self.case_sensitive)
+
+    def plan_files(self):

Review Comment:
   I think this also needs the rest of the methods that can be used to refine the scan: https://github.com/apache/iceberg/pull/6131/files#diff-893817e2d326eb9fad68cc96f7742b3ca654a2e05355f9d80ed59886b36ac121R155-R186



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue merged pull request #6145: Python: Add initial TableScan implementation

Posted by GitBox <gi...@apache.org>.
rdblue merged PR #6145:
URL: https://github.com/apache/iceberg/pull/6145


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #6145: Python: Add initial TableScan implementation

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #6145:
URL: https://github.com/apache/iceberg/pull/6145#discussion_r1018424359


##########
python/tests/cli/test_console.py:
##########
@@ -538,6 +542,7 @@ def test_json_describe_namespace_does_not_exists(_):
 def test_json_describe_table(_):
     runner = CliRunner()
     result = runner.invoke(run, ["--output=json", "describe", "default.foo"])
+    print(result.output)

Review Comment:
   Is this intended?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #6145: Python: Add initial TableScan implementation

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #6145:
URL: https://github.com/apache/iceberg/pull/6145#discussion_r1018424138


##########
python/pyiceberg/table/__init__.py:
##########
@@ -32,13 +33,19 @@
 from pyiceberg.table.snapshots import Snapshot, SnapshotLogEntry
 from pyiceberg.table.sorting import SortOrder
 from pyiceberg.typedef import EMPTY_DICT, Identifier, Properties
-from pyiceberg.utils.iceberg_base_model import IcebergBaseModel
 
 
-class Table(IcebergBaseModel):
+class Table:
     identifier: Identifier = Field()
-    metadata_location: str = Field()
     metadata: TableMetadata = Field()
+    metadata_location: str = Field()

Review Comment:
   Should this also remove `Field()`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] Fokko commented on a diff in pull request #6145: Python: Add initial TableScan implementation

Posted by GitBox <gi...@apache.org>.
Fokko commented on code in PR #6145:
URL: https://github.com/apache/iceberg/pull/6145#discussion_r1018428906


##########
python/tests/cli/test_console.py:
##########
@@ -538,6 +542,7 @@ def test_json_describe_namespace_does_not_exists(_):
 def test_json_describe_table(_):
     runner = CliRunner()
     result = runner.invoke(run, ["--output=json", "describe", "default.foo"])
+    print(result.output)

Review Comment:
   Oops! Thanks



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org