You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/11/06 23:34:56 UTC

[GitHub] [iceberg] rdblue opened a new pull request, #6131: Python: Add initial TableScan implementation

rdblue opened a new pull request, #6131:
URL: https://github.com/apache/iceberg/pull/6131

   This adds an implementation of `TableScan` that is an alternative to the one in #6069. This doesn't implement `plan_files`, it is just to demonstrate a possible scan API.
   
   This scan API works like the Java scan API, but also allows passing scan options when creating an initial scan. Both of these are supported:
   
   ```python
   scan = table.scan(
       row_filter=In("id", [5, 6, 7]),
       selected_fields=("id", "data"),
       snapshot_id=1234567890
     )
   # OR
   scan = table.scan() \
       .filter_rows(In("id", [5, 6, 7]))
       .select("id", "data")
       .use_snapshot(1234567890)
   ```
   
   I think this is a reasonable way to get more pythonic (by passing optional arguments) and also mostly match the API and behavior in the JVM implementation.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] Fokko commented on pull request #6131: Python: Add initial TableScan implementation

Posted by GitBox <gi...@apache.org>.

Fokko commented on PR #6131:
URL: https://github.com/apache/iceberg/pull/6131#issuecomment-1305996028

   I would also like:
   ```python
   scan = table.scan(
       row_filter=col("id") in [5, 6, 7],
       selected_fields=("id", "data"),
       snapshot_id=1234567890
   )
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] Fokko commented on a diff in pull request #6131: Python: Add initial TableScan implementation

Posted by GitBox <gi...@apache.org>.

Fokko commented on code in PR #6131:
URL: https://github.com/apache/iceberg/pull/6131#discussion_r1018441052


##########
python/pyiceberg/table/__init__.py:
##########
@@ -90,3 +103,90 @@ def snapshot_by_name(self, name: str) -> Optional[Snapshot]:
     def history(self) -> List[SnapshotLogEntry]:
         """Get the snapshot history of this table."""
         return self.metadata.snapshot_log
+
+
+class TableScan:
+    _always_true: ClassVar[BooleanExpression] = AlwaysTrue()
+    table: Table
+    row_filter: BooleanExpression
+    partition_filter: BooleanExpression
+    selected_fields: tuple[str]
+    case_sensitive: bool
+    snapshot_id: Optional[int]
+    options: Properties
+
+    def __init__(
+        self,
+        *,
+        table: Table,
+        row_filter: BooleanExpression = _always_true,
+        partition_filter: BooleanExpression = _always_true,
+        selected_fields: tuple[str] = ("*",),
+        case_sensitive: bool = True,
+        snapshot_id: Optional[int] = None,
+        options: Properties = EMPTY_DICT,
+    ):
+        self.table = table
+        self.row_filter = row_filter
+        self.partition_filter = partition_filter
+        self.selected_fields = selected_fields
+        self.case_sensitive = case_sensitive
+        self.snapshot_id = snapshot_id
+        self.options = options
+
+    def update(self, **overrides):
+        """Creates a copy of this table scan with updated fields."""
+        return TableScan(**{**self.__dict__, **overrides})
+
+    def snapshot(self):
+        if self.snapshot_id:
+            return self.table.snapshot_by_id(self.snapshot_id)
+
+        return self.table.current_snapshot()
+
+    def projection(self):
+        snapshot_schema = self.table.schemas().get(self.snapshot().schema_id) or self.table.schema()
+
+        if "*" in self.selected_fields:
+            return snapshot_schema
+
+        return snapshot_schema.select(*self.selected_fields, case_sensitive=self.case_sensitive)
+
+    def use_snapshot(self, snapshot_id: int):
+        if self.snapshot_id:
+            raise ValueError(f"Cannot override snapshot, already set snapshot id={self.snapshot_id}")
+        if self.table.snapshot_by_id(snapshot_id):
+            return self.update(snapshot_id=snapshot_id)
+
+        raise ValueError(f"Cannot scan unknown snapshot id={snapshot_id}")
+
+    def use_ref(self, name: str):
+        if self.snapshot_id:
+            raise ValueError(f"Cannot override ref, already set snapshot id={self.snapshot_id}")
+        if snapshot := self.table.snapshot_by_name(name):
+            return self.update(snapshot_id=snapshot.snapshot_id)
+
+        raise ValueError(f"Cannot scan unknown ref={name}")
+
+    def select(self, *field_names: str) -> "TableScan":
+        if "*" in self.selected_fields:
+            return self.update(selected_fields=field_names)
+        return self.update(selected_fields=tuple(set(self.selected_fields).intersection(field_names)))

Review Comment:
   Small bug:
   ```suggestion
           return self.update(selected_fields=tuple(set(self.selected_fields).intersection(set(field_names))))
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue closed pull request #6131: Python: Add initial TableScan implementation

Posted by GitBox <gi...@apache.org>.

rdblue closed pull request #6131: Python: Add initial TableScan implementation
URL: https://github.com/apache/iceberg/pull/6131


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] Fokko commented on a diff in pull request #6131: Python: Add initial TableScan implementation

Posted by GitBox <gi...@apache.org>.

Fokko commented on code in PR #6131:
URL: https://github.com/apache/iceberg/pull/6131#discussion_r1016261138


##########
python/pyiceberg/table/__init__.py:
##########
@@ -14,30 +14,43 @@
 # KIND, either express or implied.  See the License for the
 # specific language governing permissions and limitations
 # under the License.
-
-
-from typing import Dict, List, Optional
-
-from pydantic import Field
-
+from typing import (
+    ClassVar,
+    Dict,
+    List,
+    Optional,
+)
+
+from pyiceberg.expressions import AlwaysTrue, And, BooleanExpression
 from pyiceberg.schema import Schema
 from pyiceberg.table.metadata import TableMetadata
 from pyiceberg.table.partitioning import PartitionSpec
 from pyiceberg.table.snapshots import Snapshot, SnapshotLogEntry
 from pyiceberg.table.sorting import SortOrder
-from pyiceberg.typedef import Identifier
-from pyiceberg.utils.iceberg_base_model import IcebergBaseModel
+from pyiceberg.typedef import EMPTY_DICT, Identifier, Properties
+
 
+class Table:

Review Comment:
   The schema method was already there before this PR, why is removing it now? 🤔 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a diff in pull request #6131: Python: Add initial TableScan implementation

Posted by GitBox <gi...@apache.org>.

rdblue commented on code in PR #6131:
URL: https://github.com/apache/iceberg/pull/6131#discussion_r1014914176


##########
python/pyiceberg/table/__init__.py:
##########
@@ -14,30 +14,43 @@
 # KIND, either express or implied.  See the License for the
 # specific language governing permissions and limitations
 # under the License.
-
-
-from typing import Dict, List, Optional
-
-from pydantic import Field
-
+from typing import (
+    ClassVar,
+    Dict,
+    List,
+    Optional,
+)
+
+from pyiceberg.expressions import AlwaysTrue, And, BooleanExpression
 from pyiceberg.schema import Schema
 from pyiceberg.table.metadata import TableMetadata
 from pyiceberg.table.partitioning import PartitionSpec
 from pyiceberg.table.snapshots import Snapshot, SnapshotLogEntry
 from pyiceberg.table.sorting import SortOrder
-from pyiceberg.typedef import Identifier
-from pyiceberg.utils.iceberg_base_model import IcebergBaseModel
+from pyiceberg.typedef import EMPTY_DICT, Identifier, Properties
+
 
+class Table:

Review Comment:
   @Fokko, I removed Pydantic from this because it was conflicting with the `schema` method. I think we should probably make these changes to the `Table` class either way.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] Fokko commented on a diff in pull request #6131: Python: Add initial TableScan implementation

Posted by GitBox <gi...@apache.org>.

Fokko commented on code in PR #6131:
URL: https://github.com/apache/iceberg/pull/6131#discussion_r1015731779


##########
python/pyiceberg/table/__init__.py:
##########
@@ -14,30 +14,43 @@
 # KIND, either express or implied.  See the License for the
 # specific language governing permissions and limitations
 # under the License.
-
-
-from typing import Dict, List, Optional
-
-from pydantic import Field
-
+from typing import (
+    ClassVar,
+    Dict,
+    List,
+    Optional,
+)
+
+from pyiceberg.expressions import AlwaysTrue, And, BooleanExpression
 from pyiceberg.schema import Schema
 from pyiceberg.table.metadata import TableMetadata
 from pyiceberg.table.partitioning import PartitionSpec
 from pyiceberg.table.snapshots import Snapshot, SnapshotLogEntry
 from pyiceberg.table.sorting import SortOrder
-from pyiceberg.typedef import Identifier
-from pyiceberg.utils.iceberg_base_model import IcebergBaseModel
+from pyiceberg.typedef import EMPTY_DICT, Identifier, Properties
+
 
+class Table:
+    identifier: Identifier
+    metadata: TableMetadata
+    metadata_location: Optional[str]
 
-class Table(IcebergBaseModel):
-    identifier: Identifier = Field()
-    metadata_location: str = Field()
-    metadata: TableMetadata = Field()
+    def __init__(self, identifier: Identifier, metadata: TableMetadata, metadata_location: Optional[str]):
+        self.identifier = identifier
+        self.metadata = metadata
+        self.metadata_location = metadata_location
 
     def refresh(self):
         """Refresh the current table metadata"""
         raise NotImplementedError("To be implemented")
 
+    def name(self) -> Identifier:
+        """Return the identifer of this table"""
+        return self.identifier
+
+    def scan(self, **kwargs):

Review Comment:
   ```suggestion
       def scan(self, **kwargs) -> TableScan:
   ```



##########
python/pyiceberg/table/__init__.py:
##########
@@ -90,3 +103,90 @@ def snapshot_by_name(self, name: str) -> Optional[Snapshot]:
     def history(self) -> List[SnapshotLogEntry]:
         """Get the snapshot history of this table."""
         return self.metadata.snapshot_log
+
+
+class TableScan:
+    _always_true: ClassVar[BooleanExpression] = AlwaysTrue()
+    table: Table
+    row_filter: BooleanExpression
+    partition_filter: BooleanExpression
+    selected_fields: tuple[str]
+    case_sensitive: bool
+    snapshot_id: Optional[int]
+    options: Properties
+
+    def __init__(
+        self,
+        *,
+        table: Table,
+        row_filter: BooleanExpression = _always_true,
+        partition_filter: BooleanExpression = _always_true,
+        selected_fields: tuple[str] = ("*",),
+        case_sensitive: bool = True,
+        snapshot_id: Optional[int] = None,
+        options: Properties = EMPTY_DICT,
+    ):
+        self.table = table
+        self.row_filter = row_filter
+        self.partition_filter = partition_filter
+        self.selected_fields = selected_fields
+        self.case_sensitive = case_sensitive
+        self.snapshot_id = snapshot_id
+        self.options = options
+
+    def update(self, **overrides):
+        """Creates a copy of this table scan with updated fields."""
+        return TableScan(**{**self.__dict__, **overrides})
+
+    def snapshot(self):
+        if self.snapshot_id:
+            return self.table.snapshot_by_id(self.snapshot_id)
+
+        return self.table.current_snapshot()
+
+    def projection(self):

Review Comment:
   ```suggestion
       def projection(self) -> Schema:
   ```



##########
python/pyiceberg/table/__init__.py:
##########
@@ -14,30 +14,43 @@
 # KIND, either express or implied.  See the License for the
 # specific language governing permissions and limitations
 # under the License.
-
-
-from typing import Dict, List, Optional
-
-from pydantic import Field
-
+from typing import (
+    ClassVar,
+    Dict,
+    List,
+    Optional,
+)
+
+from pyiceberg.expressions import AlwaysTrue, And, BooleanExpression
 from pyiceberg.schema import Schema
 from pyiceberg.table.metadata import TableMetadata
 from pyiceberg.table.partitioning import PartitionSpec
 from pyiceberg.table.snapshots import Snapshot, SnapshotLogEntry
 from pyiceberg.table.sorting import SortOrder
-from pyiceberg.typedef import Identifier
-from pyiceberg.utils.iceberg_base_model import IcebergBaseModel
+from pyiceberg.typedef import EMPTY_DICT, Identifier, Properties
+
 
+class Table:

Review Comment:
   You could still override the method and it will work fine, only mypy will complain.



##########
python/pyiceberg/table/__init__.py:
##########
@@ -90,3 +103,90 @@ def snapshot_by_name(self, name: str) -> Optional[Snapshot]:
     def history(self) -> List[SnapshotLogEntry]:
         """Get the snapshot history of this table."""
         return self.metadata.snapshot_log
+
+
+class TableScan:
+    _always_true: ClassVar[BooleanExpression] = AlwaysTrue()

Review Comment:
   Not sure if this is necessary. `AlwaysTrue` is a singleton it will always return the same instance.



##########
python/pyiceberg/table/__init__.py:
##########
@@ -90,3 +103,90 @@ def snapshot_by_name(self, name: str) -> Optional[Snapshot]:
     def history(self) -> List[SnapshotLogEntry]:
         """Get the snapshot history of this table."""
         return self.metadata.snapshot_log
+
+
+class TableScan:
+    _always_true: ClassVar[BooleanExpression] = AlwaysTrue()
+    table: Table
+    row_filter: BooleanExpression
+    partition_filter: BooleanExpression
+    selected_fields: tuple[str]
+    case_sensitive: bool
+    snapshot_id: Optional[int]
+    options: Properties
+
+    def __init__(
+        self,
+        *,
+        table: Table,
+        row_filter: BooleanExpression = _always_true,
+        partition_filter: BooleanExpression = _always_true,
+        selected_fields: tuple[str] = ("*",),

Review Comment:
   ```suggestion
           selected_fields: Tuple[str] = ("*",),
   ```



##########
python/pyiceberg/table/__init__.py:
##########
@@ -90,3 +103,90 @@ def snapshot_by_name(self, name: str) -> Optional[Snapshot]:
     def history(self) -> List[SnapshotLogEntry]:
         """Get the snapshot history of this table."""
         return self.metadata.snapshot_log
+
+
+class TableScan:
+    _always_true: ClassVar[BooleanExpression] = AlwaysTrue()
+    table: Table
+    row_filter: BooleanExpression
+    partition_filter: BooleanExpression
+    selected_fields: tuple[str]
+    case_sensitive: bool
+    snapshot_id: Optional[int]
+    options: Properties
+
+    def __init__(
+        self,
+        *,
+        table: Table,
+        row_filter: BooleanExpression = _always_true,
+        partition_filter: BooleanExpression = _always_true,
+        selected_fields: tuple[str] = ("*",),
+        case_sensitive: bool = True,
+        snapshot_id: Optional[int] = None,
+        options: Properties = EMPTY_DICT,
+    ):
+        self.table = table
+        self.row_filter = row_filter
+        self.partition_filter = partition_filter
+        self.selected_fields = selected_fields
+        self.case_sensitive = case_sensitive
+        self.snapshot_id = snapshot_id
+        self.options = options
+
+    def update(self, **overrides):
+        """Creates a copy of this table scan with updated fields."""
+        return TableScan(**{**self.__dict__, **overrides})
+
+    def snapshot(self):

Review Comment:
   ```suggestion
       def snapshot(self) -> Snapshot:
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a diff in pull request #6131: Python: Add initial TableScan implementation

Posted by GitBox <gi...@apache.org>.

rdblue commented on code in PR #6131:
URL: https://github.com/apache/iceberg/pull/6131#discussion_r1014914272


##########
python/pyiceberg/table/__init__.py:
##########
@@ -90,3 +103,90 @@ def snapshot_by_name(self, name: str) -> Optional[Snapshot]:
     def history(self) -> List[SnapshotLogEntry]:
         """Get the snapshot history of this table."""
         return self.metadata.snapshot_log
+
+
+class TableScan:
+    _always_true: ClassVar[BooleanExpression] = AlwaysTrue()

Review Comment:
   This was needed to avoid calling functions in arg defaults.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] ddrinka commented on pull request #6131: Python: Add initial TableScan implementation

Posted by GitBox <gi...@apache.org>.

ddrinka commented on PR #6131:
URL: https://github.com/apache/iceberg/pull/6131#issuecomment-1305936075

   I'm just an outside observer here, but isn't there already a Python implementation that followed the Java API, but folks thought it would be good to do all this work to rewrite it to be more readable/usable for a Python-native developer?  Trying to include both options seems like it might bring added complexity and scope creep.
   
   For me the second example above (the Java-style) feels very Java and the first feels very Python.  If the whole goal is keeping the API Pythonic, shouldn't we stick with one Python-like API?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a diff in pull request #6131: Python: Add initial TableScan implementation

Posted by GitBox <gi...@apache.org>.

rdblue commented on code in PR #6131:
URL: https://github.com/apache/iceberg/pull/6131#discussion_r1017248707


##########
python/pyiceberg/table/__init__.py:
##########
@@ -90,3 +103,90 @@ def snapshot_by_name(self, name: str) -> Optional[Snapshot]:
     def history(self) -> List[SnapshotLogEntry]:
         """Get the snapshot history of this table."""
         return self.metadata.snapshot_log
+
+
+class TableScan:
+    _always_true: ClassVar[BooleanExpression] = AlwaysTrue()

Review Comment:
   The problem is that `AlwaysTrue` is a class, and you have to use `AlwaysTrue()` to get the singleton.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on pull request #6131: Python: Add initial TableScan implementation

Posted by GitBox <gi...@apache.org>.

rdblue commented on PR #6131:
URL: https://github.com/apache/iceberg/pull/6131#issuecomment-1310623432

   Closing in favor of #6145.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org