You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/05/03 19:36:31 UTC

[GitHub] [iceberg] Fokko opened a new pull request, #4685: First version of the Accessor

Fokko opened a new pull request, #4685:
URL: https://github.com/apache/iceberg/pull/4685

   First version of the visitor going over the struct to construct the accessors.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #4685: First version of the Accessor

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #4685:
URL: https://github.com/apache/iceberg/pull/4685#discussion_r864981327


##########
python/src/iceberg/schema.py:
##########
@@ -290,6 +292,8 @@ def _(obj: StructType, visitor: SchemaVisitor[T]) -> T:
         visitor.before_field(field)
         try:
             result = visit(field.type, visitor)
+        except Exception:
+            logger.exception("Unable to visit the field in the schema")

Review Comment:
   Should this also log which field wasn't handled? Something like `f"Unable to visit field: {field}"`?



##########
python/src/iceberg/schema.py:
##########
@@ -290,6 +292,8 @@ def _(obj: StructType, visitor: SchemaVisitor[T]) -> T:
         visitor.before_field(field)
         try:
             result = visit(field.type, visitor)
+        except Exception:
+            logger.exception("Unable to visit the field in the schema")

Review Comment:
   Should this also log which field wasn't handled? Something like `f"Unable to visit schema field: {field}"`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] Fokko commented on a diff in pull request #4685: Python: First version of the Accessor

Posted by GitBox <gi...@apache.org>.
Fokko commented on code in PR #4685:
URL: https://github.com/apache/iceberg/pull/4685#discussion_r868327742


##########
python/src/iceberg/schema.py:
##########
@@ -262,10 +267,45 @@ def primitive(self, primitive: PrimitiveType) -> T:
         ...  # pragma: no cover
 
 
+@dataclass(init=True, eq=True)
+class Accessor:
+    """An accessor for a specific position in a container that implements the StructProtocol"""
+
+    position: int
+    inner: Optional["Accessor"] = None
+
+    def __str__(self):
+        return f"Accessor(position={self.position},inner={self.inner})"
+
+    def __repr__(self):
+        return self.__str__()
+
+    def with_inner(self, accessor: "Accessor") -> "Accessor":
+        parent = copy.deepcopy(self)
+        acc = parent
+        while acc.inner:
+            acc = acc.inner
+        acc.inner = accessor
+        return parent
+
+    def get(self, container: StructProtocol) -> Any:
+        """Returns the value at self.position in `container`
+
+        Args:
+            container(StructProtocol): A container to access at position `self.position`
+
+        Returns:
+            Any: The value at position `self.position` in the container
+        """
+        return container.get(self.position)

Review Comment:
   Great catch. Just updated the code and added a test.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] samredai commented on a diff in pull request #4685: Python: First version of the Accessor

Posted by GitBox <gi...@apache.org>.
samredai commented on code in PR #4685:
URL: https://github.com/apache/iceberg/pull/4685#discussion_r868742221


##########
python/src/iceberg/schema.py:
##########
@@ -262,10 +267,45 @@ def primitive(self, primitive: PrimitiveType) -> T:
         ...  # pragma: no cover
 
 
+@dataclass(init=True, eq=True)
+class Accessor:
+    """An accessor for a specific position in a container that implements the StructProtocol"""
+
+    position: int
+    inner: Optional["Accessor"] = None
+
+    def __str__(self):
+        return f"Accessor(position={self.position},inner={self.inner})"
+
+    def __repr__(self):
+        return self.__str__()
+
+    def with_inner(self, accessor: "Accessor") -> "Accessor":
+        parent = copy.deepcopy(self)
+        acc = parent
+        while acc.inner:
+            acc = acc.inner
+        acc.inner = accessor
+        return parent
+
+    def get(self, container: StructProtocol) -> Any:
+        """Returns the value at self.position in `container`
+
+        Args:
+            container(StructProtocol): A container to access at position `self.position`
+
+        Returns:
+            Any: The value at position `self.position` in the container
+        """
+        return container.get(self.position)
+
+
 @singledispatch
 def visit(obj, visitor: SchemaVisitor[T]) -> T:
     """A generic function for applying a schema visitor to any point within a schema
 
+    The function traverses the schema in pre-order fashion

Review Comment:
   I agree that pre-order feels more intuitive, especially since the index gets built in ascending field ID order. That being said, it's not worth it if there are gotchas that we're missing here.
   
   > ...and I think makes the visitor you're building here more complicated because it can't rely on traversal to create the child accessors it needs.
   
   This is a great point--I'm noticing now that the visit method for NestedField here is becoming responsible for much than just visiting the NestedField. I couldn't think of a post-order solution for creating the wrapped accessor after the inner accessor is visited, but I completely missed what we're doing there in `_IndexByName`. I think mimicking that is a good idea and keeping something like a `self._inner` property on the `_BuildPositionAccessors` class.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] Fokko commented on a diff in pull request #4685: Python: First version of the Accessor

Posted by GitBox <gi...@apache.org>.
Fokko commented on code in PR #4685:
URL: https://github.com/apache/iceberg/pull/4685#discussion_r871844026


##########
python/src/iceberg/schema.py:
##########
@@ -301,11 +336,10 @@ def _(obj: StructType, visitor: SchemaVisitor[T]) -> T:
 @visit.register(ListType)
 def _(obj: ListType, visitor: SchemaVisitor[T]) -> T:
     """Visit a ListType with a concrete SchemaVisitor"""
+
     visitor.before_list_element(obj.element)
-    try:
-        result = visit(obj.element.type, visitor)
-    finally:

Review Comment:
   This was very confusing when debugging, and I don't think we should swallow (at least not the broad Exception) those in silence. When visiting an element caused an error, it would actually crash on something else because the state was broken. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #4685: Python: First version of the Accessor

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #4685:
URL: https://github.com/apache/iceberg/pull/4685#discussion_r868723254


##########
python/src/iceberg/schema.py:
##########
@@ -262,10 +267,52 @@ def primitive(self, primitive: PrimitiveType) -> T:
         ...  # pragma: no cover
 
 
+@dataclass(init=True, eq=True)
+class Accessor:
+    """An accessor for a specific position in a container that implements the StructProtocol"""
+
+    position: int
+    inner: Optional["Accessor"] = None
+
+    def __str__(self):
+        return f"Accessor(position={self.position},inner={self.inner})"
+
+    def __repr__(self):
+        return self.__str__()
+
+    def with_inner(self, accessor: "Accessor") -> "Accessor":

Review Comment:
   Is there a better name for this method? It sounds like it should replace this accessor's `inner` field, but it actually sets the leaf `inner`. Maybe `with_inner_most`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #4685: First version of the Accessor

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #4685:
URL: https://github.com/apache/iceberg/pull/4685#discussion_r864990932


##########
python/src/iceberg/schema.py:
##########
@@ -503,21 +507,34 @@ class _BuildPositionAccessors(SchemaVisitor[Dict[int, "Accessor"]]):
     def __init__(self) -> None:
         self._index: Dict[int, Accessor] = {}
 
-    def schema(self, schema, result: Dict[int, Accessor]) -> Dict[int, Accessor]:
-        return self._index
+    def schema(self, schema: Schema, field_results: Dict[int, Accessor]) -> Dict[int, Accessor]:
+        return field_results
+
+    def struct(self, struct: StructType, field_results: List[Dict[int, Accessor]]) -> Dict[int, Accessor]:
+        fields = struct.fields
+        for idx, field in enumerate(fields):
+            result = field_results[idx]
+            if result:
+                self._index.update(result)
+            else:
+                self._index[field.field_id] = Accessor(field.field_id)

Review Comment:
   Yeah, it looks like this is creating accessors based on the field ID, but what we need is to create accessors that use the current schema's positions and return a map of them by field ID.
   
   For example, this schema:
   ```
   table {
     2: int id,
     1: string data,
     3: struct location {
       5: float lat,
       6: float long
     }
   }
   ```
   
   Should produce the following accessor map:
   
   ```
   1 -> Accessor(position=1)
   2 -> Accessor(position=0)
   3 -> Accessor(position=2)
   5 -> Accessor(position=2, inner=Accessor(position=0))
   6 -> Accessor(position=2, inner=Accessor(position=1))
   ```
   
   That way when an expression uses, `location.lat`, the accessor will call `row.get(2).get(0)` to return the value.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #4685: Python: First version of the Accessor

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #4685:
URL: https://github.com/apache/iceberg/pull/4685#discussion_r871454516


##########
python/src/iceberg/schema.py:
##########
@@ -301,11 +336,10 @@ def _(obj: StructType, visitor: SchemaVisitor[T]) -> T:
 @visit.register(ListType)
 def _(obj: ListType, visitor: SchemaVisitor[T]) -> T:
     """Visit a ListType with a concrete SchemaVisitor"""
+
     visitor.before_list_element(obj.element)
-    try:
-        result = visit(obj.element.type, visitor)
-    finally:

Review Comment:
   Why remove the try/finally? It seems safer to have it, although I'll admit that it would be weird for an exception to be thrown while visiting.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] Fokko commented on a diff in pull request #4685: Python: First version of the Accessor

Posted by GitBox <gi...@apache.org>.
Fokko commented on code in PR #4685:
URL: https://github.com/apache/iceberg/pull/4685#discussion_r867670979


##########
python/src/iceberg/schema.py:
##########
@@ -290,6 +292,8 @@ def _(obj: StructType, visitor: SchemaVisitor[T]) -> T:
         visitor.before_field(field)
         try:
             result = visit(field.type, visitor)
+        except Exception:
+            logger.exception("Unable to visit the field in the schema")

Review Comment:
   I've changed the code to re-raise the exception. I had a couple of time during debugging that it silently swallowed the exception, which didn't really help in the process of debugging :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] Fokko commented on a diff in pull request #4685: Python: First version of the Accessor

Posted by GitBox <gi...@apache.org>.
Fokko commented on code in PR #4685:
URL: https://github.com/apache/iceberg/pull/4685#discussion_r868342529


##########
python/src/iceberg/schema.py:
##########
@@ -262,10 +267,45 @@ def primitive(self, primitive: PrimitiveType) -> T:
         ...  # pragma: no cover
 
 
+@dataclass(init=True, eq=True)
+class Accessor:
+    """An accessor for a specific position in a container that implements the StructProtocol"""
+
+    position: int
+    inner: Optional["Accessor"] = None
+
+    def __str__(self):
+        return f"Accessor(position={self.position},inner={self.inner})"
+
+    def __repr__(self):
+        return self.__str__()
+
+    def with_inner(self, accessor: "Accessor") -> "Accessor":
+        parent = copy.deepcopy(self)
+        acc = parent
+        while acc.inner:
+            acc = acc.inner
+        acc.inner = accessor
+        return parent
+
+    def get(self, container: StructProtocol) -> Any:
+        """Returns the value at self.position in `container`
+
+        Args:
+            container(StructProtocol): A container to access at position `self.position`
+
+        Returns:
+            Any: The value at position `self.position` in the container
+        """
+        return container.get(self.position)
+
+
 @singledispatch
 def visit(obj, visitor: SchemaVisitor[T]) -> T:
     """A generic function for applying a schema visitor to any point within a schema
 
+    The function traverses the schema in pre-order fashion

Review Comment:
   For building the Accessors, the pre-order is much easier since we can just wrap the field in the `inner` function. Also, I think it is much easier for most people to reason because it follows the pattern of a recursive function.
   
   I was looking at the existing `_IndexByName` visitor, and it looks like we're actually using pre-order traversal as well. But we misuse the before and after hooks: https://github.com/apache/iceberg/blob/1bb526264b22372bf94a0d28ccf626da93ea6014/python/src/iceberg/schema.py#L391-L410
   
   I could implement a similar pattern for creating the accessors. WDYT @samredai ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #4685: Python: First version of the Accessor

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #4685:
URL: https://github.com/apache/iceberg/pull/4685#discussion_r868270096


##########
python/src/iceberg/schema.py:
##########
@@ -262,10 +267,45 @@ def primitive(self, primitive: PrimitiveType) -> T:
         ...  # pragma: no cover
 
 
+@dataclass(init=True, eq=True)
+class Accessor:
+    """An accessor for a specific position in a container that implements the StructProtocol"""
+
+    position: int
+    inner: Optional["Accessor"] = None
+
+    def __str__(self):
+        return f"Accessor(position={self.position},inner={self.inner})"
+
+    def __repr__(self):
+        return self.__str__()
+
+    def with_inner(self, accessor: "Accessor") -> "Accessor":
+        parent = copy.deepcopy(self)
+        acc = parent
+        while acc.inner:
+            acc = acc.inner
+        acc.inner = accessor
+        return parent
+
+    def get(self, container: StructProtocol) -> Any:
+        """Returns the value at self.position in `container`
+
+        Args:
+            container(StructProtocol): A container to access at position `self.position`
+
+        Returns:
+            Any: The value at position `self.position` in the container
+        """
+        return container.get(self.position)
+
+
 @singledispatch
 def visit(obj, visitor: SchemaVisitor[T]) -> T:
     """A generic function for applying a schema visitor to any point within a schema
 
+    The function traverses the schema in pre-order fashion

Review Comment:
   I don't think this change is a good idea. We use the visitor to create or process tree structures quite a bit and the post-order traversal works well in most cases because the visitor is passed the result of visiting children. For building accessors, the post-order traversal creates all of the child accessors so each level adds a new wrapper and returns a map with an additional accessor level.
   
   Changing this pattern requires changing quite a bit of code and I think makes the visitor you're building here more complicated because it can't rely on traversal to create the child accessors it needs. I'd recommend not changing the order.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #4685: Python: First version of the Accessor

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #4685:
URL: https://github.com/apache/iceberg/pull/4685#discussion_r868264489


##########
python/tests/conftest.py:
##########
@@ -19,7 +19,7 @@
 
 import pytest
 
-from iceberg import schema
+from iceberg.schema import Schema

Review Comment:
   I think that the module name was used on purpose. @samredai, didn't you say that using the module name as a prefix was recommended for readability?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #4685: Python: First version of the Accessor

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #4685:
URL: https://github.com/apache/iceberg/pull/4685#discussion_r868724187


##########
python/src/iceberg/files.py:
##########
@@ -46,8 +46,10 @@ class FileFormat(Enum):
 class StructProtocol(Protocol):  # pragma: no cover
     """A generic protocol used by accessors to get and set at positions of an object"""
 
+    @abstractmethod

Review Comment:
   Is this not implied by `Protocol`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue merged pull request #4685: Python: First version of the Accessor

Posted by GitBox <gi...@apache.org>.
rdblue merged PR #4685:
URL: https://github.com/apache/iceberg/pull/4685


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #4685: Python: First version of the Accessor

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #4685:
URL: https://github.com/apache/iceberg/pull/4685#discussion_r873060015


##########
python/src/iceberg/schema.py:
##########
@@ -262,10 +262,44 @@ def primitive(self, primitive: PrimitiveType) -> T:
         ...  # pragma: no cover
 
 
+@dataclass(init=True, eq=True, frozen=True)
+class Accessor:
+    """An accessor for a specific position in a container that implements the StructProtocol"""
+
+    position: int
+    inner: Optional["Accessor"] = None
+
+    def __str__(self):
+        return f"Accessor(position={self.position},inner={self.inner})"
+
+    def __repr__(self):
+        return self.__str__()

Review Comment:
   Following up on this: should we use `@dataclass(repr=True, ...)` instead of defining this?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #4685: Python: First version of the Accessor

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #4685:
URL: https://github.com/apache/iceberg/pull/4685#discussion_r868705720


##########
python/src/iceberg/schema.py:
##########
@@ -262,10 +267,45 @@ def primitive(self, primitive: PrimitiveType) -> T:
         ...  # pragma: no cover
 
 
+@dataclass(init=True, eq=True)
+class Accessor:
+    """An accessor for a specific position in a container that implements the StructProtocol"""
+
+    position: int
+    inner: Optional["Accessor"] = None
+
+    def __str__(self):
+        return f"Accessor(position={self.position},inner={self.inner})"
+
+    def __repr__(self):
+        return self.__str__()
+
+    def with_inner(self, accessor: "Accessor") -> "Accessor":
+        parent = copy.deepcopy(self)
+        acc = parent
+        while acc.inner:
+            acc = acc.inner
+        acc.inner = accessor
+        return parent
+
+    def get(self, container: StructProtocol) -> Any:
+        """Returns the value at self.position in `container`
+
+        Args:
+            container(StructProtocol): A container to access at position `self.position`
+
+        Returns:
+            Any: The value at position `self.position` in the container
+        """
+        return container.get(self.position)
+
+
 @singledispatch
 def visit(obj, visitor: SchemaVisitor[T]) -> T:
     """A generic function for applying a schema visitor to any point within a schema
 
+    The function traverses the schema in pre-order fashion

Review Comment:
   The purpose of the visitor pattern is to separate the logic that traverses the tree structure from the logic that operates on it. Moving to pre-order conflicts with that goal because you have to inspect the inner type using `instanceof`. It think it looks more like recursive method that traverses the schema because it has to redo the work of the visitor.
   
   Iceberg uses the pattern of using a visitor to build a tree structure quite a bit, and to do that you almost always want to use post-order so that you can wrap what was returned by visiting the children of a node. There are other ways to do it, but you end up keeping state in the visitor or needing to pass state down the tree. I think it's cleaner and easier to work with using the existing pattern. But even if that's not convincing, I'm concerned about having to rewrite all of the Java visitors to use different logic, rather than being able to see what they're doing and port the same logic. That's going to be quite a bit harder.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #4685: Python: First version of the Accessor

Posted by GitBox <gi...@apache.org>.
rdblue commented on PR #4685:
URL: https://github.com/apache/iceberg/pull/4685#issuecomment-1126575299

   Looks great! Thanks, @Fokko!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] Fokko commented on a diff in pull request #4685: Python: First version of the Accessor

Posted by GitBox <gi...@apache.org>.
Fokko commented on code in PR #4685:
URL: https://github.com/apache/iceberg/pull/4685#discussion_r871847920


##########
python/tests/test_schema.py:
##########
@@ -347,3 +351,80 @@ def test_schema_find_type(table_schema_simple):
         == table_schema_simple.find_type("BAZ", case_sensitive=False)
         == BooleanType()
     )
+
+
+def test_build_position_accessors(table_schema_nested):
+    accessors = build_position_accessors(table_schema_nested)
+    assert accessors == {
+        1: Accessor(position=0, inner=None),
+        2: Accessor(position=1, inner=None),
+        3: Accessor(position=2, inner=None),
+        5: Accessor(position=3, inner=Accessor(position=0, inner=None)),
+        7: Accessor(position=4, inner=Accessor(position=0, inner=None)),
+        8: Accessor(position=4, inner=Accessor(position=1, inner=None)),
+        9: Accessor(position=4, inner=Accessor(position=1, inner=Accessor(position=0, inner=None))),
+        10: Accessor(position=4, inner=Accessor(position=1, inner=Accessor(position=1, inner=None))),
+        12: Accessor(position=5, inner=Accessor(position=0, inner=None)),
+        13: Accessor(position=5, inner=Accessor(position=0, inner=Accessor(position=0, inner=None))),
+        14: Accessor(position=5, inner=Accessor(position=0, inner=Accessor(position=1, inner=None))),
+    }
+
+
+def test_build_position_accessors_list():
+    iceberg_schema = schema.Schema(
+        NestedField(
+            field_id=4,
+            name="qux",
+            field_type=ListType(element_id=5, element_type=StringType(), element_is_optional=True),
+            is_optional=True,
+        ),
+        schema_id=1,
+    )
+    accessors = build_position_accessors(iceberg_schema)
+    assert accessors == {
+        5: Accessor(position=0, inner=Accessor(position=0, inner=None)),
+    }
+
+
+def test_build_position_accessors_map():
+    iceberg_schema = schema.Schema(
+        NestedField(
+            field_id=6,
+            name="quux",
+            field_type=MapType(
+                key_id=7,
+                key_type=StringType(),
+                value_id=8,
+                value_type=MapType(
+                    key_id=9, key_type=StringType(), value_id=10, value_type=IntegerType(), value_is_optional=True
+                ),
+                value_is_optional=True,
+            ),
+            is_optional=True,
+        ),
+        schema_id=1,
+    )
+    accessors = build_position_accessors(iceberg_schema)
+    assert accessors == {
+        7: Accessor(position=0, inner=Accessor(position=0, inner=None)),
+        8: Accessor(position=0, inner=Accessor(position=1, inner=None)),
+        9: Accessor(position=0, inner=Accessor(position=1, inner=Accessor(position=0, inner=None))),
+        10: Accessor(position=0, inner=Accessor(position=1, inner=Accessor(position=1, inner=None))),
+    }
+
+
+class TestStructProtocol(StructProtocol):

Review Comment:
   I agree, but that would be outside of the scope of this PR. @samredai WDYT?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #4685: First version of the Accessor

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #4685:
URL: https://github.com/apache/iceberg/pull/4685#discussion_r864985851


##########
python/src/iceberg/schema.py:
##########
@@ -503,21 +507,34 @@ class _BuildPositionAccessors(SchemaVisitor[Dict[int, "Accessor"]]):
     def __init__(self) -> None:
         self._index: Dict[int, Accessor] = {}
 
-    def schema(self, schema, result: Dict[int, Accessor]) -> Dict[int, Accessor]:
-        return self._index
+    def schema(self, schema: Schema, field_results: Dict[int, Accessor]) -> Dict[int, Accessor]:
+        return field_results
+
+    def struct(self, struct: StructType, field_results: List[Dict[int, Accessor]]) -> Dict[int, Accessor]:
+        fields = struct.fields
+        for idx, field in enumerate(fields):
+            result = field_results[idx]
+            if result:
+                self._index.update(result)
+            else:
+                self._index[field.field_id] = Accessor(field.field_id)

Review Comment:
   Doesn't this need to create the accessor based on `idx` because the accessor uses position to locate a field?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] Fokko commented on pull request #4685: First version of the Accessor

Posted by GitBox <gi...@apache.org>.
Fokko commented on PR #4685:
URL: https://github.com/apache/iceberg/pull/4685#issuecomment-1118679163

   Hey @samredai It took me a while, but I noticed that we're doing post-order traversal over the schema. This made it challenging to implement the Accessors because it would first create the object that should be set to `inner`, and then the parent objects. Instead of visiting the leaves first, I've changed the code to pre-order traversal. This way we can just recursively construct the accessors.
   
   Let's consider the following schema:
   ```
   Schema: table {
     2: id: required int
     1: data: optional string
     3: location: optional struct<5: latitude: required float, 6: longitude: required float>
   }
   ```
   
   Before it would visit:
   ```
   Field: 2: id: required int
   Field: 1: data: optional string
   Field: 5: latitude: required float
   Field: 6: longitude: required float
   Struct: struct<5: latitude: required float, 6: longitude: required float>
   Field: 3: location: optional struct<5: latitude: required float, 6: longitude: required float>
   Struct: struct<2: id: required int, 1: data: optional string, 3: location: optional struct<5: latitude: required float, 6: longitude: required float>>
   ```
   We first visit all the fields, and then the objects. I've changed this to pre-order:
   ```
   Struct: struct<2: id: required int, 1: data: optional string, 3: location: optional struct<5: latitude: required float, 6: longitude: required float>>
   Field: 2: id: required int
   Field: 1: data: optional string
   Field: 3: location: optional struct<5: latitude: required float, 6: longitude: required float>
   Struct: struct<5: latitude: required float, 6: longitude: required float>
   Field: 5: latitude: required float
   Field: 6: longitude: required float
   ```
   Which makes more sense to me and it also allows me to simplify the code. Please let me know if I'm missing anything here.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #4685: Python: First version of the Accessor

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #4685:
URL: https://github.com/apache/iceberg/pull/4685#discussion_r868710344


##########
python/src/iceberg/schema.py:
##########
@@ -498,26 +565,76 @@ def index_name_by_id(schema_or_type) -> Dict[int, str]:
 
 
 class _BuildPositionAccessors(SchemaVisitor[Dict[int, "Accessor"]]):
-    """A schema visitor for generating a field ID to accessor index"""
+    """A schema visitor for generating a field ID to accessor index
+
+    Example:
+        >>> from iceberg.schema import Schema
+        >>> from iceberg.types import *
+        >>> schema = Schema(
+        ...     NestedField(field_id=2, name="id", field_type=IntegerType(), is_optional=False),
+        ...     NestedField(field_id=1, name="data", field_type=StringType(), is_optional=True),
+        ...     NestedField(
+        ...         field_id=3,
+        ...         name="location",
+        ...         field_type=StructType(
+        ...             NestedField(field_id=5, name="latitude", field_type=FloatType(), is_optional=False),
+        ...             NestedField(field_id=6, name="longitude", field_type=FloatType(), is_optional=False),
+        ...         ),
+        ...         is_optional=True,
+        ...     ),
+        ...     schema_id=1,
+        ...     identifier_field_ids=[1],
+        ... )
+        >>> result = build_position_accessors(schema)
+        >>> expected = {
+        ...      2: Accessor(position=0, inner=None),
+        ...      1: Accessor(position=1, inner=None),
+        ...      3: Accessor(position=2, inner=None),
+        ...      5: Accessor(position=2, inner=Accessor(position=0, inner=None)),
+        ...      6: Accessor(position=2, inner=Accessor(position=1, inner=None)),
+        ... }
+        >>> result == expected
+        True
+    """
 
     def __init__(self) -> None:
         self._index: Dict[int, Accessor] = {}
+        self._parents: Dict[int, int] = {}
+        self._pos: Dict[int, int] = defaultdict(lambda: 0)
 
-    def schema(self, schema, result: Dict[int, Accessor]) -> Dict[int, Accessor]:
+    def schema(self, schema: Schema) -> Dict[int, Accessor]:
         return self._index
 
-    def struct(self, struct, result: List[Dict[int, Accessor]]) -> Dict[int, Accessor]:
-        # TODO: Populate the `self._index` dictionary where the key is the field ID and the value is an accessor for that field.
-        #   The equivalent java logic can be found here: https://github.com/apache/iceberg/blob/master/api/src/main/java/org/apache/iceberg/Accessors.java#L213-L230
+    def struct(self, struct: StructType) -> Dict[int, Accessor]:
         return self._index
 
-    def field(self, field: NestedField, result: Dict[int, Accessor]) -> Dict[int, Accessor]:
+    def field(self, field: NestedField) -> Dict[int, Accessor]:
+        field_type = field.type
+        if isinstance(field_type, StructType):

Review Comment:
   I don't think it is necessary to duplicate the traversal logic. This can be done in `struct` before `field` is handled. Same for `map` and `list`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #4685: Python: First version of the Accessor

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #4685:
URL: https://github.com/apache/iceberg/pull/4685#discussion_r868724490


##########
python/tests/conftest.py:
##########
@@ -19,7 +19,7 @@
 
 import pytest
 
-from iceberg import schema
+from iceberg.schema import Schema

Review Comment:
   Unless there is a reason to touch this file, I'd probably prefer to revert this change.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #4685: Python: First version of the Accessor

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #4685:
URL: https://github.com/apache/iceberg/pull/4685#discussion_r868718919


##########
python/src/iceberg/schema.py:
##########
@@ -498,26 +565,76 @@ def index_name_by_id(schema_or_type) -> Dict[int, str]:
 
 
 class _BuildPositionAccessors(SchemaVisitor[Dict[int, "Accessor"]]):
-    """A schema visitor for generating a field ID to accessor index"""
+    """A schema visitor for generating a field ID to accessor index
+
+    Example:
+        >>> from iceberg.schema import Schema
+        >>> from iceberg.types import *
+        >>> schema = Schema(
+        ...     NestedField(field_id=2, name="id", field_type=IntegerType(), is_optional=False),
+        ...     NestedField(field_id=1, name="data", field_type=StringType(), is_optional=True),
+        ...     NestedField(
+        ...         field_id=3,
+        ...         name="location",
+        ...         field_type=StructType(
+        ...             NestedField(field_id=5, name="latitude", field_type=FloatType(), is_optional=False),
+        ...             NestedField(field_id=6, name="longitude", field_type=FloatType(), is_optional=False),
+        ...         ),
+        ...         is_optional=True,
+        ...     ),
+        ...     schema_id=1,
+        ...     identifier_field_ids=[1],
+        ... )
+        >>> result = build_position_accessors(schema)
+        >>> expected = {
+        ...      2: Accessor(position=0, inner=None),
+        ...      1: Accessor(position=1, inner=None),
+        ...      3: Accessor(position=2, inner=None),
+        ...      5: Accessor(position=2, inner=Accessor(position=0, inner=None)),
+        ...      6: Accessor(position=2, inner=Accessor(position=1, inner=None)),
+        ... }
+        >>> result == expected
+        True
+    """
 
     def __init__(self) -> None:
         self._index: Dict[int, Accessor] = {}
+        self._parents: Dict[int, int] = {}
+        self._pos: Dict[int, int] = defaultdict(lambda: 0)
 
-    def schema(self, schema, result: Dict[int, Accessor]) -> Dict[int, Accessor]:
+    def schema(self, schema: Schema) -> Dict[int, Accessor]:
         return self._index
 
-    def struct(self, struct, result: List[Dict[int, Accessor]]) -> Dict[int, Accessor]:
-        # TODO: Populate the `self._index` dictionary where the key is the field ID and the value is an accessor for that field.
-        #   The equivalent java logic can be found here: https://github.com/apache/iceberg/blob/master/api/src/main/java/org/apache/iceberg/Accessors.java#L213-L230
+    def struct(self, struct: StructType) -> Dict[int, Accessor]:
         return self._index
 
-    def field(self, field: NestedField, result: Dict[int, Accessor]) -> Dict[int, Accessor]:
+    def field(self, field: NestedField) -> Dict[int, Accessor]:
+        field_type = field.type
+        if isinstance(field_type, StructType):
+            # In the case of a struct, we want to map which one the parent is
+            for inner_field in field_type.fields:
+                self._parents[inner_field.field_id] = field.field_id
+        elif isinstance(field_type, MapType):
+            self._parents[field_type.key.field_id] = field.field_id
+            self._parents[field_type.value.field_id] = field.field_id
+        elif isinstance(field_type, ListType):
+            self._parents[field_type.element.field_id] = field.field_id
+
+        parent = 0
+        if field.field_id in self._parents:
+            parent = self._parents[field.field_id]
+            self._index[field.field_id] = self._index[parent].with_inner(Accessor(position=self._pos[parent]))
+        else:
+            self._index[field.field_id] = Accessor(position=self._pos[parent])
+
+        self._pos[parent] = self._pos[parent] + 1

Review Comment:
   I see that this is reconstructing the position in the nested type based on the number of calls to `field`. This seems a bit hard to follow, at least to me. I prefer the more straightforward approach in the Java version, where position is handled with a local variable, `i`, in `struct`:
   
   ```java
         for (int i = 0; i < fieldResults.size(); i += 1) {
           Types.NestedField field = fields.get(i);
           Map<Integer, Accessor<StructLike>> result = fieldResults.get(i);
           if (result != null) {
             for (Map.Entry<Integer, Accessor<StructLike>> entry : result.entrySet()) {
               accessors.put(entry.getKey(), newAccessor(i, field.isOptional(), entry.getValue()));
             }
           } else {
             accessors.put(field.fieldId(), newAccessor(i, field.type()));
           }
         }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #4685: Python: First version of the Accessor

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #4685:
URL: https://github.com/apache/iceberg/pull/4685#discussion_r868265530


##########
python/src/iceberg/schema.py:
##########
@@ -262,10 +267,45 @@ def primitive(self, primitive: PrimitiveType) -> T:
         ...  # pragma: no cover
 
 
+@dataclass(init=True, eq=True)
+class Accessor:
+    """An accessor for a specific position in a container that implements the StructProtocol"""
+
+    position: int
+    inner: Optional["Accessor"] = None
+
+    def __str__(self):
+        return f"Accessor(position={self.position},inner={self.inner})"
+
+    def __repr__(self):
+        return self.__str__()
+
+    def with_inner(self, accessor: "Accessor") -> "Accessor":
+        parent = copy.deepcopy(self)
+        acc = parent
+        while acc.inner:
+            acc = acc.inner
+        acc.inner = accessor
+        return parent
+
+    def get(self, container: StructProtocol) -> Any:
+        """Returns the value at self.position in `container`
+
+        Args:
+            container(StructProtocol): A container to access at position `self.position`
+
+        Returns:
+            Any: The value at position `self.position` in the container
+        """
+        return container.get(self.position)

Review Comment:
   This also needs to call `inner` if it is defined.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #4685: Python: First version of the Accessor

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #4685:
URL: https://github.com/apache/iceberg/pull/4685#discussion_r871462282


##########
python/src/iceberg/schema.py:
##########
@@ -497,31 +530,78 @@ def index_name_by_id(schema_or_type) -> Dict[int, str]:
     return indexer.by_id()
 
 
-class _BuildPositionAccessors(SchemaVisitor[Dict[int, "Accessor"]]):
-    """A schema visitor for generating a field ID to accessor index"""
+Position = int
 
-    def __init__(self) -> None:
-        self._index: Dict[int, Accessor] = {}
 
-    def schema(self, schema, result: Dict[int, Accessor]) -> Dict[int, Accessor]:
-        return self._index
+class _BuildPositionAccessors(SchemaVisitor[Dict[Position, Accessor]]):
+    """A schema visitor for generating a field ID to accessor index
 
-    def struct(self, struct, result: List[Dict[int, Accessor]]) -> Dict[int, Accessor]:
-        # TODO: Populate the `self._index` dictionary where the key is the field ID and the value is an accessor for that field.
-        #   The equivalent java logic can be found here: https://github.com/apache/iceberg/blob/master/api/src/main/java/org/apache/iceberg/Accessors.java#L213-L230
-        return self._index
+    Example:
+        >>> from iceberg.schema import Schema
+        >>> from iceberg.types import *
+        >>> schema = Schema(
+        ...     NestedField(field_id=2, name="id", field_type=IntegerType(), is_optional=False),
+        ...     NestedField(field_id=1, name="data", field_type=StringType(), is_optional=True),
+        ...     NestedField(
+        ...         field_id=3,
+        ...         name="location",
+        ...         field_type=StructType(
+        ...             NestedField(field_id=5, name="latitude", field_type=FloatType(), is_optional=False),
+        ...             NestedField(field_id=6, name="longitude", field_type=FloatType(), is_optional=False),
+        ...         ),
+        ...         is_optional=True,
+        ...     ),
+        ...     schema_id=1,
+        ...     identifier_field_ids=[1],
+        ... )
+        >>> result = build_position_accessors(schema)
+        >>> expected = {
+        ...      2: Accessor(position=0, inner=None),
+        ...      1: Accessor(position=1, inner=None),
+        ...      5: Accessor(position=2, inner=Accessor(position=0, inner=None)),
+        ...      6: Accessor(position=2, inner=Accessor(position=1, inner=None)),
+        ... }
+        >>> result == expected
+        True
+    """
 
-    def field(self, field: NestedField, result: Dict[int, Accessor]) -> Dict[int, Accessor]:
-        return self._index
+    @staticmethod
+    def _wrap_leaves(result: Dict[Position, Accessor], position: Position = 0) -> Dict[Position, Accessor]:
+        return {field_id: Accessor(position, inner=inner) for field_id, inner in result.items()}
 
-    def list(self, list_type: ListType, result: Dict[int, Accessor]) -> Dict[int, Accessor]:
-        return self._index
+    def schema(self, schema: Schema, result: Dict[Position, Accessor]) -> Dict[Position, Accessor]:
+        return result
 
-    def map(self, map_type: MapType, key_result: Dict[int, Accessor], value_result: Dict[int, Accessor]) -> Dict[int, Accessor]:
-        return self._index
+    def struct(self, struct: StructType, field_results: List[Dict[Position, Accessor]]) -> Dict[Position, Accessor]:
+        result = {}
 
-    def primitive(self, primitive: PrimitiveType) -> Dict[int, Accessor]:
-        return self._index
+        for position, field in enumerate(struct.fields):
+            if field_results[position]:

Review Comment:
   Since `list`, `map`, and `primitive` all return a map, is the `else` case used? It is used in the JVM version for all 3 cases, but I think here it is probably just used for the `primitive` case because the accessor map is empty, right?
   
   I think that is correct (if `{}` uses the `else` branch) but I don't think there's a need to return accessors for fields inside maps or lists because we don't really know how to handle repeated elements. We could extend this later so that a repeated layer returns a tuple of the values, but right now we don't really need that.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] Fokko commented on a diff in pull request #4685: Python: First version of the Accessor

Posted by GitBox <gi...@apache.org>.
Fokko commented on code in PR #4685:
URL: https://github.com/apache/iceberg/pull/4685#discussion_r868342529


##########
python/src/iceberg/schema.py:
##########
@@ -262,10 +267,45 @@ def primitive(self, primitive: PrimitiveType) -> T:
         ...  # pragma: no cover
 
 
+@dataclass(init=True, eq=True)
+class Accessor:
+    """An accessor for a specific position in a container that implements the StructProtocol"""
+
+    position: int
+    inner: Optional["Accessor"] = None
+
+    def __str__(self):
+        return f"Accessor(position={self.position},inner={self.inner})"
+
+    def __repr__(self):
+        return self.__str__()
+
+    def with_inner(self, accessor: "Accessor") -> "Accessor":
+        parent = copy.deepcopy(self)
+        acc = parent
+        while acc.inner:
+            acc = acc.inner
+        acc.inner = accessor
+        return parent
+
+    def get(self, container: StructProtocol) -> Any:
+        """Returns the value at self.position in `container`
+
+        Args:
+            container(StructProtocol): A container to access at position `self.position`
+
+        Returns:
+            Any: The value at position `self.position` in the container
+        """
+        return container.get(self.position)
+
+
 @singledispatch
 def visit(obj, visitor: SchemaVisitor[T]) -> T:
     """A generic function for applying a schema visitor to any point within a schema
 
+    The function traverses the schema in pre-order fashion

Review Comment:
   For building the Accessors, the pre-order is much easier since we can just wrap the field in the `inner` function. Also, I think it is much easier for most people to reason because it follows the pattern of a recursive function.
   
   I was looking at the existing `_IndexByName` visitor, and it looks like we're actually using pre-order traversal as well. But mimicking this using the before and after hooks: https://github.com/apache/iceberg/blob/1bb526264b22372bf94a0d28ccf626da93ea6014/python/src/iceberg/schema.py#L391-L410
   
   I could implement a similar pattern for creating the accessors. WDYT @samredai ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #4685: Python: First version of the Accessor

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #4685:
URL: https://github.com/apache/iceberg/pull/4685#discussion_r868705720


##########
python/src/iceberg/schema.py:
##########
@@ -262,10 +267,45 @@ def primitive(self, primitive: PrimitiveType) -> T:
         ...  # pragma: no cover
 
 
+@dataclass(init=True, eq=True)
+class Accessor:
+    """An accessor for a specific position in a container that implements the StructProtocol"""
+
+    position: int
+    inner: Optional["Accessor"] = None
+
+    def __str__(self):
+        return f"Accessor(position={self.position},inner={self.inner})"
+
+    def __repr__(self):
+        return self.__str__()
+
+    def with_inner(self, accessor: "Accessor") -> "Accessor":
+        parent = copy.deepcopy(self)
+        acc = parent
+        while acc.inner:
+            acc = acc.inner
+        acc.inner = accessor
+        return parent
+
+    def get(self, container: StructProtocol) -> Any:
+        """Returns the value at self.position in `container`
+
+        Args:
+            container(StructProtocol): A container to access at position `self.position`
+
+        Returns:
+            Any: The value at position `self.position` in the container
+        """
+        return container.get(self.position)
+
+
 @singledispatch
 def visit(obj, visitor: SchemaVisitor[T]) -> T:
     """A generic function for applying a schema visitor to any point within a schema
 
+    The function traverses the schema in pre-order fashion

Review Comment:
   The purpose of the visitor pattern is to separate the logic that traverses the tree structure from the logic that operates on it. Iceberg uses the pattern of using a visitor to build a tree structure quite a bit, and to do that you almost always want to use post-order so that you can wrap what was returned by visiting the children of a node.
   
   There are other ways to do it, but you end up keeping state in the visitor (as in this PR) or needing to pass state down the tree. I think it's cleaner and easier to work with using the existing pattern. But even if that's not convincing, I'm concerned about having to rewrite all of the Java visitors to use different logic, rather than being able to see what they're doing and port the same logic. That's going to be quite a bit harder.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #4685: Python: First version of the Accessor

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #4685:
URL: https://github.com/apache/iceberg/pull/4685#discussion_r868721508


##########
python/src/iceberg/schema.py:
##########
@@ -279,53 +326,54 @@ def visit(obj, visitor: SchemaVisitor[T]) -> T:
 @visit.register(Schema)
 def _(obj: Schema, visitor: SchemaVisitor[T]) -> T:
     """Visit a Schema with a concrete SchemaVisitor"""
-    return visitor.schema(obj, visit(obj.as_struct(), visitor))
+    return visit(obj.as_struct(), visitor)
 
 
 @visit.register(StructType)
 def _(obj: StructType, visitor: SchemaVisitor[T]) -> T:
     """Visit a StructType with a concrete SchemaVisitor"""
-    results = []
+
+    result = visitor.struct(obj)
+
     for field in obj.fields:
+        visitor.field(field)
         visitor.before_field(field)
-        try:
-            result = visit(field.type, visitor)
-        finally:
-            visitor.after_field(field)
+        visit(field.type, visitor)
+        visitor.after_field(field)
 
-        results.append(visitor.field(field, result))
-
-    return visitor.struct(obj, results)
+    return result
 
 
 @visit.register(ListType)
 def _(obj: ListType, visitor: SchemaVisitor[T]) -> T:
     """Visit a ListType with a concrete SchemaVisitor"""
+    result = visitor.list(obj)
+
+    visitor.field(obj.element)
+
     visitor.before_list_element(obj.element)
-    try:
-        result = visit(obj.element.type, visitor)
-    finally:
-        visitor.after_list_element(obj.element)
+    visit(obj.element.type, visitor)
+    visitor.after_list_element(obj.element)
 
-    return visitor.list(obj, result)
+    return result
 
 
 @visit.register(MapType)
 def _(obj: MapType, visitor: SchemaVisitor[T]) -> T:
     """Visit a MapType with a concrete SchemaVisitor"""
+    result = visitor.map(obj)
+
+    visitor.field(obj.key)

Review Comment:
   Changing the behavior of `field` is probably going to cause problems later when translating between Java and Python implementations. `field` isn't called for map key, map value, or list element in any of the Java visitors.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #4685: Python: First version of the Accessor

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #4685:
URL: https://github.com/apache/iceberg/pull/4685#discussion_r871861711


##########
python/src/iceberg/schema.py:
##########
@@ -497,31 +530,78 @@ def index_name_by_id(schema_or_type) -> Dict[int, str]:
     return indexer.by_id()
 
 
-class _BuildPositionAccessors(SchemaVisitor[Dict[int, "Accessor"]]):
-    """A schema visitor for generating a field ID to accessor index"""
+Position = int
 
-    def __init__(self) -> None:
-        self._index: Dict[int, Accessor] = {}
 
-    def schema(self, schema, result: Dict[int, Accessor]) -> Dict[int, Accessor]:
-        return self._index
+class _BuildPositionAccessors(SchemaVisitor[Dict[Position, Accessor]]):
+    """A schema visitor for generating a field ID to accessor index
 
-    def struct(self, struct, result: List[Dict[int, Accessor]]) -> Dict[int, Accessor]:
-        # TODO: Populate the `self._index` dictionary where the key is the field ID and the value is an accessor for that field.
-        #   The equivalent java logic can be found here: https://github.com/apache/iceberg/blob/master/api/src/main/java/org/apache/iceberg/Accessors.java#L213-L230
-        return self._index
+    Example:
+        >>> from iceberg.schema import Schema
+        >>> from iceberg.types import *
+        >>> schema = Schema(
+        ...     NestedField(field_id=2, name="id", field_type=IntegerType(), is_optional=False),
+        ...     NestedField(field_id=1, name="data", field_type=StringType(), is_optional=True),
+        ...     NestedField(
+        ...         field_id=3,
+        ...         name="location",
+        ...         field_type=StructType(
+        ...             NestedField(field_id=5, name="latitude", field_type=FloatType(), is_optional=False),
+        ...             NestedField(field_id=6, name="longitude", field_type=FloatType(), is_optional=False),
+        ...         ),
+        ...         is_optional=True,
+        ...     ),
+        ...     schema_id=1,
+        ...     identifier_field_ids=[1],
+        ... )
+        >>> result = build_position_accessors(schema)
+        >>> expected = {
+        ...      2: Accessor(position=0, inner=None),
+        ...      1: Accessor(position=1, inner=None),
+        ...      5: Accessor(position=2, inner=Accessor(position=0, inner=None)),
+        ...      6: Accessor(position=2, inner=Accessor(position=1, inner=None)),
+        ... }
+        >>> result == expected
+        True
+    """
 
-    def field(self, field: NestedField, result: Dict[int, Accessor]) -> Dict[int, Accessor]:
-        return self._index
+    @staticmethod
+    def _wrap_leaves(result: Dict[Position, Accessor], position: Position = 0) -> Dict[Position, Accessor]:
+        return {field_id: Accessor(position, inner=inner) for field_id, inner in result.items()}
 
-    def list(self, list_type: ListType, result: Dict[int, Accessor]) -> Dict[int, Accessor]:
-        return self._index
+    def schema(self, schema: Schema, result: Dict[Position, Accessor]) -> Dict[Position, Accessor]:
+        return result
 
-    def map(self, map_type: MapType, key_result: Dict[int, Accessor], value_result: Dict[int, Accessor]) -> Dict[int, Accessor]:
-        return self._index
+    def struct(self, struct: StructType, field_results: List[Dict[Position, Accessor]]) -> Dict[Position, Accessor]:
+        result = {}
 
-    def primitive(self, primitive: PrimitiveType) -> Dict[int, Accessor]:
-        return self._index
+        for position, field in enumerate(struct.fields):
+            if field_results[position]:

Review Comment:
   > So you only want to support structs (including nested structs), but discard support for lists and maps?
   
   Yes, because maps and lists aren't really supported. An accessor that is produced will be able to load, for example, a map by position, but the inner accessor for key will load position 0. What does it mean to call `m.get(0)`? It seems to me that we need a different sort of accessor for repeated elements like map keys, map values, and list elements. One that gets the collection and then calls any inner accessor on each item in the collection. We can do that, but Java doesn't actually use it yet so there's not much point to implementing it over here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #4685: Python: First version of the Accessor

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #4685:
URL: https://github.com/apache/iceberg/pull/4685#discussion_r871454516


##########
python/src/iceberg/schema.py:
##########
@@ -301,11 +336,10 @@ def _(obj: StructType, visitor: SchemaVisitor[T]) -> T:
 @visit.register(ListType)
 def _(obj: ListType, visitor: SchemaVisitor[T]) -> T:
     """Visit a ListType with a concrete SchemaVisitor"""
+
     visitor.before_list_element(obj.element)
-    try:
-        result = visit(obj.element.type, visitor)
-    finally:

Review Comment:
   Why remove the try/finally? It seems safer to have it, although I'll admit that it would be weird for an exception to be thrown during operation.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] samredai commented on pull request #4685: First version of the Accessor

Posted by GitBox <gi...@apache.org>.
samredai commented on PR #4685:
URL: https://github.com/apache/iceberg/pull/4685#issuecomment-1118746923

   @Fokko thanks for catching this! It makes sense to me and I don't think there are other areas that make post-order traversal a requirement.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] Fokko commented on a diff in pull request #4685: First version of the Accessor

Posted by GitBox <gi...@apache.org>.
Fokko commented on code in PR #4685:
URL: https://github.com/apache/iceberg/pull/4685#discussion_r866108031


##########
python/tests/test_schema.py:
##########
@@ -347,3 +349,17 @@ def test_schema_find_type(table_schema_simple):
         == table_schema_simple.find_type("BAZ", case_sensitive=False)
         == BooleanType()
     )
+
+
+def test_build_position_accessors(table_schema_nested):

Review Comment:
   As you can see here, the list/maps are still to be implemented :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] Fokko commented on a diff in pull request #4685: Python: First version of the Accessor

Posted by GitBox <gi...@apache.org>.
Fokko commented on code in PR #4685:
URL: https://github.com/apache/iceberg/pull/4685#discussion_r871846400


##########
python/src/iceberg/schema.py:
##########
@@ -497,31 +530,78 @@ def index_name_by_id(schema_or_type) -> Dict[int, str]:
     return indexer.by_id()
 
 
-class _BuildPositionAccessors(SchemaVisitor[Dict[int, "Accessor"]]):
-    """A schema visitor for generating a field ID to accessor index"""
+Position = int
 
-    def __init__(self) -> None:
-        self._index: Dict[int, Accessor] = {}
 
-    def schema(self, schema, result: Dict[int, Accessor]) -> Dict[int, Accessor]:
-        return self._index
+class _BuildPositionAccessors(SchemaVisitor[Dict[Position, Accessor]]):
+    """A schema visitor for generating a field ID to accessor index
 
-    def struct(self, struct, result: List[Dict[int, Accessor]]) -> Dict[int, Accessor]:
-        # TODO: Populate the `self._index` dictionary where the key is the field ID and the value is an accessor for that field.
-        #   The equivalent java logic can be found here: https://github.com/apache/iceberg/blob/master/api/src/main/java/org/apache/iceberg/Accessors.java#L213-L230
-        return self._index
+    Example:
+        >>> from iceberg.schema import Schema
+        >>> from iceberg.types import *
+        >>> schema = Schema(
+        ...     NestedField(field_id=2, name="id", field_type=IntegerType(), is_optional=False),
+        ...     NestedField(field_id=1, name="data", field_type=StringType(), is_optional=True),
+        ...     NestedField(
+        ...         field_id=3,
+        ...         name="location",
+        ...         field_type=StructType(
+        ...             NestedField(field_id=5, name="latitude", field_type=FloatType(), is_optional=False),
+        ...             NestedField(field_id=6, name="longitude", field_type=FloatType(), is_optional=False),
+        ...         ),
+        ...         is_optional=True,
+        ...     ),
+        ...     schema_id=1,
+        ...     identifier_field_ids=[1],
+        ... )
+        >>> result = build_position_accessors(schema)
+        >>> expected = {
+        ...      2: Accessor(position=0, inner=None),
+        ...      1: Accessor(position=1, inner=None),
+        ...      5: Accessor(position=2, inner=Accessor(position=0, inner=None)),
+        ...      6: Accessor(position=2, inner=Accessor(position=1, inner=None)),
+        ... }
+        >>> result == expected
+        True
+    """
 
-    def field(self, field: NestedField, result: Dict[int, Accessor]) -> Dict[int, Accessor]:
-        return self._index
+    @staticmethod
+    def _wrap_leaves(result: Dict[Position, Accessor], position: Position = 0) -> Dict[Position, Accessor]:
+        return {field_id: Accessor(position, inner=inner) for field_id, inner in result.items()}
 
-    def list(self, list_type: ListType, result: Dict[int, Accessor]) -> Dict[int, Accessor]:
-        return self._index
+    def schema(self, schema: Schema, result: Dict[Position, Accessor]) -> Dict[Position, Accessor]:
+        return result
 
-    def map(self, map_type: MapType, key_result: Dict[int, Accessor], value_result: Dict[int, Accessor]) -> Dict[int, Accessor]:
-        return self._index
+    def struct(self, struct: StructType, field_results: List[Dict[Position, Accessor]]) -> Dict[Position, Accessor]:
+        result = {}
 
-    def primitive(self, primitive: PrimitiveType) -> Dict[int, Accessor]:
-        return self._index
+        for position, field in enumerate(struct.fields):
+            if field_results[position]:

Review Comment:
   This is Python:
   ```python
   ➜  python3
   Python 3.9.12 (main, Mar 26 2022, 15:44:31) 
   [Clang 13.1.6 (clang-1316.0.21.2)] on darwin
   Type "help", "copyright", "credits" or "license" for more information.
   >>> if {}:
   ...   print('if')
   ... else:
   ...   print('else')
   ... 
   else
   ```
   
   > I think that is correct (if `{}` uses the else branch) but I don't think there's a need to return accessors for fields inside maps or lists because we don't really know how to handle repeated elements. We could extend this later so that a repeated layer returns a tuple of the values, but right now we don't really need that.
   
   So you only want to support structs (including nested structs), but discard support for lists and maps?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] Fokko commented on a diff in pull request #4685: Python: First version of the Accessor

Posted by GitBox <gi...@apache.org>.
Fokko commented on code in PR #4685:
URL: https://github.com/apache/iceberg/pull/4685#discussion_r872639988


##########
python/src/iceberg/schema.py:
##########
@@ -497,31 +530,78 @@ def index_name_by_id(schema_or_type) -> Dict[int, str]:
     return indexer.by_id()
 
 
-class _BuildPositionAccessors(SchemaVisitor[Dict[int, "Accessor"]]):
-    """A schema visitor for generating a field ID to accessor index"""
+Position = int
 
-    def __init__(self) -> None:
-        self._index: Dict[int, Accessor] = {}
 
-    def schema(self, schema, result: Dict[int, Accessor]) -> Dict[int, Accessor]:
-        return self._index
+class _BuildPositionAccessors(SchemaVisitor[Dict[Position, Accessor]]):
+    """A schema visitor for generating a field ID to accessor index
 
-    def struct(self, struct, result: List[Dict[int, Accessor]]) -> Dict[int, Accessor]:
-        # TODO: Populate the `self._index` dictionary where the key is the field ID and the value is an accessor for that field.
-        #   The equivalent java logic can be found here: https://github.com/apache/iceberg/blob/master/api/src/main/java/org/apache/iceberg/Accessors.java#L213-L230
-        return self._index
+    Example:
+        >>> from iceberg.schema import Schema
+        >>> from iceberg.types import *
+        >>> schema = Schema(
+        ...     NestedField(field_id=2, name="id", field_type=IntegerType(), is_optional=False),
+        ...     NestedField(field_id=1, name="data", field_type=StringType(), is_optional=True),
+        ...     NestedField(
+        ...         field_id=3,
+        ...         name="location",
+        ...         field_type=StructType(
+        ...             NestedField(field_id=5, name="latitude", field_type=FloatType(), is_optional=False),
+        ...             NestedField(field_id=6, name="longitude", field_type=FloatType(), is_optional=False),
+        ...         ),
+        ...         is_optional=True,
+        ...     ),
+        ...     schema_id=1,
+        ...     identifier_field_ids=[1],
+        ... )
+        >>> result = build_position_accessors(schema)
+        >>> expected = {
+        ...      2: Accessor(position=0, inner=None),
+        ...      1: Accessor(position=1, inner=None),
+        ...      5: Accessor(position=2, inner=Accessor(position=0, inner=None)),
+        ...      6: Accessor(position=2, inner=Accessor(position=1, inner=None)),
+        ... }
+        >>> result == expected
+        True
+    """
 
-    def field(self, field: NestedField, result: Dict[int, Accessor]) -> Dict[int, Accessor]:
-        return self._index
+    @staticmethod
+    def _wrap_leaves(result: Dict[Position, Accessor], position: Position = 0) -> Dict[Position, Accessor]:
+        return {field_id: Accessor(position, inner=inner) for field_id, inner in result.items()}
 
-    def list(self, list_type: ListType, result: Dict[int, Accessor]) -> Dict[int, Accessor]:
-        return self._index
+    def schema(self, schema: Schema, result: Dict[Position, Accessor]) -> Dict[Position, Accessor]:
+        return result
 
-    def map(self, map_type: MapType, key_result: Dict[int, Accessor], value_result: Dict[int, Accessor]) -> Dict[int, Accessor]:
-        return self._index
+    def struct(self, struct: StructType, field_results: List[Dict[Position, Accessor]]) -> Dict[Position, Accessor]:
+        result = {}
 
-    def primitive(self, primitive: PrimitiveType) -> Dict[int, Accessor]:
-        return self._index
+        for position, field in enumerate(struct.fields):
+            if field_results[position]:

Review Comment:
   @rdblue Interesting case, and I see your point, just updated the code and extended the test cases 👍🏻 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #4685: Python: First version of the Accessor

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #4685:
URL: https://github.com/apache/iceberg/pull/4685#discussion_r871463419


##########
python/tests/test_schema.py:
##########
@@ -347,3 +351,80 @@ def test_schema_find_type(table_schema_simple):
         == table_schema_simple.find_type("BAZ", case_sensitive=False)
         == BooleanType()
     )
+
+
+def test_build_position_accessors(table_schema_nested):
+    accessors = build_position_accessors(table_schema_nested)
+    assert accessors == {
+        1: Accessor(position=0, inner=None),
+        2: Accessor(position=1, inner=None),
+        3: Accessor(position=2, inner=None),
+        5: Accessor(position=3, inner=Accessor(position=0, inner=None)),
+        7: Accessor(position=4, inner=Accessor(position=0, inner=None)),
+        8: Accessor(position=4, inner=Accessor(position=1, inner=None)),
+        9: Accessor(position=4, inner=Accessor(position=1, inner=Accessor(position=0, inner=None))),
+        10: Accessor(position=4, inner=Accessor(position=1, inner=Accessor(position=1, inner=None))),
+        12: Accessor(position=5, inner=Accessor(position=0, inner=None)),
+        13: Accessor(position=5, inner=Accessor(position=0, inner=Accessor(position=0, inner=None))),
+        14: Accessor(position=5, inner=Accessor(position=0, inner=Accessor(position=1, inner=None))),
+    }
+
+
+def test_build_position_accessors_list():
+    iceberg_schema = schema.Schema(
+        NestedField(
+            field_id=4,
+            name="qux",
+            field_type=ListType(element_id=5, element_type=StringType(), element_is_optional=True),
+            is_optional=True,
+        ),
+        schema_id=1,
+    )
+    accessors = build_position_accessors(iceberg_schema)

Review Comment:
   I was just looking at that and commented above. How does that work? It looks like `list` and `map` are returning accessors in the visitor.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #4685: Python: First version of the Accessor

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #4685:
URL: https://github.com/apache/iceberg/pull/4685#discussion_r871465203


##########
python/tests/test_schema.py:
##########
@@ -347,3 +351,80 @@ def test_schema_find_type(table_schema_simple):
         == table_schema_simple.find_type("BAZ", case_sensitive=False)
         == BooleanType()
     )
+
+
+def test_build_position_accessors(table_schema_nested):
+    accessors = build_position_accessors(table_schema_nested)
+    assert accessors == {
+        1: Accessor(position=0, inner=None),
+        2: Accessor(position=1, inner=None),
+        3: Accessor(position=2, inner=None),
+        5: Accessor(position=3, inner=Accessor(position=0, inner=None)),
+        7: Accessor(position=4, inner=Accessor(position=0, inner=None)),
+        8: Accessor(position=4, inner=Accessor(position=1, inner=None)),
+        9: Accessor(position=4, inner=Accessor(position=1, inner=Accessor(position=0, inner=None))),
+        10: Accessor(position=4, inner=Accessor(position=1, inner=Accessor(position=1, inner=None))),
+        12: Accessor(position=5, inner=Accessor(position=0, inner=None)),
+        13: Accessor(position=5, inner=Accessor(position=0, inner=Accessor(position=0, inner=None))),
+        14: Accessor(position=5, inner=Accessor(position=0, inner=Accessor(position=1, inner=None))),
+    }
+
+
+def test_build_position_accessors_list():
+    iceberg_schema = schema.Schema(
+        NestedField(
+            field_id=4,
+            name="qux",
+            field_type=ListType(element_id=5, element_type=StringType(), element_is_optional=True),
+            is_optional=True,
+        ),
+        schema_id=1,
+    )
+    accessors = build_position_accessors(iceberg_schema)
+    assert accessors == {
+        5: Accessor(position=0, inner=Accessor(position=0, inner=None)),
+    }
+
+
+def test_build_position_accessors_map():
+    iceberg_schema = schema.Schema(
+        NestedField(
+            field_id=6,
+            name="quux",
+            field_type=MapType(
+                key_id=7,
+                key_type=StringType(),
+                value_id=8,
+                value_type=MapType(
+                    key_id=9, key_type=StringType(), value_id=10, value_type=IntegerType(), value_is_optional=True
+                ),
+                value_is_optional=True,
+            ),
+            is_optional=True,
+        ),
+        schema_id=1,
+    )
+    accessors = build_position_accessors(iceberg_schema)
+    assert accessors == {
+        7: Accessor(position=0, inner=Accessor(position=0, inner=None)),
+        8: Accessor(position=0, inner=Accessor(position=1, inner=None)),
+        9: Accessor(position=0, inner=Accessor(position=1, inner=Accessor(position=0, inner=None))),
+        10: Accessor(position=0, inner=Accessor(position=1, inner=Accessor(position=1, inner=None))),
+    }
+
+
+class TestStructProtocol(StructProtocol):

Review Comment:
   Since this is a struct-like object, would it make more sense to drop `Protocol` from the class name?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] Fokko commented on a diff in pull request #4685: Python: First version of the Accessor

Posted by GitBox <gi...@apache.org>.
Fokko commented on code in PR #4685:
URL: https://github.com/apache/iceberg/pull/4685#discussion_r871847612


##########
python/tests/test_schema.py:
##########
@@ -347,3 +351,80 @@ def test_schema_find_type(table_schema_simple):
         == table_schema_simple.find_type("BAZ", case_sensitive=False)
         == BooleanType()
     )
+
+
+def test_build_position_accessors(table_schema_nested):
+    accessors = build_position_accessors(table_schema_nested)
+    assert accessors == {
+        1: Accessor(position=0, inner=None),
+        2: Accessor(position=1, inner=None),
+        3: Accessor(position=2, inner=None),
+        5: Accessor(position=3, inner=Accessor(position=0, inner=None)),
+        7: Accessor(position=4, inner=Accessor(position=0, inner=None)),
+        8: Accessor(position=4, inner=Accessor(position=1, inner=None)),
+        9: Accessor(position=4, inner=Accessor(position=1, inner=Accessor(position=0, inner=None))),
+        10: Accessor(position=4, inner=Accessor(position=1, inner=Accessor(position=1, inner=None))),
+        12: Accessor(position=5, inner=Accessor(position=0, inner=None)),
+        13: Accessor(position=5, inner=Accessor(position=0, inner=Accessor(position=0, inner=None))),
+        14: Accessor(position=5, inner=Accessor(position=0, inner=Accessor(position=1, inner=None))),
+    }
+
+
+def test_build_position_accessors_list():
+    iceberg_schema = schema.Schema(
+        NestedField(
+            field_id=4,
+            name="qux",
+            field_type=ListType(element_id=5, element_type=StringType(), element_is_optional=True),
+            is_optional=True,
+        ),
+        schema_id=1,
+    )
+    accessors = build_position_accessors(iceberg_schema)

Review Comment:
   Correct me if I'm wrong here. But I believe that the `list` itself isn't being returned (id 4 is missing).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] samredai commented on a diff in pull request #4685: First version of the Accessor

Posted by GitBox <gi...@apache.org>.
samredai commented on code in PR #4685:
URL: https://github.com/apache/iceberg/pull/4685#discussion_r865358439


##########
python/src/iceberg/expressions/base.py:
##########
@@ -278,6 +278,9 @@ def __str__(self):
     def __repr__(self):
         return f"Accessor(position={self._position})"
 
+    def __eq__(self, other):
+        return other and self._position == other.position

Review Comment:
   We don't manipulate the `self._position` property so this is just a nitpick, but should we use `self.position` here instead just so it goes through that path?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] Fokko commented on a diff in pull request #4685: First version of the Accessor

Posted by GitBox <gi...@apache.org>.
Fokko commented on code in PR #4685:
URL: https://github.com/apache/iceberg/pull/4685#discussion_r865144092


##########
python/src/iceberg/schema.py:
##########
@@ -290,6 +292,8 @@ def _(obj: StructType, visitor: SchemaVisitor[T]) -> T:
         visitor.before_field(field)
         try:
             result = visit(field.type, visitor)
+        except Exception:
+            logger.exception("Unable to visit the field in the schema")

Review Comment:
   Great idea! I'm actually not sure if we should swallow those exceptions.



##########
python/src/iceberg/expressions/base.py:
##########
@@ -278,6 +278,9 @@ def __str__(self):
     def __repr__(self):
         return f"Accessor(position={self._position})"
 
+    def __eq__(self, other):
+        return other and self._position == other.position

Review Comment:
   We only use it for tests right now, but that's a good point! Thanks!



##########
python/src/iceberg/schema.py:
##########
@@ -503,21 +507,34 @@ class _BuildPositionAccessors(SchemaVisitor[Dict[int, "Accessor"]]):
     def __init__(self) -> None:
         self._index: Dict[int, Accessor] = {}
 
-    def schema(self, schema, result: Dict[int, Accessor]) -> Dict[int, Accessor]:
-        return self._index
+    def schema(self, schema: Schema, field_results: Dict[int, Accessor]) -> Dict[int, Accessor]:
+        return field_results
+
+    def struct(self, struct: StructType, field_results: List[Dict[int, Accessor]]) -> Dict[int, Accessor]:
+        fields = struct.fields
+        for idx, field in enumerate(fields):
+            result = field_results[idx]
+            if result:
+                self._index.update(result)
+            else:
+                self._index[field.field_id] = Accessor(field.field_id)

Review Comment:
   Thanks @rdblue the example is very helpful. I've added it to the docstring as well 👍🏻 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #4685: Python: First version of the Accessor

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #4685:
URL: https://github.com/apache/iceberg/pull/4685#discussion_r868271279


##########
python/src/iceberg/schema.py:
##########
@@ -279,53 +319,74 @@ def visit(obj, visitor: SchemaVisitor[T]) -> T:
 @visit.register(Schema)
 def _(obj: Schema, visitor: SchemaVisitor[T]) -> T:
     """Visit a Schema with a concrete SchemaVisitor"""
-    return visitor.schema(obj, visit(obj.as_struct(), visitor))
+    return visit(obj.as_struct(), visitor)
 
 
 @visit.register(StructType)
 def _(obj: StructType, visitor: SchemaVisitor[T]) -> T:
     """Visit a StructType with a concrete SchemaVisitor"""
-    results = []
+
+    result = visitor.struct(obj)
+
     for field in obj.fields:
+        visitor.field(field)
         visitor.before_field(field)
         try:
-            result = visit(field.type, visitor)
+            visit(field.type, visitor)
+        except Exception as e:
+            logger.exception(f"Unable to visit the field: {field}")
+            raise e
         finally:
             visitor.after_field(field)
 
-        results.append(visitor.field(field, result))
-
-    return visitor.struct(obj, results)
+    return result
 
 
 @visit.register(ListType)
 def _(obj: ListType, visitor: SchemaVisitor[T]) -> T:
     """Visit a ListType with a concrete SchemaVisitor"""
+    result = visitor.list(obj)
+
+    visitor.field(obj.element)
+
     visitor.before_list_element(obj.element)
     try:
-        result = visit(obj.element.type, visitor)
+        visit(obj.element.type, visitor)
+    except Exception as e:
+        logger.exception(f"Unable to visit the type: {obj}")
+        raise e
     finally:
         visitor.after_list_element(obj.element)
 
-    return visitor.list(obj, result)
+    return result
 
 
 @visit.register(MapType)
 def _(obj: MapType, visitor: SchemaVisitor[T]) -> T:
     """Visit a MapType with a concrete SchemaVisitor"""
+    result = visitor.map(obj)
+
+    visitor.field(obj.key)
     visitor.before_map_key(obj.key)
     try:
-        key_result = visit(obj.key.type, visitor)
+        visit(obj.key.type, visitor)
+    except Exception as e:
+        logger.exception(f"Unable to visit the map ket type: {obj.key}")
+        raise e
     finally:
         visitor.after_map_key(obj.key)
 
+    visitor.field(obj.value)
     visitor.before_map_value(obj.value)
     try:
-        value_result = visit(obj.value.type, visitor)
+        visit(obj.value.type, visitor)
+    except Exception as e:
+        logger.exception(f"Unable to visit the map value type: {obj.value}")

Review Comment:
   Do we need to log exceptions at every level? Why not propagate like normal to avoid too many logs?
   
   A heavily nested object with an exception would result in _a lot_ of logs, plus the exception getting handled and logged at a higher level.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] Fokko commented on a diff in pull request #4685: Python: First version of the Accessor

Posted by GitBox <gi...@apache.org>.
Fokko commented on code in PR #4685:
URL: https://github.com/apache/iceberg/pull/4685#discussion_r867671139


##########
python/tests/test_schema.py:
##########
@@ -347,3 +349,17 @@ def test_schema_find_type(table_schema_simple):
         == table_schema_simple.find_type("BAZ", case_sensitive=False)
         == BooleanType()
     )
+
+
+def test_build_position_accessors(table_schema_nested):

Review Comment:
   Also fixed!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] Fokko commented on a diff in pull request #4685: Python: First version of the Accessor

Posted by GitBox <gi...@apache.org>.
Fokko commented on code in PR #4685:
URL: https://github.com/apache/iceberg/pull/4685#discussion_r868346289


##########
python/src/iceberg/schema.py:
##########
@@ -279,53 +319,74 @@ def visit(obj, visitor: SchemaVisitor[T]) -> T:
 @visit.register(Schema)
 def _(obj: Schema, visitor: SchemaVisitor[T]) -> T:
     """Visit a Schema with a concrete SchemaVisitor"""
-    return visitor.schema(obj, visit(obj.as_struct(), visitor))
+    return visit(obj.as_struct(), visitor)
 
 
 @visit.register(StructType)
 def _(obj: StructType, visitor: SchemaVisitor[T]) -> T:
     """Visit a StructType with a concrete SchemaVisitor"""
-    results = []
+
+    result = visitor.struct(obj)
+
     for field in obj.fields:
+        visitor.field(field)
         visitor.before_field(field)
         try:
-            result = visit(field.type, visitor)
+            visit(field.type, visitor)
+        except Exception as e:
+            logger.exception(f"Unable to visit the field: {field}")
+            raise e
         finally:
             visitor.after_field(field)
 
-        results.append(visitor.field(field, result))
-
-    return visitor.struct(obj, results)
+    return result
 
 
 @visit.register(ListType)
 def _(obj: ListType, visitor: SchemaVisitor[T]) -> T:
     """Visit a ListType with a concrete SchemaVisitor"""
+    result = visitor.list(obj)
+
+    visitor.field(obj.element)
+
     visitor.before_list_element(obj.element)
     try:
-        result = visit(obj.element.type, visitor)
+        visit(obj.element.type, visitor)
+    except Exception as e:
+        logger.exception(f"Unable to visit the type: {obj}")
+        raise e
     finally:
         visitor.after_list_element(obj.element)
 
-    return visitor.list(obj, result)
+    return result
 
 
 @visit.register(MapType)
 def _(obj: MapType, visitor: SchemaVisitor[T]) -> T:
     """Visit a MapType with a concrete SchemaVisitor"""
+    result = visitor.map(obj)
+
+    visitor.field(obj.key)
     visitor.before_map_key(obj.key)
     try:
-        key_result = visit(obj.key.type, visitor)
+        visit(obj.key.type, visitor)
+    except Exception as e:
+        logger.exception(f"Unable to visit the map ket type: {obj.key}")
+        raise e
     finally:
         visitor.after_map_key(obj.key)
 
+    visitor.field(obj.value)
     visitor.before_map_value(obj.value)
     try:
-        value_result = visit(obj.value.type, visitor)
+        visit(obj.value.type, visitor)
+    except Exception as e:
+        logger.exception(f"Unable to visit the map value type: {obj.value}")

Review Comment:
   This was in the original code, but it would be best to remove this. If something fails while traversing the schema, we don't want to continue and just return a part of the schema. In that case we just want to throw an exception



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] Fokko commented on a diff in pull request #4685: Python: First version of the Accessor

Posted by GitBox <gi...@apache.org>.
Fokko commented on code in PR #4685:
URL: https://github.com/apache/iceberg/pull/4685#discussion_r871849169


##########
python/tests/test_schema.py:
##########
@@ -347,3 +351,80 @@ def test_schema_find_type(table_schema_simple):
         == table_schema_simple.find_type("BAZ", case_sensitive=False)
         == BooleanType()
     )
+
+
+def test_build_position_accessors(table_schema_nested):
+    accessors = build_position_accessors(table_schema_nested)
+    assert accessors == {
+        1: Accessor(position=0, inner=None),
+        2: Accessor(position=1, inner=None),
+        3: Accessor(position=2, inner=None),
+        5: Accessor(position=3, inner=Accessor(position=0, inner=None)),
+        7: Accessor(position=4, inner=Accessor(position=0, inner=None)),
+        8: Accessor(position=4, inner=Accessor(position=1, inner=None)),
+        9: Accessor(position=4, inner=Accessor(position=1, inner=Accessor(position=0, inner=None))),
+        10: Accessor(position=4, inner=Accessor(position=1, inner=Accessor(position=1, inner=None))),
+        12: Accessor(position=5, inner=Accessor(position=0, inner=None)),
+        13: Accessor(position=5, inner=Accessor(position=0, inner=Accessor(position=0, inner=None))),
+        14: Accessor(position=5, inner=Accessor(position=0, inner=Accessor(position=1, inner=None))),
+    }
+
+
+def test_build_position_accessors_list():
+    iceberg_schema = schema.Schema(
+        NestedField(
+            field_id=4,
+            name="qux",
+            field_type=ListType(element_id=5, element_type=StringType(), element_is_optional=True),
+            is_optional=True,
+        ),
+        schema_id=1,
+    )
+    accessors = build_position_accessors(iceberg_schema)
+    assert accessors == {
+        5: Accessor(position=0, inner=Accessor(position=0, inner=None)),
+    }
+
+
+def test_build_position_accessors_map():
+    iceberg_schema = schema.Schema(
+        NestedField(
+            field_id=6,
+            name="quux",
+            field_type=MapType(
+                key_id=7,
+                key_type=StringType(),
+                value_id=8,
+                value_type=MapType(
+                    key_id=9, key_type=StringType(), value_id=10, value_type=IntegerType(), value_is_optional=True
+                ),
+                value_is_optional=True,
+            ),
+            is_optional=True,
+        ),
+        schema_id=1,
+    )
+    accessors = build_position_accessors(iceberg_schema)
+    assert accessors == {
+        7: Accessor(position=0, inner=Accessor(position=0, inner=None)),
+        8: Accessor(position=0, inner=Accessor(position=1, inner=None)),
+        9: Accessor(position=0, inner=Accessor(position=1, inner=Accessor(position=0, inner=None))),
+        10: Accessor(position=0, inner=Accessor(position=1, inner=Accessor(position=1, inner=None))),
+    }

Review Comment:
   @rdblue this is an example for a map. I would expect id 7 to be in there as it is the key to the map. id 8 doesn't make much sense, since it will point at another map. id 9 and 10 should be in there since they are primitives.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] Fokko commented on a diff in pull request #4685: Python: First version of the Accessor

Posted by GitBox <gi...@apache.org>.
Fokko commented on code in PR #4685:
URL: https://github.com/apache/iceberg/pull/4685#discussion_r871305407


##########
python/tests/test_schema.py:
##########
@@ -347,3 +351,80 @@ def test_schema_find_type(table_schema_simple):
         == table_schema_simple.find_type("BAZ", case_sensitive=False)
         == BooleanType()
     )
+
+
+def test_build_position_accessors(table_schema_nested):
+    accessors = build_position_accessors(table_schema_nested)
+    assert accessors == {
+        1: Accessor(position=0, inner=None),
+        2: Accessor(position=1, inner=None),
+        3: Accessor(position=2, inner=None),
+        5: Accessor(position=3, inner=Accessor(position=0, inner=None)),
+        7: Accessor(position=4, inner=Accessor(position=0, inner=None)),
+        8: Accessor(position=4, inner=Accessor(position=1, inner=None)),
+        9: Accessor(position=4, inner=Accessor(position=1, inner=Accessor(position=0, inner=None))),
+        10: Accessor(position=4, inner=Accessor(position=1, inner=Accessor(position=1, inner=None))),
+        12: Accessor(position=5, inner=Accessor(position=0, inner=None)),
+        13: Accessor(position=5, inner=Accessor(position=0, inner=Accessor(position=0, inner=None))),
+        14: Accessor(position=5, inner=Accessor(position=0, inner=Accessor(position=1, inner=None))),
+    }
+
+
+def test_build_position_accessors_list():
+    iceberg_schema = schema.Schema(
+        NestedField(
+            field_id=4,
+            name="qux",
+            field_type=ListType(element_id=5, element_type=StringType(), element_is_optional=True),
+            is_optional=True,
+        ),
+        schema_id=1,
+    )
+    accessors = build_position_accessors(iceberg_schema)

Review Comment:
   @rdblue You mentioned that in the Java implementation there is no accessor for the field list. I've changed this on the Python side as well to have the same behavior.



##########
python/src/iceberg/schema.py:
##########
@@ -279,53 +326,54 @@ def visit(obj, visitor: SchemaVisitor[T]) -> T:
 @visit.register(Schema)
 def _(obj: Schema, visitor: SchemaVisitor[T]) -> T:
     """Visit a Schema with a concrete SchemaVisitor"""
-    return visitor.schema(obj, visit(obj.as_struct(), visitor))
+    return visit(obj.as_struct(), visitor)
 
 
 @visit.register(StructType)
 def _(obj: StructType, visitor: SchemaVisitor[T]) -> T:
     """Visit a StructType with a concrete SchemaVisitor"""
-    results = []
+
+    result = visitor.struct(obj)
+
     for field in obj.fields:
+        visitor.field(field)
         visitor.before_field(field)
-        try:
-            result = visit(field.type, visitor)
-        finally:
-            visitor.after_field(field)
+        visit(field.type, visitor)
+        visitor.after_field(field)
 
-        results.append(visitor.field(field, result))
-
-    return visitor.struct(obj, results)
+    return result
 
 
 @visit.register(ListType)
 def _(obj: ListType, visitor: SchemaVisitor[T]) -> T:
     """Visit a ListType with a concrete SchemaVisitor"""
+    result = visitor.list(obj)
+
+    visitor.field(obj.element)
+
     visitor.before_list_element(obj.element)
-    try:
-        result = visit(obj.element.type, visitor)
-    finally:
-        visitor.after_list_element(obj.element)
+    visit(obj.element.type, visitor)
+    visitor.after_list_element(obj.element)
 
-    return visitor.list(obj, result)
+    return result
 
 
 @visit.register(MapType)
 def _(obj: MapType, visitor: SchemaVisitor[T]) -> T:
     """Visit a MapType with a concrete SchemaVisitor"""
+    result = visitor.map(obj)
+
+    visitor.field(obj.key)

Review Comment:
   I've reverted this 👍🏻 



##########
python/src/iceberg/schema.py:
##########
@@ -262,10 +267,45 @@ def primitive(self, primitive: PrimitiveType) -> T:
         ...  # pragma: no cover
 
 
+@dataclass(init=True, eq=True)
+class Accessor:
+    """An accessor for a specific position in a container that implements the StructProtocol"""
+
+    position: int
+    inner: Optional["Accessor"] = None
+
+    def __str__(self):
+        return f"Accessor(position={self.position},inner={self.inner})"
+
+    def __repr__(self):
+        return self.__str__()
+
+    def with_inner(self, accessor: "Accessor") -> "Accessor":
+        parent = copy.deepcopy(self)
+        acc = parent
+        while acc.inner:
+            acc = acc.inner
+        acc.inner = accessor
+        return parent
+
+    def get(self, container: StructProtocol) -> Any:
+        """Returns the value at self.position in `container`
+
+        Args:
+            container(StructProtocol): A container to access at position `self.position`
+
+        Returns:
+            Any: The value at position `self.position` in the container
+        """
+        return container.get(self.position)
+
+
 @singledispatch
 def visit(obj, visitor: SchemaVisitor[T]) -> T:
     """A generic function for applying a schema visitor to any point within a schema
 
+    The function traverses the schema in pre-order fashion

Review Comment:
   @rdblue I've seen the light, thank you. Just reverted it to post-order traversal. It took me a while to wrap my head around it, but I like the fact that there is no state in the visitor itself. This makes is much cleaner. 



##########
python/src/iceberg/schema.py:
##########
@@ -262,10 +267,52 @@ def primitive(self, primitive: PrimitiveType) -> T:
         ...  # pragma: no cover
 
 
+@dataclass(init=True, eq=True)
+class Accessor:
+    """An accessor for a specific position in a container that implements the StructProtocol"""
+
+    position: int
+    inner: Optional["Accessor"] = None
+
+    def __str__(self):
+        return f"Accessor(position={self.position},inner={self.inner})"
+
+    def __repr__(self):
+        return self.__str__()
+
+    def with_inner(self, accessor: "Accessor") -> "Accessor":

Review Comment:
   We don't need this one anymore :)



##########
python/src/iceberg/files.py:
##########
@@ -46,8 +46,10 @@ class FileFormat(Enum):
 class StructProtocol(Protocol):  # pragma: no cover
     """A generic protocol used by accessors to get and set at positions of an object"""
 
+    @abstractmethod

Review Comment:
   Unfortunately not. At least, PyCharm didn't recognize it and wasn't suggesting implementing the abstract classes.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #4685: Python: First version of the Accessor

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #4685:
URL: https://github.com/apache/iceberg/pull/4685#discussion_r871451960


##########
python/src/iceberg/schema.py:
##########
@@ -40,6 +41,8 @@
 
 T = TypeVar("T")
 
+logger = logging.getLogger(__name__)

Review Comment:
   Looks like this is unused. Do we need to add it?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #4685: Python: First version of the Accessor

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #4685:
URL: https://github.com/apache/iceberg/pull/4685#discussion_r871861840


##########
python/tests/test_schema.py:
##########
@@ -347,3 +351,80 @@ def test_schema_find_type(table_schema_simple):
         == table_schema_simple.find_type("BAZ", case_sensitive=False)
         == BooleanType()
     )
+
+
+def test_build_position_accessors(table_schema_nested):
+    accessors = build_position_accessors(table_schema_nested)
+    assert accessors == {
+        1: Accessor(position=0, inner=None),
+        2: Accessor(position=1, inner=None),
+        3: Accessor(position=2, inner=None),
+        5: Accessor(position=3, inner=Accessor(position=0, inner=None)),
+        7: Accessor(position=4, inner=Accessor(position=0, inner=None)),
+        8: Accessor(position=4, inner=Accessor(position=1, inner=None)),
+        9: Accessor(position=4, inner=Accessor(position=1, inner=Accessor(position=0, inner=None))),
+        10: Accessor(position=4, inner=Accessor(position=1, inner=Accessor(position=1, inner=None))),
+        12: Accessor(position=5, inner=Accessor(position=0, inner=None)),
+        13: Accessor(position=5, inner=Accessor(position=0, inner=Accessor(position=0, inner=None))),
+        14: Accessor(position=5, inner=Accessor(position=0, inner=Accessor(position=1, inner=None))),
+    }
+
+
+def test_build_position_accessors_list():
+    iceberg_schema = schema.Schema(
+        NestedField(
+            field_id=4,
+            name="qux",
+            field_type=ListType(element_id=5, element_type=StringType(), element_is_optional=True),
+            is_optional=True,
+        ),
+        schema_id=1,
+    )
+    accessors = build_position_accessors(iceberg_schema)
+    assert accessors == {
+        5: Accessor(position=0, inner=Accessor(position=0, inner=None)),
+    }
+
+
+def test_build_position_accessors_map():
+    iceberg_schema = schema.Schema(
+        NestedField(
+            field_id=6,
+            name="quux",
+            field_type=MapType(
+                key_id=7,
+                key_type=StringType(),
+                value_id=8,
+                value_type=MapType(
+                    key_id=9, key_type=StringType(), value_id=10, value_type=IntegerType(), value_is_optional=True
+                ),
+                value_is_optional=True,
+            ),
+            is_optional=True,
+        ),
+        schema_id=1,
+    )
+    accessors = build_position_accessors(iceberg_schema)
+    assert accessors == {
+        7: Accessor(position=0, inner=Accessor(position=0, inner=None)),
+        8: Accessor(position=0, inner=Accessor(position=1, inner=None)),
+        9: Accessor(position=0, inner=Accessor(position=1, inner=Accessor(position=0, inner=None))),
+        10: Accessor(position=0, inner=Accessor(position=1, inner=Accessor(position=1, inner=None))),
+    }
+
+
+class TestStructProtocol(StructProtocol):

Review Comment:
   I mean just this test class: `TestStructProtocol` -> `TestStruct`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] samredai commented on a diff in pull request #4685: Python: First version of the Accessor

Posted by GitBox <gi...@apache.org>.
samredai commented on code in PR #4685:
URL: https://github.com/apache/iceberg/pull/4685#discussion_r868676643


##########
python/tests/conftest.py:
##########
@@ -19,7 +19,7 @@
 
 import pytest
 
-from iceberg import schema
+from iceberg.schema import Schema

Review Comment:
   I think it matters more for the files containing actual tests (for example `schema.py` should be tested in a file named `test_schema.py` which should import the module using `from iceberg import schema`). That way when you're looking at an individual test, it's almost always a safe assumption that the objects that start with `schema.` are what's being tested.
   
   That doesn't necessarily apply here in `conftest.py` which is mostly a collection of global fixtures so I think either is fine.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] Fokko commented on a diff in pull request #4685: Python: First version of the Accessor

Posted by GitBox <gi...@apache.org>.
Fokko commented on code in PR #4685:
URL: https://github.com/apache/iceberg/pull/4685#discussion_r871842905


##########
python/src/iceberg/schema.py:
##########
@@ -40,6 +41,8 @@
 
 T = TypeVar("T")
 
+logger = logging.getLogger(__name__)

Review Comment:
   No, let's remove it!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org