You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/10/22 15:42:13 UTC

[GitHub] [arrow] rok opened a new pull request #8510: ARROW-1614: [C++] Add a Tensor logical value type with constant dimensions, implemented using ExtensionType

rok opened a new pull request #8510:
URL: https://github.com/apache/arrow/pull/8510


   > [ARROW-1614](https://issues.apache.org/jira/browse/ARROW-1614): In an Arrow table, we would like to add support for a column that has values cells each containing a tensor value, with all tensors having the same dimensions. These would be stored as a binary value, plus some metadata to store type and shape/strides.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] rok commented on a change in pull request #8510: ARROW-1614: [C++] Add a Tensor logical value type with constant dimensions, implemented using ExtensionType

Posted by GitBox <gi...@apache.org>.
rok commented on a change in pull request #8510:
URL: https://github.com/apache/arrow/pull/8510#discussion_r722227879



##########
File path: cpp/src/arrow/extension_type_test.cc
##########
@@ -333,4 +334,144 @@ TEST_F(TestExtensionType, ValidateExtensionArray) {
   ASSERT_OK(ext_arr4->ValidateFull());
 }
 
+class TensorArray : public ExtensionArray {
+ public:
+  using ExtensionArray::ExtensionArray;
+};
+
+class TensorArrayType : public ExtensionType {
+ public:
+  explicit TensorArrayType(const std::shared_ptr<DataType>& type,
+                           const std::vector<int64_t>& shape,
+                           const std::vector<int64_t>& strides)
+      : ExtensionType(type), type_(type), shape_(shape), strides_(strides) {}
+
+  std::shared_ptr<DataType> type() const { return type_; }
+  std::vector<int64_t> shape() const { return shape_; }
+  std::vector<int64_t> strides() const { return strides_; }
+
+  std::string extension_name() const override {
+    std::stringstream s;
+    s << "ext-array-tensor-type<type=" << *storage_type() << ", shape=(";
+    for (uint64_t i = 0; i < shape_.size(); i++) {
+      s << shape_[i];
+      if (i < shape_.size() - 1) {
+        s << ", ";
+      }
+    }
+    s << "), strides=(";
+    for (uint64_t i = 0; i < strides_.size(); i++) {
+      s << strides_[i];
+      if (i < strides_.size() - 1) {
+        s << ", ";
+      }
+    }
+    s << ")>";
+    return s.str();
+  }
+
+  bool ExtensionEquals(const ExtensionType& other) const override {
+    return this->shape() == static_cast<const TensorArrayType&>(other).shape();

Review comment:
       I removed the `ndim` check and made equality comparison only check for type name (`ext-array-tensor-type`).
   
   > > In that case do we even want to keep ndim for equality comparison?
   >
   > This is a good question. I might lean towards saying not, but I'm not a maintainer. I guess it depends on how the type is used throughout the rest of the Arrow ecosystem -- you mentioned the Compute Engine for example.
   
   I'm not aware of active work on tensor computation in Arrow at the moment so I don't think there were any decisions made on this yet. It is super interesting to see what were the trade-offs in other places (numba, jax) though.
   
   At the moment this is an `ExtensionArray` in `cpp/src/arrow/extension_type_test.cc`. My understanding is there are no plans to make these extension arrays available elsewhere @jorisvandenbossche @wesm ?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on pull request #8510: ARROW-1614: [C++] Add a Tensor logical value type with constant dimensions, implemented using ExtensionType

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on pull request #8510:
URL: https://github.com/apache/arrow/pull/8510#issuecomment-1020954761


   > would there be interest in folding the Pandas side of these third-party extensions into Pandas also?
   
   That will be something to discuss in the pandas project. 
   (speaking as a pandas maintainer, for now we mostly encourage creating third-party extensions (that was the whole purpose of formalizing this ExtensionArray interface in pandas), but at some point we should also expand the types in pandas itself. Although I personally think we should rather start with adding a simple List type, than directly a tensor type)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] rok commented on a change in pull request #8510: ARROW-1614: [C++] Add a Tensor logical value type with constant dimensions, implemented using ExtensionType

Posted by GitBox <gi...@apache.org>.
rok commented on a change in pull request #8510:
URL: https://github.com/apache/arrow/pull/8510#discussion_r791700615



##########
File path: cpp/src/arrow/extension_type_test.cc
##########
@@ -333,4 +334,144 @@ TEST_F(TestExtensionType, ValidateExtensionArray) {
   ASSERT_OK(ext_arr4->ValidateFull());
 }
 
+class TensorArray : public ExtensionArray {
+ public:
+  using ExtensionArray::ExtensionArray;
+};
+
+class TensorArrayType : public ExtensionType {
+ public:
+  explicit TensorArrayType(const std::shared_ptr<DataType>& type,
+                           const std::vector<int64_t>& shape,
+                           const std::vector<int64_t>& strides)
+      : ExtensionType(type), type_(type), shape_(shape), strides_(strides) {}
+
+  std::shared_ptr<DataType> type() const { return type_; }
+  std::vector<int64_t> shape() const { return shape_; }
+  std::vector<int64_t> strides() const { return strides_; }
+
+  std::string extension_name() const override {
+    std::stringstream s;
+    s << "ext-array-tensor-type<type=" << *storage_type() << ", shape=(";
+    for (uint64_t i = 0; i < shape_.size(); i++) {
+      s << shape_[i];
+      if (i < shape_.size() - 1) {
+        s << ", ";
+      }
+    }
+    s << "), strides=(";
+    for (uint64_t i = 0; i < strides_.size(); i++) {
+      s << strides_[i];
+      if (i < strides_.size() - 1) {
+        s << ", ";
+      }
+    }
+    s << ")>";
+    return s.str();
+  }
+
+  bool ExtensionEquals(const ExtensionType& other) const override {
+    return this->shape() == static_cast<const TensorArrayType&>(other).shape();

Review comment:
       Hm, I would agree with Joris here. I'll reintroduce parameters into the name.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] sjperkins commented on a change in pull request #8510: ARROW-1614: [C++] Add a Tensor logical value type with constant dimensions, implemented using ExtensionType

Posted by GitBox <gi...@apache.org>.
sjperkins commented on a change in pull request #8510:
URL: https://github.com/apache/arrow/pull/8510#discussion_r722148585



##########
File path: cpp/src/arrow/extension_type_test.cc
##########
@@ -333,4 +334,144 @@ TEST_F(TestExtensionType, ValidateExtensionArray) {
   ASSERT_OK(ext_arr4->ValidateFull());
 }
 
+class TensorArray : public ExtensionArray {
+ public:
+  using ExtensionArray::ExtensionArray;
+};
+
+class TensorArrayType : public ExtensionType {
+ public:
+  explicit TensorArrayType(const std::shared_ptr<DataType>& type,
+                           const std::vector<int64_t>& shape,
+                           const std::vector<int64_t>& strides)
+      : ExtensionType(type), type_(type), shape_(shape), strides_(strides) {}
+
+  std::shared_ptr<DataType> type() const { return type_; }
+  std::vector<int64_t> shape() const { return shape_; }
+  std::vector<int64_t> strides() const { return strides_; }
+
+  std::string extension_name() const override {
+    std::stringstream s;
+    s << "ext-array-tensor-type<type=" << *storage_type() << ", shape=(";
+    for (uint64_t i = 0; i < shape_.size(); i++) {
+      s << shape_[i];
+      if (i < shape_.size() - 1) {
+        s << ", ";
+      }
+    }
+    s << "), strides=(";
+    for (uint64_t i = 0; i < strides_.size(); i++) {
+      s << strides_[i];
+      if (i < strides_.size() - 1) {
+        s << ", ";
+      }
+    }
+    s << ")>";
+    return s.str();
+  }
+
+  bool ExtensionEquals(const ExtensionType& other) const override {
+    return this->shape() == static_cast<const TensorArrayType&>(other).shape();

Review comment:
       > Ah, I misunderstood your suggestion. Why would you need it so loose?
   
   Practically speaking, this means that one cannot have for e.g. a `(10, 5, 4)` shape Tensor and a `(10, 6, 2)` Tensor in separate parquet files in the same dataset -- IIRC the Dataset API will complain that their metadata doesn't agree (due to the strict equality comparison). I do run into these sort of cases in the datasets that I deal with.
   
   More formally, I would argue that parameterising Tensors Types on `shape` and `stride` introduces an infinite number of parameterisations and I'm not sure that this class of parameterisations is useful. This doesn't imply that `shape` and `stride` should not be attributes on a Tensor Type!
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wesm commented on pull request #8510: ARROW-1614: [C++] Add a Tensor logical value type with constant dimensions, implemented using ExtensionType

Posted by GitBox <gi...@apache.org>.
wesm commented on pull request #8510:
URL: https://github.com/apache/arrow/pull/8510#issuecomment-1016939781


   Indeed I think having a built-in Tensor value type (implemented using extension arrays) in Arrow/pyarrow would be better than having third party projects rolling their own. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] sjperkins commented on a change in pull request #8510: ARROW-1614: [C++] Add a Tensor logical value type with constant dimensions, implemented using ExtensionType

Posted by GitBox <gi...@apache.org>.
sjperkins commented on a change in pull request #8510:
URL: https://github.com/apache/arrow/pull/8510#discussion_r722173704



##########
File path: cpp/src/arrow/extension_type_test.cc
##########
@@ -333,4 +334,144 @@ TEST_F(TestExtensionType, ValidateExtensionArray) {
   ASSERT_OK(ext_arr4->ValidateFull());
 }
 
+class TensorArray : public ExtensionArray {
+ public:
+  using ExtensionArray::ExtensionArray;
+};
+
+class TensorArrayType : public ExtensionType {
+ public:
+  explicit TensorArrayType(const std::shared_ptr<DataType>& type,
+                           const std::vector<int64_t>& shape,
+                           const std::vector<int64_t>& strides)
+      : ExtensionType(type), type_(type), shape_(shape), strides_(strides) {}
+
+  std::shared_ptr<DataType> type() const { return type_; }
+  std::vector<int64_t> shape() const { return shape_; }
+  std::vector<int64_t> strides() const { return strides_; }
+
+  std::string extension_name() const override {
+    std::stringstream s;
+    s << "ext-array-tensor-type<type=" << *storage_type() << ", shape=(";
+    for (uint64_t i = 0; i < shape_.size(); i++) {
+      s << shape_[i];
+      if (i < shape_.size() - 1) {
+        s << ", ";
+      }
+    }
+    s << "), strides=(";
+    for (uint64_t i = 0; i < strides_.size(); i++) {
+      s << strides_[i];
+      if (i < strides_.size() - 1) {
+        s << ", ";
+      }
+    }
+    s << ")>";
+    return s.str();
+  }
+
+  bool ExtensionEquals(const ExtensionType& other) const override {
+    return this->shape() == static_cast<const TensorArrayType&>(other).shape();

Review comment:
       > In that case do we even want to keep ndim for equality comparison?
   
   This is a good question. I might lean towards saying not, but I'm not a maintainer. I guess it depends on how the type is used throughout the rest of the Arrow ecosystem -- you mentioned the Compute Engine for example.
   
   For instance, Numba parameterise their [Array](https://github.com/numba/numba/blob/e617b39a0b4b23d7b69d16f482fd66b4ac6cc307/numba/core/types/npytypes.py#L423) type on `dtype`, `ndim`, `layout`, `readonly` and `alignment`, I would guess because these properties are useful for generating efficient LLVM code.
   
   I'm not suggesting that Arrow should copy Numba's Type parameterisation, but its similar enough to provide food for thought.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] rok commented on pull request #8510: ARROW-1614: [C++] Add a Tensor logical value type with constant dimensions, implemented using ExtensionType

Posted by GitBox <gi...@apache.org>.
rok commented on pull request #8510:
URL: https://github.com/apache/arrow/pull/8510#issuecomment-934299009


   > I was thinking even looser:
   > 
   > ```python
   > def __eq__(self, other):
   >     len(self.shape) == len(other.shape)
   > ```
   
   Done.
   We could introduce comparison options in case there would be differing requirements here.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] rok commented on pull request #8510: ARROW-1614: [C++] Add a Tensor logical value type with constant dimensions, implemented using ExtensionType

Posted by GitBox <gi...@apache.org>.
rok commented on pull request #8510:
URL: https://github.com/apache/arrow/pull/8510#issuecomment-934236327


   > In the context of testing metadata equality withinin multiple parquet files in a dataset, equality on shape and strides may be a very strict requirement. Would relaxing the equality requirement to only compare the number of tensor dimensions negatively impact the design?
   
   Good point. By tensor dimensions you mean shape, right?
   I think it's ok to relax on the strides check. I've pushed a change, see latest commit.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] rok commented on a change in pull request #8510: ARROW-1614: [C++] Add a Tensor logical value type with constant dimensions, implemented using ExtensionType

Posted by GitBox <gi...@apache.org>.
rok commented on a change in pull request #8510:
URL: https://github.com/apache/arrow/pull/8510#discussion_r722161976



##########
File path: cpp/src/arrow/extension_type_test.cc
##########
@@ -333,4 +334,144 @@ TEST_F(TestExtensionType, ValidateExtensionArray) {
   ASSERT_OK(ext_arr4->ValidateFull());
 }
 
+class TensorArray : public ExtensionArray {
+ public:
+  using ExtensionArray::ExtensionArray;
+};
+
+class TensorArrayType : public ExtensionType {
+ public:
+  explicit TensorArrayType(const std::shared_ptr<DataType>& type,
+                           const std::vector<int64_t>& shape,
+                           const std::vector<int64_t>& strides)
+      : ExtensionType(type), type_(type), shape_(shape), strides_(strides) {}
+
+  std::shared_ptr<DataType> type() const { return type_; }
+  std::vector<int64_t> shape() const { return shape_; }
+  std::vector<int64_t> strides() const { return strides_; }
+
+  std::string extension_name() const override {
+    std::stringstream s;
+    s << "ext-array-tensor-type<type=" << *storage_type() << ", shape=(";
+    for (uint64_t i = 0; i < shape_.size(); i++) {
+      s << shape_[i];
+      if (i < shape_.size() - 1) {
+        s << ", ";
+      }
+    }
+    s << "), strides=(";
+    for (uint64_t i = 0; i < strides_.size(); i++) {
+      s << strides_[i];
+      if (i < strides_.size() - 1) {
+        s << ", ";
+      }
+    }
+    s << ")>";
+    return s.str();
+  }
+
+  bool ExtensionEquals(const ExtensionType& other) const override {
+    return this->shape() == static_cast<const TensorArrayType&>(other).shape();

Review comment:
       > > Ah, I misunderstood your suggestion. Why would you need it so loose?
   > 
   > Practically speaking, this means that one cannot have for e.g. a `(10, 5, 4)` shape Tensor and a `(10, 6, 2)` Tensor in separate parquet files in the same dataset -- IIRC the Dataset API will complain that their metadata doesn't agree (due to the strict equality comparison). I do run into these sort of cases in the datasets that I deal with.
   > 
   > More formally, I would argue that parameterising Tensors Types on `shape` and `stride` introduces an infinite number of parameterisations and I'm not sure that this class of parameterisations is useful. This doesn't imply that `shape` and `stride` should not be attributes on a Tensor Type!
   
   Got it. Infinite types sounds a bit redundant indeed.
   The only reason I can think of to have dimensions and strides in equality comparison then is if we did some compute kernels that needed to identify type in advance. But even that can probably be solved at runtime.
   
   In that case do we even want to keep `ndim` for equality comparison?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] BryanCutler commented on a change in pull request #8510: ARROW-1614: [C++] Add a Tensor logical value type with constant dimensions, implemented using ExtensionType

Posted by GitBox <gi...@apache.org>.
BryanCutler commented on a change in pull request #8510:
URL: https://github.com/apache/arrow/pull/8510#discussion_r510457429



##########
File path: python/pyarrow/tests/test_extension_type.py
##########
@@ -76,6 +76,95 @@ def __reduce__(self):
         return MyListType, (self.storage_type,)
 
 
+def _tensor_to_array(obj, dtype):
+    batch_size = obj.shape[0]
+    element_shape = obj.shape[1:]
+    total_num_elements = obj.size
+    num_elements = 1 if len(obj.shape) == 1 else np.prod(element_shape)
+
+    child_buf = pa.py_buffer(obj)
+    child_array = pa.Array.from_buffers(
+        dtype, total_num_elements, [None, child_buf])
+
+    offset_buf = pa.py_buffer(
+        np.int32([i * num_elements for i in range(batch_size + 1)]))
+
+    storage = pa.Array.from_buffers(pa.list_(dtype), batch_size,
+                                    [None, offset_buf], children=[child_array])
+
+    typ = TensorType(element_shape, dtype)
+    return pa.ExtensionArray.from_storage(typ, storage)
+
+
+class TensorArray(pa.ExtensionArray):
+    """
+    Concrete class for Arrow arrays of Tensor data type.
+    """
+
+    @classmethod
+    def from_numpy(cls, obj):
+        """
+        Convert a single contiguous numpy.ndarray to TensorArray.
+        """
+        assert isinstance(obj, np.ndarray)
+        if not obj.flags.c_contiguous:
+            obj = np.ascontiguousarray(obj)
+        dtype = pa.from_numpy_dtype(obj.dtype)
+
+        return _tensor_to_array(obj, dtype)
+
+    @classmethod
+    def from_tensor(cls, obj):
+        """
+        Convert a single contiguous pyarrow.Tensor to a TensorArray.
+        """
+        assert isinstance(obj, pa.Tensor)
+        assert obj.is_contiguous
+        dtype = obj.type
+
+        return _tensor_to_array(obj, dtype)
+
+    def to_numpy(self):
+        """
+        Convert TensorArray to numpy.ndarray.
+        """
+        shape = (len(self),) + self.type.shape
+        buf = self.storage.buffers()[3]
+        storage_list_type = self.storage.type
+        ext_dtype = storage_list_type.value_type.to_pandas_dtype()
+
+        return np.ndarray(shape, buffer=buf, dtype=ext_dtype)
+
+    def to_tensor(self):
+        """
+        Convert TensorArray to pyarrow.Tensor.
+        """
+        return pa.Tensor.from_numpy(self.to_numpy())
+
+
+class TensorType(pa.PyExtensionType):

Review comment:
       Hi @rok , the code in this PR appears to be from https://github.com/CODAIT/text-extensions-for-pandas/blob/master/text_extensions_for_pandas/array/arrow_conversion.py#L306. I know you were looking at it to get started for a C++ implementation, so if your intention is to rewrite for C++, then that's fine. If this ends up being part of a Python implementation, I would appreciate being able to contribute that myself rather than duplicate.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #8510: ARROW-1614: [C++] Add a Tensor logical value type with constant dimensions, implemented using ExtensionType

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #8510:
URL: https://github.com/apache/arrow/pull/8510#issuecomment-714589639


   https://issues.apache.org/jira/browse/ARROW-1614


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] rok commented on a change in pull request #8510: ARROW-1614: [C++] Add a Tensor logical value type with constant dimensions, implemented using ExtensionType

Posted by GitBox <gi...@apache.org>.
rok commented on a change in pull request #8510:
URL: https://github.com/apache/arrow/pull/8510#discussion_r510879292



##########
File path: python/pyarrow/tests/test_extension_type.py
##########
@@ -76,6 +76,95 @@ def __reduce__(self):
         return MyListType, (self.storage_type,)
 
 
+def _tensor_to_array(obj, dtype):
+    batch_size = obj.shape[0]
+    element_shape = obj.shape[1:]
+    total_num_elements = obj.size
+    num_elements = 1 if len(obj.shape) == 1 else np.prod(element_shape)
+
+    child_buf = pa.py_buffer(obj)
+    child_array = pa.Array.from_buffers(
+        dtype, total_num_elements, [None, child_buf])
+
+    offset_buf = pa.py_buffer(
+        np.int32([i * num_elements for i in range(batch_size + 1)]))
+
+    storage = pa.Array.from_buffers(pa.list_(dtype), batch_size,
+                                    [None, offset_buf], children=[child_array])
+
+    typ = TensorType(element_shape, dtype)
+    return pa.ExtensionArray.from_storage(typ, storage)
+
+
+class TensorArray(pa.ExtensionArray):
+    """
+    Concrete class for Arrow arrays of Tensor data type.
+    """
+
+    @classmethod
+    def from_numpy(cls, obj):
+        """
+        Convert a single contiguous numpy.ndarray to TensorArray.
+        """
+        assert isinstance(obj, np.ndarray)
+        if not obj.flags.c_contiguous:
+            obj = np.ascontiguousarray(obj)
+        dtype = pa.from_numpy_dtype(obj.dtype)
+
+        return _tensor_to_array(obj, dtype)
+
+    @classmethod
+    def from_tensor(cls, obj):
+        """
+        Convert a single contiguous pyarrow.Tensor to a TensorArray.
+        """
+        assert isinstance(obj, pa.Tensor)
+        assert obj.is_contiguous
+        dtype = obj.type
+
+        return _tensor_to_array(obj, dtype)
+
+    def to_numpy(self):
+        """
+        Convert TensorArray to numpy.ndarray.
+        """
+        shape = (len(self),) + self.type.shape
+        buf = self.storage.buffers()[3]
+        storage_list_type = self.storage.type
+        ext_dtype = storage_list_type.value_type.to_pandas_dtype()
+
+        return np.ndarray(shape, buffer=buf, dtype=ext_dtype)
+
+    def to_tensor(self):
+        """
+        Convert TensorArray to pyarrow.Tensor.
+        """
+        return pa.Tensor.from_numpy(self.to_numpy())
+
+
+class TensorType(pa.PyExtensionType):

Review comment:
       Hey @BryanCutler. Indeed this is meant just for discussing the design before implementing in C++.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] rok commented on a change in pull request #8510: ARROW-1614: [C++] Add a Tensor logical value type with constant dimensions, implemented using ExtensionType

Posted by GitBox <gi...@apache.org>.
rok commented on a change in pull request #8510:
URL: https://github.com/apache/arrow/pull/8510#discussion_r722138642



##########
File path: cpp/src/arrow/extension_type_test.cc
##########
@@ -333,4 +334,144 @@ TEST_F(TestExtensionType, ValidateExtensionArray) {
   ASSERT_OK(ext_arr4->ValidateFull());
 }
 
+class TensorArray : public ExtensionArray {
+ public:
+  using ExtensionArray::ExtensionArray;
+};
+
+class TensorArrayType : public ExtensionType {
+ public:
+  explicit TensorArrayType(const std::shared_ptr<DataType>& type,
+                           const std::vector<int64_t>& shape,
+                           const std::vector<int64_t>& strides)
+      : ExtensionType(type), type_(type), shape_(shape), strides_(strides) {}
+
+  std::shared_ptr<DataType> type() const { return type_; }
+  std::vector<int64_t> shape() const { return shape_; }
+  std::vector<int64_t> strides() const { return strides_; }
+
+  std::string extension_name() const override {
+    std::stringstream s;
+    s << "ext-array-tensor-type<type=" << *storage_type() << ", shape=(";
+    for (uint64_t i = 0; i < shape_.size(); i++) {
+      s << shape_[i];
+      if (i < shape_.size() - 1) {
+        s << ", ";
+      }
+    }
+    s << "), strides=(";
+    for (uint64_t i = 0; i < strides_.size(); i++) {
+      s << strides_[i];
+      if (i < strides_.size() - 1) {
+        s << ", ";
+      }
+    }
+    s << ")>";
+    return s.str();
+  }
+
+  bool ExtensionEquals(const ExtensionType& other) const override {
+    return this->shape() == static_cast<const TensorArrayType&>(other).shape();

Review comment:
       Added a `ndim` method and used it for equality comparison. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Hoeze commented on pull request #8510: ARROW-1614: [C++] Add a Tensor logical value type with constant dimensions, implemented using ExtensionType

Posted by GitBox <gi...@apache.org>.
Hoeze commented on pull request #8510:
URL: https://github.com/apache/arrow/pull/8510#issuecomment-1016715324


   fyi, the ray project created its own Tensor type:
   https://docs.ray.io/en/latest/_modules/ray/data/extensions/tensor_extension.html#ArrowTensorArray


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] sjperkins commented on a change in pull request #8510: ARROW-1614: [C++] Add a Tensor logical value type with constant dimensions, implemented using ExtensionType

Posted by GitBox <gi...@apache.org>.
sjperkins commented on a change in pull request #8510:
URL: https://github.com/apache/arrow/pull/8510#discussion_r722126155



##########
File path: cpp/src/arrow/extension_type_test.cc
##########
@@ -333,4 +334,144 @@ TEST_F(TestExtensionType, ValidateExtensionArray) {
   ASSERT_OK(ext_arr4->ValidateFull());
 }
 
+class TensorArray : public ExtensionArray {
+ public:
+  using ExtensionArray::ExtensionArray;
+};
+
+class TensorArrayType : public ExtensionType {
+ public:
+  explicit TensorArrayType(const std::shared_ptr<DataType>& type,
+                           const std::vector<int64_t>& shape,
+                           const std::vector<int64_t>& strides)
+      : ExtensionType(type), type_(type), shape_(shape), strides_(strides) {}
+
+  std::shared_ptr<DataType> type() const { return type_; }
+  std::vector<int64_t> shape() const { return shape_; }
+  std::vector<int64_t> strides() const { return strides_; }
+
+  std::string extension_name() const override {
+    std::stringstream s;
+    s << "ext-array-tensor-type<type=" << *storage_type() << ", shape=(";
+    for (uint64_t i = 0; i < shape_.size(); i++) {
+      s << shape_[i];
+      if (i < shape_.size() - 1) {
+        s << ", ";
+      }
+    }
+    s << "), strides=(";
+    for (uint64_t i = 0; i < strides_.size(); i++) {
+      s << strides_[i];
+      if (i < strides_.size() - 1) {
+        s << ", ";
+      }
+    }
+    s << ")>";
+    return s.str();
+  }
+
+  bool ExtensionEquals(const ExtensionType& other) const override {
+    return this->shape() == static_cast<const TensorArrayType&>(other).shape();

Review comment:
       I think this should be:
   
   ```suggestion
       return this->shape().size() == static_cast<const TensorArrayType&>(other).shape().size();
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] rok commented on pull request #8510: ARROW-1614: [C++] Add a Tensor logical value type with constant dimensions, implemented using ExtensionType

Posted by GitBox <gi...@apache.org>.
rok commented on pull request #8510:
URL: https://github.com/apache/arrow/pull/8510#issuecomment-991150389


   @jorisvandenbossche @sjperkins @pitrou is there interest to get this in?
   If yes is `cpp/src/arrow/extension_type_test.cc` a good place to put it?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #8510: ARROW-1614: [C++] Add a Tensor logical value type with constant dimensions, implemented using ExtensionType

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on a change in pull request #8510:
URL: https://github.com/apache/arrow/pull/8510#discussion_r791482403



##########
File path: cpp/src/arrow/extension_type_test.cc
##########
@@ -333,4 +334,144 @@ TEST_F(TestExtensionType, ValidateExtensionArray) {
   ASSERT_OK(ext_arr4->ValidateFull());
 }
 
+class TensorArray : public ExtensionArray {
+ public:
+  using ExtensionArray::ExtensionArray;
+};
+
+class TensorArrayType : public ExtensionType {
+ public:
+  explicit TensorArrayType(const std::shared_ptr<DataType>& type,
+                           const std::vector<int64_t>& shape,
+                           const std::vector<int64_t>& strides)
+      : ExtensionType(type), type_(type), shape_(shape), strides_(strides) {}
+
+  std::shared_ptr<DataType> type() const { return type_; }
+  std::vector<int64_t> shape() const { return shape_; }
+  std::vector<int64_t> strides() const { return strides_; }
+
+  std::string extension_name() const override {
+    std::stringstream s;
+    s << "ext-array-tensor-type<type=" << *storage_type() << ", shape=(";
+    for (uint64_t i = 0; i < shape_.size(); i++) {
+      s << shape_[i];
+      if (i < shape_.size() - 1) {
+        s << ", ";
+      }
+    }
+    s << "), strides=(";
+    for (uint64_t i = 0; i < strides_.size(); i++) {
+      s << strides_[i];
+      if (i < strides_.size() - 1) {
+        s << ", ";
+      }
+    }
+    s << ")>";
+    return s.str();
+  }
+
+  bool ExtensionEquals(const ExtensionType& other) const override {
+    return this->shape() == static_cast<const TensorArrayType&>(other).shape();

Review comment:
       > I removed the `ndim` check and made equality comparison only check for type name (`ext-array-tensor-type`).
   
   I don't really understand the argument about not being strict in the type equality. 
   
   Currently, the example code here uses `FixedSizeListArray` as the underlying storage type, for _constant_ dimension (and I would say constant shape and strides, as well?) for the full array. 
   
   So if you have multiple files in the same dataset where the shape of the tensors differs between files, how can that ever be combined in a single schema / table? FixedSizeListArray with a different size are not compatible with each other. 
   
   If you want to be flexible regarding the shape of each tensor value, it seems that is rather what is covered in https://issues.apache.org/jira/browse/ARROW-8714, where the storage type could be a struct with one field the data as a variable sized list array and the other field storing the shape of each tensor value.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] sjperkins commented on a change in pull request #8510: ARROW-1614: [C++] Add a Tensor logical value type with constant dimensions, implemented using ExtensionType

Posted by GitBox <gi...@apache.org>.
sjperkins commented on a change in pull request #8510:
URL: https://github.com/apache/arrow/pull/8510#discussion_r722184608



##########
File path: cpp/src/arrow/extension_type_test.cc
##########
@@ -333,4 +334,144 @@ TEST_F(TestExtensionType, ValidateExtensionArray) {
   ASSERT_OK(ext_arr4->ValidateFull());
 }
 
+class TensorArray : public ExtensionArray {
+ public:
+  using ExtensionArray::ExtensionArray;
+};
+
+class TensorArrayType : public ExtensionType {
+ public:
+  explicit TensorArrayType(const std::shared_ptr<DataType>& type,
+                           const std::vector<int64_t>& shape,
+                           const std::vector<int64_t>& strides)
+      : ExtensionType(type), type_(type), shape_(shape), strides_(strides) {}
+
+  std::shared_ptr<DataType> type() const { return type_; }
+  std::vector<int64_t> shape() const { return shape_; }
+  std::vector<int64_t> strides() const { return strides_; }
+
+  std::string extension_name() const override {
+    std::stringstream s;
+    s << "ext-array-tensor-type<type=" << *storage_type() << ", shape=(";
+    for (uint64_t i = 0; i < shape_.size(); i++) {
+      s << shape_[i];
+      if (i < shape_.size() - 1) {
+        s << ", ";
+      }
+    }
+    s << "), strides=(";
+    for (uint64_t i = 0; i < strides_.size(); i++) {
+      s << strides_[i];
+      if (i < strides_.size() - 1) {
+        s << ", ";
+      }
+    }
+    s << ")>";
+    return s.str();
+  }
+
+  bool ExtensionEquals(const ExtensionType& other) const override {
+    return this->shape() == static_cast<const TensorArrayType&>(other).shape();

Review comment:
       Also, I think that Google jax include shape in their Type Parameterisations, AFAIK because the XLA compiler can produce very optimal code when array bounds are known. But it does lead to these kind of sharp edges:
   
   - Search for "functions with argument-value dependent shapes" under https://jax.readthedocs.io/en/latest/notebooks/Common_Gotchas_in_JAX.html#structured-control-flow-primitives
   - https://github.com/google/jax/issues/5100#issuecomment-738904018




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] sjperkins commented on pull request #8510: ARROW-1614: [C++] Add a Tensor logical value type with constant dimensions, implemented using ExtensionType

Posted by GitBox <gi...@apache.org>.
sjperkins commented on pull request #8510:
URL: https://github.com/apache/arrow/pull/8510#issuecomment-934277406


   > 
   > 
   > > In the context of testing metadata equality withinin multiple parquet files in a dataset, equality on shape and strides may be a very strict requirement. Would relaxing the equality requirement to only compare the number of tensor dimensions negatively impact the design?
   > 
   > Good point. By tensor dimensions you mean shape, right? I think it's ok to relax on the strides check. I've pushed a change, see latest commit.
   
   I was thinking even looser:
   
   ```python
   def __eq__(self, other):
       len(self.shape) == len(other.shape)
   ```
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #8510: ARROW-1614: [C++] Add a Tensor logical value type with constant dimensions, implemented using ExtensionType

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #8510:
URL: https://github.com/apache/arrow/pull/8510#issuecomment-992271954


   Currently we don't ship any standard extension types. I recommend discussing this on the mailing-list.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] sjperkins commented on a change in pull request #8510: ARROW-1614: [C++] Add a Tensor logical value type with constant dimensions, implemented using ExtensionType

Posted by GitBox <gi...@apache.org>.
sjperkins commented on a change in pull request #8510:
URL: https://github.com/apache/arrow/pull/8510#discussion_r722173704



##########
File path: cpp/src/arrow/extension_type_test.cc
##########
@@ -333,4 +334,144 @@ TEST_F(TestExtensionType, ValidateExtensionArray) {
   ASSERT_OK(ext_arr4->ValidateFull());
 }
 
+class TensorArray : public ExtensionArray {
+ public:
+  using ExtensionArray::ExtensionArray;
+};
+
+class TensorArrayType : public ExtensionType {
+ public:
+  explicit TensorArrayType(const std::shared_ptr<DataType>& type,
+                           const std::vector<int64_t>& shape,
+                           const std::vector<int64_t>& strides)
+      : ExtensionType(type), type_(type), shape_(shape), strides_(strides) {}
+
+  std::shared_ptr<DataType> type() const { return type_; }
+  std::vector<int64_t> shape() const { return shape_; }
+  std::vector<int64_t> strides() const { return strides_; }
+
+  std::string extension_name() const override {
+    std::stringstream s;
+    s << "ext-array-tensor-type<type=" << *storage_type() << ", shape=(";
+    for (uint64_t i = 0; i < shape_.size(); i++) {
+      s << shape_[i];
+      if (i < shape_.size() - 1) {
+        s << ", ";
+      }
+    }
+    s << "), strides=(";
+    for (uint64_t i = 0; i < strides_.size(); i++) {
+      s << strides_[i];
+      if (i < strides_.size() - 1) {
+        s << ", ";
+      }
+    }
+    s << ")>";
+    return s.str();
+  }
+
+  bool ExtensionEquals(const ExtensionType& other) const override {
+    return this->shape() == static_cast<const TensorArrayType&>(other).shape();

Review comment:
       > In that case do we even want to keep ndim for equality comparison?
   
   This is a good question. I might lean towards saying not, but I'm not a maintainer. I guess it depends on how the type is used throughout the rest of the Arrow ecosystem -- you mentioned the Compute Engine for example.
   
   For instance, Numba parameterise their [Array](https://github.com/numba/numba/blob/e617b39a0b4b23d7b69d16f482fd66b4ac6cc307/numba/core/types/npytypes.py#L423) type on `dtype`, `ndim`, `layout`, `readonly` and `alignment`, I would guess because these properties are useful for generating efficient LLVM code.
   
   I'm not suggesting that Arrow should copy Numba's Type parameterisation, but there are enough similarities to provide food for thought.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #8510: ARROW-1614: [C++] Add a Tensor logical value type with constant dimensions, implemented using ExtensionType

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on a change in pull request #8510:
URL: https://github.com/apache/arrow/pull/8510#discussion_r791482403



##########
File path: cpp/src/arrow/extension_type_test.cc
##########
@@ -333,4 +334,144 @@ TEST_F(TestExtensionType, ValidateExtensionArray) {
   ASSERT_OK(ext_arr4->ValidateFull());
 }
 
+class TensorArray : public ExtensionArray {
+ public:
+  using ExtensionArray::ExtensionArray;
+};
+
+class TensorArrayType : public ExtensionType {
+ public:
+  explicit TensorArrayType(const std::shared_ptr<DataType>& type,
+                           const std::vector<int64_t>& shape,
+                           const std::vector<int64_t>& strides)
+      : ExtensionType(type), type_(type), shape_(shape), strides_(strides) {}
+
+  std::shared_ptr<DataType> type() const { return type_; }
+  std::vector<int64_t> shape() const { return shape_; }
+  std::vector<int64_t> strides() const { return strides_; }
+
+  std::string extension_name() const override {
+    std::stringstream s;
+    s << "ext-array-tensor-type<type=" << *storage_type() << ", shape=(";
+    for (uint64_t i = 0; i < shape_.size(); i++) {
+      s << shape_[i];
+      if (i < shape_.size() - 1) {
+        s << ", ";
+      }
+    }
+    s << "), strides=(";
+    for (uint64_t i = 0; i < strides_.size(); i++) {
+      s << strides_[i];
+      if (i < strides_.size() - 1) {
+        s << ", ";
+      }
+    }
+    s << ")>";
+    return s.str();
+  }
+
+  bool ExtensionEquals(const ExtensionType& other) const override {
+    return this->shape() == static_cast<const TensorArrayType&>(other).shape();

Review comment:
       > I removed the `ndim` check and made equality comparison only check for type name (`ext-array-tensor-type`).
   
   I don't really understand the argument about not being strict in the type equality. 
   
   Currently, the example code here uses `FixedSizeListArray` as the underlying storage type, for _constant_ dimension (and I would say constant shape and strides, as well?) for the full array. 
   
   So if you have multiple files in the same dataset where the shape of the tensors differs between files, how can that ever be combined in a single schema / table? FixedSizeListArray with a different size are not compatible with each other. 
   
   If you want to be flexible regarding the shape of each tensor value, it seems that is rather what is covered in https://issues.apache.org/jira/browse/ARROW-8714, where the storage type could be a struct with one field the data as a variable sized list array and the other field storing the shape of each tensor value.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] frreiss commented on pull request #8510: ARROW-1614: [C++] Add a Tensor logical value type with constant dimensions, implemented using ExtensionType

Posted by GitBox <gi...@apache.org>.
frreiss commented on pull request #8510:
URL: https://github.com/apache/arrow/pull/8510#issuecomment-1020670546


   @wesm would there be interest in folding the Pandas side of these third-party extensions into Pandas also?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] sjperkins commented on pull request #8510: ARROW-1614: [C++] Add a Tensor logical value type with constant dimensions, implemented using ExtensionType

Posted by GitBox <gi...@apache.org>.
sjperkins commented on pull request #8510:
URL: https://github.com/apache/arrow/pull/8510#issuecomment-934155105


   In the context of  testing metadata equality withinin multiple parquet files in a dataset, equality on shape and strides may be a very strict requirement. Would relaxing the equality requirement to only compare the number of tensor dimensions negatively impact the design?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] rok commented on a change in pull request #8510: ARROW-1614: [C++] Add a Tensor logical value type with constant dimensions, implemented using ExtensionType

Posted by GitBox <gi...@apache.org>.
rok commented on a change in pull request #8510:
URL: https://github.com/apache/arrow/pull/8510#discussion_r722135047



##########
File path: cpp/src/arrow/extension_type_test.cc
##########
@@ -333,4 +334,144 @@ TEST_F(TestExtensionType, ValidateExtensionArray) {
   ASSERT_OK(ext_arr4->ValidateFull());
 }
 
+class TensorArray : public ExtensionArray {
+ public:
+  using ExtensionArray::ExtensionArray;
+};
+
+class TensorArrayType : public ExtensionType {
+ public:
+  explicit TensorArrayType(const std::shared_ptr<DataType>& type,
+                           const std::vector<int64_t>& shape,
+                           const std::vector<int64_t>& strides)
+      : ExtensionType(type), type_(type), shape_(shape), strides_(strides) {}
+
+  std::shared_ptr<DataType> type() const { return type_; }
+  std::vector<int64_t> shape() const { return shape_; }
+  std::vector<int64_t> strides() const { return strides_; }
+
+  std::string extension_name() const override {
+    std::stringstream s;
+    s << "ext-array-tensor-type<type=" << *storage_type() << ", shape=(";
+    for (uint64_t i = 0; i < shape_.size(); i++) {
+      s << shape_[i];
+      if (i < shape_.size() - 1) {
+        s << ", ";
+      }
+    }
+    s << "), strides=(";
+    for (uint64_t i = 0; i < strides_.size(); i++) {
+      s << strides_[i];
+      if (i < strides_.size() - 1) {
+        s << ", ";
+      }
+    }
+    s << ")>";
+    return s.str();
+  }
+
+  bool ExtensionEquals(const ExtensionType& other) const override {
+    return this->shape() == static_cast<const TensorArrayType&>(other).shape();

Review comment:
       Ah, I misunderstood your suggestion. Why would you need it so loose? 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] sjperkins commented on a change in pull request #8510: ARROW-1614: [C++] Add a Tensor logical value type with constant dimensions, implemented using ExtensionType

Posted by GitBox <gi...@apache.org>.
sjperkins commented on a change in pull request #8510:
URL: https://github.com/apache/arrow/pull/8510#discussion_r722148585



##########
File path: cpp/src/arrow/extension_type_test.cc
##########
@@ -333,4 +334,144 @@ TEST_F(TestExtensionType, ValidateExtensionArray) {
   ASSERT_OK(ext_arr4->ValidateFull());
 }
 
+class TensorArray : public ExtensionArray {
+ public:
+  using ExtensionArray::ExtensionArray;
+};
+
+class TensorArrayType : public ExtensionType {
+ public:
+  explicit TensorArrayType(const std::shared_ptr<DataType>& type,
+                           const std::vector<int64_t>& shape,
+                           const std::vector<int64_t>& strides)
+      : ExtensionType(type), type_(type), shape_(shape), strides_(strides) {}
+
+  std::shared_ptr<DataType> type() const { return type_; }
+  std::vector<int64_t> shape() const { return shape_; }
+  std::vector<int64_t> strides() const { return strides_; }
+
+  std::string extension_name() const override {
+    std::stringstream s;
+    s << "ext-array-tensor-type<type=" << *storage_type() << ", shape=(";
+    for (uint64_t i = 0; i < shape_.size(); i++) {
+      s << shape_[i];
+      if (i < shape_.size() - 1) {
+        s << ", ";
+      }
+    }
+    s << "), strides=(";
+    for (uint64_t i = 0; i < strides_.size(); i++) {
+      s << strides_[i];
+      if (i < strides_.size() - 1) {
+        s << ", ";
+      }
+    }
+    s << ")>";
+    return s.str();
+  }
+
+  bool ExtensionEquals(const ExtensionType& other) const override {
+    return this->shape() == static_cast<const TensorArrayType&>(other).shape();

Review comment:
       > Ah, I misunderstood your suggestion. Why would you need it so loose?
   
   More practically, this means that one cannot have for e.g. a `(10, 5, 4)` shape Tensor and a `(10, 6, 2)` Tensor in separate parquet files in the same dataset -- IIRC the Dataset API will complain that their metadata doesn't agree (due to the strict equality comparison). I do run into these sort of cases in the datasets that I deal with.
   
   I would argue that parameterising Tensors Types on `shape` and `stride` introduces an infinite number of parameterisations and I'm not sure that this class of parameterisations is useful. This doesn't imply that `shape` and `stride` should not be attributes on a Tensor Type!
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] rok commented on a change in pull request #8510: ARROW-1614: [C++] Add a Tensor logical value type with constant dimensions, implemented using ExtensionType

Posted by GitBox <gi...@apache.org>.
rok commented on a change in pull request #8510:
URL: https://github.com/apache/arrow/pull/8510#discussion_r791700615



##########
File path: cpp/src/arrow/extension_type_test.cc
##########
@@ -333,4 +334,144 @@ TEST_F(TestExtensionType, ValidateExtensionArray) {
   ASSERT_OK(ext_arr4->ValidateFull());
 }
 
+class TensorArray : public ExtensionArray {
+ public:
+  using ExtensionArray::ExtensionArray;
+};
+
+class TensorArrayType : public ExtensionType {
+ public:
+  explicit TensorArrayType(const std::shared_ptr<DataType>& type,
+                           const std::vector<int64_t>& shape,
+                           const std::vector<int64_t>& strides)
+      : ExtensionType(type), type_(type), shape_(shape), strides_(strides) {}
+
+  std::shared_ptr<DataType> type() const { return type_; }
+  std::vector<int64_t> shape() const { return shape_; }
+  std::vector<int64_t> strides() const { return strides_; }
+
+  std::string extension_name() const override {
+    std::stringstream s;
+    s << "ext-array-tensor-type<type=" << *storage_type() << ", shape=(";
+    for (uint64_t i = 0; i < shape_.size(); i++) {
+      s << shape_[i];
+      if (i < shape_.size() - 1) {
+        s << ", ";
+      }
+    }
+    s << "), strides=(";
+    for (uint64_t i = 0; i < strides_.size(); i++) {
+      s << strides_[i];
+      if (i < strides_.size() - 1) {
+        s << ", ";
+      }
+    }
+    s << ")>";
+    return s.str();
+  }
+
+  bool ExtensionEquals(const ExtensionType& other) const override {
+    return this->shape() == static_cast<const TensorArrayType&>(other).shape();

Review comment:
       Hm, I would agree with Joris here. I'll reintroduce parameters into the name.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on pull request #8510: ARROW-1614: [C++] Add a Tensor logical value type with constant dimensions, implemented using ExtensionType

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on pull request #8510:
URL: https://github.com/apache/arrow/pull/8510#issuecomment-1020954761


   > would there be interest in folding the Pandas side of these third-party extensions into Pandas also?
   
   That will be something to discuss in the pandas project. 
   (speaking as a pandas maintainer, for now we mostly encourage creating third-party extensions (that was the whole purpose of formalizing this ExtensionArray interface in pandas), but at some point we should also expand the types in pandas itself. Although I personally think we should rather start with adding a simple List type, than directly a tensor type)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] frreiss commented on pull request #8510: ARROW-1614: [C++] Add a Tensor logical value type with constant dimensions, implemented using ExtensionType

Posted by GitBox <gi...@apache.org>.
frreiss commented on pull request #8510:
URL: https://github.com/apache/arrow/pull/8510#issuecomment-1020670546


   @wesm would there be interest in folding the Pandas side of these third-party extensions into Pandas also?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org