You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/04/27 04:41:08 UTC

[GitHub] [arrow] mrkn opened a new pull request #7044: ARROW-6485: WIP: [Format][C++] Support the format of a COO sparse matrix that has separated row and column indices

mrkn opened a new pull request #7044:
URL: https://github.com/apache/arrow/pull/7044


   I'd like to add matrix-specialized COO sparse tensor format with split indices.
   It improves the interoperability among scipy because of reducing the copies of index arrays.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] rok commented on pull request #7044: ARROW-6485: [Format][C++] Support the format of a COO sparse tensor that manages its indices in separated vectors

Posted by GitBox <gi...@apache.org>.
rok commented on pull request #7044:
URL: https://github.com/apache/arrow/pull/7044#issuecomment-654824921


   > Not only scipy, but also [SuiteSparse](https://github.com/DrTimothyAldenDavis/SuiteSparse) employs the split format.
   
   I didn't realize, Then we should indeed have proper support for this option.
   
   > > Question: could we handle this as a special case of COO tensor rather than a new type? Could we serialize `(row, col)` data as a single row major tensor of COO type and only deserialize it into the SciPy layout if desired? (I'm asking because I'm not sure if such approach is feasible)
   > 
   > Although we can handle the split-format as an internal variation of SparseCOOIndex, we still need to introduce the new flatbuffer type.
   
   Could we use the original COO flatbuffer type by 'concatenating' vectors when serializing? Concatenated vectors would effectively be like a row-major 2D tensor when serialized. Again - I'm not sure this is feasible with flatbuffers or desirable that's why I ask. The reason I'm interested is that we could avoid introducing another format.
    
   
   > > If we go for a new type - could I propose a name SparseCOOMatrix (as opposed to n-dimensional SparseCOOTensor). It could perhaps be shortened to `COOM`?
   > 
   > The implementation in this pull-request can handle more than 2-dimension.
   
   Oh I see. In case we introduce a new format - your proposal was `SplitCOO` - could we shorten it to `SCO` to keep naming style?
   
   Just thinking out loud: we could also call this format coordinate list - `COL` ([wikipedia](https://en.wikipedia.org/wiki/Sparse_matrix#Coordinate_list_(COO)))?
   Or keep `COO` for vector-style and call tensor style `COT` (coordinate tensor).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] rok commented on pull request #7044: ARROW-6485: [Format][C++] Support the format of a COO sparse matrix that has separated row and column indices

Posted by GitBox <gi...@apache.org>.
rok commented on pull request #7044:
URL: https://github.com/apache/arrow/pull/7044#issuecomment-654457797


   Hey @mrkn - sorry I didn't have capacity to reply for a while.
   
   If I remember correctly only SciPy has this architecture of COO index having two vectors instead of one 2D matrix.
   
   Question: could we handle this as a special case of COO tensor rather than a new type? Could we serialize `(row, col)` data as a single row major tensor of COO type and only deserialize it into the SciPy layout if desired? (I'm asking because I'm not sure if such approach is feasible)
   
   If we go for a new type - could I propose a name SparseCOOMatrix (as opposed to n-dimensional SparseCOOTensor). It could perhaps be shortened to `COOM`?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] mrkn commented on a change in pull request #7044: ARROW-6485: [Format][C++] Support the format of a COO sparse matrix that has separated row and column indices

Posted by GitBox <gi...@apache.org>.
mrkn commented on a change in pull request #7044:
URL: https://github.com/apache/arrow/pull/7044#discussion_r438924682



##########
File path: cpp/src/arrow/tensor/util.h
##########
@@ -0,0 +1,38 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+namespace arrow {
+namespace internal {
+
+template <typename Int>
+inline char const* ordinal_suffix(const Int n) {

Review comment:
       I used this internal utility function to building error messages.
   An example use case is [here](https://github.com/apache/arrow/pull/7044/files#diff-4ff845cf63a5d797badfaff498817e8eR30-R34).
   
   I believe this file is not a suitable location for this function.
   Where is the best place?  May `util/formatting.h` be a candidate?

##########
File path: python/pyarrow/tests/test_sparse_tensor.py
##########
@@ -189,21 +231,50 @@ def test_sparse_coo_tensor_from_dense(dtype_str, arrow_type):
 
     # Test from numpy array
     sparse_tensor = pa.SparseCOOTensor.from_dense_numpy(array)
-    repr(sparse_tensor)

Review comment:
       @rok I guess this kind of `repr` calls in this file are for debugging and needless, so I removed them.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on pull request #7044: ARROW-6485: [Format][C++] Support the format of a COO sparse tensor that manages its indices in separated vectors

Posted by GitBox <gi...@apache.org>.
emkornfield commented on pull request #7044:
URL: https://github.com/apache/arrow/pull/7044#issuecomment-691806935


   @mrkn @rok what is the status of this PR?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] mrkn commented on pull request #7044: ARROW-6485: [Format][C++] Support the format of a COO sparse matrix that has separated row and column indices

Posted by GitBox <gi...@apache.org>.
mrkn commented on pull request #7044:
URL: https://github.com/apache/arrow/pull/7044#issuecomment-654582143


   >> If we go for a new type - could I propose a name SparseCOOMatrix (as opposed to n-dimensional SparseCOOTensor). It could perhaps be shortened to COOM?
   >
   > The implementation in this pull-request can handle more than 2-dimension.
   
   Oh, I forgot to change the title of this pull-request. Sorry.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] mrkn commented on pull request #7044: ARROW-6485: [Format][C++] Support the format of a COO sparse tensor that manages its indices in separated vectors

Posted by GitBox <gi...@apache.org>.
mrkn commented on pull request #7044:
URL: https://github.com/apache/arrow/pull/7044#issuecomment-691809348


   @emkornfield Presumably this need to fix for the canonical flag added in #7477.  I can do this next week.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] mrkn commented on pull request #7044: ARROW-6485: [Format][C++] Support the format of a COO sparse tensor that manages its indices in separated vectors

Posted by GitBox <gi...@apache.org>.
mrkn commented on pull request #7044:
URL: https://github.com/apache/arrow/pull/7044#issuecomment-699815225


   I couldn't begin to work this pull-request last week. But I've started to do it today. I guess I can finish this by this weekend.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] rok commented on a change in pull request #7044: ARROW-6485: [Format][C++] Support the format of a COO sparse matrix that has separated row and column indices

Posted by GitBox <gi...@apache.org>.
rok commented on a change in pull request #7044:
URL: https://github.com/apache/arrow/pull/7044#discussion_r439226180



##########
File path: cpp/src/arrow/ipc/metadata_internal.cc
##########
@@ -882,7 +882,7 @@ Status MakeSparseTensorIndexCOO(FBB& fbb, const SparseCOOIndex& sparse_index,
                                 Offset* fb_sparse_index, size_t* num_buffers) {
   *fb_sparse_index_type = flatbuf::SparseTensorIndex::SparseTensorIndexCOO;
 
-  // We assume that the value type of indices tensor is an integer.
+  // We assume that the value type of indices tensor is an integer type.

Review comment:
       Perhaps: `// We assume that indices tensor has integer value type.` ?

##########
File path: cpp/src/arrow/ipc/metadata_internal.cc
##########
@@ -901,6 +901,35 @@ Status MakeSparseTensorIndexCOO(FBB& fbb, const SparseCOOIndex& sparse_index,
   return Status::OK();
 }
 
+Status MakeSparseTensorIndexSplitCOO(FBB& fbb, const SparseSplitCOOIndex& sparse_index,
+                                     const std::vector<BufferMetadata>& buffers,
+                                     flatbuf::SparseTensorIndex* fb_sparse_index_type,
+                                     Offset* fb_sparse_index, size_t* num_buffers) {
+  *fb_sparse_index_type = flatbuf::SparseTensorIndex::SparseTensorIndexSplitCOO;
+
+  using IntOffset = flatbuffers::Offset<flatbuf::Int>;
+  std::vector<IntOffset> indices_type_offsets;
+  for (const auto& tensor : sparse_index.indices()) {
+    // We assume that the value types of indices tensors is integer types.
+    const auto& type = checked_cast<const IntegerType&>(*tensor->type());
+    auto offset = flatbuf::CreateInt(fbb, type.bit_width(), type.is_signed());
+    indices_type_offsets.push_back(offset);
+  }
+  auto fb_indices_types = fbb.CreateVector(indices_type_offsets);
+
+  // NOTE: buffers contains ndim + 1 items: indices and a data

Review comment:
       `a data` -> `a data value`?

##########
File path: cpp/src/arrow/ipc/metadata_internal.cc
##########
@@ -1329,9 +1383,25 @@ Status GetSparseTensorMetadata(const Buffer& metadata, std::shared_ptr<DataType>
     return Status::IOError(
         "Header-type of flatbuffer-encoded Message is not SparseTensor.");
   }
+
+  const auto ndim_max = static_cast<size_t>(std::numeric_limits<int>::max());
+  if (ndim_max < sparse_tensor->shape()->size()) {
+    return Status::Invalid(
+        "The sparse tensor to be read has too much number of dimensions (",
+        sparse_tensor->shape()->size(), " is given)");

Review comment:
       How about something like `Input tensor has too many dimensions`?

##########
File path: cpp/src/arrow/ipc/metadata_internal.cc
##########
@@ -1292,6 +1326,26 @@ Status GetSparseCOOIndexMetadata(const flatbuf::SparseTensorIndexCOO* sparse_ind
   return IntFromFlatbuffer(sparse_index->indicesType(), indices_type);
 }
 
+Result<std::vector<std::shared_ptr<DataType>>> GetSparseSplitCOOIndexMetadata(
+    const flatbuf::SparseTensorIndexSplitCOO* sparse_index, const size_t ndim) {
+  const auto fb_indices_types = sparse_index->indicesTypes();
+  if (fb_indices_types->size() != ndim) {
+    return Status::Invalid(
+        "The number of indices types in a SparseSplitCOOIndex is inconsistent to the "
+        "number of dimensions");

Review comment:
       `to the` -> `with the`?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #7044: ARROW-6485: WIP: [Format][C++] Support the format of a COO sparse matrix that has separated row and column indices

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #7044:
URL: https://github.com/apache/arrow/pull/7044#issuecomment-619714519


   https://issues.apache.org/jira/browse/ARROW-6485


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] mrkn commented on pull request #7044: ARROW-6485: [Format][C++] Support the format of a COO sparse matrix that has separated row and column indices

Posted by GitBox <gi...@apache.org>.
mrkn commented on pull request #7044:
URL: https://github.com/apache/arrow/pull/7044#issuecomment-654581941


   > Hey @mrkn - sorry I didn't have capacity to reply for a while.
   
   No problem. Thank you for your cooperation!
   
   > If I remember correctly only SciPy has this architecture of COO index having two vectors instead of one 2D matrix.
   
   Not only scipy, but also [SuiteSparse](https://github.com/DrTimothyAldenDavis/SuiteSparse) employs the split format.
   
   > Question: could we handle this as a special case of COO tensor rather than a new type? Could we serialize `(row, col)` data as a single row major tensor of COO type and only deserialize it into the SciPy layout if desired? (I'm asking because I'm not sure if such approach is feasible)
   > 
   
   Although we can handle the split-format as an internal variation of SparseCOOIndex, we still need to introduce the new flatbuffer type.
   
   > If we go for a new type - could I propose a name SparseCOOMatrix (as opposed to n-dimensional SparseCOOTensor). It could perhaps be shortened to `COOM`?
   
   The implementation in this pull-request can handle more than 2-dimension.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org