You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@arrow.apache.org by pc...@apache.org on 2018/09/07 23:17:20 UTC

[arrow] branch master updated: ARROW-3127: [Doc] Add Tutorial for Sending Tensor from C++ to Python

This is an automated email from the ASF dual-hosted git repository.

pcmoritz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git


The following commit(s) were added to refs/heads/master by this push:
     new f3247e8  ARROW-3127: [Doc] Add Tutorial for Sending Tensor from C++ to Python
f3247e8 is described below

commit f3247e8c8b6f3959b4c44a33ab98388b370f0daa
Author: Philipp Moritz <pc...@gmail.com>
AuthorDate: Fri Sep 7 16:17:11 2018 -0700

    ARROW-3127: [Doc] Add Tutorial for Sending Tensor from C++ to Python
    
    This PR adds a short tutorial showing how to
    1. Serialize a floating-point array in C++ into Tensor
    2. Save the Tensor to Plasma
    3. Access the Tensor in Python
    
    cc @pcmoritz
    
    Author: Philipp Moritz <pc...@gmail.com>
    Author: Simon Mo <xm...@berkeley.edu>
    
    Closes #2481 from simon-mo/arrow_tensor_doc and squashes the following commits:
    
    73cb8fe7 <Philipp Moritz> some small fixes
    5ad5f3dc <Simon Mo> Add Initial Draft of the Tutorial
---
 cpp/apidoc/index.md                  |   1 +
 cpp/apidoc/tutorials/tensor_to_py.md | 130 +++++++++++++++++++++++++++++++++++
 2 files changed, 131 insertions(+)

diff --git a/cpp/apidoc/index.md b/cpp/apidoc/index.md
index 25be1f2..46ee500 100644
--- a/cpp/apidoc/index.md
+++ b/cpp/apidoc/index.md
@@ -40,6 +40,7 @@ Table of Contents
  * Tutorials
    * [Convert a vector of row-wise data into an Arrow table](tutorials/row_wise_conversion.md)
    * [Using the Plasma In-Memory Object Store](tutorials/plasma.md)
+   * [Use Plasma to Access Tensors from C++ in Python](tutorials/tensor_to_py.md)
 
 Getting Started
 ---------------
diff --git a/cpp/apidoc/tutorials/tensor_to_py.md b/cpp/apidoc/tutorials/tensor_to_py.md
new file mode 100644
index 0000000..e7a7416
--- /dev/null
+++ b/cpp/apidoc/tutorials/tensor_to_py.md
@@ -0,0 +1,130 @@
+<!---
+  Licensed under the Apache License, Version 2.0 (the "License");
+  you may not use this file except in compliance with the License.
+  You may obtain a copy of the License at
+
+   http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License. See accompanying LICENSE file.
+-->
+
+Use Plasma to Access Tensors from C++ in Python
+==============================================
+
+This short tutorial shows how to use Arrow and the Plasma Store to send data
+from C++ to Python.
+
+In detail, we will show how to:
+1. Serialize a floating-point array in C++ into an Arrow tensor
+2. Save the Arrow tensor to Plasma
+3. Access the Tensor in a Python process
+
+This approach has the advantage that multiple python processes can all read
+the tensor with zero-copy. Therefore, only one copy is necessary when we send
+a tensor from one C++ process to many python processes.
+
+
+Step 0: Set up
+------
+We will include the following header files and construct a Plasma client.
+
+```cpp
+#include <plasma/client.h>
+#include <arrow/tensor.h>
+#include <arrow/array.h>
+#include <arrow/buffer.h>
+#include <arrow/io/memory.h>
+#include <arrow/ipc/writer.h>
+
+PlasmaClient client_;
+ARROW_CHECK_OK(client_.Connect("/tmp/plasma", "", 0));
+```
+
+
+Step 1: Serialize a floating point array in C++ into an Arrow Tensor
+--------------------------------------------------------------------
+In this step, we will construct a floating-point array in C++.
+
+```cpp
+// Generate an Object ID for Plasma
+ObjectID object_id = ObjectID::from_binary("11111111111111111111");
+
+// Generate Float Array
+int64_t input_length = 1000;
+std::vector<float> input(input_length);
+for (int64_t i = 0; i < input_length; ++i) {
+  input[i] = 2.0;
+}
+
+// Cast float array to bytes array
+const uint8_t* bytes_array = reinterpret_cast<const uint8_t*>(input.data());
+
+// Create Arrow Tensor Object, no copy made!
+// {input_length} is the shape of the tensor
+auto value_buffer = std::make_shared<Buffer>(bytes_array, sizeof(float) * input_length);
+Tensor t(float32(), value_buffer, {input_length});
+```
+
+Step 2: Save the Arrow Tensor to Plasma In-Memory Object Store
+--------------------------------------------------------------
+Continuing from Step 1, this step will save the tensor to Plasma Store. We
+use `arrow::ipc::WriteTensor` to write the data.
+
+The variable `meta_len` will contain the length of the tensor metadata
+after the call to `arrow::ipc::WriteTensor`.
+
+```cpp
+// Get the size of the tensor to be stored in Plasma
+int64_t datasize;
+ARROW_CHECK_OK(ipc::GetTensorSize(t, &datasize));
+int32_t meta_len = 0;
+
+// Create the Plasma Object
+// Plasma is responsible for initializing and resizing the buffer
+// This buffer will contain the _serialized_ tensor
+std::shared_ptr<Buffer> buffer;
+ARROW_CHECK_OK(
+    client_.Create(object_id, datasize, NULL, 0, &buffer));
+
+// Writing Process, this will copy the tensor into Plasma
+io::FixedSizeBufferWriter stream(buffer);
+ARROW_CHECK_OK(arrow::ipc::WriteTensor(t, &stream, &meta_len, &datasize));
+
+// Seal Plasma Object
+// This computes a hash of the object data by default
+ARROW_CHECK_OK(client_.Seal(object_id));
+```
+
+Step 3: Access the Tensor in a Python Process
+---------------------------------------------
+In Python, we will construct a Plasma client and point it to the store's socket.
+The `inputs` variable will be a list of Object IDs in their raw byte string form.
+
+```python
+import pyarrow as pa
+import pyarrow.plasma as plasma
+
+plasma_client = plasma.connect('/tmp/plasma', '', 0)
+
+# inputs: a list of object ids
+inputs = [20 * b'1']
+
+# Construct Object ID and perform a batch get
+object_ids = [plasma.ObjectID(inp) for inp in inputs]
+buffers = plasma_client.get_buffers(object_ids)
+
+# Read the tensor and convert to numpy array for each object
+arrs = []
+for buffer in buffers:
+    reader = pa.BufferReader(buffer)
+    t = pa.read_tensor(reader)
+    arr = t.to_numpy()
+    arrs.append(arr)
+
+# arrs is now a list of numpy arrays
+assert np.all(arrs[0] == 2.0 * np.ones(1000, dtype="float32"))
+```