You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@arrow.apache.org by ko...@apache.org on 2023/01/24 08:14:30 UTC

[arrow] branch master updated: GH-32801: [C++][Docs] Delete outdated .md files (#33829)

This is an automated email from the ASF dual-hosted git repository.

kou pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git


The following commit(s) were added to refs/heads/master by this push:
     new 3cf0b4d48b GH-32801: [C++][Docs] Delete outdated .md files (#33829)
3cf0b4d48b is described below

commit 3cf0b4d48b7fa2e1b25ca869b7e056a8f8525356
Author: Shaheer Ahmad <11...@users.noreply.github.com>
AuthorDate: Tue Jan 24 13:14:23 2023 +0500

    GH-32801: [C++][Docs] Delete outdated .md files (#33829)
    
    
    
    ### Rationale for this change
    According to the issue #32801 , there were some outdated files in cpp/apidoc which I deleted.
    
    ### What changes are included in this PR?
    Deleted Outdated MD Files (HDFS.md plasma.md and tensor_to_py.md)
    
    ### Are these changes tested?
    No.
    
    ### Are there any user-facing changes?
    No.
    
    * Closes: #32801
    
    Authored-by: Shaheer-Ahmd <sh...@gmail.com>
    Signed-off-by: Sutou Kouhei <ko...@clear-code.com>
---
 cpp/apidoc/HDFS.md                   |  83 -------
 cpp/apidoc/tutorials/plasma.md       | 450 -----------------------------------
 cpp/apidoc/tutorials/tensor_to_py.md | 127 ----------
 3 files changed, 660 deletions(-)

diff --git a/cpp/apidoc/HDFS.md b/cpp/apidoc/HDFS.md
deleted file mode 100644
index d3671fb769..0000000000
--- a/cpp/apidoc/HDFS.md
+++ /dev/null
@@ -1,83 +0,0 @@
-<!---
-  Licensed to the Apache Software Foundation (ASF) under one
-  or more contributor license agreements.  See the NOTICE file
-  distributed with this work for additional information
-  regarding copyright ownership.  The ASF licenses this file
-  to you under the Apache License, Version 2.0 (the
-  "License"); you may not use this file except in compliance
-  with the License.  You may obtain a copy of the License at
-
-    http://www.apache.org/licenses/LICENSE-2.0
-
-  Unless required by applicable law or agreed to in writing,
-  software distributed under the License is distributed on an
-  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
-  KIND, either express or implied.  See the License for the
-  specific language governing permissions and limitations
-  under the License.
--->
-
-## Using Arrow's HDFS (Apache Hadoop Distributed File System) interface
-
-### Build requirements
-
-To build the integration, pass the following option to CMake
-
-```shell
--DARROW_HDFS=on
-```
-
-For convenience, we have bundled `hdfs.h` for libhdfs from Apache Hadoop in
-Arrow's thirdparty. If you wish to build against the `hdfs.h` in your installed
-Hadoop distribution, set the `$HADOOP_HOME` environment variable.
-
-### Runtime requirements
-
-By default, the HDFS client C++ class in `libarrow_io` uses the libhdfs JNI
-interface to the Java Hadoop client. This library is loaded **at runtime**
-(rather than at link / library load time, since the library may not be in your
-LD_LIBRARY_PATH), and relies on some environment variables.
-
-* `HADOOP_HOME`: the root of your installed Hadoop distribution. Often has
-`lib/native/libhdfs.so`.
-* `JAVA_HOME`: the location of your Java SDK installation.
-* `CLASSPATH`: must contain the Hadoop jars. You can set these using:
-
-```shell
-export CLASSPATH=`$HADOOP_HOME/bin/hadoop classpath --glob`
-```
-
-* `ARROW_LIBHDFS_DIR` (optional): explicit location of `libhdfs.so` if it is
-installed somewhere other than `$HADOOP_HOME/lib/native`.
-
-To accommodate distribution-specific nuances, the `JAVA_HOME` variable may be
-set to the root path for the Java SDK, the JRE path itself, or to the directory
-containing the `libjvm` library.
-
-### Mac Specifics
-
-The installed location of Java on OS X can vary, however the following snippet
-will set it automatically for you:
-
-```shell
-export JAVA_HOME=$(/usr/libexec/java_home)
-```
-
-Homebrew's Hadoop does not have native libs. Apache doesn't build these, so
-users must build Hadoop to get the native libs. See this Stack Overflow
-answer for details:
-
-http://stackoverflow.com/a/40051353/478288
-
-Be sure to include the path to the native libs in `JAVA_LIBRARY_PATH`:
-
-```shell
-export JAVA_LIBRARY_PATH=$HADOOP_HOME/lib/native:$JAVA_LIBRARY_PATH
-```
-
-If you get an error about needing to install Java 6, then add *BundledApp* and
-*JNI* to the `JVMCapabilities` in `$JAVA_HOME/../Info.plist`. See
-
-https://oliverdowling.com.au/2015/10/09/oracles-jre-8-on-mac-os-x-el-capitan/
-
-https://derflounder.wordpress.com/2015/08/08/modifying-oracles-java-sdk-to-run-java-applications-on-os-x/
diff --git a/cpp/apidoc/tutorials/plasma.md b/cpp/apidoc/tutorials/plasma.md
deleted file mode 100644
index fef4522200..0000000000
--- a/cpp/apidoc/tutorials/plasma.md
+++ /dev/null
@@ -1,450 +0,0 @@
-<!---
-  Licensed under the Apache License, Version 2.0 (the "License");
-  you may not use this file except in compliance with the License.
-  You may obtain a copy of the License at
-
-   http://www.apache.org/licenses/LICENSE-2.0
-
-  Unless required by applicable law or agreed to in writing, software
-  distributed under the License is distributed on an "AS IS" BASIS,
-  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-  See the License for the specific language governing permissions and
-  limitations under the License. See accompanying LICENSE file.
--->
-
-Using the Plasma In-Memory Object Store from C++
-================================================
-
-Apache Arrow offers the ability to share your data structures among multiple
-processes simultaneously through Plasma, an in-memory object store.
-
-Note that **the Plasma API is not stable**.
-
-Plasma clients are processes that run on the same machine as the object store.
-They communicate with the object store over Unix domain sockets, and they read
-and write data in the object store through shared memory.
-
-Plasma objects are immutable once they have been created.
-
-The following goes over the basics so you can begin using Plasma in your big
-data applications.
-
-Starting the Plasma store
--------------------------
-
-To start running the Plasma object store so that clients may
-connect and access the data, run the following command:
-
-```
-plasma_store_server -m 1000000000 -s /tmp/plasma
-```
-
-The `-m` flag specifies the size of the object store in bytes. The `-s` flag
-specifies the path of the Unix domain socket that the store will listen at.
-
-Therefore, the above command initializes a Plasma store up to 1 GB of memory
-and sets the socket to `/tmp/plasma.`
-
-The Plasma store will remain available as long as the `plasma_store_server` process is
-running in a terminal window. Messages, such as alerts for disconnecting
-clients, may occasionally be output. To stop running the Plasma store, you
-can press `Ctrl-C` in the terminal window.
-
-Alternatively, you can run the Plasma store in the background and ignore all
-message output with the following terminal command:
-
-```
-plasma_store_server -m 1000000000 -s /tmp/plasma 1> /dev/null 2> /dev/null &
-```
-
-The Plasma store will instead run silently in the background. To stop running
-the Plasma store in this case, issue the command below:
-
-```
-killall plasma_store_server
-```
-
-Creating a Plasma client
-------------------------
-
-Now that the Plasma object store is up and running, it is time to make a client
-process connect to it. To use the Plasma object store as a client, your
-application should initialize a `plasma::PlasmaClient` object and tell it to
-connect to the socket specified when starting up the Plasma object store.
-
-```cpp
-#include <plasma/client.h>
-
-using namespace plasma;
-
-int main(int argc, char** argv) {
-  // Start up and connect a Plasma client.
-  PlasmaClient client;
-  ARROW_CHECK_OK(client.Connect("/tmp/plasma"));
-  // Disconnect the Plasma client.
-  ARROW_CHECK_OK(client.Disconnect());
-}
-```
-
-Save this program in a file `test.cc` and compile it with
-
-```
-g++ test.cc `pkg-config --cflags --libs plasma` --std=c++11
-```
-
-Note that multiple clients can be created within the same process.
-
-If the Plasma store is still running, you can now execute the `a.out` executable
-and the store will print something like
-
-```
-Disconnecting client on fd 5
-```
-
-which shows that the client was successfully disconnected.
-
-Object IDs
-----------
-
-The Plasma object store uses twenty-byte identifiers for accessing objects
-stored in shared memory. Each object in the Plasma store should be associated
-with a unique ID. The Object ID is then a key that can be used by **any** client
-to fetch that object from the Plasma store.
-
-Random generation of Object IDs is often good enough to ensure unique IDs.
-For test purposes, you can use the function `random_object_id` from the header
-`plasma/test-util.h` to generate random Object IDs, which uses a global random
-number generator. In your own applications, we recommend to generate a string of
-`ObjectID::size()` many random bytes using your own random number generator
-and pass them to `ObjectID::from_bytes` to generate the ObjectID.
-
-```cpp
-#include <plasma/test-util.h>
-
-// Randomly generate an Object ID.
-ObjectID object_id = random_object_id();
-```
-
-Now, any connected client that knows the object's Object ID can access the
-same object from the Plasma object store. For easy transportation of Object IDs,
-you can convert/serialize an Object ID into a binary string and back as
-follows:
-
-```cpp
-// From ObjectID to binary string
-std:string id_string = object_id.binary();
-
-// From binary string to ObjectID
-ObjectID id_object = ObjectID::from_binary(&id_string);
-```
-
-You can also get a human readable representation of ObjectIDs in the same
-format that git uses for commit hashes by running `ObjectID::hex`.
-
-Here is a test program you can run:
-
-```cpp
-#include <iostream>
-#include <string>
-#include <plasma/client.h>
-#include <plasma/test-util.h>
-
-using namespace plasma;
-
-int main(int argc, char** argv) {
-  ObjectID object_id1 = random_object_id();
-  std::cout << "object_id1 is " << object_id1.hex() << std::endl;
-
-  std::string id_string = object_id1.binary();
-  ObjectID object_id2 = ObjectID::from_binary(id_string);
-  std::cout << "object_id2 is " << object_id2.hex() << std::endl;
-}
-```
-
-Creating an Object
-------------------
-
-Now that you learned about Object IDs that are used to refer to objects,
-let's look at how objects can be stored in Plasma.
-
-Storing objects is a two-stage process. First a buffer is allocated with a call
-to `Create`. Then it can be constructed in place by the client. Then it is made
-immutable and shared with other clients via a call to `Seal`.
-
-The `Create` call blocks while the Plasma store allocates a buffer of the
-appropriate size. The client will then map the buffer into its own address
-space. At this point the object can be constructed in place using a pointer that
-was written by the `Create` command.
-
-```cpp
-int64_t data_size = 100;
-// The address of the buffer allocated by the Plasma store will be written at
-// this address.
-std::shared_ptr<Buffer> data;
-// Create a Plasma object by specifying its ID and size.
-ARROW_CHECK_OK(client.Create(object_id, data_size, NULL, 0, &data));
-```
-
-You can also specify metadata for the object; the third argument is the
-metadata (as raw bytes) and the fourth argument is the size of the metadata.
-
-```cpp
-// Create a Plasma object with metadata.
-int64_t data_size = 100;
-std::string metadata = "{'author': 'john'}";
-std::shared_ptr<Buffer> data;
-client.Create(object_id, data_size, (uint8_t*) metadata.data(), metadata.size(), &data);
-```
-
-Now that we've obtained a pointer to our object's data, we can
-write our data to it:
-
-```cpp
-// Write some data for the Plasma object.
-for (int64_t i = 0; i < data_size; i++) {
-    data[i] = static_cast<uint8_t>(i % 4);
-}
-```
-
-When the client is done, the client **seals** the buffer, making the object
-immutable, and making it available to other Plasma clients:
-
-```cpp
-// Seal the object. This makes it available for all clients.
-client.Seal(object_id);
-```
-
-Here is an example that combines all these features:
-
-```cpp
-#include <plasma/client.h>
-
-using namespace plasma;
-
-int main(int argc, char** argv) {
-  // Start up and connect a Plasma client.
-  PlasmaClient client;
-  ARROW_CHECK_OK(client.Connect("/tmp/plasma"));
-  // Create an object with a fixed ObjectID.
-  ObjectID object_id = ObjectID::from_binary("00000000000000000000");
-  int64_t data_size = 1000;
-  std::shared_ptr<Buffer> data;
-  std::string metadata = "{'author': 'john'}";
-  ARROW_CHECK_OK(client.Create(object_id, data_size, (uint8_t*) metadata.data(), metadata.size(), &data));
-  // Write some data into the object.
-  auto d = data->mutable_data();
-  for (int64_t i = 0; i < data_size; i++) {
-    d[i] = static_cast<uint8_t>(i % 4);
-  }
-  // Seal the object.
-  ARROW_CHECK_OK(client.Seal(object_id));
-  // Disconnect the client.
-  ARROW_CHECK_OK(client.Disconnect());
-}
-```
-
-This example can be compiled with
-
-```
-g++ create.cc `pkg-config --cflags --libs plasma` --std=c++11 -o create
-```
-
-To verify that an object exists in the Plasma object store, you can
-call `PlasmaClient::Contains()` to check if an object has
-been created and sealed for a given Object ID. Note that this function
-will still return False if the object has been created, but not yet
-sealed:
-
-```cpp
-// Check if an object has been created and sealed.
-bool has_object;
-client.Contains(object_id, &has_object);
-if (has_object) {
-    // Object has been created and sealed, proceed
-}
-```
-
-Getting an Object
------------------
-
-After an object has been sealed, any client who knows the Object ID can get
-the object. To store the retrieved object contents, you should create an
-`ObjectBuffer`, then call `PlasmaClient::Get()` as follows:
-
-```cpp
-// Get from the Plasma store by Object ID.
-ObjectBuffer object_buffer;
-client.Get(&object_id, 1, -1, &object_buffer);
-```
-
-`PlasmaClient::Get()` isn't limited to fetching a single object
-from the Plasma store at once. You can specify an array of Object IDs and
-`ObjectBuffers` to fetch at once, so long as you also specify the
-number of objects being fetched:
-
-```cpp
-// Get two objects at once from the Plasma store. This function
-// call will block until both objects have been fetched.
-ObjectBuffer multiple_buffers[2];
-ObjectID multiple_ids[2] = {object_id1, object_id2};
-client.Get(multiple_ids, 2, -1, multiple_buffers);
-```
-
-Since `PlasmaClient::Get()` is a blocking function call, it may be
-necessary to limit the amount of time the function is allowed to take
-when trying to fetch from the Plasma store. You can pass in a timeout
-in milliseconds when calling `PlasmaClient::Get().` To use `PlasmaClient::Get()`
-without a timeout, just pass in -1 like in the previous example calls:
-
-```cpp
-// Make the function call give up fetching the object if it takes
-// more than 100 milliseconds.
-int64_t timeout = 100;
-client.Get(&object_id, 1, timeout, &object_buffer);
-```
-
-Finally, to access the object, you can access the `data` and
-`metadata` attributes of the `ObjectBuffer`. The `data` can be indexed
-like any array:
-
-```cpp
-// Access object data.
-uint8_t* data = object_buffer.data;
-int64_t data_size = object_buffer.data_size;
-
-// Access object metadata.
-uint8_t* metadata = object_buffer.metadata;
-uint8_t metadata_size = object_buffer.metadata_size;
-
-// Index into data array.
-uint8_t first_data_byte = data[0];
-```
-
-Here is a longer example that shows these capabilities:
-
-```cpp
-#include <plasma/client.h>
-
-using namespace plasma;
-
-int main(int argc, char** argv) {
-  // Start up and connect a Plasma client.
-  PlasmaClient client;
-  ARROW_CHECK_OK(client.Connect("/tmp/plasma"));
-  ObjectID object_id = ObjectID::from_binary("00000000000000000000");
-  ObjectBuffer object_buffer;
-  ARROW_CHECK_OK(client.Get(&object_id, 1, -1, &object_buffer));
-
-  // Retrieve object data.
-  auto buffer = object_buffer.data;
-  const uint8_t* data = buffer->data();
-  int64_t data_size = buffer->size();
-
-  // Check that the data agrees with what was written in the other process.
-  for (int64_t i = 0; i < data_size; i++) {
-    ARROW_CHECK(data[i] == static_cast<uint8_t>(i % 4));
-  }
-
-  // Disconnect the client.
-  ARROW_CHECK_OK(client.Disconnect());
-}
-```
-
-If you compile it with
-
-```
-g++ get.cc `pkg-config --cflags --libs plasma` --std=c++11 -o get
-```
-
-and run it with `./get`, all the assertions will pass if you run the `create`
-example from above on the same Plasma store.
-
-
-Object Lifetime Management
---------------------------
-
-The Plasma store internally does reference counting to make sure objects that
-are mapped into the address space of one of the clients with `PlasmaClient::Get`
-are accessible. To unmap objects from a client, call `PlasmaClient::Release`.
-All objects that are mapped into a clients address space will automatically
-be released when the client is disconnected from the store (this happens even
-if the client process crashes or otherwise fails to call `Disconnect`).
-
-If a new object is created and there is not enough space in the Plasma store,
-the store will evict the least recently used object (an object is in use if at
-least one client has gotten it but not released it).
-
-Object notifications
---------------------
-
-Additionally, you can arrange to have Plasma notify you when objects are
-sealed in the object store. This may especially be handy when your
-program is collaborating with other Plasma clients, and needs to know
-when they make objects available.
-
-First, you can subscribe your current Plasma client to such notifications
-by getting a file descriptor:
-
-```cpp
-// Start receiving notifications into file_descriptor.
-int fd;
-ARROW_CHECK_OK(client.Subscribe(&fd));
-```
-
-Once you have the file descriptor, you can have your current Plasma client
-wait to receive the next object notification. Object notifications
-include information such as Object ID, data size, and metadata size of
-the next newly available object:
-
-```cpp
-// Receive notification of the next newly available object.
-// Notification information is stored in object_id, data_size, and metadata_size
-ObjectID object_id;
-int64_t data_size;
-int64_t metadata_size;
-ARROW_CHECK_OK(client.GetNotification(fd, &object_id, &data_size, &metadata_size));
-
-// Get the newly available object.
-ObjectBuffer object_buffer;
-ARROW_CHECK_OK(client.Get(&object_id, 1, -1, &object_buffer));
-```
-
-Here is a full program that shows this capability:
-
-```cpp
-#include <plasma/client.h>
-
-using namespace plasma;
-
-int main(int argc, char** argv) {
-  // Start up and connect a Plasma client.
-  PlasmaClient client;
-  ARROW_CHECK_OK(client.Connect("/tmp/plasma"));
-
-  int fd;
-  ARROW_CHECK_OK(client.Subscribe(&fd));
-
-  ObjectID object_id;
-  int64_t data_size;
-  int64_t metadata_size;
-  while (true) {
-    ARROW_CHECK_OK(client.GetNotification(fd, &object_id, &data_size, &metadata_size));
-
-    std::cout << "Received object notification for object_id = "
-              << object_id.hex() << ", with data_size = " << data_size
-              << ", and metadata_size = " << metadata_size << std::endl;
-  }
-
-  // Disconnect the client.
-  ARROW_CHECK_OK(client.Disconnect());
-}
-```
-
-If you compile it with
-
-```
-g++ subscribe.cc `pkg-config --cflags --libs plasma` --std=c++11 -o subscribe
-```
-
-and invoke `./create` and `./subscribe` while the Plasma store is running,
-you can observe the new object arriving.
diff --git a/cpp/apidoc/tutorials/tensor_to_py.md b/cpp/apidoc/tutorials/tensor_to_py.md
deleted file mode 100644
index cd191fea07..0000000000
--- a/cpp/apidoc/tutorials/tensor_to_py.md
+++ /dev/null
@@ -1,127 +0,0 @@
-<!---
-  Licensed under the Apache License, Version 2.0 (the "License");
-  you may not use this file except in compliance with the License.
-  You may obtain a copy of the License at
-
-   http://www.apache.org/licenses/LICENSE-2.0
-
-  Unless required by applicable law or agreed to in writing, software
-  distributed under the License is distributed on an "AS IS" BASIS,
-  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-  See the License for the specific language governing permissions and
-  limitations under the License. See accompanying LICENSE file.
--->
-
-Use Plasma to Access Tensors from C++ in Python
-==============================================
-
-This short tutorial shows how to use Arrow and the Plasma Store to send data
-from C++ to Python.
-
-In detail, we will show how to:
-1. Serialize a floating-point array in C++ into an Arrow tensor
-2. Save the Arrow tensor to Plasma
-3. Access the Tensor in a Python process
-
-This approach has the advantage that multiple python processes can all read
-the tensor with zero-copy. Therefore, only one copy is necessary when we send
-a tensor from one C++ process to many python processes.
-
-
-Step 0: Set up
-------
-We will include the following header files and construct a Plasma client.
-
-```cpp
-#include <plasma/client.h>
-#include <arrow/tensor.h>
-#include <arrow/array.h>
-#include <arrow/buffer.h>
-#include <arrow/io/memory.h>
-#include <arrow/ipc/writer.h>
-
-PlasmaClient client_;
-ARROW_CHECK_OK(client_.Connect("/tmp/plasma", "", 0));
-```
-
-
-Step 1: Serialize a floating point array in C++ into an Arrow Tensor
---------------------------------------------------------------------
-In this step, we will construct a floating-point array in C++.
-
-```cpp
-// Generate an Object ID for Plasma
-ObjectID object_id = ObjectID::from_binary("11111111111111111111");
-
-// Generate Float Array
-int64_t input_length = 1000;
-std::vector<float> input(input_length);
-for (int64_t i = 0; i < input_length; ++i) {
-  input[i] = 2.0;
-}
-
-// Create Arrow Tensor Object, no copy made!
-// {input_length} is the shape of the tensor
-auto value_buffer = Buffer::Wrap<float>(input);
-Tensor t(float32(), value_buffer, {input_length});
-```
-
-Step 2: Save the Arrow Tensor to Plasma In-Memory Object Store
---------------------------------------------------------------
-Continuing from Step 1, this step will save the tensor to Plasma Store. We
-use `arrow::ipc::WriteTensor` to write the data.
-
-The variable `meta_len` will contain the length of the tensor metadata
-after the call to `arrow::ipc::WriteTensor`.
-
-```cpp
-// Get the size of the tensor to be stored in Plasma
-int64_t datasize;
-ARROW_CHECK_OK(ipc::GetTensorSize(t, &datasize));
-int32_t meta_len = 0;
-
-// Create the Plasma Object
-// Plasma is responsible for initializing and resizing the buffer
-// This buffer will contain the _serialized_ tensor
-std::shared_ptr<Buffer> buffer;
-ARROW_CHECK_OK(
-    client_.Create(object_id, datasize, NULL, 0, &buffer));
-
-// Writing Process, this will copy the tensor into Plasma
-io::FixedSizeBufferWriter stream(buffer);
-ARROW_CHECK_OK(arrow::ipc::WriteTensor(t, &stream, &meta_len, &datasize));
-
-// Seal Plasma Object
-// This computes a hash of the object data by default
-ARROW_CHECK_OK(client_.Seal(object_id));
-```
-
-Step 3: Access the Tensor in a Python Process
----------------------------------------------
-In Python, we will construct a Plasma client and point it to the store's socket.
-The `inputs` variable will be a list of Object IDs in their raw byte string form.
-
-```python
-import pyarrow as pa
-import pyarrow.plasma as plasma
-
-plasma_client = plasma.connect('/tmp/plasma')
-
-# inputs: a list of object ids
-inputs = [20 * b'1']
-
-# Construct Object ID and perform a batch get
-object_ids = [plasma.ObjectID(inp) for inp in inputs]
-buffers = plasma_client.get_buffers(object_ids)
-
-# Read the tensor and convert to numpy array for each object
-arrs = []
-for buffer in buffers:
-    reader = pa.BufferReader(buffer)
-    t = pa.read_tensor(reader)
-    arr = t.to_numpy()
-    arrs.append(arr)
-
-# arrs is now a list of numpy arrays
-assert np.all(arrs[0] == 2.0 * np.ones(1000, dtype="float32"))
-```