You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@arrow.apache.org by we...@apache.org on 2020/05/04 18:00:24 UTC

[arrow-site] branch master updated (8c83c4a -> b944a1d)

This is an automated email from the ASF dual-hosted git repository.

wesm pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/arrow-site.git.


 discard 8c83c4a  ARROW-7847: [Website] Add blog post about fuzzing the IPC layer
     new b944a1d  ARROW-8023: [Website] Add blog post about the C data interface (#50)

This update added new revisions after undoing existing revisions.
That is to say, some revisions that were in the old version of the
branch are not in the new version.  This situation occurs
when a user --force pushes a change and generates a repository
containing something like this:

 * -- * -- B -- O -- O -- O   (8c83c4a)
            \
             N -- N -- N   refs/heads/master (b944a1d)

You should already have received notification emails for all of the O
revisions, and so the following emails describe only the N revisions
from the common base, B.

Any revisions marked "omit" are not gone; other references still
refer to them.  Any revisions marked "discard" are gone forever.

The 1 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.


Summary of changes:

[arrow-site] 01/01: ARROW-8023: [Website] Add blog post about the C data interface (#50)

Posted by we...@apache.org.

This is an automated email from the ASF dual-hosted git repository.

wesm pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/arrow-site.git

commit b944a1d2ed196ba71b1331fece76ef1b78a4b633
Author: Antoine Pitrou <an...@python.org>
AuthorDate: Mon May 4 12:58:47 2020 -0500

    ARROW-8023: [Website] Add blog post about the C data interface (#50)
---
 ...020-05-04-introducing-arrow-c-data-interface.md | 160 +++++++++++++++++++++
 1 file changed, 160 insertions(+)

diff --git a/_posts/2020-05-04-introducing-arrow-c-data-interface.md b/_posts/2020-05-04-introducing-arrow-c-data-interface.md
new file mode 100644
index 0000000..83136db
--- /dev/null
+++ b/_posts/2020-05-04-introducing-arrow-c-data-interface.md
@@ -0,0 +1,160 @@
+---
+layout: post
+title: "Introducing the Apache Arrow C Data Interface"
+description: "This post introduces the Arrow C Data Interface, a simple C-based
+interoperability standard to simplify interactions between independent users
+and implementors of the Arrow in-memory format."
+date: "2020-05-04 00:00:00 +0100"
+author: apitrou
+categories: [application]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+Apache Arrow includes a cross-language, platform-independent in-memory
+[columnar format](https://arrow.apache.org/docs/format/Columnar.html)
+allowing zero-copy data sharing and transfer between heterogenous runtimes
+and applications.
+
+The easiest way to use the Arrow columnar format has always been to depend
+on one of the concrete implementations developed by the Apache Arrow community.
+The project codebase contains libraries for 11 different programming languages
+so far, and will likely grow to include more languages in the future.
+
+However, some projects may wish to import and export the Arrow columnar format
+without taking on a new library dependency, such as the Arrow C++ library.
+We have therefore designed an alternative which exchanges data at the C level,
+conforming to a simple data definition.  The C Data Interface carries no dependencies
+except a shared C ABI between binaries which use it.  C ABIs are platform-wide standards
+which are necessarily adhered to by all compilers which generate binaries and are extremely
+stable, ensuring portability of libraries and executable binaries.  Two libraries that utilize
+the C structures defined by the C Data Interface can do zero-copy data
+transfers at runtime without any build-time or link-time dependency
+requirements.
+
+The best way to learn about the C Data Interface is to read the
+[spec](https://arrow.apache.org/docs/format/CDataInterface.html).
+However, we will quickly go over its strong points.
+
+## Two simple struct definitions
+
+To interact with the C Data Interface at the C or C++ level, the only
+thing you have to include in your code is two struct type declarations
+(and a couple of `#define`s for constant values).  Those declarations
+only depend on standard C types, and can simply be pasted in a header
+file.  Other languages can also participate as long as they provide a
+Foreign Function Interface layer; this is the case for most modern
+languages, such as Python (with `ctypes` or `cffi`), Julia, Rust, Go, etc.
+
+## Zero-copy data sharing
+
+The C Data Interface passes Arrow data buffers through memory pointers.  So,
+by construction, it allows you to share data from one runtime to
+another without copying it.  Since the data is in standard
+[Arrow in-memory format](https://arrow.apache.org/docs/format/Columnar.html),
+its layout is well-defined and unambiguous.
+
+This design also restricts the C Data Interface to *in-process* data sharing.
+For interprocess communication, we recommend use of the Arrow
+[IPC format](https://arrow.apache.org/docs/format/Columnar.html#serialization-and-interprocess-communication-ipc).
+
+## Reduced marshalling
+
+The C Data Interface stays close to the natural way of expressing Arrow-like
+data in C or C++.  Only two aspects involve non-trivial marshalling:
+
+* the encoding of data types, using a very simple string-based language
+* the encoding of optional metadata, using a very simple length-prefixed format
+
+## Separate type and data representation
+
+For applications which produce many instances of data of a single datatype
+(for example, as a stream of record batches), repeatedly reconstructing the
+datatype from its string encoding would represent unnecessary overhead.  To
+address this use case, the C Data Interface defines two independent structures:
+one representing a datatype (and optional metadata), one representing a piece
+of data.
+
+## Lifetime handling
+
+One common difficulty of data sharing between heterogenous runtimes is to
+correctly handle the lifetime of data.  The C Data Interface allows the producer
+to define its own memory management scheme through a release callback.
+This is a simple function pointer which consumers will call when they are
+finished using the data.  For example when used as a producer the Arrow C++
+library passes a release callback which simply decrements a `shared_ptr`'s
+reference count.
+
+## Application: passing data between R and Python
+
+The R and Python Arrow libraries are both based on the Arrow C++ library,
+however their respective toolchains (mandated by the R and Python packaging
+standards) are ABI-incompatible.  It is therefore impossible to pass data
+directly at the C++ level between the R and Python bindings.
+
+Using the C Data Interface, we have circumvented this restriction and provide
+a zero-copy data sharing API between R and Python.  It is based on the R
+[`reticulate`](https://rstudio.github.io/reticulate/) library.
+
+Here is an example session mixing R and Python library calls:
+
+```r
+library(arrow)
+library(reticulate)
+use_virtualenv("arrow")
+pa <- import("pyarrow")
+
+# Create an array in PyArrow
+a <- pa$array(c(1, 2, 3))
+a
+
+## Array
+## <double>
+## [
+##   1,
+##   2,
+##   3
+## ]
+
+# Apply R methods on the PyArrow-created array:
+a[a > 1]
+
+## Array
+## <double>
+## [
+##   2,
+##   3
+## ]
+
+# Create an array in R and pass it to PyArrow
+b <- Array$create(c(5, 6, 7))
+a_and_b <- pa$concat_arrays(r_to_py(list(a, b)))
+a_and_b
+
+## Array
+## <double>
+## [
+##   1,
+##   2,
+##   3,
+##   5,
+##   6,
+##   7
+## ]
+```