You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@arrow.apache.org by gi...@apache.org on 2020/04/22 23:55:43 UTC
[arrow-site] branch asf-site updated: Updating built site (build f45dd485ccb491ea2681d369a0258f44cc560ac6)

This is an automated email from the ASF dual-hosted git repository.

github-bot pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/arrow-site.git


The following commit(s) were added to refs/heads/asf-site by this push:
     new e9e47f1  Updating built site (build f45dd485ccb491ea2681d369a0258f44cc560ac6)
e9e47f1 is described below

commit e9e47f1c43a9c5c60934c496a9d4e333f65958f2
Author: Neal Richardson <ne...@gmail.com>
AuthorDate: Wed Apr 22 23:55:32 2020 +0000

    Updating built site (build f45dd485ccb491ea2681d369a0258f44cc560ac6)
---
 ...manifest-07b3643e10d26ac1b64aff61ff62464d.json} |   2 +-
 blog/2020/04/21/0.17.0-release/index.html          | 496 +++++++++++++++++++++
 blog/index.html                                    |  15 +
 feed.xml                                           | 453 ++++++++++---------
 4 files changed, 757 insertions(+), 209 deletions(-)

diff --git a/assets/.sprockets-manifest-9f55fef5b0b2da26929349fe08192161.json b/assets/.sprockets-manifest-07b3643e10d26ac1b64aff61ff62464d.json
similarity index 79%
rename from assets/.sprockets-manifest-9f55fef5b0b2da26929349fe08192161.json
rename to assets/.sprockets-manifest-07b3643e10d26ac1b64aff61ff62464d.json
index ec8cb97..2bf02f9 100644
--- a/assets/.sprockets-manifest-9f55fef5b0b2da26929349fe08192161.json
+++ b/assets/.sprockets-manifest-07b3643e10d26ac1b64aff61ff62464d.json
@@ -1 +1 @@
-{"files":{"main-18cd3029557f73c1ee82e41113127b04f6fcd84c56d9db0cb9c40ebe26ef6e33.js":{"logical_path":"main.js","mtime":"2020-04-21T08:23:41-04:00","size":124531,"digest":"18cd3029557f73c1ee82e41113127b04f6fcd84c56d9db0cb9c40ebe26ef6e33","integrity":"sha256-GM0wKVV/c8HuguQRExJ7BPb82ExW2dsMucQOvibvbjM="}},"assets":{"main.js":"main-18cd3029557f73c1ee82e41113127b04f6fcd84c56d9db0cb9c40ebe26ef6e33.js"}}
\ No newline at end of file
+{"files":{"main-18cd3029557f73c1ee82e41113127b04f6fcd84c56d9db0cb9c40ebe26ef6e33.js":{"logical_path":"main.js","mtime":"2020-04-22T19:55:24-04:00","size":124531,"digest":"18cd3029557f73c1ee82e41113127b04f6fcd84c56d9db0cb9c40ebe26ef6e33","integrity":"sha256-GM0wKVV/c8HuguQRExJ7BPb82ExW2dsMucQOvibvbjM="}},"assets":{"main.js":"main-18cd3029557f73c1ee82e41113127b04f6fcd84c56d9db0cb9c40ebe26ef6e33.js"}}
\ No newline at end of file
diff --git a/blog/2020/04/21/0.17.0-release/index.html b/blog/2020/04/21/0.17.0-release/index.html
new file mode 100644
index 0000000..cd67589
--- /dev/null
+++ b/blog/2020/04/21/0.17.0-release/index.html
@@ -0,0 +1,496 @@
+<!DOCTYPE html>
+<html lang="en-US">
+  <head>
+    <meta charset="UTF-8">
+    <meta http-equiv="X-UA-Compatible" content="IE=edge">
+    <meta name="viewport" content="width=device-width, initial-scale=1">
+    <!-- The above meta tags *must* come first in the head; any other head content must come *after* these tags -->
+    
+    <title>Apache Arrow 0.17.0 Release | Apache Arrow</title>
+    
+    
+    <!-- Begin Jekyll SEO tag v2.6.1 -->
+<meta name="generator" content="Jekyll v3.8.4" />
+<meta property="og:title" content="Apache Arrow 0.17.0 Release" />
+<meta name="author" content="pmc" />
+<meta property="og:locale" content="en_US" />
+<meta name="description" content="The Apache Arrow team is pleased to announce the 0.17.0 release. This covers over 2 months of development work and includes 569 resolved issues from 79 distinct contributors. See the Install Page to learn how to get the libraries for your platform. The release notes below are not exhaustive and only expose selected highlights of the release. Many other bugfixes and improvements have been made: we refer you to the complete changelog. Community Since the 0 [...]
+<meta property="og:description" content="The Apache Arrow team is pleased to announce the 0.17.0 release. This covers over 2 months of development work and includes 569 resolved issues from 79 distinct contributors. See the Install Page to learn how to get the libraries for your platform. The release notes below are not exhaustive and only expose selected highlights of the release. Many other bugfixes and improvements have been made: we refer you to the complete changelog. Community Sinc [...]
+<link rel="canonical" href="https://arrow.apache.org/blog/2020/04/21/0.17.0-release/" />
+<meta property="og:url" content="https://arrow.apache.org/blog/2020/04/21/0.17.0-release/" />
+<meta property="og:site_name" content="Apache Arrow" />
+<meta property="og:image" content="https://arrow.apache.org/img/arrow.png" />
+<meta property="og:type" content="article" />
+<meta property="article:published_time" content="2020-04-21T02:00:00-04:00" />
+<meta name="twitter:card" content="summary_large_image" />
+<meta property="twitter:image" content="https://arrow.apache.org/img/arrow.png" />
+<meta property="twitter:title" content="Apache Arrow 0.17.0 Release" />
+<meta name="twitter:site" content="@ApacheArrow" />
+<meta name="twitter:creator" content="@pmc" />
+<script type="application/ld+json">
+{"headline":"Apache Arrow 0.17.0 Release","dateModified":"2020-04-21T02:00:00-04:00","datePublished":"2020-04-21T02:00:00-04:00","publisher":{"@type":"Organization","logo":{"@type":"ImageObject","url":"https://arrow.apache.org/img/logo.png"},"name":"pmc"},"@type":"BlogPosting","mainEntityOfPage":{"@type":"WebPage","@id":"https://arrow.apache.org/blog/2020/04/21/0.17.0-release/"},"description":"The Apache Arrow team is pleased to announce the 0.17.0 release. This covers over 2 months of d [...]
+<!-- End Jekyll SEO tag -->
+
+
+    <!-- favicons -->
+    <link rel="icon" type="image/png" sizes="16x16" href="/img/favicon-16x16.png" id="light1">
+    <link rel="icon" type="image/png" sizes="32x32" href="/img/favicon-32x32.png" id="light2">
+    <link rel="apple-touch-icon" type="image/png" sizes="180x180" href="/img/apple-touch-icon.png" id="light3">
+    <link rel="apple-touch-icon" type="image/png" sizes="120x120" href="/img/apple-touch-icon-120x120.png" id="light4">
+    <link rel="apple-touch-icon" type="image/png" sizes="76x76" href="/img/apple-touch-icon-76x76.png" id="light5">
+    <link rel="apple-touch-icon" type="image/png" sizes="60x60" href="/img/apple-touch-icon-60x60.png" id="light6">
+    <!-- dark mode favicons -->
+    <link rel="icon" type="image/png" sizes="16x16" href="/img/favicon-16x16-dark.png" id="dark1">
+    <link rel="icon" type="image/png" sizes="32x32" href="/img/favicon-32x32-dark.png" id="dark2">
+    <link rel="apple-touch-icon" type="image/png" sizes="180x180" href="/img/apple-touch-icon-dark.png" id="dark3">
+    <link rel="apple-touch-icon" type="image/png" sizes="120x120" href="/img/apple-touch-icon-120x120-dark.png" id="dark4">
+    <link rel="apple-touch-icon" type="image/png" sizes="76x76" href="/img/apple-touch-icon-76x76-dark.png" id="dark5">
+    <link rel="apple-touch-icon" type="image/png" sizes="60x60" href="/img/apple-touch-icon-60x60-dark.png" id="dark6">
+
+    <script>
+      // Switch to the dark-mode favicons if prefers-color-scheme: dark
+      function onUpdate() {
+        light1 = document.querySelector('link#light1');
+        light2 = document.querySelector('link#light2');
+        light3 = document.querySelector('link#light3');
+        light4 = document.querySelector('link#light4');
+        light5 = document.querySelector('link#light5');
+        light6 = document.querySelector('link#light6');
+
+        dark1 = document.querySelector('link#dark1');
+        dark2 = document.querySelector('link#dark2');
+        dark3 = document.querySelector('link#dark3');
+        dark4 = document.querySelector('link#dark4');
+        dark5 = document.querySelector('link#dark5');
+        dark6 = document.querySelector('link#dark6');
+
+        if (matcher.matches) {
+          light1.remove();
+          light2.remove();
+          light3.remove();
+          light4.remove();
+          light5.remove();
+          light6.remove();
+          document.head.append(dark1);
+          document.head.append(dark2);
+          document.head.append(dark3);
+          document.head.append(dark4);
+          document.head.append(dark5);
+          document.head.append(dark6);
+        } else {
+          dark1.remove();
+          dark2.remove();
+          dark3.remove();
+          dark4.remove();
+          dark5.remove();
+          dark6.remove();
+          document.head.append(light1);
+          document.head.append(light2);
+          document.head.append(light3);
+          document.head.append(light4);
+          document.head.append(light5);
+          document.head.append(light6);
+        }
+      }
+      matcher = window.matchMedia('(prefers-color-scheme: dark)');
+      matcher.addListener(onUpdate);
+      onUpdate();
+    </script>
+
+    <link rel="stylesheet" href="//fonts.googleapis.com/css?family=Lato:300,300italic,400,400italic,700,700italic,900">
+
+    <link href="/css/main.css" rel="stylesheet">
+    <link href="/css/syntax.css" rel="stylesheet">
+    <script src="https://code.jquery.com/jquery-3.3.1.slim.min.js" integrity="sha384-q8i/X+965DzO0rT7abK41JStQIAqVgRVzpbzo5smXKp4YfRvH+8abtTE1Pi6jizo" crossorigin="anonymous"></script>
+    <script src="https://cdnjs.cloudflare.com/ajax/libs/popper.js/1.14.3/umd/popper.min.js" integrity="sha384-ZMP7rVo3mIykV+2+9J3UJ46jBk0WLaUAdn689aCwoqbBJiSnjAK/l8WvCWPIPm49" crossorigin="anonymous"></script>
+    
+    <!-- Global Site Tag (gtag.js) - Google Analytics -->
+<script async src="https://www.googletagmanager.com/gtag/js?id=UA-107500873-1"></script>
+<script>
+  window.dataLayer = window.dataLayer || [];
+  function gtag(){dataLayer.push(arguments)};
+  gtag('js', new Date());
+
+  gtag('config', 'UA-107500873-1');
+</script>
+
+    
+  </head>
+
+
+<body class="wrap">
+  <header>
+    <nav class="navbar navbar-expand-md navbar-dark bg-dark">
+  <a class="navbar-brand" href="/"><img src="/img/arrow-inverse-300px.png" height="60px"/></a>
+  <button class="navbar-toggler" type="button" data-toggle="collapse" data-target="#arrow-navbar" aria-controls="arrow-navbar" aria-expanded="false" aria-label="Toggle navigation">
+    <span class="navbar-toggler-icon"></span>
+  </button>
+
+    <!-- Collect the nav links, forms, and other content for toggling -->
+    <div class="collapse navbar-collapse" id="arrow-navbar">
+      <ul class="nav navbar-nav">
+        <li class="nav-item dropdown">
+          <a class="nav-link dropdown-toggle" href="#"
+             id="navbarDropdownProjectLinks" role="button" data-toggle="dropdown"
+             aria-haspopup="true" aria-expanded="false">
+             Project Links
+          </a>
+          <div class="dropdown-menu" aria-labelledby="navbarDropdownProjectLinks">
+            <a class="dropdown-item" href="/install/">Installation</a>
+            <a class="dropdown-item" href="/release/">Releases</a>
+            <a class="dropdown-item" href="/faq/">FAQ</a>
+            <a class="dropdown-item" href="/blog/">Blog</a>
+            <a class="dropdown-item" href="https://github.com/apache/arrow">Source Code</a>
+            <a class="dropdown-item" href="https://issues.apache.org/jira/browse/ARROW">Issue Tracker</a>
+          </div>
+        </li>
+        <li class="nav-item dropdown">
+          <a class="nav-link dropdown-toggle" href="#"
+             id="navbarDropdownCommunity" role="button" data-toggle="dropdown"
+             aria-haspopup="true" aria-expanded="false">
+             Community
+          </a>
+          <div class="dropdown-menu" aria-labelledby="navbarDropdownCommunity">
+            <a class="dropdown-item" href="http://mail-archives.apache.org/mod_mbox/arrow-user/">User Mailing List</a>
+            <a class="dropdown-item" href="http://mail-archives.apache.org/mod_mbox/arrow-dev/">Dev Mailing List</a>
+            <a class="dropdown-item" href="https://cwiki.apache.org/confluence/display/ARROW">Developer Wiki</a>
+            <a class="dropdown-item" href="/committers/">Committers</a>
+            <a class="dropdown-item" href="/powered_by/">Powered By</a>
+          </div>
+        </li>
+        <li class="nav-item">
+          <a class="nav-link" href="/docs/format/Columnar.html"
+             role="button" aria-haspopup="true" aria-expanded="false">
+             Specification
+          </a>
+        </li>
+        <li class="nav-item dropdown">
+          <a class="nav-link dropdown-toggle" href="#"
+             id="navbarDropdownDocumentation" role="button" data-toggle="dropdown"
+             aria-haspopup="true" aria-expanded="false">
+             Documentation
+          </a>
+          <div class="dropdown-menu" aria-labelledby="navbarDropdownDocumentation">
+            <a class="dropdown-item" href="/docs">Project Docs</a>
+            <a class="dropdown-item" href="/docs/python">Python</a>
+            <a class="dropdown-item" href="/docs/cpp">C++</a>
+            <a class="dropdown-item" href="/docs/java">Java</a>
+            <a class="dropdown-item" href="/docs/c_glib">C GLib</a>
+            <a class="dropdown-item" href="/docs/js">JavaScript</a>
+            <a class="dropdown-item" href="/docs/r">R</a>
+          </div>
+        </li>
+        <!-- <li><a href="/blog">Blog</a></li> -->
+        <li class="nav-item dropdown">
+          <a class="nav-link dropdown-toggle" href="#"
+             id="navbarDropdownASF" role="button" data-toggle="dropdown"
+             aria-haspopup="true" aria-expanded="false">
+             ASF Links
+          </a>
+          <div class="dropdown-menu" aria-labelledby="navbarDropdownASF">
+            <a class="dropdown-item" href="http://www.apache.org/">ASF Website</a>
+            <a class="dropdown-item" href="http://www.apache.org/licenses/">License</a>
+            <a class="dropdown-item" href="http://www.apache.org/foundation/sponsorship.html">Donate</a>
+            <a class="dropdown-item" href="http://www.apache.org/foundation/thanks.html">Thanks</a>
+            <a class="dropdown-item" href="http://www.apache.org/security/">Security</a>
+          </div>
+        </li>
+      </ul>
+      <div class="flex-row justify-content-end ml-md-auto">
+        <a class="d-sm-none d-md-inline pr-2" href="https://www.apache.org/events/current-event.html">
+          <img src="https://www.apache.org/events/current-event-234x60.png"/>
+        </a>
+        <a href="http://www.apache.org/">
+          <img src="/img/asf_logo.svg" width="120px"/>
+        </a>
+      </div>
+      </div><!-- /.navbar-collapse -->
+    </div>
+  </nav>
+
+  </header>
+
+  <div class="container p-lg-4">
+    <main role="main">
+    
+    
+    
+<h1>
+  Apache Arrow 0.17.0 Release
+</h1>
+
+
+
+<p>
+  <span class="badge badge-secondary">Published</span>
+  <span class="published">
+    21 Apr 2020
+  </span>
+  <br />
+  <span class="badge badge-secondary">By</span>
+  
+    <a href="https://arrow.apache.org">The Apache Arrow PMC (pmc) </a>
+  
+
+  
+</p>
+
+
+    <!--
+
+-->
+
+<p>The Apache Arrow team is pleased to announce the 0.17.0 release. This covers
+over 2 months of development work and includes <a href="https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20%3D%20Resolved%20AND%20fixVersion%20%3D%200.17.0"><strong>569 resolved issues</strong></a>
+from <a href="https://arrow.apache.org/release/0.17.0.html#contributors"><strong>79 distinct contributors</strong></a>. See the Install Page to learn how to
+get the libraries for your platform.</p>
+
+<p>The release notes below are not exhaustive and only expose selected highlights
+of the release. Many other bugfixes and improvements have been made: we refer
+you to the <a href="https://arrow.apache.org/release/0.17.0.html">complete changelog</a>.</p>
+
+<h2 id="community">Community</h2>
+
+<p>Since the 0.16.0 release, two committers have joined the Project Management
+Committee (PMC):</p>
+
+<ul>
+  <li><a href="https://github.com/nealrichardson">Neal Richardson</a></li>
+  <li><a href="https://github.com/fsaintjacques">François Saint-Jacques</a></li>
+</ul>
+
+<p>Thank you for all your contributions!</p>
+
+<h2 id="columnar-format-notes">Columnar Format Notes</h2>
+
+<p>A <a href="https://arrow.apache.org/docs/format/CDataInterface.html">C-level Data Interface</a> was designed to ease data sharing inside a single
+process. It allows different runtimes or libraries to share Arrow data using a
+well-known binary layout and metadata representation, without any copies. Third
+party libraries can use the C interface to import and export the Arrow columnar
+format in-process without requiring on any new code dependencies.</p>
+
+<p>The C++ library now includes an implementation of the C Data Interface, and
+Python and R have bindings to that implementation.</p>
+
+<h2 id="arrow-flight-rpc-notes">Arrow Flight RPC notes</h2>
+
+<ul>
+  <li>Adopted new DoExchange bi-directional data RPC</li>
+  <li>ListFlights supports being passed a Criteria argument in
+Java/C++/Python. This allows applications to search for flights satisfying a
+given query.</li>
+  <li>Custom metadata can be attached to errors that the server sends to the
+client, which can be used to encode richer application-specific information.</li>
+  <li>A number of minor bugs were fixed, including proper handling of empty null
+arrays in Java and round-tripping of certain Arrow status codes in
+C++/Python.</li>
+</ul>
+
+<h2 id="c-notes">C++ notes</h2>
+
+<h3 id="feather-v2">Feather V2</h3>
+
+<p>The “Feather V2” format based on the Arrow IPC file format was developed.
+Feather V2 features full support for all Arrow data types, and resolves the 2GB
+per-column limitation for large amounts of string data that the <a href="https://github.com/wesm/feather">original
+Feather implementation</a> had.  Feather V2 also introduces experimental IPC
+message compression using LZ4 frame format or ZSTD. This will be formalized
+later in the Arrow format.</p>
+
+<h3 id="c-datasets">C++ Datasets</h3>
+
+<ul>
+  <li>Improve speed on high latency file system by relaxing discovery validation</li>
+  <li>Better performance with Arrow IPC files using column projection</li>
+  <li>Add the ability to list files in FileSystemDataset</li>
+  <li>Add support for Parquet file reader options</li>
+  <li>Support dictionary columns in partition expression</li>
+  <li>Fix various crashes and other issues</li>
+</ul>
+
+<h3 id="c-parquet-notes">C++ Parquet notes</h3>
+
+<ul>
+  <li>Complete support for writing nested types to Parquet format was
+completed. The legacy code can be accessed through parquet write option C++
+and an environment variable in Python. Read support will come in a future
+release.</li>
+  <li>The BYTE_STREAM_SPLIT encoding was implemented for floating-point types. It
+helps improve the efficiency of memory compression for high-entropy data.</li>
+  <li>Expose Parquet schema field_id as Arrow field metadata</li>
+  <li>Support for DataPageV2 data page format</li>
+</ul>
+
+<h3 id="c-build-notes">C++ build notes</h3>
+
+<ul>
+  <li>We continued to make the core C++ library build simpler and faster. Among the
+improvements are the removal of the dependency on Thrift IDL compiler at
+build time; while Parquet still requires the Thrift runtime C++ library, its
+dependencies are much lighter. We also further reduced the number of build
+configurations that require Boost, and when Boost is needed to be built, we
+only download the components we need, reducing the size of the Boost bundle
+by 90%.</li>
+  <li>Improved support for building on ARM platforms</li>
+  <li>Upgraded LLVM version from 7 to 8</li>
+  <li>Simplified SIMD build configuration with ARROW_SIMD_LEVEL option allowing no
+SIMD, SSE4.2, AVX2, or AVX512 to be selected.</li>
+  <li>Fixed a number of bugs affecting compilation on aarch64 platforms</li>
+</ul>
+
+<h3 id="other-c-notes">Other C++ notes</h3>
+
+<ul>
+  <li>Many crashes on invalid input detected by <a href="https://google.github.io/oss-fuzz/">OSS-Fuzz</a> in the IPC reader and
+in Parquet-Arrow reading were fixed. See our recent <a href="https://arrow.apache.org/blog/2020/03/31/fuzzing-arrow-ipc/">blog post</a> for more
+details.</li>
+  <li>A “Device” abstraction was added to simplify buffer management and movement
+across heterogeneous hardware configurations, e.g. CPUs and GPUs.</li>
+  <li>A streaming CSV reader was implemented, yielding individual RecordBatches and
+helping limit overall memory occupation.</li>
+  <li>Array casting from Decimal128 to integer types and to Decimal128 with
+different scale/precision was added.</li>
+  <li>Sparse CSF tensors are now supported.</li>
+  <li>When creating an Array, the null bitmap is not kept if the null count is known to be zero</li>
+  <li>Compressor support for the LZ4 frame format (LZ4_FRAME) was added</li>
+  <li>An event-driven interface for reading IPC streams was added.</li>
+  <li>Further core APIs that required passing an explicit out-parameter were
+migrated to <code class="highlighter-rouge">Result&lt;T&gt;</code>.</li>
+  <li>New analytics kernels for match, sort indices / argsort, top-k</li>
+</ul>
+
+<h2 id="java-notes">Java notes</h2>
+
+<ul>
+  <li>Netty dependencies were removed for BufferAllocator and ReferenceManager
+classes. In the future, we plan to move netty related classes to a separate
+module.</li>
+  <li>New features were provided to support efficiently appending vector/vector
+schema root values in batch.</li>
+  <li>Comparing a range of values in dense union vectors has been supported.</li>
+  <li>The quick sort algorithm was improved to avoid degenerating to the worst case.</li>
+</ul>
+
+<h2 id="python-notes">Python notes</h2>
+
+<h3 id="datasets">Datasets</h3>
+
+<ul>
+  <li>Updated <code class="highlighter-rouge">pyarrow.dataset</code> module following the changes in the C++ Datasets
+project. This release also adds <a href="https://arrow.apache.org/docs/python/dataset.html">richer documentation</a> on the datasets
+module.</li>
+  <li>Support for the improved dataset functionality in
+<code class="highlighter-rouge">pyarrow.parquet.read_table/ParquetDataset</code>. To enable, pass
+<code class="highlighter-rouge">use_legacy_dataset=False</code>. Among other things, this allows to specify filters
+for all columns and not only the partition keys (using row group statistics)
+and enables different partitioning schemes. See the “note” in the
+<a href="https://arrow.apache.org/docs/python/parquet.html#reading-from-partitioned-datasets"><code class="highlighter-rouge">ParquetDataset</code> documentation</a>.</li>
+</ul>
+
+<h3 id="packaging">Packaging</h3>
+
+<ul>
+  <li>Wheels for Python 3.8 are now available</li>
+  <li>Support for Python 2.7 has been dropped as Python 2.x reached end-of-life in
+January 2020.</li>
+  <li>Nightly wheels and conda packages are now available for testing or other
+development purposes. See the <a href="https://arrow.apache.org/docs/python/install.html#installing-nightly-packages">installation guide</a></li>
+</ul>
+
+<h3 id="other-improvements">Other improvements</h3>
+
+<ul>
+  <li>Conversion to numpy/pandas for FixedSizeList, LargeString, LargeBinary</li>
+  <li>Sparse CSC matrices and Sparse CSF tensors support was added. (ARROW-7419,
+ARROW-7427)</li>
+</ul>
+
+<h2 id="r-notes">R notes</h2>
+
+<p>Highlights include support for the Feather V2 format and the C Data Interface,
+both described above. Along with low-level bindings for the C interface, this
+release adds tooling to work with Arrow data in Python using <code class="highlighter-rouge">reticulate</code>. See
+<a href="https://arrow.apache.org/docs/r/articles/python.html"><code class="highlighter-rouge">vignette("python", package = "arrow")</code></a> for a guide to getting started.</p>
+
+<p>Installation on Linux now builds C++ the library from source by default. For a
+faster, richer build, set the environment variable <code class="highlighter-rouge">NOT_CRAN=true</code>. See
+<a href="https://arrow.apache.org/docs/r/articles/install.html"><code class="highlighter-rouge">vignette("install", package = "arrow")</code></a> for details and more options.</p>
+
+<p>For more on what’s in the 0.17 R package, see the <a href="https://arrow.apache.org/docs/r/news/">R changelog</a>.</p>
+
+<h2 id="ruby-and-c-glib-notes">Ruby and C GLib notes</h2>
+
+<h3 id="ruby">Ruby</h3>
+
+<ul>
+  <li>Support Ruby 2.3 again</li>
+</ul>
+
+<h3 id="c-glib">C GLib</h3>
+
+<ul>
+  <li>Add GArrowRecordBatchIterator</li>
+  <li>Add support for GArrowFilterOptions</li>
+  <li>Add support for Peek() to GIOInputStream</li>
+  <li>Add some metadata bindings to GArrowSchema</li>
+  <li>Add LocalFileSystem support</li>
+  <li>Add support for writer properties of Parquet</li>
+  <li>Add support for MapArray</li>
+  <li>Add support for BooleanNode</li>
+</ul>
+
+<h2 id="rust-notes">Rust notes</h2>
+
+<ul>
+  <li>DictionayArray support.</li>
+  <li>Various improvements to code safety.</li>
+  <li>Filter kernel now supports temporal types.</li>
+</ul>
+
+<h3 id="rust-parquet-notes">Rust Parquet notes</h3>
+
+<ul>
+  <li>Array reader now supports temporal types.</li>
+  <li>Parquet writer now supports custom meta-data key/value pairs.</li>
+</ul>
+
+<h3 id="rust-datafusion-notes">Rust DataFusion notes</h3>
+
+<ul>
+  <li>Logical plans can now reference columns by name (as well as by index) using
+the new <code class="highlighter-rouge">UnresolvedColumn</code> expression. There is a new optimizer rule to
+resolve these into column indices.</li>
+  <li>Scalar UDFs can now be registered with the execution context and used from
+logical query plans as well as from SQL. A number of math scalar functions
+have been implemented using this feature (sqrt, cos, sin, tan, asin, acos,
+atan, floor, ceil, round, trunc, abs, signum, exp, log, log2, log10).</li>
+  <li>Various SQL improvements, including support for <code class="highlighter-rouge">SELECT *</code> and <code class="highlighter-rouge">SELECT
+COUNT(*)</code>, and improvements to parsing of aggregate queries.</li>
+  <li>Flight examples are provided, with a client that sends a SQL statement to a
+Flight server and receives the results.</li>
+  <li>The interactive SQL command-line tool now has improved documentation and
+better formatting of query results.</li>
+</ul>
+
+<h2 id="project-operations">Project Operations</h2>
+
+<p>We’ve continued our migration of general automation toward GitHub Actions. The
+majority of our commit-by-commit continuous integration (CI) is now running on
+GitHub Actions. We are working on different solutions for using dedicated
+hardware as part of our CI. The <a href="https://buildkite.com/">Buildkite</a> self-hosted CI/CD platform is
+now supported on Apache repositories and GitHub Actions also supports
+self-hosted workers.</p>
+
+
+    </main>
+
+    <hr/>
+<footer class="footer">
+  <p>Apache Arrow, Arrow, Apache, the Apache feather logo, and the Apache Arrow project logo are either registered trademarks or trademarks of The Apache Software Foundation in the United States and other countries.</p>
+  <p>&copy; 2016-2019 The Apache Software Foundation</p>
+  <script integrity="sha256-GM0wKVV/c8HuguQRExJ7BPb82ExW2dsMucQOvibvbjM=" crossorigin="anonymous" type="text/javascript" src="/assets/main-18cd3029557f73c1ee82e41113127b04f6fcd84c56d9db0cb9c40ebe26ef6e33.js"></script>
+</footer>
+
+  </div>
+</body>
+</html>
diff --git a/blog/index.html b/blog/index.html
index 40ecefa..1056b3a 100644
--- a/blog/index.html
+++ b/blog/index.html
@@ -217,6 +217,21 @@
   
   <p>
     <h3>
+      <a href="/blog/2020/04/21/0.17.0-release/">Apache Arrow 0.17.0 Release</a>
+    </h3>
+    
+    <p>
+    <span class="blog-list-date">
+      21 April 2020
+    </span>
+    </p>
+    The Apache Arrow team is pleased to announce the 0.17.0 release. This covers over 2 months of development work and includes 569 resolved issues from 79 distinct contributors. See the Install Page to learn how to get the libraries for your platform. The release notes below are not exhaustive and...
+  </p>
+  
+
+  
+  <p>
+    <h3>
       <a href="/blog/2020/03/31/fuzzing-arrow-ipc/">Fuzzing the Arrow C++ IPC implementation</a>
     </h3>
     
diff --git a/feed.xml b/feed.xml
index 04feb74..b13ac67 100644
--- a/feed.xml
+++ b/feed.xml
@@ -1,4 +1,247 @@
-<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.8.4">Jekyll</generator><link href="https://arrow.apache.org/feed.xml" rel="self" type="application/atom+xml" /><link href="https://arrow.apache.org/" rel="alternate" type="text/html" /><updated>2020-04-21T08:23:33-04:00</updated><id>https://arrow.apache.org/feed.xml</id><title type="html">Apache Arrow</title><subtitle>Apache Arrow is a cross-language developm [...]
+<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.8.4">Jekyll</generator><link href="https://arrow.apache.org/feed.xml" rel="self" type="application/atom+xml" /><link href="https://arrow.apache.org/" rel="alternate" type="text/html" /><updated>2020-04-22T19:55:16-04:00</updated><id>https://arrow.apache.org/feed.xml</id><title type="html">Apache Arrow</title><subtitle>Apache Arrow is a cross-language developm [...]
+
+--&gt;
+
+&lt;p&gt;The Apache Arrow team is pleased to announce the 0.17.0 release. This covers
+over 2 months of development work and includes &lt;a href=&quot;https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20%3D%20Resolved%20AND%20fixVersion%20%3D%200.17.0&quot;&gt;&lt;strong&gt;569 resolved issues&lt;/strong&gt;&lt;/a&gt;
+from &lt;a href=&quot;https://arrow.apache.org/release/0.17.0.html#contributors&quot;&gt;&lt;strong&gt;79 distinct contributors&lt;/strong&gt;&lt;/a&gt;. See the Install Page to learn how to
+get the libraries for your platform.&lt;/p&gt;
+
+&lt;p&gt;The release notes below are not exhaustive and only expose selected highlights
+of the release. Many other bugfixes and improvements have been made: we refer
+you to the &lt;a href=&quot;https://arrow.apache.org/release/0.17.0.html&quot;&gt;complete changelog&lt;/a&gt;.&lt;/p&gt;
+
+&lt;h2 id=&quot;community&quot;&gt;Community&lt;/h2&gt;
+
+&lt;p&gt;Since the 0.16.0 release, two committers have joined the Project Management
+Committee (PMC):&lt;/p&gt;
+
+&lt;ul&gt;
+  &lt;li&gt;&lt;a href=&quot;https://github.com/nealrichardson&quot;&gt;Neal Richardson&lt;/a&gt;&lt;/li&gt;
+  &lt;li&gt;&lt;a href=&quot;https://github.com/fsaintjacques&quot;&gt;François Saint-Jacques&lt;/a&gt;&lt;/li&gt;
+&lt;/ul&gt;
+
+&lt;p&gt;Thank you for all your contributions!&lt;/p&gt;
+
+&lt;h2 id=&quot;columnar-format-notes&quot;&gt;Columnar Format Notes&lt;/h2&gt;
+
+&lt;p&gt;A &lt;a href=&quot;https://arrow.apache.org/docs/format/CDataInterface.html&quot;&gt;C-level Data Interface&lt;/a&gt; was designed to ease data sharing inside a single
+process. It allows different runtimes or libraries to share Arrow data using a
+well-known binary layout and metadata representation, without any copies. Third
+party libraries can use the C interface to import and export the Arrow columnar
+format in-process without requiring on any new code dependencies.&lt;/p&gt;
+
+&lt;p&gt;The C++ library now includes an implementation of the C Data Interface, and
+Python and R have bindings to that implementation.&lt;/p&gt;
+
+&lt;h2 id=&quot;arrow-flight-rpc-notes&quot;&gt;Arrow Flight RPC notes&lt;/h2&gt;
+
+&lt;ul&gt;
+  &lt;li&gt;Adopted new DoExchange bi-directional data RPC&lt;/li&gt;
+  &lt;li&gt;ListFlights supports being passed a Criteria argument in
+Java/C++/Python. This allows applications to search for flights satisfying a
+given query.&lt;/li&gt;
+  &lt;li&gt;Custom metadata can be attached to errors that the server sends to the
+client, which can be used to encode richer application-specific information.&lt;/li&gt;
+  &lt;li&gt;A number of minor bugs were fixed, including proper handling of empty null
+arrays in Java and round-tripping of certain Arrow status codes in
+C++/Python.&lt;/li&gt;
+&lt;/ul&gt;
+
+&lt;h2 id=&quot;c-notes&quot;&gt;C++ notes&lt;/h2&gt;
+
+&lt;h3 id=&quot;feather-v2&quot;&gt;Feather V2&lt;/h3&gt;
+
+&lt;p&gt;The “Feather V2” format based on the Arrow IPC file format was developed.
+Feather V2 features full support for all Arrow data types, and resolves the 2GB
+per-column limitation for large amounts of string data that the &lt;a href=&quot;https://github.com/wesm/feather&quot;&gt;original
+Feather implementation&lt;/a&gt; had.  Feather V2 also introduces experimental IPC
+message compression using LZ4 frame format or ZSTD. This will be formalized
+later in the Arrow format.&lt;/p&gt;
+
+&lt;h3 id=&quot;c-datasets&quot;&gt;C++ Datasets&lt;/h3&gt;
+
+&lt;ul&gt;
+  &lt;li&gt;Improve speed on high latency file system by relaxing discovery validation&lt;/li&gt;
+  &lt;li&gt;Better performance with Arrow IPC files using column projection&lt;/li&gt;
+  &lt;li&gt;Add the ability to list files in FileSystemDataset&lt;/li&gt;
+  &lt;li&gt;Add support for Parquet file reader options&lt;/li&gt;
+  &lt;li&gt;Support dictionary columns in partition expression&lt;/li&gt;
+  &lt;li&gt;Fix various crashes and other issues&lt;/li&gt;
+&lt;/ul&gt;
+
+&lt;h3 id=&quot;c-parquet-notes&quot;&gt;C++ Parquet notes&lt;/h3&gt;
+
+&lt;ul&gt;
+  &lt;li&gt;Complete support for writing nested types to Parquet format was
+completed. The legacy code can be accessed through parquet write option C++
+and an environment variable in Python. Read support will come in a future
+release.&lt;/li&gt;
+  &lt;li&gt;The BYTE_STREAM_SPLIT encoding was implemented for floating-point types. It
+helps improve the efficiency of memory compression for high-entropy data.&lt;/li&gt;
+  &lt;li&gt;Expose Parquet schema field_id as Arrow field metadata&lt;/li&gt;
+  &lt;li&gt;Support for DataPageV2 data page format&lt;/li&gt;
+&lt;/ul&gt;
+
+&lt;h3 id=&quot;c-build-notes&quot;&gt;C++ build notes&lt;/h3&gt;
+
+&lt;ul&gt;
+  &lt;li&gt;We continued to make the core C++ library build simpler and faster. Among the
+improvements are the removal of the dependency on Thrift IDL compiler at
+build time; while Parquet still requires the Thrift runtime C++ library, its
+dependencies are much lighter. We also further reduced the number of build
+configurations that require Boost, and when Boost is needed to be built, we
+only download the components we need, reducing the size of the Boost bundle
+by 90%.&lt;/li&gt;
+  &lt;li&gt;Improved support for building on ARM platforms&lt;/li&gt;
+  &lt;li&gt;Upgraded LLVM version from 7 to 8&lt;/li&gt;
+  &lt;li&gt;Simplified SIMD build configuration with ARROW_SIMD_LEVEL option allowing no
+SIMD, SSE4.2, AVX2, or AVX512 to be selected.&lt;/li&gt;
+  &lt;li&gt;Fixed a number of bugs affecting compilation on aarch64 platforms&lt;/li&gt;
+&lt;/ul&gt;
+
+&lt;h3 id=&quot;other-c-notes&quot;&gt;Other C++ notes&lt;/h3&gt;
+
+&lt;ul&gt;
+  &lt;li&gt;Many crashes on invalid input detected by &lt;a href=&quot;https://google.github.io/oss-fuzz/&quot;&gt;OSS-Fuzz&lt;/a&gt; in the IPC reader and
+in Parquet-Arrow reading were fixed. See our recent &lt;a href=&quot;https://arrow.apache.org/blog/2020/03/31/fuzzing-arrow-ipc/&quot;&gt;blog post&lt;/a&gt; for more
+details.&lt;/li&gt;
+  &lt;li&gt;A “Device” abstraction was added to simplify buffer management and movement
+across heterogeneous hardware configurations, e.g. CPUs and GPUs.&lt;/li&gt;
+  &lt;li&gt;A streaming CSV reader was implemented, yielding individual RecordBatches and
+helping limit overall memory occupation.&lt;/li&gt;
+  &lt;li&gt;Array casting from Decimal128 to integer types and to Decimal128 with
+different scale/precision was added.&lt;/li&gt;
+  &lt;li&gt;Sparse CSF tensors are now supported.&lt;/li&gt;
+  &lt;li&gt;When creating an Array, the null bitmap is not kept if the null count is known to be zero&lt;/li&gt;
+  &lt;li&gt;Compressor support for the LZ4 frame format (LZ4_FRAME) was added&lt;/li&gt;
+  &lt;li&gt;An event-driven interface for reading IPC streams was added.&lt;/li&gt;
+  &lt;li&gt;Further core APIs that required passing an explicit out-parameter were
+migrated to &lt;code class=&quot;highlighter-rouge&quot;&gt;Result&amp;lt;T&amp;gt;&lt;/code&gt;.&lt;/li&gt;
+  &lt;li&gt;New analytics kernels for match, sort indices / argsort, top-k&lt;/li&gt;
+&lt;/ul&gt;
+
+&lt;h2 id=&quot;java-notes&quot;&gt;Java notes&lt;/h2&gt;
+
+&lt;ul&gt;
+  &lt;li&gt;Netty dependencies were removed for BufferAllocator and ReferenceManager
+classes. In the future, we plan to move netty related classes to a separate
+module.&lt;/li&gt;
+  &lt;li&gt;New features were provided to support efficiently appending vector/vector
+schema root values in batch.&lt;/li&gt;
+  &lt;li&gt;Comparing a range of values in dense union vectors has been supported.&lt;/li&gt;
+  &lt;li&gt;The quick sort algorithm was improved to avoid degenerating to the worst case.&lt;/li&gt;
+&lt;/ul&gt;
+
+&lt;h2 id=&quot;python-notes&quot;&gt;Python notes&lt;/h2&gt;
+
+&lt;h3 id=&quot;datasets&quot;&gt;Datasets&lt;/h3&gt;
+
+&lt;ul&gt;
+  &lt;li&gt;Updated &lt;code class=&quot;highlighter-rouge&quot;&gt;pyarrow.dataset&lt;/code&gt; module following the changes in the C++ Datasets
+project. This release also adds &lt;a href=&quot;https://arrow.apache.org/docs/python/dataset.html&quot;&gt;richer documentation&lt;/a&gt; on the datasets
+module.&lt;/li&gt;
+  &lt;li&gt;Support for the improved dataset functionality in
+&lt;code class=&quot;highlighter-rouge&quot;&gt;pyarrow.parquet.read_table/ParquetDataset&lt;/code&gt;. To enable, pass
+&lt;code class=&quot;highlighter-rouge&quot;&gt;use_legacy_dataset=False&lt;/code&gt;. Among other things, this allows to specify filters
+for all columns and not only the partition keys (using row group statistics)
+and enables different partitioning schemes. See the “note” in the
+&lt;a href=&quot;https://arrow.apache.org/docs/python/parquet.html#reading-from-partitioned-datasets&quot;&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;ParquetDataset&lt;/code&gt; documentation&lt;/a&gt;.&lt;/li&gt;
+&lt;/ul&gt;
+
+&lt;h3 id=&quot;packaging&quot;&gt;Packaging&lt;/h3&gt;
+
+&lt;ul&gt;
+  &lt;li&gt;Wheels for Python 3.8 are now available&lt;/li&gt;
+  &lt;li&gt;Support for Python 2.7 has been dropped as Python 2.x reached end-of-life in
+January 2020.&lt;/li&gt;
+  &lt;li&gt;Nightly wheels and conda packages are now available for testing or other
+development purposes. See the &lt;a href=&quot;https://arrow.apache.org/docs/python/install.html#installing-nightly-packages&quot;&gt;installation guide&lt;/a&gt;&lt;/li&gt;
+&lt;/ul&gt;
+
+&lt;h3 id=&quot;other-improvements&quot;&gt;Other improvements&lt;/h3&gt;
+
+&lt;ul&gt;
+  &lt;li&gt;Conversion to numpy/pandas for FixedSizeList, LargeString, LargeBinary&lt;/li&gt;
+  &lt;li&gt;Sparse CSC matrices and Sparse CSF tensors support was added. (ARROW-7419,
+ARROW-7427)&lt;/li&gt;
+&lt;/ul&gt;
+
+&lt;h2 id=&quot;r-notes&quot;&gt;R notes&lt;/h2&gt;
+
+&lt;p&gt;Highlights include support for the Feather V2 format and the C Data Interface,
+both described above. Along with low-level bindings for the C interface, this
+release adds tooling to work with Arrow data in Python using &lt;code class=&quot;highlighter-rouge&quot;&gt;reticulate&lt;/code&gt;. See
+&lt;a href=&quot;https://arrow.apache.org/docs/r/articles/python.html&quot;&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;vignette(&quot;python&quot;, package = &quot;arrow&quot;)&lt;/code&gt;&lt;/a&gt; for a guide to getting started.&lt;/p&gt;
+
+&lt;p&gt;Installation on Linux now builds C++ the library from source by default. For a
+faster, richer build, set the environment variable &lt;code class=&quot;highlighter-rouge&quot;&gt;NOT_CRAN=true&lt;/code&gt;. See
+&lt;a href=&quot;https://arrow.apache.org/docs/r/articles/install.html&quot;&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;vignette(&quot;install&quot;, package = &quot;arrow&quot;)&lt;/code&gt;&lt;/a&gt; for details and more options.&lt;/p&gt;
+
+&lt;p&gt;For more on what’s in the 0.17 R package, see the &lt;a href=&quot;https://arrow.apache.org/docs/r/news/&quot;&gt;R changelog&lt;/a&gt;.&lt;/p&gt;
+
+&lt;h2 id=&quot;ruby-and-c-glib-notes&quot;&gt;Ruby and C GLib notes&lt;/h2&gt;
+
+&lt;h3 id=&quot;ruby&quot;&gt;Ruby&lt;/h3&gt;
+
+&lt;ul&gt;
+  &lt;li&gt;Support Ruby 2.3 again&lt;/li&gt;
+&lt;/ul&gt;
+
+&lt;h3 id=&quot;c-glib&quot;&gt;C GLib&lt;/h3&gt;
+
+&lt;ul&gt;
+  &lt;li&gt;Add GArrowRecordBatchIterator&lt;/li&gt;
+  &lt;li&gt;Add support for GArrowFilterOptions&lt;/li&gt;
+  &lt;li&gt;Add support for Peek() to GIOInputStream&lt;/li&gt;
+  &lt;li&gt;Add some metadata bindings to GArrowSchema&lt;/li&gt;
+  &lt;li&gt;Add LocalFileSystem support&lt;/li&gt;
+  &lt;li&gt;Add support for writer properties of Parquet&lt;/li&gt;
+  &lt;li&gt;Add support for MapArray&lt;/li&gt;
+  &lt;li&gt;Add support for BooleanNode&lt;/li&gt;
+&lt;/ul&gt;
+
+&lt;h2 id=&quot;rust-notes&quot;&gt;Rust notes&lt;/h2&gt;
+
+&lt;ul&gt;
+  &lt;li&gt;DictionayArray support.&lt;/li&gt;
+  &lt;li&gt;Various improvements to code safety.&lt;/li&gt;
+  &lt;li&gt;Filter kernel now supports temporal types.&lt;/li&gt;
+&lt;/ul&gt;
+
+&lt;h3 id=&quot;rust-parquet-notes&quot;&gt;Rust Parquet notes&lt;/h3&gt;
+
+&lt;ul&gt;
+  &lt;li&gt;Array reader now supports temporal types.&lt;/li&gt;
+  &lt;li&gt;Parquet writer now supports custom meta-data key/value pairs.&lt;/li&gt;
+&lt;/ul&gt;
+
+&lt;h3 id=&quot;rust-datafusion-notes&quot;&gt;Rust DataFusion notes&lt;/h3&gt;
+
+&lt;ul&gt;
+  &lt;li&gt;Logical plans can now reference columns by name (as well as by index) using
+the new &lt;code class=&quot;highlighter-rouge&quot;&gt;UnresolvedColumn&lt;/code&gt; expression. There is a new optimizer rule to
+resolve these into column indices.&lt;/li&gt;
+  &lt;li&gt;Scalar UDFs can now be registered with the execution context and used from
+logical query plans as well as from SQL. A number of math scalar functions
+have been implemented using this feature (sqrt, cos, sin, tan, asin, acos,
+atan, floor, ceil, round, trunc, abs, signum, exp, log, log2, log10).&lt;/li&gt;
+  &lt;li&gt;Various SQL improvements, including support for &lt;code class=&quot;highlighter-rouge&quot;&gt;SELECT *&lt;/code&gt; and &lt;code class=&quot;highlighter-rouge&quot;&gt;SELECT
+COUNT(*)&lt;/code&gt;, and improvements to parsing of aggregate queries.&lt;/li&gt;
+  &lt;li&gt;Flight examples are provided, with a client that sends a SQL statement to a
+Flight server and receives the results.&lt;/li&gt;
+  &lt;li&gt;The interactive SQL command-line tool now has improved documentation and
+better formatting of query results.&lt;/li&gt;
+&lt;/ul&gt;
+
+&lt;h2 id=&quot;project-operations&quot;&gt;Project Operations&lt;/h2&gt;
+
+&lt;p&gt;We’ve continued our migration of general automation toward GitHub Actions. The
+majority of our commit-by-commit continuous integration (CI) is now running on
+GitHub Actions. We are working on different solutions for using dedicated
+hardware as part of our CI. The &lt;a href=&quot;https://buildkite.com/&quot;&gt;Buildkite&lt;/a&gt; self-hosted CI/CD platform is
+now supported on Apache repositories and GitHub Actions also supports
+self-hosted workers.&lt;/p&gt;</content><author><name>pmc</name></author><summary type="html">The Apache Arrow team is pleased to announce the 0.17.0 release. This covers over 2 months of development work and includes 569 resolved issues from 79 distinct contributors. See the Install Page to learn how to get the libraries for your platform. The release notes below are not exhaustive and only expose selected highlights of the release. Many other bugfixes and improvements have been made: w [...]
 
 --&gt;
 
@@ -1780,210 +2023,4 @@ for C++&lt;/li&gt;
 data messaging use cases&lt;/li&gt;
   &lt;li&gt;&lt;strong&gt;Arrow Columnar Format evolution&lt;/strong&gt;: we are discussing a new “duration” or
 “time interval” type and some other additions to the Arrow columnar format.&lt;/li&gt;
-&lt;/ul&gt;</content><author><name>wesm</name></author><summary type="html">The Apache Arrow team is pleased to announce the 0.13.0 release. This covers more than 2 months of development work and includes 550 resolved issues from 81 distinct contributors. See the Install Page to learn how to get the libraries for your platform. The complete changelog is also available. While it’s a large release, this post will give some brief highlights in the project since the 0.12.0 release from Janua [...]
-
---&gt;
-
-&lt;p&gt;Python users who upgrade to recently released &lt;code class=&quot;highlighter-rouge&quot;&gt;pyarrow&lt;/code&gt; 0.12 may find that
-their applications use significantly less memory when converting Arrow string
-data to pandas format. This includes using &lt;code class=&quot;highlighter-rouge&quot;&gt;pyarrow.parquet.read_table&lt;/code&gt; and
-&lt;code class=&quot;highlighter-rouge&quot;&gt;pandas.read_parquet&lt;/code&gt;. This article details some of what is going on under the
-hood, and why Python applications dealing with large amounts of strings are
-prone to memory use problems.&lt;/p&gt;
-
-&lt;h2 id=&quot;why-python-strings-can-use-a-lot-of-memory&quot;&gt;Why Python strings can use a lot of memory&lt;/h2&gt;
-
-&lt;p&gt;Let’s start with some possibly surprising facts. I’m going to create an empty
-&lt;code class=&quot;highlighter-rouge&quot;&gt;bytes&lt;/code&gt; object and an empty &lt;code class=&quot;highlighter-rouge&quot;&gt;str&lt;/code&gt; (unicode) object in Python 3.7:&lt;/p&gt;
-
-&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;In [1]: val = b''
-
-In [2]: unicode_val = u''
-&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
-
-&lt;p&gt;The &lt;code class=&quot;highlighter-rouge&quot;&gt;sys.getsizeof&lt;/code&gt; function accurately reports the number of bytes used by
-built-in Python objects. You might be surprised to find that:&lt;/p&gt;
-
-&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;In [4]: import sys
-In [5]: sys.getsizeof(val)
-Out[5]: 33
-
-In [6]: sys.getsizeof(unicode_val)
-Out[6]: 49
-&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
-
-&lt;p&gt;Since strings in Python are nul-terminated, we can infer that a bytes object
-has 32 bytes of overhead while unicode has 48 bytes. One must also account for
-&lt;code class=&quot;highlighter-rouge&quot;&gt;PyObject*&lt;/code&gt; pointer references to the objects, so the actual overhead is 40 and
-56 bytes, respectively. With large strings and text, this overhead may not
-matter much, but when you have a lot of small strings, such as those arising
-from reading a CSV or Apache Parquet file, they can take up an unexpected
-amount of memory. pandas represents strings in NumPy arrays of &lt;code class=&quot;highlighter-rouge&quot;&gt;PyObject*&lt;/code&gt;
-pointers, so the total memory used by a unique unicode string is&lt;/p&gt;
-
-&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;8 (PyObject*) + 48 (Python C struct) + string_length + 1
-&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
-
-&lt;p&gt;Suppose that we read a CSV file with&lt;/p&gt;
-
-&lt;ul&gt;
-  &lt;li&gt;1 column&lt;/li&gt;
-  &lt;li&gt;1 million rows&lt;/li&gt;
-  &lt;li&gt;Each value in the column is a string with 10 characters&lt;/li&gt;
-&lt;/ul&gt;
-
-&lt;p&gt;On disk this file would take approximately 10MB. Read into memory, however, it
-could take up over 60MB, as a 10 character string object takes up 67 bytes in a
-&lt;code class=&quot;highlighter-rouge&quot;&gt;pandas.Series&lt;/code&gt;.&lt;/p&gt;
-
-&lt;h2 id=&quot;how-apache-arrow-represents-strings&quot;&gt;How Apache Arrow represents strings&lt;/h2&gt;
-
-&lt;p&gt;While a Python unicode string can have 57 bytes of overhead, a string in the
-Arrow columnar format has only 4 (32 bits) or 4.125 (33 bits) bytes of
-overhead. 32-bit integer offsets encodes the position and size of a string
-value in a contiguous chunk of memory:&lt;/p&gt;
-
-&lt;div align=&quot;center&quot;&gt;
-&lt;img src=&quot;/img/20190205-arrow-string.png&quot; alt=&quot;Apache Arrow string memory layout&quot; width=&quot;80%&quot; class=&quot;img-responsive&quot; /&gt;
-&lt;/div&gt;
-
-&lt;p&gt;When you call &lt;code class=&quot;highlighter-rouge&quot;&gt;table.to_pandas()&lt;/code&gt; or &lt;code class=&quot;highlighter-rouge&quot;&gt;array.to_pandas()&lt;/code&gt; with &lt;code class=&quot;highlighter-rouge&quot;&gt;pyarrow&lt;/code&gt;, we
-have to convert this compact string representation back to pandas’s
-Python-based strings. This can use a huge amount of memory when we have a large
-number of small strings. It is a quite common occurrence when working with web
-analytics data, which compresses to a compact size when stored in the Parquet
-columnar file format.&lt;/p&gt;
-
-&lt;p&gt;Note that the Arrow string memory format has other benefits beyond memory
-use. It is also much more efficient for analytics due to the guarantee of data
-locality; all strings are next to each other in memory. In the case of pandas
-and Python strings, the string data can be located anywhere in the process
-heap. Arrow PMC member Uwe Korn did some work to &lt;a href=&quot;https://www.slideshare.net/xhochy/extending-pandas-using-apache-arrow-and-numba&quot;&gt;extend pandas with Arrow
-string arrays&lt;/a&gt; for improved performance and memory use.&lt;/p&gt;
-
-&lt;h2 id=&quot;reducing-pandas-memory-use-when-converting-from-arrow&quot;&gt;Reducing pandas memory use when converting from Arrow&lt;/h2&gt;
-
-&lt;p&gt;For many years, the &lt;code class=&quot;highlighter-rouge&quot;&gt;pandas.read_csv&lt;/code&gt; function has relied on a trick to limit
-the amount of string memory allocated. Because pandas uses arrays of
-&lt;code class=&quot;highlighter-rouge&quot;&gt;PyObject*&lt;/code&gt; pointers to refer to objects in the Python heap, we can avoid
-creating multiple strings with the same value, instead reusing existing objects
-and incrementing their reference counts.&lt;/p&gt;
-
-&lt;p&gt;Schematically, we have the following:&lt;/p&gt;
-
-&lt;div align=&quot;center&quot;&gt;
-&lt;img src=&quot;/img/20190205-numpy-string.png&quot; alt=&quot;pandas string memory optimization&quot; width=&quot;80%&quot; class=&quot;img-responsive&quot; /&gt;
-&lt;/div&gt;
-
-&lt;p&gt;In &lt;code class=&quot;highlighter-rouge&quot;&gt;pyarrow&lt;/code&gt; 0.12, we have implemented this when calling &lt;code class=&quot;highlighter-rouge&quot;&gt;to_pandas&lt;/code&gt;. It
-requires using a hash table to deduplicate the Arrow string data as it’s being
-converted to pandas. Hashing data is not free, but counterintuitively it can be
-faster in addition to being vastly more memory efficient in the common case in
-analytics where we have table columns with many instances of the same string
-values.&lt;/p&gt;
-
-&lt;h2 id=&quot;memory-and-performance-benchmarks&quot;&gt;Memory and Performance Benchmarks&lt;/h2&gt;
-
-&lt;p&gt;We can use the &lt;a href=&quot;https://pypi.org/project/memory-profiler/&quot;&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;memory_profiler&lt;/code&gt;&lt;/a&gt; Python package to easily get process
-memory usage within a running Python application.&lt;/p&gt;
-
-&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;memory_profiler&lt;/span&gt;
-&lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;mem&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;():&lt;/span&gt;
-    &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;memory_profiler&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;memory_usage&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;
-&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
-
-&lt;p&gt;In a new application I have:&lt;/p&gt;
-
-&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;In [7]: mem()
-Out[7]: 86.21875
-&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
-
-&lt;p&gt;I will generate approximate 1 gigabyte of string data represented as Python
-strings with length 10. The &lt;code class=&quot;highlighter-rouge&quot;&gt;pandas.util.testing&lt;/code&gt; module has a handy &lt;code class=&quot;highlighter-rouge&quot;&gt;rands&lt;/code&gt;
-function for generating random strings. Here is the data generation function:&lt;/p&gt;
-
-&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;pandas.util.testing&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;rands&lt;/span&gt;
-&lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;generate_strings&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;length&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;nunique&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;string_length&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1 [...]
-    &lt;span class=&quot;n&quot;&gt;unique_values&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;rands&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;string_length&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt; [...]
-    &lt;span class=&quot;n&quot;&gt;values&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;unique_values&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;length&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;//&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;nunique&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
-    &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;values&lt;/span&gt;
-&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
-
-&lt;p&gt;This generates a certain number of unique strings, then duplicates then to
-yield the desired number of total strings. So I’m going to create 100 million
-strings with only 10000 unique values:&lt;/p&gt;
-
-&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;In [8]: values = generate_strings(100000000, 10000)
-
-In [9]: mem()
-Out[9]: 852.140625
-&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
-
-&lt;p&gt;100 million &lt;code class=&quot;highlighter-rouge&quot;&gt;PyObject*&lt;/code&gt; values is only 745 MB, so this increase of a little
-over 770 MB is consistent with what we know so far. Now I’m going to convert
-this to Arrow format:&lt;/p&gt;
-
-&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;In [11]: arr = pa.array(values)
-
-In [12]: mem()
-Out[12]: 2276.9609375
-&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
-
-&lt;p&gt;Since &lt;code class=&quot;highlighter-rouge&quot;&gt;pyarrow&lt;/code&gt; exactly accounts for all of its memory allocations, we also
-check that&lt;/p&gt;
-
-&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;In [13]: pa.total_allocated_bytes()
-Out[13]: 1416777280
-&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
-
-&lt;p&gt;Since each string takes about 14 bytes (10 bytes plus 4 bytes of overhead),
-this is what we expect.&lt;/p&gt;
-
-&lt;p&gt;Now, converting &lt;code class=&quot;highlighter-rouge&quot;&gt;arr&lt;/code&gt; back to pandas is where things get tricky. The &lt;em&gt;minimum&lt;/em&gt;
-amount of memory that pandas can use is a little under 800 MB as above as we
-need 100 million &lt;code class=&quot;highlighter-rouge&quot;&gt;PyObject*&lt;/code&gt; values, which are 8 bytes each.&lt;/p&gt;
-
-&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;In [14]: arr_as_pandas = arr.to_pandas()
-
-In [15]: mem()
-Out[15]: 3041.78125
-&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
-
-&lt;p&gt;Doing the math, we used 765 MB which seems right. We can disable the string
-deduplication logic by passing &lt;code class=&quot;highlighter-rouge&quot;&gt;deduplicate_objects=False&lt;/code&gt; to &lt;code class=&quot;highlighter-rouge&quot;&gt;to_pandas&lt;/code&gt;:&lt;/p&gt;
-
-&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;In [16]: arr_as_pandas_no_dedup = arr.to_pandas(deduplicate_objects=False)
-
-In [17]: mem()
-Out[17]: 10006.95703125
-&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
-
-&lt;p&gt;Without object deduplication, we use 6965 megabytes, or an average of 73 bytes
-per value. This is a little bit higher than the theoretical size of 67 bytes
-computed above.&lt;/p&gt;
-
-&lt;p&gt;One of the more surprising results is that the new behavior is about twice as fast:&lt;/p&gt;
-
-&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;In [18]: %time arr_as_pandas_time = arr.to_pandas()
-CPU times: user 2.94 s, sys: 213 ms, total: 3.15 s
-Wall time: 3.14 s
-
-In [19]: %time arr_as_pandas_no_dedup_time = arr.to_pandas(deduplicate_objects=False)
-CPU times: user 4.19 s, sys: 2.04 s, total: 6.23 s
-Wall time: 6.21 s
-&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
-
-&lt;p&gt;The reason for this is that creating so many Python objects is more expensive
-than hashing the 10 byte values and looking them up in a hash table.&lt;/p&gt;
-
-&lt;p&gt;Note that when you convert Arrow data with mostly unique values back to pandas,
-the memory use benefits here won’t have as much of an impact.&lt;/p&gt;
-
-&lt;h2 id=&quot;takeaways&quot;&gt;Takeaways&lt;/h2&gt;
-
-&lt;p&gt;In Apache Arrow, our goal is to develop computational tools to operate natively
-on the cache- and SIMD-friendly efficient Arrow columnar format. In the
-meantime, though, we recognize that users have legacy applications using the
-native memory layout of pandas or other analytics tools. We will do our best to
-provide fast and memory-efficient interoperability with pandas and other
-popular libraries.&lt;/p&gt;</content><author><name>wesm</name></author><summary type="html">Python users who upgrade to recently released pyarrow 0.12 may find that their applications use significantly less memory when converting Arrow string data to pandas format. This includes using pyarrow.parquet.read_table and pandas.read_parquet. This article details some of what is going on under the hood, and why Python applications dealing with large amounts of strings are prone to memory use p [...]
\ No newline at end of file
+&lt;/ul&gt;</content><author><name>wesm</name></author><summary type="html">The Apache Arrow team is pleased to announce the 0.13.0 release. This covers more than 2 months of development work and includes 550 resolved issues from 81 distinct contributors. See the Install Page to learn how to get the libraries for your platform. The complete changelog is also available. While it’s a large release, this post will give some brief highlights in the project since the 0.12.0 release from Janua [...]
\ No newline at end of file