You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@arrow.apache.org by we...@apache.org on 2019/02/05 14:38:19 UTC

[arrow-site] branch asf-site updated: Publish string memory use blog post

This is an automated email from the ASF dual-hosted git repository.

wesm pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/arrow-site.git


The following commit(s) were added to refs/heads/asf-site by this push:
     new 04200bc  Publish string memory use blog post
04200bc is described below

commit 04200bcfc244252f0cec8825c5003b88bfc17d7f
Author: Wes McKinney <we...@apache.org>
AuthorDate: Tue Feb 5 08:36:30 2019 -0600

    Publish string memory use blog post
---
 .../02/05/python-string-memory-0.12/index.html     | 370 +++++++++++++++++++++
 blog/index.html                                    | 309 +++++++++++------
 feed.xml                                           | 302 +++++++++++------
 img/20190205-arrow-string.png                      | Bin 0 -> 4127 bytes
 img/20190205-numpy-string.png                      | Bin 0 -> 13714 bytes
 5 files changed, 766 insertions(+), 215 deletions(-)

diff --git a/blog/2019/02/05/python-string-memory-0.12/index.html b/blog/2019/02/05/python-string-memory-0.12/index.html
new file mode 100644
index 0000000..a0c1548
--- /dev/null
+++ b/blog/2019/02/05/python-string-memory-0.12/index.html
@@ -0,0 +1,370 @@
+<!DOCTYPE html>
+<html lang="en-US">
+  <head>
+    <meta charset="UTF-8">
+    <title>Apache Arrow Homepage</title>
+    <meta http-equiv="X-UA-Compatible" content="IE=edge">
+    <meta name="viewport" content="width=device-width, initial-scale=1">
+    <meta name="generator" content="Jekyll v3.8.4">
+    <!-- The above 3 meta tags *must* come first in the head; any other head content must come *after* these tags -->
+    <link rel="icon" type="image/x-icon" href="/favicon.ico">
+
+    <link rel="stylesheet" href="//fonts.googleapis.com/css?family=Lato:300,300italic,400,400italic,700,700italic,900">
+
+    <link href="/css/main.css" rel="stylesheet">
+    <link href="/css/syntax.css" rel="stylesheet">
+    <script src="https://code.jquery.com/jquery-3.3.1.slim.min.js" integrity="sha384-q8i/X+965DzO0rT7abK41JStQIAqVgRVzpbzo5smXKp4YfRvH+8abtTE1Pi6jizo" crossorigin="anonymous"></script>
+    <script src="https://cdnjs.cloudflare.com/ajax/libs/popper.js/1.14.3/umd/popper.min.js" integrity="sha384-ZMP7rVo3mIykV+2+9J3UJ46jBk0WLaUAdn689aCwoqbBJiSnjAK/l8WvCWPIPm49" crossorigin="anonymous"></script>
+    
+    <!-- Global Site Tag (gtag.js) - Google Analytics -->
+<script async src="https://www.googletagmanager.com/gtag/js?id=UA-107500873-1"></script>
+<script>
+  window.dataLayer = window.dataLayer || [];
+  function gtag(){dataLayer.push(arguments)};
+  gtag('js', new Date());
+
+  gtag('config', 'UA-107500873-1');
+</script>
+
+    
+  </head>
+
+
+
+<body class="wrap">
+  <header>
+    <nav class="navbar navbar-expand-md navbar-dark bg-dark">
+  <a class="navbar-brand" href="/">Apache Arrow&#8482;&nbsp;&nbsp;&nbsp;</a>
+  <button class="navbar-toggler" type="button" data-toggle="collapse" data-target="#navbarSupportedContent" aria-controls="navbarSupportedContent" aria-expanded="false" aria-label="Toggle navigation">
+    <span class="navbar-toggler-icon"></span>
+  </button>
+
+    <!-- Collect the nav links, forms, and other content for toggling -->
+    <div class="collapse navbar-collapse" id="arrow-navbar">
+      <ul class="nav navbar-nav">
+        <li class="nav-item dropdown">
+          <a class="nav-link dropdown-toggle" href="#"
+             id="navbarDropdownProjectLinks" role="button" data-toggle="dropdown"
+             aria-haspopup="true" aria-expanded="false">
+             Project Links
+          </a>
+          <div class="dropdown-menu" aria-labelledby="navbarDropdownProjectLinks">
+            <a class="dropdown-item" href="/install/">Install</a>
+            <a class="dropdown-item" href="/blog/">Blog</a>
+            <a class="dropdown-item" href="/release/">Releases</a>
+            <a class="dropdown-item" href="https://issues.apache.org/jira/browse/ARROW">Issue Tracker</a>
+            <a class="dropdown-item" href="https://github.com/apache/arrow">Source Code</a>
+          </div>
+        </li>
+        <li class="nav-item dropdown">
+          <a class="nav-link dropdown-toggle" href="#"
+             id="navbarDropdownCommunity" role="button" data-toggle="dropdown"
+             aria-haspopup="true" aria-expanded="false">
+             Community
+          </a>
+          <div class="dropdown-menu" aria-labelledby="navbarDropdownCommunity">
+            <a class="dropdown-item" href="http://mail-archives.apache.org/mod_mbox/arrow-user/">User Mailing List</a>
+            <a class="dropdown-item" href="http://mail-archives.apache.org/mod_mbox/arrow-dev/">Dev Mailing List</a>
+            <a class="dropdown-item" href="https://cwiki.apache.org/confluence/display/ARROW">Developer Wiki</a>
+            <a class="dropdown-item" href="/committers/">Committers</a>
+            <a class="dropdown-item" href="/powered_by/">Powered By</a>
+          </div>
+        </li>
+        <li class="nav-item dropdown">
+          <a class="nav-link dropdown-toggle" href="#"
+             id="navbarDropdownSpecification" role="button" data-toggle="dropdown"
+             aria-haspopup="true" aria-expanded="false">
+             Specification
+          </a>
+          <div class="dropdown-menu" aria-labelledby="navbarDropdownSpecification">
+            <a class="dropdown-item" href="/docs/memory_layout.html">Memory Layout</a>
+            <a class="dropdown-item" href="/docs/metadata.html">Metadata</a>
+            <a class="dropdown-item" href="/docs/ipc.html">Messaging / IPC</a>
+          </div>
+        </li>
+        <li class="nav-item dropdown">
+          <a class="nav-link dropdown-toggle" href="#"
+             id="navbarDropdownDocumentation" role="button" data-toggle="dropdown"
+             aria-haspopup="true" aria-expanded="false">
+             Documentation
+          </a>
+          <div class="dropdown-menu" aria-labelledby="navbarDropdownDocumentation">
+            <a class="dropdown-item" href="/docs">Project Docs</a>
+            <a class="dropdown-item" href="/docs/python">Python</a>
+            <a class="dropdown-item" href="/docs/cpp">C++</a>
+            <a class="dropdown-item" href="/docs/java">Java API</a>
+            <a class="dropdown-item" href="/docs/c_glib">C GLib API</a>
+            <a class="dropdown-item" href="/docs/js">Javascript API</a>
+          </div>
+        </li>
+        <!-- <li><a href="/blog">Blog</a></li> -->
+        <li class="nav-item dropdown">
+          <a class="nav-link dropdown-toggle" href="#"
+             id="navbarDropdownASF" role="button" data-toggle="dropdown"
+             aria-haspopup="true" aria-expanded="false">
+             ASF Links
+          </a>
+          <div class="dropdown-menu" aria-labelledby="navbarDropdownASF">
+            <a class="dropdown-item" href="http://www.apache.org/">ASF Website</a>
+            <a class="dropdown-item" href="http://www.apache.org/licenses/">License</a>
+            <a class="dropdown-item" href="http://www.apache.org/foundation/sponsorship.html">Donate</a>
+            <a class="dropdown-item" href="http://www.apache.org/foundation/thanks.html">Thanks</a>
+            <a class="dropdown-item" href="http://www.apache.org/security/">Security</a>
+          </div>
+        </li>
+      </ul>
+      <div class="flex-row justify-content-end ml-md-auto">
+        <a class="d-sm-none d-md-inline pr-2" href="https://www.apache.org/events/current-event.html">
+          <img src="https://www.apache.org/events/current-event-234x60.png"/>
+        </a>
+        <a href="http://www.apache.org/">
+          <img src="/img/asf_logo.svg" width="120px"/>
+        </a>
+      </div>
+      </div><!-- /.navbar-collapse -->
+    </div>
+  </nav>
+
+  </header>
+
+  <div class="container p-lg-4">
+    <main role="main">
+
+    <h1>
+      Reducing Python String Memory Use in Apache Arrow 0.12
+      <a href="/blog/2019/02/05/python-string-memory-0.12/" class="permalink" title="Permalink">∞</a>
+    </h1>
+
+    
+
+    <p>
+      <span class="badge badge-secondary">Published</span>
+      <span class="published">
+        05 Feb 2019
+      </span>
+      <br />
+      <span class="badge badge-secondary">By</span>
+      <a href="http://wesmckinney.com">Wes McKinney ()</a>
+    </p>
+
+    <!--
+
+-->
+
+<p>Python users who upgrade to recently released <code class="highlighter-rouge">pyarrow</code> 0.12 may find that
+their applications use significantly less memory when converting Arrow string
+data to pandas format. This includes using <code class="highlighter-rouge">pyarrow.parquet.read_table</code> and
+<code class="highlighter-rouge">pandas.read_parquet</code>. This article details some of what is going on under the
+hood, and why Python applications dealing with large amounts of strings are
+prone to memory use problems.</p>
+
+<h2 id="why-python-strings-can-use-a-lot-of-memory">Why Python strings can use a lot of memory</h2>
+
+<p>Let’s start with some possibly surprising facts. I’m going to create an empty
+<code class="highlighter-rouge">bytes</code> object and an empty <code class="highlighter-rouge">str</code> (unicode) object in Python 3.7:</p>
+
+<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>In [1]: val = b''
+
+In [2]: unicode_val = u''
+</code></pre></div></div>
+
+<p>The <code class="highlighter-rouge">sys.getsizeof</code> function accurately reports the number of bytes used by
+built-in Python objects. You might be surprised to find that:</p>
+
+<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>In [4]: import sys
+In [5]: sys.getsizeof(val)
+Out[5]: 33
+
+In [6]: sys.getsizeof(unicode_val)
+Out[6]: 49
+</code></pre></div></div>
+
+<p>Since strings in Python are nul-terminated, we can infer that a bytes object
+has 32 bytes of overhead while unicode has 48 bytes. One must also account for
+<code class="highlighter-rouge">PyObject*</code> pointer references to the objects, so the actual overhead is 40 and
+56 bytes, respectively. With large strings and text, this overhead may not
+matter much, but when you have a lot of small strings, such as those arising
+from reading a CSV or Apache Parquet file, they can take up an unexpected
+amount of memory. pandas represents strings in NumPy arrays of <code class="highlighter-rouge">PyObject*</code>
+pointers, so the total memory used by a unique unicode string is</p>
+
+<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>8 (PyObject*) + 48 (Python C struct) + string_length + 1
+</code></pre></div></div>
+
+<p>Suppose that we read a CSV file with</p>
+
+<ul>
+  <li>1 column</li>
+  <li>1 million rows</li>
+  <li>Each value in the column is a string with 10 characters</li>
+</ul>
+
+<p>On disk this file would take approximately 10MB. Read into memory, however, it
+could take up over 60MB, as a 10 character string object takes up 67 bytes in a
+<code class="highlighter-rouge">pandas.Series</code>.</p>
+
+<h2 id="how-apache-arrow-represents-strings">How Apache Arrow represents strings</h2>
+
+<p>While a Python unicode string can have 57 bytes of overhead, a string in the
+Arrow columnar format has only 4 (32 bits) or 4.125 (33 bits) bytes of
+overhead. 32-bit integer offsets encodes the position and size of a string
+value in a contiguous chunk of memory:</p>
+
+<div align="center">
+<img src="/img/20190205-arrow-string.png" alt="Apache Arrow string memory layout" width="80%" class="img-responsive" />
+</div>
+
+<p>When you call <code class="highlighter-rouge">table.to_pandas()</code> or <code class="highlighter-rouge">array.to_pandas()</code> with <code class="highlighter-rouge">pyarrow</code>, we
+have to convert this compact string representation back to pandas’s
+Python-based strings. This can use a huge amount of memory when we have a large
+number of small strings. It is a quite common occurrence when working with web
+analytics data, which compresses to a compact size when stored in the Parquet
+columnar file format.</p>
+
+<p>Note that the Arrow string memory format has other benefits beyond memory
+use. It is also much more efficient for analytics due to the guarantee of data
+locality; all strings are next to each other in memory. In the case of pandas
+and Python strings, the string data can be located anywhere in the process
+heap. Arrow PMC member Uwe Korn did some work to <a href="https://www.slideshare.net/xhochy/extending-pandas-using-apache-arrow-and-numba">extend pandas with Arrow
+string arrays</a> for improved performance and memory use.</p>
+
+<h2 id="reducing-pandas-memory-use-when-converting-from-arrow">Reducing pandas memory use when converting from Arrow</h2>
+
+<p>For many years, the <code class="highlighter-rouge">pandas.read_csv</code> function has relied on a trick to limit
+the amount of string memory allocated. Because pandas uses arrays of
+<code class="highlighter-rouge">PyObject*</code> pointers to refer to objects in the Python heap, we can avoid
+creating multiple strings with the same value, instead reusing existing objects
+and incrementing their reference counts.</p>
+
+<p>Schematically, we have the following:</p>
+
+<div align="center">
+<img src="/img/20190205-numpy-string.png" alt="pandas string memory optimization" width="80%" class="img-responsive" />
+</div>
+
+<p>In <code class="highlighter-rouge">pyarrow</code> 0.12, we have implemented this when calling <code class="highlighter-rouge">to_pandas</code>. It
+requires using a hash table to deduplicate the Arrow string data as it’s being
+converted to pandas. Hashing data is not free, but counterintuitively it can be
+faster in addition to being vastly more memory efficient in the common case in
+analytics where we have table columns with many instances of the same string
+values.</p>
+
+<h2 id="memory-and-performance-benchmarks">Memory and Performance Benchmarks</h2>
+
+<p>We can use the <code class="highlighter-rouge">memory_profiler</code> Python package to easily get process memory
+usage within a running Python application.</p>
+
+<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>import memory_profiler
+def mem():
+    return memory_profiler.memory_usage()[0]
+</code></pre></div></div>
+
+<p>In a new application I have:</p>
+
+<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>In [7]: mem()
+Out[7]: 86.21875
+</code></pre></div></div>
+
+<p>I will generate approximate 1 gigabyte of string data represented as Python
+strings with length 10. The <code class="highlighter-rouge">pandas.util.testing</code> module has a handy <code class="highlighter-rouge">rands</code>
+function for generating random strings. Here is the data generation function:</p>
+
+<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">pandas.util.testing</span> <span class="kn">import</span> <span class="n">rands</span>
+<span class="k">def</span> <span class="nf">generate_strings</span><span class="p">(</span><span class="n">length</span><span class="p">,</span> <span class="n">nunique</span><span class="p">,</span> <span class="n">string_length</span><span class="o">=</span><span class="mi">10</span><span class="p">):</span>
+    <span class="n">unique_values</span> <span class="o">=</span> <span class="p">[</span><span class="n">rands</span><span class="p">(</span><span class="n">string_length</span><span class="p">)</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">nunique</span><span class="p">)]</span>
+    <span class="n">values</span> <span class="o">=</span> <span class="n">unique_values</span> <span class="o">*</span> <span class="p">(</span><span class="n">length</span> <span class="o">//</span> <span class="n">nunique</span><span class="p">)</span>
+    <span class="k">return</span> <span class="n">values</span>
+</code></pre></div></div>
+
+<p>This generates a certain number of unique strings, then duplicates then to
+yield the desired number of total strings. So I’m going to create 100 million
+strings with only 10000 unique values:</p>
+
+<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>In [8]: values = generate_strings(100000000, 10000)
+
+In [9]: mem()
+Out[9]: 852.140625
+</code></pre></div></div>
+
+<p>100 million <code class="highlighter-rouge">PyObject*</code> values is only 745 MB, so this increase of a little
+over 770 MB is consistent with what we know so far. Now I’m going to convert
+this to Arrow format:</p>
+
+<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>In [11]: arr = pa.array(values)
+
+In [12]: mem()
+Out[12]: 2276.9609375
+</code></pre></div></div>
+
+<p>Since <code class="highlighter-rouge">pyarrow</code> exactly accounts for all of its memory allocations, we also
+check that</p>
+
+<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>In [13]: pa.total_allocated_bytes()
+Out[13]: 1416777280
+</code></pre></div></div>
+
+<p>Since each string takes about 14 bytes (10 bytes plus 4 bytes of overhead),
+this is what we expect.</p>
+
+<p>Now, converting <code class="highlighter-rouge">arr</code> back to pandas is where things get tricky. The <em>minimum</em>
+amount of memory that pandas can use is a little under 800 MB as above as we
+need 100 million <code class="highlighter-rouge">PyObject*</code> values, which are 8 bytes each.</p>
+
+<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>In [14]: arr_as_pandas = arr.to_pandas()
+
+In [15]: mem()
+Out[15]: 3041.78125
+</code></pre></div></div>
+
+<p>Doing the math, we used 765 MB which seems right. We can disable the string
+deduplication logic by passing <code class="highlighter-rouge">deduplicate_objects=False</code> to <code class="highlighter-rouge">to_pandas</code>:</p>
+
+<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>In [16]: arr_as_pandas_no_dedup = arr.to_pandas(deduplicate_objects=False)
+
+In [17]: mem()
+Out[17]: 10006.95703125
+</code></pre></div></div>
+
+<p>Without object deduplication, we use 6965 megabytes, or an average of 73 bytes
+per value. This is a little bit higher than the theoretical size of 67 bytes
+computed above.</p>
+
+<p>One of the more surprising results is that the new behavior is about twice as fast:</p>
+
+<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>In [18]: %time arr_as_pandas_time = arr.to_pandas()
+CPU times: user 2.94 s, sys: 213 ms, total: 3.15 s
+Wall time: 3.14 s
+
+In [19]: %time arr_as_pandas_no_dedup_time = arr.to_pandas(deduplicate_objects=False)
+CPU times: user 4.19 s, sys: 2.04 s, total: 6.23 s
+Wall time: 6.21 s
+</code></pre></div></div>
+
+<p>The reason for this is that creating so many Python objects is more expensive
+than hashing the 10 byte values and looking them up in a hash table.</p>
+
+<p>Note that when you convert Arrow data with mostly unique values back to pandas,
+the memory use benefits here won’t have as much of an impact.</p>
+
+<h2 id="takeaways">Takeaways</h2>
+
+<p>In Apache Arrow, our goal is to develop computational tools to operate natively
+on the cache- and SIMD-friendly efficient Arrow columnar format. In the
+meantime, though, we recognize that users have legacy applications using the
+native memory layout of pandas or other analytics tools. We will do our best to
+provide fast and memory-efficient interoperability with pandas and other
+popular libraries.</p>
+
+
+    </main>
+
+    <hr/>
+<footer class="footer">
+  <p>Apache Arrow, Arrow, Apache, the Apache feather logo, and the Apache Arrow project logo are either registered trademarks or trademarks of The Apache Software Foundation in the United States and other countries.</p>
+  <p>&copy; 2016-2019 The Apache Software Foundation</p>
+  <script src="/assets/main-8d2a359fd27a888246eb638b36a4e8b68ac65b9f11c48b9fac601fa0c9a7d796.js" integrity="sha256-jSo1n9J6iIJG62OLNqTotorGW58RxIufrGAfoMmn15Y=" crossorigin="anonymous" type="text/javascript"></script>
+</footer>
+
+  </div>
+</body>
+</html>
diff --git a/blog/index.html b/blog/index.html
index 0a60fe4..0f399f5 100644
--- a/blog/index.html
+++ b/blog/index.html
@@ -138,8 +138,8 @@
     
   <div class="blog-post" style="margin-bottom: 4rem">
     <h1>
-      DataFusion: A Rust-native Query Engine for Apache Arrow
-      <a href="/blog/2019/02/05/datafusion-donation/" class="permalink" title="Permalink">∞</a>
+      Reducing Python String Memory Use in Apache Arrow 0.12
+      <a href="/blog/2019/02/05/python-string-memory-0.12/" class="permalink" title="Permalink">∞</a>
     </h1>
 
     
@@ -151,125 +151,216 @@
       </span>
       <br />
       <span class="badge badge-secondary">By</span>
-      <a href=""> (agrove)</a>
+      <a href="http://wesmckinney.com">Wes McKinney (wesm)</a>
     </p>
     <!--
 
 -->
 
-<p>We are excited to announce that
-<a href="https://github.com/apache/arrow/tree/master/rust/datafusion">DataFusion</a> has
-been donated to the Apache Arrow project. DataFusion is an in-memory query
-engine for the Rust implementation of Apache Arrow.</p>
-
-<p>Although DataFusion was started two years ago, it was recently re-implemented
-to be Arrow-native and currently has limited capabilities but does support SQL
-queries against iterators of RecordBatch and has support for CSV files. There
-are plans to <a href="https://issues.apache.org/jira/browse/ARROW-4466">add support for Parquet
-files</a>.</p>
-
-<p>SQL support is limited to projection (<code class="highlighter-rouge">SELECT</code>), selection (<code class="highlighter-rouge">WHERE</code>), and
-simple aggregates (<code class="highlighter-rouge">MIN</code>, <code class="highlighter-rouge">MAX</code>, <code class="highlighter-rouge">SUM</code>) with an optional <code class="highlighter-rouge">GROUP BY</code> clause.</p>
-
-<p>Supported expressions are identifiers, literals, simple math operations (<code class="highlighter-rouge">+</code>,
-<code class="highlighter-rouge">-</code>, <code class="highlighter-rouge">*</code>, <code class="highlighter-rouge">/</code>), binary expressions (<code class="highlighter-rouge">AND</code>, <code class="highlighter-rouge">OR</code>), equality and comparison
-operators (<code class="highlighter-rouge">=</code>, <code class="highlighter-rouge">!=</code>, <code class="highlighter-rouge">&lt;</code>, <code class="highlighter-rouge">&lt;=</code>, <code class="highlighter-rouge">&gt;=</code>, <code class="highlighter-rouge">&gt;</code>), and <code class="highlighter-rouge">CAST(expr AS type)</code>.</p>
-
-<h2 id="example">Example</h2>
-
-<p>The following example demonstrates running a simple aggregate SQL query against
-a CSV file.</p>
-
-<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// create execution context</span>
-<span class="k">let</span> <span class="k">mut</span> <span class="n">ctx</span> <span class="o">=</span> <span class="nn">ExecutionContext</span><span class="p">::</span><span class="nf">new</span><span class="p">();</span>
-
-<span class="c">// define schema for data source (csv file)</span>
-<span class="k">let</span> <span class="n">schema</span> <span class="o">=</span> <span class="nn">Arc</span><span class="p">::</span><span class="nf">new</span><span class="p">(</span><span class="nn">Schema</span><span class="p">::</span><span class="nf">new</span><span class="p">(</span><span class="nd">vec!</span><span class="p">[</span>
-    <span class="nn">Field</span><span class="p">::</span><span class="nf">new</span><span class="p">(</span><span class="s">"c1"</span><span class="p">,</span> <span class="nn">DataType</span><span class="p">::</span><span class="n">Utf8</span><span class="p">,</span> <span class="kc">false</span><span class="p">),</span>
-    <span class="nn">Field</span><span class="p">::</span><span class="nf">new</span><span class="p">(</span><span class="s">"c2"</span><span class="p">,</span> <span class="nn">DataType</span><span class="p">::</span><span class="n">UInt32</span><span class="p">,</span> <span class="kc">false</span><span class="p">),</span>
-    <span class="nn">Field</span><span class="p">::</span><span class="nf">new</span><span class="p">(</span><span class="s">"c3"</span><span class="p">,</span> <span class="nn">DataType</span><span class="p">::</span><span class="n">Int8</span><span class="p">,</span> <span class="kc">false</span><span class="p">),</span>
-    <span class="nn">Field</span><span class="p">::</span><span class="nf">new</span><span class="p">(</span><span class="s">"c4"</span><span class="p">,</span> <span class="nn">DataType</span><span class="p">::</span><span class="n">Int16</span><span class="p">,</span> <span class="kc">false</span><span class="p">),</span>
-    <span class="nn">Field</span><span class="p">::</span><span class="nf">new</span><span class="p">(</span><span class="s">"c5"</span><span class="p">,</span> <span class="nn">DataType</span><span class="p">::</span><span class="n">Int32</span><span class="p">,</span> <span class="kc">false</span><span class="p">),</span>
-    <span class="nn">Field</span><span class="p">::</span><span class="nf">new</span><span class="p">(</span><span class="s">"c6"</span><span class="p">,</span> <span class="nn">DataType</span><span class="p">::</span><span class="n">Int64</span><span class="p">,</span> <span class="kc">false</span><span class="p">),</span>
-    <span class="nn">Field</span><span class="p">::</span><span class="nf">new</span><span class="p">(</span><span class="s">"c7"</span><span class="p">,</span> <span class="nn">DataType</span><span class="p">::</span><span class="n">UInt8</span><span class="p">,</span> <span class="kc">false</span><span class="p">),</span>
-    <span class="nn">Field</span><span class="p">::</span><span class="nf">new</span><span class="p">(</span><span class="s">"c8"</span><span class="p">,</span> <span class="nn">DataType</span><span class="p">::</span><span class="n">UInt16</span><span class="p">,</span> <span class="kc">false</span><span class="p">),</span>
-    <span class="nn">Field</span><span class="p">::</span><span class="nf">new</span><span class="p">(</span><span class="s">"c9"</span><span class="p">,</span> <span class="nn">DataType</span><span class="p">::</span><span class="n">UInt32</span><span class="p">,</span> <span class="kc">false</span><span class="p">),</span>
-    <span class="nn">Field</span><span class="p">::</span><span class="nf">new</span><span class="p">(</span><span class="s">"c10"</span><span class="p">,</span> <span class="nn">DataType</span><span class="p">::</span><span class="n">UInt64</span><span class="p">,</span> <span class="kc">false</span><span class="p">),</span>
-    <span class="nn">Field</span><span class="p">::</span><span class="nf">new</span><span class="p">(</span><span class="s">"c11"</span><span class="p">,</span> <span class="nn">DataType</span><span class="p">::</span><span class="n">Float32</span><span class="p">,</span> <span class="kc">false</span><span class="p">),</span>
-    <span class="nn">Field</span><span class="p">::</span><span class="nf">new</span><span class="p">(</span><span class="s">"c12"</span><span class="p">,</span> <span class="nn">DataType</span><span class="p">::</span><span class="n">Float64</span><span class="p">,</span> <span class="kc">false</span><span class="p">),</span>
-    <span class="nn">Field</span><span class="p">::</span><span class="nf">new</span><span class="p">(</span><span class="s">"c13"</span><span class="p">,</span> <span class="nn">DataType</span><span class="p">::</span><span class="n">Utf8</span><span class="p">,</span> <span class="kc">false</span><span class="p">),</span>
-<span class="p">]));</span>
-
-<span class="c">// register csv file with the execution context</span>
-<span class="k">let</span> <span class="n">csv_datasource</span> <span class="o">=</span>
-    <span class="nn">CsvDataSource</span><span class="p">::</span><span class="nf">new</span><span class="p">(</span><span class="s">"test/data/aggregate_test_100.csv"</span><span class="p">,</span> <span class="n">schema</span><span class="nf">.clone</span><span class="p">(),</span> <span class="mi">1024</span><span class="p">);</span>
-<span class="n">ctx</span><span class="nf">.register_datasource</span><span class="p">(</span><span class="s">"aggregate_test_100"</span><span class="p">,</span> <span class="nn">Rc</span><span class="p">::</span><span class="nf">new</span><span class="p">(</span><span class="nn">RefCell</span><span class="p">::</span><span class="nf">new</span><span class="p">(</span><span class="n">csv_datasource</span><span class="p">)));</span>
-
-<span class="k">let</span> <span class="n">sql</span> <span class="o">=</span> <span class="s">"SELECT c1, MIN(c12), MAX(c12) FROM aggregate_test_100 WHERE c11 &gt; 0.1 AND c11 &lt; 0.9 GROUP BY c1"</span><span class="p">;</span>
-
-<span class="c">// execute the query</span>
-<span class="k">let</span> <span class="n">relation</span> <span class="o">=</span> <span class="n">ctx</span><span class="nf">.sql</span><span class="p">(</span><span class="o">&amp;</span><span class="n">sql</span><span class="p">)</span><span class="nf">.unwrap</span><span class="p">();</span>
-<span class="k">let</span> <span class="k">mut</span> <span class="n">results</span> <span class="o">=</span> <span class="n">relation</span><span class="nf">.borrow_mut</span><span class="p">();</span>
-
-<span class="c">// iterate over the results</span>
-<span class="k">while</span> <span class="k">let</span> <span class="nf">Some</span><span class="p">(</span><span class="n">batch</span><span class="p">)</span> <span class="o">=</span> <span class="n">results</span><span class="nf">.next</span><span class="p">()</span><span class="nf">.unwrap</span><span class="p">()</span> <span class="p">{</span>
-    <span class="nd">println!</span><span class="p">(</span>
-        <span class="s">"RecordBatch has {} rows and {} columns"</span><span class="p">,</span>
-        <span class="n">batch</span><span class="nf">.num_rows</span><span class="p">(),</span>
-        <span class="n">batch</span><span class="nf">.num_columns</span><span class="p">()</span>
-    <span class="p">);</span>
-
-    <span class="k">let</span> <span class="n">c1</span> <span class="o">=</span> <span class="n">batch</span>
-        <span class="nf">.column</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span>
-        <span class="nf">.as_any</span><span class="p">()</span>
-        <span class="py">.downcast_ref</span><span class="p">::</span><span class="o">&lt;</span><span class="n">BinaryArray</span><span class="o">&gt;</span><span class="p">()</span>
-        <span class="nf">.unwrap</span><span class="p">();</span>
-
-    <span class="k">let</span> <span class="n">min</span> <span class="o">=</span> <span class="n">batch</span>
-        <span class="nf">.column</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>
-        <span class="nf">.as_any</span><span class="p">()</span>
-        <span class="py">.downcast_ref</span><span class="p">::</span><span class="o">&lt;</span><span class="n">Float64Array</span><span class="o">&gt;</span><span class="p">()</span>
-        <span class="nf">.unwrap</span><span class="p">();</span>
-
-    <span class="k">let</span> <span class="n">max</span> <span class="o">=</span> <span class="n">batch</span>
-        <span class="nf">.column</span><span class="p">(</span><span class="mi">2</span><span class="p">)</span>
-        <span class="nf">.as_any</span><span class="p">()</span>
-        <span class="py">.downcast_ref</span><span class="p">::</span><span class="o">&lt;</span><span class="n">Float64Array</span><span class="o">&gt;</span><span class="p">()</span>
-        <span class="nf">.unwrap</span><span class="p">();</span>
-
-    <span class="k">for</span> <span class="n">i</span> <span class="n">in</span> <span class="mi">0</span><span class="o">..</span><span class="n">batch</span><span class="nf">.num_rows</span><span class="p">()</span> <span class="p">{</span>
-        <span class="k">let</span> <span class="n">c1_value</span><span class="p">:</span> <span class="nb">String</span> <span class="o">=</span> <span class="nn">String</span><span class="p">::</span><span class="nf">from_utf8</span><span class="p">(</span><span class="n">c1</span><span class="nf">.value</span><span class="p">(</span><span class="n">i</span><span class="p">)</span><span class="nf">.to_vec</span><span class="p">())</span><span class="nf">.unwrap</span><span class="p">() [...]
-        <span class="nd">println!</span><span class="p">(</span><span class="s">"{}, Min: {}, Max: {}"</span><span class="p">,</span> <span class="n">c1_value</span><span class="p">,</span> <span class="n">min</span><span class="nf">.value</span><span class="p">(</span><span class="n">i</span><span class="p">),</span> <span class="n">max</span><span class="nf">.value</span><span class="p">(</span><span class="n">i</span><span class="p">),);</span>
-    <span class="p">}</span>
-<span class="p">}</span>
+<p>Python users who upgrade to recently released <code class="highlighter-rouge">pyarrow</code> 0.12 may find that
+their applications use significantly less memory when converting Arrow string
+data to pandas format. This includes using <code class="highlighter-rouge">pyarrow.parquet.read_table</code> and
+<code class="highlighter-rouge">pandas.read_parquet</code>. This article details some of what is going on under the
+hood, and why Python applications dealing with large amounts of strings are
+prone to memory use problems.</p>
+
+<h2 id="why-python-strings-can-use-a-lot-of-memory">Why Python strings can use a lot of memory</h2>
+
+<p>Let’s start with some possibly surprising facts. I’m going to create an empty
+<code class="highlighter-rouge">bytes</code> object and an empty <code class="highlighter-rouge">str</code> (unicode) object in Python 3.7:</p>
+
+<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>In [1]: val = b''
+
+In [2]: unicode_val = u''
 </code></pre></div></div>
 
-<h2 id="roadmap">Roadmap</h2>
+<p>The <code class="highlighter-rouge">sys.getsizeof</code> function accurately reports the number of bytes used by
+built-in Python objects. You might be surprised to find that:</p>
+
+<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>In [4]: import sys
+In [5]: sys.getsizeof(val)
+Out[5]: 33
+
+In [6]: sys.getsizeof(unicode_val)
+Out[6]: 49
+</code></pre></div></div>
 
-<p>The roadmap for DataFusion will depend on interest from the Rust community, but
-here are some of the short term items that are planned:</p>
+<p>Since strings in Python are nul-terminated, we can infer that a bytes object
+has 32 bytes of overhead while unicode has 48 bytes. One must also account for
+<code class="highlighter-rouge">PyObject*</code> pointer references to the objects, so the actual overhead is 40 and
+56 bytes, respectively. With large strings and text, this overhead may not
+matter much, but when you have a lot of small strings, such as those arising
+from reading a CSV or Apache Parquet file, they can take up an unexpected
+amount of memory. pandas represents strings in NumPy arrays of <code class="highlighter-rouge">PyObject*</code>
+pointers, so the total memory used by a unique unicode string is</p>
+
+<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>8 (PyObject*) + 48 (Python C struct) + string_length + 1
+</code></pre></div></div>
+
+<p>Suppose that we read a CSV file with</p>
 
 <ul>
-  <li>Extending test coverage of the existing functionality</li>
-  <li>Adding support for Parquet data sources</li>
-  <li>Implementing more SQL features such as <code class="highlighter-rouge">JOIN</code>, <code class="highlighter-rouge">ORDER BY</code> and <code class="highlighter-rouge">LIMIT</code></li>
-  <li>Implement a DataFrame API as an alternative to SQL</li>
-  <li>Adding support for partitioning and parallel query execution using Rust’s
-async and await functionality</li>
-  <li>Creating a Docker image to make it easy to use DataFusion as a standalone
-query tool for interactive and batch queries</li>
+  <li>1 column</li>
+  <li>1 million rows</li>
+  <li>Each value in the column is a string with 10 characters</li>
 </ul>
 
-<h2 id="contributors-welcome">Contributors Welcome!</h2>
+<p>On disk this file would take approximately 10MB. Read into memory, however, it
+could take up over 60MB, as a 10 character string object takes up 67 bytes in a
+<code class="highlighter-rouge">pandas.Series</code>.</p>
+
+<h2 id="how-apache-arrow-represents-strings">How Apache Arrow represents strings</h2>
+
+<p>While a Python unicode string can have 57 bytes of overhead, a string in the
+Arrow columnar format has only 4 (32 bits) or 4.125 (33 bits) bytes of
+overhead. 32-bit integer offsets encodes the position and size of a string
+value in a contiguous chunk of memory:</p>
+
+<div align="center">
+<img src="/img/20190205-arrow-string.png" alt="Apache Arrow string memory layout" width="80%" class="img-responsive" />
+</div>
+
+<p>When you call <code class="highlighter-rouge">table.to_pandas()</code> or <code class="highlighter-rouge">array.to_pandas()</code> with <code class="highlighter-rouge">pyarrow</code>, we
+have to convert this compact string representation back to pandas’s
+Python-based strings. This can use a huge amount of memory when we have a large
+number of small strings. It is a quite common occurrence when working with web
+analytics data, which compresses to a compact size when stored in the Parquet
+columnar file format.</p>
+
+<p>Note that the Arrow string memory format has other benefits beyond memory
+use. It is also much more efficient for analytics due to the guarantee of data
+locality; all strings are next to each other in memory. In the case of pandas
+and Python strings, the string data can be located anywhere in the process
+heap. Arrow PMC member Uwe Korn did some work to <a href="https://www.slideshare.net/xhochy/extending-pandas-using-apache-arrow-and-numba">extend pandas with Arrow
+string arrays</a> for improved performance and memory use.</p>
+
+<h2 id="reducing-pandas-memory-use-when-converting-from-arrow">Reducing pandas memory use when converting from Arrow</h2>
+
+<p>For many years, the <code class="highlighter-rouge">pandas.read_csv</code> function has relied on a trick to limit
+the amount of string memory allocated. Because pandas uses arrays of
+<code class="highlighter-rouge">PyObject*</code> pointers to refer to objects in the Python heap, we can avoid
+creating multiple strings with the same value, instead reusing existing objects
+and incrementing their reference counts.</p>
+
+<p>Schematically, we have the following:</p>
+
+<div align="center">
+<img src="/img/20190205-numpy-string.png" alt="pandas string memory optimization" width="80%" class="img-responsive" />
+</div>
+
+<p>In <code class="highlighter-rouge">pyarrow</code> 0.12, we have implemented this when calling <code class="highlighter-rouge">to_pandas</code>. It
+requires using a hash table to deduplicate the Arrow string data as it’s being
+converted to pandas. Hashing data is not free, but counterintuitively it can be
+faster in addition to being vastly more memory efficient in the common case in
+analytics where we have table columns with many instances of the same string
+values.</p>
+
+<h2 id="memory-and-performance-benchmarks">Memory and Performance Benchmarks</h2>
+
+<p>We can use the <code class="highlighter-rouge">memory_profiler</code> Python package to easily get process memory
+usage within a running Python application.</p>
+
+<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>import memory_profiler
+def mem():
+    return memory_profiler.memory_usage()[0]
+</code></pre></div></div>
+
+<p>In a new application I have:</p>
+
+<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>In [7]: mem()
+Out[7]: 86.21875
+</code></pre></div></div>
+
+<p>I will generate approximate 1 gigabyte of string data represented as Python
+strings with length 10. The <code class="highlighter-rouge">pandas.util.testing</code> module has a handy <code class="highlighter-rouge">rands</code>
+function for generating random strings. Here is the data generation function:</p>
+
+<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">pandas.util.testing</span> <span class="kn">import</span> <span class="n">rands</span>
+<span class="k">def</span> <span class="nf">generate_strings</span><span class="p">(</span><span class="n">length</span><span class="p">,</span> <span class="n">nunique</span><span class="p">,</span> <span class="n">string_length</span><span class="o">=</span><span class="mi">10</span><span class="p">):</span>
+    <span class="n">unique_values</span> <span class="o">=</span> <span class="p">[</span><span class="n">rands</span><span class="p">(</span><span class="n">string_length</span><span class="p">)</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">nunique</span><span class="p">)]</span>
+    <span class="n">values</span> <span class="o">=</span> <span class="n">unique_values</span> <span class="o">*</span> <span class="p">(</span><span class="n">length</span> <span class="o">//</span> <span class="n">nunique</span><span class="p">)</span>
+    <span class="k">return</span> <span class="n">values</span>
+</code></pre></div></div>
+
+<p>This generates a certain number of unique strings, then duplicates then to
+yield the desired number of total strings. So I’m going to create 100 million
+strings with only 10000 unique values:</p>
+
+<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>In [8]: values = generate_strings(100000000, 10000)
+
+In [9]: mem()
+Out[9]: 852.140625
+</code></pre></div></div>
+
+<p>100 million <code class="highlighter-rouge">PyObject*</code> values is only 745 MB, so this increase of a little
+over 770 MB is consistent with what we know so far. Now I’m going to convert
+this to Arrow format:</p>
+
+<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>In [11]: arr = pa.array(values)
+
+In [12]: mem()
+Out[12]: 2276.9609375
+</code></pre></div></div>
+
+<p>Since <code class="highlighter-rouge">pyarrow</code> exactly accounts for all of its memory allocations, we also
+check that</p>
+
+<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>In [13]: pa.total_allocated_bytes()
+Out[13]: 1416777280
+</code></pre></div></div>
+
+<p>Since each string takes about 14 bytes (10 bytes plus 4 bytes of overhead),
+this is what we expect.</p>
+
+<p>Now, converting <code class="highlighter-rouge">arr</code> back to pandas is where things get tricky. The <em>minimum</em>
+amount of memory that pandas can use is a little under 800 MB as above as we
+need 100 million <code class="highlighter-rouge">PyObject*</code> values, which are 8 bytes each.</p>
+
+<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>In [14]: arr_as_pandas = arr.to_pandas()
+
+In [15]: mem()
+Out[15]: 3041.78125
+</code></pre></div></div>
+
+<p>Doing the math, we used 765 MB which seems right. We can disable the string
+deduplication logic by passing <code class="highlighter-rouge">deduplicate_objects=False</code> to <code class="highlighter-rouge">to_pandas</code>:</p>
+
+<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>In [16]: arr_as_pandas_no_dedup = arr.to_pandas(deduplicate_objects=False)
+
+In [17]: mem()
+Out[17]: 10006.95703125
+</code></pre></div></div>
+
+<p>Without object deduplication, we use 6965 megabytes, or an average of 73 bytes
+per value. This is a little bit higher than the theoretical size of 67 bytes
+computed above.</p>
+
+<p>One of the more surprising results is that the new behavior is about twice as fast:</p>
+
+<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>In [18]: %time arr_as_pandas_time = arr.to_pandas()
+CPU times: user 2.94 s, sys: 213 ms, total: 3.15 s
+Wall time: 3.14 s
+
+In [19]: %time arr_as_pandas_no_dedup_time = arr.to_pandas(deduplicate_objects=False)
+CPU times: user 4.19 s, sys: 2.04 s, total: 6.23 s
+Wall time: 6.21 s
+</code></pre></div></div>
+
+<p>The reason for this is that creating so many Python objects is more expensive
+than hashing the 10 byte values and looking them up in a hash table.</p>
+
+<p>Note that when you convert Arrow data with mostly unique values back to pandas,
+the memory use benefits here won’t have as much of an impact.</p>
+
+<h2 id="takeaways">Takeaways</h2>
+
+<p>In Apache Arrow, our goal is to develop computational tools to operate natively
+on the cache- and SIMD-friendly efficient Arrow columnar format. In the
+meantime, though, we recognize that users have legacy applications using the
+native memory layout of pandas or other analytics tools. We will do our best to
+provide fast and memory-efficient interoperability with pandas and other
+popular libraries.</p>
 
-<p>If you are excited about being able to use Rust for data science and would like
-to contribute to this work then there are many ways to get involved. The
-simplest way to get started is to try out DataFusion against your own data
-sources and file bug reports for any issues that you find. You could also check
-out the current <a href="https://cwiki.apache.org/confluence/display/ARROW/Rust+JIRA+Dashboard">list of
-issues</a>
-and have a go at fixing one. You can also join the <a href="http://mail-archives.apache.org/mod_mbox/arrow-user/">user mailing
-list</a> to ask questions.</p>
 
   </div>
 
@@ -292,7 +383,7 @@ list</a> to ask questions.</p>
       </span>
       <br />
       <span class="badge badge-secondary">By</span>
-      <a href=""> (javierluraschi)</a>
+      <a href="http://wesmckinney.com">Wes McKinney (javierluraschi)</a>
     </p>
     <!--
 
diff --git a/feed.xml b/feed.xml
index 8db0160..2642bf0 100644
--- a/feed.xml
+++ b/feed.xml
@@ -1,120 +1,210 @@
-<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.8.4">Jekyll</generator><link href="/feed.xml" rel="self" type="application/atom+xml" /><link href="/" rel="alternate" type="text/html" /><updated>2019-02-05T09:30:27-05:00</updated><id>/feed.xml</id><entry><title type="html">DataFusion: A Rust-native Query Engine for Apache Arrow</title><link href="/blog/2019/02/05/datafusion-donation/" rel="alternate" type=" [...]
+<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.8.4">Jekyll</generator><link href="/feed.xml" rel="self" type="application/atom+xml" /><link href="/" rel="alternate" type="text/html" /><updated>2019-02-05T09:36:14-05:00</updated><id>/feed.xml</id><entry><title type="html">Reducing Python String Memory Use in Apache Arrow 0.12</title><link href="/blog/2019/02/05/python-string-memory-0.12/" rel="alternate" t [...]
 
 --&gt;
 
-&lt;p&gt;We are excited to announce that
-&lt;a href=&quot;https://github.com/apache/arrow/tree/master/rust/datafusion&quot;&gt;DataFusion&lt;/a&gt; has
-been donated to the Apache Arrow project. DataFusion is an in-memory query
-engine for the Rust implementation of Apache Arrow.&lt;/p&gt;
-
-&lt;p&gt;Although DataFusion was started two years ago, it was recently re-implemented
-to be Arrow-native and currently has limited capabilities but does support SQL
-queries against iterators of RecordBatch and has support for CSV files. There
-are plans to &lt;a href=&quot;https://issues.apache.org/jira/browse/ARROW-4466&quot;&gt;add support for Parquet
-files&lt;/a&gt;.&lt;/p&gt;
-
-&lt;p&gt;SQL support is limited to projection (&lt;code class=&quot;highlighter-rouge&quot;&gt;SELECT&lt;/code&gt;), selection (&lt;code class=&quot;highlighter-rouge&quot;&gt;WHERE&lt;/code&gt;), and
-simple aggregates (&lt;code class=&quot;highlighter-rouge&quot;&gt;MIN&lt;/code&gt;, &lt;code class=&quot;highlighter-rouge&quot;&gt;MAX&lt;/code&gt;, &lt;code class=&quot;highlighter-rouge&quot;&gt;SUM&lt;/code&gt;) with an optional &lt;code class=&quot;highlighter-rouge&quot;&gt;GROUP BY&lt;/code&gt; clause.&lt;/p&gt;
-
-&lt;p&gt;Supported expressions are identifiers, literals, simple math operations (&lt;code class=&quot;highlighter-rouge&quot;&gt;+&lt;/code&gt;,
-&lt;code class=&quot;highlighter-rouge&quot;&gt;-&lt;/code&gt;, &lt;code class=&quot;highlighter-rouge&quot;&gt;*&lt;/code&gt;, &lt;code class=&quot;highlighter-rouge&quot;&gt;/&lt;/code&gt;), binary expressions (&lt;code class=&quot;highlighter-rouge&quot;&gt;AND&lt;/code&gt;, &lt;code class=&quot;highlighter-rouge&quot;&gt;OR&lt;/code&gt;), equality and comparison
-operators (&lt;code class=&quot;highlighter-rouge&quot;&gt;=&lt;/code&gt;, &lt;code class=&quot;highlighter-rouge&quot;&gt;!=&lt;/code&gt;, &lt;code class=&quot;highlighter-rouge&quot;&gt;&amp;lt;&lt;/code&gt;, &lt;code class=&quot;highlighter-rouge&quot;&gt;&amp;lt;=&lt;/code&gt;, &lt;code class=&quot;highlighter-rouge&quot;&gt;&amp;gt;=&lt;/code&gt;, &lt;code class=&quot;highlighter-rouge&quot;&gt;&amp;gt;&lt;/code&gt;), and &lt;code class=&quot;highlighter-rouge&quot;&gt;CAST(expr AS  [...]
-
-&lt;h2 id=&quot;example&quot;&gt;Example&lt;/h2&gt;
-
-&lt;p&gt;The following example demonstrates running a simple aggregate SQL query against
-a CSV file.&lt;/p&gt;
-
-&lt;div class=&quot;language-rust highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c&quot;&gt;// create execution context&lt;/span&gt;
-&lt;span class=&quot;k&quot;&gt;let&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;mut&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ctx&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;ExecutionContext&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;new&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;();&lt;/span&gt;
-
-&lt;span class=&quot;c&quot;&gt;// define schema for data source (csv file)&lt;/span&gt;
-&lt;span class=&quot;k&quot;&gt;let&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;schema&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;Arc&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;new&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nn&quot;&gt;Schema&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;new&lt;/span&gt;&lt;s [...]
-    &lt;span class=&quot;nn&quot;&gt;Field&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;new&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;c1&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;DataType&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Utf8&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/s [...]
-    &lt;span class=&quot;nn&quot;&gt;Field&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;new&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;c2&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;DataType&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;UInt32&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt; [...]
-    &lt;span class=&quot;nn&quot;&gt;Field&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;new&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;c3&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;DataType&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Int8&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/s [...]
-    &lt;span class=&quot;nn&quot;&gt;Field&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;new&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;c4&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;DataType&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Int16&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/ [...]
-    &lt;span class=&quot;nn&quot;&gt;Field&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;new&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;c5&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;DataType&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Int32&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/ [...]
-    &lt;span class=&quot;nn&quot;&gt;Field&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;new&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;c6&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;DataType&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Int64&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/ [...]
-    &lt;span class=&quot;nn&quot;&gt;Field&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;new&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;c7&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;DataType&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;UInt8&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/ [...]
-    &lt;span class=&quot;nn&quot;&gt;Field&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;new&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;c8&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;DataType&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;UInt16&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt; [...]
-    &lt;span class=&quot;nn&quot;&gt;Field&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;new&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;c9&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;DataType&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;UInt32&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt; [...]
-    &lt;span class=&quot;nn&quot;&gt;Field&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;new&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;c10&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;DataType&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;UInt64&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt [...]
-    &lt;span class=&quot;nn&quot;&gt;Field&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;new&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;c11&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;DataType&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Float32&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&l [...]
-    &lt;span class=&quot;nn&quot;&gt;Field&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;new&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;c12&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;DataType&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Float64&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&l [...]
-    &lt;span class=&quot;nn&quot;&gt;Field&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;new&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;c13&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;DataType&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Utf8&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/ [...]
-&lt;span class=&quot;p&quot;&gt;]));&lt;/span&gt;
-
-&lt;span class=&quot;c&quot;&gt;// register csv file with the execution context&lt;/span&gt;
-&lt;span class=&quot;k&quot;&gt;let&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;csv_datasource&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;
-    &lt;span class=&quot;nn&quot;&gt;CsvDataSource&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;new&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;test/data/aggregate_test_100.csv&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;schema&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;.clone&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(),&lt;/span&gt; [...]
-&lt;span class=&quot;n&quot;&gt;ctx&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;.register_datasource&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;aggregate_test_100&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;Rc&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;new&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot [...]
-
-&lt;span class=&quot;k&quot;&gt;let&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sql&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;SELECT c1, MIN(c12), MAX(c12) FROM aggregate_test_100 WHERE c11 &amp;gt; 0.1 AND c11 &amp;lt; 0.9 GROUP BY c1&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
-
-&lt;span class=&quot;c&quot;&gt;// execute the query&lt;/span&gt;
-&lt;span class=&quot;k&quot;&gt;let&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;relation&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ctx&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;.sql&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sql&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;.unwrap&lt;/span& [...]
-&lt;span class=&quot;k&quot;&gt;let&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;mut&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;results&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;relation&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;.borrow_mut&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;();&lt;/span&gt;
-
-&lt;span class=&quot;c&quot;&gt;// iterate over the results&lt;/span&gt;
-&lt;span class=&quot;k&quot;&gt;while&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;let&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;Some&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;batch&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;results&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;.next&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;&l [...]
-    &lt;span class=&quot;nd&quot;&gt;println!&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
-        &lt;span class=&quot;s&quot;&gt;&quot;RecordBatch has {} rows and {} columns&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
-        &lt;span class=&quot;n&quot;&gt;batch&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;.num_rows&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(),&lt;/span&gt;
-        &lt;span class=&quot;n&quot;&gt;batch&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;.num_columns&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
-    &lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
-
-    &lt;span class=&quot;k&quot;&gt;let&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;c1&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;batch&lt;/span&gt;
-        &lt;span class=&quot;nf&quot;&gt;.column&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
-        &lt;span class=&quot;nf&quot;&gt;.as_any&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
-        &lt;span class=&quot;py&quot;&gt;.downcast_ref&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;BinaryArray&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
-        &lt;span class=&quot;nf&quot;&gt;.unwrap&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;();&lt;/span&gt;
-
-    &lt;span class=&quot;k&quot;&gt;let&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;min&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;batch&lt;/span&gt;
-        &lt;span class=&quot;nf&quot;&gt;.column&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
-        &lt;span class=&quot;nf&quot;&gt;.as_any&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
-        &lt;span class=&quot;py&quot;&gt;.downcast_ref&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Float64Array&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
-        &lt;span class=&quot;nf&quot;&gt;.unwrap&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;();&lt;/span&gt;
-
-    &lt;span class=&quot;k&quot;&gt;let&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;max&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;batch&lt;/span&gt;
-        &lt;span class=&quot;nf&quot;&gt;.column&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
-        &lt;span class=&quot;nf&quot;&gt;.as_any&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
-        &lt;span class=&quot;py&quot;&gt;.downcast_ref&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Float64Array&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
-        &lt;span class=&quot;nf&quot;&gt;.unwrap&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;();&lt;/span&gt;
-
-    &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;..&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;batch&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;.num_rows&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
-        &lt;span class=&quot;k&quot;&gt;let&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;c1_value&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;String&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;String&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;from_utf8&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;c1& [...]
-        &lt;span class=&quot;nd&quot;&gt;println!&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;{}, Min: {}, Max: {}&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;c1_value&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;min&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;.value&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class [...]
-    &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
-&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
+&lt;p&gt;Python users who upgrade to recently released &lt;code class=&quot;highlighter-rouge&quot;&gt;pyarrow&lt;/code&gt; 0.12 may find that
+their applications use significantly less memory when converting Arrow string
+data to pandas format. This includes using &lt;code class=&quot;highlighter-rouge&quot;&gt;pyarrow.parquet.read_table&lt;/code&gt; and
+&lt;code class=&quot;highlighter-rouge&quot;&gt;pandas.read_parquet&lt;/code&gt;. This article details some of what is going on under the
+hood, and why Python applications dealing with large amounts of strings are
+prone to memory use problems.&lt;/p&gt;
+
+&lt;h2 id=&quot;why-python-strings-can-use-a-lot-of-memory&quot;&gt;Why Python strings can use a lot of memory&lt;/h2&gt;
+
+&lt;p&gt;Let’s start with some possibly surprising facts. I’m going to create an empty
+&lt;code class=&quot;highlighter-rouge&quot;&gt;bytes&lt;/code&gt; object and an empty &lt;code class=&quot;highlighter-rouge&quot;&gt;str&lt;/code&gt; (unicode) object in Python 3.7:&lt;/p&gt;
+
+&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;In [1]: val = b''
+
+In [2]: unicode_val = u''
+&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
+
+&lt;p&gt;The &lt;code class=&quot;highlighter-rouge&quot;&gt;sys.getsizeof&lt;/code&gt; function accurately reports the number of bytes used by
+built-in Python objects. You might be surprised to find that:&lt;/p&gt;
+
+&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;In [4]: import sys
+In [5]: sys.getsizeof(val)
+Out[5]: 33
+
+In [6]: sys.getsizeof(unicode_val)
+Out[6]: 49
 &lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
 
-&lt;h2 id=&quot;roadmap&quot;&gt;Roadmap&lt;/h2&gt;
+&lt;p&gt;Since strings in Python are nul-terminated, we can infer that a bytes object
+has 32 bytes of overhead while unicode has 48 bytes. One must also account for
+&lt;code class=&quot;highlighter-rouge&quot;&gt;PyObject*&lt;/code&gt; pointer references to the objects, so the actual overhead is 40 and
+56 bytes, respectively. With large strings and text, this overhead may not
+matter much, but when you have a lot of small strings, such as those arising
+from reading a CSV or Apache Parquet file, they can take up an unexpected
+amount of memory. pandas represents strings in NumPy arrays of &lt;code class=&quot;highlighter-rouge&quot;&gt;PyObject*&lt;/code&gt;
+pointers, so the total memory used by a unique unicode string is&lt;/p&gt;
+
+&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;8 (PyObject*) + 48 (Python C struct) + string_length + 1
+&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
 
-&lt;p&gt;The roadmap for DataFusion will depend on interest from the Rust community, but
-here are some of the short term items that are planned:&lt;/p&gt;
+&lt;p&gt;Suppose that we read a CSV file with&lt;/p&gt;
 
 &lt;ul&gt;
-  &lt;li&gt;Extending test coverage of the existing functionality&lt;/li&gt;
-  &lt;li&gt;Adding support for Parquet data sources&lt;/li&gt;
-  &lt;li&gt;Implementing more SQL features such as &lt;code class=&quot;highlighter-rouge&quot;&gt;JOIN&lt;/code&gt;, &lt;code class=&quot;highlighter-rouge&quot;&gt;ORDER BY&lt;/code&gt; and &lt;code class=&quot;highlighter-rouge&quot;&gt;LIMIT&lt;/code&gt;&lt;/li&gt;
-  &lt;li&gt;Implement a DataFrame API as an alternative to SQL&lt;/li&gt;
-  &lt;li&gt;Adding support for partitioning and parallel query execution using Rust’s
-async and await functionality&lt;/li&gt;
-  &lt;li&gt;Creating a Docker image to make it easy to use DataFusion as a standalone
-query tool for interactive and batch queries&lt;/li&gt;
+  &lt;li&gt;1 column&lt;/li&gt;
+  &lt;li&gt;1 million rows&lt;/li&gt;
+  &lt;li&gt;Each value in the column is a string with 10 characters&lt;/li&gt;
 &lt;/ul&gt;
 
-&lt;h2 id=&quot;contributors-welcome&quot;&gt;Contributors Welcome!&lt;/h2&gt;
+&lt;p&gt;On disk this file would take approximately 10MB. Read into memory, however, it
+could take up over 60MB, as a 10 character string object takes up 67 bytes in a
+&lt;code class=&quot;highlighter-rouge&quot;&gt;pandas.Series&lt;/code&gt;.&lt;/p&gt;
+
+&lt;h2 id=&quot;how-apache-arrow-represents-strings&quot;&gt;How Apache Arrow represents strings&lt;/h2&gt;
+
+&lt;p&gt;While a Python unicode string can have 57 bytes of overhead, a string in the
+Arrow columnar format has only 4 (32 bits) or 4.125 (33 bits) bytes of
+overhead. 32-bit integer offsets encodes the position and size of a string
+value in a contiguous chunk of memory:&lt;/p&gt;
+
+&lt;div align=&quot;center&quot;&gt;
+&lt;img src=&quot;/img/20190205-arrow-string.png&quot; alt=&quot;Apache Arrow string memory layout&quot; width=&quot;80%&quot; class=&quot;img-responsive&quot; /&gt;
+&lt;/div&gt;
+
+&lt;p&gt;When you call &lt;code class=&quot;highlighter-rouge&quot;&gt;table.to_pandas()&lt;/code&gt; or &lt;code class=&quot;highlighter-rouge&quot;&gt;array.to_pandas()&lt;/code&gt; with &lt;code class=&quot;highlighter-rouge&quot;&gt;pyarrow&lt;/code&gt;, we
+have to convert this compact string representation back to pandas’s
+Python-based strings. This can use a huge amount of memory when we have a large
+number of small strings. It is a quite common occurrence when working with web
+analytics data, which compresses to a compact size when stored in the Parquet
+columnar file format.&lt;/p&gt;
+
+&lt;p&gt;Note that the Arrow string memory format has other benefits beyond memory
+use. It is also much more efficient for analytics due to the guarantee of data
+locality; all strings are next to each other in memory. In the case of pandas
+and Python strings, the string data can be located anywhere in the process
+heap. Arrow PMC member Uwe Korn did some work to &lt;a href=&quot;https://www.slideshare.net/xhochy/extending-pandas-using-apache-arrow-and-numba&quot;&gt;extend pandas with Arrow
+string arrays&lt;/a&gt; for improved performance and memory use.&lt;/p&gt;
+
+&lt;h2 id=&quot;reducing-pandas-memory-use-when-converting-from-arrow&quot;&gt;Reducing pandas memory use when converting from Arrow&lt;/h2&gt;
+
+&lt;p&gt;For many years, the &lt;code class=&quot;highlighter-rouge&quot;&gt;pandas.read_csv&lt;/code&gt; function has relied on a trick to limit
+the amount of string memory allocated. Because pandas uses arrays of
+&lt;code class=&quot;highlighter-rouge&quot;&gt;PyObject*&lt;/code&gt; pointers to refer to objects in the Python heap, we can avoid
+creating multiple strings with the same value, instead reusing existing objects
+and incrementing their reference counts.&lt;/p&gt;
+
+&lt;p&gt;Schematically, we have the following:&lt;/p&gt;
+
+&lt;div align=&quot;center&quot;&gt;
+&lt;img src=&quot;/img/20190205-numpy-string.png&quot; alt=&quot;pandas string memory optimization&quot; width=&quot;80%&quot; class=&quot;img-responsive&quot; /&gt;
+&lt;/div&gt;
+
+&lt;p&gt;In &lt;code class=&quot;highlighter-rouge&quot;&gt;pyarrow&lt;/code&gt; 0.12, we have implemented this when calling &lt;code class=&quot;highlighter-rouge&quot;&gt;to_pandas&lt;/code&gt;. It
+requires using a hash table to deduplicate the Arrow string data as it’s being
+converted to pandas. Hashing data is not free, but counterintuitively it can be
+faster in addition to being vastly more memory efficient in the common case in
+analytics where we have table columns with many instances of the same string
+values.&lt;/p&gt;
+
+&lt;h2 id=&quot;memory-and-performance-benchmarks&quot;&gt;Memory and Performance Benchmarks&lt;/h2&gt;
+
+&lt;p&gt;We can use the &lt;code class=&quot;highlighter-rouge&quot;&gt;memory_profiler&lt;/code&gt; Python package to easily get process memory
+usage within a running Python application.&lt;/p&gt;
+
+&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;import memory_profiler
+def mem():
+    return memory_profiler.memory_usage()[0]
+&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
+
+&lt;p&gt;In a new application I have:&lt;/p&gt;
+
+&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;In [7]: mem()
+Out[7]: 86.21875
+&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
+
+&lt;p&gt;I will generate approximate 1 gigabyte of string data represented as Python
+strings with length 10. The &lt;code class=&quot;highlighter-rouge&quot;&gt;pandas.util.testing&lt;/code&gt; module has a handy &lt;code class=&quot;highlighter-rouge&quot;&gt;rands&lt;/code&gt;
+function for generating random strings. Here is the data generation function:&lt;/p&gt;
+
+&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;pandas.util.testing&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;rands&lt;/span&gt;
+&lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;generate_strings&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;length&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;nunique&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;string_length&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1 [...]
+    &lt;span class=&quot;n&quot;&gt;unique_values&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;rands&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;string_length&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt; [...]
+    &lt;span class=&quot;n&quot;&gt;values&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;unique_values&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;length&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;//&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;nunique&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
+    &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;values&lt;/span&gt;
+&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
+
+&lt;p&gt;This generates a certain number of unique strings, then duplicates then to
+yield the desired number of total strings. So I’m going to create 100 million
+strings with only 10000 unique values:&lt;/p&gt;
+
+&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;In [8]: values = generate_strings(100000000, 10000)
+
+In [9]: mem()
+Out[9]: 852.140625
+&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
+
+&lt;p&gt;100 million &lt;code class=&quot;highlighter-rouge&quot;&gt;PyObject*&lt;/code&gt; values is only 745 MB, so this increase of a little
+over 770 MB is consistent with what we know so far. Now I’m going to convert
+this to Arrow format:&lt;/p&gt;
+
+&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;In [11]: arr = pa.array(values)
+
+In [12]: mem()
+Out[12]: 2276.9609375
+&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
+
+&lt;p&gt;Since &lt;code class=&quot;highlighter-rouge&quot;&gt;pyarrow&lt;/code&gt; exactly accounts for all of its memory allocations, we also
+check that&lt;/p&gt;
+
+&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;In [13]: pa.total_allocated_bytes()
+Out[13]: 1416777280
+&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
+
+&lt;p&gt;Since each string takes about 14 bytes (10 bytes plus 4 bytes of overhead),
+this is what we expect.&lt;/p&gt;
+
+&lt;p&gt;Now, converting &lt;code class=&quot;highlighter-rouge&quot;&gt;arr&lt;/code&gt; back to pandas is where things get tricky. The &lt;em&gt;minimum&lt;/em&gt;
+amount of memory that pandas can use is a little under 800 MB as above as we
+need 100 million &lt;code class=&quot;highlighter-rouge&quot;&gt;PyObject*&lt;/code&gt; values, which are 8 bytes each.&lt;/p&gt;
+
+&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;In [14]: arr_as_pandas = arr.to_pandas()
+
+In [15]: mem()
+Out[15]: 3041.78125
+&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
+
+&lt;p&gt;Doing the math, we used 765 MB which seems right. We can disable the string
+deduplication logic by passing &lt;code class=&quot;highlighter-rouge&quot;&gt;deduplicate_objects=False&lt;/code&gt; to &lt;code class=&quot;highlighter-rouge&quot;&gt;to_pandas&lt;/code&gt;:&lt;/p&gt;
+
+&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;In [16]: arr_as_pandas_no_dedup = arr.to_pandas(deduplicate_objects=False)
+
+In [17]: mem()
+Out[17]: 10006.95703125
+&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
+
+&lt;p&gt;Without object deduplication, we use 6965 megabytes, or an average of 73 bytes
+per value. This is a little bit higher than the theoretical size of 67 bytes
+computed above.&lt;/p&gt;
+
+&lt;p&gt;One of the more surprising results is that the new behavior is about twice as fast:&lt;/p&gt;
+
+&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;In [18]: %time arr_as_pandas_time = arr.to_pandas()
+CPU times: user 2.94 s, sys: 213 ms, total: 3.15 s
+Wall time: 3.14 s
+
+In [19]: %time arr_as_pandas_no_dedup_time = arr.to_pandas(deduplicate_objects=False)
+CPU times: user 4.19 s, sys: 2.04 s, total: 6.23 s
+Wall time: 6.21 s
+&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
+
+&lt;p&gt;The reason for this is that creating so many Python objects is more expensive
+than hashing the 10 byte values and looking them up in a hash table.&lt;/p&gt;
+
+&lt;p&gt;Note that when you convert Arrow data with mostly unique values back to pandas,
+the memory use benefits here won’t have as much of an impact.&lt;/p&gt;
+
+&lt;h2 id=&quot;takeaways&quot;&gt;Takeaways&lt;/h2&gt;
 
-&lt;p&gt;If you are excited about being able to use Rust for data science and would like
-to contribute to this work then there are many ways to get involved. The
-simplest way to get started is to try out DataFusion against your own data
-sources and file bug reports for any issues that you find. You could also check
-out the current &lt;a href=&quot;https://cwiki.apache.org/confluence/display/ARROW/Rust+JIRA+Dashboard&quot;&gt;list of
-issues&lt;/a&gt;
-and have a go at fixing one. You can also join the &lt;a href=&quot;http://mail-archives.apache.org/mod_mbox/arrow-user/&quot;&gt;user mailing
-list&lt;/a&gt; to ask questions.&lt;/p&gt;</content><author><name>agrove</name></author></entry><entry><title type="html">Speeding up R and Apache Spark using Apache Arrow</title><link href="/blog/2019/01/25/r-spark-improvements/" rel="alternate" type="text/html" title="Speeding up R and Apache Spark using Apache Arrow" /><published>2019-01-25T01:00:00-05:00</published><updated>2019-01-25T01:00:00-05:00</updated><id>/blog/2019/01/25/r-spark-improvements</id><content type="html" xml:base= [...]
+&lt;p&gt;In Apache Arrow, our goal is to develop computational tools to operate natively
+on the cache- and SIMD-friendly efficient Arrow columnar format. In the
+meantime, though, we recognize that users have legacy applications using the
+native memory layout of pandas or other analytics tools. We will do our best to
+provide fast and memory-efficient interoperability with pandas and other
+popular libraries.&lt;/p&gt;</content><author><name>wesm</name></author></entry><entry><title type="html">Speeding up R and Apache Spark using Apache Arrow</title><link href="/blog/2019/01/25/r-spark-improvements/" rel="alternate" type="text/html" title="Speeding up R and Apache Spark using Apache Arrow" /><published>2019-01-25T01:00:00-05:00</published><updated>2019-01-25T01:00:00-05:00</updated><id>/blog/2019/01/25/r-spark-improvements</id><content type="html" xml:base="/blog/2019/01/2 [...]
 
 --&gt;
 
diff --git a/img/20190205-arrow-string.png b/img/20190205-arrow-string.png
new file mode 100644
index 0000000..066386b
Binary files /dev/null and b/img/20190205-arrow-string.png differ
diff --git a/img/20190205-numpy-string.png b/img/20190205-numpy-string.png
new file mode 100644
index 0000000..ed048b0
Binary files /dev/null and b/img/20190205-numpy-string.png differ