You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@arrow.apache.org by we...@apache.org on 2020/04/07 18:14:04 UTC

[arrow-site] 01/01: ARROW-7847: [Website] Add blog post about fuzzing the IPC layer

This is an automated email from the ASF dual-hosted git repository.

wesm pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/arrow-site.git

commit 277255ada3d80ccb23053bee80229d7cfabf555a
Author: Antoine Pitrou <an...@python.org>
AuthorDate: Tue Apr 7 13:12:37 2020 -0500

    ARROW-7847: [Website] Add blog post about fuzzing the IPC layer
---
 _data/contributors.yml                 |  3 ++
 _posts/2020-04-01-fuzzing-arrow-ipc.md | 89 ++++++++++++++++++++++++++++++++++
 2 files changed, 92 insertions(+)

diff --git a/_data/contributors.yml b/_data/contributors.yml
index e70d9af..dcddb10 100644
--- a/_data/contributors.yml
+++ b/_data/contributors.yml
@@ -49,4 +49,7 @@
 - name: Neal Richardson
   apacheId: npr # Not a real apacheId
   githubId: nealrichardson
+- name: Antoine Pitrou
+  apacheId: apitrou
+  githubId: pitrou
 # End contributors.yml
diff --git a/_posts/2020-04-01-fuzzing-arrow-ipc.md b/_posts/2020-04-01-fuzzing-arrow-ipc.md
new file mode 100644
index 0000000..b094e1a
--- /dev/null
+++ b/_posts/2020-04-01-fuzzing-arrow-ipc.md
@@ -0,0 +1,89 @@
+---
+layout: post
+title: "Fuzzing the Arrow C++ IPC implementation"
+description: "We have set up continuous fuzzing for the Arrow C++ IPC reader.
+This helped us find and correct several issues where missing input validation
+would lead to crashes or undefined behaviour."
+date: "2020-04-01 00:00:00 +0100"
+author: apitrou
+categories: [application]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+Apache Arrow aims to allow fast and seamless data interchange between
+heterogenous runtimes and environments.  Whether using the columnar
+[IPC stream protocol](https://arrow.apache.org/docs/format/Columnar.html),
+the [Flight](https://arrow.apache.org/docs/format/Flight.html) RPC layer,
+the Feather file format, the
+[Plasma](https://arrow.apache.org/docs/python/plasma.html) shared object
+store, or any application-specific data distribution mechanism, Arrow IPC
+implementations may try to decode data from untrusted input.  While it is ok
+to report an error in that case, Arrow shouldn't crash or engage in risky
+behaviour while reading such data.
+
+To validate the robustness of the Arrow C++ IPC reader (which also underlies
+the Python, C/GLib, R and Ruby bindings), we
+[successfully submitted](https://github.com/google/oss-fuzz/pull/3233)
+the Arrow project to OSS-Fuzz, a continuous fuzzing initiative for critical
+open source projects, provided by Google.
+
+## What is being fuzzed
+
+As of this writing, the `RecordBatchStreamReader` and `RecordBatchFileReader`
+C++ classes are being fuzzed by feeding them data generated by the fuzzer.
+
+When a record batch is successfully read by one of those classes, the
+fuzzing setup then validates it using `RecordBatch::ValidateFull`.  This
+method can either succeed or fail, but it shouldn't crash.
+
+By ensuring that reading a record batch from IPC, then validating it, always
+shows deterministic behaviour, we hope to make it relatively safe to ingest
+Arrow IPC data coming from untrusted sources.
+
+(of course, it is still recommended for security-critical applications
+ to use cryptographic means of authentication and integrity control -- for
+ example, to enable TLS with the Flight RPC protocol)
+
+## How we help the fuzzer find problems
+
+Fuzzing is a brute force process that tries to devise invalid data to
+exercise an implementation's response.  By default, the fuzzer does not know
+anything about the data representation expected by the program under test.
+Fuzzing can therefore be extremely inefficient, testing tons of uninteresting
+variations while missing critical ones.
+
+To help guide the fuzzing process, we added a seed corpus of valid Arrow IPC
+files with various data types.  By starting from this data and mutating it to
+find invalid variations, OSS-Fuzz was able to find tens of issues with data
+validation.  All of them have been fixed.  As of this writing, no new issue
+in the IPC layer was found since March 4th 2020.
+
+## What comes next
+
+Of course, we still monitor OSS-Fuzz for any new problem that could be found
+in the C++ IPC implementation.  Such problems might for example appear when adding
+features to the Arrow [IPC format](https://arrow.apache.org/docs/format/Columnar.html).
+
+We have started fuzzing the Parquet C++ implementation.  Several issues have
+been found and fixed, but more are still coming.  We hope to stabilize the
+situation in the next month or two.
+
+The tensor and sparse tensor IPC read paths are not being exercised yet.
+They will be once a motivated core developer wants to own the topic.