You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@arrow.apache.org by we...@apache.org on 2020/04/07 18:14:03 UTC

[arrow-site] branch master updated (92577ce -> 277255a)

This is an automated email from the ASF dual-hosted git repository.

wesm pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/arrow-site.git.


    omit 92577ce  Merge pull request #49 from pitrou/ARROW-7847-ipc-fuzz-post
    omit f8e50b7  Set filename to exact date
    omit eaf58e4  Address review comment.
    omit 34db0b9  ARROW-7847: [Website] Add blog post about fuzzing the IPC layer
     new 277255a  ARROW-7847: [Website] Add blog post about fuzzing the IPC layer

This update added new revisions after undoing existing revisions.
That is to say, some revisions that were in the old version of the
branch are not in the new version.  This situation occurs
when a user --force pushes a change and generates a repository
containing something like this:

 * -- * -- B -- O -- O -- O   (92577ce)
            \
             N -- N -- N   refs/heads/master (277255a)

You should already have received notification emails for all of the O
revisions, and so the following emails describe only the N revisions
from the common base, B.

Any revisions marked "omit" are not gone; other references still
refer to them.  Any revisions marked "discard" are gone forever.

The 1 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.


Summary of changes:


[arrow-site] 01/01: ARROW-7847: [Website] Add blog post about fuzzing the IPC layer

Posted by we...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

wesm pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/arrow-site.git

commit 277255ada3d80ccb23053bee80229d7cfabf555a
Author: Antoine Pitrou <an...@python.org>
AuthorDate: Tue Apr 7 13:12:37 2020 -0500

    ARROW-7847: [Website] Add blog post about fuzzing the IPC layer
---
 _data/contributors.yml                 |  3 ++
 _posts/2020-04-01-fuzzing-arrow-ipc.md | 89 ++++++++++++++++++++++++++++++++++
 2 files changed, 92 insertions(+)

diff --git a/_data/contributors.yml b/_data/contributors.yml
index e70d9af..dcddb10 100644
--- a/_data/contributors.yml
+++ b/_data/contributors.yml
@@ -49,4 +49,7 @@
 - name: Neal Richardson
   apacheId: npr # Not a real apacheId
   githubId: nealrichardson
+- name: Antoine Pitrou
+  apacheId: apitrou
+  githubId: pitrou
 # End contributors.yml
diff --git a/_posts/2020-04-01-fuzzing-arrow-ipc.md b/_posts/2020-04-01-fuzzing-arrow-ipc.md
new file mode 100644
index 0000000..b094e1a
--- /dev/null
+++ b/_posts/2020-04-01-fuzzing-arrow-ipc.md
@@ -0,0 +1,89 @@
+---
+layout: post
+title: "Fuzzing the Arrow C++ IPC implementation"
+description: "We have set up continuous fuzzing for the Arrow C++ IPC reader.
+This helped us find and correct several issues where missing input validation
+would lead to crashes or undefined behaviour."
+date: "2020-04-01 00:00:00 +0100"
+author: apitrou
+categories: [application]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+Apache Arrow aims to allow fast and seamless data interchange between
+heterogenous runtimes and environments.  Whether using the columnar
+[IPC stream protocol](https://arrow.apache.org/docs/format/Columnar.html),
+the [Flight](https://arrow.apache.org/docs/format/Flight.html) RPC layer,
+the Feather file format, the
+[Plasma](https://arrow.apache.org/docs/python/plasma.html) shared object
+store, or any application-specific data distribution mechanism, Arrow IPC
+implementations may try to decode data from untrusted input.  While it is ok
+to report an error in that case, Arrow shouldn't crash or engage in risky
+behaviour while reading such data.
+
+To validate the robustness of the Arrow C++ IPC reader (which also underlies
+the Python, C/GLib, R and Ruby bindings), we
+[successfully submitted](https://github.com/google/oss-fuzz/pull/3233)
+the Arrow project to OSS-Fuzz, a continuous fuzzing initiative for critical
+open source projects, provided by Google.
+
+## What is being fuzzed
+
+As of this writing, the `RecordBatchStreamReader` and `RecordBatchFileReader`
+C++ classes are being fuzzed by feeding them data generated by the fuzzer.
+
+When a record batch is successfully read by one of those classes, the
+fuzzing setup then validates it using `RecordBatch::ValidateFull`.  This
+method can either succeed or fail, but it shouldn't crash.
+
+By ensuring that reading a record batch from IPC, then validating it, always
+shows deterministic behaviour, we hope to make it relatively safe to ingest
+Arrow IPC data coming from untrusted sources.
+
+(of course, it is still recommended for security-critical applications
+ to use cryptographic means of authentication and integrity control -- for
+ example, to enable TLS with the Flight RPC protocol)
+
+## How we help the fuzzer find problems
+
+Fuzzing is a brute force process that tries to devise invalid data to
+exercise an implementation's response.  By default, the fuzzer does not know
+anything about the data representation expected by the program under test.
+Fuzzing can therefore be extremely inefficient, testing tons of uninteresting
+variations while missing critical ones.
+
+To help guide the fuzzing process, we added a seed corpus of valid Arrow IPC
+files with various data types.  By starting from this data and mutating it to
+find invalid variations, OSS-Fuzz was able to find tens of issues with data
+validation.  All of them have been fixed.  As of this writing, no new issue
+in the IPC layer was found since March 4th 2020.
+
+## What comes next
+
+Of course, we still monitor OSS-Fuzz for any new problem that could be found
+in the C++ IPC implementation.  Such problems might for example appear when adding
+features to the Arrow [IPC format](https://arrow.apache.org/docs/format/Columnar.html).
+
+We have started fuzzing the Parquet C++ implementation.  Several issues have
+been found and fixed, but more are still coming.  We hope to stabilize the
+situation in the next month or two.
+
+The tensor and sparse tensor IPC read paths are not being exercised yet.
+They will be once a motivated core developer wants to own the topic.