You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2017/11/01 02:55:02 UTC

[jira] [Commented] (ARROW-1455) [Python] Add Dockerfile for validating Dask integration outside of usual CI

    [ https://issues.apache.org/jira/browse/ARROW-1455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16233596#comment-16233596 ] 

ASF GitHub Bot commented on ARROW-1455:
---------------------------------------

wesm closed pull request #1249: ARROW-1455 [Python] Add Dockerfile for validating Dask integration
URL: https://github.com/apache/arrow/pull/1249
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/dev/dask_integration.sh b/dev/dask_integration.sh
new file mode 100755
index 000000000..d344328b6
--- /dev/null
+++ b/dev/dask_integration.sh
@@ -0,0 +1,21 @@
+#!/usr/bin/env bash
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+# Pass the service name to run_docker_compose.sh
+# Which validates environment and runs the service
+exec "$(dirname ${BASH_SOURCE})"/run_docker_compose.sh dask_integration
diff --git a/dev/dask_integration/Dockerfile b/dev/dask_integration/Dockerfile
new file mode 100644
index 000000000..f72ef8ca0
--- /dev/null
+++ b/dev/dask_integration/Dockerfile
@@ -0,0 +1,88 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+FROM ubuntu:14.04
+ADD . /apache-arrow
+WORKDIR /apache-arrow
+# Basic OS utilities
+RUN apt-get update && apt-get install -y \
+        wget \
+        git \
+        gcc \
+        g++
+# This will install conda in /home/ubuntu/miniconda
+RUN wget -O /tmp/miniconda.sh \
+    https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh && \
+    bash /tmp/miniconda.sh -b -p /home/ubuntu/miniconda && \
+    rm /tmp/miniconda.sh
+# Create Conda environment
+ENV PATH="/home/ubuntu/miniconda/bin:${PATH}"
+RUN conda create -y -q -n test-environment \
+    python=3.6
+# Install dependencies
+RUN conda install -c conda-forge \
+    numpy \
+    pandas \
+    bcolz \
+    blosc \
+    bokeh \
+    boto3 \
+    chest \
+    cloudpickle \
+    coverage \
+    cytoolz \
+    distributed \
+    graphviz \
+    h5py \
+    ipython \
+    partd \
+    psutil \
+    "pytest<=3.1.1" \
+    scikit-image \
+    scikit-learn \
+    scipy \
+    sqlalchemy \
+    toolz
+# install pytables from defaults for now
+RUN conda install pytables
+
+RUN pip install -q git+https://github.com/dask/partd --upgrade --no-deps
+RUN pip install -q git+https://github.com/dask/zict --upgrade --no-deps
+RUN pip install -q git+https://github.com/dask/distributed --upgrade --no-deps
+RUN pip install -q git+https://github.com/mrocklin/sparse --upgrade --no-deps
+RUN pip install -q git+https://github.com/dask/s3fs --upgrade --no-deps
+
+RUN conda install -q -c conda-forge numba cython
+RUN pip install -q git+https://github.com/dask/fastparquet
+
+RUN pip install -q \
+    cachey \
+    graphviz \
+    moto \
+    pyarrow \
+    --upgrade --no-deps
+
+RUN pip install -q \
+    cityhash \
+    flake8 \
+    mmh3 \
+    pandas_datareader \
+    pytest-xdist \
+    xxhash \
+    pycodestyle
+
+CMD arrow/dev/dask_integration/dask_integration.sh
+
diff --git a/dev/dask_integration/dask_integration.sh b/dev/dask_integration/dask_integration.sh
new file mode 100755
index 000000000..f5a24e462
--- /dev/null
+++ b/dev/dask_integration/dask_integration.sh
@@ -0,0 +1,49 @@
+#!/usr/bin/env bash
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+# Set up environment and working directory
+cd /apache-arrow
+
+export ARROW_BUILD_TYPE=release
+export ARROW_HOME=$(pwd)/dist
+export PARQUET_HOME=$(pwd)/dist
+CONDA_BASE=/home/ubuntu/miniconda
+export LD_LIBRARY_PATH=$(pwd)/dist/lib:${CONDA_BASE}/lib:${LD_LIBRARY_PATH}
+
+# Allow for --user Python installation inside Docker
+export HOME=$(pwd)
+
+# Clean up and get the dask master branch from github
+rm -rf dask .local
+export GIT_COMMITTER_NAME="Nobody"
+export GIT_COMMITTER_EMAIL="nobody@nowhere.com"
+git clone https://github.com/dask/dask.git
+pushd dask
+pip install --user -e .[complete]
+# Verify integrity of the installed dask dataframe code
+py.test dask/dataframe/tests/test_dataframe.py
+popd
+
+# Run the integration test
+pushd arrow/python/testing
+py.test dask_tests
+popd
+
+pushd dask/dask/dataframe/io
+py.test tests/test_parquet.py
+popd
diff --git a/dev/docker-compose.yml b/dev/docker-compose.yml
index 7bd2cd441..4b9014894 100644
--- a/dev/docker-compose.yml
+++ b/dev/docker-compose.yml
@@ -28,3 +28,8 @@ services:
     - "4000:4000"
     volumes:
      - ../..:/apache-arrow
+  dask_integration:
+    build: 
+      context: dask_integration
+    volumes:
+     - ../..:/apache-arrow
diff --git a/dev/run_docker_compose.sh b/dev/run_docker_compose.sh
index f46879ed1..681a3a75f 100755
--- a/dev/run_docker_compose.sh
+++ b/dev/run_docker_compose.sh
@@ -37,4 +37,4 @@ fi
 
 GID=$(id -g ${USERNAME})
 docker-compose -f arrow/dev/docker-compose.yml run \
-               -u "${UID}:${GID}" "${1}"
+               --rm -u "${UID}:${GID}" "${1}"
diff --git a/python/testing/README.md b/python/testing/README.md
index 07970a231..0ebeec4a1 100644
--- a/python/testing/README.md
+++ b/python/testing/README.md
@@ -23,4 +23,26 @@
 
 ```shell
 ./test_hdfs.sh
-```
\ No newline at end of file
+```
+
+## Testing Dask integration
+
+Initial integration testing with Dask has been Dockerized.
+To invoke the test run the following command in the `arrow`
+root-directory:
+
+```shell
+bash dev/dask_integration.sh
+```
+
+This script will create a `dask` directory on the same level as
+`arrow`. It will clone the Dask project from Github into `dask`
+and do a Python `--user` install. The Docker code will use the parent
+directory of `arrow` as `$HOME` and that's where Python will
+install `dask` into a `.local` directory.
+
+The output of the Docker session will contain the results of tests
+of the Dask dataframe followed by the single integration test that
+now exists for Arrow. That test creates a set of `csv`-files and then
+does parallel reading of `csv`-files into a Dask dataframe. The code
+for this test resides here in the `dask_test` directory.
diff --git a/python/testing/dask_tests/test_dask_integration.py b/python/testing/dask_tests/test_dask_integration.py
new file mode 100644
index 000000000..e67834878
--- /dev/null
+++ b/python/testing/dask_tests/test_dask_integration.py
@@ -0,0 +1,51 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+from datetime import date, timedelta
+import csv
+from random import randint
+import dask.dataframe as dd
+import pyarrow as pa
+
+def make_datafiles(tmpdir, prefix='data', num_files=20):
+    rowcount = 5000
+    fieldnames = ['date', 'temperature', 'dewpoint']
+    start_date = date(1900, 1, 1)
+    for i in range(num_files):
+        filename = '{0}/{1}-{2}.csv'.format(tmpdir, prefix, i)
+        with open(filename, 'w') as outcsv:
+            writer = csv.DictWriter(outcsv, fieldnames)
+            writer.writeheader()
+            the_date = start_date
+            for _ in range(rowcount):
+                temperature = randint(-10, 35)
+                dewpoint = temperature - randint(0, 10)
+                writer.writerow({'date': the_date, 'temperature': temperature,
+                                 'dewpoint': dewpoint})
+                the_date += timedelta(days=1)
+
+def test_dask_file_read(tmpdir):
+    prefix = 'data'
+    make_datafiles(tmpdir, prefix)
+    # Read all datafiles in parallel
+    datafiles = '{0}/{1}-*.csv'.format(tmpdir, prefix)
+    dask_df = dd.read_csv(datafiles)
+    # Convert Dask dataframe to Arrow table
+    table = pa.Table.from_pandas(dask_df.compute())
+    # Second column (1) is temperature
+    dask_temp = int(1000 * dask_df['temperature'].mean().compute())
+    arrow_temp = int(1000 * table[1].to_pandas().mean())
+    assert dask_temp == arrow_temp


 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


> [Python] Add Dockerfile for validating Dask integration outside of usual CI
> ---------------------------------------------------------------------------
>
>                 Key: ARROW-1455
>                 URL: https://issues.apache.org/jira/browse/ARROW-1455
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Python
>            Reporter: Wes McKinney
>            Assignee: Heimir Thor Sverrisson
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 0.8.0
>
>
> Introducing the Dask stack into Arrow's CI might be a bit heavyweight at the moment, but we can add a testing set up in https://github.com/apache/arrow/tree/master/python/testing so that this can be validated on an ad hoc basis in a reproducible way.
> see also ARROW-1417



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)