You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@drill.apache.org by Stefán Baxter <st...@activitystream.com> on 2015/07/29 17:49:25 UTC

The incomplete saga of Drill, Tachyon and S3 (Three Amigos, - the analytics edition)

Hi,

I have been trying to get Drill to work with Tachyon (
http://tachyon-project.org/index.html) using S3 as a Deep storage (Tachyon:
Under File System).

The whole Idea is that each Drillbit (node) has it own, mutli tired, local
storage (MEM, SSD + HDD) and uses that to cache Parquet files which are
stored in S3.
This should minimize the S3 traffic and latency and maximize the
performance as Tachyon handles eviction of unused files and moving hot
files between tires.

In theory this sounds good (to me at least) and in practice it is almost
working.

I would like to share the steps we have taken to get this running so others
can follow them and hoping someone here can assist us with what we hope is
the last leg.

*Steps taken:*

   1. *Have Drill 1.1.x running *
   - according to their simple simple guide (
   http://tachyon-project.org/Running-Tachyon-Locally.html)

   2. *Have Tachyon running*
      1. Latest release:
      https://github.com/amplab/tachyon/releases/download/v0.7.0/tachyon-0.7.0-bin.tar.gz
      2. Configure and run local instance according to their simple guide (
      https://drill.apache.org/docs/starting-drill-on-linux-and-mac-os-x/)
      - requires java 7 to run (jsp pages will not render correctly using
      java 8)
      3. Make sure to run the tests (the should leave some test files in
      your Tachyon files ystem)
      4. Have it running on localhost (bin/tachyon-startup.sh localhost
      (for this example))

      3. *Configure S3 Underlying FS for Tachyon*
   1. Configure S3 according to this guide (
      http://tachyon-project.org/Setup-UFS.html
      2. Add "export TACHYON_UNDERFS_ADDRESS=s3n://<bucket-name>"
      to conf/tachyon-env.sh
      3. Add "-Dfs.s3n.awsAccessKeyId=<your-key>" to the export
      TACHYON_JAVA_OPTS section of the same file: conf/tachyon-env.sh
      4. Add "-Dfs.s3n.awsSecretAccessKey=<you-secret>" to the export
      TACHYON_JAVA_OPTS section of the same file: conf/tachyon-env.sh

      4. *Add Tachyon client and jet3t client (jars) to Drill*
      1. cp <tachyoon-root>/clients/client/target/tachyon-client-0.7.0.jar
      <drill-root>/jars/3rdparty/
      2. get the jets3t download (
      http://bitbucket.org/jmurty/jets3t/downloads/jets3t-0.9.3.zip)
      3. unzip it and cp jets3t-0.9.3/jars/jets3t-0.9.3.jar
      to <drill-root>/jars/3rdparty/

      5. *Allow Drill to load jets3t jar*
      1. Edit <drill-root>/bin/hadoop-excludes.txt
      2. Remove the jets3t line from the file

      6. *Configure S3 access for the jets3t in Drill (used by the Tachyon
   driver)*
      1. Edit vim <drill-root>/conf/drill-env.sh
      2. Add -Dfs.s3n.awsAccessKeyId=<your-key> to the "export
      DRILL_JAVA_OPTS=" line
      3. Add -Dfs.s3n.awsSecretAccessKey=<you-secret> to the "export
      DRILL_JAVA_OPTS=" line
      - I have no idea why the Tachyon client needs both a native Tachyon
      client-master/worker connection as well as a S3 connection

      7. *Configure a new storage for Drill using the Drill admin
   (localhost:8047)*
   1. Create new storage name "ts3" (for example)
      2. Use the following config for it:
      {"type": "file",  "enabled": true,  "connection": "tachyon://
      127.0.0.1:19998/",  "workspaces": {    "root": {      "location":
      "/",      "writable": true,      "defaultInputFormat": null    }  },
       "formats": {    "psv": {      "type": "text",      "extensions": [
       "tbl"      ],      "delimiter": "|"    },    "csv": {
"type": "text",
           "extensions": [        "csv"      ],      "delimiter": ","    },
       "tsv": {      "type": "text",     "extensions": [        "tsv"
    ],
       "delimiter": "\t"    },    "parquet": {      "type": "parquet"    },
       "json": {      "type": "json"    },    "avro": {      "type":
"avro"    }
       } }
      3. Notice the "tachyon://127.0.0.1:19998/" connection string in the
      config.
      - It's the glue between Drill and Tachyon
      4. Run Drillbit + local client/sqlline (see drill documentation)

      8. *Make sure Drill is communicating to Tachyon*
      1. Type "use ts3.root;" in the Drill sqlline/client
      2. Type "show files;" in the Drill sqlline/client
      3. Should show the test files directory generated earlier:

      +----------------------+--------------+---------+---------+--------+--------+--------------+--------------------------+--------------------------+
      |         name         | isDirectory  | isFile  | length  | owner  |
      group  | permissions  |        accessTime        |
modificationTime
      |

      +----------------------+--------------+---------+---------+--------+--------+--------------+--------------------------+--------------------------+
      | default_tests_files  | true         | false   | 0       |        |
             | rwxrwxrwx    | 2015-07-29 15:08:13.782  | 2015-07-29
15:08:13.782
       |

      +----------------------+--------------+---------+---------+--------+--------+--------------+--------------------------+--------------------------+
      4. Do the partial-success dance!
      - Drill is now talking to the local Tachyon file system

      9. *Create a database on the Tachyond file system*
      1. run: "CREATE TABLE ts3.root.`/test` AS SELECT * FROM
      dfs.tmp.`/some-file.json`;
      2. have it not work:
      Error: SYSTEM ERROR: IllegalArgumentException: No Under File System
      Factory found for:
      s3n://streamanalytics/tmp/tachyon/workers/1438179000001/48
      Fragment 0:0
      [Error Id: e4201119-1805-44b7-8088-3fc1c898f388 on localhost:31010]
      (state=,code=0)
      3. Do the goddammit-furstration dance and then help me solve this one!
      - the empty parquet file is created in Tachyon and can be listed with
      "show files"
      - nothing is created in S3 (other than the tmp files created by
      Tachyon when formatting/setting up)

      10. *Verify that everything is saved to S3*
   - pending

   11. *Verify that Drillbits see material from every Tachyon node*
   - pending

   12. *Configure Tachyon to be multi-tiered *
   - pending


So, there we almost have it! :)

All input and ideas are welcomed! (If someone is doing this already then
please set forth and share)

Regards,
 -Stefan

Re: The incomplete saga of Drill, Tachyon and S3 (Three Amigos, - the analytics edition)

Posted by Stefán Baxter <st...@activitystream.com>.
Hi Calvin,

This actually did the trick and we it up and running now :).

On thing took me by complete surprise and that is the fact that the
directory structure is not reflected into S3 and instead Tachyon puts
everything into one folder and in numbered files.

That make me question two design decisions for Tachyon:

   - Make the client do worker/server stuff (connect directly to S3 in this
   case)
   - make deployment harder and make separation of responsibilities unclear
   (to say the least)

   - Store files in a proprietary Tachyon structure in the underlying
   filesystem (S3)
   - "/analytics/processed/tripcreator/events2/0_0_0.parquet" is stored
   like "/<bucket>/tmp/tachyon/data/115" (hope this is not permanent and that
   the file is moved)
   - If permanent it hinders use by any other clients than Tachyon

Please comment on the second point here and thank you for addressing the
first point in your email.

Regards,
 -Stefan



On Wed, Jul 29, 2015 at 5:50 PM, Calvin Jia <ji...@gmail.com> wrote:

> Hi,
>
> I think the issue is in step 4, could you try adding
> the tachyon-underfs-s3 (0.7.0) jar as well as changing the jets3t version
> to 0.8.1 (this is the version Tachyon uses, does Drill require 0.9.3?).
>
> However, I think there may be other issues with that since Tachyon client
> may rely on other jars that are not available. One way around this is to
> compile Tachyon and use the tachyon-client-0.7.0-jar-with-dependencies
> (generated in tachyon/clients/client/target). But the first fix is probably
> worth a try since it shouldn't take much time.
>
> I think you hit on a very good point when you ask why does the Tachyon
> client require a connection to S3 and not just Tachyon. The current design
> for the client has under file system data operations (like writing
> s3n://streamanalytics/tmp/tachyon/workers/1438179000001/48) handled by
> the client to prevent a bottleneck at the worker. Its arguable that the
> Tachyon client should just delegate the work to the server so we can avoid
> having issues like this, but that will require some redesigning of the
> client.
>
> Hope this helps,
> Calvin
>