You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@drill.apache.org by Kumiko Yada <Ku...@ds-iq.com> on 2015/12/15 00:20:15 UTC

Drill Azure Blob Storage Plugin

Is there an Azure Blob Storage Plugin for Apache Drill?  I'm looking for a solution that can be done without configuring Hadoop to access Azure blob (http://hadoop.apache.org/docs/r2.7.0/hadoop-azure/index.html).

Thanks
Kumiko

Re: Drill Azure Blob Storage Plugin

Posted by Tomer Shiran <ts...@dremio.com>.
Follow these steps:

   - Add hadoop-azure-2.7.1.jar and azure-storage-2.0.0.jar to the
   classpath. For example, copy these JARs into <drill>/jars/3rdparty. Note
   that these JARs are available in a Hadoop tarball (you don't actually need
   Hadoop) or directly online:


   1.
      http://central.maven.org/maven2/org/apache/hadoop/hadoop-azure/2.7.1/hadoop-azure-2.7.1.jar
      2.
      http://central.maven.org/maven2/com/microsoft/azure/azure-storage/2.0.0/azure-storage-2.0.0.jar


   - Add the following XML to snippet to <drill>/conf/core-site.xml:

<property>
<name>fs.azure.account.key.YOUR_ACCOUNT.blob.core.windows.net</name>
<value>YOUR AZURE ACCESS KEY</value>
</property>

   - Create a new datastore (ie, storage plugin) in Drill called "azure"
   (or whatever you choose). For the configuration, just copy whatever JSON
   configuration is in the default "dfs" plugin, but replace the connection
   string from file:/// to wasb://
   YOUR_CONTAINER@YOUR_ACCOUNT.blob.core.windows.net/ - the configuration
   would look something like this:

{
  "type": "file",
  "enabled": true,
  "connection": "wasb://drill@tshiran.blob.core.windows.net/",
  "workspaces": {
    "root": {
      "location": "/",
      "writable": false,
      "defaultInputFormat": null
    },
    "tmp": {
      "location": "/tmp",
      "writable": true,
      "defaultInputFormat": null
    }
  },
  "formats": {
    "psv": {
      "type": "text",
      "extensions": [
        "tbl"
      ],
      "delimiter": "|"
    },
    "csv": {
      "type": "text",
      "extensions": [
        "csv"
      ],
      "delimiter": ","
    },
    "tsv": {
      "type": "text",
      "extensions": [
        "tsv"
      ],
      "delimiter": "\t"
    },
    "parquet": {
      "type": "parquet"
    },
    "json": {
      "type": "json"
    },
    "avro": {
      "type": "avro"
    },
    "sequencefile": {
      "type": "sequencefile",
      "extensions": [
        "seq"
      ]
    },
    "csvh": {
      "type": "text",
      "extensions": [
        "csvh"
      ],
      "extractHeader": true,
      "delimiter": ","
    }
  }
}

   - From the Drill CLI, you could run a query like this:

SELECT COUNT(*) FROM azure.root.`sfpd2014.csv`;

   - Or:

USE azure;
SELECT COUNT(*) FROM `sfpd2014.csv`;


On Mon, Dec 14, 2015 at 3:20 PM, Kumiko Yada <Ku...@ds-iq.com> wrote:

> Is there an Azure Blob Storage Plugin for Apache Drill?  I'm looking for a
> solution that can be done without configuring Hadoop to access Azure blob (
> http://hadoop.apache.org/docs/r2.7.0/hadoop-azure/index.html).
>
> Thanks
> Kumiko
>