You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@drill.apache.org by Derek Rabindran <dr...@gmail.com> on 2015/02/06 07:42:32 UTC

Apache Drill and S3

Hi,

My use case involves using Drill in combination with S3.  I have a few
questions:

1) Is it possible to decrypt the files before processing?  My files are
client-side encrypted.  I'm able to provide the master key, however, I'm
not sure at which level this should be configured.

2) What is Hadoops role when using Drill with S3?  Can you outline the
details of what's actually happening when we execute a drill request on
files residing in S3?

3) Will this work for both S3 and S3n?

Thanks

Re: Apache Drill and S3

Posted by David Tucker <dt...@maprtech.com>.

Yes, the s3 and s3n implementations work just fine with drill ... there's an excellent blog post on the apache drill site about enabling S3 at http://drill.apache.org/blog/2014/12/09/running-sql-queries-on-amazon-s3/ .   To Steven's point, there is a property defining the local file system to which the objects from S3 may be temporarily staged.   By default, that is in /tmp.   You can change the property fs.s3.buffer.dir along with your S3 credentials should you need additional space for large object transfers.

Because the s3 storage plug-in is so similar to the basic file-system plug-in, I usually add a bucket to my Drill deployments by default.   The json example below can be used to create a plug-in specification for the AMPLab benchmark data (though I do default to disabled just in case)

{
  "name" : "s3amplab",
  "config" : {
    "type" : "file",
    "enabled" : false,
    "connection" : "s3n://big-data-benchmark",
    "workspaces" : {
      "root" : {
        "location" : "/",
        "writable" : false,
        "defaultInputFormat" : null
      }
    },
    "formats" : {
      "psv" : {
        "type" : "text",
        "extensions" : [ "tbl" ],
        "delimiter" : "|"
      },
      "csv" : {
        "type" : "text",
        "extensions" : [ "csv" ],
        "delimiter" : ","
      },
      "tsv" : {
        "type" : "text",
        "extensions" : [ "tsv" ],
        "delimiter" : "\t"
      },
      "parquet" : {
        "type" : "parquet"
      },
      "json" : {
        "type" : "json"
      }
    }
  }
}


Just load it into your drill storage configuration with
	curl -X POST -H "Content-Type: application/json"  --upload-file ${S3_JASON_PLUGIN_FILE} \
          http://${DRILL_SERVER}:8047/storage/s3amplab.json

Happy drilling !

-- David

On Feb 6, 2015, at 1:58 PM, Steven Phillips <sp...@maprtech.com> wrote:

> I don't really know anything about Hadoop encryption, so I will not address
> question 1.
> 
> 2) The "filesystem" storage in drill uses the Hadoop Filesystem API. The
> filesystem type is configured as part of the storage plugin configuration,
> in the "connection" field.
> 
> When executing a query against any "filesystem" storage, drill uses the
> getBlockLocation() method the Filesystem api to get a lost of blocks along
> with the locations of each block. It uses this information to assign
> fragments to the drillbits. Within each fragment, the filesystem api is
> used to read the data from the filesystem.
> 
> I'm not sure how the getBlockLocations() method is implemented for the s3
> filesystem, but I believe it splits the file based on some configuration
> property for blocksize. I am not sure what locations are returned for the
> block locations.
> 
> 3) I haven't tried this, but if there is a filesystem implementation for s3
> and s3n, then they should both work with drill.
> 
> On Thu, Feb 5, 2015 at 10:42 PM, Derek Rabindran <dr...@gmail.com> wrote:
> 
>> Hi,
>> 
>> My use case involves using Drill in combination with S3.  I have a few
>> questions:
>> 
>> 1) Is it possible to decrypt the files before processing?  My files are
>> client-side encrypted.  I'm able to provide the master key, however, I'm
>> not sure at which level this should be configured.
>> 
>> 2) What is Hadoops role when using Drill with S3?  Can you outline the
>> details of what's actually happening when we execute a drill request on
>> files residing in S3?
>> 
>> 3) Will this work for both S3 and S3n?
>> 
>> Thanks
>> 
> 
> 
> 
> -- 
> Steven Phillips
> Software Engineer
> 
> mapr.com

Re: Apache Drill and S3

Posted by Steven Phillips <sp...@maprtech.com>.

I don't really know anything about Hadoop encryption, so I will not address
question 1.

2) The "filesystem" storage in drill uses the Hadoop Filesystem API. The
filesystem type is configured as part of the storage plugin configuration,
in the "connection" field.

When executing a query against any "filesystem" storage, drill uses the
getBlockLocation() method the Filesystem api to get a lost of blocks along
with the locations of each block. It uses this information to assign
fragments to the drillbits. Within each fragment, the filesystem api is
used to read the data from the filesystem.

I'm not sure how the getBlockLocations() method is implemented for the s3
filesystem, but I believe it splits the file based on some configuration
property for blocksize. I am not sure what locations are returned for the
block locations.

3) I haven't tried this, but if there is a filesystem implementation for s3
and s3n, then they should both work with drill.

On Thu, Feb 5, 2015 at 10:42 PM, Derek Rabindran <dr...@gmail.com> wrote:

> Hi,
>
> My use case involves using Drill in combination with S3.  I have a few
> questions:
>
> 1) Is it possible to decrypt the files before processing?  My files are
> client-side encrypted.  I'm able to provide the master key, however, I'm
> not sure at which level this should be configured.
>
> 2) What is Hadoops role when using Drill with S3?  Can you outline the
> details of what's actually happening when we execute a drill request on
> files residing in S3?
>
> 3) Will this work for both S3 and S3n?
>
> Thanks
>

-- 
 Steven Phillips
 Software Engineer

 mapr.com

Re: Apache Drill and S3

Posted by Venkata Sowrirajan <vs...@maprtech.com>.

1. I am not really sure whether you can do this or not. However, If you can
decrypt the files using S3 API's on the fly with your master key, then I
think it should be possible

2. Hadoop has no role when using Drill with S3. Drill uses S3 FileSystem
API's, access key and secret key from core-site.xml to access the files
stored in S3 buckets.

3. I think it will work for both S3 and S3n.

Regards

Venkat
MapR Technologies, Inc.

On Thu, Feb 5, 2015 at 10:42 PM, Derek Rabindran <dr...@gmail.com> wrote:

> Hi,
>
> My use case involves using Drill in combination with S3.  I have a few
> questions:
>
> 1) Is it possible to decrypt the files before processing?  My files are
> client-side encrypted.  I'm able to provide the master key, however, I'm
> not sure at which level this should be configured.
>
> 2) What is Hadoops role when using Drill with S3?  Can you outline the
> details of what's actually happening when we execute a drill request on
> files residing in S3?
>
> 3) Will this work for both S3 and S3n?
>
> Thanks
>