You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@drill.apache.org by Paul Mogren <PM...@commercehub.com> on 2015/06/18 17:24:42 UTC

Re: To EMRFS or not to EMRFS?

Following up. Ted gave sound advice regarding reading S3 vs HDFS, but
didn¹t address EMRFS specifically. Here is what I have learned.

EMRFS is an HDFS emulation layer over S3 storage. What it provides is a
way to get the data consistency expected by clients of HDFS, but not
provided directly by S3, by transparently doing things like retries of
reads and tracking which clients recently wrote which data. I see no
reason to believe that the performance of Drill over EMRFS would be better
than Drill directly reading from S3, unless maybe in the context of large
objects, if there is something different about how they can be split for
concurrent readers - I have not investigated this.

I don¹t believe Drill writes to shared storage except by user request or
configuration. Perhaps EMRFS is helpful to users of Drill features that
write data, like CTAS or spill-to-DFS.

Other potential reasons that some might like to use EMRFS include its
support for IAM Role security in EC2, and transparent client-side
encryption of S3 objects. It seems the former is finally coming to Jets3t,
the library used by Drill to access S3, so hopefully that will soon be an
option even without requiring EMRFS:
https://bitbucket.org/jmurty/jets3t/issue/163/provide-support-for-aws-iam-i
nstance-roles

-Paul

On 5/26/15, 2:15 PM, "Paul Mogren" <PM...@commercehub.com> wrote:

>Thank you. This kind of summary advice is helpful to getting started.
>
>
>
>
>On 5/22/15, 6:37 PM, "Ted Dunning" <te...@gmail.com> wrote:
>
>>The variation will have less to do with Drill (which can read all these
>>options such as EMR resident MapR FS or HDFS or persistent MapR FS or
>>HDFS
>>or S3).
>>
>>The biggest differences will have to do with whether your clusters
>>providing storage are permanent or ephemeral.  If they are ephemeral, you
>>can host the distributed file system on EBS based volumes so that you
>>will
>>have an ephemeral but restartable cluster.
>>
>>So the costs in run time will have to do with startup or restart times
>>and
>>the time it takes to pour the data into any new distributed file system.
>>If you host permanently in S3 and have Drill read directly from there,
>>you
>>have no permanent storage cost for the input data, but will probably have
>>slower reads.  With a permanent cluster hosting the data, you will have
>>higher costs, but likely also higher performance.  Copying data from S3
>>to
>>a distributed file system is probably not a great idea since you pay
>>roughly the same cost during copy as you would have paid just querying
>>directly from S3.
>>
>>Exactly how these trade-offs pan out requires some careful thought and
>>considerable knowledge of your workload.
>>
>>
>>
>>On Fri, May 22, 2015 at 3:22 PM, Paul Mogren <PM...@commercehub.com>
>>wrote:
>>
>>> > When running Drill in AWS EMR, can anyone advise as to the advantages
>>> >and disadvantages of having Drill access S3 via EMRFS vs. directly?
>>>
>>> Also, a third option: an actual HDFS not backed by S3
>>>
>>>
>

Re: To EMRFS or not to EMRFS?

Posted by Paul Mogren <PM...@commercehub.com>.

If you go down that route, it may be better to ditch Jets3t in favor of
the AWS SDK, as Presto and much of the community has.

https://github.com/facebook/presto/pull/991/files
https://github.com/facebook/presto/blob/master/presto-hive/src/main/java/co
m/facebook/presto/hive/HdfsConfigurationUpdater.java


On 6/18/15, 4:08 PM, "Paul Mogren" <PM...@commercehub.com> wrote:

>Maybe another way to go is to copy Jets3tNativeFileSystemStore and any
>necessary dependent classes to another name, modify, and register it under
>a different URL scheme (not s3n)
>
>
>
>
>On 6/18/15, 3:54 PM, "Paul Mogren" <PM...@commercehub.com> wrote:
>
>>Thanks.
>>
>>
>>I tried to follow up on the upcoming Jets3t support for IAM roles.
>>Dropping in a snapshot build of Jets3t with Drill was not enough. I was
>>going to try patching Drill to take advantage, but I found that Drill
>>inherits its ability to treat S3 as a DFS from hadoop-common’s
>>org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore, which
>>delegates
>>some initialization to org.apache.hadoop.fs.s3.S3Credentials, which
>>explicitly throws IllegalArgumentException if the accessKey and/or
>>secretAccessKey is not provided (stack trace below). So we’re looking at
>>waiting for Jets3t making a release, having it picked up by Hadoop, patch
>>Hadoop to allow use of IAM role, Hadoop making a release, and having that
>>available in your stack (e.g. AWS offering it in EMR). That’s going to
>>take quite some time.  One could probably do the patching oneself and
>>slip
>>it into the classpath, but it makes more sense to try EMRFS.
>>
>>
>>2015-06-18 18:12:28,997 [qtp1208140271-50] ERROR
>>o.a.d.e.server.rest.QueryResources - Query from Web UI Failed
>>org.apache.drill.common.exceptions.UserRemoteException: SYSTEM ERROR:
>>java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access
>>Key must be specified as the username or password (respective
>>ly) of a s3n URL, or by setting the fs.s3n.awsAccessKeyId or
>>fs.s3n.awsSecretAccessKey properties (respectively).
>>
>>
>>
>>
>>[Error Id: 32c0357e-b8df-48f0-98ae-f200d48dd0f5 on
>>ip-172-24-7-103.ec2.internal:31010]
>>
>>
>>  (org.apache.drill.exec.work.foreman.ForemanException) Unexpected
>>exception during fragment initialization: AWS Access Key ID and Secret
>>Access Key must be specified as the username or password (respect
>>ively) of a s3n URL, or by setting the fs.s3n.awsAccessKeyId or
>>fs.s3n.awsSecretAccessKey properties (respectively).
>>    org.apache.drill.exec.work.foreman.Foreman.run():251
>>    java.util.concurrent.ThreadPoolExecutor.runWorker():1145
>>    java.util.concurrent.ThreadPoolExecutor$Worker.run():615
>>    java.lang.Thread.run():745
>>  Caused By (java.lang.IllegalArgumentException) AWS Access Key ID and
>>Secret Access Key must be specified as the username or password
>>(respectively) of a s3n URL, or by setting the fs.s3n.awsAccessKeyId
>> or fs.s3n.awsSecretAccessKey properties (respectively).
>>    org.apache.hadoop.fs.s3.S3Credentials.initialize():70
>>    
>>org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.initialize():73
>>    sun.reflect.NativeMethodAccessorImpl.invoke0():-2
>>    sun.reflect.NativeMethodAccessorImpl.invoke():57
>>    sun.reflect.DelegatingMethodAccessorImpl.invoke():43
>>    java.lang.reflect.Method.invoke():606
>>    org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod():190
>>    org.apache.hadoop.io.retry.RetryInvocationHandler.invoke():103
>>    org.apache.hadoop.fs.s3native.$Proxy65.initialize():-1
>>    org.apache.hadoop.fs.s3native.NativeS3FileSystem.initialize():272
>>    org.apache.hadoop.fs.FileSystem.createFileSystem():2397
>>    org.apache.hadoop.fs.FileSystem.access$200():89
>>    org.apache.hadoop.fs.FileSystem$Cache.getInternal():2431
>>    org.apache.hadoop.fs.FileSystem$Cache.get():2413
>>    org.apache.hadoop.fs.FileSystem.get():368
>>    org.apache.hadoop.fs.FileSystem.get():167
>>    org.apache.drill.exec.store.dfs.DrillFileSystem.<init>():85
>>    org.apache.drill.exec.util.ImpersonationUtil$1.run():156
>>    org.apache.drill.exec.util.ImpersonationUtil$1.run():153
>>    java.security.AccessController.doPrivileged():-2
>>    javax.security.auth.Subject.doAs():415
>>    org.apache.hadoop.security.UserGroupInformation.doAs():1556
>>    org.apache.drill.exec.util.ImpersonationUtil.createFileSystem():153
>>    org.apache.drill.exec.util.ImpersonationUtil.createFileSystem():145
>>    org.apache.drill.exec.util.ImpersonationUtil.createFileSystem():133
>>    
>>org.apache.drill.exec.store.dfs.WorkspaceSchemaFactory$WorkspaceSchema.<i
>>n
>>i
>>t>():125
>>    
>>org.apache.drill.exec.store.dfs.WorkspaceSchemaFactory.createSchema():114
>>    
>>org.apache.drill.exec.store.dfs.FileSystemSchemaFactory$FileSystemSchema.
>><
>>i
>>nit>():77
>>    
>>org.apache.drill.exec.store.dfs.FileSystemSchemaFactory.registerSchemas()
>>:
>>6
>>4
>>    
>>org.apache.drill.exec.store.dfs.FileSystemPlugin.registerSchemas():131
>>    
>>org.apache.drill.exec.store.StoragePluginRegistry$DrillSchemaFactory.regi
>>s
>>t
>>erSchemas():330
>>org.apache.drill.exec.ops.QueryContext.getRootSchema():158
>>    org.apache.drill.exec.ops.QueryContext.getRootSchema():147
>>    org.apache.drill.exec.ops.QueryContext.getRootSchema():135
>>    org.apache.drill.exec.ops.QueryContext.getNewDefaultSchema():121
>>    org.apache.drill.exec.planner.sql.DrillSqlWorker.<init>():90
>>    org.apache.drill.exec.work.foreman.Foreman.runSQL():900
>>    org.apache.drill.exec.work.foreman.Foreman.run():240
>>    java.util.concurrent.ThreadPoolExecutor.runWorker():1145
>>    java.util.concurrent.ThreadPoolExecutor$Worker.run():615
>>    java.lang.Thread.run():745
>>
>>
>>        at 
>>org.apache.drill.exec.rpc.user.QueryResultHandler.resultArrived(QueryResu
>>l
>>t
>>Handler.java:118) ~[drill-java-exec-1.0.0-rebuffed.jar:1.0.0]
>>        at 
>>org.apache.drill.exec.rpc.user.UserClient.handleReponse(UserClient.java:1
>>1
>>1
>>) ~[drill-java-exec-1.0.0-rebuffed.jar:1.0.0]
>>        at 
>>org.apache.drill.exec.rpc.BasicClientWithConnection.handle(BasicClientWit
>>h
>>C
>>onnection.java:47) ~[drill-java-exec-1.0.0-rebuffed.jar:1.0.0]
>>        at 
>>org.apache.drill.exec.rpc.BasicClientWithConnection.handle(BasicClientWit
>>h
>>C
>>onnection.java:32) ~[drill-java-exec-1.0.0-rebuffed.jar:1.0.0]
>>        at org.apache.drill.exec.rpc.RpcBus.handle(RpcBus.java:61)
>>~[drill-java-exec-1.0.0-rebuffed.jar:1.0.0]
>>        at 
>>org.apache.drill.exec.rpc.RpcBus$InboundHandler.decode(RpcBus.java:233)
>>~[drill-java-exec-1.0.0-rebuffed.jar:1.0.0]
>>        at 
>>org.apache.drill.exec.rpc.RpcBus$InboundHandler.decode(RpcBus.java:205)
>>~[drill-java-exec-1.0.0-rebuffed.jar:1.0.0]
>>        at 
>>io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessa
>>g
>>e
>>Decoder.java:89) ~[netty-codec-4.0.27.Final.jar:4.0.27.Final]
>>        at 
>>io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(Abstract
>>C
>>h
>>annelHandlerContext.java:339)
>>~[netty-transport-4.0.27.Final.jar:4.0.27.Final]
>>        at 
>>io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractCh
>>a
>>n
>>nelHandlerContext.java:324)
>>~[netty-transport-4.0.27.Final.jar:4.0.27.Final]
>>        at 
>>io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.ja
>>v
>>a
>>:254) ~[netty-handler-4.0.27.Final.jar:4.0.27.Final]
>>        at 
>>io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(Abstract
>>C
>>h
>>annelHandlerContext.java:339)
>>~[netty-transport-4.0.27.Final.jar:4.0.27.Final]
>>        at 
>>io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractCh
>>a
>>n
>>nelHandlerContext.java:324)
>>~[netty-transport-4.0.27.Final.jar:4.0.27.Final]
>>        at 
>>io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessa
>>g
>>e
>>Decoder.java:103) ~[netty-codec-4.0.27.Final.jar:4.0.27.Final]
>>        at 
>>io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(Abstract
>>C
>>h
>>annelHandlerContext.java:339)
>>~[netty-transport-4.0.27.Final.jar:4.0.27.Final]
>>        at 
>>io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractCh
>>a
>>n
>>nelHandlerContext.java:324)
>>~[netty-transport-4.0.27.Final.jar:4.0.27.Final]
>>        at 
>>io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDeco
>>d
>>e
>>r.java:242) ~[netty-codec-4.0.27.Final.jar:4.0.27.Final]
>>        at 
>>io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(Abstract
>>C
>>h
>>annelHandlerContext.java:339)
>>~[netty-transport-4.0.27.Final.jar:4.0.27.Final]
>>        at 
>>io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractCh
>>a
>>n
>>nelHandlerContext.java:324)
>>~[netty-transport-4.0.27.Final.jar:4.0.27.Final]
>>        at 
>>io.netty.channel.ChannelInboundHandlerAdapter.channelRead(ChannelInboundH
>>a
>>n
>>dlerAdapter.java:86) ~[netty-transport-4.0.27.Final.jar:4.0.27.Final]
>>        at 
>>io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(Abstract
>>C
>>h
>>annelHandlerContext.java:339)
>>~[netty-transport-4.0.27.Final.jar:4.0.27.Final]
>>        at 
>>io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractCh
>>a
>>n
>>nelHandlerContext.java:324)
>>~[netty-transport-4.0.27.Final.jar:4.0.27.Final]
>>        at 
>>io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPip
>>e
>>l
>>ine.java:847) ~[netty-transport-4.0.27.Final.jar:4.0.27.Final]
>>        at 
>>io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epoll
>>I
>>n
>>Ready(AbstractEpollStreamChannel.java:618)
>>~[netty-transport-native-epoll-4.0.27.Final-linux-x86_64.jar:na]
>>        at 
>>io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:32
>>9
>>)
>> ~[netty-transport-native-epoll-4.0.27.Final-linux-x86_64.jar:na]
>>        at 
>>io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:250)
>>~[netty-transport-native-epoll-4.0.27.Final-linux-x86_64.jar:na]
>>        at 
>>io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEven
>>t
>>E
>>xecutor.java:111) ~[netty-common-4.0.27.Final.jar:4.0.27.Final]
>>        at java.lang.Thread.run(Thread.java:745) ~[na:1.7.0_71]
>>
>>
>>
>>
>>
>>
>>
>>
>>On 6/18/15, 11:28 AM, "Ted Dunning" <te...@gmail.com> wrote:
>>
>>>On Thu, Jun 18, 2015 at 8:24 AM, Paul Mogren <PM...@commercehub.com>
>>>wrote:
>>>
>>>> Following up. Ted gave sound advice regarding reading S3 vs HDFS, but
>>>> didn¹t address EMRFS specifically. Here is what I have learned.
>>>>
>>>
>>>Great summary.  Very useful when people help by feeding back what they
>>>have
>>>learned.
>>
>

Re: To EMRFS or not to EMRFS?

Posted by Paul Mogren <PM...@commercehub.com>.

Maybe another way to go is to copy Jets3tNativeFileSystemStore and any
necessary dependent classes to another name, modify, and register it under
a different URL scheme (not s3n)




On 6/18/15, 3:54 PM, "Paul Mogren" <PM...@commercehub.com> wrote:

>Thanks.
>
>
>I tried to follow up on the upcoming Jets3t support for IAM roles.
>Dropping in a snapshot build of Jets3t with Drill was not enough. I was
>going to try patching Drill to take advantage, but I found that Drill
>inherits its ability to treat S3 as a DFS from hadoop-common’s
>org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore, which delegates
>some initialization to org.apache.hadoop.fs.s3.S3Credentials, which
>explicitly throws IllegalArgumentException if the accessKey and/or
>secretAccessKey is not provided (stack trace below). So we’re looking at
>waiting for Jets3t making a release, having it picked up by Hadoop, patch
>Hadoop to allow use of IAM role, Hadoop making a release, and having that
>available in your stack (e.g. AWS offering it in EMR). That’s going to
>take quite some time.  One could probably do the patching oneself and slip
>it into the classpath, but it makes more sense to try EMRFS.
>
>
>2015-06-18 18:12:28,997 [qtp1208140271-50] ERROR
>o.a.d.e.server.rest.QueryResources - Query from Web UI Failed
>org.apache.drill.common.exceptions.UserRemoteException: SYSTEM ERROR:
>java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access
>Key must be specified as the username or password (respective
>ly) of a s3n URL, or by setting the fs.s3n.awsAccessKeyId or
>fs.s3n.awsSecretAccessKey properties (respectively).
>
>
>
>
>[Error Id: 32c0357e-b8df-48f0-98ae-f200d48dd0f5 on
>ip-172-24-7-103.ec2.internal:31010]
>
>
>  (org.apache.drill.exec.work.foreman.ForemanException) Unexpected
>exception during fragment initialization: AWS Access Key ID and Secret
>Access Key must be specified as the username or password (respect
>ively) of a s3n URL, or by setting the fs.s3n.awsAccessKeyId or
>fs.s3n.awsSecretAccessKey properties (respectively).
>    org.apache.drill.exec.work.foreman.Foreman.run():251
>    java.util.concurrent.ThreadPoolExecutor.runWorker():1145
>    java.util.concurrent.ThreadPoolExecutor$Worker.run():615
>    java.lang.Thread.run():745
>  Caused By (java.lang.IllegalArgumentException) AWS Access Key ID and
>Secret Access Key must be specified as the username or password
>(respectively) of a s3n URL, or by setting the fs.s3n.awsAccessKeyId
> or fs.s3n.awsSecretAccessKey properties (respectively).
>    org.apache.hadoop.fs.s3.S3Credentials.initialize():70
>    
>org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.initialize():73
>    sun.reflect.NativeMethodAccessorImpl.invoke0():-2
>    sun.reflect.NativeMethodAccessorImpl.invoke():57
>    sun.reflect.DelegatingMethodAccessorImpl.invoke():43
>    java.lang.reflect.Method.invoke():606
>    org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod():190
>    org.apache.hadoop.io.retry.RetryInvocationHandler.invoke():103
>    org.apache.hadoop.fs.s3native.$Proxy65.initialize():-1
>    org.apache.hadoop.fs.s3native.NativeS3FileSystem.initialize():272
>    org.apache.hadoop.fs.FileSystem.createFileSystem():2397
>    org.apache.hadoop.fs.FileSystem.access$200():89
>    org.apache.hadoop.fs.FileSystem$Cache.getInternal():2431
>    org.apache.hadoop.fs.FileSystem$Cache.get():2413
>    org.apache.hadoop.fs.FileSystem.get():368
>    org.apache.hadoop.fs.FileSystem.get():167
>    org.apache.drill.exec.store.dfs.DrillFileSystem.<init>():85
>    org.apache.drill.exec.util.ImpersonationUtil$1.run():156
>    org.apache.drill.exec.util.ImpersonationUtil$1.run():153
>    java.security.AccessController.doPrivileged():-2
>    javax.security.auth.Subject.doAs():415
>    org.apache.hadoop.security.UserGroupInformation.doAs():1556
>    org.apache.drill.exec.util.ImpersonationUtil.createFileSystem():153
>    org.apache.drill.exec.util.ImpersonationUtil.createFileSystem():145
>    org.apache.drill.exec.util.ImpersonationUtil.createFileSystem():133
>    
>org.apache.drill.exec.store.dfs.WorkspaceSchemaFactory$WorkspaceSchema.<in
>i
>t>():125
>    
>org.apache.drill.exec.store.dfs.WorkspaceSchemaFactory.createSchema():114
>    
>org.apache.drill.exec.store.dfs.FileSystemSchemaFactory$FileSystemSchema.<
>i
>nit>():77
>    
>org.apache.drill.exec.store.dfs.FileSystemSchemaFactory.registerSchemas():
>6
>4
>    org.apache.drill.exec.store.dfs.FileSystemPlugin.registerSchemas():131
>    
>org.apache.drill.exec.store.StoragePluginRegistry$DrillSchemaFactory.regis
>t
>erSchemas():330
>org.apache.drill.exec.ops.QueryContext.getRootSchema():158
>    org.apache.drill.exec.ops.QueryContext.getRootSchema():147
>    org.apache.drill.exec.ops.QueryContext.getRootSchema():135
>    org.apache.drill.exec.ops.QueryContext.getNewDefaultSchema():121
>    org.apache.drill.exec.planner.sql.DrillSqlWorker.<init>():90
>    org.apache.drill.exec.work.foreman.Foreman.runSQL():900
>    org.apache.drill.exec.work.foreman.Foreman.run():240
>    java.util.concurrent.ThreadPoolExecutor.runWorker():1145
>    java.util.concurrent.ThreadPoolExecutor$Worker.run():615
>    java.lang.Thread.run():745
>
>
>        at 
>org.apache.drill.exec.rpc.user.QueryResultHandler.resultArrived(QueryResul
>t
>Handler.java:118) ~[drill-java-exec-1.0.0-rebuffed.jar:1.0.0]
>        at 
>org.apache.drill.exec.rpc.user.UserClient.handleReponse(UserClient.java:11
>1
>) ~[drill-java-exec-1.0.0-rebuffed.jar:1.0.0]
>        at 
>org.apache.drill.exec.rpc.BasicClientWithConnection.handle(BasicClientWith
>C
>onnection.java:47) ~[drill-java-exec-1.0.0-rebuffed.jar:1.0.0]
>        at 
>org.apache.drill.exec.rpc.BasicClientWithConnection.handle(BasicClientWith
>C
>onnection.java:32) ~[drill-java-exec-1.0.0-rebuffed.jar:1.0.0]
>        at org.apache.drill.exec.rpc.RpcBus.handle(RpcBus.java:61)
>~[drill-java-exec-1.0.0-rebuffed.jar:1.0.0]
>        at 
>org.apache.drill.exec.rpc.RpcBus$InboundHandler.decode(RpcBus.java:233)
>~[drill-java-exec-1.0.0-rebuffed.jar:1.0.0]
>        at 
>org.apache.drill.exec.rpc.RpcBus$InboundHandler.decode(RpcBus.java:205)
>~[drill-java-exec-1.0.0-rebuffed.jar:1.0.0]
>        at 
>io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessag
>e
>Decoder.java:89) ~[netty-codec-4.0.27.Final.jar:4.0.27.Final]
>        at 
>io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractC
>h
>annelHandlerContext.java:339)
>~[netty-transport-4.0.27.Final.jar:4.0.27.Final]
>        at 
>io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractCha
>n
>nelHandlerContext.java:324)
>~[netty-transport-4.0.27.Final.jar:4.0.27.Final]
>        at 
>io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.jav
>a
>:254) ~[netty-handler-4.0.27.Final.jar:4.0.27.Final]
>        at 
>io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractC
>h
>annelHandlerContext.java:339)
>~[netty-transport-4.0.27.Final.jar:4.0.27.Final]
>        at 
>io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractCha
>n
>nelHandlerContext.java:324)
>~[netty-transport-4.0.27.Final.jar:4.0.27.Final]
>        at 
>io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessag
>e
>Decoder.java:103) ~[netty-codec-4.0.27.Final.jar:4.0.27.Final]
>        at 
>io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractC
>h
>annelHandlerContext.java:339)
>~[netty-transport-4.0.27.Final.jar:4.0.27.Final]
>        at 
>io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractCha
>n
>nelHandlerContext.java:324)
>~[netty-transport-4.0.27.Final.jar:4.0.27.Final]
>        at 
>io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecod
>e
>r.java:242) ~[netty-codec-4.0.27.Final.jar:4.0.27.Final]
>        at 
>io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractC
>h
>annelHandlerContext.java:339)
>~[netty-transport-4.0.27.Final.jar:4.0.27.Final]
>        at 
>io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractCha
>n
>nelHandlerContext.java:324)
>~[netty-transport-4.0.27.Final.jar:4.0.27.Final]
>        at 
>io.netty.channel.ChannelInboundHandlerAdapter.channelRead(ChannelInboundHa
>n
>dlerAdapter.java:86) ~[netty-transport-4.0.27.Final.jar:4.0.27.Final]
>        at 
>io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractC
>h
>annelHandlerContext.java:339)
>~[netty-transport-4.0.27.Final.jar:4.0.27.Final]
>        at 
>io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractCha
>n
>nelHandlerContext.java:324)
>~[netty-transport-4.0.27.Final.jar:4.0.27.Final]
>        at 
>io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipe
>l
>ine.java:847) ~[netty-transport-4.0.27.Final.jar:4.0.27.Final]
>        at 
>io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollI
>n
>Ready(AbstractEpollStreamChannel.java:618)
>~[netty-transport-native-epoll-4.0.27.Final-linux-x86_64.jar:na]
>        at 
>io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:329
>)
> ~[netty-transport-native-epoll-4.0.27.Final-linux-x86_64.jar:na]
>        at 
>io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:250)
>~[netty-transport-native-epoll-4.0.27.Final-linux-x86_64.jar:na]
>        at 
>io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEvent
>E
>xecutor.java:111) ~[netty-common-4.0.27.Final.jar:4.0.27.Final]
>        at java.lang.Thread.run(Thread.java:745) ~[na:1.7.0_71]
>
>
>
>
>
>
>
>
>On 6/18/15, 11:28 AM, "Ted Dunning" <te...@gmail.com> wrote:
>
>>On Thu, Jun 18, 2015 at 8:24 AM, Paul Mogren <PM...@commercehub.com>
>>wrote:
>>
>>> Following up. Ted gave sound advice regarding reading S3 vs HDFS, but
>>> didn¹t address EMRFS specifically. Here is what I have learned.
>>>
>>
>>Great summary.  Very useful when people help by feeding back what they
>>have
>>learned.
>

Re: To EMRFS or not to EMRFS?

Posted by Paul Mogren <PM...@commercehub.com>.

Thanks.


I tried to follow up on the upcoming Jets3t support for IAM roles.
Dropping in a snapshot build of Jets3t with Drill was not enough. I was
going to try patching Drill to take advantage, but I found that Drill
inherits its ability to treat S3 as a DFS from hadoop-common’s
org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore, which delegates
some initialization to org.apache.hadoop.fs.s3.S3Credentials, which
explicitly throws IllegalArgumentException if the accessKey and/or
secretAccessKey is not provided (stack trace below). So we’re looking at
waiting for Jets3t making a release, having it picked up by Hadoop, patch
Hadoop to allow use of IAM role, Hadoop making a release, and having that
available in your stack (e.g. AWS offering it in EMR). That’s going to
take quite some time.  One could probably do the patching oneself and slip
it into the classpath, but it makes more sense to try EMRFS.


2015-06-18 18:12:28,997 [qtp1208140271-50] ERROR
o.a.d.e.server.rest.QueryResources - Query from Web UI Failed
org.apache.drill.common.exceptions.UserRemoteException: SYSTEM ERROR:
java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access
Key must be specified as the username or password (respective
ly) of a s3n URL, or by setting the fs.s3n.awsAccessKeyId or
fs.s3n.awsSecretAccessKey properties (respectively).




[Error Id: 32c0357e-b8df-48f0-98ae-f200d48dd0f5 on
ip-172-24-7-103.ec2.internal:31010]


  (org.apache.drill.exec.work.foreman.ForemanException) Unexpected
exception during fragment initialization: AWS Access Key ID and Secret
Access Key must be specified as the username or password (respect
ively) of a s3n URL, or by setting the fs.s3n.awsAccessKeyId or
fs.s3n.awsSecretAccessKey properties (respectively).
    org.apache.drill.exec.work.foreman.Foreman.run():251
    java.util.concurrent.ThreadPoolExecutor.runWorker():1145
    java.util.concurrent.ThreadPoolExecutor$Worker.run():615
    java.lang.Thread.run():745
  Caused By (java.lang.IllegalArgumentException) AWS Access Key ID and
Secret Access Key must be specified as the username or password
(respectively) of a s3n URL, or by setting the fs.s3n.awsAccessKeyId
 or fs.s3n.awsSecretAccessKey properties (respectively).
    org.apache.hadoop.fs.s3.S3Credentials.initialize():70
    
org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.initialize():73
    sun.reflect.NativeMethodAccessorImpl.invoke0():-2
    sun.reflect.NativeMethodAccessorImpl.invoke():57
    sun.reflect.DelegatingMethodAccessorImpl.invoke():43
    java.lang.reflect.Method.invoke():606
    org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod():190
    org.apache.hadoop.io.retry.RetryInvocationHandler.invoke():103
    org.apache.hadoop.fs.s3native.$Proxy65.initialize():-1
    org.apache.hadoop.fs.s3native.NativeS3FileSystem.initialize():272
    org.apache.hadoop.fs.FileSystem.createFileSystem():2397
    org.apache.hadoop.fs.FileSystem.access$200():89
    org.apache.hadoop.fs.FileSystem$Cache.getInternal():2431
    org.apache.hadoop.fs.FileSystem$Cache.get():2413
    org.apache.hadoop.fs.FileSystem.get():368
    org.apache.hadoop.fs.FileSystem.get():167
    org.apache.drill.exec.store.dfs.DrillFileSystem.<init>():85
    org.apache.drill.exec.util.ImpersonationUtil$1.run():156
    org.apache.drill.exec.util.ImpersonationUtil$1.run():153
    java.security.AccessController.doPrivileged():-2
    javax.security.auth.Subject.doAs():415
    org.apache.hadoop.security.UserGroupInformation.doAs():1556
    org.apache.drill.exec.util.ImpersonationUtil.createFileSystem():153
    org.apache.drill.exec.util.ImpersonationUtil.createFileSystem():145
    org.apache.drill.exec.util.ImpersonationUtil.createFileSystem():133
    
org.apache.drill.exec.store.dfs.WorkspaceSchemaFactory$WorkspaceSchema.<ini
t>():125
    
org.apache.drill.exec.store.dfs.WorkspaceSchemaFactory.createSchema():114
    
org.apache.drill.exec.store.dfs.FileSystemSchemaFactory$FileSystemSchema.<i
nit>():77
    
org.apache.drill.exec.store.dfs.FileSystemSchemaFactory.registerSchemas():6
4
    org.apache.drill.exec.store.dfs.FileSystemPlugin.registerSchemas():131
    
org.apache.drill.exec.store.StoragePluginRegistry$DrillSchemaFactory.regist
erSchemas():330
org.apache.drill.exec.ops.QueryContext.getRootSchema():158
    org.apache.drill.exec.ops.QueryContext.getRootSchema():147
    org.apache.drill.exec.ops.QueryContext.getRootSchema():135
    org.apache.drill.exec.ops.QueryContext.getNewDefaultSchema():121
    org.apache.drill.exec.planner.sql.DrillSqlWorker.<init>():90
    org.apache.drill.exec.work.foreman.Foreman.runSQL():900
    org.apache.drill.exec.work.foreman.Foreman.run():240
    java.util.concurrent.ThreadPoolExecutor.runWorker():1145
    java.util.concurrent.ThreadPoolExecutor$Worker.run():615
    java.lang.Thread.run():745


        at 
org.apache.drill.exec.rpc.user.QueryResultHandler.resultArrived(QueryResult
Handler.java:118) ~[drill-java-exec-1.0.0-rebuffed.jar:1.0.0]
        at 
org.apache.drill.exec.rpc.user.UserClient.handleReponse(UserClient.java:111
) ~[drill-java-exec-1.0.0-rebuffed.jar:1.0.0]
        at 
org.apache.drill.exec.rpc.BasicClientWithConnection.handle(BasicClientWithC
onnection.java:47) ~[drill-java-exec-1.0.0-rebuffed.jar:1.0.0]
        at 
org.apache.drill.exec.rpc.BasicClientWithConnection.handle(BasicClientWithC
onnection.java:32) ~[drill-java-exec-1.0.0-rebuffed.jar:1.0.0]
        at org.apache.drill.exec.rpc.RpcBus.handle(RpcBus.java:61)
~[drill-java-exec-1.0.0-rebuffed.jar:1.0.0]
        at 
org.apache.drill.exec.rpc.RpcBus$InboundHandler.decode(RpcBus.java:233)
~[drill-java-exec-1.0.0-rebuffed.jar:1.0.0]
        at 
org.apache.drill.exec.rpc.RpcBus$InboundHandler.decode(RpcBus.java:205)
~[drill-java-exec-1.0.0-rebuffed.jar:1.0.0]
        at 
io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessage
Decoder.java:89) ~[netty-codec-4.0.27.Final.jar:4.0.27.Final]
        at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractCh
annelHandlerContext.java:339)
~[netty-transport-4.0.27.Final.jar:4.0.27.Final]
        at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChan
nelHandlerContext.java:324)
~[netty-transport-4.0.27.Final.jar:4.0.27.Final]
        at 
io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java
:254) ~[netty-handler-4.0.27.Final.jar:4.0.27.Final]
        at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractCh
annelHandlerContext.java:339)
~[netty-transport-4.0.27.Final.jar:4.0.27.Final]
        at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChan
nelHandlerContext.java:324)
~[netty-transport-4.0.27.Final.jar:4.0.27.Final]
        at 
io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessage
Decoder.java:103) ~[netty-codec-4.0.27.Final.jar:4.0.27.Final]
        at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractCh
annelHandlerContext.java:339)
~[netty-transport-4.0.27.Final.jar:4.0.27.Final]
        at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChan
nelHandlerContext.java:324)
~[netty-transport-4.0.27.Final.jar:4.0.27.Final]
        at 
io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecode
r.java:242) ~[netty-codec-4.0.27.Final.jar:4.0.27.Final]
        at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractCh
annelHandlerContext.java:339)
~[netty-transport-4.0.27.Final.jar:4.0.27.Final]
        at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChan
nelHandlerContext.java:324)
~[netty-transport-4.0.27.Final.jar:4.0.27.Final]
        at 
io.netty.channel.ChannelInboundHandlerAdapter.channelRead(ChannelInboundHan
dlerAdapter.java:86) ~[netty-transport-4.0.27.Final.jar:4.0.27.Final]
        at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractCh
annelHandlerContext.java:339)
~[netty-transport-4.0.27.Final.jar:4.0.27.Final]
        at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChan
nelHandlerContext.java:324)
~[netty-transport-4.0.27.Final.jar:4.0.27.Final]
        at 
io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipel
ine.java:847) ~[netty-transport-4.0.27.Final.jar:4.0.27.Final]
        at 
io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollIn
Ready(AbstractEpollStreamChannel.java:618)
~[netty-transport-native-epoll-4.0.27.Final-linux-x86_64.jar:na]
        at 
io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:329)
 ~[netty-transport-native-epoll-4.0.27.Final-linux-x86_64.jar:na]
        at 
io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:250)
~[netty-transport-native-epoll-4.0.27.Final-linux-x86_64.jar:na]
        at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventE
xecutor.java:111) ~[netty-common-4.0.27.Final.jar:4.0.27.Final]
        at java.lang.Thread.run(Thread.java:745) ~[na:1.7.0_71]








On 6/18/15, 11:28 AM, "Ted Dunning" <te...@gmail.com> wrote:

>On Thu, Jun 18, 2015 at 8:24 AM, Paul Mogren <PM...@commercehub.com>
>wrote:
>
>> Following up. Ted gave sound advice regarding reading S3 vs HDFS, but
>> didn¹t address EMRFS specifically. Here is what I have learned.
>>
>
>Great summary.  Very useful when people help by feeding back what they
>have
>learned.

Re: To EMRFS or not to EMRFS?

Posted by Alexander Zarei <al...@gmail.com>.

What we learned through our research/experiments for doing performance test
for Drill ODBC, you get the best throughput when Solid State Drive EC2
instances such as m3,xlarge are used to form the HDFS.

Re: To EMRFS or not to EMRFS?

Posted by Ted Dunning <te...@gmail.com>.

On Thu, Jun 18, 2015 at 8:24 AM, Paul Mogren <PM...@commercehub.com>
wrote:

> Following up. Ted gave sound advice regarding reading S3 vs HDFS, but
> didn¹t address EMRFS specifically. Here is what I have learned.
>

Great summary.  Very useful when people help by feeding back what they have
learned.

Re: To EMRFS or not to EMRFS?

Posted by Paul Mogren <PM...@commercehub.com>.

"When consistent view is enabled, Amazon EMR also has better performance
when listing Amazon S3 prefixes with over 10,000 objects. In fact, we’ve
seen a 5x increase in list performance on prefixes with over 1 million
objects. This speed-up is due to using the EMRFS metadata, which is
required for consistent view, to make listing large numbers of objects
more efficient."


https://blogs.aws.amazon.com/bigdata/post/Tx1WL4KR7SE37YY/Ensuring-Consiste
ncy-When-Using-Amazon-S3-and-Amazon-Elastic-MapReduce-for-ETL-W




I’m also realizing that EMRFS may be necessary for Drill to find data that
was written very recently to S3 by another process. The other process also
has to write via EMRFS, not directly to S3, in order to get that benefit.




On 6/18/15, 11:24 AM, "Paul Mogren" <PM...@commercehub.com> wrote:

>Following up. Ted gave sound advice regarding reading S3 vs HDFS, but
>didn¹t address EMRFS specifically. Here is what I have learned.
>
>EMRFS is an HDFS emulation layer over S3 storage. What it provides is a
>way to get the data consistency expected by clients of HDFS, but not
>provided directly by S3, by transparently doing things like retries of
>reads and tracking which clients recently wrote which data. I see no
>reason to believe that the performance of Drill over EMRFS would be better
>than Drill directly reading from S3, unless maybe in the context of large
>objects, if there is something different about how they can be split for
>concurrent readers - I have not investigated this.
>
>I don¹t believe Drill writes to shared storage except by user request or
>configuration. Perhaps EMRFS is helpful to users of Drill features that
>write data, like CTAS or spill-to-DFS.
>
>
>Other potential reasons that some might like to use EMRFS include its
>support for IAM Role security in EC2, and transparent client-side
>encryption of S3 objects. It seems the former is finally coming to Jets3t,
>the library used by Drill to access S3, so hopefully that will soon be an
>option even without requiring EMRFS:
>https://bitbucket.org/jmurty/jets3t/issue/163/provide-support-for-aws-iam-
>i
>nstance-roles
>
>-Paul
>
>
>
>On 5/26/15, 2:15 PM, "Paul Mogren" <PM...@commercehub.com> wrote:
>
>>Thank you. This kind of summary advice is helpful to getting started.
>>
>>
>>
>>
>>On 5/22/15, 6:37 PM, "Ted Dunning" <te...@gmail.com> wrote:
>>
>>>The variation will have less to do with Drill (which can read all these
>>>options such as EMR resident MapR FS or HDFS or persistent MapR FS or
>>>HDFS
>>>or S3).
>>>
>>>The biggest differences will have to do with whether your clusters
>>>providing storage are permanent or ephemeral.  If they are ephemeral,
>>>you
>>>can host the distributed file system on EBS based volumes so that you
>>>will
>>>have an ephemeral but restartable cluster.
>>>
>>>So the costs in run time will have to do with startup or restart times
>>>and
>>>the time it takes to pour the data into any new distributed file system.
>>>If you host permanently in S3 and have Drill read directly from there,
>>>you
>>>have no permanent storage cost for the input data, but will probably
>>>have
>>>slower reads.  With a permanent cluster hosting the data, you will have
>>>higher costs, but likely also higher performance.  Copying data from S3
>>>to
>>>a distributed file system is probably not a great idea since you pay
>>>roughly the same cost during copy as you would have paid just querying
>>>directly from S3.
>>>
>>>Exactly how these trade-offs pan out requires some careful thought and
>>>considerable knowledge of your workload.
>>>
>>>
>>>
>>>On Fri, May 22, 2015 at 3:22 PM, Paul Mogren <PM...@commercehub.com>
>>>wrote:
>>>
>>>> > When running Drill in AWS EMR, can anyone advise as to the
>>>>advantages
>>>> >and disadvantages of having Drill access S3 via EMRFS vs. directly?
>>>>
>>>> Also, a third option: an actual HDFS not backed by S3
>>>>
>>>>
>>
>