You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@drill.apache.org by Iver Walkoe <re...@eyecue.com> on 2014/11/01 18:15:35 UTC

Re: Drill on Amazon?

Hi Ted!

Thank you for your notes; your timing is perfect as I just now (finally) 
have enough time to start working on this in earnest.

I have started collecting some info - there's not much on Drill per se, 
but there are what seem to be analogous efforts in using Spark and R on EMR.

While apologizing for posting links, I would appreciate any and all 
input as to whether a similar approach might work with Drill.

1.) The upshot is to use a boostrap script to install software on 
cluster startup. In another project, I had had success using R in Hadoop 
streaming (i.e., sans RHadoop) so I was thinking this might be a way to 
get the Drillbits onto the slave (core/task in AWS) nodes in the 
cluster. Then one would ssh to the master and subsequently connect to 
the core nodes.

Installing Apache Spark on an Amazon EMR Cluster
http://blogs.aws.amazon.com/bigdata/post/Tx15AY5C50K70RV/Installing-Apache-Spark-on-an-Amazon-EMR-Cluster

...this next entry illustrates getting RHadoop onto the core nodes:

Statistical Analysis with Open-Source R and RStudio on Amazon EMR
http://blogs.aws.amazon.com/bigdata/post/Tx37RSKRFDQNTSL/Statistical-Analysis-with-Open-Source-R-and-RStudio-on-Amazon-EMR

Does that make sense?

2.) I noticed that Amazon is now offering MapR instances in EMR; would 
there be any advantage to using one of these instead of the "stock" 
Amazon instances?

http://aws.amazon.com/elasticmapreduce/mapr/

3.) While I've only been reading and not (yet) directly experimenting, 
I'm not sure I understand your note on UDP multicast. Is there a link to 
the Drill wiki or documentation where this is explained (i.e., why this 
is pertinent)?

At the risk of redundancy, here's my slightly-refined basic 
(hypothetical) plan:

i.) Create EMR instance with Drill bits installed on core/task/slave 
nodes with a bootstrap script

ii.) Load data onto S3 and create Hive external table: have data 
partitioned into "folders" (i.e., with S3 prefixes) and associate Hive 
indexes with this partitioning. This would only be used for creating the 
framework in the Hive metadata store that Drill could leverage.

iii.) Connect to the master node via SSH and then to a slave node (I 
guess via jdbc) to then query Hive ala:

https://cwiki.apache.org/confluence/display/DRILL/Querying+Hive

Does this seem to make sense?

:-)

Again, thank you for the notes!

Best,

Iver



On 10/31/2014 5:56 PM, Ted Dunning wrote:
> Iver,
>
> Didn't see if you got an answer.
>
> Yes... you would just start drill bits on each node separately.  You might
> have some troubles because Amazon disables UDP multicast by default.  That
> issue should be resolved soon.
>
>
>
> On Sun, Oct 19, 2014 at 4:30 PM, Iver Walkoe<re...@eyecue.com>  wrote:
>
>> Hello!
>>
>> This is my first query to the group - at present, I'm an inexperienced
>> Drill user though looking to change that.
>>
>> I am pretty familiar with AWS - though not as much at the config level -
>> and can make my way around Hadoop.
>>
>> That being said, and noting I'm going to be following up with Amazon
>> people on this as well, I thought I'd post a question here just in case
>> there were some readily available resources.
>>
>> I'm looking to investigate the possibility of using Drill with Hive on an
>> EMR instance pointed toward an external table on S3. That is, I'd be
>> looking to use Hive to create the metadata for an external table on S3 and
>> have Drill leverage this.
>>
>> In particular, I am pretty clueless as to how one would get Drill
>> installed on the slave nodes on an EMR instance. Don't know if it's
>> possible, in fact (hoping it is). It would seem that getting Drill (Bits)
>> on the slave nodes and then being able to communicate with a Drill Bit on
>> such a node is the task at hand.
>>
>> Any and all suggestions are greatly appreciated!
>>
>> Thanks!
>>
>> Iver
>>
>>