You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@drill.apache.org by "N.Venkata Naga Ravi" <nv...@hotmail.com> on 2014/05/15 18:41:08 UTC

Drill with Spark

Hi,

I started exploring Drill , it looks like very interesting tool. Can some body explain how Drill is going to compare with Apache Spark and Storm.
Do we still need Apache Spark along with Drill in the Bigdata stack? Or Drill can directly support as replacement with Spark?

Thanks,
Ravi

Help : Cannot start Drill

Posted by Amit Matety <ma...@yahoo.com>.

 Hi,

I am following the instructions to compile and run Drill locally on mac. I was able to successfully compile the source code cloned from git but when I get to running the following command below, I am seeing the exception (Looks like someone else faced similar issue before  : Can't Start Drill but no responses )
 
 Can't Start Drill .
Hi, I am new to drill.I compiled drill follow  https://cwiki.apache.org/confluence/display/DRILL/Compiling+Drill+from+source
,and when i start drill ,there are some problems, Can u help me?   
View on mail-archives.apache.org Preview by Yahoo  

./bin/sqlline -u jdbc:drill:schema=parquet-local -n admin -p admin

11:52:33.074 [main] ERROR com.netflix.curator.ConnectionState - Connection timed out for connection string (localhost:2181) and timeout (5000) / elapsed (43593)
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss
        at com.netflix.curator.ConnectionState.getZooKeeper(ConnectionState.java:94) ~[curator-client-1.1.9.jar:na]

Regards,
Amit

Re: Drill with Spark

Posted by Ted Dunning <te...@gmail.com>.

On Sat, May 17, 2014 at 12:27 AM, Amit Matety <ma...@yahoo.com> wrote:

> Does Drill support joins to in memory dimension tables unlike Druid? Does
> it have any limitation on the number of records it can fetch, etc?
>

Yes.  No.

Re: Drill with Spark

Posted by Amit Matety <ma...@yahoo.com>.

Thanks Neeraja. I will check out the link provided. 

Sent from my iPhone

> On May 17, 2014, at 12:12 PM, Neeraja Rentachintala <nr...@maprtech.com> wrote:
> 
> In addition what others said, below are few others (answered in an email
> thread some time back).
> 
> 
> -----------
> - Drill provides ANSI SQL. This means that all the BI/Analytics and SQL
> tools can work as is with Drill using JDBC/ODBC. Druid provides REST APIs
> as the query layer.I am not sure if Druid has SQL layer at all (don't see
> it in their docs)
> 
> - Query flexibility is high with Drill. For ex: Druid supports groupBy
> style queries, but doesn't support JOINs. Drill supports all the key
> analytic functionality such as JOINs, aggregations, sort, filters, wide
> variety of functions to operate on data which makes it suitable for a more
> broader set of use cases
> 
> - Drill supports queries natively on Hadoop data formats (JSON, parquet,
> Text as well as all Hive file formats). You don't need to load or copy the
> data into a specific format in order to do queries.
> 
> - Drill can do direct queries on self-describing data such as JSON,
> Parquet, HBase without defining schema overlays in Hive. You can take a
> look at the "Apache Drill in 10 mins doc" below to get started with Drill
> around some of these capabilities.
> https://cwiki.apache.org/confluence/display/DRILL/Apache+Drill+in+10+Minutes
> 
> 
> 
>> On Sat, May 17, 2014 at 7:43 AM, Timothy Chen <tn...@gmail.com> wrote:
>> 
>> Druid just like redshift requires an extra ETL to import the data before
>> you can query, which slows down the freshness of your query able data.
>> 
>> Obvious three are pros and cons to each decision, but Drill also tries to
>> do optimizations as much as possible with metadata available, and also down
>> the road will able to again enough stats after a scan or perhaps even a
>> extra compute stats like what impala does.
>> 
>> Tim
>> 
>> Sent from my iPhone
>> 
>>> On May 17, 2014, at 12:27 AM, Amit Matety <ma...@yahoo.com> wrote:
>>> 
>>> In the regards to comparison: How does it compare to Druid which is also
>> an in-memory warehouse ? Does Drill support joins to in memory dimension
>> tables unlike Druid? Does it have any limitation on the number of records
>> it can fetch, etc?
>>> 
>>> Regards,
>>> Amit
>>> 
>>>> On May 16, 2014, at 8:46 PM, Jason Altekruse <al...@gmail.com>
>> wrote:
>>>> 
>>>> Ted covered the most important points. I just want to add a few
>>>> clarifications.
>>>> 
>>>> While the code for Drill so far is written in pure Java, there is not
>>>> specific requirement that all of Drill run in Java. Part of the
>> motivation
>>>> for using the in-memory representation of records that we did, making it
>>>> columnar, and also storing it in java native ByteBuffers, was to enable
>>>> integration with native code compiled from C/C++ to run some of our
>>>> operators. ByteBuffers are part of the official Java API, but their use
>> is
>>>> not recommend. They allow memory operations that you do not find in
>> typical
>>>> java data types and structures, but require you to manage your own
>> memory.
>>>> 
>>>> One important use case for us is the ability to pass them through the
>> Java
>>>> Native Interface without having to do a copy. While it is still
>> inefficient
>>>> to jump from Java to C every record, we should be able to define a clean
>>>> interface to take a batch of records (around 1000) in a single jump to
>> a C
>>>> context and after the C code finishes processing them, a single jump
>> back
>>>> into the java context will also be able to complete quickly in the same
>>>> manner as the jump in the other direction.
>>>> 
>>>> With this consideration, any language you could pass data to from C
>> would
>>>> be compatible. While we likely will not support a wide array of plugin
>>>> languages soon, it should be possible for people to plug in a variety of
>>>> existing codebases for adding data processing functionalities to Drill.
>>>> 
>>>> -Jason Altekruse
>>>> 
>>>> 
>>>>> On Fri, May 16, 2014 at 8:11 PM, Ted Dunning <te...@gmail.com>
>> wrote:
>>>>> 
>>>>> Drill is a very different tool from spark or even from Spark SQL (aka
>>>>> Shark).
>>>>> 
>>>>> There is some overlap, but there are important differences.  For
>> instance,
>>>>> 
>>>>> - Drill supports weakly typed SQL.
>>>>> 
>>>>> - Drill has a very clever way to pass data from one processor to
>> another.
>>>>> This allows very efficient processing
>>>>> 
>>>>> - Drill generates code in response to query and to observed data.
>> This is
>>>>> a big deal since it allows high speed with dynamic types
>>>>> 
>>>>> - Drill supports full ANSII SQL, not Hive QL.
>>>>> 
>>>>> - Spark supports programming in Scala
>>>>> 
>>>>> - Spark ties distributed data object to objects in a language like
>> Java or
>>>>> Scala rather than using a columnar form.  This makes generic user
>> written
>>>>> code easier, but is less efficient.
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> On Thu, May 15, 2014 at 9:41 AM, N.Venkata Naga Ravi
>>>>> <nv...@hotmail.com>wrote:
>>>>> 
>>>>>> Hi,
>>>>>> 
>>>>>> I started exploring Drill , it looks like very interesting tool. Can
>> some
>>>>>> body explain how Drill is going to compare with Apache Spark and
>> Storm.
>>>>>> Do we still need Apache Spark along with Drill in the Bigdata stack?
>> Or
>>>>>> Drill can directly support as replacement with Spark?
>>>>>> 
>>>>>> Thanks,
>>>>>> Ravi
>>

Re: Drill with Spark

Posted by Neeraja Rentachintala <nr...@maprtech.com>.

In addition what others said, below are few others (answered in an email
thread some time back).


-----------
- Drill provides ANSI SQL. This means that all the BI/Analytics and SQL
tools can work as is with Drill using JDBC/ODBC. Druid provides REST APIs
as the query layer.I am not sure if Druid has SQL layer at all (don't see
it in their docs)

- Query flexibility is high with Drill. For ex: Druid supports groupBy
style queries, but doesn't support JOINs. Drill supports all the key
analytic functionality such as JOINs, aggregations, sort, filters, wide
variety of functions to operate on data which makes it suitable for a more
broader set of use cases

- Drill supports queries natively on Hadoop data formats (JSON, parquet,
Text as well as all Hive file formats). You don't need to load or copy the
data into a specific format in order to do queries.

- Drill can do direct queries on self-describing data such as JSON,
Parquet, HBase without defining schema overlays in Hive. You can take a
look at the "Apache Drill in 10 mins doc" below to get started with Drill
around some of these capabilities.
https://cwiki.apache.org/confluence/display/DRILL/Apache+Drill+in+10+Minutes



On Sat, May 17, 2014 at 7:43 AM, Timothy Chen <tn...@gmail.com> wrote:

> Druid just like redshift requires an extra ETL to import the data before
> you can query, which slows down the freshness of your query able data.
>
> Obvious three are pros and cons to each decision, but Drill also tries to
> do optimizations as much as possible with metadata available, and also down
> the road will able to again enough stats after a scan or perhaps even a
> extra compute stats like what impala does.
>
> Tim
>
> Sent from my iPhone
>
> > On May 17, 2014, at 12:27 AM, Amit Matety <ma...@yahoo.com> wrote:
> >
> > In the regards to comparison: How does it compare to Druid which is also
> an in-memory warehouse ? Does Drill support joins to in memory dimension
> tables unlike Druid? Does it have any limitation on the number of records
> it can fetch, etc?
> >
> > Regards,
> > Amit
> >
> >> On May 16, 2014, at 8:46 PM, Jason Altekruse <al...@gmail.com>
> wrote:
> >>
> >> Ted covered the most important points. I just want to add a few
> >> clarifications.
> >>
> >> While the code for Drill so far is written in pure Java, there is not
> >> specific requirement that all of Drill run in Java. Part of the
> motivation
> >> for using the in-memory representation of records that we did, making it
> >> columnar, and also storing it in java native ByteBuffers, was to enable
> >> integration with native code compiled from C/C++ to run some of our
> >> operators. ByteBuffers are part of the official Java API, but their use
> is
> >> not recommend. They allow memory operations that you do not find in
> typical
> >> java data types and structures, but require you to manage your own
> memory.
> >>
> >> One important use case for us is the ability to pass them through the
> Java
> >> Native Interface without having to do a copy. While it is still
> inefficient
> >> to jump from Java to C every record, we should be able to define a clean
> >> interface to take a batch of records (around 1000) in a single jump to
> a C
> >> context and after the C code finishes processing them, a single jump
> back
> >> into the java context will also be able to complete quickly in the same
> >> manner as the jump in the other direction.
> >>
> >> With this consideration, any language you could pass data to from C
> would
> >> be compatible. While we likely will not support a wide array of plugin
> >> languages soon, it should be possible for people to plug in a variety of
> >> existing codebases for adding data processing functionalities to Drill.
> >>
> >> -Jason Altekruse
> >>
> >>
> >>> On Fri, May 16, 2014 at 8:11 PM, Ted Dunning <te...@gmail.com>
> wrote:
> >>>
> >>> Drill is a very different tool from spark or even from Spark SQL (aka
> >>> Shark).
> >>>
> >>> There is some overlap, but there are important differences.  For
> instance,
> >>>
> >>> - Drill supports weakly typed SQL.
> >>>
> >>> - Drill has a very clever way to pass data from one processor to
> another.
> >>> This allows very efficient processing
> >>>
> >>> - Drill generates code in response to query and to observed data.
>  This is
> >>> a big deal since it allows high speed with dynamic types
> >>>
> >>> - Drill supports full ANSII SQL, not Hive QL.
> >>>
> >>> - Spark supports programming in Scala
> >>>
> >>> - Spark ties distributed data object to objects in a language like
> Java or
> >>> Scala rather than using a columnar form.  This makes generic user
> written
> >>> code easier, but is less efficient.
> >>>
> >>>
> >>>
> >>>
> >>> On Thu, May 15, 2014 at 9:41 AM, N.Venkata Naga Ravi
> >>> <nv...@hotmail.com>wrote:
> >>>
> >>>> Hi,
> >>>>
> >>>> I started exploring Drill , it looks like very interesting tool. Can
> some
> >>>> body explain how Drill is going to compare with Apache Spark and
> Storm.
> >>>> Do we still need Apache Spark along with Drill in the Bigdata stack?
> Or
> >>>> Drill can directly support as replacement with Spark?
> >>>>
> >>>> Thanks,
> >>>> Ravi
> >>>
>

Re: Drill with Spark

Posted by Timothy Chen <tn...@gmail.com>.

Druid just like redshift requires an extra ETL to import the data before you can query, which slows down the freshness of your query able data.

Obvious three are pros and cons to each decision, but Drill also tries to do optimizations as much as possible with metadata available, and also down the road will able to again enough stats after a scan or perhaps even a extra compute stats like what impala does.

Tim

Sent from my iPhone

> On May 17, 2014, at 12:27 AM, Amit Matety <ma...@yahoo.com> wrote:
> 
> In the regards to comparison: How does it compare to Druid which is also an in-memory warehouse ? Does Drill support joins to in memory dimension tables unlike Druid? Does it have any limitation on the number of records it can fetch, etc?
> 
> Regards,
> Amit
> 
>> On May 16, 2014, at 8:46 PM, Jason Altekruse <al...@gmail.com> wrote:
>> 
>> Ted covered the most important points. I just want to add a few
>> clarifications.
>> 
>> While the code for Drill so far is written in pure Java, there is not
>> specific requirement that all of Drill run in Java. Part of the motivation
>> for using the in-memory representation of records that we did, making it
>> columnar, and also storing it in java native ByteBuffers, was to enable
>> integration with native code compiled from C/C++ to run some of our
>> operators. ByteBuffers are part of the official Java API, but their use is
>> not recommend. They allow memory operations that you do not find in typical
>> java data types and structures, but require you to manage your own memory.
>> 
>> One important use case for us is the ability to pass them through the Java
>> Native Interface without having to do a copy. While it is still inefficient
>> to jump from Java to C every record, we should be able to define a clean
>> interface to take a batch of records (around 1000) in a single jump to a C
>> context and after the C code finishes processing them, a single jump back
>> into the java context will also be able to complete quickly in the same
>> manner as the jump in the other direction.
>> 
>> With this consideration, any language you could pass data to from C would
>> be compatible. While we likely will not support a wide array of plugin
>> languages soon, it should be possible for people to plug in a variety of
>> existing codebases for adding data processing functionalities to Drill.
>> 
>> -Jason Altekruse
>> 
>> 
>>> On Fri, May 16, 2014 at 8:11 PM, Ted Dunning <te...@gmail.com> wrote:
>>> 
>>> Drill is a very different tool from spark or even from Spark SQL (aka
>>> Shark).
>>> 
>>> There is some overlap, but there are important differences.  For instance,
>>> 
>>> - Drill supports weakly typed SQL.
>>> 
>>> - Drill has a very clever way to pass data from one processor to another.
>>> This allows very efficient processing
>>> 
>>> - Drill generates code in response to query and to observed data.  This is
>>> a big deal since it allows high speed with dynamic types
>>> 
>>> - Drill supports full ANSII SQL, not Hive QL.
>>> 
>>> - Spark supports programming in Scala
>>> 
>>> - Spark ties distributed data object to objects in a language like Java or
>>> Scala rather than using a columnar form.  This makes generic user written
>>> code easier, but is less efficient.
>>> 
>>> 
>>> 
>>> 
>>> On Thu, May 15, 2014 at 9:41 AM, N.Venkata Naga Ravi
>>> <nv...@hotmail.com>wrote:
>>> 
>>>> Hi,
>>>> 
>>>> I started exploring Drill , it looks like very interesting tool. Can some
>>>> body explain how Drill is going to compare with Apache Spark and Storm.
>>>> Do we still need Apache Spark along with Drill in the Bigdata stack? Or
>>>> Drill can directly support as replacement with Spark?
>>>> 
>>>> Thanks,
>>>> Ravi
>>>

Re: Drill with Spark

Posted by Amit Matety <ma...@yahoo.com>.

In the regards to comparison: How does it compare to Druid which is also an in-memory warehouse ? Does Drill support joins to in memory dimension tables unlike Druid? Does it have any limitation on the number of records it can fetch, etc?

Regards,
Amit

> On May 16, 2014, at 8:46 PM, Jason Altekruse <al...@gmail.com> wrote:
> 
> Ted covered the most important points. I just want to add a few
> clarifications.
> 
> While the code for Drill so far is written in pure Java, there is not
> specific requirement that all of Drill run in Java. Part of the motivation
> for using the in-memory representation of records that we did, making it
> columnar, and also storing it in java native ByteBuffers, was to enable
> integration with native code compiled from C/C++ to run some of our
> operators. ByteBuffers are part of the official Java API, but their use is
> not recommend. They allow memory operations that you do not find in typical
> java data types and structures, but require you to manage your own memory.
> 
> One important use case for us is the ability to pass them through the Java
> Native Interface without having to do a copy. While it is still inefficient
> to jump from Java to C every record, we should be able to define a clean
> interface to take a batch of records (around 1000) in a single jump to a C
> context and after the C code finishes processing them, a single jump back
> into the java context will also be able to complete quickly in the same
> manner as the jump in the other direction.
> 
> With this consideration, any language you could pass data to from C would
> be compatible. While we likely will not support a wide array of plugin
> languages soon, it should be possible for people to plug in a variety of
> existing codebases for adding data processing functionalities to Drill.
> 
> -Jason Altekruse
> 
> 
>> On Fri, May 16, 2014 at 8:11 PM, Ted Dunning <te...@gmail.com> wrote:
>> 
>> Drill is a very different tool from spark or even from Spark SQL (aka
>> Shark).
>> 
>> There is some overlap, but there are important differences.  For instance,
>> 
>> - Drill supports weakly typed SQL.
>> 
>> - Drill has a very clever way to pass data from one processor to another.
>> This allows very efficient processing
>> 
>> - Drill generates code in response to query and to observed data.  This is
>> a big deal since it allows high speed with dynamic types
>> 
>> - Drill supports full ANSII SQL, not Hive QL.
>> 
>> - Spark supports programming in Scala
>> 
>> - Spark ties distributed data object to objects in a language like Java or
>> Scala rather than using a columnar form.  This makes generic user written
>> code easier, but is less efficient.
>> 
>> 
>> 
>> 
>> On Thu, May 15, 2014 at 9:41 AM, N.Venkata Naga Ravi
>> <nv...@hotmail.com>wrote:
>> 
>>> Hi,
>>> 
>>> I started exploring Drill , it looks like very interesting tool. Can some
>>> body explain how Drill is going to compare with Apache Spark and Storm.
>>> Do we still need Apache Spark along with Drill in the Bigdata stack? Or
>>> Drill can directly support as replacement with Spark?
>>> 
>>> Thanks,
>>> Ravi
>>

Re: Drill with Spark

Posted by Jason Altekruse <al...@gmail.com>.

Ted covered the most important points. I just want to add a few
clarifications.

While the code for Drill so far is written in pure Java, there is not
specific requirement that all of Drill run in Java. Part of the motivation
for using the in-memory representation of records that we did, making it
columnar, and also storing it in java native ByteBuffers, was to enable
integration with native code compiled from C/C++ to run some of our
operators. ByteBuffers are part of the official Java API, but their use is
not recommend. They allow memory operations that you do not find in typical
java data types and structures, but require you to manage your own memory.

One important use case for us is the ability to pass them through the Java
Native Interface without having to do a copy. While it is still inefficient
to jump from Java to C every record, we should be able to define a clean
interface to take a batch of records (around 1000) in a single jump to a C
context and after the C code finishes processing them, a single jump back
into the java context will also be able to complete quickly in the same
manner as the jump in the other direction.

With this consideration, any language you could pass data to from C would
be compatible. While we likely will not support a wide array of plugin
languages soon, it should be possible for people to plug in a variety of
existing codebases for adding data processing functionalities to Drill.

-Jason Altekruse

On Fri, May 16, 2014 at 8:11 PM, Ted Dunning <te...@gmail.com> wrote:

> Drill is a very different tool from spark or even from Spark SQL (aka
> Shark).
>
> There is some overlap, but there are important differences.  For instance,
>
> - Drill supports weakly typed SQL.
>
> - Drill has a very clever way to pass data from one processor to another.
>  This allows very efficient processing
>
> - Drill generates code in response to query and to observed data.  This is
> a big deal since it allows high speed with dynamic types
>
> - Drill supports full ANSII SQL, not Hive QL.
>
> - Spark supports programming in Scala
>
> - Spark ties distributed data object to objects in a language like Java or
> Scala rather than using a columnar form.  This makes generic user written
> code easier, but is less efficient.
>
>
>
>
> On Thu, May 15, 2014 at 9:41 AM, N.Venkata Naga Ravi
> <nv...@hotmail.com>wrote:
>
> > Hi,
> >
> > I started exploring Drill , it looks like very interesting tool. Can some
> > body explain how Drill is going to compare with Apache Spark and Storm.
> > Do we still need Apache Spark along with Drill in the Bigdata stack? Or
> > Drill can directly support as replacement with Spark?
> >
> > Thanks,
> > Ravi
> >
>

Re: Drill with Spark

Posted by Ted Dunning <te...@gmail.com>.

Drill is a very different tool from spark or even from Spark SQL (aka
Shark).

There is some overlap, but there are important differences.  For instance,

- Drill supports weakly typed SQL.

- Drill has a very clever way to pass data from one processor to another.
 This allows very efficient processing

- Drill generates code in response to query and to observed data.  This is
a big deal since it allows high speed with dynamic types

- Drill supports full ANSII SQL, not Hive QL.

- Spark supports programming in Scala

- Spark ties distributed data object to objects in a language like Java or
Scala rather than using a columnar form.  This makes generic user written
code easier, but is less efficient.




On Thu, May 15, 2014 at 9:41 AM, N.Venkata Naga Ravi
<nv...@hotmail.com>wrote:

> Hi,
>
> I started exploring Drill , it looks like very interesting tool. Can some
> body explain how Drill is going to compare with Apache Spark and Storm.
> Do we still need Apache Spark along with Drill in the Bigdata stack? Or
> Drill can directly support as replacement with Spark?
>
> Thanks,
> Ravi
>