You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@drill.apache.org by Samuel Marks <sa...@gmail.com> on 2015/01/30 12:26:29 UTC

Which [open-souce] SQL engine atop Hadoop?

Since Hadoop <https://hive.apache.org> came out, there have been various
commercial and/or open-source attempts to expose some compatibility with SQL
<http://drill.apache.org>. Obviously by posting here I am not expecting an
unbiased answer.

Seeking an SQL-on-Hadoop offering which provides: low-latency querying, and
supports the most common CRUD <https://spark.apache.org>, including [the
basics!] along these lines: CREATE TABLE, INSERT INTO, SELECT * FROM, UPDATE
Table SET C1=2 WHERE, DELETE FROM, and DROP TABLE. Transactional support
would be nice also, but is not a must-have.

Essentially I want a full replacement for the more traditional RDBMS, one
which can scale from 1 node to a serious Hadoop cluster.

Python is my language of choice for interfacing, however there does seem to
be a Python JDBC wrapper <https://spark.apache.org/sql>.

Here is what I've found thus far:

   - Apache Hive <https://hive.apache.org> (SQL-like, with interactive SQL
   thanks to the Stinger initiative)
   - Apache Drill <http://drill.apache.org> (ANSI SQL support)
   - Apache Spark <https://spark.apache.org> (Spark SQL
   <https://spark.apache.org/sql>, queries only, add data via Hive, RDD
   <https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.SchemaRDD>
   or Paraquet <http://parquet.io/>)
   - Apache Phoenix <http://phoenix.apache.org> (built atop Apache HBase
   <http://hbase.apache.org>, lacks full transaction
   <http://en.wikipedia.org/wiki/Database_transaction> support, relational
   operators <http://en.wikipedia.org/wiki/Relational_operators> and some
   built-in functions)
   - Cloudera Impala
   <http://www.cloudera.com/content/cloudera/en/products-and-services/cdh/impala.html>
   (significant HiveQL support, some SQL language support, no support for
   indexes on its tables, importantly missing DELETE, UPDATE and INTERSECT;
   amongst others)
   - Presto <https://github.com/facebook/presto> from Facebook (can query
   Hive, Cassandra <http://cassandra.apache.org>, relational DBs &etc.
   Doesn't seem to be designed for low-latency responses across small
   clusters, or support UPDATE operations. It is optimized for data
   warehousing or analytics¹
   <http://prestodb.io/docs/current/overview/use-cases.html>)
   - SQL-Hadoop <https://www.mapr.com/why-hadoop/sql-hadoop> via MapR
   community edition <https://www.mapr.com/products/hadoop-download> (seems
   to be a packaging of Hive, HP Vertica
   <http://www.vertica.com/hp-vertica-products/sqlonhadoop>, SparkSQL,
   Drill and a native ODBC wrapper
   <http://package.mapr.com/tools/MapR-ODBC/MapR_ODBC>)
   - Apache Kylin <http://www.kylin.io> from Ebay (provides an SQL
   interface and multi-dimensional analysis [OLAP
   <http://en.wikipedia.org/wiki/OLAP>], "… offers ANSI SQL on Hadoop and
   supports most ANSI SQL query functions". It depends on HDFS, MapReduce,
   Hive and HBase; and seems targeted at very large data-sets though maintains
   low query latency)
   - Apache Tajo <http://tajo.apache.org> (ANSI/ISO SQL standard compliance
   with JDBC <http://en.wikipedia.org/wiki/JDBC> driver support [benchmarks
   against Hive and Impala
   <http://blogs.gartner.com/nick-heudecker/apache-tajo-enters-the-sql-on-hadoop-space>
   ])
   - Cascading <http://en.wikipedia.org/wiki/Cascading_%28software%29>'s
   Lingual <http://docs.cascading.org/lingual/1.0/>²
   <http://docs.cascading.org/lingual/1.0/#sql-support> ("Lingual provides
   JDBC Drivers, a SQL command shell, and a catalog manager for publishing
   files [or any resource] as schemas and tables.")

Which—from this list or elsewhere—would you recommend, and why?
Thanks for all suggestions,

Samuel Marks
http://linkedin.com/in/samuelmarks

Re: Which [open-souce] SQL engine atop Hadoop?

Posted by Andrew Brust <an...@bluebadgeinsights.com>.

It's designed for data discovery/analytical use. Not CRUD/operational applications.

Sent from my phone

----- Reply message -----
From: "Samuel Marks" <sa...@gmail.com>
To: "user@drill.apache.org" <us...@drill.apache.org>
Subject: Which [open-souce] SQL engine atop Hadoop?
Date: Fri, Jan 30, 2015 8:59 PM

Thanks Andrew and Tomer,

I am quite used to MongoDB, and thus wouldn't mind going the extra step of
removing the base schema altogether.

Field validation can always be done on other layers. Now without model
consistency I'd need some sort of transaction support to ascertain the
various constraints I require, such as public, foreign and other unique
keys.

Would you still say that Drill isn't a suitable replacement for
MongoDB/MySQL/Postgres, in my use-case?

Best,

Samuel Marks
http://linkedin.com/in/samuelmarks
On 31 Jan 2015 10:57, "Tomer Shiran" <ts...@gmail.com> wrote:

> Yes, Drill is currently focused on querying data as opposed to inserting or
> updating. While most of the systems you listed take a traditional approach
> to SQL in which a DBA must create and manage schemas, Drill is designed for
> Hadoop and NoSQL databases. In these systems, most of the data is usually
> self-describing (JSON, Parquet, etc.) and sometimes even schema-less (as in
> JSON, HBase, MongoDB) so it doesn't make sense to require schemas to be
> created and managed manually, and data to be transformed before it can be
> queried.  Drill's unique architecture makes it unique in its ability to
> enable self-service data exploration where agility is essential.
>
> On Fri, Jan 30, 2015 at 10:10 AM, Andrew Brust <
> andrew.brust@bluebadgeinsights.com> wrote:
>
> > Not sure Drill -- or any of the other SQL-on-Hadoop engines -- are truly
> > well-suited to CRUD.  They excel at the "R" -- the "CUD" is not their
> forte.
> >
> > -----Original Message-----
> > From: Samuel Marks [mailto:samuelmarks@gmail.com]
> > Sent: Friday, January 30, 2015 8:50 AM
> > To: user@drill.apache.org
> > Subject: Re: Which [open-souce] SQL engine atop Hadoop?
> >
> > Dear Jacques,
> >
> > Seeing the support for 03 SQL syntax, nested objects, and schema-free SQL
> > in Apache Drill is quite impressive, not to mention the useful ODBC
> > interface alongside the expected JDBC one. Additionally on the
> scalability
> > side your documentation claims: "Scales from a single laptop to a
> 1000-node
> > cluster".
> >
> > You mention that this entire topic is subjective. I suppose with
> > insufficient information about my use-case, you may just be right.
> >
> > Without giving away my full use-case—FYI: I will be open-sourcing what
> I'm
> > building—I will tell you a little bit about the components.
> >
> > The generic components would just include CRUD, and basic related queries
> > (such as propagated updates utilising joins).
> >
> > More interesting is on the analytics side, wherein I'll be executing a
> > variety of Machine Learning, information filtering (recommender systems,
> > internal search engine most with some element of Natural Language
> > Processing), time series sequence matching and related tasks. Some of
> these
> > require near-realtime responses, whereas others can be delayed
> > significantly.
> >
> > I posted something similar to this on StackOverflow, it was very quickly
> > removed. Haven't tried LinkedIn or Quora, probably worth a shot. Worried
> > about speaking to enterprise sales people, as they're being paid to push
> > their own offering (and I doubt they have extensive benchmarks across all
> > their competitors).
> >
> > Thanks for your continuing advice,
> >
> > Samuel Marks
> > http://linkedin.com/in/samuelmarks
> >
> > On Sat, Jan 31, 2015 at 12:22 AM, Jacques Nadeau <ja...@apache.org>
> > wrote:
> >
> > > Samuel,
> > >
> > > You've come and asked your question on the Apache Drill group so of
> > > course the answer is Apache Drill is best for everything, right?
> > >
> > > The reality is that each tool has a set of strengths and weaknesses
> > > for each particular use case. An Apache user support mailing list is
> > > definitely NOT the place to have this discussion.  You're really
> > > asking for technology selection advice and this entire topic is very
> > > subjective. The people in any one community would never do full
> > > justice to all the options. As such I suggest you use another forum
> such
> > as Quora or LinkedIn to get advice.
> > > (There is also a helpful article on Gigaom that just came out
> > > yesterday and all sorts of friendly sales people at companies like
> > > MapR and IBM who love giving this kind of advice.)
> > >
> > > What we can do here is tell you how Drill can solve or not solve your
> > > different use cases and help you work through those.  If you to go
> > > into more detail, on those,  we'd be happy to help.
> > >
> > > Thanks again for the interest. Sorry if this seems abrupt but these
> > > threads generally aren't productive and tend to be very divisive.
> > >
> > > Welcome to the community :)
> > >
> > > Jacques
> > > On Jan 30, 2015 3:28 AM, "Samuel Marks" <sa...@gmail.com> wrote:
> > >
> > > > Since Hadoop <https://hive.apache.org> came out, there have been
> > > > various commercial and/or open-source attempts to expose some
> > > > compatibility with SQL <http://drill.apache.org>. Obviously by
> > > > posting here I am not expecting
> > > an
> > > > unbiased answer.
> > > >
> > > > Seeking an SQL-on-Hadoop offering which provides: low-latency
> > > > querying,
> > > and
> > > > supports the most common CRUD <https://spark.apache.org>, including
> > > > [the basics!] along these lines: CREATE TABLE, INSERT INTO, SELECT *
> > > > FROM, UPDATE Table SET C1=2 WHERE, DELETE FROM, and DROP TABLE.
> > > > Transactional support would be nice also, but is not a must-have.
> > > >
> > > > Essentially I want a full replacement for the more traditional
> > > > RDBMS, one which can scale from 1 node to a serious Hadoop cluster.
> > > >
> > > > Python is my language of choice for interfacing, however there does
> > > > seem
> > > to
> > > > be a Python JDBC wrapper <https://spark.apache.org/sql>.
> > > >
> > > > Here is what I've found thus far:
> > > >
> > > >    - Apache Hive <https://hive.apache.org> (SQL-like, with
> > > > interactive
> > > SQL
> > > >    thanks to the Stinger initiative)
> > > >    - Apache Drill <http://drill.apache.org> (ANSI SQL support)
> > > >    - Apache Spark <https://spark.apache.org> (Spark SQL
> > > >    <https://spark.apache.org/sql>, queries only, add data via Hive,
> > RDD
> > > >    <
> > > >
> > > https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.s
> > > park.sql.SchemaRDD
> > > > >
> > > >    or Paraquet <http://parquet.io/>)
> > > >    - Apache Phoenix <http://phoenix.apache.org> (built atop Apache
> > HBase
> > > >    <http://hbase.apache.org>, lacks full transaction
> > > >    <http://en.wikipedia.org/wiki/Database_transaction> support,
> > > relational
> > > >    operators <http://en.wikipedia.org/wiki/Relational_operators> and
> > > some
> > > >    built-in functions)
> > > >    - Cloudera Impala
> > > >    <
> > > >
> > > http://www.cloudera.com/content/cloudera/en/products-and-services/cdh/
> > > impala.html
> > > > >
> > > >    (significant HiveQL support, some SQL language support, no support
> > for
> > > >    indexes on its tables, importantly missing DELETE, UPDATE and
> > > INTERSECT;
> > > >    amongst others)
> > > >    - Presto <https://github.com/facebook/presto> from Facebook (can
> > > query
> > > >    Hive, Cassandra <http://cassandra.apache.org>, relational DBs
> &etc.
> > > >    Doesn't seem to be designed for low-latency responses across small
> > > >    clusters, or support UPDATE operations. It is optimized for data
> > > >    warehousing or analytics¹
> > > >    <http://prestodb.io/docs/current/overview/use-cases.html>)
> > > >    - SQL-Hadoop <https://www.mapr.com/why-hadoop/sql-hadoop> via
> MapR
> > > >    community edition <https://www.mapr.com/products/hadoop-download>
> > > > (seems
> > > >    to be a packaging of Hive, HP Vertica
> > > >    <http://www.vertica.com/hp-vertica-products/sqlonhadoop>,
> SparkSQL,
> > > >    Drill and a native ODBC wrapper
> > > >    <http://package.mapr.com/tools/MapR-ODBC/MapR_ODBC>)
> > > >    - Apache Kylin <http://www.kylin.io> from Ebay (provides an SQL
> > > >    interface and multi-dimensional analysis [OLAP
> > > >    <http://en.wikipedia.org/wiki/OLAP>], "… offers ANSI SQL on
> > > > Hadoop
> > > and
> > > >    supports most ANSI SQL query functions". It depends on HDFS,
> > > MapReduce,
> > > >    Hive and HBase; and seems targeted at very large data-sets though
> > > > maintains
> > > >    low query latency)
> > > >    - Apache Tajo <http://tajo.apache.org> (ANSI/ISO SQL standard
> > > > compliance
> > > >    with JDBC <http://en.wikipedia.org/wiki/JDBC> driver support
> > > > [benchmarks
> > > >    against Hive and Impala
> > > >    <
> > > >
> > > http://blogs.gartner.com/nick-heudecker/apache-tajo-enters-the-sql-on-
> > > hadoop-space
> > > > >
> > > >    ])
> > > >    - Cascading <
> http://en.wikipedia.org/wiki/Cascading_%28software%29
> > >'s
> > > >    Lingual <http://docs.cascading.org/lingual/1.0/>²
> > > >    <http://docs.cascading.org/lingual/1.0/#sql-support> ("Lingual
> > > provides
> > > >    JDBC Drivers, a SQL command shell, and a catalog manager for
> > > publishing
> > > >    files [or any resource] as schemas and tables.")
> > > >
> > > > Which—from this list or elsewhere—would you recommend, and why?
> > > > Thanks for all suggestions,
> > > >
> > > > Samuel Marks
> > > > http://linkedin.com/in/samuelmarks
> > > >
> > >
> >
>

Re: Which [open-souce] SQL engine atop Hadoop?

Posted by Samuel Marks <sa...@gmail.com>.

Thanks Andrew and Tomer,

I am quite used to MongoDB, and thus wouldn't mind going the extra step of
removing the base schema altogether.

Field validation can always be done on other layers. Now without model
consistency I'd need some sort of transaction support to ascertain the
various constraints I require, such as public, foreign and other unique
keys.

Would you still say that Drill isn't a suitable replacement for
MongoDB/MySQL/Postgres, in my use-case?

Best,

Samuel Marks
http://linkedin.com/in/samuelmarks
On 31 Jan 2015 10:57, "Tomer Shiran" <ts...@gmail.com> wrote:

> Yes, Drill is currently focused on querying data as opposed to inserting or
> updating. While most of the systems you listed take a traditional approach
> to SQL in which a DBA must create and manage schemas, Drill is designed for
> Hadoop and NoSQL databases. In these systems, most of the data is usually
> self-describing (JSON, Parquet, etc.) and sometimes even schema-less (as in
> JSON, HBase, MongoDB) so it doesn't make sense to require schemas to be
> created and managed manually, and data to be transformed before it can be
> queried.  Drill's unique architecture makes it unique in its ability to
> enable self-service data exploration where agility is essential.
>
> On Fri, Jan 30, 2015 at 10:10 AM, Andrew Brust <
> andrew.brust@bluebadgeinsights.com> wrote:
>
> > Not sure Drill -- or any of the other SQL-on-Hadoop engines -- are truly
> > well-suited to CRUD.  They excel at the "R" -- the "CUD" is not their
> forte.
> >
> > -----Original Message-----
> > From: Samuel Marks [mailto:samuelmarks@gmail.com]
> > Sent: Friday, January 30, 2015 8:50 AM
> > To: user@drill.apache.org
> > Subject: Re: Which [open-souce] SQL engine atop Hadoop?
> >
> > Dear Jacques,
> >
> > Seeing the support for 03 SQL syntax, nested objects, and schema-free SQL
> > in Apache Drill is quite impressive, not to mention the useful ODBC
> > interface alongside the expected JDBC one. Additionally on the
> scalability
> > side your documentation claims: "Scales from a single laptop to a
> 1000-node
> > cluster".
> >
> > You mention that this entire topic is subjective. I suppose with
> > insufficient information about my use-case, you may just be right.
> >
> > Without giving away my full use-case—FYI: I will be open-sourcing what
> I'm
> > building—I will tell you a little bit about the components.
> >
> > The generic components would just include CRUD, and basic related queries
> > (such as propagated updates utilising joins).
> >
> > More interesting is on the analytics side, wherein I'll be executing a
> > variety of Machine Learning, information filtering (recommender systems,
> > internal search engine most with some element of Natural Language
> > Processing), time series sequence matching and related tasks. Some of
> these
> > require near-realtime responses, whereas others can be delayed
> > significantly.
> >
> > I posted something similar to this on StackOverflow, it was very quickly
> > removed. Haven't tried LinkedIn or Quora, probably worth a shot. Worried
> > about speaking to enterprise sales people, as they're being paid to push
> > their own offering (and I doubt they have extensive benchmarks across all
> > their competitors).
> >
> > Thanks for your continuing advice,
> >
> > Samuel Marks
> > http://linkedin.com/in/samuelmarks
> >
> > On Sat, Jan 31, 2015 at 12:22 AM, Jacques Nadeau <ja...@apache.org>
> > wrote:
> >
> > > Samuel,
> > >
> > > You've come and asked your question on the Apache Drill group so of
> > > course the answer is Apache Drill is best for everything, right?
> > >
> > > The reality is that each tool has a set of strengths and weaknesses
> > > for each particular use case. An Apache user support mailing list is
> > > definitely NOT the place to have this discussion.  You're really
> > > asking for technology selection advice and this entire topic is very
> > > subjective. The people in any one community would never do full
> > > justice to all the options. As such I suggest you use another forum
> such
> > as Quora or LinkedIn to get advice.
> > > (There is also a helpful article on Gigaom that just came out
> > > yesterday and all sorts of friendly sales people at companies like
> > > MapR and IBM who love giving this kind of advice.)
> > >
> > > What we can do here is tell you how Drill can solve or not solve your
> > > different use cases and help you work through those.  If you to go
> > > into more detail, on those,  we'd be happy to help.
> > >
> > > Thanks again for the interest. Sorry if this seems abrupt but these
> > > threads generally aren't productive and tend to be very divisive.
> > >
> > > Welcome to the community :)
> > >
> > > Jacques
> > > On Jan 30, 2015 3:28 AM, "Samuel Marks" <sa...@gmail.com> wrote:
> > >
> > > > Since Hadoop <https://hive.apache.org> came out, there have been
> > > > various commercial and/or open-source attempts to expose some
> > > > compatibility with SQL <http://drill.apache.org>. Obviously by
> > > > posting here I am not expecting
> > > an
> > > > unbiased answer.
> > > >
> > > > Seeking an SQL-on-Hadoop offering which provides: low-latency
> > > > querying,
> > > and
> > > > supports the most common CRUD <https://spark.apache.org>, including
> > > > [the basics!] along these lines: CREATE TABLE, INSERT INTO, SELECT *
> > > > FROM, UPDATE Table SET C1=2 WHERE, DELETE FROM, and DROP TABLE.
> > > > Transactional support would be nice also, but is not a must-have.
> > > >
> > > > Essentially I want a full replacement for the more traditional
> > > > RDBMS, one which can scale from 1 node to a serious Hadoop cluster.
> > > >
> > > > Python is my language of choice for interfacing, however there does
> > > > seem
> > > to
> > > > be a Python JDBC wrapper <https://spark.apache.org/sql>.
> > > >
> > > > Here is what I've found thus far:
> > > >
> > > >    - Apache Hive <https://hive.apache.org> (SQL-like, with
> > > > interactive
> > > SQL
> > > >    thanks to the Stinger initiative)
> > > >    - Apache Drill <http://drill.apache.org> (ANSI SQL support)
> > > >    - Apache Spark <https://spark.apache.org> (Spark SQL
> > > >    <https://spark.apache.org/sql>, queries only, add data via Hive,
> > RDD
> > > >    <
> > > >
> > > https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.s
> > > park.sql.SchemaRDD
> > > > >
> > > >    or Paraquet <http://parquet.io/>)
> > > >    - Apache Phoenix <http://phoenix.apache.org> (built atop Apache
> > HBase
> > > >    <http://hbase.apache.org>, lacks full transaction
> > > >    <http://en.wikipedia.org/wiki/Database_transaction> support,
> > > relational
> > > >    operators <http://en.wikipedia.org/wiki/Relational_operators> and
> > > some
> > > >    built-in functions)
> > > >    - Cloudera Impala
> > > >    <
> > > >
> > > http://www.cloudera.com/content/cloudera/en/products-and-services/cdh/
> > > impala.html
> > > > >
> > > >    (significant HiveQL support, some SQL language support, no support
> > for
> > > >    indexes on its tables, importantly missing DELETE, UPDATE and
> > > INTERSECT;
> > > >    amongst others)
> > > >    - Presto <https://github.com/facebook/presto> from Facebook (can
> > > query
> > > >    Hive, Cassandra <http://cassandra.apache.org>, relational DBs
> &etc.
> > > >    Doesn't seem to be designed for low-latency responses across small
> > > >    clusters, or support UPDATE operations. It is optimized for data
> > > >    warehousing or analytics¹
> > > >    <http://prestodb.io/docs/current/overview/use-cases.html>)
> > > >    - SQL-Hadoop <https://www.mapr.com/why-hadoop/sql-hadoop> via
> MapR
> > > >    community edition <https://www.mapr.com/products/hadoop-download>
> > > > (seems
> > > >    to be a packaging of Hive, HP Vertica
> > > >    <http://www.vertica.com/hp-vertica-products/sqlonhadoop>,
> SparkSQL,
> > > >    Drill and a native ODBC wrapper
> > > >    <http://package.mapr.com/tools/MapR-ODBC/MapR_ODBC>)
> > > >    - Apache Kylin <http://www.kylin.io> from Ebay (provides an SQL
> > > >    interface and multi-dimensional analysis [OLAP
> > > >    <http://en.wikipedia.org/wiki/OLAP>], "… offers ANSI SQL on
> > > > Hadoop
> > > and
> > > >    supports most ANSI SQL query functions". It depends on HDFS,
> > > MapReduce,
> > > >    Hive and HBase; and seems targeted at very large data-sets though
> > > > maintains
> > > >    low query latency)
> > > >    - Apache Tajo <http://tajo.apache.org> (ANSI/ISO SQL standard
> > > > compliance
> > > >    with JDBC <http://en.wikipedia.org/wiki/JDBC> driver support
> > > > [benchmarks
> > > >    against Hive and Impala
> > > >    <
> > > >
> > > http://blogs.gartner.com/nick-heudecker/apache-tajo-enters-the-sql-on-
> > > hadoop-space
> > > > >
> > > >    ])
> > > >    - Cascading <
> http://en.wikipedia.org/wiki/Cascading_%28software%29
> > >'s
> > > >    Lingual <http://docs.cascading.org/lingual/1.0/>²
> > > >    <http://docs.cascading.org/lingual/1.0/#sql-support> ("Lingual
> > > provides
> > > >    JDBC Drivers, a SQL command shell, and a catalog manager for
> > > publishing
> > > >    files [or any resource] as schemas and tables.")
> > > >
> > > > Which—from this list or elsewhere—would you recommend, and why?
> > > > Thanks for all suggestions,
> > > >
> > > > Samuel Marks
> > > > http://linkedin.com/in/samuelmarks
> > > >
> > >
> >
>

Re: Which [open-souce] SQL engine atop Hadoop?

Posted by Tomer Shiran <ts...@gmail.com>.

Yes, Drill is currently focused on querying data as opposed to inserting or
updating. While most of the systems you listed take a traditional approach
to SQL in which a DBA must create and manage schemas, Drill is designed for
Hadoop and NoSQL databases. In these systems, most of the data is usually
self-describing (JSON, Parquet, etc.) and sometimes even schema-less (as in
JSON, HBase, MongoDB) so it doesn't make sense to require schemas to be
created and managed manually, and data to be transformed before it can be
queried.  Drill's unique architecture makes it unique in its ability to
enable self-service data exploration where agility is essential.

On Fri, Jan 30, 2015 at 10:10 AM, Andrew Brust <
andrew.brust@bluebadgeinsights.com> wrote:

> Not sure Drill -- or any of the other SQL-on-Hadoop engines -- are truly
> well-suited to CRUD.  They excel at the "R" -- the "CUD" is not their forte.
>
> -----Original Message-----
> From: Samuel Marks [mailto:samuelmarks@gmail.com]
> Sent: Friday, January 30, 2015 8:50 AM
> To: user@drill.apache.org
> Subject: Re: Which [open-souce] SQL engine atop Hadoop?
>
> Dear Jacques,
>
> Seeing the support for 03 SQL syntax, nested objects, and schema-free SQL
> in Apache Drill is quite impressive, not to mention the useful ODBC
> interface alongside the expected JDBC one. Additionally on the scalability
> side your documentation claims: "Scales from a single laptop to a 1000-node
> cluster".
>
> You mention that this entire topic is subjective. I suppose with
> insufficient information about my use-case, you may just be right.
>
> Without giving away my full use-case—FYI: I will be open-sourcing what I'm
> building—I will tell you a little bit about the components.
>
> The generic components would just include CRUD, and basic related queries
> (such as propagated updates utilising joins).
>
> More interesting is on the analytics side, wherein I'll be executing a
> variety of Machine Learning, information filtering (recommender systems,
> internal search engine most with some element of Natural Language
> Processing), time series sequence matching and related tasks. Some of these
> require near-realtime responses, whereas others can be delayed
> significantly.
>
> I posted something similar to this on StackOverflow, it was very quickly
> removed. Haven't tried LinkedIn or Quora, probably worth a shot. Worried
> about speaking to enterprise sales people, as they're being paid to push
> their own offering (and I doubt they have extensive benchmarks across all
> their competitors).
>
> Thanks for your continuing advice,
>
> Samuel Marks
> http://linkedin.com/in/samuelmarks
>
> On Sat, Jan 31, 2015 at 12:22 AM, Jacques Nadeau <ja...@apache.org>
> wrote:
>
> > Samuel,
> >
> > You've come and asked your question on the Apache Drill group so of
> > course the answer is Apache Drill is best for everything, right?
> >
> > The reality is that each tool has a set of strengths and weaknesses
> > for each particular use case. An Apache user support mailing list is
> > definitely NOT the place to have this discussion.  You're really
> > asking for technology selection advice and this entire topic is very
> > subjective. The people in any one community would never do full
> > justice to all the options. As such I suggest you use another forum such
> as Quora or LinkedIn to get advice.
> > (There is also a helpful article on Gigaom that just came out
> > yesterday and all sorts of friendly sales people at companies like
> > MapR and IBM who love giving this kind of advice.)
> >
> > What we can do here is tell you how Drill can solve or not solve your
> > different use cases and help you work through those.  If you to go
> > into more detail, on those,  we'd be happy to help.
> >
> > Thanks again for the interest. Sorry if this seems abrupt but these
> > threads generally aren't productive and tend to be very divisive.
> >
> > Welcome to the community :)
> >
> > Jacques
> > On Jan 30, 2015 3:28 AM, "Samuel Marks" <sa...@gmail.com> wrote:
> >
> > > Since Hadoop <https://hive.apache.org> came out, there have been
> > > various commercial and/or open-source attempts to expose some
> > > compatibility with SQL <http://drill.apache.org>. Obviously by
> > > posting here I am not expecting
> > an
> > > unbiased answer.
> > >
> > > Seeking an SQL-on-Hadoop offering which provides: low-latency
> > > querying,
> > and
> > > supports the most common CRUD <https://spark.apache.org>, including
> > > [the basics!] along these lines: CREATE TABLE, INSERT INTO, SELECT *
> > > FROM, UPDATE Table SET C1=2 WHERE, DELETE FROM, and DROP TABLE.
> > > Transactional support would be nice also, but is not a must-have.
> > >
> > > Essentially I want a full replacement for the more traditional
> > > RDBMS, one which can scale from 1 node to a serious Hadoop cluster.
> > >
> > > Python is my language of choice for interfacing, however there does
> > > seem
> > to
> > > be a Python JDBC wrapper <https://spark.apache.org/sql>.
> > >
> > > Here is what I've found thus far:
> > >
> > >    - Apache Hive <https://hive.apache.org> (SQL-like, with
> > > interactive
> > SQL
> > >    thanks to the Stinger initiative)
> > >    - Apache Drill <http://drill.apache.org> (ANSI SQL support)
> > >    - Apache Spark <https://spark.apache.org> (Spark SQL
> > >    <https://spark.apache.org/sql>, queries only, add data via Hive,
> RDD
> > >    <
> > >
> > https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.s
> > park.sql.SchemaRDD
> > > >
> > >    or Paraquet <http://parquet.io/>)
> > >    - Apache Phoenix <http://phoenix.apache.org> (built atop Apache
> HBase
> > >    <http://hbase.apache.org>, lacks full transaction
> > >    <http://en.wikipedia.org/wiki/Database_transaction> support,
> > relational
> > >    operators <http://en.wikipedia.org/wiki/Relational_operators> and
> > some
> > >    built-in functions)
> > >    - Cloudera Impala
> > >    <
> > >
> > http://www.cloudera.com/content/cloudera/en/products-and-services/cdh/
> > impala.html
> > > >
> > >    (significant HiveQL support, some SQL language support, no support
> for
> > >    indexes on its tables, importantly missing DELETE, UPDATE and
> > INTERSECT;
> > >    amongst others)
> > >    - Presto <https://github.com/facebook/presto> from Facebook (can
> > query
> > >    Hive, Cassandra <http://cassandra.apache.org>, relational DBs &etc.
> > >    Doesn't seem to be designed for low-latency responses across small
> > >    clusters, or support UPDATE operations. It is optimized for data
> > >    warehousing or analytics¹
> > >    <http://prestodb.io/docs/current/overview/use-cases.html>)
> > >    - SQL-Hadoop <https://www.mapr.com/why-hadoop/sql-hadoop> via MapR
> > >    community edition <https://www.mapr.com/products/hadoop-download>
> > > (seems
> > >    to be a packaging of Hive, HP Vertica
> > >    <http://www.vertica.com/hp-vertica-products/sqlonhadoop>, SparkSQL,
> > >    Drill and a native ODBC wrapper
> > >    <http://package.mapr.com/tools/MapR-ODBC/MapR_ODBC>)
> > >    - Apache Kylin <http://www.kylin.io> from Ebay (provides an SQL
> > >    interface and multi-dimensional analysis [OLAP
> > >    <http://en.wikipedia.org/wiki/OLAP>], "… offers ANSI SQL on
> > > Hadoop
> > and
> > >    supports most ANSI SQL query functions". It depends on HDFS,
> > MapReduce,
> > >    Hive and HBase; and seems targeted at very large data-sets though
> > > maintains
> > >    low query latency)
> > >    - Apache Tajo <http://tajo.apache.org> (ANSI/ISO SQL standard
> > > compliance
> > >    with JDBC <http://en.wikipedia.org/wiki/JDBC> driver support
> > > [benchmarks
> > >    against Hive and Impala
> > >    <
> > >
> > http://blogs.gartner.com/nick-heudecker/apache-tajo-enters-the-sql-on-
> > hadoop-space
> > > >
> > >    ])
> > >    - Cascading <http://en.wikipedia.org/wiki/Cascading_%28software%29
> >'s
> > >    Lingual <http://docs.cascading.org/lingual/1.0/>²
> > >    <http://docs.cascading.org/lingual/1.0/#sql-support> ("Lingual
> > provides
> > >    JDBC Drivers, a SQL command shell, and a catalog manager for
> > publishing
> > >    files [or any resource] as schemas and tables.")
> > >
> > > Which—from this list or elsewhere—would you recommend, and why?
> > > Thanks for all suggestions,
> > >
> > > Samuel Marks
> > > http://linkedin.com/in/samuelmarks
> > >
> >
>

RE: Which [open-souce] SQL engine atop Hadoop?

Posted by Andrew Brust <an...@bluebadgeinsights.com>.

Not sure Drill -- or any of the other SQL-on-Hadoop engines -- are truly well-suited to CRUD.  They excel at the "R" -- the "CUD" is not their forte.

-----Original Message-----
From: Samuel Marks [mailto:samuelmarks@gmail.com] 
Sent: Friday, January 30, 2015 8:50 AM
To: user@drill.apache.org
Subject: Re: Which [open-souce] SQL engine atop Hadoop?

Dear Jacques,

Seeing the support for 03 SQL syntax, nested objects, and schema-free SQL in Apache Drill is quite impressive, not to mention the useful ODBC interface alongside the expected JDBC one. Additionally on the scalability side your documentation claims: "Scales from a single laptop to a 1000-node cluster".

You mention that this entire topic is subjective. I suppose with insufficient information about my use-case, you may just be right.

Without giving away my full use-case—FYI: I will be open-sourcing what I'm building—I will tell you a little bit about the components.

The generic components would just include CRUD, and basic related queries (such as propagated updates utilising joins).

More interesting is on the analytics side, wherein I'll be executing a variety of Machine Learning, information filtering (recommender systems, internal search engine most with some element of Natural Language Processing), time series sequence matching and related tasks. Some of these require near-realtime responses, whereas others can be delayed significantly.

I posted something similar to this on StackOverflow, it was very quickly removed. Haven't tried LinkedIn or Quora, probably worth a shot. Worried about speaking to enterprise sales people, as they're being paid to push their own offering (and I doubt they have extensive benchmarks across all their competitors).

Thanks for your continuing advice,

Samuel Marks
http://linkedin.com/in/samuelmarks

On Sat, Jan 31, 2015 at 12:22 AM, Jacques Nadeau <ja...@apache.org> wrote:

> Samuel,
>
> You've come and asked your question on the Apache Drill group so of 
> course the answer is Apache Drill is best for everything, right?
>
> The reality is that each tool has a set of strengths and weaknesses 
> for each particular use case. An Apache user support mailing list is 
> definitely NOT the place to have this discussion.  You're really 
> asking for technology selection advice and this entire topic is very 
> subjective. The people in any one community would never do full 
> justice to all the options. As such I suggest you use another forum such as Quora or LinkedIn to get advice.
> (There is also a helpful article on Gigaom that just came out 
> yesterday and all sorts of friendly sales people at companies like 
> MapR and IBM who love giving this kind of advice.)
>
> What we can do here is tell you how Drill can solve or not solve your 
> different use cases and help you work through those.  If you to go 
> into more detail, on those,  we'd be happy to help.
>
> Thanks again for the interest. Sorry if this seems abrupt but these 
> threads generally aren't productive and tend to be very divisive.
>
> Welcome to the community :)
>
> Jacques
> On Jan 30, 2015 3:28 AM, "Samuel Marks" <sa...@gmail.com> wrote:
>
> > Since Hadoop <https://hive.apache.org> came out, there have been 
> > various commercial and/or open-source attempts to expose some 
> > compatibility with SQL <http://drill.apache.org>. Obviously by 
> > posting here I am not expecting
> an
> > unbiased answer.
> >
> > Seeking an SQL-on-Hadoop offering which provides: low-latency 
> > querying,
> and
> > supports the most common CRUD <https://spark.apache.org>, including 
> > [the basics!] along these lines: CREATE TABLE, INSERT INTO, SELECT * 
> > FROM, UPDATE Table SET C1=2 WHERE, DELETE FROM, and DROP TABLE. 
> > Transactional support would be nice also, but is not a must-have.
> >
> > Essentially I want a full replacement for the more traditional 
> > RDBMS, one which can scale from 1 node to a serious Hadoop cluster.
> >
> > Python is my language of choice for interfacing, however there does 
> > seem
> to
> > be a Python JDBC wrapper <https://spark.apache.org/sql>.
> >
> > Here is what I've found thus far:
> >
> >    - Apache Hive <https://hive.apache.org> (SQL-like, with 
> > interactive
> SQL
> >    thanks to the Stinger initiative)
> >    - Apache Drill <http://drill.apache.org> (ANSI SQL support)
> >    - Apache Spark <https://spark.apache.org> (Spark SQL
> >    <https://spark.apache.org/sql>, queries only, add data via Hive, RDD
> >    <
> >
> https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.s
> park.sql.SchemaRDD
> > >
> >    or Paraquet <http://parquet.io/>)
> >    - Apache Phoenix <http://phoenix.apache.org> (built atop Apache HBase
> >    <http://hbase.apache.org>, lacks full transaction
> >    <http://en.wikipedia.org/wiki/Database_transaction> support,
> relational
> >    operators <http://en.wikipedia.org/wiki/Relational_operators> and
> some
> >    built-in functions)
> >    - Cloudera Impala
> >    <
> >
> http://www.cloudera.com/content/cloudera/en/products-and-services/cdh/
> impala.html
> > >
> >    (significant HiveQL support, some SQL language support, no support for
> >    indexes on its tables, importantly missing DELETE, UPDATE and
> INTERSECT;
> >    amongst others)
> >    - Presto <https://github.com/facebook/presto> from Facebook (can
> query
> >    Hive, Cassandra <http://cassandra.apache.org>, relational DBs &etc.
> >    Doesn't seem to be designed for low-latency responses across small
> >    clusters, or support UPDATE operations. It is optimized for data
> >    warehousing or analytics¹
> >    <http://prestodb.io/docs/current/overview/use-cases.html>)
> >    - SQL-Hadoop <https://www.mapr.com/why-hadoop/sql-hadoop> via MapR
> >    community edition <https://www.mapr.com/products/hadoop-download>
> > (seems
> >    to be a packaging of Hive, HP Vertica
> >    <http://www.vertica.com/hp-vertica-products/sqlonhadoop>, SparkSQL,
> >    Drill and a native ODBC wrapper
> >    <http://package.mapr.com/tools/MapR-ODBC/MapR_ODBC>)
> >    - Apache Kylin <http://www.kylin.io> from Ebay (provides an SQL
> >    interface and multi-dimensional analysis [OLAP
> >    <http://en.wikipedia.org/wiki/OLAP>], "… offers ANSI SQL on 
> > Hadoop
> and
> >    supports most ANSI SQL query functions". It depends on HDFS,
> MapReduce,
> >    Hive and HBase; and seems targeted at very large data-sets though 
> > maintains
> >    low query latency)
> >    - Apache Tajo <http://tajo.apache.org> (ANSI/ISO SQL standard 
> > compliance
> >    with JDBC <http://en.wikipedia.org/wiki/JDBC> driver support 
> > [benchmarks
> >    against Hive and Impala
> >    <
> >
> http://blogs.gartner.com/nick-heudecker/apache-tajo-enters-the-sql-on-
> hadoop-space
> > >
> >    ])
> >    - Cascading <http://en.wikipedia.org/wiki/Cascading_%28software%29>'s
> >    Lingual <http://docs.cascading.org/lingual/1.0/>²
> >    <http://docs.cascading.org/lingual/1.0/#sql-support> ("Lingual
> provides
> >    JDBC Drivers, a SQL command shell, and a catalog manager for
> publishing
> >    files [or any resource] as schemas and tables.")
> >
> > Which—from this list or elsewhere—would you recommend, and why?
> > Thanks for all suggestions,
> >
> > Samuel Marks
> > http://linkedin.com/in/samuelmarks
> >
>

Re: Which [open-souce] SQL engine atop Hadoop?

Posted by Samuel Marks <sa...@gmail.com>.

Dear Jacques,

Seeing the support for 03 SQL syntax, nested objects, and schema-free SQL
in Apache Drill is quite impressive, not to mention the useful ODBC
interface alongside the expected JDBC one. Additionally on the scalability
side your documentation claims: "Scales from a single laptop to a 1000-node
cluster".

You mention that this entire topic is subjective. I suppose with
insufficient information about my use-case, you may just be right.

Without giving away my full use-case—FYI: I will be open-sourcing what I'm
building—I will tell you a little bit about the components.

The generic components would just include CRUD, and basic related queries
(such as propagated updates utilising joins).

More interesting is on the analytics side, wherein I'll be executing a
variety of Machine Learning, information filtering (recommender systems,
internal search engine most with some element of Natural Language
Processing), time series sequence matching and related tasks. Some of these
require near-realtime responses, whereas others can be delayed
significantly.

I posted something similar to this on StackOverflow, it was very quickly
removed. Haven't tried LinkedIn or Quora, probably worth a shot. Worried
about speaking to enterprise sales people, as they're being paid to push
their own offering (and I doubt they have extensive benchmarks across all
their competitors).

Thanks for your continuing advice,

Samuel Marks
http://linkedin.com/in/samuelmarks

On Sat, Jan 31, 2015 at 12:22 AM, Jacques Nadeau <ja...@apache.org> wrote:

> Samuel,
>
> You've come and asked your question on the Apache Drill group so of course
> the answer is Apache Drill is best for everything, right?
>
> The reality is that each tool has a set of strengths and weaknesses for
> each particular use case. An Apache user support mailing list is definitely
> NOT the place to have this discussion.  You're really asking for technology
> selection advice and this entire topic is very subjective. The people in
> any one community would never do full justice to all the options. As such I
> suggest you use another forum such as Quora or LinkedIn to get advice.
> (There is also a helpful article on Gigaom that just came out yesterday and
> all sorts of friendly sales people at companies like MapR and IBM who love
> giving this kind of advice.)
>
> What we can do here is tell you how Drill can solve or not solve your
> different use cases and help you work through those.  If you to go into
> more detail, on those,  we'd be happy to help.
>
> Thanks again for the interest. Sorry if this seems abrupt but these threads
> generally aren't productive and tend to be very divisive.
>
> Welcome to the community :)
>
> Jacques
> On Jan 30, 2015 3:28 AM, "Samuel Marks" <sa...@gmail.com> wrote:
>
> > Since Hadoop <https://hive.apache.org> came out, there have been various
> > commercial and/or open-source attempts to expose some compatibility with
> > SQL
> > <http://drill.apache.org>. Obviously by posting here I am not expecting
> an
> > unbiased answer.
> >
> > Seeking an SQL-on-Hadoop offering which provides: low-latency querying,
> and
> > supports the most common CRUD <https://spark.apache.org>, including [the
> > basics!] along these lines: CREATE TABLE, INSERT INTO, SELECT * FROM,
> > UPDATE
> > Table SET C1=2 WHERE, DELETE FROM, and DROP TABLE. Transactional support
> > would be nice also, but is not a must-have.
> >
> > Essentially I want a full replacement for the more traditional RDBMS, one
> > which can scale from 1 node to a serious Hadoop cluster.
> >
> > Python is my language of choice for interfacing, however there does seem
> to
> > be a Python JDBC wrapper <https://spark.apache.org/sql>.
> >
> > Here is what I've found thus far:
> >
> >    - Apache Hive <https://hive.apache.org> (SQL-like, with interactive
> SQL
> >    thanks to the Stinger initiative)
> >    - Apache Drill <http://drill.apache.org> (ANSI SQL support)
> >    - Apache Spark <https://spark.apache.org> (Spark SQL
> >    <https://spark.apache.org/sql>, queries only, add data via Hive, RDD
> >    <
> >
> https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.SchemaRDD
> > >
> >    or Paraquet <http://parquet.io/>)
> >    - Apache Phoenix <http://phoenix.apache.org> (built atop Apache HBase
> >    <http://hbase.apache.org>, lacks full transaction
> >    <http://en.wikipedia.org/wiki/Database_transaction> support,
> relational
> >    operators <http://en.wikipedia.org/wiki/Relational_operators> and
> some
> >    built-in functions)
> >    - Cloudera Impala
> >    <
> >
> http://www.cloudera.com/content/cloudera/en/products-and-services/cdh/impala.html
> > >
> >    (significant HiveQL support, some SQL language support, no support for
> >    indexes on its tables, importantly missing DELETE, UPDATE and
> INTERSECT;
> >    amongst others)
> >    - Presto <https://github.com/facebook/presto> from Facebook (can
> query
> >    Hive, Cassandra <http://cassandra.apache.org>, relational DBs &etc.
> >    Doesn't seem to be designed for low-latency responses across small
> >    clusters, or support UPDATE operations. It is optimized for data
> >    warehousing or analytics¹
> >    <http://prestodb.io/docs/current/overview/use-cases.html>)
> >    - SQL-Hadoop <https://www.mapr.com/why-hadoop/sql-hadoop> via MapR
> >    community edition <https://www.mapr.com/products/hadoop-download>
> > (seems
> >    to be a packaging of Hive, HP Vertica
> >    <http://www.vertica.com/hp-vertica-products/sqlonhadoop>, SparkSQL,
> >    Drill and a native ODBC wrapper
> >    <http://package.mapr.com/tools/MapR-ODBC/MapR_ODBC>)
> >    - Apache Kylin <http://www.kylin.io> from Ebay (provides an SQL
> >    interface and multi-dimensional analysis [OLAP
> >    <http://en.wikipedia.org/wiki/OLAP>], "… offers ANSI SQL on Hadoop
> and
> >    supports most ANSI SQL query functions". It depends on HDFS,
> MapReduce,
> >    Hive and HBase; and seems targeted at very large data-sets though
> > maintains
> >    low query latency)
> >    - Apache Tajo <http://tajo.apache.org> (ANSI/ISO SQL standard
> > compliance
> >    with JDBC <http://en.wikipedia.org/wiki/JDBC> driver support
> > [benchmarks
> >    against Hive and Impala
> >    <
> >
> http://blogs.gartner.com/nick-heudecker/apache-tajo-enters-the-sql-on-hadoop-space
> > >
> >    ])
> >    - Cascading <http://en.wikipedia.org/wiki/Cascading_%28software%29>'s
> >    Lingual <http://docs.cascading.org/lingual/1.0/>²
> >    <http://docs.cascading.org/lingual/1.0/#sql-support> ("Lingual
> provides
> >    JDBC Drivers, a SQL command shell, and a catalog manager for
> publishing
> >    files [or any resource] as schemas and tables.")
> >
> > Which—from this list or elsewhere—would you recommend, and why?
> > Thanks for all suggestions,
> >
> > Samuel Marks
> > http://linkedin.com/in/samuelmarks
> >
>

Re: Which [open-souce] SQL engine atop Hadoop?

Posted by Jacques Nadeau <ja...@apache.org>.

Samuel,

You've come and asked your question on the Apache Drill group so of course
the answer is Apache Drill is best for everything, right?

The reality is that each tool has a set of strengths and weaknesses for
each particular use case. An Apache user support mailing list is definitely
NOT the place to have this discussion.  You're really asking for technology
selection advice and this entire topic is very subjective. The people in
any one community would never do full justice to all the options. As such I
suggest you use another forum such as Quora or LinkedIn to get advice.
(There is also a helpful article on Gigaom that just came out yesterday and
all sorts of friendly sales people at companies like MapR and IBM who love
giving this kind of advice.)

What we can do here is tell you how Drill can solve or not solve your
different use cases and help you work through those.  If you to go into
more detail, on those,  we'd be happy to help.

Thanks again for the interest. Sorry if this seems abrupt but these threads
generally aren't productive and tend to be very divisive.

Welcome to the community :)

Jacques
On Jan 30, 2015 3:28 AM, "Samuel Marks" <sa...@gmail.com> wrote:

> Since Hadoop <https://hive.apache.org> came out, there have been various
> commercial and/or open-source attempts to expose some compatibility with
> SQL
> <http://drill.apache.org>. Obviously by posting here I am not expecting an
> unbiased answer.
>
> Seeking an SQL-on-Hadoop offering which provides: low-latency querying, and
> supports the most common CRUD <https://spark.apache.org>, including [the
> basics!] along these lines: CREATE TABLE, INSERT INTO, SELECT * FROM,
> UPDATE
> Table SET C1=2 WHERE, DELETE FROM, and DROP TABLE. Transactional support
> would be nice also, but is not a must-have.
>
> Essentially I want a full replacement for the more traditional RDBMS, one
> which can scale from 1 node to a serious Hadoop cluster.
>
> Python is my language of choice for interfacing, however there does seem to
> be a Python JDBC wrapper <https://spark.apache.org/sql>.
>
> Here is what I've found thus far:
>
>    - Apache Hive <https://hive.apache.org> (SQL-like, with interactive SQL
>    thanks to the Stinger initiative)
>    - Apache Drill <http://drill.apache.org> (ANSI SQL support)
>    - Apache Spark <https://spark.apache.org> (Spark SQL
>    <https://spark.apache.org/sql>, queries only, add data via Hive, RDD
>    <
> https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.SchemaRDD
> >
>    or Paraquet <http://parquet.io/>)
>    - Apache Phoenix <http://phoenix.apache.org> (built atop Apache HBase
>    <http://hbase.apache.org>, lacks full transaction
>    <http://en.wikipedia.org/wiki/Database_transaction> support, relational
>    operators <http://en.wikipedia.org/wiki/Relational_operators> and some
>    built-in functions)
>    - Cloudera Impala
>    <
> http://www.cloudera.com/content/cloudera/en/products-and-services/cdh/impala.html
> >
>    (significant HiveQL support, some SQL language support, no support for
>    indexes on its tables, importantly missing DELETE, UPDATE and INTERSECT;
>    amongst others)
>    - Presto <https://github.com/facebook/presto> from Facebook (can query
>    Hive, Cassandra <http://cassandra.apache.org>, relational DBs &etc.
>    Doesn't seem to be designed for low-latency responses across small
>    clusters, or support UPDATE operations. It is optimized for data
>    warehousing or analytics¹
>    <http://prestodb.io/docs/current/overview/use-cases.html>)
>    - SQL-Hadoop <https://www.mapr.com/why-hadoop/sql-hadoop> via MapR
>    community edition <https://www.mapr.com/products/hadoop-download>
> (seems
>    to be a packaging of Hive, HP Vertica
>    <http://www.vertica.com/hp-vertica-products/sqlonhadoop>, SparkSQL,
>    Drill and a native ODBC wrapper
>    <http://package.mapr.com/tools/MapR-ODBC/MapR_ODBC>)
>    - Apache Kylin <http://www.kylin.io> from Ebay (provides an SQL
>    interface and multi-dimensional analysis [OLAP
>    <http://en.wikipedia.org/wiki/OLAP>], "… offers ANSI SQL on Hadoop and
>    supports most ANSI SQL query functions". It depends on HDFS, MapReduce,
>    Hive and HBase; and seems targeted at very large data-sets though
> maintains
>    low query latency)
>    - Apache Tajo <http://tajo.apache.org> (ANSI/ISO SQL standard
> compliance
>    with JDBC <http://en.wikipedia.org/wiki/JDBC> driver support
> [benchmarks
>    against Hive and Impala
>    <
> http://blogs.gartner.com/nick-heudecker/apache-tajo-enters-the-sql-on-hadoop-space
> >
>    ])
>    - Cascading <http://en.wikipedia.org/wiki/Cascading_%28software%29>'s
>    Lingual <http://docs.cascading.org/lingual/1.0/>²
>    <http://docs.cascading.org/lingual/1.0/#sql-support> ("Lingual provides
>    JDBC Drivers, a SQL command shell, and a catalog manager for publishing
>    files [or any resource] as schemas and tables.")
>
> Which—from this list or elsewhere—would you recommend, and why?
> Thanks for all suggestions,
>
> Samuel Marks
> http://linkedin.com/in/samuelmarks
>