You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by Samuel Marks <sa...@gmail.com> on 2015/01/26 15:19:52 UTC
Which [open-souce] SQL engine atop Hadoop?
Since Hadoop <https://hive.apache.org> came out, there have been various
commercial and/or open-source attempts to expose some compatibility with SQL
<http://drill.apache.org>.
I am seeking one which is good for low-latency querying, and supports the
most common CRUD <https://spark.apache.org>, including [the basics!] along
these lines: CREATE TABLE, INSERT INTO, SELECT * FROM, UPDATE Table SET
C1=2 WHERE, DELETE FROM, and DROP TABLE.
I will be utilising them from Python, however there does seem to be a Python
JDBC wrapper <https://spark.apache.org/sql>. Additionally it needs to be
scalable for big and small data (starting on a single-node "cluster").
Here is what I've found thus far:
- Apache Hive <https://hive.apache.org> (SQL-like, with interactive SQL
thanks to the Stinger initiative)
- Apache Drill <http://drill.apache.org> (ANSI SQL support)
- Apache Spark <https://spark.apache.org> (Spark SQL
<https://spark.apache.org/sql>, queries only, add data via Hive, RDD
<https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.SchemaRDD>
or Paraquet <http://parquet.io/>)
- Apache Phoenix <http://phoenix.apache.org> (built atop Apache HBase
<http://hbase.apache.org>, lacks full transaction
<http://en.wikipedia.org/wiki/Database_transaction> support, relational
operators <http://en.wikipedia.org/wiki/Relational_operators> and some
built-in functions)
- Presto <https://github.com/facebook/presto> from Facebook (can query
Hive, Cassandra <http://cassandra.apache.org>, relational DBs &etc.
Doesn't seem to be designed for low-latency responses across small
clusters, or support UPDATE operations. It is optimized for data
warehousing or analytics¹
<http://prestodb.io/docs/current/overview/use-cases.html>)
- SQL-Hadoop <https://www.mapr.com/why-hadoop/sql-hadoop> via MapR
community edition <https://www.mapr.com/products/hadoop-download> (seems
to be a packaging of Hive, HP Vertica
<http://www.vertica.com/hp-vertica-products/sqlonhadoop>, SparkSQL,
Drill and a native ODBC wrapper
<http://package.mapr.com/tools/MapR-ODBC/MapR_ODBC>)
- Apache Kylin <http://www.kylin.io> from Ebay (provides an SQL
interface and multi-dimensional analysis [OLAP
<http://en.wikipedia.org/wiki/OLAP>], "… offers ANSI SQL on Hadoop and
supports most ANSI SQL query functions". It depends on HDFS, MapReduce,
Hive and HBase; and seems targeted at very large data-sets though maintains
low query latency)
- Apache Tajo <http://tajo.apache.org> (ANSI/ISO SQL standard compliance
with JDBC <http://en.wikipedia.org/wiki/JDBC> driver support [benchmarks
against Hive and Impala
<http://blogs.gartner.com/nick-heudecker/apache-tajo-enters-the-sql-on-hadoop-space>
])
- Cascading <http://en.wikipedia.org/wiki/Cascading_%28software%29>'s
Lingual <http://docs.cascading.org/lingual/1.0/>²
<http://docs.cascading.org/lingual/1.0/#sql-support> ("Lingual provides
JDBC Drivers, a SQL command shell, and a catalog manager for publishing
files [or any resource] as schemas and tables.")
Which—from this list or elsewhere—would you recommend, and why?
Thanks for all suggestions,
Samuel Marks
http://linkedin.com/in/samuelmarks
Re: Which [open-souce] SQL engine atop Hadoop?
Posted by Samuel Marks <sa...@gmail.com>.
Hey cool, just found this one: http://trafodion.apache.org/
Samuel Marks
http://linkedin.com/in/samuelmarks
On Thu, Feb 5, 2015 at 8:39 PM, Azuryy Yu <az...@gmail.com> wrote:
> please look at:
> http://mail-archives.apache.org/mod_mbox/tajo-user/201502.mbox/browser
>
>
>
> On Tue, Jan 27, 2015 at 5:13 PM, Daniel Haviv <da...@gmail.com>
> wrote:
>
>> Can you elaborate on why you prefer Tajo?
>>
>> Daniel
>>
>> On 27 בינו׳ 2015, at 10:35, Azuryy Yu <az...@gmail.com> wrote:
>>
>> You almost list all open sourced MPP real time SQL-ON-Hadoop.
>>
>> I prefer Tajo, which was relased by 0.9.0 recently, and still working in
>> progress for 1.0
>>
>>
>> On Mon, Jan 26, 2015 at 10:19 PM, Samuel Marks <sa...@gmail.com>
>> wrote:
>>
>>> Since Hadoop <https://hive.apache.org> came out, there have been
>>> various commercial and/or open-source attempts to expose some compatibility
>>> with SQL <http://drill.apache.org>.
>>>
>>> I am seeking one which is good for low-latency querying, and supports
>>> the most common CRUD <https://spark.apache.org>, including [the
>>> basics!] along these lines: CREATE TABLE, INSERT INTO, SELECT * FROM, UPDATE
>>> Table SET C1=2 WHERE, DELETE FROM, and DROP TABLE.
>>>
>>> I will be utilising them from Python, however there does seem to be a Python
>>> JDBC wrapper <https://spark.apache.org/sql>. Additionally it needs to
>>> be scalable for big and small data (starting on a single-node "cluster").
>>>
>>> Here is what I've found thus far:
>>>
>>> - Apache Hive <https://hive.apache.org> (SQL-like, with interactive
>>> SQL thanks to the Stinger initiative)
>>> - Apache Drill <http://drill.apache.org> (ANSI SQL support)
>>> - Apache Spark <https://spark.apache.org> (Spark SQL
>>> <https://spark.apache.org/sql>, queries only, add data via Hive, RDD
>>> <https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.SchemaRDD>
>>> or Paraquet <http://parquet.io/>)
>>> - Apache Phoenix <http://phoenix.apache.org> (built atop Apache HBase
>>> <http://hbase.apache.org>, lacks full transaction
>>> <http://en.wikipedia.org/wiki/Database_transaction> support, relational
>>> operators <http://en.wikipedia.org/wiki/Relational_operators> and
>>> some built-in functions)
>>> - Presto <https://github.com/facebook/presto> from Facebook (can
>>> query Hive, Cassandra <http://cassandra.apache.org>, relational DBs
>>> &etc. Doesn't seem to be designed for low-latency responses across small
>>> clusters, or support UPDATE operations. It is optimized for data
>>> warehousing or analytics¹
>>> <http://prestodb.io/docs/current/overview/use-cases.html>)
>>> - SQL-Hadoop <https://www.mapr.com/why-hadoop/sql-hadoop> via MapR
>>> community edition <https://www.mapr.com/products/hadoop-download>
>>> (seems to be a packaging of Hive, HP Vertica
>>> <http://www.vertica.com/hp-vertica-products/sqlonhadoop>, SparkSQL,
>>> Drill and a native ODBC wrapper
>>> <http://package.mapr.com/tools/MapR-ODBC/MapR_ODBC>)
>>> - Apache Kylin <http://www.kylin.io> from Ebay (provides an SQL
>>> interface and multi-dimensional analysis [OLAP
>>> <http://en.wikipedia.org/wiki/OLAP>], "… offers ANSI SQL on Hadoop
>>> and supports most ANSI SQL query functions". It depends on HDFS, MapReduce,
>>> Hive and HBase; and seems targeted at very large data-sets though maintains
>>> low query latency)
>>> - Apache Tajo <http://tajo.apache.org> (ANSI/ISO SQL standard
>>> compliance with JDBC <http://en.wikipedia.org/wiki/JDBC> driver
>>> support [benchmarks against Hive and Impala
>>> <http://blogs.gartner.com/nick-heudecker/apache-tajo-enters-the-sql-on-hadoop-space>
>>> ])
>>> - Cascading <http://en.wikipedia.org/wiki/Cascading_%28software%29>'s
>>> Lingual <http://docs.cascading.org/lingual/1.0/>²
>>> <http://docs.cascading.org/lingual/1.0/#sql-support> ("Lingual
>>> provides JDBC Drivers, a SQL command shell, and a catalog manager for
>>> publishing files [or any resource] as schemas and tables.")
>>>
>>> Which—from this list or elsewhere—would you recommend, and why?
>>> Thanks for all suggestions,
>>>
>>> Samuel Marks
>>> http://linkedin.com/in/samuelmarks
>>>
>>
>>
>
Re: Which [open-souce] SQL engine atop Hadoop?
Posted by Azuryy Yu <az...@gmail.com>.
please look at:
http://mail-archives.apache.org/mod_mbox/tajo-user/201502.mbox/browser
On Tue, Jan 27, 2015 at 5:13 PM, Daniel Haviv <da...@gmail.com> wrote:
> Can you elaborate on why you prefer Tajo?
>
> Daniel
>
> On 27 בינו׳ 2015, at 10:35, Azuryy Yu <az...@gmail.com> wrote:
>
> You almost list all open sourced MPP real time SQL-ON-Hadoop.
>
> I prefer Tajo, which was relased by 0.9.0 recently, and still working in
> progress for 1.0
>
>
> On Mon, Jan 26, 2015 at 10:19 PM, Samuel Marks <sa...@gmail.com>
> wrote:
>
>> Since Hadoop <https://hive.apache.org> came out, there have been various
>> commercial and/or open-source attempts to expose some compatibility with
>> SQL <http://drill.apache.org>.
>>
>> I am seeking one which is good for low-latency querying, and supports the
>> most common CRUD <https://spark.apache.org>, including [the basics!]
>> along these lines: CREATE TABLE, INSERT INTO, SELECT * FROM, UPDATE
>> Table SET C1=2 WHERE, DELETE FROM, and DROP TABLE.
>>
>> I will be utilising them from Python, however there does seem to be a Python
>> JDBC wrapper <https://spark.apache.org/sql>. Additionally it needs to be
>> scalable for big and small data (starting on a single-node "cluster").
>>
>> Here is what I've found thus far:
>>
>> - Apache Hive <https://hive.apache.org> (SQL-like, with interactive
>> SQL thanks to the Stinger initiative)
>> - Apache Drill <http://drill.apache.org> (ANSI SQL support)
>> - Apache Spark <https://spark.apache.org> (Spark SQL
>> <https://spark.apache.org/sql>, queries only, add data via Hive, RDD
>> <https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.SchemaRDD>
>> or Paraquet <http://parquet.io/>)
>> - Apache Phoenix <http://phoenix.apache.org> (built atop Apache HBase
>> <http://hbase.apache.org>, lacks full transaction
>> <http://en.wikipedia.org/wiki/Database_transaction> support, relational
>> operators <http://en.wikipedia.org/wiki/Relational_operators> and
>> some built-in functions)
>> - Presto <https://github.com/facebook/presto> from Facebook (can
>> query Hive, Cassandra <http://cassandra.apache.org>, relational DBs
>> &etc. Doesn't seem to be designed for low-latency responses across small
>> clusters, or support UPDATE operations. It is optimized for data
>> warehousing or analytics¹
>> <http://prestodb.io/docs/current/overview/use-cases.html>)
>> - SQL-Hadoop <https://www.mapr.com/why-hadoop/sql-hadoop> via MapR
>> community edition <https://www.mapr.com/products/hadoop-download>
>> (seems to be a packaging of Hive, HP Vertica
>> <http://www.vertica.com/hp-vertica-products/sqlonhadoop>, SparkSQL,
>> Drill and a native ODBC wrapper
>> <http://package.mapr.com/tools/MapR-ODBC/MapR_ODBC>)
>> - Apache Kylin <http://www.kylin.io> from Ebay (provides an SQL
>> interface and multi-dimensional analysis [OLAP
>> <http://en.wikipedia.org/wiki/OLAP>], "… offers ANSI SQL on Hadoop
>> and supports most ANSI SQL query functions". It depends on HDFS, MapReduce,
>> Hive and HBase; and seems targeted at very large data-sets though maintains
>> low query latency)
>> - Apache Tajo <http://tajo.apache.org> (ANSI/ISO SQL standard
>> compliance with JDBC <http://en.wikipedia.org/wiki/JDBC> driver
>> support [benchmarks against Hive and Impala
>> <http://blogs.gartner.com/nick-heudecker/apache-tajo-enters-the-sql-on-hadoop-space>
>> ])
>> - Cascading <http://en.wikipedia.org/wiki/Cascading_%28software%29>'s
>> Lingual <http://docs.cascading.org/lingual/1.0/>²
>> <http://docs.cascading.org/lingual/1.0/#sql-support> ("Lingual
>> provides JDBC Drivers, a SQL command shell, and a catalog manager for
>> publishing files [or any resource] as schemas and tables.")
>>
>> Which—from this list or elsewhere—would you recommend, and why?
>> Thanks for all suggestions,
>>
>> Samuel Marks
>> http://linkedin.com/in/samuelmarks
>>
>
>
Re: Which [open-souce] SQL engine atop Hadoop?
Posted by Azuryy Yu <az...@gmail.com>.
please look at:
http://mail-archives.apache.org/mod_mbox/tajo-user/201502.mbox/browser
On Tue, Jan 27, 2015 at 5:13 PM, Daniel Haviv <da...@gmail.com> wrote:
> Can you elaborate on why you prefer Tajo?
>
> Daniel
>
> On 27 בינו׳ 2015, at 10:35, Azuryy Yu <az...@gmail.com> wrote:
>
> You almost list all open sourced MPP real time SQL-ON-Hadoop.
>
> I prefer Tajo, which was relased by 0.9.0 recently, and still working in
> progress for 1.0
>
>
> On Mon, Jan 26, 2015 at 10:19 PM, Samuel Marks <sa...@gmail.com>
> wrote:
>
>> Since Hadoop <https://hive.apache.org> came out, there have been various
>> commercial and/or open-source attempts to expose some compatibility with
>> SQL <http://drill.apache.org>.
>>
>> I am seeking one which is good for low-latency querying, and supports the
>> most common CRUD <https://spark.apache.org>, including [the basics!]
>> along these lines: CREATE TABLE, INSERT INTO, SELECT * FROM, UPDATE
>> Table SET C1=2 WHERE, DELETE FROM, and DROP TABLE.
>>
>> I will be utilising them from Python, however there does seem to be a Python
>> JDBC wrapper <https://spark.apache.org/sql>. Additionally it needs to be
>> scalable for big and small data (starting on a single-node "cluster").
>>
>> Here is what I've found thus far:
>>
>> - Apache Hive <https://hive.apache.org> (SQL-like, with interactive
>> SQL thanks to the Stinger initiative)
>> - Apache Drill <http://drill.apache.org> (ANSI SQL support)
>> - Apache Spark <https://spark.apache.org> (Spark SQL
>> <https://spark.apache.org/sql>, queries only, add data via Hive, RDD
>> <https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.SchemaRDD>
>> or Paraquet <http://parquet.io/>)
>> - Apache Phoenix <http://phoenix.apache.org> (built atop Apache HBase
>> <http://hbase.apache.org>, lacks full transaction
>> <http://en.wikipedia.org/wiki/Database_transaction> support, relational
>> operators <http://en.wikipedia.org/wiki/Relational_operators> and
>> some built-in functions)
>> - Presto <https://github.com/facebook/presto> from Facebook (can
>> query Hive, Cassandra <http://cassandra.apache.org>, relational DBs
>> &etc. Doesn't seem to be designed for low-latency responses across small
>> clusters, or support UPDATE operations. It is optimized for data
>> warehousing or analytics¹
>> <http://prestodb.io/docs/current/overview/use-cases.html>)
>> - SQL-Hadoop <https://www.mapr.com/why-hadoop/sql-hadoop> via MapR
>> community edition <https://www.mapr.com/products/hadoop-download>
>> (seems to be a packaging of Hive, HP Vertica
>> <http://www.vertica.com/hp-vertica-products/sqlonhadoop>, SparkSQL,
>> Drill and a native ODBC wrapper
>> <http://package.mapr.com/tools/MapR-ODBC/MapR_ODBC>)
>> - Apache Kylin <http://www.kylin.io> from Ebay (provides an SQL
>> interface and multi-dimensional analysis [OLAP
>> <http://en.wikipedia.org/wiki/OLAP>], "… offers ANSI SQL on Hadoop
>> and supports most ANSI SQL query functions". It depends on HDFS, MapReduce,
>> Hive and HBase; and seems targeted at very large data-sets though maintains
>> low query latency)
>> - Apache Tajo <http://tajo.apache.org> (ANSI/ISO SQL standard
>> compliance with JDBC <http://en.wikipedia.org/wiki/JDBC> driver
>> support [benchmarks against Hive and Impala
>> <http://blogs.gartner.com/nick-heudecker/apache-tajo-enters-the-sql-on-hadoop-space>
>> ])
>> - Cascading <http://en.wikipedia.org/wiki/Cascading_%28software%29>'s
>> Lingual <http://docs.cascading.org/lingual/1.0/>²
>> <http://docs.cascading.org/lingual/1.0/#sql-support> ("Lingual
>> provides JDBC Drivers, a SQL command shell, and a catalog manager for
>> publishing files [or any resource] as schemas and tables.")
>>
>> Which—from this list or elsewhere—would you recommend, and why?
>> Thanks for all suggestions,
>>
>> Samuel Marks
>> http://linkedin.com/in/samuelmarks
>>
>
>
Re: Which [open-souce] SQL engine atop Hadoop?
Posted by Azuryy Yu <az...@gmail.com>.
please look at:
http://mail-archives.apache.org/mod_mbox/tajo-user/201502.mbox/browser
On Tue, Jan 27, 2015 at 5:13 PM, Daniel Haviv <da...@gmail.com> wrote:
> Can you elaborate on why you prefer Tajo?
>
> Daniel
>
> On 27 בינו׳ 2015, at 10:35, Azuryy Yu <az...@gmail.com> wrote:
>
> You almost list all open sourced MPP real time SQL-ON-Hadoop.
>
> I prefer Tajo, which was relased by 0.9.0 recently, and still working in
> progress for 1.0
>
>
> On Mon, Jan 26, 2015 at 10:19 PM, Samuel Marks <sa...@gmail.com>
> wrote:
>
>> Since Hadoop <https://hive.apache.org> came out, there have been various
>> commercial and/or open-source attempts to expose some compatibility with
>> SQL <http://drill.apache.org>.
>>
>> I am seeking one which is good for low-latency querying, and supports the
>> most common CRUD <https://spark.apache.org>, including [the basics!]
>> along these lines: CREATE TABLE, INSERT INTO, SELECT * FROM, UPDATE
>> Table SET C1=2 WHERE, DELETE FROM, and DROP TABLE.
>>
>> I will be utilising them from Python, however there does seem to be a Python
>> JDBC wrapper <https://spark.apache.org/sql>. Additionally it needs to be
>> scalable for big and small data (starting on a single-node "cluster").
>>
>> Here is what I've found thus far:
>>
>> - Apache Hive <https://hive.apache.org> (SQL-like, with interactive
>> SQL thanks to the Stinger initiative)
>> - Apache Drill <http://drill.apache.org> (ANSI SQL support)
>> - Apache Spark <https://spark.apache.org> (Spark SQL
>> <https://spark.apache.org/sql>, queries only, add data via Hive, RDD
>> <https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.SchemaRDD>
>> or Paraquet <http://parquet.io/>)
>> - Apache Phoenix <http://phoenix.apache.org> (built atop Apache HBase
>> <http://hbase.apache.org>, lacks full transaction
>> <http://en.wikipedia.org/wiki/Database_transaction> support, relational
>> operators <http://en.wikipedia.org/wiki/Relational_operators> and
>> some built-in functions)
>> - Presto <https://github.com/facebook/presto> from Facebook (can
>> query Hive, Cassandra <http://cassandra.apache.org>, relational DBs
>> &etc. Doesn't seem to be designed for low-latency responses across small
>> clusters, or support UPDATE operations. It is optimized for data
>> warehousing or analytics¹
>> <http://prestodb.io/docs/current/overview/use-cases.html>)
>> - SQL-Hadoop <https://www.mapr.com/why-hadoop/sql-hadoop> via MapR
>> community edition <https://www.mapr.com/products/hadoop-download>
>> (seems to be a packaging of Hive, HP Vertica
>> <http://www.vertica.com/hp-vertica-products/sqlonhadoop>, SparkSQL,
>> Drill and a native ODBC wrapper
>> <http://package.mapr.com/tools/MapR-ODBC/MapR_ODBC>)
>> - Apache Kylin <http://www.kylin.io> from Ebay (provides an SQL
>> interface and multi-dimensional analysis [OLAP
>> <http://en.wikipedia.org/wiki/OLAP>], "… offers ANSI SQL on Hadoop
>> and supports most ANSI SQL query functions". It depends on HDFS, MapReduce,
>> Hive and HBase; and seems targeted at very large data-sets though maintains
>> low query latency)
>> - Apache Tajo <http://tajo.apache.org> (ANSI/ISO SQL standard
>> compliance with JDBC <http://en.wikipedia.org/wiki/JDBC> driver
>> support [benchmarks against Hive and Impala
>> <http://blogs.gartner.com/nick-heudecker/apache-tajo-enters-the-sql-on-hadoop-space>
>> ])
>> - Cascading <http://en.wikipedia.org/wiki/Cascading_%28software%29>'s
>> Lingual <http://docs.cascading.org/lingual/1.0/>²
>> <http://docs.cascading.org/lingual/1.0/#sql-support> ("Lingual
>> provides JDBC Drivers, a SQL command shell, and a catalog manager for
>> publishing files [or any resource] as schemas and tables.")
>>
>> Which—from this list or elsewhere—would you recommend, and why?
>> Thanks for all suggestions,
>>
>> Samuel Marks
>> http://linkedin.com/in/samuelmarks
>>
>
>
Re: Which [open-souce] SQL engine atop Hadoop?
Posted by Azuryy Yu <az...@gmail.com>.
please look at:
http://mail-archives.apache.org/mod_mbox/tajo-user/201502.mbox/browser
On Tue, Jan 27, 2015 at 5:13 PM, Daniel Haviv <da...@gmail.com> wrote:
> Can you elaborate on why you prefer Tajo?
>
> Daniel
>
> On 27 בינו׳ 2015, at 10:35, Azuryy Yu <az...@gmail.com> wrote:
>
> You almost list all open sourced MPP real time SQL-ON-Hadoop.
>
> I prefer Tajo, which was relased by 0.9.0 recently, and still working in
> progress for 1.0
>
>
> On Mon, Jan 26, 2015 at 10:19 PM, Samuel Marks <sa...@gmail.com>
> wrote:
>
>> Since Hadoop <https://hive.apache.org> came out, there have been various
>> commercial and/or open-source attempts to expose some compatibility with
>> SQL <http://drill.apache.org>.
>>
>> I am seeking one which is good for low-latency querying, and supports the
>> most common CRUD <https://spark.apache.org>, including [the basics!]
>> along these lines: CREATE TABLE, INSERT INTO, SELECT * FROM, UPDATE
>> Table SET C1=2 WHERE, DELETE FROM, and DROP TABLE.
>>
>> I will be utilising them from Python, however there does seem to be a Python
>> JDBC wrapper <https://spark.apache.org/sql>. Additionally it needs to be
>> scalable for big and small data (starting on a single-node "cluster").
>>
>> Here is what I've found thus far:
>>
>> - Apache Hive <https://hive.apache.org> (SQL-like, with interactive
>> SQL thanks to the Stinger initiative)
>> - Apache Drill <http://drill.apache.org> (ANSI SQL support)
>> - Apache Spark <https://spark.apache.org> (Spark SQL
>> <https://spark.apache.org/sql>, queries only, add data via Hive, RDD
>> <https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.SchemaRDD>
>> or Paraquet <http://parquet.io/>)
>> - Apache Phoenix <http://phoenix.apache.org> (built atop Apache HBase
>> <http://hbase.apache.org>, lacks full transaction
>> <http://en.wikipedia.org/wiki/Database_transaction> support, relational
>> operators <http://en.wikipedia.org/wiki/Relational_operators> and
>> some built-in functions)
>> - Presto <https://github.com/facebook/presto> from Facebook (can
>> query Hive, Cassandra <http://cassandra.apache.org>, relational DBs
>> &etc. Doesn't seem to be designed for low-latency responses across small
>> clusters, or support UPDATE operations. It is optimized for data
>> warehousing or analytics¹
>> <http://prestodb.io/docs/current/overview/use-cases.html>)
>> - SQL-Hadoop <https://www.mapr.com/why-hadoop/sql-hadoop> via MapR
>> community edition <https://www.mapr.com/products/hadoop-download>
>> (seems to be a packaging of Hive, HP Vertica
>> <http://www.vertica.com/hp-vertica-products/sqlonhadoop>, SparkSQL,
>> Drill and a native ODBC wrapper
>> <http://package.mapr.com/tools/MapR-ODBC/MapR_ODBC>)
>> - Apache Kylin <http://www.kylin.io> from Ebay (provides an SQL
>> interface and multi-dimensional analysis [OLAP
>> <http://en.wikipedia.org/wiki/OLAP>], "… offers ANSI SQL on Hadoop
>> and supports most ANSI SQL query functions". It depends on HDFS, MapReduce,
>> Hive and HBase; and seems targeted at very large data-sets though maintains
>> low query latency)
>> - Apache Tajo <http://tajo.apache.org> (ANSI/ISO SQL standard
>> compliance with JDBC <http://en.wikipedia.org/wiki/JDBC> driver
>> support [benchmarks against Hive and Impala
>> <http://blogs.gartner.com/nick-heudecker/apache-tajo-enters-the-sql-on-hadoop-space>
>> ])
>> - Cascading <http://en.wikipedia.org/wiki/Cascading_%28software%29>'s
>> Lingual <http://docs.cascading.org/lingual/1.0/>²
>> <http://docs.cascading.org/lingual/1.0/#sql-support> ("Lingual
>> provides JDBC Drivers, a SQL command shell, and a catalog manager for
>> publishing files [or any resource] as schemas and tables.")
>>
>> Which—from this list or elsewhere—would you recommend, and why?
>> Thanks for all suggestions,
>>
>> Samuel Marks
>> http://linkedin.com/in/samuelmarks
>>
>
>
Re: Which [open-souce] SQL engine atop Hadoop?
Posted by Daniel Haviv <da...@gmail.com>.
Can you elaborate on why you prefer Tajo?
Daniel
> On 27 בינו׳ 2015, at 10:35, Azuryy Yu <az...@gmail.com> wrote:
>
> You almost list all open sourced MPP real time SQL-ON-Hadoop.
>
> I prefer Tajo, which was relased by 0.9.0 recently, and still working in progress for 1.0
>
>
>> On Mon, Jan 26, 2015 at 10:19 PM, Samuel Marks <sa...@gmail.com> wrote:
>> Since Hadoop came out, there have been various commercial and/or open-source attempts to expose some compatibility with SQL.
>>
>> I am seeking one which is good for low-latency querying, and supports the most common CRUD, including [the basics!] along these lines: CREATE TABLE, INSERT INTO, SELECT * FROM, UPDATE Table SET C1=2 WHERE, DELETE FROM, and DROP TABLE.
>>
>> I will be utilising them from Python, however there does seem to be a Python JDBC wrapper. Additionally it needs to be scalable for big and small data (starting on a single-node "cluster").
>>
>> Here is what I've found thus far:
>>
>> Apache Hive (SQL-like, with interactive SQL thanks to the Stinger initiative)
>> Apache Drill (ANSI SQL support)
>> Apache Spark (Spark SQL, queries only, add data via Hive, RDD or Paraquet)
>> Apache Phoenix (built atop Apache HBase, lacks full transaction support, relational operators and some built-in functions)
>> Presto from Facebook (can query Hive, Cassandra, relational DBs &etc. Doesn't seem to be designed for low-latency responses across small clusters, or support UPDATE operations. It is optimized for data warehousing or analytics¹)
>> SQL-Hadoop via MapR community edition (seems to be a packaging of Hive, HP Vertica, SparkSQL, Drill and a native ODBC wrapper)
>> Apache Kylin from Ebay (provides an SQL interface and multi-dimensional analysis [OLAP], "… offers ANSI SQL on Hadoop and supports most ANSI SQL query functions". It depends on HDFS, MapReduce, Hive and HBase; and seems targeted at very large data-sets though maintains low query latency)
>> Apache Tajo (ANSI/ISO SQL standard compliance with JDBC driver support [benchmarks against Hive and Impala])
>> Cascading's Lingual² ("Lingual provides JDBC Drivers, a SQL command shell, and a catalog manager for publishing files [or any resource] as schemas and tables.")
>> Which—from this list or elsewhere—would you recommend, and why?
>>
>> Thanks for all suggestions,
>>
>> Samuel Marks
>> http://linkedin.com/in/samuelmarks
>
Re: Which [open-souce] SQL engine atop Hadoop?
Posted by Daniel Haviv <da...@gmail.com>.
Can you elaborate on why you prefer Tajo?
Daniel
> On 27 בינו׳ 2015, at 10:35, Azuryy Yu <az...@gmail.com> wrote:
>
> You almost list all open sourced MPP real time SQL-ON-Hadoop.
>
> I prefer Tajo, which was relased by 0.9.0 recently, and still working in progress for 1.0
>
>
>> On Mon, Jan 26, 2015 at 10:19 PM, Samuel Marks <sa...@gmail.com> wrote:
>> Since Hadoop came out, there have been various commercial and/or open-source attempts to expose some compatibility with SQL.
>>
>> I am seeking one which is good for low-latency querying, and supports the most common CRUD, including [the basics!] along these lines: CREATE TABLE, INSERT INTO, SELECT * FROM, UPDATE Table SET C1=2 WHERE, DELETE FROM, and DROP TABLE.
>>
>> I will be utilising them from Python, however there does seem to be a Python JDBC wrapper. Additionally it needs to be scalable for big and small data (starting on a single-node "cluster").
>>
>> Here is what I've found thus far:
>>
>> Apache Hive (SQL-like, with interactive SQL thanks to the Stinger initiative)
>> Apache Drill (ANSI SQL support)
>> Apache Spark (Spark SQL, queries only, add data via Hive, RDD or Paraquet)
>> Apache Phoenix (built atop Apache HBase, lacks full transaction support, relational operators and some built-in functions)
>> Presto from Facebook (can query Hive, Cassandra, relational DBs &etc. Doesn't seem to be designed for low-latency responses across small clusters, or support UPDATE operations. It is optimized for data warehousing or analytics¹)
>> SQL-Hadoop via MapR community edition (seems to be a packaging of Hive, HP Vertica, SparkSQL, Drill and a native ODBC wrapper)
>> Apache Kylin from Ebay (provides an SQL interface and multi-dimensional analysis [OLAP], "… offers ANSI SQL on Hadoop and supports most ANSI SQL query functions". It depends on HDFS, MapReduce, Hive and HBase; and seems targeted at very large data-sets though maintains low query latency)
>> Apache Tajo (ANSI/ISO SQL standard compliance with JDBC driver support [benchmarks against Hive and Impala])
>> Cascading's Lingual² ("Lingual provides JDBC Drivers, a SQL command shell, and a catalog manager for publishing files [or any resource] as schemas and tables.")
>> Which—from this list or elsewhere—would you recommend, and why?
>>
>> Thanks for all suggestions,
>>
>> Samuel Marks
>> http://linkedin.com/in/samuelmarks
>
Re: Which [open-souce] SQL engine atop Hadoop?
Posted by Daniel Haviv <da...@gmail.com>.
Can you elaborate on why you prefer Tajo?
Daniel
> On 27 בינו׳ 2015, at 10:35, Azuryy Yu <az...@gmail.com> wrote:
>
> You almost list all open sourced MPP real time SQL-ON-Hadoop.
>
> I prefer Tajo, which was relased by 0.9.0 recently, and still working in progress for 1.0
>
>
>> On Mon, Jan 26, 2015 at 10:19 PM, Samuel Marks <sa...@gmail.com> wrote:
>> Since Hadoop came out, there have been various commercial and/or open-source attempts to expose some compatibility with SQL.
>>
>> I am seeking one which is good for low-latency querying, and supports the most common CRUD, including [the basics!] along these lines: CREATE TABLE, INSERT INTO, SELECT * FROM, UPDATE Table SET C1=2 WHERE, DELETE FROM, and DROP TABLE.
>>
>> I will be utilising them from Python, however there does seem to be a Python JDBC wrapper. Additionally it needs to be scalable for big and small data (starting on a single-node "cluster").
>>
>> Here is what I've found thus far:
>>
>> Apache Hive (SQL-like, with interactive SQL thanks to the Stinger initiative)
>> Apache Drill (ANSI SQL support)
>> Apache Spark (Spark SQL, queries only, add data via Hive, RDD or Paraquet)
>> Apache Phoenix (built atop Apache HBase, lacks full transaction support, relational operators and some built-in functions)
>> Presto from Facebook (can query Hive, Cassandra, relational DBs &etc. Doesn't seem to be designed for low-latency responses across small clusters, or support UPDATE operations. It is optimized for data warehousing or analytics¹)
>> SQL-Hadoop via MapR community edition (seems to be a packaging of Hive, HP Vertica, SparkSQL, Drill and a native ODBC wrapper)
>> Apache Kylin from Ebay (provides an SQL interface and multi-dimensional analysis [OLAP], "… offers ANSI SQL on Hadoop and supports most ANSI SQL query functions". It depends on HDFS, MapReduce, Hive and HBase; and seems targeted at very large data-sets though maintains low query latency)
>> Apache Tajo (ANSI/ISO SQL standard compliance with JDBC driver support [benchmarks against Hive and Impala])
>> Cascading's Lingual² ("Lingual provides JDBC Drivers, a SQL command shell, and a catalog manager for publishing files [or any resource] as schemas and tables.")
>> Which—from this list or elsewhere—would you recommend, and why?
>>
>> Thanks for all suggestions,
>>
>> Samuel Marks
>> http://linkedin.com/in/samuelmarks
>
Re: Which [open-souce] SQL engine atop Hadoop?
Posted by Daniel Haviv <da...@gmail.com>.
Can you elaborate on why you prefer Tajo?
Daniel
> On 27 בינו׳ 2015, at 10:35, Azuryy Yu <az...@gmail.com> wrote:
>
> You almost list all open sourced MPP real time SQL-ON-Hadoop.
>
> I prefer Tajo, which was relased by 0.9.0 recently, and still working in progress for 1.0
>
>
>> On Mon, Jan 26, 2015 at 10:19 PM, Samuel Marks <sa...@gmail.com> wrote:
>> Since Hadoop came out, there have been various commercial and/or open-source attempts to expose some compatibility with SQL.
>>
>> I am seeking one which is good for low-latency querying, and supports the most common CRUD, including [the basics!] along these lines: CREATE TABLE, INSERT INTO, SELECT * FROM, UPDATE Table SET C1=2 WHERE, DELETE FROM, and DROP TABLE.
>>
>> I will be utilising them from Python, however there does seem to be a Python JDBC wrapper. Additionally it needs to be scalable for big and small data (starting on a single-node "cluster").
>>
>> Here is what I've found thus far:
>>
>> Apache Hive (SQL-like, with interactive SQL thanks to the Stinger initiative)
>> Apache Drill (ANSI SQL support)
>> Apache Spark (Spark SQL, queries only, add data via Hive, RDD or Paraquet)
>> Apache Phoenix (built atop Apache HBase, lacks full transaction support, relational operators and some built-in functions)
>> Presto from Facebook (can query Hive, Cassandra, relational DBs &etc. Doesn't seem to be designed for low-latency responses across small clusters, or support UPDATE operations. It is optimized for data warehousing or analytics¹)
>> SQL-Hadoop via MapR community edition (seems to be a packaging of Hive, HP Vertica, SparkSQL, Drill and a native ODBC wrapper)
>> Apache Kylin from Ebay (provides an SQL interface and multi-dimensional analysis [OLAP], "… offers ANSI SQL on Hadoop and supports most ANSI SQL query functions". It depends on HDFS, MapReduce, Hive and HBase; and seems targeted at very large data-sets though maintains low query latency)
>> Apache Tajo (ANSI/ISO SQL standard compliance with JDBC driver support [benchmarks against Hive and Impala])
>> Cascading's Lingual² ("Lingual provides JDBC Drivers, a SQL command shell, and a catalog manager for publishing files [or any resource] as schemas and tables.")
>> Which—from this list or elsewhere—would you recommend, and why?
>>
>> Thanks for all suggestions,
>>
>> Samuel Marks
>> http://linkedin.com/in/samuelmarks
>
Re: Which [open-souce] SQL engine atop Hadoop?
Posted by Azuryy Yu <az...@gmail.com>.
You almost list all open sourced MPP real time SQL-ON-Hadoop.
I prefer Tajo, which was relased by 0.9.0 recently, and still working in
progress for 1.0
On Mon, Jan 26, 2015 at 10:19 PM, Samuel Marks <sa...@gmail.com>
wrote:
> Since Hadoop <https://hive.apache.org> came out, there have been various
> commercial and/or open-source attempts to expose some compatibility with
> SQL <http://drill.apache.org>.
>
> I am seeking one which is good for low-latency querying, and supports the
> most common CRUD <https://spark.apache.org>, including [the basics!]
> along these lines: CREATE TABLE, INSERT INTO, SELECT * FROM, UPDATE Table
> SET C1=2 WHERE, DELETE FROM, and DROP TABLE.
>
> I will be utilising them from Python, however there does seem to be a Python
> JDBC wrapper <https://spark.apache.org/sql>. Additionally it needs to be
> scalable for big and small data (starting on a single-node "cluster").
>
> Here is what I've found thus far:
>
> - Apache Hive <https://hive.apache.org> (SQL-like, with interactive
> SQL thanks to the Stinger initiative)
> - Apache Drill <http://drill.apache.org> (ANSI SQL support)
> - Apache Spark <https://spark.apache.org> (Spark SQL
> <https://spark.apache.org/sql>, queries only, add data via Hive, RDD
> <https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.SchemaRDD>
> or Paraquet <http://parquet.io/>)
> - Apache Phoenix <http://phoenix.apache.org> (built atop Apache HBase
> <http://hbase.apache.org>, lacks full transaction
> <http://en.wikipedia.org/wiki/Database_transaction> support, relational
> operators <http://en.wikipedia.org/wiki/Relational_operators> and some
> built-in functions)
> - Presto <https://github.com/facebook/presto> from Facebook (can query
> Hive, Cassandra <http://cassandra.apache.org>, relational DBs &etc.
> Doesn't seem to be designed for low-latency responses across small
> clusters, or support UPDATE operations. It is optimized for data
> warehousing or analytics¹
> <http://prestodb.io/docs/current/overview/use-cases.html>)
> - SQL-Hadoop <https://www.mapr.com/why-hadoop/sql-hadoop> via MapR
> community edition <https://www.mapr.com/products/hadoop-download>
> (seems to be a packaging of Hive, HP Vertica
> <http://www.vertica.com/hp-vertica-products/sqlonhadoop>, SparkSQL,
> Drill and a native ODBC wrapper
> <http://package.mapr.com/tools/MapR-ODBC/MapR_ODBC>)
> - Apache Kylin <http://www.kylin.io> from Ebay (provides an SQL
> interface and multi-dimensional analysis [OLAP
> <http://en.wikipedia.org/wiki/OLAP>], "… offers ANSI SQL on Hadoop and
> supports most ANSI SQL query functions". It depends on HDFS, MapReduce,
> Hive and HBase; and seems targeted at very large data-sets though maintains
> low query latency)
> - Apache Tajo <http://tajo.apache.org> (ANSI/ISO SQL standard
> compliance with JDBC <http://en.wikipedia.org/wiki/JDBC> driver
> support [benchmarks against Hive and Impala
> <http://blogs.gartner.com/nick-heudecker/apache-tajo-enters-the-sql-on-hadoop-space>
> ])
> - Cascading <http://en.wikipedia.org/wiki/Cascading_%28software%29>'s
> Lingual <http://docs.cascading.org/lingual/1.0/>²
> <http://docs.cascading.org/lingual/1.0/#sql-support> ("Lingual
> provides JDBC Drivers, a SQL command shell, and a catalog manager for
> publishing files [or any resource] as schemas and tables.")
>
> Which—from this list or elsewhere—would you recommend, and why?
> Thanks for all suggestions,
>
> Samuel Marks
> http://linkedin.com/in/samuelmarks
>
Re: Which [open-souce] SQL engine atop Hadoop?
Posted by Azuryy Yu <az...@gmail.com>.
You almost list all open sourced MPP real time SQL-ON-Hadoop.
I prefer Tajo, which was relased by 0.9.0 recently, and still working in
progress for 1.0
On Mon, Jan 26, 2015 at 10:19 PM, Samuel Marks <sa...@gmail.com>
wrote:
> Since Hadoop <https://hive.apache.org> came out, there have been various
> commercial and/or open-source attempts to expose some compatibility with
> SQL <http://drill.apache.org>.
>
> I am seeking one which is good for low-latency querying, and supports the
> most common CRUD <https://spark.apache.org>, including [the basics!]
> along these lines: CREATE TABLE, INSERT INTO, SELECT * FROM, UPDATE Table
> SET C1=2 WHERE, DELETE FROM, and DROP TABLE.
>
> I will be utilising them from Python, however there does seem to be a Python
> JDBC wrapper <https://spark.apache.org/sql>. Additionally it needs to be
> scalable for big and small data (starting on a single-node "cluster").
>
> Here is what I've found thus far:
>
> - Apache Hive <https://hive.apache.org> (SQL-like, with interactive
> SQL thanks to the Stinger initiative)
> - Apache Drill <http://drill.apache.org> (ANSI SQL support)
> - Apache Spark <https://spark.apache.org> (Spark SQL
> <https://spark.apache.org/sql>, queries only, add data via Hive, RDD
> <https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.SchemaRDD>
> or Paraquet <http://parquet.io/>)
> - Apache Phoenix <http://phoenix.apache.org> (built atop Apache HBase
> <http://hbase.apache.org>, lacks full transaction
> <http://en.wikipedia.org/wiki/Database_transaction> support, relational
> operators <http://en.wikipedia.org/wiki/Relational_operators> and some
> built-in functions)
> - Presto <https://github.com/facebook/presto> from Facebook (can query
> Hive, Cassandra <http://cassandra.apache.org>, relational DBs &etc.
> Doesn't seem to be designed for low-latency responses across small
> clusters, or support UPDATE operations. It is optimized for data
> warehousing or analytics¹
> <http://prestodb.io/docs/current/overview/use-cases.html>)
> - SQL-Hadoop <https://www.mapr.com/why-hadoop/sql-hadoop> via MapR
> community edition <https://www.mapr.com/products/hadoop-download>
> (seems to be a packaging of Hive, HP Vertica
> <http://www.vertica.com/hp-vertica-products/sqlonhadoop>, SparkSQL,
> Drill and a native ODBC wrapper
> <http://package.mapr.com/tools/MapR-ODBC/MapR_ODBC>)
> - Apache Kylin <http://www.kylin.io> from Ebay (provides an SQL
> interface and multi-dimensional analysis [OLAP
> <http://en.wikipedia.org/wiki/OLAP>], "… offers ANSI SQL on Hadoop and
> supports most ANSI SQL query functions". It depends on HDFS, MapReduce,
> Hive and HBase; and seems targeted at very large data-sets though maintains
> low query latency)
> - Apache Tajo <http://tajo.apache.org> (ANSI/ISO SQL standard
> compliance with JDBC <http://en.wikipedia.org/wiki/JDBC> driver
> support [benchmarks against Hive and Impala
> <http://blogs.gartner.com/nick-heudecker/apache-tajo-enters-the-sql-on-hadoop-space>
> ])
> - Cascading <http://en.wikipedia.org/wiki/Cascading_%28software%29>'s
> Lingual <http://docs.cascading.org/lingual/1.0/>²
> <http://docs.cascading.org/lingual/1.0/#sql-support> ("Lingual
> provides JDBC Drivers, a SQL command shell, and a catalog manager for
> publishing files [or any resource] as schemas and tables.")
>
> Which—from this list or elsewhere—would you recommend, and why?
> Thanks for all suggestions,
>
> Samuel Marks
> http://linkedin.com/in/samuelmarks
>
Re: Which [open-souce] SQL engine atop Hadoop?
Posted by Azuryy Yu <az...@gmail.com>.
You almost list all open sourced MPP real time SQL-ON-Hadoop.
I prefer Tajo, which was relased by 0.9.0 recently, and still working in
progress for 1.0
On Mon, Jan 26, 2015 at 10:19 PM, Samuel Marks <sa...@gmail.com>
wrote:
> Since Hadoop <https://hive.apache.org> came out, there have been various
> commercial and/or open-source attempts to expose some compatibility with
> SQL <http://drill.apache.org>.
>
> I am seeking one which is good for low-latency querying, and supports the
> most common CRUD <https://spark.apache.org>, including [the basics!]
> along these lines: CREATE TABLE, INSERT INTO, SELECT * FROM, UPDATE Table
> SET C1=2 WHERE, DELETE FROM, and DROP TABLE.
>
> I will be utilising them from Python, however there does seem to be a Python
> JDBC wrapper <https://spark.apache.org/sql>. Additionally it needs to be
> scalable for big and small data (starting on a single-node "cluster").
>
> Here is what I've found thus far:
>
> - Apache Hive <https://hive.apache.org> (SQL-like, with interactive
> SQL thanks to the Stinger initiative)
> - Apache Drill <http://drill.apache.org> (ANSI SQL support)
> - Apache Spark <https://spark.apache.org> (Spark SQL
> <https://spark.apache.org/sql>, queries only, add data via Hive, RDD
> <https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.SchemaRDD>
> or Paraquet <http://parquet.io/>)
> - Apache Phoenix <http://phoenix.apache.org> (built atop Apache HBase
> <http://hbase.apache.org>, lacks full transaction
> <http://en.wikipedia.org/wiki/Database_transaction> support, relational
> operators <http://en.wikipedia.org/wiki/Relational_operators> and some
> built-in functions)
> - Presto <https://github.com/facebook/presto> from Facebook (can query
> Hive, Cassandra <http://cassandra.apache.org>, relational DBs &etc.
> Doesn't seem to be designed for low-latency responses across small
> clusters, or support UPDATE operations. It is optimized for data
> warehousing or analytics¹
> <http://prestodb.io/docs/current/overview/use-cases.html>)
> - SQL-Hadoop <https://www.mapr.com/why-hadoop/sql-hadoop> via MapR
> community edition <https://www.mapr.com/products/hadoop-download>
> (seems to be a packaging of Hive, HP Vertica
> <http://www.vertica.com/hp-vertica-products/sqlonhadoop>, SparkSQL,
> Drill and a native ODBC wrapper
> <http://package.mapr.com/tools/MapR-ODBC/MapR_ODBC>)
> - Apache Kylin <http://www.kylin.io> from Ebay (provides an SQL
> interface and multi-dimensional analysis [OLAP
> <http://en.wikipedia.org/wiki/OLAP>], "… offers ANSI SQL on Hadoop and
> supports most ANSI SQL query functions". It depends on HDFS, MapReduce,
> Hive and HBase; and seems targeted at very large data-sets though maintains
> low query latency)
> - Apache Tajo <http://tajo.apache.org> (ANSI/ISO SQL standard
> compliance with JDBC <http://en.wikipedia.org/wiki/JDBC> driver
> support [benchmarks against Hive and Impala
> <http://blogs.gartner.com/nick-heudecker/apache-tajo-enters-the-sql-on-hadoop-space>
> ])
> - Cascading <http://en.wikipedia.org/wiki/Cascading_%28software%29>'s
> Lingual <http://docs.cascading.org/lingual/1.0/>²
> <http://docs.cascading.org/lingual/1.0/#sql-support> ("Lingual
> provides JDBC Drivers, a SQL command shell, and a catalog manager for
> publishing files [or any resource] as schemas and tables.")
>
> Which—from this list or elsewhere—would you recommend, and why?
> Thanks for all suggestions,
>
> Samuel Marks
> http://linkedin.com/in/samuelmarks
>
Re: Which [open-souce] SQL engine atop Hadoop?
Posted by Azuryy Yu <az...@gmail.com>.
You almost list all open sourced MPP real time SQL-ON-Hadoop.
I prefer Tajo, which was relased by 0.9.0 recently, and still working in
progress for 1.0
On Mon, Jan 26, 2015 at 10:19 PM, Samuel Marks <sa...@gmail.com>
wrote:
> Since Hadoop <https://hive.apache.org> came out, there have been various
> commercial and/or open-source attempts to expose some compatibility with
> SQL <http://drill.apache.org>.
>
> I am seeking one which is good for low-latency querying, and supports the
> most common CRUD <https://spark.apache.org>, including [the basics!]
> along these lines: CREATE TABLE, INSERT INTO, SELECT * FROM, UPDATE Table
> SET C1=2 WHERE, DELETE FROM, and DROP TABLE.
>
> I will be utilising them from Python, however there does seem to be a Python
> JDBC wrapper <https://spark.apache.org/sql>. Additionally it needs to be
> scalable for big and small data (starting on a single-node "cluster").
>
> Here is what I've found thus far:
>
> - Apache Hive <https://hive.apache.org> (SQL-like, with interactive
> SQL thanks to the Stinger initiative)
> - Apache Drill <http://drill.apache.org> (ANSI SQL support)
> - Apache Spark <https://spark.apache.org> (Spark SQL
> <https://spark.apache.org/sql>, queries only, add data via Hive, RDD
> <https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.SchemaRDD>
> or Paraquet <http://parquet.io/>)
> - Apache Phoenix <http://phoenix.apache.org> (built atop Apache HBase
> <http://hbase.apache.org>, lacks full transaction
> <http://en.wikipedia.org/wiki/Database_transaction> support, relational
> operators <http://en.wikipedia.org/wiki/Relational_operators> and some
> built-in functions)
> - Presto <https://github.com/facebook/presto> from Facebook (can query
> Hive, Cassandra <http://cassandra.apache.org>, relational DBs &etc.
> Doesn't seem to be designed for low-latency responses across small
> clusters, or support UPDATE operations. It is optimized for data
> warehousing or analytics¹
> <http://prestodb.io/docs/current/overview/use-cases.html>)
> - SQL-Hadoop <https://www.mapr.com/why-hadoop/sql-hadoop> via MapR
> community edition <https://www.mapr.com/products/hadoop-download>
> (seems to be a packaging of Hive, HP Vertica
> <http://www.vertica.com/hp-vertica-products/sqlonhadoop>, SparkSQL,
> Drill and a native ODBC wrapper
> <http://package.mapr.com/tools/MapR-ODBC/MapR_ODBC>)
> - Apache Kylin <http://www.kylin.io> from Ebay (provides an SQL
> interface and multi-dimensional analysis [OLAP
> <http://en.wikipedia.org/wiki/OLAP>], "… offers ANSI SQL on Hadoop and
> supports most ANSI SQL query functions". It depends on HDFS, MapReduce,
> Hive and HBase; and seems targeted at very large data-sets though maintains
> low query latency)
> - Apache Tajo <http://tajo.apache.org> (ANSI/ISO SQL standard
> compliance with JDBC <http://en.wikipedia.org/wiki/JDBC> driver
> support [benchmarks against Hive and Impala
> <http://blogs.gartner.com/nick-heudecker/apache-tajo-enters-the-sql-on-hadoop-space>
> ])
> - Cascading <http://en.wikipedia.org/wiki/Cascading_%28software%29>'s
> Lingual <http://docs.cascading.org/lingual/1.0/>²
> <http://docs.cascading.org/lingual/1.0/#sql-support> ("Lingual
> provides JDBC Drivers, a SQL command shell, and a catalog manager for
> publishing files [or any resource] as schemas and tables.")
>
> Which—from this list or elsewhere—would you recommend, and why?
> Thanks for all suggestions,
>
> Samuel Marks
> http://linkedin.com/in/samuelmarks
>