You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Samuel Marks <sa...@gmail.com> on 2015/01/30 12:26:31 UTC

Which [open-souce] SQL engine atop Hadoop?

Since Hadoop <https://hive.apache.org> came out, there have been various
commercial and/or open-source attempts to expose some compatibility with SQL
<http://drill.apache.org>. Obviously by posting here I am not expecting an
unbiased answer.

Seeking an SQL-on-Hadoop offering which provides: low-latency querying, and
supports the most common CRUD <https://spark.apache.org>, including [the
basics!] along these lines: CREATE TABLE, INSERT INTO, SELECT * FROM, UPDATE
Table SET C1=2 WHERE, DELETE FROM, and DROP TABLE. Transactional support
would be nice also, but is not a must-have.

Essentially I want a full replacement for the more traditional RDBMS, one
which can scale from 1 node to a serious Hadoop cluster.

Python is my language of choice for interfacing, however there does seem to
be a Python JDBC wrapper <https://spark.apache.org/sql>.

Here is what I've found thus far:

   - Apache Hive <https://hive.apache.org> (SQL-like, with interactive SQL
   thanks to the Stinger initiative)
   - Apache Drill <http://drill.apache.org> (ANSI SQL support)
   - Apache Spark <https://spark.apache.org> (Spark SQL
   <https://spark.apache.org/sql>, queries only, add data via Hive, RDD
   <https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.SchemaRDD>
   or Paraquet <http://parquet.io/>)
   - Apache Phoenix <http://phoenix.apache.org> (built atop Apache HBase
   <http://hbase.apache.org>, lacks full transaction
   <http://en.wikipedia.org/wiki/Database_transaction> support, relational
   operators <http://en.wikipedia.org/wiki/Relational_operators> and some
   built-in functions)
   - Cloudera Impala
   <http://www.cloudera.com/content/cloudera/en/products-and-services/cdh/impala.html>
   (significant HiveQL support, some SQL language support, no support for
   indexes on its tables, importantly missing DELETE, UPDATE and INTERSECT;
   amongst others)
   - Presto <https://github.com/facebook/presto> from Facebook (can query
   Hive, Cassandra <http://cassandra.apache.org>, relational DBs &etc.
   Doesn't seem to be designed for low-latency responses across small
   clusters, or support UPDATE operations. It is optimized for data
   warehousing or analytics¹
   <http://prestodb.io/docs/current/overview/use-cases.html>)
   - SQL-Hadoop <https://www.mapr.com/why-hadoop/sql-hadoop> via MapR
   community edition <https://www.mapr.com/products/hadoop-download> (seems
   to be a packaging of Hive, HP Vertica
   <http://www.vertica.com/hp-vertica-products/sqlonhadoop>, SparkSQL,
   Drill and a native ODBC wrapper
   <http://package.mapr.com/tools/MapR-ODBC/MapR_ODBC>)
   - Apache Kylin <http://www.kylin.io> from Ebay (provides an SQL
   interface and multi-dimensional analysis [OLAP
   <http://en.wikipedia.org/wiki/OLAP>], "… offers ANSI SQL on Hadoop and
   supports most ANSI SQL query functions". It depends on HDFS, MapReduce,
   Hive and HBase; and seems targeted at very large data-sets though maintains
   low query latency)
   - Apache Tajo <http://tajo.apache.org> (ANSI/ISO SQL standard compliance
   with JDBC <http://en.wikipedia.org/wiki/JDBC> driver support [benchmarks
   against Hive and Impala
   <http://blogs.gartner.com/nick-heudecker/apache-tajo-enters-the-sql-on-hadoop-space>
   ])
   - Cascading <http://en.wikipedia.org/wiki/Cascading_%28software%29>'s
   Lingual <http://docs.cascading.org/lingual/1.0/>²
   <http://docs.cascading.org/lingual/1.0/#sql-support> ("Lingual provides
   JDBC Drivers, a SQL command shell, and a catalog manager for publishing
   files [or any resource] as schemas and tables.")

Which—from this list or elsewhere—would you recommend, and why?
Thanks for all suggestions,

Samuel Marks
http://linkedin.com/in/samuelmarks

Re: Which [open-souce] SQL engine atop Hadoop?

Posted by Koert Kuipers <ko...@tresata.com>.

Spark-SQL is read-only yes, in the sense that it does not support mutation
but only transformation to a new dataset that you store separately.

i am not aware of many systems that support mutation. systems that support
mutation will not use HDFS as the datastore. so something like Phoenix
(backed by HBase) will be needed for that.

On Sat, Jan 31, 2015 at 11:05 AM, Koert Kuipers <ko...@tresata.com> wrote:

> yes you can run whatever you like with the data in hdfs. keep in mind that
> hive makes this general access pattern just a little harder, since hive has
> a tendency to store data and metadata separately, with the metadata in a
> special metadata store (not on hdfs), and its not as easy for all systems
> to access hive metadata.
>
> i am not familiar at all with tajo or drill.
>
> On Fri, Jan 30, 2015 at 8:27 PM, Samuel Marks <sa...@gmail.com>
> wrote:
>
>> Thanks for the advice
>>
>> Koert: when everything is in the same essential data-store (HDFS), can't
>> I just run whatever complex tools I'm whichever paradigm they like?
>>
>> E.g.: GraphX, Mahout &etc.
>>
>> Also, what about Tajo or Drill?
>>
>> Best,
>>
>> Samuel Marks
>> http://linkedin.com/in/samuelmarks
>>
>> PS: Spark-SQL is read-only IIRC, right?
>> On 31 Jan 2015 03:39, "Koert Kuipers" <ko...@tresata.com> wrote:
>>
>>> since you require high-powered analytics, and i assume you want to stay
>>> sane while doing so, you require the ability to "drop out of sql" when
>>> needed. so spark-sql and lingual would be my choices.
>>>
>>> low latency indicates phoenix or spark-sql to me.
>>>
>>> so i would say spark-sql
>>>
>>> On Fri, Jan 30, 2015 at 7:56 AM, Samuel Marks <sa...@gmail.com>
>>> wrote:
>>>
>>>> HAWQ is pretty nifty due to its full SQL compliance (ANSI 92) and
>>>> exposing both JDBC and ODBC interfaces. However, although Pivotal does open-source
>>>> a lot of software <http://www.pivotal.io/oss>, I don't believe they
>>>> open source Pivotal HD: HAWQ.
>>>>
>>>> So that doesn't meet my requirements. I should note that the project I
>>>> am building will also be open-source, which heightens the importance of
>>>> having all components also being open-source.
>>>>
>>>> Cheers,
>>>>
>>>> Samuel Marks
>>>> http://linkedin.com/in/samuelmarks
>>>>
>>>> On Fri, Jan 30, 2015 at 11:35 PM, Siddharth Tiwari <
>>>> siddharth.tiwari@live.com> wrote:
>>>>
>>>>> Have you looked at HAWQ from Pivotal ?
>>>>>
>>>>> Sent from my iPhone
>>>>>
>>>>> On Jan 30, 2015, at 4:27 AM, Samuel Marks <sa...@gmail.com>
>>>>> wrote:
>>>>>
>>>>> Since Hadoop <https://hive.apache.org> came out, there have been
>>>>> various commercial and/or open-source attempts to expose some compatibility
>>>>> with SQL <http://drill.apache.org>. Obviously by posting here I am
>>>>> not expecting an unbiased answer.
>>>>>
>>>>> Seeking an SQL-on-Hadoop offering which provides: low-latency
>>>>> querying, and supports the most common CRUD <https://spark.apache.org>,
>>>>> including [the basics!] along these lines: CREATE TABLE, INSERT INTO, SELECT
>>>>> * FROM, UPDATE Table SET C1=2 WHERE, DELETE FROM, and DROP TABLE.
>>>>> Transactional support would be nice also, but is not a must-have.
>>>>>
>>>>> Essentially I want a full replacement for the more traditional RDBMS,
>>>>> one which can scale from 1 node to a serious Hadoop cluster.
>>>>>
>>>>> Python is my language of choice for interfacing, however there does
>>>>> seem to be a Python JDBC wrapper <https://spark.apache.org/sql>.
>>>>>
>>>>> Here is what I've found thus far:
>>>>>
>>>>>    - Apache Hive <https://hive.apache.org> (SQL-like, with
>>>>>    interactive SQL thanks to the Stinger initiative)
>>>>>    - Apache Drill <http://drill.apache.org> (ANSI SQL support)
>>>>>    - Apache Spark <https://spark.apache.org> (Spark SQL
>>>>>    <https://spark.apache.org/sql>, queries only, add data via Hive,
>>>>>    RDD
>>>>>    <https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.SchemaRDD>
>>>>>    or Paraquet <http://parquet.io/>)
>>>>>    - Apache Phoenix <http://phoenix.apache.org> (built atop Apache
>>>>>    HBase <http://hbase.apache.org>, lacks full transaction
>>>>>    <http://en.wikipedia.org/wiki/Database_transaction> support, relational
>>>>>    operators <http://en.wikipedia.org/wiki/Relational_operators> and
>>>>>    some built-in functions)
>>>>>    - Cloudera Impala
>>>>>    <http://www.cloudera.com/content/cloudera/en/products-and-services/cdh/impala.html>
>>>>>    (significant HiveQL support, some SQL language support, no support for
>>>>>    indexes on its tables, importantly missing DELETE, UPDATE and INTERSECT;
>>>>>    amongst others)
>>>>>    - Presto <https://github.com/facebook/presto> from Facebook (can
>>>>>    query Hive, Cassandra <http://cassandra.apache.org>, relational
>>>>>    DBs &etc. Doesn't seem to be designed for low-latency responses across
>>>>>    small clusters, or support UPDATE operations. It is optimized for
>>>>>    data warehousing or analytics¹
>>>>>    <http://prestodb.io/docs/current/overview/use-cases.html>)
>>>>>    - SQL-Hadoop <https://www.mapr.com/why-hadoop/sql-hadoop> via MapR
>>>>>    community edition <https://www.mapr.com/products/hadoop-download>
>>>>>    (seems to be a packaging of Hive, HP Vertica
>>>>>    <http://www.vertica.com/hp-vertica-products/sqlonhadoop>,
>>>>>    SparkSQL, Drill and a native ODBC wrapper
>>>>>    <http://package.mapr.com/tools/MapR-ODBC/MapR_ODBC>)
>>>>>    - Apache Kylin <http://www.kylin.io> from Ebay (provides an SQL
>>>>>    interface and multi-dimensional analysis [OLAP
>>>>>    <http://en.wikipedia.org/wiki/OLAP>], "… offers ANSI SQL on Hadoop
>>>>>    and supports most ANSI SQL query functions". It depends on HDFS, MapReduce,
>>>>>    Hive and HBase; and seems targeted at very large data-sets though maintains
>>>>>    low query latency)
>>>>>    - Apache Tajo <http://tajo.apache.org> (ANSI/ISO SQL standard
>>>>>    compliance with JDBC <http://en.wikipedia.org/wiki/JDBC> driver
>>>>>    support [benchmarks against Hive and Impala
>>>>>    <http://blogs.gartner.com/nick-heudecker/apache-tajo-enters-the-sql-on-hadoop-space>
>>>>>    ])
>>>>>    - Cascading <http://en.wikipedia.org/wiki/Cascading_%28software%29>'s
>>>>>    Lingual <http://docs.cascading.org/lingual/1.0/>²
>>>>>    <http://docs.cascading.org/lingual/1.0/#sql-support> ("Lingual
>>>>>    provides JDBC Drivers, a SQL command shell, and a catalog manager for
>>>>>    publishing files [or any resource] as schemas and tables.")
>>>>>
>>>>> Which—from this list or elsewhere—would you recommend, and why?
>>>>> Thanks for all suggestions,
>>>>>
>>>>> Samuel Marks
>>>>> http://linkedin.com/in/samuelmarks
>>>>>
>>>>>
>>>>
>>>
>

Re: Which [open-souce] SQL engine atop Hadoop?

Posted by Devopam Mittra <de...@gmail.com>.

hi Samuel,
Apologies for the delay in response as well as overlooking Presto mention
in your initial post itself.
#IMHO :
Presto is lightweight, easy to install and configure.
It does not support "UPDATE" .. hmm , i don't need updates in Big Data
analytics where i can have a temp / intermediate table which will be faster
as well (between, I don't know how many others provide true Update
capabilities)
I am happy with Hive itself , and don't need Presto for my ad-hoc analytics
since the overhead of MR job kick-off timing is not overwhelming compared
to the total query execution time.
Presto is good for me when I need to run parameterized and fixed queries
from a dashboard directly on my HDP cluster as it reduces my screen staring
time

Hope you find it helpful in your decision making.

regards
Devopam



On Tue, Feb 3, 2015 at 1:57 PM, Samuel Marks <sa...@gmail.com> wrote:

> Thanks Devopam,
>
> In my initial post I did mention Presto, with his review:
> " can query Hive, Cassandra <http://cassandra.apache.org/>, relational
> DBs &etc. Doesn't seem to be designed for low-latency responses across
> small clusters, or support UPDATE operations. It is optimized for data
> warehousing or analytics¹
> <http://prestodb.io/docs/current/overview/use-cases.html>"
>
> Your thoughts?
>
> Best,
>
> Samuel Marks
> http://linkedin.com/in/samuelmarks
> On 03/02/2015 6:06 pm, "Devopam Mittra" <de...@gmail.com> wrote:
>
>> hi Samuel,
>> You may wish to evaluate Presto (https://prestodb.io/) , which has an
>> added advantage of being faster than conventional Hive due to no MR jobs
>> being fired.
>> It has a dependency on Hive metastore though , through which it derives
>> the mechanism to execute the queries directly on source files.
>> The only flip side I found was the absence of complex SQL syntax that
>> means creating a lot of intermediate tables for little complicated
>> calculations (and imho , all calculations become complex sooner than we
>> intend them to )
>>
>> regards
>> Devopam
>>
>> On Tue, Feb 3, 2015 at 10:30 AM, Samuel Marks <sa...@gmail.com>
>> wrote:
>>
>>> Alexander: So would you recommend using Phoenix for all but those kind
>>> of queries, and switching to Hive+Tez for the rest? - Is that feasible?
>>>
>>> Checking their documentation, it looks like it just might be:
>>> https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration
>>>
>>> There is some early work on a Hive + Phoenix integration on GitHub:
>>> https://github.com/nmaillard/Phoenix-Hive
>>>
>>> Saurabh: I am sure there are a variety of very good non open-source
>>> products on the market :) - However in this thread I am only looking at
>>> open-source options. Additionally I am planning on open-sourcing this
>>> project I am building using these tools, so it makes even more sense that
>>> the entire toolset and their dependencies are also open-source.
>>>
>>> Best,
>>>
>>> Samuel Marks
>>> http://linkedin.com/in/samuelmarks
>>>
>>> On Tue, Feb 3, 2015 at 2:33 PM, Saurabh B <sa...@gmail.com>
>>> wrote:
>>>
>>>> This is not open source but we are using Vertica and it works very
>>>> nicely for us. There is a 1TB community edition but above that it costs
>>>> money.
>>>> It has really advanced SQL (analytical functions, etc), works like an
>>>> RDBMS, has R/Java/C++ SDK and scales nicely. There is a similar option of
>>>> Redshift available but Vertica has more features (pattern matching
>>>> functions, etc).
>>>>
>>>> Again, not open source so I would be interested to know what you end up
>>>> going with and what your experience is.
>>>>
>>>> On Mon, Feb 2, 2015 at 12:08 AM, Samuel Marks <sa...@gmail.com>
>>>> wrote:
>>>>
>>>>> Well what I am seeking is a Big Data database that can work with Small
>>>>> Data also. I.e.: scaleable from one node to vast clusters; whilst
>>>>> maintaining relatively low latency throughout.
>>>>>
>>>>> Which fit into this category?
>>>>>
>>>>> Samuel Marks
>>>>> http://linkedin.com/in/samuelmarks
>>>>>
>>>>
>>>>
>>>
>>
>>
>> --
>> Devopam Mittra
>> Life and Relations are not binary
>>
>


-- 
Devopam Mittra
Life and Relations are not binary

Re: Which [open-souce] SQL engine atop Hadoop?

Posted by Samuel Marks <sa...@gmail.com>.

Thanks Devopam,

In my initial post I did mention Presto, with his review:
" can query Hive, Cassandra <http://cassandra.apache.org/>, relational DBs
&etc. Doesn't seem to be designed for low-latency responses across small
clusters, or support UPDATE operations. It is optimized for data
warehousing or analytics¹
<http://prestodb.io/docs/current/overview/use-cases.html>"

Your thoughts?

Best,

Samuel Marks
http://linkedin.com/in/samuelmarks
On 03/02/2015 6:06 pm, "Devopam Mittra" <de...@gmail.com> wrote:

> hi Samuel,
> You may wish to evaluate Presto (https://prestodb.io/) , which has an
> added advantage of being faster than conventional Hive due to no MR jobs
> being fired.
> It has a dependency on Hive metastore though , through which it derives
> the mechanism to execute the queries directly on source files.
> The only flip side I found was the absence of complex SQL syntax that
> means creating a lot of intermediate tables for little complicated
> calculations (and imho , all calculations become complex sooner than we
> intend them to )
>
> regards
> Devopam
>
> On Tue, Feb 3, 2015 at 10:30 AM, Samuel Marks <sa...@gmail.com>
> wrote:
>
>> Alexander: So would you recommend using Phoenix for all but those kind of
>> queries, and switching to Hive+Tez for the rest? - Is that feasible?
>>
>> Checking their documentation, it looks like it just might be:
>> https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration
>>
>> There is some early work on a Hive + Phoenix integration on GitHub:
>> https://github.com/nmaillard/Phoenix-Hive
>>
>> Saurabh: I am sure there are a variety of very good non open-source
>> products on the market :) - However in this thread I am only looking at
>> open-source options. Additionally I am planning on open-sourcing this
>> project I am building using these tools, so it makes even more sense that
>> the entire toolset and their dependencies are also open-source.
>>
>> Best,
>>
>> Samuel Marks
>> http://linkedin.com/in/samuelmarks
>>
>> On Tue, Feb 3, 2015 at 2:33 PM, Saurabh B <sa...@gmail.com>
>> wrote:
>>
>>> This is not open source but we are using Vertica and it works very
>>> nicely for us. There is a 1TB community edition but above that it costs
>>> money.
>>> It has really advanced SQL (analytical functions, etc), works like an
>>> RDBMS, has R/Java/C++ SDK and scales nicely. There is a similar option of
>>> Redshift available but Vertica has more features (pattern matching
>>> functions, etc).
>>>
>>> Again, not open source so I would be interested to know what you end up
>>> going with and what your experience is.
>>>
>>> On Mon, Feb 2, 2015 at 12:08 AM, Samuel Marks <sa...@gmail.com>
>>> wrote:
>>>
>>>> Well what I am seeking is a Big Data database that can work with Small
>>>> Data also. I.e.: scaleable from one node to vast clusters; whilst
>>>> maintaining relatively low latency throughout.
>>>>
>>>> Which fit into this category?
>>>>
>>>> Samuel Marks
>>>> http://linkedin.com/in/samuelmarks
>>>>
>>>
>>>
>>
>
>
> --
> Devopam Mittra
> Life and Relations are not binary
>

Re: Which [open-souce] SQL engine atop Hadoop?

Posted by Devopam Mittra <de...@gmail.com>.

hi Samuel,
You may wish to evaluate Presto (https://prestodb.io/) , which has an added
advantage of being faster than conventional Hive due to no MR jobs being
fired.
It has a dependency on Hive metastore though , through which it derives the
mechanism to execute the queries directly on source files.
The only flip side I found was the absence of complex SQL syntax that means
creating a lot of intermediate tables for little complicated calculations
(and imho , all calculations become complex sooner than we intend them to )

regards
Devopam

On Tue, Feb 3, 2015 at 10:30 AM, Samuel Marks <sa...@gmail.com> wrote:

> Alexander: So would you recommend using Phoenix for all but those kind of
> queries, and switching to Hive+Tez for the rest? - Is that feasible?
>
> Checking their documentation, it looks like it just might be:
> https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration
>
> There is some early work on a Hive + Phoenix integration on GitHub:
> https://github.com/nmaillard/Phoenix-Hive
>
> Saurabh: I am sure there are a variety of very good non open-source
> products on the market :) - However in this thread I am only looking at
> open-source options. Additionally I am planning on open-sourcing this
> project I am building using these tools, so it makes even more sense that
> the entire toolset and their dependencies are also open-source.
>
> Best,
>
> Samuel Marks
> http://linkedin.com/in/samuelmarks
>
> On Tue, Feb 3, 2015 at 2:33 PM, Saurabh B <sa...@gmail.com>
> wrote:
>
>> This is not open source but we are using Vertica and it works very nicely
>> for us. There is a 1TB community edition but above that it costs money.
>> It has really advanced SQL (analytical functions, etc), works like an
>> RDBMS, has R/Java/C++ SDK and scales nicely. There is a similar option of
>> Redshift available but Vertica has more features (pattern matching
>> functions, etc).
>>
>> Again, not open source so I would be interested to know what you end up
>> going with and what your experience is.
>>
>> On Mon, Feb 2, 2015 at 12:08 AM, Samuel Marks <sa...@gmail.com>
>> wrote:
>>
>>> Well what I am seeking is a Big Data database that can work with Small
>>> Data also. I.e.: scaleable from one node to vast clusters; whilst
>>> maintaining relatively low latency throughout.
>>>
>>> Which fit into this category?
>>>
>>> Samuel Marks
>>> http://linkedin.com/in/samuelmarks
>>>
>>
>>
>


-- 
Devopam Mittra
Life and Relations are not binary

Re: Which [open-souce] SQL engine atop Hadoop?

Posted by Samuel Marks <sa...@gmail.com>.

Alexander: So would you recommend using Phoenix for all but those kind of
queries, and switching to Hive+Tez for the rest? - Is that feasible?

Checking their documentation, it looks like it just might be:
https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration

There is some early work on a Hive + Phoenix integration on GitHub:
https://github.com/nmaillard/Phoenix-Hive

Saurabh: I am sure there are a variety of very good non open-source
products on the market :) - However in this thread I am only looking at
open-source options. Additionally I am planning on open-sourcing this
project I am building using these tools, so it makes even more sense that
the entire toolset and their dependencies are also open-source.

Best,

Samuel Marks
http://linkedin.com/in/samuelmarks

On Tue, Feb 3, 2015 at 2:33 PM, Saurabh B <sa...@gmail.com> wrote:

> This is not open source but we are using Vertica and it works very nicely
> for us. There is a 1TB community edition but above that it costs money.
> It has really advanced SQL (analytical functions, etc), works like an
> RDBMS, has R/Java/C++ SDK and scales nicely. There is a similar option of
> Redshift available but Vertica has more features (pattern matching
> functions, etc).
>
> Again, not open source so I would be interested to know what you end up
> going with and what your experience is.
>
> On Mon, Feb 2, 2015 at 12:08 AM, Samuel Marks <sa...@gmail.com>
> wrote:
>
>> Well what I am seeking is a Big Data database that can work with Small
>> Data also. I.e.: scaleable from one node to vast clusters; whilst
>> maintaining relatively low latency throughout.
>>
>> Which fit into this category?
>>
>> Samuel Marks
>> http://linkedin.com/in/samuelmarks
>>
>
>

Re: Which [open-souce] SQL engine atop Hadoop?

Posted by Saurabh B <sa...@gmail.com>.

This is not open source but we are using Vertica and it works very nicely
for us. There is a 1TB community edition but above that it costs money.
It has really advanced SQL (analytical functions, etc), works like an
RDBMS, has R/Java/C++ SDK and scales nicely. There is a similar option of
Redshift available but Vertica has more features (pattern matching
functions, etc).

Again, not open source so I would be interested to know what you end up
going with and what your experience is.

On Mon, Feb 2, 2015 at 12:08 AM, Samuel Marks <sa...@gmail.com> wrote:

> Well what I am seeking is a Big Data database that can work with Small
> Data also. I.e.: scaleable from one node to vast clusters; whilst
> maintaining relatively low latency throughout.
>
> Which fit into this category?
>
> Samuel Marks
> http://linkedin.com/in/samuelmarks
>

Re: Which [open-souce] SQL engine atop Hadoop?

Posted by Alexander Pivovarov <ap...@gmail.com>.

Apache Phoenix is super fast for queries which filters data by table key,
- sub-second latency
- has good jdbc driver

but has limitations
- no full outer join support
- inner and left outer join use one computer memory, so it can not join
huge table to huge table


On Mon, Feb 2, 2015 at 6:59 PM, Alexander Pivovarov <ap...@gmail.com>
wrote:

> I like Tez engine for hive (aka Stinger initiative)
>
> - faster than MR engine. especially for complex queries with lots of
> nested sub-queries
> - stable
> - min latency is 5-7 sec  (0 sec for select count(*) ...)
> - capable to process huge datasets (not limited by RAM as Spark)
>
>
> On Mon, Feb 2, 2015 at 6:00 PM, Samuel Marks <sa...@gmail.com>
> wrote:
>
>> Maybe you're right, and what I should be doing is throwing in connectors
>> so that data from regular databases is pushed into HDFS at regular
>> intervals, wherein my "fancier" analytics can be run across larger
>> data-sets.
>>
>> However, I don't want to decide straightaway, for example, Phoenix +
>> Spark may be just the combination I am looking for.
>>
>> Best,
>>
>>
>> Samuel Marks
>> http://linkedin.com/in/samuelmarks
>>
>> On Mon, Feb 2, 2015 at 5:14 PM, Jörn Franke <jo...@gmail.com> wrote:
>>
>>> Hallo,
>>>
>>> I think you have to think first about your functional and non-functional
>>> requirements. You can scale "normal" SQL databases as well (cf CERN or
>>> Facebook). There are different types of databases for different purposes -
>>> there is no one fits it all. At the moment, we are a few years away from a
>>> one-fits-it-all database that leverages AI etc to automatically scale,
>>> optimize etc processing, storage and network.  Until then you will have to
>>> do the math depending on your requirements.
>>> Once you make them more precise, we will able to help you more.
>>>
>>> Cheers
>>> Le 2 févr. 2015 06:08, "Samuel Marks" <sa...@gmail.com> a écrit :
>>>
>>> Well what I am seeking is a Big Data database that can work with Small
>>> Data also. I.e.: scaleable from one node to vast clusters; whilst
>>> maintaining relatively low latency throughout.
>>>
>>> Which fit into this category?
>>>
>>> Samuel Marks
>>> http://linkedin.com/in/samuelmarks
>>>
>>>
>>
>

Re: Which [open-souce] SQL engine atop Hadoop?

Posted by Alexander Pivovarov <ap...@gmail.com>.

I like Tez engine for hive (aka Stinger initiative)

- faster than MR engine. especially for complex queries with lots of nested
sub-queries
- stable
- min latency is 5-7 sec  (0 sec for select count(*) ...)
- capable to process huge datasets (not limited by RAM as Spark)


On Mon, Feb 2, 2015 at 6:00 PM, Samuel Marks <sa...@gmail.com> wrote:

> Maybe you're right, and what I should be doing is throwing in connectors
> so that data from regular databases is pushed into HDFS at regular
> intervals, wherein my "fancier" analytics can be run across larger
> data-sets.
>
> However, I don't want to decide straightaway, for example, Phoenix + Spark
> may be just the combination I am looking for.
>
> Best,
>
>
> Samuel Marks
> http://linkedin.com/in/samuelmarks
>
> On Mon, Feb 2, 2015 at 5:14 PM, Jörn Franke <jo...@gmail.com> wrote:
>
>> Hallo,
>>
>> I think you have to think first about your functional and non-functional
>> requirements. You can scale "normal" SQL databases as well (cf CERN or
>> Facebook). There are different types of databases for different purposes -
>> there is no one fits it all. At the moment, we are a few years away from a
>> one-fits-it-all database that leverages AI etc to automatically scale,
>> optimize etc processing, storage and network.  Until then you will have to
>> do the math depending on your requirements.
>> Once you make them more precise, we will able to help you more.
>>
>> Cheers
>> Le 2 févr. 2015 06:08, "Samuel Marks" <sa...@gmail.com> a écrit :
>>
>> Well what I am seeking is a Big Data database that can work with Small
>> Data also. I.e.: scaleable from one node to vast clusters; whilst
>> maintaining relatively low latency throughout.
>>
>> Which fit into this category?
>>
>> Samuel Marks
>> http://linkedin.com/in/samuelmarks
>>
>>
>

Re: Which [open-souce] SQL engine atop Hadoop?

Posted by Samuel Marks <sa...@gmail.com>.

Maybe you're right, and what I should be doing is throwing in connectors so
that data from regular databases is pushed into HDFS at regular intervals,
wherein my "fancier" analytics can be run across larger data-sets.

However, I don't want to decide straightaway, for example, Phoenix + Spark
may be just the combination I am looking for.

Best,


Samuel Marks
http://linkedin.com/in/samuelmarks

On Mon, Feb 2, 2015 at 5:14 PM, Jörn Franke <jo...@gmail.com> wrote:

> Hallo,
>
> I think you have to think first about your functional and non-functional
> requirements. You can scale "normal" SQL databases as well (cf CERN or
> Facebook). There are different types of databases for different purposes -
> there is no one fits it all. At the moment, we are a few years away from a
> one-fits-it-all database that leverages AI etc to automatically scale,
> optimize etc processing, storage and network.  Until then you will have to
> do the math depending on your requirements.
> Once you make them more precise, we will able to help you more.
>
> Cheers
> Le 2 févr. 2015 06:08, "Samuel Marks" <sa...@gmail.com> a écrit :
>
> Well what I am seeking is a Big Data database that can work with Small
> Data also. I.e.: scaleable from one node to vast clusters; whilst
> maintaining relatively low latency throughout.
>
> Which fit into this category?
>
> Samuel Marks
> http://linkedin.com/in/samuelmarks
>
>

Re: Which [open-souce] SQL engine atop Hadoop?

Posted by Jörn Franke <jo...@gmail.com>.

Hallo,

I think you have to think first about your functional and non-functional
requirements. You can scale "normal" SQL databases as well (cf CERN or
Facebook). There are different types of databases for different purposes -
there is no one fits it all. At the moment, we are a few years away from a
one-fits-it-all database that leverages AI etc to automatically scale,
optimize etc processing, storage and network.  Until then you will have to
do the math depending on your requirements.
Once you make them more precise, we will able to help you more.

Cheers
Le 2 févr. 2015 06:08, "Samuel Marks" <sa...@gmail.com> a écrit :

Well what I am seeking is a Big Data database that can work with Small Data
also. I.e.: scaleable from one node to vast clusters; whilst maintaining
relatively low latency throughout.

Which fit into this category?

Samuel Marks
http://linkedin.com/in/samuelmarks

Re: Which [open-souce] SQL engine atop Hadoop?

Posted by Samuel Marks <sa...@gmail.com>.

Well what I am seeking is a Big Data database that can work with Small Data
also. I.e.: scaleable from one node to vast clusters; whilst maintaining
relatively low latency throughout.

Which fit into this category?

Samuel Marks
http://linkedin.com/in/samuelmarks

Re: Which [open-souce] SQL engine atop Hadoop?

Posted by Koert Kuipers <ko...@tresata.com>.

i would not exclude spark sql unless you really need something mutable in
which case lingual wont work either

On Sat, Jan 31, 2015 at 8:56 PM, Samuel Marks <sa...@gmail.com> wrote:

> Interesting discussion. It looks like the HBase metastore can also be
> configured to use HDFS HA (ex. tutorial
> <http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/cdh_hag_hdfs_ha_cdh_components_config.html>
> ).
>
> To get back on topic though, the primary contenders now are: Phoenix,
> Lingual and perhaps Tajo or Drill?
>
> Best,
>
> Samuel Marks
> http://linkedin.com/in/samuelmarks
>
> On Sun, Feb 1, 2015 at 9:38 AM, Edward Capriolo <ed...@gmail.com>
> wrote:
>
>> "is the metastore thrift definition stable across hive versions?" I would
>> say yes. Like many API's the core eventually solidifies. No one is saying
>> it will never every change, but basically there are things like "database"
>> and "table" and they have properties like "name". I have some basic scripts
>> that look for table names matching patterns or summarize disk usage by
>> owner. I have not had to touch them very much. Usually if they do change it
>> is something small and if you tie the commit to a jira you can figure out
>> what and why.
>>
>> On Sat, Jan 31, 2015 at 3:02 PM, Koert Kuipers <ko...@tresata.com> wrote:
>>
>>> seems the metastore thrift service support SASL. thats great. so if i
>>> understand it correctly all i need is the metastore thrift definition to
>>> query the metastore.
>>> is the metastore thrift definition stable across hive versions? if so,
>>> then i can build my app once without worrying about the hive version
>>> deployed. in that case i admit its not as bad as i thought. lets see!
>>>
>>> On Sat, Jan 31, 2015 at 2:41 PM, Koert Kuipers <ko...@tresata.com>
>>> wrote:
>>>
>>>> oh sorry edward, i misread you post. seems we agree that "SQL
>>>> constructs inside hive" are not for other systems.
>>>>
>>>> On Sat, Jan 31, 2015 at 2:38 PM, Koert Kuipers <ko...@tresata.com>
>>>> wrote:
>>>>
>>>>> edward,
>>>>> i would not call "SQL constructs inside hive" accessible for other
>>>>> systems. its inside hive after all
>>>>>
>>>>> it is true that i can contact the metastore in java using
>>>>> HiveMetaStoreClient, but then i need to bring in a whole slew of
>>>>> dependencies (the miniumum seems to be hive-metastore, hive-common,
>>>>> hive-shims, libfb303, libthrift and a few hadoop dependencies, by trial and
>>>>> error). these jars need to be "provided" and added to the classpath on the
>>>>> cluster, unless someone is willing to build versions of an application for
>>>>> every hive version out there. and even when you do all this you can only
>>>>> pray its going to be compatible with the next hive version, since backwards
>>>>> compatibility is... well lets just say lacking. the attitude seems to be
>>>>> that hive does not have a java api, so there is nothing that needs to be
>>>>> stable.
>>>>>
>>>>> you are right i could go the pure thrift road. i havent tried that
>>>>> yet. that might just be the best option. but how easy is it to do this with
>>>>> a secure hadoop/hive ecosystem? now i need to handle kerberos myself and
>>>>> somehow pass tokens into thrift i assume?
>>>>>
>>>>> contrast all of this with an avro file on hadoop with metadata baked
>>>>> in, and i think its safe to say hive metadata is not easily accessible.
>>>>>
>>>>> i will take a look at your book. i hope it has an example of using
>>>>> thrift on a secure cluster to contact hive metastore (without using the
>>>>> HiveMetaStoreClient), that would be awesome.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Sat, Jan 31, 2015 at 1:32 PM, Edward Capriolo <
>>>>> edlinuxguru@gmail.com> wrote:
>>>>>
>>>>>> "with the metadata in a special metadata store (not on hdfs), and its
>>>>>> not as easy for all systems to access hive metadata." I disagree.
>>>>>>
>>>>>> Hives metadata is not only accessible through the SQL constructs like
>>>>>> "describe table". But the entire meta-store also is actually a thrift
>>>>>> service so you have programmatic access to determine things like what
>>>>>> columns are in a table etc. Thrift creates RPC clients for almost every
>>>>>> major language.
>>>>>>
>>>>>> In the programming hive book
>>>>>> http://www.amazon.com/dp/1449319335/?tag=mh0b-20&hvadid=3521269638&ref=pd_sl_4yiryvbf8k_e
>>>>>> there is even examples where I show how to iterate all the tables inside
>>>>>> the database from a java client.
>>>>>>
>>>>>> On Sat, Jan 31, 2015 at 11:05 AM, Koert Kuipers <ko...@tresata.com>
>>>>>> wrote:
>>>>>>
>>>>>>> yes you can run whatever you like with the data in hdfs. keep in
>>>>>>> mind that hive makes this general access pattern just a little harder,
>>>>>>> since hive has a tendency to store data and metadata separately, with the
>>>>>>> metadata in a special metadata store (not on hdfs), and its not as easy for
>>>>>>> all systems to access hive metadata.
>>>>>>>
>>>>>>> i am not familiar at all with tajo or drill.
>>>>>>>
>>>>>>> On Fri, Jan 30, 2015 at 8:27 PM, Samuel Marks <samuelmarks@gmail.com
>>>>>>> > wrote:
>>>>>>>
>>>>>>>> Thanks for the advice
>>>>>>>>
>>>>>>>> Koert: when everything is in the same essential data-store (HDFS),
>>>>>>>> can't I just run whatever complex tools I'm whichever paradigm they like?
>>>>>>>>
>>>>>>>> E.g.: GraphX, Mahout &etc.
>>>>>>>>
>>>>>>>> Also, what about Tajo or Drill?
>>>>>>>>
>>>>>>>> Best,
>>>>>>>>
>>>>>>>> Samuel Marks
>>>>>>>> http://linkedin.com/in/samuelmarks
>>>>>>>>
>>>>>>>> PS: Spark-SQL is read-only IIRC, right?
>>>>>>>> On 31 Jan 2015 03:39, "Koert Kuipers" <ko...@tresata.com> wrote:
>>>>>>>>
>>>>>>>>> since you require high-powered analytics, and i assume you want to
>>>>>>>>> stay sane while doing so, you require the ability to "drop out of sql" when
>>>>>>>>> needed. so spark-sql and lingual would be my choices.
>>>>>>>>>
>>>>>>>>> low latency indicates phoenix or spark-sql to me.
>>>>>>>>>
>>>>>>>>> so i would say spark-sql
>>>>>>>>>
>>>>>>>>> On Fri, Jan 30, 2015 at 7:56 AM, Samuel Marks <
>>>>>>>>> samuelmarks@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> HAWQ is pretty nifty due to its full SQL compliance (ANSI 92) and
>>>>>>>>>> exposing both JDBC and ODBC interfaces. However, although Pivotal does open-source
>>>>>>>>>> a lot of software <http://www.pivotal.io/oss>, I don't believe
>>>>>>>>>> they open source Pivotal HD: HAWQ.
>>>>>>>>>>
>>>>>>>>>> So that doesn't meet my requirements. I should note that the
>>>>>>>>>> project I am building will also be open-source, which heightens the
>>>>>>>>>> importance of having all components also being open-source.
>>>>>>>>>>
>>>>>>>>>> Cheers,
>>>>>>>>>>
>>>>>>>>>> Samuel Marks
>>>>>>>>>> http://linkedin.com/in/samuelmarks
>>>>>>>>>>
>>>>>>>>>> On Fri, Jan 30, 2015 at 11:35 PM, Siddharth Tiwari <
>>>>>>>>>> siddharth.tiwari@live.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Have you looked at HAWQ from Pivotal ?
>>>>>>>>>>>
>>>>>>>>>>> Sent from my iPhone
>>>>>>>>>>>
>>>>>>>>>>> On Jan 30, 2015, at 4:27 AM, Samuel Marks <sa...@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Since Hadoop <https://hive.apache.org> came out, there have
>>>>>>>>>>> been various commercial and/or open-source attempts to expose some
>>>>>>>>>>> compatibility with SQL <http://drill.apache.org>. Obviously by
>>>>>>>>>>> posting here I am not expecting an unbiased answer.
>>>>>>>>>>>
>>>>>>>>>>> Seeking an SQL-on-Hadoop offering which provides: low-latency
>>>>>>>>>>> querying, and supports the most common CRUD
>>>>>>>>>>> <https://spark.apache.org>, including [the basics!] along these
>>>>>>>>>>> lines: CREATE TABLE, INSERT INTO, SELECT * FROM, UPDATE Table
>>>>>>>>>>> SET C1=2 WHERE, DELETE FROM, and DROP TABLE. Transactional
>>>>>>>>>>> support would be nice also, but is not a must-have.
>>>>>>>>>>>
>>>>>>>>>>> Essentially I want a full replacement for the more traditional
>>>>>>>>>>> RDBMS, one which can scale from 1 node to a serious Hadoop cluster.
>>>>>>>>>>>
>>>>>>>>>>> Python is my language of choice for interfacing, however there
>>>>>>>>>>> does seem to be a Python JDBC wrapper
>>>>>>>>>>> <https://spark.apache.org/sql>.
>>>>>>>>>>>
>>>>>>>>>>> Here is what I've found thus far:
>>>>>>>>>>>
>>>>>>>>>>>    - Apache Hive <https://hive.apache.org> (SQL-like, with
>>>>>>>>>>>    interactive SQL thanks to the Stinger initiative)
>>>>>>>>>>>    - Apache Drill <http://drill.apache.org> (ANSI SQL support)
>>>>>>>>>>>    - Apache Spark <https://spark.apache.org> (Spark SQL
>>>>>>>>>>>    <https://spark.apache.org/sql>, queries only, add data via
>>>>>>>>>>>    Hive, RDD
>>>>>>>>>>>    <https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.SchemaRDD>
>>>>>>>>>>>    or Paraquet <http://parquet.io/>)
>>>>>>>>>>>    - Apache Phoenix <http://phoenix.apache.org> (built atop Apache
>>>>>>>>>>>    HBase <http://hbase.apache.org>, lacks full transaction
>>>>>>>>>>>    <http://en.wikipedia.org/wiki/Database_transaction> support, relational
>>>>>>>>>>>    operators <http://en.wikipedia.org/wiki/Relational_operators>
>>>>>>>>>>>    and some built-in functions)
>>>>>>>>>>>    - Cloudera Impala
>>>>>>>>>>>    <http://www.cloudera.com/content/cloudera/en/products-and-services/cdh/impala.html>
>>>>>>>>>>>    (significant HiveQL support, some SQL language support, no support for
>>>>>>>>>>>    indexes on its tables, importantly missing DELETE, UPDATE and INTERSECT;
>>>>>>>>>>>    amongst others)
>>>>>>>>>>>    - Presto <https://github.com/facebook/presto> from Facebook
>>>>>>>>>>>    (can query Hive, Cassandra <http://cassandra.apache.org>,
>>>>>>>>>>>    relational DBs &etc. Doesn't seem to be designed for low-latency responses
>>>>>>>>>>>    across small clusters, or support UPDATE operations. It is
>>>>>>>>>>>    optimized for data warehousing or analytics¹
>>>>>>>>>>>    <http://prestodb.io/docs/current/overview/use-cases.html>)
>>>>>>>>>>>    - SQL-Hadoop <https://www.mapr.com/why-hadoop/sql-hadoop>
>>>>>>>>>>>    via MapR community edition
>>>>>>>>>>>    <https://www.mapr.com/products/hadoop-download> (seems to be
>>>>>>>>>>>    a packaging of Hive, HP Vertica
>>>>>>>>>>>    <http://www.vertica.com/hp-vertica-products/sqlonhadoop>,
>>>>>>>>>>>    SparkSQL, Drill and a native ODBC wrapper
>>>>>>>>>>>    <http://package.mapr.com/tools/MapR-ODBC/MapR_ODBC>)
>>>>>>>>>>>    - Apache Kylin <http://www.kylin.io> from Ebay (provides an
>>>>>>>>>>>    SQL interface and multi-dimensional analysis [OLAP
>>>>>>>>>>>    <http://en.wikipedia.org/wiki/OLAP>], "… offers ANSI SQL on
>>>>>>>>>>>    Hadoop and supports most ANSI SQL query functions". It depends on HDFS,
>>>>>>>>>>>    MapReduce, Hive and HBase; and seems targeted at very large data-sets
>>>>>>>>>>>    though maintains low query latency)
>>>>>>>>>>>    - Apache Tajo <http://tajo.apache.org> (ANSI/ISO SQL
>>>>>>>>>>>    standard compliance with JDBC
>>>>>>>>>>>    <http://en.wikipedia.org/wiki/JDBC> driver support [benchmarks
>>>>>>>>>>>    against Hive and Impala
>>>>>>>>>>>    <http://blogs.gartner.com/nick-heudecker/apache-tajo-enters-the-sql-on-hadoop-space>
>>>>>>>>>>>    ])
>>>>>>>>>>>    - Cascading
>>>>>>>>>>>    <http://en.wikipedia.org/wiki/Cascading_%28software%29>'s
>>>>>>>>>>>    Lingual <http://docs.cascading.org/lingual/1.0/>²
>>>>>>>>>>>    <http://docs.cascading.org/lingual/1.0/#sql-support>
>>>>>>>>>>>    ("Lingual provides JDBC Drivers, a SQL command shell, and a catalog manager
>>>>>>>>>>>    for publishing files [or any resource] as schemas and tables.")
>>>>>>>>>>>
>>>>>>>>>>> Which—from this list or elsewhere—would you recommend, and why?
>>>>>>>>>>> Thanks for all suggestions,
>>>>>>>>>>>
>>>>>>>>>>> Samuel Marks
>>>>>>>>>>> http://linkedin.com/in/samuelmarks
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Which [open-souce] SQL engine atop Hadoop?

Posted by Samuel Marks <sa...@gmail.com>.

Interesting discussion. It looks like the HBase metastore can also be
configured to use HDFS HA (ex. tutorial
<http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/cdh_hag_hdfs_ha_cdh_components_config.html>
).

To get back on topic though, the primary contenders now are: Phoenix,
Lingual and perhaps Tajo or Drill?

Best,

Samuel Marks
http://linkedin.com/in/samuelmarks

On Sun, Feb 1, 2015 at 9:38 AM, Edward Capriolo <ed...@gmail.com>
wrote:

> "is the metastore thrift definition stable across hive versions?" I would
> say yes. Like many API's the core eventually solidifies. No one is saying
> it will never every change, but basically there are things like "database"
> and "table" and they have properties like "name". I have some basic scripts
> that look for table names matching patterns or summarize disk usage by
> owner. I have not had to touch them very much. Usually if they do change it
> is something small and if you tie the commit to a jira you can figure out
> what and why.
>
> On Sat, Jan 31, 2015 at 3:02 PM, Koert Kuipers <ko...@tresata.com> wrote:
>
>> seems the metastore thrift service support SASL. thats great. so if i
>> understand it correctly all i need is the metastore thrift definition to
>> query the metastore.
>> is the metastore thrift definition stable across hive versions? if so,
>> then i can build my app once without worrying about the hive version
>> deployed. in that case i admit its not as bad as i thought. lets see!
>>
>> On Sat, Jan 31, 2015 at 2:41 PM, Koert Kuipers <ko...@tresata.com> wrote:
>>
>>> oh sorry edward, i misread you post. seems we agree that "SQL constructs
>>> inside hive" are not for other systems.
>>>
>>> On Sat, Jan 31, 2015 at 2:38 PM, Koert Kuipers <ko...@tresata.com>
>>> wrote:
>>>
>>>> edward,
>>>> i would not call "SQL constructs inside hive" accessible for other
>>>> systems. its inside hive after all
>>>>
>>>> it is true that i can contact the metastore in java using
>>>> HiveMetaStoreClient, but then i need to bring in a whole slew of
>>>> dependencies (the miniumum seems to be hive-metastore, hive-common,
>>>> hive-shims, libfb303, libthrift and a few hadoop dependencies, by trial and
>>>> error). these jars need to be "provided" and added to the classpath on the
>>>> cluster, unless someone is willing to build versions of an application for
>>>> every hive version out there. and even when you do all this you can only
>>>> pray its going to be compatible with the next hive version, since backwards
>>>> compatibility is... well lets just say lacking. the attitude seems to be
>>>> that hive does not have a java api, so there is nothing that needs to be
>>>> stable.
>>>>
>>>> you are right i could go the pure thrift road. i havent tried that yet.
>>>> that might just be the best option. but how easy is it to do this with a
>>>> secure hadoop/hive ecosystem? now i need to handle kerberos myself and
>>>> somehow pass tokens into thrift i assume?
>>>>
>>>> contrast all of this with an avro file on hadoop with metadata baked
>>>> in, and i think its safe to say hive metadata is not easily accessible.
>>>>
>>>> i will take a look at your book. i hope it has an example of using
>>>> thrift on a secure cluster to contact hive metastore (without using the
>>>> HiveMetaStoreClient), that would be awesome.
>>>>
>>>>
>>>>
>>>>
>>>> On Sat, Jan 31, 2015 at 1:32 PM, Edward Capriolo <edlinuxguru@gmail.com
>>>> > wrote:
>>>>
>>>>> "with the metadata in a special metadata store (not on hdfs), and its
>>>>> not as easy for all systems to access hive metadata." I disagree.
>>>>>
>>>>> Hives metadata is not only accessible through the SQL constructs like
>>>>> "describe table". But the entire meta-store also is actually a thrift
>>>>> service so you have programmatic access to determine things like what
>>>>> columns are in a table etc. Thrift creates RPC clients for almost every
>>>>> major language.
>>>>>
>>>>> In the programming hive book
>>>>> http://www.amazon.com/dp/1449319335/?tag=mh0b-20&hvadid=3521269638&ref=pd_sl_4yiryvbf8k_e
>>>>> there is even examples where I show how to iterate all the tables inside
>>>>> the database from a java client.
>>>>>
>>>>> On Sat, Jan 31, 2015 at 11:05 AM, Koert Kuipers <ko...@tresata.com>
>>>>> wrote:
>>>>>
>>>>>> yes you can run whatever you like with the data in hdfs. keep in mind
>>>>>> that hive makes this general access pattern just a little harder, since
>>>>>> hive has a tendency to store data and metadata separately, with the
>>>>>> metadata in a special metadata store (not on hdfs), and its not as easy for
>>>>>> all systems to access hive metadata.
>>>>>>
>>>>>> i am not familiar at all with tajo or drill.
>>>>>>
>>>>>> On Fri, Jan 30, 2015 at 8:27 PM, Samuel Marks <sa...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Thanks for the advice
>>>>>>>
>>>>>>> Koert: when everything is in the same essential data-store (HDFS),
>>>>>>> can't I just run whatever complex tools I'm whichever paradigm they like?
>>>>>>>
>>>>>>> E.g.: GraphX, Mahout &etc.
>>>>>>>
>>>>>>> Also, what about Tajo or Drill?
>>>>>>>
>>>>>>> Best,
>>>>>>>
>>>>>>> Samuel Marks
>>>>>>> http://linkedin.com/in/samuelmarks
>>>>>>>
>>>>>>> PS: Spark-SQL is read-only IIRC, right?
>>>>>>> On 31 Jan 2015 03:39, "Koert Kuipers" <ko...@tresata.com> wrote:
>>>>>>>
>>>>>>>> since you require high-powered analytics, and i assume you want to
>>>>>>>> stay sane while doing so, you require the ability to "drop out of sql" when
>>>>>>>> needed. so spark-sql and lingual would be my choices.
>>>>>>>>
>>>>>>>> low latency indicates phoenix or spark-sql to me.
>>>>>>>>
>>>>>>>> so i would say spark-sql
>>>>>>>>
>>>>>>>> On Fri, Jan 30, 2015 at 7:56 AM, Samuel Marks <
>>>>>>>> samuelmarks@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> HAWQ is pretty nifty due to its full SQL compliance (ANSI 92) and
>>>>>>>>> exposing both JDBC and ODBC interfaces. However, although Pivotal does open-source
>>>>>>>>> a lot of software <http://www.pivotal.io/oss>, I don't believe
>>>>>>>>> they open source Pivotal HD: HAWQ.
>>>>>>>>>
>>>>>>>>> So that doesn't meet my requirements. I should note that the
>>>>>>>>> project I am building will also be open-source, which heightens the
>>>>>>>>> importance of having all components also being open-source.
>>>>>>>>>
>>>>>>>>> Cheers,
>>>>>>>>>
>>>>>>>>> Samuel Marks
>>>>>>>>> http://linkedin.com/in/samuelmarks
>>>>>>>>>
>>>>>>>>> On Fri, Jan 30, 2015 at 11:35 PM, Siddharth Tiwari <
>>>>>>>>> siddharth.tiwari@live.com> wrote:
>>>>>>>>>
>>>>>>>>>> Have you looked at HAWQ from Pivotal ?
>>>>>>>>>>
>>>>>>>>>> Sent from my iPhone
>>>>>>>>>>
>>>>>>>>>> On Jan 30, 2015, at 4:27 AM, Samuel Marks <sa...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> Since Hadoop <https://hive.apache.org> came out, there have been
>>>>>>>>>> various commercial and/or open-source attempts to expose some compatibility
>>>>>>>>>> with SQL <http://drill.apache.org>. Obviously by posting here I
>>>>>>>>>> am not expecting an unbiased answer.
>>>>>>>>>>
>>>>>>>>>> Seeking an SQL-on-Hadoop offering which provides: low-latency
>>>>>>>>>> querying, and supports the most common CRUD
>>>>>>>>>> <https://spark.apache.org>, including [the basics!] along these
>>>>>>>>>> lines: CREATE TABLE, INSERT INTO, SELECT * FROM, UPDATE Table
>>>>>>>>>> SET C1=2 WHERE, DELETE FROM, and DROP TABLE. Transactional
>>>>>>>>>> support would be nice also, but is not a must-have.
>>>>>>>>>>
>>>>>>>>>> Essentially I want a full replacement for the more traditional
>>>>>>>>>> RDBMS, one which can scale from 1 node to a serious Hadoop cluster.
>>>>>>>>>>
>>>>>>>>>> Python is my language of choice for interfacing, however there
>>>>>>>>>> does seem to be a Python JDBC wrapper
>>>>>>>>>> <https://spark.apache.org/sql>.
>>>>>>>>>>
>>>>>>>>>> Here is what I've found thus far:
>>>>>>>>>>
>>>>>>>>>>    - Apache Hive <https://hive.apache.org> (SQL-like, with
>>>>>>>>>>    interactive SQL thanks to the Stinger initiative)
>>>>>>>>>>    - Apache Drill <http://drill.apache.org> (ANSI SQL support)
>>>>>>>>>>    - Apache Spark <https://spark.apache.org> (Spark SQL
>>>>>>>>>>    <https://spark.apache.org/sql>, queries only, add data via
>>>>>>>>>>    Hive, RDD
>>>>>>>>>>    <https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.SchemaRDD>
>>>>>>>>>>    or Paraquet <http://parquet.io/>)
>>>>>>>>>>    - Apache Phoenix <http://phoenix.apache.org> (built atop Apache
>>>>>>>>>>    HBase <http://hbase.apache.org>, lacks full transaction
>>>>>>>>>>    <http://en.wikipedia.org/wiki/Database_transaction> support, relational
>>>>>>>>>>    operators <http://en.wikipedia.org/wiki/Relational_operators>
>>>>>>>>>>    and some built-in functions)
>>>>>>>>>>    - Cloudera Impala
>>>>>>>>>>    <http://www.cloudera.com/content/cloudera/en/products-and-services/cdh/impala.html>
>>>>>>>>>>    (significant HiveQL support, some SQL language support, no support for
>>>>>>>>>>    indexes on its tables, importantly missing DELETE, UPDATE and INTERSECT;
>>>>>>>>>>    amongst others)
>>>>>>>>>>    - Presto <https://github.com/facebook/presto> from Facebook
>>>>>>>>>>    (can query Hive, Cassandra <http://cassandra.apache.org>,
>>>>>>>>>>    relational DBs &etc. Doesn't seem to be designed for low-latency responses
>>>>>>>>>>    across small clusters, or support UPDATE operations. It is
>>>>>>>>>>    optimized for data warehousing or analytics¹
>>>>>>>>>>    <http://prestodb.io/docs/current/overview/use-cases.html>)
>>>>>>>>>>    - SQL-Hadoop <https://www.mapr.com/why-hadoop/sql-hadoop> via MapR
>>>>>>>>>>    community edition
>>>>>>>>>>    <https://www.mapr.com/products/hadoop-download> (seems to be
>>>>>>>>>>    a packaging of Hive, HP Vertica
>>>>>>>>>>    <http://www.vertica.com/hp-vertica-products/sqlonhadoop>,
>>>>>>>>>>    SparkSQL, Drill and a native ODBC wrapper
>>>>>>>>>>    <http://package.mapr.com/tools/MapR-ODBC/MapR_ODBC>)
>>>>>>>>>>    - Apache Kylin <http://www.kylin.io> from Ebay (provides an
>>>>>>>>>>    SQL interface and multi-dimensional analysis [OLAP
>>>>>>>>>>    <http://en.wikipedia.org/wiki/OLAP>], "… offers ANSI SQL on
>>>>>>>>>>    Hadoop and supports most ANSI SQL query functions". It depends on HDFS,
>>>>>>>>>>    MapReduce, Hive and HBase; and seems targeted at very large data-sets
>>>>>>>>>>    though maintains low query latency)
>>>>>>>>>>    - Apache Tajo <http://tajo.apache.org> (ANSI/ISO SQL standard
>>>>>>>>>>    compliance with JDBC <http://en.wikipedia.org/wiki/JDBC>
>>>>>>>>>>    driver support [benchmarks against Hive and Impala
>>>>>>>>>>    <http://blogs.gartner.com/nick-heudecker/apache-tajo-enters-the-sql-on-hadoop-space>
>>>>>>>>>>    ])
>>>>>>>>>>    - Cascading
>>>>>>>>>>    <http://en.wikipedia.org/wiki/Cascading_%28software%29>'s
>>>>>>>>>>    Lingual <http://docs.cascading.org/lingual/1.0/>²
>>>>>>>>>>    <http://docs.cascading.org/lingual/1.0/#sql-support>
>>>>>>>>>>    ("Lingual provides JDBC Drivers, a SQL command shell, and a catalog manager
>>>>>>>>>>    for publishing files [or any resource] as schemas and tables.")
>>>>>>>>>>
>>>>>>>>>> Which—from this list or elsewhere—would you recommend, and why?
>>>>>>>>>> Thanks for all suggestions,
>>>>>>>>>>
>>>>>>>>>> Samuel Marks
>>>>>>>>>> http://linkedin.com/in/samuelmarks
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Which [open-souce] SQL engine atop Hadoop?

Posted by Edward Capriolo <ed...@gmail.com>.

"is the metastore thrift definition stable across hive versions?" I would
say yes. Like many API's the core eventually solidifies. No one is saying
it will never every change, but basically there are things like "database"
and "table" and they have properties like "name". I have some basic scripts
that look for table names matching patterns or summarize disk usage by
owner. I have not had to touch them very much. Usually if they do change it
is something small and if you tie the commit to a jira you can figure out
what and why.

On Sat, Jan 31, 2015 at 3:02 PM, Koert Kuipers <ko...@tresata.com> wrote:

> seems the metastore thrift service support SASL. thats great. so if i
> understand it correctly all i need is the metastore thrift definition to
> query the metastore.
> is the metastore thrift definition stable across hive versions? if so,
> then i can build my app once without worrying about the hive version
> deployed. in that case i admit its not as bad as i thought. lets see!
>
> On Sat, Jan 31, 2015 at 2:41 PM, Koert Kuipers <ko...@tresata.com> wrote:
>
>> oh sorry edward, i misread you post. seems we agree that "SQL constructs
>> inside hive" are not for other systems.
>>
>> On Sat, Jan 31, 2015 at 2:38 PM, Koert Kuipers <ko...@tresata.com> wrote:
>>
>>> edward,
>>> i would not call "SQL constructs inside hive" accessible for other
>>> systems. its inside hive after all
>>>
>>> it is true that i can contact the metastore in java using
>>> HiveMetaStoreClient, but then i need to bring in a whole slew of
>>> dependencies (the miniumum seems to be hive-metastore, hive-common,
>>> hive-shims, libfb303, libthrift and a few hadoop dependencies, by trial and
>>> error). these jars need to be "provided" and added to the classpath on the
>>> cluster, unless someone is willing to build versions of an application for
>>> every hive version out there. and even when you do all this you can only
>>> pray its going to be compatible with the next hive version, since backwards
>>> compatibility is... well lets just say lacking. the attitude seems to be
>>> that hive does not have a java api, so there is nothing that needs to be
>>> stable.
>>>
>>> you are right i could go the pure thrift road. i havent tried that yet.
>>> that might just be the best option. but how easy is it to do this with a
>>> secure hadoop/hive ecosystem? now i need to handle kerberos myself and
>>> somehow pass tokens into thrift i assume?
>>>
>>> contrast all of this with an avro file on hadoop with metadata baked in,
>>> and i think its safe to say hive metadata is not easily accessible.
>>>
>>> i will take a look at your book. i hope it has an example of using
>>> thrift on a secure cluster to contact hive metastore (without using the
>>> HiveMetaStoreClient), that would be awesome.
>>>
>>>
>>>
>>>
>>> On Sat, Jan 31, 2015 at 1:32 PM, Edward Capriolo <ed...@gmail.com>
>>> wrote:
>>>
>>>> "with the metadata in a special metadata store (not on hdfs), and its
>>>> not as easy for all systems to access hive metadata." I disagree.
>>>>
>>>> Hives metadata is not only accessible through the SQL constructs like
>>>> "describe table". But the entire meta-store also is actually a thrift
>>>> service so you have programmatic access to determine things like what
>>>> columns are in a table etc. Thrift creates RPC clients for almost every
>>>> major language.
>>>>
>>>> In the programming hive book
>>>> http://www.amazon.com/dp/1449319335/?tag=mh0b-20&hvadid=3521269638&ref=pd_sl_4yiryvbf8k_e
>>>> there is even examples where I show how to iterate all the tables inside
>>>> the database from a java client.
>>>>
>>>> On Sat, Jan 31, 2015 at 11:05 AM, Koert Kuipers <ko...@tresata.com>
>>>> wrote:
>>>>
>>>>> yes you can run whatever you like with the data in hdfs. keep in mind
>>>>> that hive makes this general access pattern just a little harder, since
>>>>> hive has a tendency to store data and metadata separately, with the
>>>>> metadata in a special metadata store (not on hdfs), and its not as easy for
>>>>> all systems to access hive metadata.
>>>>>
>>>>> i am not familiar at all with tajo or drill.
>>>>>
>>>>> On Fri, Jan 30, 2015 at 8:27 PM, Samuel Marks <sa...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Thanks for the advice
>>>>>>
>>>>>> Koert: when everything is in the same essential data-store (HDFS),
>>>>>> can't I just run whatever complex tools I'm whichever paradigm they like?
>>>>>>
>>>>>> E.g.: GraphX, Mahout &etc.
>>>>>>
>>>>>> Also, what about Tajo or Drill?
>>>>>>
>>>>>> Best,
>>>>>>
>>>>>> Samuel Marks
>>>>>> http://linkedin.com/in/samuelmarks
>>>>>>
>>>>>> PS: Spark-SQL is read-only IIRC, right?
>>>>>> On 31 Jan 2015 03:39, "Koert Kuipers" <ko...@tresata.com> wrote:
>>>>>>
>>>>>>> since you require high-powered analytics, and i assume you want to
>>>>>>> stay sane while doing so, you require the ability to "drop out of sql" when
>>>>>>> needed. so spark-sql and lingual would be my choices.
>>>>>>>
>>>>>>> low latency indicates phoenix or spark-sql to me.
>>>>>>>
>>>>>>> so i would say spark-sql
>>>>>>>
>>>>>>> On Fri, Jan 30, 2015 at 7:56 AM, Samuel Marks <samuelmarks@gmail.com
>>>>>>> > wrote:
>>>>>>>
>>>>>>>> HAWQ is pretty nifty due to its full SQL compliance (ANSI 92) and
>>>>>>>> exposing both JDBC and ODBC interfaces. However, although Pivotal does open-source
>>>>>>>> a lot of software <http://www.pivotal.io/oss>, I don't believe
>>>>>>>> they open source Pivotal HD: HAWQ.
>>>>>>>>
>>>>>>>> So that doesn't meet my requirements. I should note that the
>>>>>>>> project I am building will also be open-source, which heightens the
>>>>>>>> importance of having all components also being open-source.
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>>
>>>>>>>> Samuel Marks
>>>>>>>> http://linkedin.com/in/samuelmarks
>>>>>>>>
>>>>>>>> On Fri, Jan 30, 2015 at 11:35 PM, Siddharth Tiwari <
>>>>>>>> siddharth.tiwari@live.com> wrote:
>>>>>>>>
>>>>>>>>> Have you looked at HAWQ from Pivotal ?
>>>>>>>>>
>>>>>>>>> Sent from my iPhone
>>>>>>>>>
>>>>>>>>> On Jan 30, 2015, at 4:27 AM, Samuel Marks <sa...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> Since Hadoop <https://hive.apache.org> came out, there have been
>>>>>>>>> various commercial and/or open-source attempts to expose some compatibility
>>>>>>>>> with SQL <http://drill.apache.org>. Obviously by posting here I
>>>>>>>>> am not expecting an unbiased answer.
>>>>>>>>>
>>>>>>>>> Seeking an SQL-on-Hadoop offering which provides: low-latency
>>>>>>>>> querying, and supports the most common CRUD
>>>>>>>>> <https://spark.apache.org>, including [the basics!] along these
>>>>>>>>> lines: CREATE TABLE, INSERT INTO, SELECT * FROM, UPDATE Table SET
>>>>>>>>> C1=2 WHERE, DELETE FROM, and DROP TABLE. Transactional support
>>>>>>>>> would be nice also, but is not a must-have.
>>>>>>>>>
>>>>>>>>> Essentially I want a full replacement for the more traditional
>>>>>>>>> RDBMS, one which can scale from 1 node to a serious Hadoop cluster.
>>>>>>>>>
>>>>>>>>> Python is my language of choice for interfacing, however there
>>>>>>>>> does seem to be a Python JDBC wrapper
>>>>>>>>> <https://spark.apache.org/sql>.
>>>>>>>>>
>>>>>>>>> Here is what I've found thus far:
>>>>>>>>>
>>>>>>>>>    - Apache Hive <https://hive.apache.org> (SQL-like, with
>>>>>>>>>    interactive SQL thanks to the Stinger initiative)
>>>>>>>>>    - Apache Drill <http://drill.apache.org> (ANSI SQL support)
>>>>>>>>>    - Apache Spark <https://spark.apache.org> (Spark SQL
>>>>>>>>>    <https://spark.apache.org/sql>, queries only, add data via
>>>>>>>>>    Hive, RDD
>>>>>>>>>    <https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.SchemaRDD>
>>>>>>>>>    or Paraquet <http://parquet.io/>)
>>>>>>>>>    - Apache Phoenix <http://phoenix.apache.org> (built atop Apache
>>>>>>>>>    HBase <http://hbase.apache.org>, lacks full transaction
>>>>>>>>>    <http://en.wikipedia.org/wiki/Database_transaction> support, relational
>>>>>>>>>    operators <http://en.wikipedia.org/wiki/Relational_operators>
>>>>>>>>>    and some built-in functions)
>>>>>>>>>    - Cloudera Impala
>>>>>>>>>    <http://www.cloudera.com/content/cloudera/en/products-and-services/cdh/impala.html>
>>>>>>>>>    (significant HiveQL support, some SQL language support, no support for
>>>>>>>>>    indexes on its tables, importantly missing DELETE, UPDATE and INTERSECT;
>>>>>>>>>    amongst others)
>>>>>>>>>    - Presto <https://github.com/facebook/presto> from Facebook
>>>>>>>>>    (can query Hive, Cassandra <http://cassandra.apache.org>,
>>>>>>>>>    relational DBs &etc. Doesn't seem to be designed for low-latency responses
>>>>>>>>>    across small clusters, or support UPDATE operations. It is
>>>>>>>>>    optimized for data warehousing or analytics¹
>>>>>>>>>    <http://prestodb.io/docs/current/overview/use-cases.html>)
>>>>>>>>>    - SQL-Hadoop <https://www.mapr.com/why-hadoop/sql-hadoop> via MapR
>>>>>>>>>    community edition
>>>>>>>>>    <https://www.mapr.com/products/hadoop-download> (seems to be a
>>>>>>>>>    packaging of Hive, HP Vertica
>>>>>>>>>    <http://www.vertica.com/hp-vertica-products/sqlonhadoop>,
>>>>>>>>>    SparkSQL, Drill and a native ODBC wrapper
>>>>>>>>>    <http://package.mapr.com/tools/MapR-ODBC/MapR_ODBC>)
>>>>>>>>>    - Apache Kylin <http://www.kylin.io> from Ebay (provides an
>>>>>>>>>    SQL interface and multi-dimensional analysis [OLAP
>>>>>>>>>    <http://en.wikipedia.org/wiki/OLAP>], "… offers ANSI SQL on
>>>>>>>>>    Hadoop and supports most ANSI SQL query functions". It depends on HDFS,
>>>>>>>>>    MapReduce, Hive and HBase; and seems targeted at very large data-sets
>>>>>>>>>    though maintains low query latency)
>>>>>>>>>    - Apache Tajo <http://tajo.apache.org> (ANSI/ISO SQL standard
>>>>>>>>>    compliance with JDBC <http://en.wikipedia.org/wiki/JDBC>
>>>>>>>>>    driver support [benchmarks against Hive and Impala
>>>>>>>>>    <http://blogs.gartner.com/nick-heudecker/apache-tajo-enters-the-sql-on-hadoop-space>
>>>>>>>>>    ])
>>>>>>>>>    - Cascading
>>>>>>>>>    <http://en.wikipedia.org/wiki/Cascading_%28software%29>'s
>>>>>>>>>    Lingual <http://docs.cascading.org/lingual/1.0/>²
>>>>>>>>>    <http://docs.cascading.org/lingual/1.0/#sql-support> ("Lingual
>>>>>>>>>    provides JDBC Drivers, a SQL command shell, and a catalog manager for
>>>>>>>>>    publishing files [or any resource] as schemas and tables.")
>>>>>>>>>
>>>>>>>>> Which—from this list or elsewhere—would you recommend, and why?
>>>>>>>>> Thanks for all suggestions,
>>>>>>>>>
>>>>>>>>> Samuel Marks
>>>>>>>>> http://linkedin.com/in/samuelmarks
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Which [open-souce] SQL engine atop Hadoop?

Posted by Koert Kuipers <ko...@tresata.com>.

seems the metastore thrift service support SASL. thats great. so if i
understand it correctly all i need is the metastore thrift definition to
query the metastore.
is the metastore thrift definition stable across hive versions? if so, then
i can build my app once without worrying about the hive version deployed.
in that case i admit its not as bad as i thought. lets see!

On Sat, Jan 31, 2015 at 2:41 PM, Koert Kuipers <ko...@tresata.com> wrote:

> oh sorry edward, i misread you post. seems we agree that "SQL constructs
> inside hive" are not for other systems.
>
> On Sat, Jan 31, 2015 at 2:38 PM, Koert Kuipers <ko...@tresata.com> wrote:
>
>> edward,
>> i would not call "SQL constructs inside hive" accessible for other
>> systems. its inside hive after all
>>
>> it is true that i can contact the metastore in java using
>> HiveMetaStoreClient, but then i need to bring in a whole slew of
>> dependencies (the miniumum seems to be hive-metastore, hive-common,
>> hive-shims, libfb303, libthrift and a few hadoop dependencies, by trial and
>> error). these jars need to be "provided" and added to the classpath on the
>> cluster, unless someone is willing to build versions of an application for
>> every hive version out there. and even when you do all this you can only
>> pray its going to be compatible with the next hive version, since backwards
>> compatibility is... well lets just say lacking. the attitude seems to be
>> that hive does not have a java api, so there is nothing that needs to be
>> stable.
>>
>> you are right i could go the pure thrift road. i havent tried that yet.
>> that might just be the best option. but how easy is it to do this with a
>> secure hadoop/hive ecosystem? now i need to handle kerberos myself and
>> somehow pass tokens into thrift i assume?
>>
>> contrast all of this with an avro file on hadoop with metadata baked in,
>> and i think its safe to say hive metadata is not easily accessible.
>>
>> i will take a look at your book. i hope it has an example of using thrift
>> on a secure cluster to contact hive metastore (without using the
>> HiveMetaStoreClient), that would be awesome.
>>
>>
>>
>>
>> On Sat, Jan 31, 2015 at 1:32 PM, Edward Capriolo <ed...@gmail.com>
>> wrote:
>>
>>> "with the metadata in a special metadata store (not on hdfs), and its
>>> not as easy for all systems to access hive metadata." I disagree.
>>>
>>> Hives metadata is not only accessible through the SQL constructs like
>>> "describe table". But the entire meta-store also is actually a thrift
>>> service so you have programmatic access to determine things like what
>>> columns are in a table etc. Thrift creates RPC clients for almost every
>>> major language.
>>>
>>> In the programming hive book
>>> http://www.amazon.com/dp/1449319335/?tag=mh0b-20&hvadid=3521269638&ref=pd_sl_4yiryvbf8k_e
>>> there is even examples where I show how to iterate all the tables inside
>>> the database from a java client.
>>>
>>> On Sat, Jan 31, 2015 at 11:05 AM, Koert Kuipers <ko...@tresata.com>
>>> wrote:
>>>
>>>> yes you can run whatever you like with the data in hdfs. keep in mind
>>>> that hive makes this general access pattern just a little harder, since
>>>> hive has a tendency to store data and metadata separately, with the
>>>> metadata in a special metadata store (not on hdfs), and its not as easy for
>>>> all systems to access hive metadata.
>>>>
>>>> i am not familiar at all with tajo or drill.
>>>>
>>>> On Fri, Jan 30, 2015 at 8:27 PM, Samuel Marks <sa...@gmail.com>
>>>> wrote:
>>>>
>>>>> Thanks for the advice
>>>>>
>>>>> Koert: when everything is in the same essential data-store (HDFS),
>>>>> can't I just run whatever complex tools I'm whichever paradigm they like?
>>>>>
>>>>> E.g.: GraphX, Mahout &etc.
>>>>>
>>>>> Also, what about Tajo or Drill?
>>>>>
>>>>> Best,
>>>>>
>>>>> Samuel Marks
>>>>> http://linkedin.com/in/samuelmarks
>>>>>
>>>>> PS: Spark-SQL is read-only IIRC, right?
>>>>> On 31 Jan 2015 03:39, "Koert Kuipers" <ko...@tresata.com> wrote:
>>>>>
>>>>>> since you require high-powered analytics, and i assume you want to
>>>>>> stay sane while doing so, you require the ability to "drop out of sql" when
>>>>>> needed. so spark-sql and lingual would be my choices.
>>>>>>
>>>>>> low latency indicates phoenix or spark-sql to me.
>>>>>>
>>>>>> so i would say spark-sql
>>>>>>
>>>>>> On Fri, Jan 30, 2015 at 7:56 AM, Samuel Marks <sa...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> HAWQ is pretty nifty due to its full SQL compliance (ANSI 92) and
>>>>>>> exposing both JDBC and ODBC interfaces. However, although Pivotal does open-source
>>>>>>> a lot of software <http://www.pivotal.io/oss>, I don't believe they
>>>>>>> open source Pivotal HD: HAWQ.
>>>>>>>
>>>>>>> So that doesn't meet my requirements. I should note that the project
>>>>>>> I am building will also be open-source, which heightens the importance of
>>>>>>> having all components also being open-source.
>>>>>>>
>>>>>>> Cheers,
>>>>>>>
>>>>>>> Samuel Marks
>>>>>>> http://linkedin.com/in/samuelmarks
>>>>>>>
>>>>>>> On Fri, Jan 30, 2015 at 11:35 PM, Siddharth Tiwari <
>>>>>>> siddharth.tiwari@live.com> wrote:
>>>>>>>
>>>>>>>> Have you looked at HAWQ from Pivotal ?
>>>>>>>>
>>>>>>>> Sent from my iPhone
>>>>>>>>
>>>>>>>> On Jan 30, 2015, at 4:27 AM, Samuel Marks <sa...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Since Hadoop <https://hive.apache.org> came out, there have been
>>>>>>>> various commercial and/or open-source attempts to expose some compatibility
>>>>>>>> with SQL <http://drill.apache.org>. Obviously by posting here I am
>>>>>>>> not expecting an unbiased answer.
>>>>>>>>
>>>>>>>> Seeking an SQL-on-Hadoop offering which provides: low-latency
>>>>>>>> querying, and supports the most common CRUD
>>>>>>>> <https://spark.apache.org>, including [the basics!] along these
>>>>>>>> lines: CREATE TABLE, INSERT INTO, SELECT * FROM, UPDATE Table SET
>>>>>>>> C1=2 WHERE, DELETE FROM, and DROP TABLE. Transactional support
>>>>>>>> would be nice also, but is not a must-have.
>>>>>>>>
>>>>>>>> Essentially I want a full replacement for the more traditional
>>>>>>>> RDBMS, one which can scale from 1 node to a serious Hadoop cluster.
>>>>>>>>
>>>>>>>> Python is my language of choice for interfacing, however there does
>>>>>>>> seem to be a Python JDBC wrapper <https://spark.apache.org/sql>.
>>>>>>>>
>>>>>>>> Here is what I've found thus far:
>>>>>>>>
>>>>>>>>    - Apache Hive <https://hive.apache.org> (SQL-like, with
>>>>>>>>    interactive SQL thanks to the Stinger initiative)
>>>>>>>>    - Apache Drill <http://drill.apache.org> (ANSI SQL support)
>>>>>>>>    - Apache Spark <https://spark.apache.org> (Spark SQL
>>>>>>>>    <https://spark.apache.org/sql>, queries only, add data via
>>>>>>>>    Hive, RDD
>>>>>>>>    <https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.SchemaRDD>
>>>>>>>>    or Paraquet <http://parquet.io/>)
>>>>>>>>    - Apache Phoenix <http://phoenix.apache.org> (built atop Apache
>>>>>>>>    HBase <http://hbase.apache.org>, lacks full transaction
>>>>>>>>    <http://en.wikipedia.org/wiki/Database_transaction> support, relational
>>>>>>>>    operators <http://en.wikipedia.org/wiki/Relational_operators>
>>>>>>>>    and some built-in functions)
>>>>>>>>    - Cloudera Impala
>>>>>>>>    <http://www.cloudera.com/content/cloudera/en/products-and-services/cdh/impala.html>
>>>>>>>>    (significant HiveQL support, some SQL language support, no support for
>>>>>>>>    indexes on its tables, importantly missing DELETE, UPDATE and INTERSECT;
>>>>>>>>    amongst others)
>>>>>>>>    - Presto <https://github.com/facebook/presto> from Facebook
>>>>>>>>    (can query Hive, Cassandra <http://cassandra.apache.org>,
>>>>>>>>    relational DBs &etc. Doesn't seem to be designed for low-latency responses
>>>>>>>>    across small clusters, or support UPDATE operations. It is
>>>>>>>>    optimized for data warehousing or analytics¹
>>>>>>>>    <http://prestodb.io/docs/current/overview/use-cases.html>)
>>>>>>>>    - SQL-Hadoop <https://www.mapr.com/why-hadoop/sql-hadoop> via MapR
>>>>>>>>    community edition
>>>>>>>>    <https://www.mapr.com/products/hadoop-download> (seems to be a
>>>>>>>>    packaging of Hive, HP Vertica
>>>>>>>>    <http://www.vertica.com/hp-vertica-products/sqlonhadoop>,
>>>>>>>>    SparkSQL, Drill and a native ODBC wrapper
>>>>>>>>    <http://package.mapr.com/tools/MapR-ODBC/MapR_ODBC>)
>>>>>>>>    - Apache Kylin <http://www.kylin.io> from Ebay (provides an SQL
>>>>>>>>    interface and multi-dimensional analysis [OLAP
>>>>>>>>    <http://en.wikipedia.org/wiki/OLAP>], "… offers ANSI SQL on
>>>>>>>>    Hadoop and supports most ANSI SQL query functions". It depends on HDFS,
>>>>>>>>    MapReduce, Hive and HBase; and seems targeted at very large data-sets
>>>>>>>>    though maintains low query latency)
>>>>>>>>    - Apache Tajo <http://tajo.apache.org> (ANSI/ISO SQL standard
>>>>>>>>    compliance with JDBC <http://en.wikipedia.org/wiki/JDBC> driver
>>>>>>>>    support [benchmarks against Hive and Impala
>>>>>>>>    <http://blogs.gartner.com/nick-heudecker/apache-tajo-enters-the-sql-on-hadoop-space>
>>>>>>>>    ])
>>>>>>>>    - Cascading
>>>>>>>>    <http://en.wikipedia.org/wiki/Cascading_%28software%29>'s
>>>>>>>>    Lingual <http://docs.cascading.org/lingual/1.0/>²
>>>>>>>>    <http://docs.cascading.org/lingual/1.0/#sql-support> ("Lingual
>>>>>>>>    provides JDBC Drivers, a SQL command shell, and a catalog manager for
>>>>>>>>    publishing files [or any resource] as schemas and tables.")
>>>>>>>>
>>>>>>>> Which—from this list or elsewhere—would you recommend, and why?
>>>>>>>> Thanks for all suggestions,
>>>>>>>>
>>>>>>>> Samuel Marks
>>>>>>>> http://linkedin.com/in/samuelmarks
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>
>>>
>>
>

Re: Which [open-souce] SQL engine atop Hadoop?

Posted by Koert Kuipers <ko...@tresata.com>.

oh sorry edward, i misread you post. seems we agree that "SQL constructs
inside hive" are not for other systems.

On Sat, Jan 31, 2015 at 2:38 PM, Koert Kuipers <ko...@tresata.com> wrote:

> edward,
> i would not call "SQL constructs inside hive" accessible for other
> systems. its inside hive after all
>
> it is true that i can contact the metastore in java using
> HiveMetaStoreClient, but then i need to bring in a whole slew of
> dependencies (the miniumum seems to be hive-metastore, hive-common,
> hive-shims, libfb303, libthrift and a few hadoop dependencies, by trial and
> error). these jars need to be "provided" and added to the classpath on the
> cluster, unless someone is willing to build versions of an application for
> every hive version out there. and even when you do all this you can only
> pray its going to be compatible with the next hive version, since backwards
> compatibility is... well lets just say lacking. the attitude seems to be
> that hive does not have a java api, so there is nothing that needs to be
> stable.
>
> you are right i could go the pure thrift road. i havent tried that yet.
> that might just be the best option. but how easy is it to do this with a
> secure hadoop/hive ecosystem? now i need to handle kerberos myself and
> somehow pass tokens into thrift i assume?
>
> contrast all of this with an avro file on hadoop with metadata baked in,
> and i think its safe to say hive metadata is not easily accessible.
>
> i will take a look at your book. i hope it has an example of using thrift
> on a secure cluster to contact hive metastore (without using the
> HiveMetaStoreClient), that would be awesome.
>
>
>
>
> On Sat, Jan 31, 2015 at 1:32 PM, Edward Capriolo <ed...@gmail.com>
> wrote:
>
>> "with the metadata in a special metadata store (not on hdfs), and its not
>> as easy for all systems to access hive metadata." I disagree.
>>
>> Hives metadata is not only accessible through the SQL constructs like
>> "describe table". But the entire meta-store also is actually a thrift
>> service so you have programmatic access to determine things like what
>> columns are in a table etc. Thrift creates RPC clients for almost every
>> major language.
>>
>> In the programming hive book
>> http://www.amazon.com/dp/1449319335/?tag=mh0b-20&hvadid=3521269638&ref=pd_sl_4yiryvbf8k_e
>> there is even examples where I show how to iterate all the tables inside
>> the database from a java client.
>>
>> On Sat, Jan 31, 2015 at 11:05 AM, Koert Kuipers <ko...@tresata.com>
>> wrote:
>>
>>> yes you can run whatever you like with the data in hdfs. keep in mind
>>> that hive makes this general access pattern just a little harder, since
>>> hive has a tendency to store data and metadata separately, with the
>>> metadata in a special metadata store (not on hdfs), and its not as easy for
>>> all systems to access hive metadata.
>>>
>>> i am not familiar at all with tajo or drill.
>>>
>>> On Fri, Jan 30, 2015 at 8:27 PM, Samuel Marks <sa...@gmail.com>
>>> wrote:
>>>
>>>> Thanks for the advice
>>>>
>>>> Koert: when everything is in the same essential data-store (HDFS),
>>>> can't I just run whatever complex tools I'm whichever paradigm they like?
>>>>
>>>> E.g.: GraphX, Mahout &etc.
>>>>
>>>> Also, what about Tajo or Drill?
>>>>
>>>> Best,
>>>>
>>>> Samuel Marks
>>>> http://linkedin.com/in/samuelmarks
>>>>
>>>> PS: Spark-SQL is read-only IIRC, right?
>>>> On 31 Jan 2015 03:39, "Koert Kuipers" <ko...@tresata.com> wrote:
>>>>
>>>>> since you require high-powered analytics, and i assume you want to
>>>>> stay sane while doing so, you require the ability to "drop out of sql" when
>>>>> needed. so spark-sql and lingual would be my choices.
>>>>>
>>>>> low latency indicates phoenix or spark-sql to me.
>>>>>
>>>>> so i would say spark-sql
>>>>>
>>>>> On Fri, Jan 30, 2015 at 7:56 AM, Samuel Marks <sa...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> HAWQ is pretty nifty due to its full SQL compliance (ANSI 92) and
>>>>>> exposing both JDBC and ODBC interfaces. However, although Pivotal does open-source
>>>>>> a lot of software <http://www.pivotal.io/oss>, I don't believe they
>>>>>> open source Pivotal HD: HAWQ.
>>>>>>
>>>>>> So that doesn't meet my requirements. I should note that the project
>>>>>> I am building will also be open-source, which heightens the importance of
>>>>>> having all components also being open-source.
>>>>>>
>>>>>> Cheers,
>>>>>>
>>>>>> Samuel Marks
>>>>>> http://linkedin.com/in/samuelmarks
>>>>>>
>>>>>> On Fri, Jan 30, 2015 at 11:35 PM, Siddharth Tiwari <
>>>>>> siddharth.tiwari@live.com> wrote:
>>>>>>
>>>>>>> Have you looked at HAWQ from Pivotal ?
>>>>>>>
>>>>>>> Sent from my iPhone
>>>>>>>
>>>>>>> On Jan 30, 2015, at 4:27 AM, Samuel Marks <sa...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>> Since Hadoop <https://hive.apache.org> came out, there have been
>>>>>>> various commercial and/or open-source attempts to expose some compatibility
>>>>>>> with SQL <http://drill.apache.org>. Obviously by posting here I am
>>>>>>> not expecting an unbiased answer.
>>>>>>>
>>>>>>> Seeking an SQL-on-Hadoop offering which provides: low-latency
>>>>>>> querying, and supports the most common CRUD
>>>>>>> <https://spark.apache.org>, including [the basics!] along these
>>>>>>> lines: CREATE TABLE, INSERT INTO, SELECT * FROM, UPDATE Table SET
>>>>>>> C1=2 WHERE, DELETE FROM, and DROP TABLE. Transactional support
>>>>>>> would be nice also, but is not a must-have.
>>>>>>>
>>>>>>> Essentially I want a full replacement for the more traditional
>>>>>>> RDBMS, one which can scale from 1 node to a serious Hadoop cluster.
>>>>>>>
>>>>>>> Python is my language of choice for interfacing, however there does
>>>>>>> seem to be a Python JDBC wrapper <https://spark.apache.org/sql>.
>>>>>>>
>>>>>>> Here is what I've found thus far:
>>>>>>>
>>>>>>>    - Apache Hive <https://hive.apache.org> (SQL-like, with
>>>>>>>    interactive SQL thanks to the Stinger initiative)
>>>>>>>    - Apache Drill <http://drill.apache.org> (ANSI SQL support)
>>>>>>>    - Apache Spark <https://spark.apache.org> (Spark SQL
>>>>>>>    <https://spark.apache.org/sql>, queries only, add data via Hive,
>>>>>>>    RDD
>>>>>>>    <https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.SchemaRDD>
>>>>>>>    or Paraquet <http://parquet.io/>)
>>>>>>>    - Apache Phoenix <http://phoenix.apache.org> (built atop Apache
>>>>>>>    HBase <http://hbase.apache.org>, lacks full transaction
>>>>>>>    <http://en.wikipedia.org/wiki/Database_transaction> support, relational
>>>>>>>    operators <http://en.wikipedia.org/wiki/Relational_operators>
>>>>>>>    and some built-in functions)
>>>>>>>    - Cloudera Impala
>>>>>>>    <http://www.cloudera.com/content/cloudera/en/products-and-services/cdh/impala.html>
>>>>>>>    (significant HiveQL support, some SQL language support, no support for
>>>>>>>    indexes on its tables, importantly missing DELETE, UPDATE and INTERSECT;
>>>>>>>    amongst others)
>>>>>>>    - Presto <https://github.com/facebook/presto> from Facebook (can
>>>>>>>    query Hive, Cassandra <http://cassandra.apache.org>, relational
>>>>>>>    DBs &etc. Doesn't seem to be designed for low-latency responses across
>>>>>>>    small clusters, or support UPDATE operations. It is optimized
>>>>>>>    for data warehousing or analytics¹
>>>>>>>    <http://prestodb.io/docs/current/overview/use-cases.html>)
>>>>>>>    - SQL-Hadoop <https://www.mapr.com/why-hadoop/sql-hadoop> via MapR
>>>>>>>    community edition <https://www.mapr.com/products/hadoop-download>
>>>>>>>    (seems to be a packaging of Hive, HP Vertica
>>>>>>>    <http://www.vertica.com/hp-vertica-products/sqlonhadoop>,
>>>>>>>    SparkSQL, Drill and a native ODBC wrapper
>>>>>>>    <http://package.mapr.com/tools/MapR-ODBC/MapR_ODBC>)
>>>>>>>    - Apache Kylin <http://www.kylin.io> from Ebay (provides an SQL
>>>>>>>    interface and multi-dimensional analysis [OLAP
>>>>>>>    <http://en.wikipedia.org/wiki/OLAP>], "… offers ANSI SQL on
>>>>>>>    Hadoop and supports most ANSI SQL query functions". It depends on HDFS,
>>>>>>>    MapReduce, Hive and HBase; and seems targeted at very large data-sets
>>>>>>>    though maintains low query latency)
>>>>>>>    - Apache Tajo <http://tajo.apache.org> (ANSI/ISO SQL standard
>>>>>>>    compliance with JDBC <http://en.wikipedia.org/wiki/JDBC> driver
>>>>>>>    support [benchmarks against Hive and Impala
>>>>>>>    <http://blogs.gartner.com/nick-heudecker/apache-tajo-enters-the-sql-on-hadoop-space>
>>>>>>>    ])
>>>>>>>    - Cascading
>>>>>>>    <http://en.wikipedia.org/wiki/Cascading_%28software%29>'s Lingual
>>>>>>>    <http://docs.cascading.org/lingual/1.0/>²
>>>>>>>    <http://docs.cascading.org/lingual/1.0/#sql-support> ("Lingual
>>>>>>>    provides JDBC Drivers, a SQL command shell, and a catalog manager for
>>>>>>>    publishing files [or any resource] as schemas and tables.")
>>>>>>>
>>>>>>> Which—from this list or elsewhere—would you recommend, and why?
>>>>>>> Thanks for all suggestions,
>>>>>>>
>>>>>>> Samuel Marks
>>>>>>> http://linkedin.com/in/samuelmarks
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>
>>
>

Re: Which [open-souce] SQL engine atop Hadoop?

Posted by Edward Capriolo <ed...@gmail.com>.

1: "SQL constructs inside hive" <--use jdbc driver "describe table" read
result set
2: "use thrift"
3: web hcat
https://cwiki.apache.org/confluence/display/Hive/WebHCat+InstallWebHCat#WebHCatInstallWebHCat-WebHCatInstalledwithHive
4: Just go the mysql db that backs the metastore and query directly

That gives you 4 ways to get at hive's meta data.

>> "since backwards compatibility is... well lets just say lacking"
Welcome to open source software. Or all software in general really.

All I am getting at was there is 4 ways right there to get at the metadata.

>>"but how easy is it to do this with a secure hadoop/hive ecosystem? now i
need to handle kerberos myself and somehow pass tokens into thrift i
assume?"
Frankly I do not give a crud about the "secure bla bla" but I have seen
several tickets on thift/sasl so I assume someone does.

My only point was hive seems to give 4 ways to get at the metadata, which
is better then say mysql or vertica which only really gives you the option
to do #1 over jdbc.

Hive actually works with avro formats where it can read the schema from the
data https://cwiki.apache.org/confluence/display/Hive/AvroSerDe so that
other then pointing your "table" at a folder the metadata is magic. Which
is what you are basically describing.

So again it depends on your definition of easily accessible. But the fact
that I have a thrift API which I can use to walk through the tables in a
database seems more accessable than many other databases I am aware of.





On Sat, Jan 31, 2015 at 2:38 PM, Koert Kuipers <ko...@tresata.com> wrote:

> edward,
> i would not call "SQL constructs inside hive" accessible for other
> systems. its inside hive after all
>
> it is true that i can contact the metastore in java using
> HiveMetaStoreClient, but then i need to bring in a whole slew of
> dependencies (the miniumum seems to be hive-metastore, hive-common,
> hive-shims, libfb303, libthrift and a few hadoop dependencies, by trial and
> error). these jars need to be "provided" and added to the classpath on the
> cluster, unless someone is willing to build versions of an application for
> every hive version out there. and even when you do all this you can only
> pray its going to be compatible with the next hive version, since backwards
> compatibility is... well lets just say lacking. the attitude seems to be
> that hive does not have a java api, so there is nothing that needs to be
> stable.
>
> you are right i could go the pure thrift road. i havent tried that yet.
> that might just be the best option. but how easy is it to do this with a
> secure hadoop/hive ecosystem? now i need to handle kerberos myself and
> somehow pass tokens into thrift i assume?
>
> contrast all of this with an avro file on hadoop with metadata baked in,
> and i think its safe to say hive metadata is not easily accessible.
>
> i will take a look at your book. i hope it has an example of using thrift
> on a secure cluster to contact hive metastore (without using the
> HiveMetaStoreClient), that would be awesome.
>
>
>
>
> On Sat, Jan 31, 2015 at 1:32 PM, Edward Capriolo <ed...@gmail.com>
> wrote:
>
>> "with the metadata in a special metadata store (not on hdfs), and its not
>> as easy for all systems to access hive metadata." I disagree.
>>
>> Hives metadata is not only accessible through the SQL constructs like
>> "describe table". But the entire meta-store also is actually a thrift
>> service so you have programmatic access to determine things like what
>> columns are in a table etc. Thrift creates RPC clients for almost every
>> major language.
>>
>> In the programming hive book
>> http://www.amazon.com/dp/1449319335/?tag=mh0b-20&hvadid=3521269638&ref=pd_sl_4yiryvbf8k_e
>> there is even examples where I show how to iterate all the tables inside
>> the database from a java client.
>>
>> On Sat, Jan 31, 2015 at 11:05 AM, Koert Kuipers <ko...@tresata.com>
>> wrote:
>>
>>> yes you can run whatever you like with the data in hdfs. keep in mind
>>> that hive makes this general access pattern just a little harder, since
>>> hive has a tendency to store data and metadata separately, with the
>>> metadata in a special metadata store (not on hdfs), and its not as easy for
>>> all systems to access hive metadata.
>>>
>>> i am not familiar at all with tajo or drill.
>>>
>>> On Fri, Jan 30, 2015 at 8:27 PM, Samuel Marks <sa...@gmail.com>
>>> wrote:
>>>
>>>> Thanks for the advice
>>>>
>>>> Koert: when everything is in the same essential data-store (HDFS),
>>>> can't I just run whatever complex tools I'm whichever paradigm they like?
>>>>
>>>> E.g.: GraphX, Mahout &etc.
>>>>
>>>> Also, what about Tajo or Drill?
>>>>
>>>> Best,
>>>>
>>>> Samuel Marks
>>>> http://linkedin.com/in/samuelmarks
>>>>
>>>> PS: Spark-SQL is read-only IIRC, right?
>>>> On 31 Jan 2015 03:39, "Koert Kuipers" <ko...@tresata.com> wrote:
>>>>
>>>>> since you require high-powered analytics, and i assume you want to
>>>>> stay sane while doing so, you require the ability to "drop out of sql" when
>>>>> needed. so spark-sql and lingual would be my choices.
>>>>>
>>>>> low latency indicates phoenix or spark-sql to me.
>>>>>
>>>>> so i would say spark-sql
>>>>>
>>>>> On Fri, Jan 30, 2015 at 7:56 AM, Samuel Marks <sa...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> HAWQ is pretty nifty due to its full SQL compliance (ANSI 92) and
>>>>>> exposing both JDBC and ODBC interfaces. However, although Pivotal does open-source
>>>>>> a lot of software <http://www.pivotal.io/oss>, I don't believe they
>>>>>> open source Pivotal HD: HAWQ.
>>>>>>
>>>>>> So that doesn't meet my requirements. I should note that the project
>>>>>> I am building will also be open-source, which heightens the importance of
>>>>>> having all components also being open-source.
>>>>>>
>>>>>> Cheers,
>>>>>>
>>>>>> Samuel Marks
>>>>>> http://linkedin.com/in/samuelmarks
>>>>>>
>>>>>> On Fri, Jan 30, 2015 at 11:35 PM, Siddharth Tiwari <
>>>>>> siddharth.tiwari@live.com> wrote:
>>>>>>
>>>>>>> Have you looked at HAWQ from Pivotal ?
>>>>>>>
>>>>>>> Sent from my iPhone
>>>>>>>
>>>>>>> On Jan 30, 2015, at 4:27 AM, Samuel Marks <sa...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>> Since Hadoop <https://hive.apache.org> came out, there have been
>>>>>>> various commercial and/or open-source attempts to expose some compatibility
>>>>>>> with SQL <http://drill.apache.org>. Obviously by posting here I am
>>>>>>> not expecting an unbiased answer.
>>>>>>>
>>>>>>> Seeking an SQL-on-Hadoop offering which provides: low-latency
>>>>>>> querying, and supports the most common CRUD
>>>>>>> <https://spark.apache.org>, including [the basics!] along these
>>>>>>> lines: CREATE TABLE, INSERT INTO, SELECT * FROM, UPDATE Table SET
>>>>>>> C1=2 WHERE, DELETE FROM, and DROP TABLE. Transactional support
>>>>>>> would be nice also, but is not a must-have.
>>>>>>>
>>>>>>> Essentially I want a full replacement for the more traditional
>>>>>>> RDBMS, one which can scale from 1 node to a serious Hadoop cluster.
>>>>>>>
>>>>>>> Python is my language of choice for interfacing, however there does
>>>>>>> seem to be a Python JDBC wrapper <https://spark.apache.org/sql>.
>>>>>>>
>>>>>>> Here is what I've found thus far:
>>>>>>>
>>>>>>>    - Apache Hive <https://hive.apache.org> (SQL-like, with
>>>>>>>    interactive SQL thanks to the Stinger initiative)
>>>>>>>    - Apache Drill <http://drill.apache.org> (ANSI SQL support)
>>>>>>>    - Apache Spark <https://spark.apache.org> (Spark SQL
>>>>>>>    <https://spark.apache.org/sql>, queries only, add data via Hive,
>>>>>>>    RDD
>>>>>>>    <https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.SchemaRDD>
>>>>>>>    or Paraquet <http://parquet.io/>)
>>>>>>>    - Apache Phoenix <http://phoenix.apache.org> (built atop Apache
>>>>>>>    HBase <http://hbase.apache.org>, lacks full transaction
>>>>>>>    <http://en.wikipedia.org/wiki/Database_transaction> support, relational
>>>>>>>    operators <http://en.wikipedia.org/wiki/Relational_operators>
>>>>>>>    and some built-in functions)
>>>>>>>    - Cloudera Impala
>>>>>>>    <http://www.cloudera.com/content/cloudera/en/products-and-services/cdh/impala.html>
>>>>>>>    (significant HiveQL support, some SQL language support, no support for
>>>>>>>    indexes on its tables, importantly missing DELETE, UPDATE and INTERSECT;
>>>>>>>    amongst others)
>>>>>>>    - Presto <https://github.com/facebook/presto> from Facebook (can
>>>>>>>    query Hive, Cassandra <http://cassandra.apache.org>, relational
>>>>>>>    DBs &etc. Doesn't seem to be designed for low-latency responses across
>>>>>>>    small clusters, or support UPDATE operations. It is optimized
>>>>>>>    for data warehousing or analytics¹
>>>>>>>    <http://prestodb.io/docs/current/overview/use-cases.html>)
>>>>>>>    - SQL-Hadoop <https://www.mapr.com/why-hadoop/sql-hadoop> via MapR
>>>>>>>    community edition <https://www.mapr.com/products/hadoop-download>
>>>>>>>    (seems to be a packaging of Hive, HP Vertica
>>>>>>>    <http://www.vertica.com/hp-vertica-products/sqlonhadoop>,
>>>>>>>    SparkSQL, Drill and a native ODBC wrapper
>>>>>>>    <http://package.mapr.com/tools/MapR-ODBC/MapR_ODBC>)
>>>>>>>    - Apache Kylin <http://www.kylin.io> from Ebay (provides an SQL
>>>>>>>    interface and multi-dimensional analysis [OLAP
>>>>>>>    <http://en.wikipedia.org/wiki/OLAP>], "… offers ANSI SQL on
>>>>>>>    Hadoop and supports most ANSI SQL query functions". It depends on HDFS,
>>>>>>>    MapReduce, Hive and HBase; and seems targeted at very large data-sets
>>>>>>>    though maintains low query latency)
>>>>>>>    - Apache Tajo <http://tajo.apache.org> (ANSI/ISO SQL standard
>>>>>>>    compliance with JDBC <http://en.wikipedia.org/wiki/JDBC> driver
>>>>>>>    support [benchmarks against Hive and Impala
>>>>>>>    <http://blogs.gartner.com/nick-heudecker/apache-tajo-enters-the-sql-on-hadoop-space>
>>>>>>>    ])
>>>>>>>    - Cascading
>>>>>>>    <http://en.wikipedia.org/wiki/Cascading_%28software%29>'s Lingual
>>>>>>>    <http://docs.cascading.org/lingual/1.0/>²
>>>>>>>    <http://docs.cascading.org/lingual/1.0/#sql-support> ("Lingual
>>>>>>>    provides JDBC Drivers, a SQL command shell, and a catalog manager for
>>>>>>>    publishing files [or any resource] as schemas and tables.")
>>>>>>>
>>>>>>> Which—from this list or elsewhere—would you recommend, and why?
>>>>>>> Thanks for all suggestions,
>>>>>>>
>>>>>>> Samuel Marks
>>>>>>> http://linkedin.com/in/samuelmarks
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>
>>
>

Re: Which [open-souce] SQL engine atop Hadoop?

Posted by Koert Kuipers <ko...@tresata.com>.

edward,
i would not call "SQL constructs inside hive" accessible for other systems.
its inside hive after all

it is true that i can contact the metastore in java using
HiveMetaStoreClient, but then i need to bring in a whole slew of
dependencies (the miniumum seems to be hive-metastore, hive-common,
hive-shims, libfb303, libthrift and a few hadoop dependencies, by trial and
error). these jars need to be "provided" and added to the classpath on the
cluster, unless someone is willing to build versions of an application for
every hive version out there. and even when you do all this you can only
pray its going to be compatible with the next hive version, since backwards
compatibility is... well lets just say lacking. the attitude seems to be
that hive does not have a java api, so there is nothing that needs to be
stable.

you are right i could go the pure thrift road. i havent tried that yet.
that might just be the best option. but how easy is it to do this with a
secure hadoop/hive ecosystem? now i need to handle kerberos myself and
somehow pass tokens into thrift i assume?

contrast all of this with an avro file on hadoop with metadata baked in,
and i think its safe to say hive metadata is not easily accessible.

i will take a look at your book. i hope it has an example of using thrift
on a secure cluster to contact hive metastore (without using the
HiveMetaStoreClient), that would be awesome.




On Sat, Jan 31, 2015 at 1:32 PM, Edward Capriolo <ed...@gmail.com>
wrote:

> "with the metadata in a special metadata store (not on hdfs), and its not
> as easy for all systems to access hive metadata." I disagree.
>
> Hives metadata is not only accessible through the SQL constructs like
> "describe table". But the entire meta-store also is actually a thrift
> service so you have programmatic access to determine things like what
> columns are in a table etc. Thrift creates RPC clients for almost every
> major language.
>
> In the programming hive book
> http://www.amazon.com/dp/1449319335/?tag=mh0b-20&hvadid=3521269638&ref=pd_sl_4yiryvbf8k_e
> there is even examples where I show how to iterate all the tables inside
> the database from a java client.
>
> On Sat, Jan 31, 2015 at 11:05 AM, Koert Kuipers <ko...@tresata.com> wrote:
>
>> yes you can run whatever you like with the data in hdfs. keep in mind
>> that hive makes this general access pattern just a little harder, since
>> hive has a tendency to store data and metadata separately, with the
>> metadata in a special metadata store (not on hdfs), and its not as easy for
>> all systems to access hive metadata.
>>
>> i am not familiar at all with tajo or drill.
>>
>> On Fri, Jan 30, 2015 at 8:27 PM, Samuel Marks <sa...@gmail.com>
>> wrote:
>>
>>> Thanks for the advice
>>>
>>> Koert: when everything is in the same essential data-store (HDFS), can't
>>> I just run whatever complex tools I'm whichever paradigm they like?
>>>
>>> E.g.: GraphX, Mahout &etc.
>>>
>>> Also, what about Tajo or Drill?
>>>
>>> Best,
>>>
>>> Samuel Marks
>>> http://linkedin.com/in/samuelmarks
>>>
>>> PS: Spark-SQL is read-only IIRC, right?
>>> On 31 Jan 2015 03:39, "Koert Kuipers" <ko...@tresata.com> wrote:
>>>
>>>> since you require high-powered analytics, and i assume you want to stay
>>>> sane while doing so, you require the ability to "drop out of sql" when
>>>> needed. so spark-sql and lingual would be my choices.
>>>>
>>>> low latency indicates phoenix or spark-sql to me.
>>>>
>>>> so i would say spark-sql
>>>>
>>>> On Fri, Jan 30, 2015 at 7:56 AM, Samuel Marks <sa...@gmail.com>
>>>> wrote:
>>>>
>>>>> HAWQ is pretty nifty due to its full SQL compliance (ANSI 92) and
>>>>> exposing both JDBC and ODBC interfaces. However, although Pivotal does open-source
>>>>> a lot of software <http://www.pivotal.io/oss>, I don't believe they
>>>>> open source Pivotal HD: HAWQ.
>>>>>
>>>>> So that doesn't meet my requirements. I should note that the project I
>>>>> am building will also be open-source, which heightens the importance of
>>>>> having all components also being open-source.
>>>>>
>>>>> Cheers,
>>>>>
>>>>> Samuel Marks
>>>>> http://linkedin.com/in/samuelmarks
>>>>>
>>>>> On Fri, Jan 30, 2015 at 11:35 PM, Siddharth Tiwari <
>>>>> siddharth.tiwari@live.com> wrote:
>>>>>
>>>>>> Have you looked at HAWQ from Pivotal ?
>>>>>>
>>>>>> Sent from my iPhone
>>>>>>
>>>>>> On Jan 30, 2015, at 4:27 AM, Samuel Marks <sa...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> Since Hadoop <https://hive.apache.org> came out, there have been
>>>>>> various commercial and/or open-source attempts to expose some compatibility
>>>>>> with SQL <http://drill.apache.org>. Obviously by posting here I am
>>>>>> not expecting an unbiased answer.
>>>>>>
>>>>>> Seeking an SQL-on-Hadoop offering which provides: low-latency
>>>>>> querying, and supports the most common CRUD
>>>>>> <https://spark.apache.org>, including [the basics!] along these
>>>>>> lines: CREATE TABLE, INSERT INTO, SELECT * FROM, UPDATE Table SET
>>>>>> C1=2 WHERE, DELETE FROM, and DROP TABLE. Transactional support would
>>>>>> be nice also, but is not a must-have.
>>>>>>
>>>>>> Essentially I want a full replacement for the more traditional RDBMS,
>>>>>> one which can scale from 1 node to a serious Hadoop cluster.
>>>>>>
>>>>>> Python is my language of choice for interfacing, however there does
>>>>>> seem to be a Python JDBC wrapper <https://spark.apache.org/sql>.
>>>>>>
>>>>>> Here is what I've found thus far:
>>>>>>
>>>>>>    - Apache Hive <https://hive.apache.org> (SQL-like, with
>>>>>>    interactive SQL thanks to the Stinger initiative)
>>>>>>    - Apache Drill <http://drill.apache.org> (ANSI SQL support)
>>>>>>    - Apache Spark <https://spark.apache.org> (Spark SQL
>>>>>>    <https://spark.apache.org/sql>, queries only, add data via Hive,
>>>>>>    RDD
>>>>>>    <https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.SchemaRDD>
>>>>>>    or Paraquet <http://parquet.io/>)
>>>>>>    - Apache Phoenix <http://phoenix.apache.org> (built atop Apache
>>>>>>    HBase <http://hbase.apache.org>, lacks full transaction
>>>>>>    <http://en.wikipedia.org/wiki/Database_transaction> support, relational
>>>>>>    operators <http://en.wikipedia.org/wiki/Relational_operators> and
>>>>>>    some built-in functions)
>>>>>>    - Cloudera Impala
>>>>>>    <http://www.cloudera.com/content/cloudera/en/products-and-services/cdh/impala.html>
>>>>>>    (significant HiveQL support, some SQL language support, no support for
>>>>>>    indexes on its tables, importantly missing DELETE, UPDATE and INTERSECT;
>>>>>>    amongst others)
>>>>>>    - Presto <https://github.com/facebook/presto> from Facebook (can
>>>>>>    query Hive, Cassandra <http://cassandra.apache.org>, relational
>>>>>>    DBs &etc. Doesn't seem to be designed for low-latency responses across
>>>>>>    small clusters, or support UPDATE operations. It is optimized for
>>>>>>    data warehousing or analytics¹
>>>>>>    <http://prestodb.io/docs/current/overview/use-cases.html>)
>>>>>>    - SQL-Hadoop <https://www.mapr.com/why-hadoop/sql-hadoop> via MapR
>>>>>>    community edition <https://www.mapr.com/products/hadoop-download>
>>>>>>    (seems to be a packaging of Hive, HP Vertica
>>>>>>    <http://www.vertica.com/hp-vertica-products/sqlonhadoop>,
>>>>>>    SparkSQL, Drill and a native ODBC wrapper
>>>>>>    <http://package.mapr.com/tools/MapR-ODBC/MapR_ODBC>)
>>>>>>    - Apache Kylin <http://www.kylin.io> from Ebay (provides an SQL
>>>>>>    interface and multi-dimensional analysis [OLAP
>>>>>>    <http://en.wikipedia.org/wiki/OLAP>], "… offers ANSI SQL on
>>>>>>    Hadoop and supports most ANSI SQL query functions". It depends on HDFS,
>>>>>>    MapReduce, Hive and HBase; and seems targeted at very large data-sets
>>>>>>    though maintains low query latency)
>>>>>>    - Apache Tajo <http://tajo.apache.org> (ANSI/ISO SQL standard
>>>>>>    compliance with JDBC <http://en.wikipedia.org/wiki/JDBC> driver
>>>>>>    support [benchmarks against Hive and Impala
>>>>>>    <http://blogs.gartner.com/nick-heudecker/apache-tajo-enters-the-sql-on-hadoop-space>
>>>>>>    ])
>>>>>>    - Cascading
>>>>>>    <http://en.wikipedia.org/wiki/Cascading_%28software%29>'s Lingual
>>>>>>    <http://docs.cascading.org/lingual/1.0/>²
>>>>>>    <http://docs.cascading.org/lingual/1.0/#sql-support> ("Lingual
>>>>>>    provides JDBC Drivers, a SQL command shell, and a catalog manager for
>>>>>>    publishing files [or any resource] as schemas and tables.")
>>>>>>
>>>>>> Which—from this list or elsewhere—would you recommend, and why?
>>>>>> Thanks for all suggestions,
>>>>>>
>>>>>> Samuel Marks
>>>>>> http://linkedin.com/in/samuelmarks
>>>>>>
>>>>>>
>>>>>
>>>>
>>
>

Re: Which [open-souce] SQL engine atop Hadoop?

Posted by Edward Capriolo <ed...@gmail.com>.

"with the metadata in a special metadata store (not on hdfs), and its not
as easy for all systems to access hive metadata." I disagree.

Hives metadata is not only accessible through the SQL constructs like
"describe table". But the entire meta-store also is actually a thrift
service so you have programmatic access to determine things like what
columns are in a table etc. Thrift creates RPC clients for almost every
major language.

In the programming hive book
http://www.amazon.com/dp/1449319335/?tag=mh0b-20&hvadid=3521269638&ref=pd_sl_4yiryvbf8k_e
there is even examples where I show how to iterate all the tables inside
the database from a java client.

On Sat, Jan 31, 2015 at 11:05 AM, Koert Kuipers <ko...@tresata.com> wrote:

> yes you can run whatever you like with the data in hdfs. keep in mind that
> hive makes this general access pattern just a little harder, since hive has
> a tendency to store data and metadata separately, with the metadata in a
> special metadata store (not on hdfs), and its not as easy for all systems
> to access hive metadata.
>
> i am not familiar at all with tajo or drill.
>
> On Fri, Jan 30, 2015 at 8:27 PM, Samuel Marks <sa...@gmail.com>
> wrote:
>
>> Thanks for the advice
>>
>> Koert: when everything is in the same essential data-store (HDFS), can't
>> I just run whatever complex tools I'm whichever paradigm they like?
>>
>> E.g.: GraphX, Mahout &etc.
>>
>> Also, what about Tajo or Drill?
>>
>> Best,
>>
>> Samuel Marks
>> http://linkedin.com/in/samuelmarks
>>
>> PS: Spark-SQL is read-only IIRC, right?
>> On 31 Jan 2015 03:39, "Koert Kuipers" <ko...@tresata.com> wrote:
>>
>>> since you require high-powered analytics, and i assume you want to stay
>>> sane while doing so, you require the ability to "drop out of sql" when
>>> needed. so spark-sql and lingual would be my choices.
>>>
>>> low latency indicates phoenix or spark-sql to me.
>>>
>>> so i would say spark-sql
>>>
>>> On Fri, Jan 30, 2015 at 7:56 AM, Samuel Marks <sa...@gmail.com>
>>> wrote:
>>>
>>>> HAWQ is pretty nifty due to its full SQL compliance (ANSI 92) and
>>>> exposing both JDBC and ODBC interfaces. However, although Pivotal does open-source
>>>> a lot of software <http://www.pivotal.io/oss>, I don't believe they
>>>> open source Pivotal HD: HAWQ.
>>>>
>>>> So that doesn't meet my requirements. I should note that the project I
>>>> am building will also be open-source, which heightens the importance of
>>>> having all components also being open-source.
>>>>
>>>> Cheers,
>>>>
>>>> Samuel Marks
>>>> http://linkedin.com/in/samuelmarks
>>>>
>>>> On Fri, Jan 30, 2015 at 11:35 PM, Siddharth Tiwari <
>>>> siddharth.tiwari@live.com> wrote:
>>>>
>>>>> Have you looked at HAWQ from Pivotal ?
>>>>>
>>>>> Sent from my iPhone
>>>>>
>>>>> On Jan 30, 2015, at 4:27 AM, Samuel Marks <sa...@gmail.com>
>>>>> wrote:
>>>>>
>>>>> Since Hadoop <https://hive.apache.org> came out, there have been
>>>>> various commercial and/or open-source attempts to expose some compatibility
>>>>> with SQL <http://drill.apache.org>. Obviously by posting here I am
>>>>> not expecting an unbiased answer.
>>>>>
>>>>> Seeking an SQL-on-Hadoop offering which provides: low-latency
>>>>> querying, and supports the most common CRUD <https://spark.apache.org>,
>>>>> including [the basics!] along these lines: CREATE TABLE, INSERT INTO, SELECT
>>>>> * FROM, UPDATE Table SET C1=2 WHERE, DELETE FROM, and DROP TABLE.
>>>>> Transactional support would be nice also, but is not a must-have.
>>>>>
>>>>> Essentially I want a full replacement for the more traditional RDBMS,
>>>>> one which can scale from 1 node to a serious Hadoop cluster.
>>>>>
>>>>> Python is my language of choice for interfacing, however there does
>>>>> seem to be a Python JDBC wrapper <https://spark.apache.org/sql>.
>>>>>
>>>>> Here is what I've found thus far:
>>>>>
>>>>>    - Apache Hive <https://hive.apache.org> (SQL-like, with
>>>>>    interactive SQL thanks to the Stinger initiative)
>>>>>    - Apache Drill <http://drill.apache.org> (ANSI SQL support)
>>>>>    - Apache Spark <https://spark.apache.org> (Spark SQL
>>>>>    <https://spark.apache.org/sql>, queries only, add data via Hive,
>>>>>    RDD
>>>>>    <https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.SchemaRDD>
>>>>>    or Paraquet <http://parquet.io/>)
>>>>>    - Apache Phoenix <http://phoenix.apache.org> (built atop Apache
>>>>>    HBase <http://hbase.apache.org>, lacks full transaction
>>>>>    <http://en.wikipedia.org/wiki/Database_transaction> support, relational
>>>>>    operators <http://en.wikipedia.org/wiki/Relational_operators> and
>>>>>    some built-in functions)
>>>>>    - Cloudera Impala
>>>>>    <http://www.cloudera.com/content/cloudera/en/products-and-services/cdh/impala.html>
>>>>>    (significant HiveQL support, some SQL language support, no support for
>>>>>    indexes on its tables, importantly missing DELETE, UPDATE and INTERSECT;
>>>>>    amongst others)
>>>>>    - Presto <https://github.com/facebook/presto> from Facebook (can
>>>>>    query Hive, Cassandra <http://cassandra.apache.org>, relational
>>>>>    DBs &etc. Doesn't seem to be designed for low-latency responses across
>>>>>    small clusters, or support UPDATE operations. It is optimized for
>>>>>    data warehousing or analytics¹
>>>>>    <http://prestodb.io/docs/current/overview/use-cases.html>)
>>>>>    - SQL-Hadoop <https://www.mapr.com/why-hadoop/sql-hadoop> via MapR
>>>>>    community edition <https://www.mapr.com/products/hadoop-download>
>>>>>    (seems to be a packaging of Hive, HP Vertica
>>>>>    <http://www.vertica.com/hp-vertica-products/sqlonhadoop>,
>>>>>    SparkSQL, Drill and a native ODBC wrapper
>>>>>    <http://package.mapr.com/tools/MapR-ODBC/MapR_ODBC>)
>>>>>    - Apache Kylin <http://www.kylin.io> from Ebay (provides an SQL
>>>>>    interface and multi-dimensional analysis [OLAP
>>>>>    <http://en.wikipedia.org/wiki/OLAP>], "… offers ANSI SQL on Hadoop
>>>>>    and supports most ANSI SQL query functions". It depends on HDFS, MapReduce,
>>>>>    Hive and HBase; and seems targeted at very large data-sets though maintains
>>>>>    low query latency)
>>>>>    - Apache Tajo <http://tajo.apache.org> (ANSI/ISO SQL standard
>>>>>    compliance with JDBC <http://en.wikipedia.org/wiki/JDBC> driver
>>>>>    support [benchmarks against Hive and Impala
>>>>>    <http://blogs.gartner.com/nick-heudecker/apache-tajo-enters-the-sql-on-hadoop-space>
>>>>>    ])
>>>>>    - Cascading <http://en.wikipedia.org/wiki/Cascading_%28software%29>'s
>>>>>    Lingual <http://docs.cascading.org/lingual/1.0/>²
>>>>>    <http://docs.cascading.org/lingual/1.0/#sql-support> ("Lingual
>>>>>    provides JDBC Drivers, a SQL command shell, and a catalog manager for
>>>>>    publishing files [or any resource] as schemas and tables.")
>>>>>
>>>>> Which—from this list or elsewhere—would you recommend, and why?
>>>>> Thanks for all suggestions,
>>>>>
>>>>> Samuel Marks
>>>>> http://linkedin.com/in/samuelmarks
>>>>>
>>>>>
>>>>
>>>
>

Re: Which [open-souce] SQL engine atop Hadoop?

Posted by Koert Kuipers <ko...@tresata.com>.

yes you can run whatever you like with the data in hdfs. keep in mind that
hive makes this general access pattern just a little harder, since hive has
a tendency to store data and metadata separately, with the metadata in a
special metadata store (not on hdfs), and its not as easy for all systems
to access hive metadata.

i am not familiar at all with tajo or drill.

On Fri, Jan 30, 2015 at 8:27 PM, Samuel Marks <sa...@gmail.com> wrote:

> Thanks for the advice
>
> Koert: when everything is in the same essential data-store (HDFS), can't I
> just run whatever complex tools I'm whichever paradigm they like?
>
> E.g.: GraphX, Mahout &etc.
>
> Also, what about Tajo or Drill?
>
> Best,
>
> Samuel Marks
> http://linkedin.com/in/samuelmarks
>
> PS: Spark-SQL is read-only IIRC, right?
> On 31 Jan 2015 03:39, "Koert Kuipers" <ko...@tresata.com> wrote:
>
>> since you require high-powered analytics, and i assume you want to stay
>> sane while doing so, you require the ability to "drop out of sql" when
>> needed. so spark-sql and lingual would be my choices.
>>
>> low latency indicates phoenix or spark-sql to me.
>>
>> so i would say spark-sql
>>
>> On Fri, Jan 30, 2015 at 7:56 AM, Samuel Marks <sa...@gmail.com>
>> wrote:
>>
>>> HAWQ is pretty nifty due to its full SQL compliance (ANSI 92) and
>>> exposing both JDBC and ODBC interfaces. However, although Pivotal does open-source
>>> a lot of software <http://www.pivotal.io/oss>, I don't believe they
>>> open source Pivotal HD: HAWQ.
>>>
>>> So that doesn't meet my requirements. I should note that the project I
>>> am building will also be open-source, which heightens the importance of
>>> having all components also being open-source.
>>>
>>> Cheers,
>>>
>>> Samuel Marks
>>> http://linkedin.com/in/samuelmarks
>>>
>>> On Fri, Jan 30, 2015 at 11:35 PM, Siddharth Tiwari <
>>> siddharth.tiwari@live.com> wrote:
>>>
>>>> Have you looked at HAWQ from Pivotal ?
>>>>
>>>> Sent from my iPhone
>>>>
>>>> On Jan 30, 2015, at 4:27 AM, Samuel Marks <sa...@gmail.com>
>>>> wrote:
>>>>
>>>> Since Hadoop <https://hive.apache.org> came out, there have been
>>>> various commercial and/or open-source attempts to expose some compatibility
>>>> with SQL <http://drill.apache.org>. Obviously by posting here I am not
>>>> expecting an unbiased answer.
>>>>
>>>> Seeking an SQL-on-Hadoop offering which provides: low-latency querying,
>>>> and supports the most common CRUD <https://spark.apache.org>,
>>>> including [the basics!] along these lines: CREATE TABLE, INSERT INTO, SELECT
>>>> * FROM, UPDATE Table SET C1=2 WHERE, DELETE FROM, and DROP TABLE.
>>>> Transactional support would be nice also, but is not a must-have.
>>>>
>>>> Essentially I want a full replacement for the more traditional RDBMS,
>>>> one which can scale from 1 node to a serious Hadoop cluster.
>>>>
>>>> Python is my language of choice for interfacing, however there does
>>>> seem to be a Python JDBC wrapper <https://spark.apache.org/sql>.
>>>>
>>>> Here is what I've found thus far:
>>>>
>>>>    - Apache Hive <https://hive.apache.org> (SQL-like, with interactive
>>>>    SQL thanks to the Stinger initiative)
>>>>    - Apache Drill <http://drill.apache.org> (ANSI SQL support)
>>>>    - Apache Spark <https://spark.apache.org> (Spark SQL
>>>>    <https://spark.apache.org/sql>, queries only, add data via Hive, RDD
>>>>    <https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.SchemaRDD>
>>>>    or Paraquet <http://parquet.io/>)
>>>>    - Apache Phoenix <http://phoenix.apache.org> (built atop Apache
>>>>    HBase <http://hbase.apache.org>, lacks full transaction
>>>>    <http://en.wikipedia.org/wiki/Database_transaction> support, relational
>>>>    operators <http://en.wikipedia.org/wiki/Relational_operators> and
>>>>    some built-in functions)
>>>>    - Cloudera Impala
>>>>    <http://www.cloudera.com/content/cloudera/en/products-and-services/cdh/impala.html>
>>>>    (significant HiveQL support, some SQL language support, no support for
>>>>    indexes on its tables, importantly missing DELETE, UPDATE and INTERSECT;
>>>>    amongst others)
>>>>    - Presto <https://github.com/facebook/presto> from Facebook (can
>>>>    query Hive, Cassandra <http://cassandra.apache.org>, relational DBs
>>>>    &etc. Doesn't seem to be designed for low-latency responses across small
>>>>    clusters, or support UPDATE operations. It is optimized for data
>>>>    warehousing or analytics¹
>>>>    <http://prestodb.io/docs/current/overview/use-cases.html>)
>>>>    - SQL-Hadoop <https://www.mapr.com/why-hadoop/sql-hadoop> via MapR
>>>>    community edition <https://www.mapr.com/products/hadoop-download>
>>>>    (seems to be a packaging of Hive, HP Vertica
>>>>    <http://www.vertica.com/hp-vertica-products/sqlonhadoop>, SparkSQL,
>>>>    Drill and a native ODBC wrapper
>>>>    <http://package.mapr.com/tools/MapR-ODBC/MapR_ODBC>)
>>>>    - Apache Kylin <http://www.kylin.io> from Ebay (provides an SQL
>>>>    interface and multi-dimensional analysis [OLAP
>>>>    <http://en.wikipedia.org/wiki/OLAP>], "… offers ANSI SQL on Hadoop
>>>>    and supports most ANSI SQL query functions". It depends on HDFS, MapReduce,
>>>>    Hive and HBase; and seems targeted at very large data-sets though maintains
>>>>    low query latency)
>>>>    - Apache Tajo <http://tajo.apache.org> (ANSI/ISO SQL standard
>>>>    compliance with JDBC <http://en.wikipedia.org/wiki/JDBC> driver
>>>>    support [benchmarks against Hive and Impala
>>>>    <http://blogs.gartner.com/nick-heudecker/apache-tajo-enters-the-sql-on-hadoop-space>
>>>>    ])
>>>>    - Cascading <http://en.wikipedia.org/wiki/Cascading_%28software%29>'s
>>>>    Lingual <http://docs.cascading.org/lingual/1.0/>²
>>>>    <http://docs.cascading.org/lingual/1.0/#sql-support> ("Lingual
>>>>    provides JDBC Drivers, a SQL command shell, and a catalog manager for
>>>>    publishing files [or any resource] as schemas and tables.")
>>>>
>>>> Which—from this list or elsewhere—would you recommend, and why?
>>>> Thanks for all suggestions,
>>>>
>>>> Samuel Marks
>>>> http://linkedin.com/in/samuelmarks
>>>>
>>>>
>>>
>>

Re: Which [open-souce] SQL engine atop Hadoop?

Posted by Samuel Marks <sa...@gmail.com>.

Thanks for the advice

Koert: when everything is in the same essential data-store (HDFS), can't I
just run whatever complex tools I'm whichever paradigm they like?

E.g.: GraphX, Mahout &etc.

Also, what about Tajo or Drill?

Best,

Samuel Marks
http://linkedin.com/in/samuelmarks

PS: Spark-SQL is read-only IIRC, right?
On 31 Jan 2015 03:39, "Koert Kuipers" <ko...@tresata.com> wrote:

> since you require high-powered analytics, and i assume you want to stay
> sane while doing so, you require the ability to "drop out of sql" when
> needed. so spark-sql and lingual would be my choices.
>
> low latency indicates phoenix or spark-sql to me.
>
> so i would say spark-sql
>
> On Fri, Jan 30, 2015 at 7:56 AM, Samuel Marks <sa...@gmail.com>
> wrote:
>
>> HAWQ is pretty nifty due to its full SQL compliance (ANSI 92) and
>> exposing both JDBC and ODBC interfaces. However, although Pivotal does open-source
>> a lot of software <http://www.pivotal.io/oss>, I don't believe they open
>> source Pivotal HD: HAWQ.
>>
>> So that doesn't meet my requirements. I should note that the project I am
>> building will also be open-source, which heightens the importance of having
>> all components also being open-source.
>>
>> Cheers,
>>
>> Samuel Marks
>> http://linkedin.com/in/samuelmarks
>>
>> On Fri, Jan 30, 2015 at 11:35 PM, Siddharth Tiwari <
>> siddharth.tiwari@live.com> wrote:
>>
>>> Have you looked at HAWQ from Pivotal ?
>>>
>>> Sent from my iPhone
>>>
>>> On Jan 30, 2015, at 4:27 AM, Samuel Marks <sa...@gmail.com> wrote:
>>>
>>> Since Hadoop <https://hive.apache.org> came out, there have been
>>> various commercial and/or open-source attempts to expose some compatibility
>>> with SQL <http://drill.apache.org>. Obviously by posting here I am not
>>> expecting an unbiased answer.
>>>
>>> Seeking an SQL-on-Hadoop offering which provides: low-latency querying,
>>> and supports the most common CRUD <https://spark.apache.org>, including
>>> [the basics!] along these lines: CREATE TABLE, INSERT INTO, SELECT *
>>> FROM, UPDATE Table SET C1=2 WHERE, DELETE FROM, and DROP TABLE.
>>> Transactional support would be nice also, but is not a must-have.
>>>
>>> Essentially I want a full replacement for the more traditional RDBMS,
>>> one which can scale from 1 node to a serious Hadoop cluster.
>>>
>>> Python is my language of choice for interfacing, however there does seem
>>> to be a Python JDBC wrapper <https://spark.apache.org/sql>.
>>>
>>> Here is what I've found thus far:
>>>
>>>    - Apache Hive <https://hive.apache.org> (SQL-like, with interactive
>>>    SQL thanks to the Stinger initiative)
>>>    - Apache Drill <http://drill.apache.org> (ANSI SQL support)
>>>    - Apache Spark <https://spark.apache.org> (Spark SQL
>>>    <https://spark.apache.org/sql>, queries only, add data via Hive, RDD
>>>    <https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.SchemaRDD>
>>>    or Paraquet <http://parquet.io/>)
>>>    - Apache Phoenix <http://phoenix.apache.org> (built atop Apache HBase
>>>    <http://hbase.apache.org>, lacks full transaction
>>>    <http://en.wikipedia.org/wiki/Database_transaction> support, relational
>>>    operators <http://en.wikipedia.org/wiki/Relational_operators> and
>>>    some built-in functions)
>>>    - Cloudera Impala
>>>    <http://www.cloudera.com/content/cloudera/en/products-and-services/cdh/impala.html>
>>>    (significant HiveQL support, some SQL language support, no support for
>>>    indexes on its tables, importantly missing DELETE, UPDATE and INTERSECT;
>>>    amongst others)
>>>    - Presto <https://github.com/facebook/presto> from Facebook (can
>>>    query Hive, Cassandra <http://cassandra.apache.org>, relational DBs
>>>    &etc. Doesn't seem to be designed for low-latency responses across small
>>>    clusters, or support UPDATE operations. It is optimized for data
>>>    warehousing or analytics¹
>>>    <http://prestodb.io/docs/current/overview/use-cases.html>)
>>>    - SQL-Hadoop <https://www.mapr.com/why-hadoop/sql-hadoop> via MapR
>>>    community edition <https://www.mapr.com/products/hadoop-download>
>>>    (seems to be a packaging of Hive, HP Vertica
>>>    <http://www.vertica.com/hp-vertica-products/sqlonhadoop>, SparkSQL,
>>>    Drill and a native ODBC wrapper
>>>    <http://package.mapr.com/tools/MapR-ODBC/MapR_ODBC>)
>>>    - Apache Kylin <http://www.kylin.io> from Ebay (provides an SQL
>>>    interface and multi-dimensional analysis [OLAP
>>>    <http://en.wikipedia.org/wiki/OLAP>], "… offers ANSI SQL on Hadoop
>>>    and supports most ANSI SQL query functions". It depends on HDFS, MapReduce,
>>>    Hive and HBase; and seems targeted at very large data-sets though maintains
>>>    low query latency)
>>>    - Apache Tajo <http://tajo.apache.org> (ANSI/ISO SQL standard
>>>    compliance with JDBC <http://en.wikipedia.org/wiki/JDBC> driver
>>>    support [benchmarks against Hive and Impala
>>>    <http://blogs.gartner.com/nick-heudecker/apache-tajo-enters-the-sql-on-hadoop-space>
>>>    ])
>>>    - Cascading <http://en.wikipedia.org/wiki/Cascading_%28software%29>'s
>>>    Lingual <http://docs.cascading.org/lingual/1.0/>²
>>>    <http://docs.cascading.org/lingual/1.0/#sql-support> ("Lingual
>>>    provides JDBC Drivers, a SQL command shell, and a catalog manager for
>>>    publishing files [or any resource] as schemas and tables.")
>>>
>>> Which—from this list or elsewhere—would you recommend, and why?
>>> Thanks for all suggestions,
>>>
>>> Samuel Marks
>>> http://linkedin.com/in/samuelmarks
>>>
>>>
>>
>

Re: Which [open-souce] SQL engine atop Hadoop?

Posted by Koert Kuipers <ko...@tresata.com>.

since you require high-powered analytics, and i assume you want to stay
sane while doing so, you require the ability to "drop out of sql" when
needed. so spark-sql and lingual would be my choices.

low latency indicates phoenix or spark-sql to me.

so i would say spark-sql

On Fri, Jan 30, 2015 at 7:56 AM, Samuel Marks <sa...@gmail.com> wrote:

> HAWQ is pretty nifty due to its full SQL compliance (ANSI 92) and exposing
> both JDBC and ODBC interfaces. However, although Pivotal does open-source
> a lot of software <http://www.pivotal.io/oss>, I don't believe they open
> source Pivotal HD: HAWQ.
>
> So that doesn't meet my requirements. I should note that the project I am
> building will also be open-source, which heightens the importance of having
> all components also being open-source.
>
> Cheers,
>
> Samuel Marks
> http://linkedin.com/in/samuelmarks
>
> On Fri, Jan 30, 2015 at 11:35 PM, Siddharth Tiwari <
> siddharth.tiwari@live.com> wrote:
>
>> Have you looked at HAWQ from Pivotal ?
>>
>> Sent from my iPhone
>>
>> On Jan 30, 2015, at 4:27 AM, Samuel Marks <sa...@gmail.com> wrote:
>>
>> Since Hadoop <https://hive.apache.org> came out, there have been various
>> commercial and/or open-source attempts to expose some compatibility with
>> SQL <http://drill.apache.org>. Obviously by posting here I am not
>> expecting an unbiased answer.
>>
>> Seeking an SQL-on-Hadoop offering which provides: low-latency querying,
>> and supports the most common CRUD <https://spark.apache.org>, including
>> [the basics!] along these lines: CREATE TABLE, INSERT INTO, SELECT * FROM,
>> UPDATE Table SET C1=2 WHERE, DELETE FROM, and DROP TABLE. Transactional
>> support would be nice also, but is not a must-have.
>>
>> Essentially I want a full replacement for the more traditional RDBMS, one
>> which can scale from 1 node to a serious Hadoop cluster.
>>
>> Python is my language of choice for interfacing, however there does seem
>> to be a Python JDBC wrapper <https://spark.apache.org/sql>.
>>
>> Here is what I've found thus far:
>>
>>    - Apache Hive <https://hive.apache.org> (SQL-like, with interactive
>>    SQL thanks to the Stinger initiative)
>>    - Apache Drill <http://drill.apache.org> (ANSI SQL support)
>>    - Apache Spark <https://spark.apache.org> (Spark SQL
>>    <https://spark.apache.org/sql>, queries only, add data via Hive, RDD
>>    <https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.SchemaRDD>
>>    or Paraquet <http://parquet.io/>)
>>    - Apache Phoenix <http://phoenix.apache.org> (built atop Apache HBase
>>    <http://hbase.apache.org>, lacks full transaction
>>    <http://en.wikipedia.org/wiki/Database_transaction> support, relational
>>    operators <http://en.wikipedia.org/wiki/Relational_operators> and
>>    some built-in functions)
>>    - Cloudera Impala
>>    <http://www.cloudera.com/content/cloudera/en/products-and-services/cdh/impala.html>
>>    (significant HiveQL support, some SQL language support, no support for
>>    indexes on its tables, importantly missing DELETE, UPDATE and INTERSECT;
>>    amongst others)
>>    - Presto <https://github.com/facebook/presto> from Facebook (can
>>    query Hive, Cassandra <http://cassandra.apache.org>, relational DBs
>>    &etc. Doesn't seem to be designed for low-latency responses across small
>>    clusters, or support UPDATE operations. It is optimized for data
>>    warehousing or analytics¹
>>    <http://prestodb.io/docs/current/overview/use-cases.html>)
>>    - SQL-Hadoop <https://www.mapr.com/why-hadoop/sql-hadoop> via MapR
>>    community edition <https://www.mapr.com/products/hadoop-download>
>>    (seems to be a packaging of Hive, HP Vertica
>>    <http://www.vertica.com/hp-vertica-products/sqlonhadoop>, SparkSQL,
>>    Drill and a native ODBC wrapper
>>    <http://package.mapr.com/tools/MapR-ODBC/MapR_ODBC>)
>>    - Apache Kylin <http://www.kylin.io> from Ebay (provides an SQL
>>    interface and multi-dimensional analysis [OLAP
>>    <http://en.wikipedia.org/wiki/OLAP>], "… offers ANSI SQL on Hadoop
>>    and supports most ANSI SQL query functions". It depends on HDFS, MapReduce,
>>    Hive and HBase; and seems targeted at very large data-sets though maintains
>>    low query latency)
>>    - Apache Tajo <http://tajo.apache.org> (ANSI/ISO SQL standard
>>    compliance with JDBC <http://en.wikipedia.org/wiki/JDBC> driver
>>    support [benchmarks against Hive and Impala
>>    <http://blogs.gartner.com/nick-heudecker/apache-tajo-enters-the-sql-on-hadoop-space>
>>    ])
>>    - Cascading <http://en.wikipedia.org/wiki/Cascading_%28software%29>'s
>>    Lingual <http://docs.cascading.org/lingual/1.0/>²
>>    <http://docs.cascading.org/lingual/1.0/#sql-support> ("Lingual
>>    provides JDBC Drivers, a SQL command shell, and a catalog manager for
>>    publishing files [or any resource] as schemas and tables.")
>>
>> Which—from this list or elsewhere—would you recommend, and why?
>> Thanks for all suggestions,
>>
>> Samuel Marks
>> http://linkedin.com/in/samuelmarks
>>
>>
>

Re: Which [open-souce] SQL engine atop Hadoop?

Posted by Samuel Marks <sa...@gmail.com>.

HAWQ is pretty nifty due to its full SQL compliance (ANSI 92) and exposing
both JDBC and ODBC interfaces. However, although Pivotal does open-source a
lot of software <http://www.pivotal.io/oss>, I don't believe they open
source Pivotal HD: HAWQ.

So that doesn't meet my requirements. I should note that the project I am
building will also be open-source, which heightens the importance of having
all components also being open-source.

Cheers,

Samuel Marks
http://linkedin.com/in/samuelmarks

On Fri, Jan 30, 2015 at 11:35 PM, Siddharth Tiwari <
siddharth.tiwari@live.com> wrote:

> Have you looked at HAWQ from Pivotal ?
>
> Sent from my iPhone
>
> On Jan 30, 2015, at 4:27 AM, Samuel Marks <sa...@gmail.com> wrote:
>
> Since Hadoop <https://hive.apache.org> came out, there have been various
> commercial and/or open-source attempts to expose some compatibility with
> SQL <http://drill.apache.org>. Obviously by posting here I am not
> expecting an unbiased answer.
>
> Seeking an SQL-on-Hadoop offering which provides: low-latency querying,
> and supports the most common CRUD <https://spark.apache.org>, including
> [the basics!] along these lines: CREATE TABLE, INSERT INTO, SELECT * FROM,
> UPDATE Table SET C1=2 WHERE, DELETE FROM, and DROP TABLE. Transactional
> support would be nice also, but is not a must-have.
>
> Essentially I want a full replacement for the more traditional RDBMS, one
> which can scale from 1 node to a serious Hadoop cluster.
>
> Python is my language of choice for interfacing, however there does seem
> to be a Python JDBC wrapper <https://spark.apache.org/sql>.
>
> Here is what I've found thus far:
>
>    - Apache Hive <https://hive.apache.org> (SQL-like, with interactive
>    SQL thanks to the Stinger initiative)
>    - Apache Drill <http://drill.apache.org> (ANSI SQL support)
>    - Apache Spark <https://spark.apache.org> (Spark SQL
>    <https://spark.apache.org/sql>, queries only, add data via Hive, RDD
>    <https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.SchemaRDD>
>    or Paraquet <http://parquet.io/>)
>    - Apache Phoenix <http://phoenix.apache.org> (built atop Apache HBase
>    <http://hbase.apache.org>, lacks full transaction
>    <http://en.wikipedia.org/wiki/Database_transaction> support, relational
>    operators <http://en.wikipedia.org/wiki/Relational_operators> and some
>    built-in functions)
>    - Cloudera Impala
>    <http://www.cloudera.com/content/cloudera/en/products-and-services/cdh/impala.html>
>    (significant HiveQL support, some SQL language support, no support for
>    indexes on its tables, importantly missing DELETE, UPDATE and INTERSECT;
>    amongst others)
>    - Presto <https://github.com/facebook/presto> from Facebook (can query
>    Hive, Cassandra <http://cassandra.apache.org>, relational DBs &etc.
>    Doesn't seem to be designed for low-latency responses across small
>    clusters, or support UPDATE operations. It is optimized for data
>    warehousing or analytics¹
>    <http://prestodb.io/docs/current/overview/use-cases.html>)
>    - SQL-Hadoop <https://www.mapr.com/why-hadoop/sql-hadoop> via MapR
>    community edition <https://www.mapr.com/products/hadoop-download>
>    (seems to be a packaging of Hive, HP Vertica
>    <http://www.vertica.com/hp-vertica-products/sqlonhadoop>, SparkSQL,
>    Drill and a native ODBC wrapper
>    <http://package.mapr.com/tools/MapR-ODBC/MapR_ODBC>)
>    - Apache Kylin <http://www.kylin.io> from Ebay (provides an SQL
>    interface and multi-dimensional analysis [OLAP
>    <http://en.wikipedia.org/wiki/OLAP>], "… offers ANSI SQL on Hadoop and
>    supports most ANSI SQL query functions". It depends on HDFS, MapReduce,
>    Hive and HBase; and seems targeted at very large data-sets though maintains
>    low query latency)
>    - Apache Tajo <http://tajo.apache.org> (ANSI/ISO SQL standard
>    compliance with JDBC <http://en.wikipedia.org/wiki/JDBC> driver
>    support [benchmarks against Hive and Impala
>    <http://blogs.gartner.com/nick-heudecker/apache-tajo-enters-the-sql-on-hadoop-space>
>    ])
>    - Cascading <http://en.wikipedia.org/wiki/Cascading_%28software%29>'s
>    Lingual <http://docs.cascading.org/lingual/1.0/>²
>    <http://docs.cascading.org/lingual/1.0/#sql-support> ("Lingual
>    provides JDBC Drivers, a SQL command shell, and a catalog manager for
>    publishing files [or any resource] as schemas and tables.")
>
> Which—from this list or elsewhere—would you recommend, and why?
> Thanks for all suggestions,
>
> Samuel Marks
> http://linkedin.com/in/samuelmarks
>
>

Re: Which [open-souce] SQL engine atop Hadoop?

Posted by Siddharth Tiwari <si...@live.com>.

Have you looked at HAWQ from Pivotal ?

Sent from my iPhone

> On Jan 30, 2015, at 4:27 AM, Samuel Marks <sa...@gmail.com> wrote:
> 
> Since Hadoop came out, there have been various commercial and/or open-source attempts to expose some compatibility with SQL. Obviously by posting here I am not expecting an unbiased answer.
> Seeking an SQL-on-Hadoop offering which provides: low-latency querying, and supports the most common CRUD, including [the basics!] along these lines: CREATE TABLE, INSERT INTO, SELECT * FROM, UPDATE Table SET C1=2 WHERE, DELETE FROM, and DROP TABLE. Transactional support would be nice also, but is not a must-have.
> 
> Essentially I want a full replacement for the more traditional RDBMS, one which can scale from 1 node to a serious Hadoop cluster.
> 
> Python is my language of choice for interfacing, however there does seem to be a Python JDBC wrapper.
> 
> Here is what I've found thus far:
> 
> Apache Hive (SQL-like, with interactive SQL thanks to the Stinger initiative)
> Apache Drill (ANSI SQL support)
> Apache Spark (Spark SQL, queries only, add data via Hive, RDD or Paraquet)
> Apache Phoenix (built atop Apache HBase, lacks full transaction support, relational operators and some built-in functions)
> Cloudera Impala (significant HiveQL support, some SQL language support, no support for indexes on its tables, importantly missing DELETE, UPDATE and INTERSECT; amongst others)
> Presto from Facebook (can query Hive, Cassandra, relational DBs &etc. Doesn't seem to be designed for low-latency responses across small clusters, or support UPDATE operations. It is optimized for data warehousing or analytics¹)
> SQL-Hadoop via MapR community edition (seems to be a packaging of Hive, HP Vertica, SparkSQL, Drill and a native ODBC wrapper)
> Apache Kylin from Ebay (provides an SQL interface and multi-dimensional analysis [OLAP], "… offers ANSI SQL on Hadoop and supports most ANSI SQL query functions". It depends on HDFS, MapReduce, Hive and HBase; and seems targeted at very large data-sets though maintains low query latency)
> Apache Tajo (ANSI/ISO SQL standard compliance with JDBC driver support [benchmarks against Hive and Impala])
> Cascading's Lingual² ("Lingual provides JDBC Drivers, a SQL command shell, and a catalog manager for publishing files [or any resource] as schemas and tables.")
> Which—from this list or elsewhere—would you recommend, and why?
> 
> Thanks for all suggestions,
> 
> Samuel Marks
> http://linkedin.com/in/samuelmarks

Re: Which [open-souce] SQL engine atop Hadoop?

Posted by Samuel Marks <sa...@gmail.com>.

Hi Uli,

My use-case is two-fold: generic and "high-powered" analytics.

There are various offerings which I can use that will push data back to
HDFS at regular intervals. Even Apache Sqoop <http://sqoop.apache.org> can
do that.

However I was thinking that it'd be better to keep everything in the Hadoop
or cache-atop-Hadoop space, to reduce levels of indirection, ease in e.g.:
explanations, debugging, tracing, profiling and orchestration.

The generic components would just include CRUD, and basic related queries
(such as propagated updates utilising joins).

More interesting is on the analytics side, wherein I'll be executing a
variety of Machine Learning, Natural Language Processing, recommenders,
time series sequence matching and related tasks. Some of these require
near-realtime responses, whereas others can be delayed significantly.

I haven't actually looked at Splice Machine. It's just Apache Derby +
Apache HBase married together cleanly, right? - It doesn't seem like
they're open-source though… definitely an interesting project.

Best,

Samuel Marks
http://linkedin.com/in/samuelmarks

On Fri, Jan 30, 2015 at 10:54 PM, Uli Bethke <ul...@sonra.io> wrote:

>  What exactly is your use case? Analytics or OLTP?
> Have you looked at Splice Machine? If your use case is OLTP, have you
> looked at NewSQL offerings (outside Hadoop)?
> Cheers
> uli
>
>
> On 30/01/2015 11:26, Samuel Marks wrote:
>
> Since Hadoop <https://hive.apache.org> came out, there have been various
> commercial and/or open-source attempts to expose some compatibility with
> SQL <http://drill.apache.org>. Obviously by posting here I am not
> expecting an unbiased answer.
>
> Seeking an SQL-on-Hadoop offering which provides: low-latency querying,
> and supports the most common CRUD <https://spark.apache.org>, including
> [the basics!] along these lines: CREATE TABLE, INSERT INTO, SELECT * FROM,
> UPDATE Table SET C1=2 WHERE, DELETE FROM, and DROP TABLE. Transactional
> support would be nice also, but is not a must-have.
>
> Essentially I want a full replacement for the more traditional RDBMS, one
> which can scale from 1 node to a serious Hadoop cluster.
>
> Python is my language of choice for interfacing, however there does seem
> to be a Python JDBC wrapper <https://spark.apache.org/sql>.
>
> Here is what I've found thus far:
>
>    - Apache Hive <https://hive.apache.org> (SQL-like, with interactive
>    SQL thanks to the Stinger initiative)
>    - Apache Drill <http://drill.apache.org> (ANSI SQL support)
>    - Apache Spark <https://spark.apache.org> (Spark SQL
>    <https://spark.apache.org/sql>, queries only, add data via Hive, RDD
>    <https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.SchemaRDD>
>    or Paraquet <http://parquet.io/>)
>    - Apache Phoenix <http://phoenix.apache.org> (built atop Apache HBase
>    <http://hbase.apache.org>, lacks full transaction
>    <http://en.wikipedia.org/wiki/Database_transaction> support, relational
>    operators <http://en.wikipedia.org/wiki/Relational_operators> and some
>    built-in functions)
>    - Cloudera Impala
>    <http://www.cloudera.com/content/cloudera/en/products-and-services/cdh/impala.html>
>    (significant HiveQL support, some SQL language support, no support for
>    indexes on its tables, importantly missing DELETE, UPDATE and INTERSECT;
>    amongst others)
>     - Presto <https://github.com/facebook/presto> from Facebook (can
>    query Hive, Cassandra <http://cassandra.apache.org>, relational DBs
>    &etc. Doesn't seem to be designed for low-latency responses across small
>    clusters, or support UPDATE operations. It is optimized for data
>    warehousing or analytics¹
>    <http://prestodb.io/docs/current/overview/use-cases.html>)
>    - SQL-Hadoop <https://www.mapr.com/why-hadoop/sql-hadoop> via MapR
>    community edition <https://www.mapr.com/products/hadoop-download>
>    (seems to be a packaging of Hive, HP Vertica
>    <http://www.vertica.com/hp-vertica-products/sqlonhadoop>, SparkSQL,
>    Drill and a native ODBC wrapper
>    <http://package.mapr.com/tools/MapR-ODBC/MapR_ODBC>)
>    - Apache Kylin <http://www.kylin.io> from Ebay (provides an SQL
>    interface and multi-dimensional analysis [OLAP
>    <http://en.wikipedia.org/wiki/OLAP>], "… offers ANSI SQL on Hadoop and
>    supports most ANSI SQL query functions". It depends on HDFS, MapReduce,
>    Hive and HBase; and seems targeted at very large data-sets though maintains
>    low query latency)
>    - Apache Tajo <http://tajo.apache.org> (ANSI/ISO SQL standard
>    compliance with JDBC <http://en.wikipedia.org/wiki/JDBC> driver
>    support [benchmarks against Hive and Impala
>    <http://blogs.gartner.com/nick-heudecker/apache-tajo-enters-the-sql-on-hadoop-space>
>    ])
>    - Cascading <http://en.wikipedia.org/wiki/Cascading_%28software%29>'s
>    Lingual <http://docs.cascading.org/lingual/1.0/>²
>    <http://docs.cascading.org/lingual/1.0/#sql-support> ("Lingual
>    provides JDBC Drivers, a SQL command shell, and a catalog manager for
>    publishing files [or any resource] as schemas and tables.")
>
> Which—from this list or elsewhere—would you recommend, and why?
>  Thanks for all suggestions,
>
> Samuel Marks
> http://linkedin.com/in/samuelmarks
>
>
> --
> ___________________________
> Uli Bethke
> Co-founder Sonra
> p: +353 86 32 83 040
> w: www.sonra.io
> l: linkedin.com/in/ulibethke
> t: twitter.com/ubethke
>
>