You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Samuel Marks <sa...@gmail.com> on 2015/02/01 02:56:06 UTC

Re: Which [open-souce] SQL engine atop Hadoop?

Interesting discussion. It looks like the HBase metastore can also be
configured to use HDFS HA (ex. tutorial
<http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/cdh_hag_hdfs_ha_cdh_components_config.html>
).

To get back on topic though, the primary contenders now are: Phoenix,
Lingual and perhaps Tajo or Drill?

Best,

Samuel Marks
http://linkedin.com/in/samuelmarks

On Sun, Feb 1, 2015 at 9:38 AM, Edward Capriolo <ed...@gmail.com>
wrote:

> "is the metastore thrift definition stable across hive versions?" I would
> say yes. Like many API's the core eventually solidifies. No one is saying
> it will never every change, but basically there are things like "database"
> and "table" and they have properties like "name". I have some basic scripts
> that look for table names matching patterns or summarize disk usage by
> owner. I have not had to touch them very much. Usually if they do change it
> is something small and if you tie the commit to a jira you can figure out
> what and why.
>
> On Sat, Jan 31, 2015 at 3:02 PM, Koert Kuipers <ko...@tresata.com> wrote:
>
>> seems the metastore thrift service support SASL. thats great. so if i
>> understand it correctly all i need is the metastore thrift definition to
>> query the metastore.
>> is the metastore thrift definition stable across hive versions? if so,
>> then i can build my app once without worrying about the hive version
>> deployed. in that case i admit its not as bad as i thought. lets see!
>>
>> On Sat, Jan 31, 2015 at 2:41 PM, Koert Kuipers <ko...@tresata.com> wrote:
>>
>>> oh sorry edward, i misread you post. seems we agree that "SQL constructs
>>> inside hive" are not for other systems.
>>>
>>> On Sat, Jan 31, 2015 at 2:38 PM, Koert Kuipers <ko...@tresata.com>
>>> wrote:
>>>
>>>> edward,
>>>> i would not call "SQL constructs inside hive" accessible for other
>>>> systems. its inside hive after all
>>>>
>>>> it is true that i can contact the metastore in java using
>>>> HiveMetaStoreClient, but then i need to bring in a whole slew of
>>>> dependencies (the miniumum seems to be hive-metastore, hive-common,
>>>> hive-shims, libfb303, libthrift and a few hadoop dependencies, by trial and
>>>> error). these jars need to be "provided" and added to the classpath on the
>>>> cluster, unless someone is willing to build versions of an application for
>>>> every hive version out there. and even when you do all this you can only
>>>> pray its going to be compatible with the next hive version, since backwards
>>>> compatibility is... well lets just say lacking. the attitude seems to be
>>>> that hive does not have a java api, so there is nothing that needs to be
>>>> stable.
>>>>
>>>> you are right i could go the pure thrift road. i havent tried that yet.
>>>> that might just be the best option. but how easy is it to do this with a
>>>> secure hadoop/hive ecosystem? now i need to handle kerberos myself and
>>>> somehow pass tokens into thrift i assume?
>>>>
>>>> contrast all of this with an avro file on hadoop with metadata baked
>>>> in, and i think its safe to say hive metadata is not easily accessible.
>>>>
>>>> i will take a look at your book. i hope it has an example of using
>>>> thrift on a secure cluster to contact hive metastore (without using the
>>>> HiveMetaStoreClient), that would be awesome.
>>>>
>>>>
>>>>
>>>>
>>>> On Sat, Jan 31, 2015 at 1:32 PM, Edward Capriolo <edlinuxguru@gmail.com
>>>> > wrote:
>>>>
>>>>> "with the metadata in a special metadata store (not on hdfs), and its
>>>>> not as easy for all systems to access hive metadata." I disagree.
>>>>>
>>>>> Hives metadata is not only accessible through the SQL constructs like
>>>>> "describe table". But the entire meta-store also is actually a thrift
>>>>> service so you have programmatic access to determine things like what
>>>>> columns are in a table etc. Thrift creates RPC clients for almost every
>>>>> major language.
>>>>>
>>>>> In the programming hive book
>>>>> http://www.amazon.com/dp/1449319335/?tag=mh0b-20&hvadid=3521269638&ref=pd_sl_4yiryvbf8k_e
>>>>> there is even examples where I show how to iterate all the tables inside
>>>>> the database from a java client.
>>>>>
>>>>> On Sat, Jan 31, 2015 at 11:05 AM, Koert Kuipers <ko...@tresata.com>
>>>>> wrote:
>>>>>
>>>>>> yes you can run whatever you like with the data in hdfs. keep in mind
>>>>>> that hive makes this general access pattern just a little harder, since
>>>>>> hive has a tendency to store data and metadata separately, with the
>>>>>> metadata in a special metadata store (not on hdfs), and its not as easy for
>>>>>> all systems to access hive metadata.
>>>>>>
>>>>>> i am not familiar at all with tajo or drill.
>>>>>>
>>>>>> On Fri, Jan 30, 2015 at 8:27 PM, Samuel Marks <sa...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Thanks for the advice
>>>>>>>
>>>>>>> Koert: when everything is in the same essential data-store (HDFS),
>>>>>>> can't I just run whatever complex tools I'm whichever paradigm they like?
>>>>>>>
>>>>>>> E.g.: GraphX, Mahout &etc.
>>>>>>>
>>>>>>> Also, what about Tajo or Drill?
>>>>>>>
>>>>>>> Best,
>>>>>>>
>>>>>>> Samuel Marks
>>>>>>> http://linkedin.com/in/samuelmarks
>>>>>>>
>>>>>>> PS: Spark-SQL is read-only IIRC, right?
>>>>>>> On 31 Jan 2015 03:39, "Koert Kuipers" <ko...@tresata.com> wrote:
>>>>>>>
>>>>>>>> since you require high-powered analytics, and i assume you want to
>>>>>>>> stay sane while doing so, you require the ability to "drop out of sql" when
>>>>>>>> needed. so spark-sql and lingual would be my choices.
>>>>>>>>
>>>>>>>> low latency indicates phoenix or spark-sql to me.
>>>>>>>>
>>>>>>>> so i would say spark-sql
>>>>>>>>
>>>>>>>> On Fri, Jan 30, 2015 at 7:56 AM, Samuel Marks <
>>>>>>>> samuelmarks@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> HAWQ is pretty nifty due to its full SQL compliance (ANSI 92) and
>>>>>>>>> exposing both JDBC and ODBC interfaces. However, although Pivotal does open-source
>>>>>>>>> a lot of software <http://www.pivotal.io/oss>, I don't believe
>>>>>>>>> they open source Pivotal HD: HAWQ.
>>>>>>>>>
>>>>>>>>> So that doesn't meet my requirements. I should note that the
>>>>>>>>> project I am building will also be open-source, which heightens the
>>>>>>>>> importance of having all components also being open-source.
>>>>>>>>>
>>>>>>>>> Cheers,
>>>>>>>>>
>>>>>>>>> Samuel Marks
>>>>>>>>> http://linkedin.com/in/samuelmarks
>>>>>>>>>
>>>>>>>>> On Fri, Jan 30, 2015 at 11:35 PM, Siddharth Tiwari <
>>>>>>>>> siddharth.tiwari@live.com> wrote:
>>>>>>>>>
>>>>>>>>>> Have you looked at HAWQ from Pivotal ?
>>>>>>>>>>
>>>>>>>>>> Sent from my iPhone
>>>>>>>>>>
>>>>>>>>>> On Jan 30, 2015, at 4:27 AM, Samuel Marks <sa...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> Since Hadoop <https://hive.apache.org> came out, there have been
>>>>>>>>>> various commercial and/or open-source attempts to expose some compatibility
>>>>>>>>>> with SQL <http://drill.apache.org>. Obviously by posting here I
>>>>>>>>>> am not expecting an unbiased answer.
>>>>>>>>>>
>>>>>>>>>> Seeking an SQL-on-Hadoop offering which provides: low-latency
>>>>>>>>>> querying, and supports the most common CRUD
>>>>>>>>>> <https://spark.apache.org>, including [the basics!] along these
>>>>>>>>>> lines: CREATE TABLE, INSERT INTO, SELECT * FROM, UPDATE Table
>>>>>>>>>> SET C1=2 WHERE, DELETE FROM, and DROP TABLE. Transactional
>>>>>>>>>> support would be nice also, but is not a must-have.
>>>>>>>>>>
>>>>>>>>>> Essentially I want a full replacement for the more traditional
>>>>>>>>>> RDBMS, one which can scale from 1 node to a serious Hadoop cluster.
>>>>>>>>>>
>>>>>>>>>> Python is my language of choice for interfacing, however there
>>>>>>>>>> does seem to be a Python JDBC wrapper
>>>>>>>>>> <https://spark.apache.org/sql>.
>>>>>>>>>>
>>>>>>>>>> Here is what I've found thus far:
>>>>>>>>>>
>>>>>>>>>>    - Apache Hive <https://hive.apache.org> (SQL-like, with
>>>>>>>>>>    interactive SQL thanks to the Stinger initiative)
>>>>>>>>>>    - Apache Drill <http://drill.apache.org> (ANSI SQL support)
>>>>>>>>>>    - Apache Spark <https://spark.apache.org> (Spark SQL
>>>>>>>>>>    <https://spark.apache.org/sql>, queries only, add data via
>>>>>>>>>>    Hive, RDD
>>>>>>>>>>    <https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.SchemaRDD>
>>>>>>>>>>    or Paraquet <http://parquet.io/>)
>>>>>>>>>>    - Apache Phoenix <http://phoenix.apache.org> (built atop Apache
>>>>>>>>>>    HBase <http://hbase.apache.org>, lacks full transaction
>>>>>>>>>>    <http://en.wikipedia.org/wiki/Database_transaction> support, relational
>>>>>>>>>>    operators <http://en.wikipedia.org/wiki/Relational_operators>
>>>>>>>>>>    and some built-in functions)
>>>>>>>>>>    - Cloudera Impala
>>>>>>>>>>    <http://www.cloudera.com/content/cloudera/en/products-and-services/cdh/impala.html>
>>>>>>>>>>    (significant HiveQL support, some SQL language support, no support for
>>>>>>>>>>    indexes on its tables, importantly missing DELETE, UPDATE and INTERSECT;
>>>>>>>>>>    amongst others)
>>>>>>>>>>    - Presto <https://github.com/facebook/presto> from Facebook
>>>>>>>>>>    (can query Hive, Cassandra <http://cassandra.apache.org>,
>>>>>>>>>>    relational DBs &etc. Doesn't seem to be designed for low-latency responses
>>>>>>>>>>    across small clusters, or support UPDATE operations. It is
>>>>>>>>>>    optimized for data warehousing or analytics¹
>>>>>>>>>>    <http://prestodb.io/docs/current/overview/use-cases.html>)
>>>>>>>>>>    - SQL-Hadoop <https://www.mapr.com/why-hadoop/sql-hadoop> via MapR
>>>>>>>>>>    community edition
>>>>>>>>>>    <https://www.mapr.com/products/hadoop-download> (seems to be
>>>>>>>>>>    a packaging of Hive, HP Vertica
>>>>>>>>>>    <http://www.vertica.com/hp-vertica-products/sqlonhadoop>,
>>>>>>>>>>    SparkSQL, Drill and a native ODBC wrapper
>>>>>>>>>>    <http://package.mapr.com/tools/MapR-ODBC/MapR_ODBC>)
>>>>>>>>>>    - Apache Kylin <http://www.kylin.io> from Ebay (provides an
>>>>>>>>>>    SQL interface and multi-dimensional analysis [OLAP
>>>>>>>>>>    <http://en.wikipedia.org/wiki/OLAP>], "… offers ANSI SQL on
>>>>>>>>>>    Hadoop and supports most ANSI SQL query functions". It depends on HDFS,
>>>>>>>>>>    MapReduce, Hive and HBase; and seems targeted at very large data-sets
>>>>>>>>>>    though maintains low query latency)
>>>>>>>>>>    - Apache Tajo <http://tajo.apache.org> (ANSI/ISO SQL standard
>>>>>>>>>>    compliance with JDBC <http://en.wikipedia.org/wiki/JDBC>
>>>>>>>>>>    driver support [benchmarks against Hive and Impala
>>>>>>>>>>    <http://blogs.gartner.com/nick-heudecker/apache-tajo-enters-the-sql-on-hadoop-space>
>>>>>>>>>>    ])
>>>>>>>>>>    - Cascading
>>>>>>>>>>    <http://en.wikipedia.org/wiki/Cascading_%28software%29>'s
>>>>>>>>>>    Lingual <http://docs.cascading.org/lingual/1.0/>²
>>>>>>>>>>    <http://docs.cascading.org/lingual/1.0/#sql-support>
>>>>>>>>>>    ("Lingual provides JDBC Drivers, a SQL command shell, and a catalog manager
>>>>>>>>>>    for publishing files [or any resource] as schemas and tables.")
>>>>>>>>>>
>>>>>>>>>> Which—from this list or elsewhere—would you recommend, and why?
>>>>>>>>>> Thanks for all suggestions,
>>>>>>>>>>
>>>>>>>>>> Samuel Marks
>>>>>>>>>> http://linkedin.com/in/samuelmarks
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Which [open-souce] SQL engine atop Hadoop?

Posted by Devopam Mittra <de...@gmail.com>.

hi Samuel,
Apologies for the delay in response as well as overlooking Presto mention
in your initial post itself.
#IMHO :
Presto is lightweight, easy to install and configure.
It does not support "UPDATE" .. hmm , i don't need updates in Big Data
analytics where i can have a temp / intermediate table which will be faster
as well (between, I don't know how many others provide true Update
capabilities)
I am happy with Hive itself , and don't need Presto for my ad-hoc analytics
since the overhead of MR job kick-off timing is not overwhelming compared
to the total query execution time.
Presto is good for me when I need to run parameterized and fixed queries
from a dashboard directly on my HDP cluster as it reduces my screen staring
time

Hope you find it helpful in your decision making.

regards
Devopam



On Tue, Feb 3, 2015 at 1:57 PM, Samuel Marks <sa...@gmail.com> wrote:

> Thanks Devopam,
>
> In my initial post I did mention Presto, with his review:
> " can query Hive, Cassandra <http://cassandra.apache.org/>, relational
> DBs &etc. Doesn't seem to be designed for low-latency responses across
> small clusters, or support UPDATE operations. It is optimized for data
> warehousing or analytics¹
> <http://prestodb.io/docs/current/overview/use-cases.html>"
>
> Your thoughts?
>
> Best,
>
> Samuel Marks
> http://linkedin.com/in/samuelmarks
> On 03/02/2015 6:06 pm, "Devopam Mittra" <de...@gmail.com> wrote:
>
>> hi Samuel,
>> You may wish to evaluate Presto (https://prestodb.io/) , which has an
>> added advantage of being faster than conventional Hive due to no MR jobs
>> being fired.
>> It has a dependency on Hive metastore though , through which it derives
>> the mechanism to execute the queries directly on source files.
>> The only flip side I found was the absence of complex SQL syntax that
>> means creating a lot of intermediate tables for little complicated
>> calculations (and imho , all calculations become complex sooner than we
>> intend them to )
>>
>> regards
>> Devopam
>>
>> On Tue, Feb 3, 2015 at 10:30 AM, Samuel Marks <sa...@gmail.com>
>> wrote:
>>
>>> Alexander: So would you recommend using Phoenix for all but those kind
>>> of queries, and switching to Hive+Tez for the rest? - Is that feasible?
>>>
>>> Checking their documentation, it looks like it just might be:
>>> https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration
>>>
>>> There is some early work on a Hive + Phoenix integration on GitHub:
>>> https://github.com/nmaillard/Phoenix-Hive
>>>
>>> Saurabh: I am sure there are a variety of very good non open-source
>>> products on the market :) - However in this thread I am only looking at
>>> open-source options. Additionally I am planning on open-sourcing this
>>> project I am building using these tools, so it makes even more sense that
>>> the entire toolset and their dependencies are also open-source.
>>>
>>> Best,
>>>
>>> Samuel Marks
>>> http://linkedin.com/in/samuelmarks
>>>
>>> On Tue, Feb 3, 2015 at 2:33 PM, Saurabh B <sa...@gmail.com>
>>> wrote:
>>>
>>>> This is not open source but we are using Vertica and it works very
>>>> nicely for us. There is a 1TB community edition but above that it costs
>>>> money.
>>>> It has really advanced SQL (analytical functions, etc), works like an
>>>> RDBMS, has R/Java/C++ SDK and scales nicely. There is a similar option of
>>>> Redshift available but Vertica has more features (pattern matching
>>>> functions, etc).
>>>>
>>>> Again, not open source so I would be interested to know what you end up
>>>> going with and what your experience is.
>>>>
>>>> On Mon, Feb 2, 2015 at 12:08 AM, Samuel Marks <sa...@gmail.com>
>>>> wrote:
>>>>
>>>>> Well what I am seeking is a Big Data database that can work with Small
>>>>> Data also. I.e.: scaleable from one node to vast clusters; whilst
>>>>> maintaining relatively low latency throughout.
>>>>>
>>>>> Which fit into this category?
>>>>>
>>>>> Samuel Marks
>>>>> http://linkedin.com/in/samuelmarks
>>>>>
>>>>
>>>>
>>>
>>
>>
>> --
>> Devopam Mittra
>> Life and Relations are not binary
>>
>


-- 
Devopam Mittra
Life and Relations are not binary

Re: Which [open-souce] SQL engine atop Hadoop?

Posted by Samuel Marks <sa...@gmail.com>.

Thanks Devopam,

In my initial post I did mention Presto, with his review:
" can query Hive, Cassandra <http://cassandra.apache.org/>, relational DBs
&etc. Doesn't seem to be designed for low-latency responses across small
clusters, or support UPDATE operations. It is optimized for data
warehousing or analytics¹
<http://prestodb.io/docs/current/overview/use-cases.html>"

Your thoughts?

Best,

Samuel Marks
http://linkedin.com/in/samuelmarks
On 03/02/2015 6:06 pm, "Devopam Mittra" <de...@gmail.com> wrote:

> hi Samuel,
> You may wish to evaluate Presto (https://prestodb.io/) , which has an
> added advantage of being faster than conventional Hive due to no MR jobs
> being fired.
> It has a dependency on Hive metastore though , through which it derives
> the mechanism to execute the queries directly on source files.
> The only flip side I found was the absence of complex SQL syntax that
> means creating a lot of intermediate tables for little complicated
> calculations (and imho , all calculations become complex sooner than we
> intend them to )
>
> regards
> Devopam
>
> On Tue, Feb 3, 2015 at 10:30 AM, Samuel Marks <sa...@gmail.com>
> wrote:
>
>> Alexander: So would you recommend using Phoenix for all but those kind of
>> queries, and switching to Hive+Tez for the rest? - Is that feasible?
>>
>> Checking their documentation, it looks like it just might be:
>> https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration
>>
>> There is some early work on a Hive + Phoenix integration on GitHub:
>> https://github.com/nmaillard/Phoenix-Hive
>>
>> Saurabh: I am sure there are a variety of very good non open-source
>> products on the market :) - However in this thread I am only looking at
>> open-source options. Additionally I am planning on open-sourcing this
>> project I am building using these tools, so it makes even more sense that
>> the entire toolset and their dependencies are also open-source.
>>
>> Best,
>>
>> Samuel Marks
>> http://linkedin.com/in/samuelmarks
>>
>> On Tue, Feb 3, 2015 at 2:33 PM, Saurabh B <sa...@gmail.com>
>> wrote:
>>
>>> This is not open source but we are using Vertica and it works very
>>> nicely for us. There is a 1TB community edition but above that it costs
>>> money.
>>> It has really advanced SQL (analytical functions, etc), works like an
>>> RDBMS, has R/Java/C++ SDK and scales nicely. There is a similar option of
>>> Redshift available but Vertica has more features (pattern matching
>>> functions, etc).
>>>
>>> Again, not open source so I would be interested to know what you end up
>>> going with and what your experience is.
>>>
>>> On Mon, Feb 2, 2015 at 12:08 AM, Samuel Marks <sa...@gmail.com>
>>> wrote:
>>>
>>>> Well what I am seeking is a Big Data database that can work with Small
>>>> Data also. I.e.: scaleable from one node to vast clusters; whilst
>>>> maintaining relatively low latency throughout.
>>>>
>>>> Which fit into this category?
>>>>
>>>> Samuel Marks
>>>> http://linkedin.com/in/samuelmarks
>>>>
>>>
>>>
>>
>
>
> --
> Devopam Mittra
> Life and Relations are not binary
>

Re: Which [open-souce] SQL engine atop Hadoop?

Posted by Devopam Mittra <de...@gmail.com>.

hi Samuel,
You may wish to evaluate Presto (https://prestodb.io/) , which has an added
advantage of being faster than conventional Hive due to no MR jobs being
fired.
It has a dependency on Hive metastore though , through which it derives the
mechanism to execute the queries directly on source files.
The only flip side I found was the absence of complex SQL syntax that means
creating a lot of intermediate tables for little complicated calculations
(and imho , all calculations become complex sooner than we intend them to )

regards
Devopam

On Tue, Feb 3, 2015 at 10:30 AM, Samuel Marks <sa...@gmail.com> wrote:

> Alexander: So would you recommend using Phoenix for all but those kind of
> queries, and switching to Hive+Tez for the rest? - Is that feasible?
>
> Checking their documentation, it looks like it just might be:
> https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration
>
> There is some early work on a Hive + Phoenix integration on GitHub:
> https://github.com/nmaillard/Phoenix-Hive
>
> Saurabh: I am sure there are a variety of very good non open-source
> products on the market :) - However in this thread I am only looking at
> open-source options. Additionally I am planning on open-sourcing this
> project I am building using these tools, so it makes even more sense that
> the entire toolset and their dependencies are also open-source.
>
> Best,
>
> Samuel Marks
> http://linkedin.com/in/samuelmarks
>
> On Tue, Feb 3, 2015 at 2:33 PM, Saurabh B <sa...@gmail.com>
> wrote:
>
>> This is not open source but we are using Vertica and it works very nicely
>> for us. There is a 1TB community edition but above that it costs money.
>> It has really advanced SQL (analytical functions, etc), works like an
>> RDBMS, has R/Java/C++ SDK and scales nicely. There is a similar option of
>> Redshift available but Vertica has more features (pattern matching
>> functions, etc).
>>
>> Again, not open source so I would be interested to know what you end up
>> going with and what your experience is.
>>
>> On Mon, Feb 2, 2015 at 12:08 AM, Samuel Marks <sa...@gmail.com>
>> wrote:
>>
>>> Well what I am seeking is a Big Data database that can work with Small
>>> Data also. I.e.: scaleable from one node to vast clusters; whilst
>>> maintaining relatively low latency throughout.
>>>
>>> Which fit into this category?
>>>
>>> Samuel Marks
>>> http://linkedin.com/in/samuelmarks
>>>
>>
>>
>


-- 
Devopam Mittra
Life and Relations are not binary

Re: Which [open-souce] SQL engine atop Hadoop?

Posted by Samuel Marks <sa...@gmail.com>.

Alexander: So would you recommend using Phoenix for all but those kind of
queries, and switching to Hive+Tez for the rest? - Is that feasible?

Checking their documentation, it looks like it just might be:
https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration

There is some early work on a Hive + Phoenix integration on GitHub:
https://github.com/nmaillard/Phoenix-Hive

Saurabh: I am sure there are a variety of very good non open-source
products on the market :) - However in this thread I am only looking at
open-source options. Additionally I am planning on open-sourcing this
project I am building using these tools, so it makes even more sense that
the entire toolset and their dependencies are also open-source.

Best,

Samuel Marks
http://linkedin.com/in/samuelmarks

On Tue, Feb 3, 2015 at 2:33 PM, Saurabh B <sa...@gmail.com> wrote:

> This is not open source but we are using Vertica and it works very nicely
> for us. There is a 1TB community edition but above that it costs money.
> It has really advanced SQL (analytical functions, etc), works like an
> RDBMS, has R/Java/C++ SDK and scales nicely. There is a similar option of
> Redshift available but Vertica has more features (pattern matching
> functions, etc).
>
> Again, not open source so I would be interested to know what you end up
> going with and what your experience is.
>
> On Mon, Feb 2, 2015 at 12:08 AM, Samuel Marks <sa...@gmail.com>
> wrote:
>
>> Well what I am seeking is a Big Data database that can work with Small
>> Data also. I.e.: scaleable from one node to vast clusters; whilst
>> maintaining relatively low latency throughout.
>>
>> Which fit into this category?
>>
>> Samuel Marks
>> http://linkedin.com/in/samuelmarks
>>
>
>

Re: Which [open-souce] SQL engine atop Hadoop?

Posted by Saurabh B <sa...@gmail.com>.

This is not open source but we are using Vertica and it works very nicely
for us. There is a 1TB community edition but above that it costs money.
It has really advanced SQL (analytical functions, etc), works like an
RDBMS, has R/Java/C++ SDK and scales nicely. There is a similar option of
Redshift available but Vertica has more features (pattern matching
functions, etc).

Again, not open source so I would be interested to know what you end up
going with and what your experience is.

On Mon, Feb 2, 2015 at 12:08 AM, Samuel Marks <sa...@gmail.com> wrote:

> Well what I am seeking is a Big Data database that can work with Small
> Data also. I.e.: scaleable from one node to vast clusters; whilst
> maintaining relatively low latency throughout.
>
> Which fit into this category?
>
> Samuel Marks
> http://linkedin.com/in/samuelmarks
>

Re: Which [open-souce] SQL engine atop Hadoop?

Posted by Alexander Pivovarov <ap...@gmail.com>.

Apache Phoenix is super fast for queries which filters data by table key,
- sub-second latency
- has good jdbc driver

but has limitations
- no full outer join support
- inner and left outer join use one computer memory, so it can not join
huge table to huge table


On Mon, Feb 2, 2015 at 6:59 PM, Alexander Pivovarov <ap...@gmail.com>
wrote:

> I like Tez engine for hive (aka Stinger initiative)
>
> - faster than MR engine. especially for complex queries with lots of
> nested sub-queries
> - stable
> - min latency is 5-7 sec  (0 sec for select count(*) ...)
> - capable to process huge datasets (not limited by RAM as Spark)
>
>
> On Mon, Feb 2, 2015 at 6:00 PM, Samuel Marks <sa...@gmail.com>
> wrote:
>
>> Maybe you're right, and what I should be doing is throwing in connectors
>> so that data from regular databases is pushed into HDFS at regular
>> intervals, wherein my "fancier" analytics can be run across larger
>> data-sets.
>>
>> However, I don't want to decide straightaway, for example, Phoenix +
>> Spark may be just the combination I am looking for.
>>
>> Best,
>>
>>
>> Samuel Marks
>> http://linkedin.com/in/samuelmarks
>>
>> On Mon, Feb 2, 2015 at 5:14 PM, Jörn Franke <jo...@gmail.com> wrote:
>>
>>> Hallo,
>>>
>>> I think you have to think first about your functional and non-functional
>>> requirements. You can scale "normal" SQL databases as well (cf CERN or
>>> Facebook). There are different types of databases for different purposes -
>>> there is no one fits it all. At the moment, we are a few years away from a
>>> one-fits-it-all database that leverages AI etc to automatically scale,
>>> optimize etc processing, storage and network.  Until then you will have to
>>> do the math depending on your requirements.
>>> Once you make them more precise, we will able to help you more.
>>>
>>> Cheers
>>> Le 2 févr. 2015 06:08, "Samuel Marks" <sa...@gmail.com> a écrit :
>>>
>>> Well what I am seeking is a Big Data database that can work with Small
>>> Data also. I.e.: scaleable from one node to vast clusters; whilst
>>> maintaining relatively low latency throughout.
>>>
>>> Which fit into this category?
>>>
>>> Samuel Marks
>>> http://linkedin.com/in/samuelmarks
>>>
>>>
>>
>

Re: Which [open-souce] SQL engine atop Hadoop?

Posted by Alexander Pivovarov <ap...@gmail.com>.

I like Tez engine for hive (aka Stinger initiative)

- faster than MR engine. especially for complex queries with lots of nested
sub-queries
- stable
- min latency is 5-7 sec  (0 sec for select count(*) ...)
- capable to process huge datasets (not limited by RAM as Spark)


On Mon, Feb 2, 2015 at 6:00 PM, Samuel Marks <sa...@gmail.com> wrote:

> Maybe you're right, and what I should be doing is throwing in connectors
> so that data from regular databases is pushed into HDFS at regular
> intervals, wherein my "fancier" analytics can be run across larger
> data-sets.
>
> However, I don't want to decide straightaway, for example, Phoenix + Spark
> may be just the combination I am looking for.
>
> Best,
>
>
> Samuel Marks
> http://linkedin.com/in/samuelmarks
>
> On Mon, Feb 2, 2015 at 5:14 PM, Jörn Franke <jo...@gmail.com> wrote:
>
>> Hallo,
>>
>> I think you have to think first about your functional and non-functional
>> requirements. You can scale "normal" SQL databases as well (cf CERN or
>> Facebook). There are different types of databases for different purposes -
>> there is no one fits it all. At the moment, we are a few years away from a
>> one-fits-it-all database that leverages AI etc to automatically scale,
>> optimize etc processing, storage and network.  Until then you will have to
>> do the math depending on your requirements.
>> Once you make them more precise, we will able to help you more.
>>
>> Cheers
>> Le 2 févr. 2015 06:08, "Samuel Marks" <sa...@gmail.com> a écrit :
>>
>> Well what I am seeking is a Big Data database that can work with Small
>> Data also. I.e.: scaleable from one node to vast clusters; whilst
>> maintaining relatively low latency throughout.
>>
>> Which fit into this category?
>>
>> Samuel Marks
>> http://linkedin.com/in/samuelmarks
>>
>>
>

Re: Which [open-souce] SQL engine atop Hadoop?

Posted by Samuel Marks <sa...@gmail.com>.

Maybe you're right, and what I should be doing is throwing in connectors so
that data from regular databases is pushed into HDFS at regular intervals,
wherein my "fancier" analytics can be run across larger data-sets.

However, I don't want to decide straightaway, for example, Phoenix + Spark
may be just the combination I am looking for.

Best,


Samuel Marks
http://linkedin.com/in/samuelmarks

On Mon, Feb 2, 2015 at 5:14 PM, Jörn Franke <jo...@gmail.com> wrote:

> Hallo,
>
> I think you have to think first about your functional and non-functional
> requirements. You can scale "normal" SQL databases as well (cf CERN or
> Facebook). There are different types of databases for different purposes -
> there is no one fits it all. At the moment, we are a few years away from a
> one-fits-it-all database that leverages AI etc to automatically scale,
> optimize etc processing, storage and network.  Until then you will have to
> do the math depending on your requirements.
> Once you make them more precise, we will able to help you more.
>
> Cheers
> Le 2 févr. 2015 06:08, "Samuel Marks" <sa...@gmail.com> a écrit :
>
> Well what I am seeking is a Big Data database that can work with Small
> Data also. I.e.: scaleable from one node to vast clusters; whilst
> maintaining relatively low latency throughout.
>
> Which fit into this category?
>
> Samuel Marks
> http://linkedin.com/in/samuelmarks
>
>

Re: Which [open-souce] SQL engine atop Hadoop?

Posted by Jörn Franke <jo...@gmail.com>.

Hallo,

I think you have to think first about your functional and non-functional
requirements. You can scale "normal" SQL databases as well (cf CERN or
Facebook). There are different types of databases for different purposes -
there is no one fits it all. At the moment, we are a few years away from a
one-fits-it-all database that leverages AI etc to automatically scale,
optimize etc processing, storage and network.  Until then you will have to
do the math depending on your requirements.
Once you make them more precise, we will able to help you more.

Cheers
Le 2 févr. 2015 06:08, "Samuel Marks" <sa...@gmail.com> a écrit :

Well what I am seeking is a Big Data database that can work with Small Data
also. I.e.: scaleable from one node to vast clusters; whilst maintaining
relatively low latency throughout.

Which fit into this category?

Samuel Marks
http://linkedin.com/in/samuelmarks

Re: Which [open-souce] SQL engine atop Hadoop?

Posted by Samuel Marks <sa...@gmail.com>.

Well what I am seeking is a Big Data database that can work with Small Data
also. I.e.: scaleable from one node to vast clusters; whilst maintaining
relatively low latency throughout.

Which fit into this category?

Samuel Marks
http://linkedin.com/in/samuelmarks

Re: Which [open-souce] SQL engine atop Hadoop?

Posted by Koert Kuipers <ko...@tresata.com>.

i would not exclude spark sql unless you really need something mutable in
which case lingual wont work either

On Sat, Jan 31, 2015 at 8:56 PM, Samuel Marks <sa...@gmail.com> wrote:

> Interesting discussion. It looks like the HBase metastore can also be
> configured to use HDFS HA (ex. tutorial
> <http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/cdh_hag_hdfs_ha_cdh_components_config.html>
> ).
>
> To get back on topic though, the primary contenders now are: Phoenix,
> Lingual and perhaps Tajo or Drill?
>
> Best,
>
> Samuel Marks
> http://linkedin.com/in/samuelmarks
>
> On Sun, Feb 1, 2015 at 9:38 AM, Edward Capriolo <ed...@gmail.com>
> wrote:
>
>> "is the metastore thrift definition stable across hive versions?" I would
>> say yes. Like many API's the core eventually solidifies. No one is saying
>> it will never every change, but basically there are things like "database"
>> and "table" and they have properties like "name". I have some basic scripts
>> that look for table names matching patterns or summarize disk usage by
>> owner. I have not had to touch them very much. Usually if they do change it
>> is something small and if you tie the commit to a jira you can figure out
>> what and why.
>>
>> On Sat, Jan 31, 2015 at 3:02 PM, Koert Kuipers <ko...@tresata.com> wrote:
>>
>>> seems the metastore thrift service support SASL. thats great. so if i
>>> understand it correctly all i need is the metastore thrift definition to
>>> query the metastore.
>>> is the metastore thrift definition stable across hive versions? if so,
>>> then i can build my app once without worrying about the hive version
>>> deployed. in that case i admit its not as bad as i thought. lets see!
>>>
>>> On Sat, Jan 31, 2015 at 2:41 PM, Koert Kuipers <ko...@tresata.com>
>>> wrote:
>>>
>>>> oh sorry edward, i misread you post. seems we agree that "SQL
>>>> constructs inside hive" are not for other systems.
>>>>
>>>> On Sat, Jan 31, 2015 at 2:38 PM, Koert Kuipers <ko...@tresata.com>
>>>> wrote:
>>>>
>>>>> edward,
>>>>> i would not call "SQL constructs inside hive" accessible for other
>>>>> systems. its inside hive after all
>>>>>
>>>>> it is true that i can contact the metastore in java using
>>>>> HiveMetaStoreClient, but then i need to bring in a whole slew of
>>>>> dependencies (the miniumum seems to be hive-metastore, hive-common,
>>>>> hive-shims, libfb303, libthrift and a few hadoop dependencies, by trial and
>>>>> error). these jars need to be "provided" and added to the classpath on the
>>>>> cluster, unless someone is willing to build versions of an application for
>>>>> every hive version out there. and even when you do all this you can only
>>>>> pray its going to be compatible with the next hive version, since backwards
>>>>> compatibility is... well lets just say lacking. the attitude seems to be
>>>>> that hive does not have a java api, so there is nothing that needs to be
>>>>> stable.
>>>>>
>>>>> you are right i could go the pure thrift road. i havent tried that
>>>>> yet. that might just be the best option. but how easy is it to do this with
>>>>> a secure hadoop/hive ecosystem? now i need to handle kerberos myself and
>>>>> somehow pass tokens into thrift i assume?
>>>>>
>>>>> contrast all of this with an avro file on hadoop with metadata baked
>>>>> in, and i think its safe to say hive metadata is not easily accessible.
>>>>>
>>>>> i will take a look at your book. i hope it has an example of using
>>>>> thrift on a secure cluster to contact hive metastore (without using the
>>>>> HiveMetaStoreClient), that would be awesome.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Sat, Jan 31, 2015 at 1:32 PM, Edward Capriolo <
>>>>> edlinuxguru@gmail.com> wrote:
>>>>>
>>>>>> "with the metadata in a special metadata store (not on hdfs), and its
>>>>>> not as easy for all systems to access hive metadata." I disagree.
>>>>>>
>>>>>> Hives metadata is not only accessible through the SQL constructs like
>>>>>> "describe table". But the entire meta-store also is actually a thrift
>>>>>> service so you have programmatic access to determine things like what
>>>>>> columns are in a table etc. Thrift creates RPC clients for almost every
>>>>>> major language.
>>>>>>
>>>>>> In the programming hive book
>>>>>> http://www.amazon.com/dp/1449319335/?tag=mh0b-20&hvadid=3521269638&ref=pd_sl_4yiryvbf8k_e
>>>>>> there is even examples where I show how to iterate all the tables inside
>>>>>> the database from a java client.
>>>>>>
>>>>>> On Sat, Jan 31, 2015 at 11:05 AM, Koert Kuipers <ko...@tresata.com>
>>>>>> wrote:
>>>>>>
>>>>>>> yes you can run whatever you like with the data in hdfs. keep in
>>>>>>> mind that hive makes this general access pattern just a little harder,
>>>>>>> since hive has a tendency to store data and metadata separately, with the
>>>>>>> metadata in a special metadata store (not on hdfs), and its not as easy for
>>>>>>> all systems to access hive metadata.
>>>>>>>
>>>>>>> i am not familiar at all with tajo or drill.
>>>>>>>
>>>>>>> On Fri, Jan 30, 2015 at 8:27 PM, Samuel Marks <samuelmarks@gmail.com
>>>>>>> > wrote:
>>>>>>>
>>>>>>>> Thanks for the advice
>>>>>>>>
>>>>>>>> Koert: when everything is in the same essential data-store (HDFS),
>>>>>>>> can't I just run whatever complex tools I'm whichever paradigm they like?
>>>>>>>>
>>>>>>>> E.g.: GraphX, Mahout &etc.
>>>>>>>>
>>>>>>>> Also, what about Tajo or Drill?
>>>>>>>>
>>>>>>>> Best,
>>>>>>>>
>>>>>>>> Samuel Marks
>>>>>>>> http://linkedin.com/in/samuelmarks
>>>>>>>>
>>>>>>>> PS: Spark-SQL is read-only IIRC, right?
>>>>>>>> On 31 Jan 2015 03:39, "Koert Kuipers" <ko...@tresata.com> wrote:
>>>>>>>>
>>>>>>>>> since you require high-powered analytics, and i assume you want to
>>>>>>>>> stay sane while doing so, you require the ability to "drop out of sql" when
>>>>>>>>> needed. so spark-sql and lingual would be my choices.
>>>>>>>>>
>>>>>>>>> low latency indicates phoenix or spark-sql to me.
>>>>>>>>>
>>>>>>>>> so i would say spark-sql
>>>>>>>>>
>>>>>>>>> On Fri, Jan 30, 2015 at 7:56 AM, Samuel Marks <
>>>>>>>>> samuelmarks@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> HAWQ is pretty nifty due to its full SQL compliance (ANSI 92) and
>>>>>>>>>> exposing both JDBC and ODBC interfaces. However, although Pivotal does open-source
>>>>>>>>>> a lot of software <http://www.pivotal.io/oss>, I don't believe
>>>>>>>>>> they open source Pivotal HD: HAWQ.
>>>>>>>>>>
>>>>>>>>>> So that doesn't meet my requirements. I should note that the
>>>>>>>>>> project I am building will also be open-source, which heightens the
>>>>>>>>>> importance of having all components also being open-source.
>>>>>>>>>>
>>>>>>>>>> Cheers,
>>>>>>>>>>
>>>>>>>>>> Samuel Marks
>>>>>>>>>> http://linkedin.com/in/samuelmarks
>>>>>>>>>>
>>>>>>>>>> On Fri, Jan 30, 2015 at 11:35 PM, Siddharth Tiwari <
>>>>>>>>>> siddharth.tiwari@live.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Have you looked at HAWQ from Pivotal ?
>>>>>>>>>>>
>>>>>>>>>>> Sent from my iPhone
>>>>>>>>>>>
>>>>>>>>>>> On Jan 30, 2015, at 4:27 AM, Samuel Marks <sa...@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Since Hadoop <https://hive.apache.org> came out, there have
>>>>>>>>>>> been various commercial and/or open-source attempts to expose some
>>>>>>>>>>> compatibility with SQL <http://drill.apache.org>. Obviously by
>>>>>>>>>>> posting here I am not expecting an unbiased answer.
>>>>>>>>>>>
>>>>>>>>>>> Seeking an SQL-on-Hadoop offering which provides: low-latency
>>>>>>>>>>> querying, and supports the most common CRUD
>>>>>>>>>>> <https://spark.apache.org>, including [the basics!] along these
>>>>>>>>>>> lines: CREATE TABLE, INSERT INTO, SELECT * FROM, UPDATE Table
>>>>>>>>>>> SET C1=2 WHERE, DELETE FROM, and DROP TABLE. Transactional
>>>>>>>>>>> support would be nice also, but is not a must-have.
>>>>>>>>>>>
>>>>>>>>>>> Essentially I want a full replacement for the more traditional
>>>>>>>>>>> RDBMS, one which can scale from 1 node to a serious Hadoop cluster.
>>>>>>>>>>>
>>>>>>>>>>> Python is my language of choice for interfacing, however there
>>>>>>>>>>> does seem to be a Python JDBC wrapper
>>>>>>>>>>> <https://spark.apache.org/sql>.
>>>>>>>>>>>
>>>>>>>>>>> Here is what I've found thus far:
>>>>>>>>>>>
>>>>>>>>>>>    - Apache Hive <https://hive.apache.org> (SQL-like, with
>>>>>>>>>>>    interactive SQL thanks to the Stinger initiative)
>>>>>>>>>>>    - Apache Drill <http://drill.apache.org> (ANSI SQL support)
>>>>>>>>>>>    - Apache Spark <https://spark.apache.org> (Spark SQL
>>>>>>>>>>>    <https://spark.apache.org/sql>, queries only, add data via
>>>>>>>>>>>    Hive, RDD
>>>>>>>>>>>    <https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.SchemaRDD>
>>>>>>>>>>>    or Paraquet <http://parquet.io/>)
>>>>>>>>>>>    - Apache Phoenix <http://phoenix.apache.org> (built atop Apache
>>>>>>>>>>>    HBase <http://hbase.apache.org>, lacks full transaction
>>>>>>>>>>>    <http://en.wikipedia.org/wiki/Database_transaction> support, relational
>>>>>>>>>>>    operators <http://en.wikipedia.org/wiki/Relational_operators>
>>>>>>>>>>>    and some built-in functions)
>>>>>>>>>>>    - Cloudera Impala
>>>>>>>>>>>    <http://www.cloudera.com/content/cloudera/en/products-and-services/cdh/impala.html>
>>>>>>>>>>>    (significant HiveQL support, some SQL language support, no support for
>>>>>>>>>>>    indexes on its tables, importantly missing DELETE, UPDATE and INTERSECT;
>>>>>>>>>>>    amongst others)
>>>>>>>>>>>    - Presto <https://github.com/facebook/presto> from Facebook
>>>>>>>>>>>    (can query Hive, Cassandra <http://cassandra.apache.org>,
>>>>>>>>>>>    relational DBs &etc. Doesn't seem to be designed for low-latency responses
>>>>>>>>>>>    across small clusters, or support UPDATE operations. It is
>>>>>>>>>>>    optimized for data warehousing or analytics¹
>>>>>>>>>>>    <http://prestodb.io/docs/current/overview/use-cases.html>)
>>>>>>>>>>>    - SQL-Hadoop <https://www.mapr.com/why-hadoop/sql-hadoop>
>>>>>>>>>>>    via MapR community edition
>>>>>>>>>>>    <https://www.mapr.com/products/hadoop-download> (seems to be
>>>>>>>>>>>    a packaging of Hive, HP Vertica
>>>>>>>>>>>    <http://www.vertica.com/hp-vertica-products/sqlonhadoop>,
>>>>>>>>>>>    SparkSQL, Drill and a native ODBC wrapper
>>>>>>>>>>>    <http://package.mapr.com/tools/MapR-ODBC/MapR_ODBC>)
>>>>>>>>>>>    - Apache Kylin <http://www.kylin.io> from Ebay (provides an
>>>>>>>>>>>    SQL interface and multi-dimensional analysis [OLAP
>>>>>>>>>>>    <http://en.wikipedia.org/wiki/OLAP>], "… offers ANSI SQL on
>>>>>>>>>>>    Hadoop and supports most ANSI SQL query functions". It depends on HDFS,
>>>>>>>>>>>    MapReduce, Hive and HBase; and seems targeted at very large data-sets
>>>>>>>>>>>    though maintains low query latency)
>>>>>>>>>>>    - Apache Tajo <http://tajo.apache.org> (ANSI/ISO SQL
>>>>>>>>>>>    standard compliance with JDBC
>>>>>>>>>>>    <http://en.wikipedia.org/wiki/JDBC> driver support [benchmarks
>>>>>>>>>>>    against Hive and Impala
>>>>>>>>>>>    <http://blogs.gartner.com/nick-heudecker/apache-tajo-enters-the-sql-on-hadoop-space>
>>>>>>>>>>>    ])
>>>>>>>>>>>    - Cascading
>>>>>>>>>>>    <http://en.wikipedia.org/wiki/Cascading_%28software%29>'s
>>>>>>>>>>>    Lingual <http://docs.cascading.org/lingual/1.0/>²
>>>>>>>>>>>    <http://docs.cascading.org/lingual/1.0/#sql-support>
>>>>>>>>>>>    ("Lingual provides JDBC Drivers, a SQL command shell, and a catalog manager
>>>>>>>>>>>    for publishing files [or any resource] as schemas and tables.")
>>>>>>>>>>>
>>>>>>>>>>> Which—from this list or elsewhere—would you recommend, and why?
>>>>>>>>>>> Thanks for all suggestions,
>>>>>>>>>>>
>>>>>>>>>>> Samuel Marks
>>>>>>>>>>> http://linkedin.com/in/samuelmarks
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>