You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by spierki <fl...@crisalid.com> on 2015/07/06 09:23:41 UTC

Spark SQL queries hive table, real time ?

Hello,

I'm actually asking my self about performance of using Spark SQL with Hive
to do real time analytics. 
I know that Hive has been created for batch processing, and Spark is use to
do fast queries. 

But, use Spark SQL with Hive will allow me to do real time queries ? Or it
just will make fastest queries but not real time.
Should I use an other datawarehouse, like Hbase ? 

Thanks in advance for your time and consideration,
Florian



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-queries-hive-table-real-time-tp23642.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

RE: Spark SQL queries hive table, real time ?

Posted by Mohammed Guller <mo...@glassbeam.com>.

Hi Florian,
It depends on a number of factors. How much data are you querying? Where is the data stored (HDD, SSD or DRAM)? What is the file format (Parquet or CSV)?

In theory, it is possible to use Spark SQL for real-time queries, but cost increases as the data size grows. If you can store all of your data in memory, then you should be able to query it in real-time ☺ On the other extreme,  if Spark SQL has to read a terabyte of data from spinning disk, there is no way it can respond in real-time. To be fair, no software can read a terabyte of data from HDD in real-time. Simple laws of physics. Either you will have to spread out the reads over a large number of disks and read them in parallel. Alternatively, index the data so that your queries don’t have to read a terabyte of data from disk.

Hope that helps.

Mohammed

From: Denny Lee [mailto:denny.g.lee@gmail.com]
Sent: Monday, July 6, 2015 4:21 AM
To: spierki; user@spark.apache.org
Subject: Re: Spark SQL queries hive table, real time ?

Within the context of your question, Spark SQL utilizing the Hive context is primarily about very fast queries.  If you want to use real-time queries, I would utilize Spark Streaming.  A couple of great resources on this topic include Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms and Optimization<http://www.slideshare.net/tathadas/guest-lecture-on-spark-streaming-in-standford> and Recipes for Running Spark Streaming Applications in Production<https://spark-summit.org/2015/events/recipes-for-running-spark-streaming-applications-in-production/> (from the recent Spark Summit 2015)

HTH!


On Mon, Jul 6, 2015 at 3:23 PM spierki <fl...@crisalid.com>> wrote:
Hello,

I'm actually asking my self about performance of using Spark SQL with Hive
to do real time analytics.
I know that Hive has been created for batch processing, and Spark is use to
do fast queries.

But, use Spark SQL with Hive will allow me to do real time queries ? Or it
just will make fastest queries but not real time.
Should I use an other datawarehouse, like Hbase ?

Thanks in advance for your time and consideration,
Florian



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-queries-hive-table-real-time-tp23642.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org<ma...@spark.apache.org>
For additional commands, e-mail: user-help@spark.apache.org<ma...@spark.apache.org>

Re: Spark SQL queries hive table, real time ?

Posted by Denny Lee <de...@gmail.com>.

Within the context of your question, Spark SQL utilizing the Hive context
is primarily about very fast queries.  If you want to use real-time
queries, I would utilize Spark Streaming.  A couple of great resources on
this topic include Guest Lecture on Spark Streaming in Stanford CME 323:
Distributed Algorithms and Optimization
<http://www.slideshare.net/tathadas/guest-lecture-on-spark-streaming-in-standford>
and Recipes for Running Spark Streaming Applications in Production
<https://spark-summit.org/2015/events/recipes-for-running-spark-streaming-applications-in-production/>
(from the recent Spark Summit 2015)

HTH!

On Mon, Jul 6, 2015 at 3:23 PM spierki <fl...@crisalid.com>
wrote:

> Hello,
>
> I'm actually asking my self about performance of using Spark SQL with Hive
> to do real time analytics.
> I know that Hive has been created for batch processing, and Spark is use to
> do fast queries.
>
> But, use Spark SQL with Hive will allow me to do real time queries ? Or it
> just will make fastest queries but not real time.
> Should I use an other datawarehouse, like Hbase ?
>
> Thanks in advance for your time and consideration,
> Florian
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-queries-hive-table-real-time-tp23642.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>

Re: Spark SQL queries hive table, real time ?

Posted by Jörn Franke <jo...@gmail.com>.

Hive using tez has recently (1.2.0) become much faster (if you use the ORC
format), so that for most of the use cases it will be sufficient.
Alternatively you could use as well SparkSQL (if you have the memory) or
apache phoenix. The latter one has - currently - a little bit less SQL
support and requires full access to all nodes on the cluster. However,
access is rather fast. You can use it for storing/retrieving Pre
-aggregated values and use spark or hive for any other queries.

Le lun. 6 juil. 2015 à 9:23, spierki <fl...@crisalid.com> a
écrit :

> Hello,
>
> I'm actually asking my self about performance of using Spark SQL with Hive
> to do real time analytics.
> I know that Hive has been created for batch processing, and Spark is use to
> do fast queries.
>
> But, use Spark SQL with Hive will allow me to do real time queries ? Or it
> just will make fastest queries but not real time.
> Should I use an other datawarehouse, like Hbase ?
>
> Thanks in advance for your time and consideration,
> Florian
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-queries-hive-table-real-time-tp23642.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>