You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by Azuryy Yu <az...@gmail.com> on 2015/03/05 07:30:10 UTC

Re: Need advice about OLAP on Hadoop

Hi VK,

I have a similar requirement. we need a real time data analysis platform.

Actually, you don't pay more attention on the Spark or Apache Drill,
because data for LOAP cubes was calculated before cube build.

you just consider two questions:

1) how to calculate the data for cube quickly?
2)which storage would be more effciency for cube access?

for the first question, Spark SQL, Impala, Apache Tajo are all good
candidates. they are all good at ad-hoc query. but Impala and apache Tajo
would be better than Spark.

for my second question, there is no excellent open source project to
support build cube and data storage. we mostly want to buy Oracle exalyties.

ps. we have our own Hadoop cluster and Apache Tajo cluster.
we calculate some general data on the Hadoop cluster using MR. then we run
ad-hoc query on Apache Tajo cluster. then pull data to
the  Oracle exalyties, then build cube.



On Fri, Feb 27, 2015 at 1:19 PM, Vikram Kone <vi...@gmail.com> wrote:

> Hi,
> I'm a newbie when it comes to Spark and Hadoop eco system in general. Our
> team has been predominantly a Microsoft shop that uses MS stack for most of
> their BI needs. So we are talking SQL server  for storing relational data
> and SQL Server Analysis services for building MOLAP cubes for sub-second
> query analysis.
> Lately, we have been hitting degradation in our cube query response times
> as our data sizes grew considerably the past year. We are talking fact
> tables which are in 1o-100 billions of rows range and a few dimensions in
> the 10-100's of millions of rows. We tried vertically scaling up our SSAS
> server but queries are still taking few minutes. In light of this, I was
> entrusted with task of figuring out an open source solution that would
> scale to our current and future needs for data analysis.
> I looked at a bunch of open source tools like Apache Drill, Druid,
> AtScale, Spark, Storm, Kylin etc and settled on exploring Spark as the
> first step given it's recent rise in popularity and growing eco-system
> around it. Since we are also interested in doing deep data analysis like
> machine learning and graph algorithms on top our data, spark seems to be a
> good solution.
> I would like to build out a POC for our MOLAP cubes using spark with
> HDFS/Hive as the datasource and see how it scales for our queries/measures
> in real time with real data.
> Roughly, these are the requirements for our team
> 1. Should be able to create facts, dimensions and measures from our data
> sets in an easier way.
> 2. Cubes should be query able from Excel and Tableau.
> 3. Easily scale out by adding new nodes when data grows
> 4. Very less maintenance and highly stable for production level workloads
> 5. Sub second query latencies for COUNT DISTINCT measures (since majority
> of our expensive measures are of this type) . Are ok with Approx Distinct
> counts for better perf.
>
> So given these requirements, is Spark the right solution to replace our
> on-premise MOLAP cubes? Or should we look at Apache Drill? or something
> else like MongoDB or Cassandra etc?
> Are there any tutorials or documentation on how to build cubes using
> Spark? Is that even possible? or even necessary? As long as our users can
> pivot/slice & dice the measures quickly from client tools by dragging
> dropping dimensions into rows/columns w/o the need to join to fact table,
> we are ok with however the data is laid out. Doesn't have to be a cube. It
> can be a flat file in hdfs for all we care. I would love to chat with some
> one who has successfully done this kind of migration from OLAP cubes to
> Spark in their team or company .
>
> This is it for now. Looking forward to a great discussion.
>
> P.S. We have decided on using Azure HDInsight as our managed hadoop system
> in the cloud.
>