You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by "Bogala, Chandra Reddy" <Ch...@gs.com> on 2014/06/06 03:01:21 UTC

performance and cluster size required

Hi,
  I get 300MB compressed file (structured CSV data) in spool directory every 3 minutes from collector. I have around 6 collectors. I move data from spool dir to HDFS directory and add as a hive partition for every 15 minutes data. Then I run different aggregation queries and post data to Hbase & Mongo. So the data is around 9 GB compressed for every query. For this much data I need to evaluate how many cluster nodes required to finish all the aggregation queries with in time ( within 15 minutes partition window).What is the best way to evaluate this?

                Is there any way I can post aggregated data to both Mongo and Hbase ( same query result posting to multiple tables instead of running same query multiple times and insert only in single table at a time)?

Thanks,
Chandra

Re: performance and cluster size required

Posted by Nitin Pawar <ni...@gmail.com>.

on the first part of your question, what should be the cluster size, it is
totally dependent on
1)what type of queries you are performing
2) what type of cluster you have got as in its shared or dedicated to you
only.
3) compressed file format drives the query performance based if the
compression type is splittable or not
4) what is the capacity of each node (compute, memory and storage)

On the second part, as per my understanding there is no way you can write
data to multiple targets using single query.
so you have two options
1) Run query once, save the output to a file and write to two targets
2) Run query twice with different targets

On Fri, Jun 6, 2014 at 6:31 AM, Bogala, Chandra Reddy <Chandra.Bogala@gs.com
> wrote:

> Hi,
>
>   I get 300MB compressed file (structured CSV data) in spool directory
> every 3 minutes from collector. I have around 6 collectors. I move data
> from spool dir to HDFS directory and add as a hive partition for every 15
> minutes data. Then I run different aggregation queries and post data to
> Hbase & Mongo. So the data is around 9 GB compressed for every query. For
> this much data I need to evaluate how many cluster nodes required to finish
> all the aggregation queries with in time ( within 15 minutes partition
> window).What is the best way to evaluate this?
>
>
>
>                 Is there any way I can post aggregated data to both Mongo
> and Hbase ( same query result posting to multiple tables instead of running
> same query multiple times and insert only in single table at a time)?
>
>
>
> Thanks,
>
> Chandra
>

-- 
Nitin Pawar