You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by "Balaraju.Kagidala Kagidala" <ba...@gmail.com> on 2016/01/07 05:47:18 UTC

Need Help in Spark Hive Data Processing

Hi ,

  I am new user to spark. I am trying to use Spark to process huge Hive
data using Spark DataFrames.


I have 5 node Spark cluster each with 30 GB memory. i am want to process
hive table with 450GB data using DataFrames. To fetch single row from Hive
table its taking 36 mins. Pls suggest me what wrong here and any help is
appreciated.


Thanks
Bala

Re: Need Help in Spark Hive Data Processing

Posted by Jörn Franke <jo...@gmail.com>.

You need the table in an efficient format, such as Orc or parquet. Have the table sorted appropriately (hint: most discriminating column in the where clause). Do not use SAN or virtualization for the slave nodes.

Can you please post your query.

I always recommend to avoid single updates where possible. They are very inefficient for analytics scenarios - this is somehow also true for the traditional database world (depends on the use case of course).

> On 07 Jan 2016, at 05:47, Balaraju.Kagidala Kagidala <ba...@gmail.com> wrote:
> 
> Hi ,
> 
>   I am new user to spark. I am trying to use Spark to process huge Hive data using Spark DataFrames.
> 
> 
> I have 5 node Spark cluster each with 30 GB memory. i am want to process hive table with 450GB data using DataFrames. To fetch single row from Hive table its taking 36 mins. Pls suggest me what wrong here and any help is appreciated.
> 
> 
> Thanks
> Bala
> 
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Need Help in Spark Hive Data Processing

Posted by Jeff Zhang <zj...@gmail.com>.

It depends on how you fetch the single row. Does your query complex ?

On Thu, Jan 7, 2016 at 12:47 PM, Balaraju.Kagidala Kagidala <
balaraju.kagidala@gmail.com> wrote:

> Hi ,
>
>   I am new user to spark. I am trying to use Spark to process huge Hive
> data using Spark DataFrames.
>
>
> I have 5 node Spark cluster each with 30 GB memory. i am want to process
> hive table with 450GB data using DataFrames. To fetch single row from Hive
> table its taking 36 mins. Pls suggest me what wrong here and any help is
> appreciated.
>
>
> Thanks
> Bala
>
>
>


-- 
Best Regards

Jeff Zhang