You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Malligarjunan S <ma...@aerifymedia.com> on 2014/07/15 20:09:52 UTC

Spark Performance Bench mark

Hello All,

I am a newbie to Apache Spark, I would like to know the performance
benchmark of Apache Spark.

My current requirement is as follows
I have few files in 2 s3 buckets
Each file may have minimum of 1 million records. File data are tab
separated.
Have to compare few columns and filter the records.

Right now I am using Hive, it is taking more than 2 days to filter the
records.
Please find the hive query below

INSERT OVERWRITE TABLE cnv_algo3
SELECT *  FROM table1 t1 JOIN table2 t2
  WHERE unix_timestamp(t2.time, 'yyyy-MM-dd HH:mm:ss,SSS') >
unix_timestamp(t1.time, 'yyyy-MM-dd HH:mm:ss,SSS')
and compare(t1.column1, t1.column2, t2.column1, t2.column4);

Here compare is the UDF function.
Assume table1 1 has 20 million records and table2 has 5 million records.
Let me know how much time Spark will to take filter the records in a
standard configuration.

It is pretty urgent to take an decision to move the project to use Spark.
Hence help me. I highly appreciate your help.


I am planning to use --instance-type m1.xlarge --instance-count 3

Thanks and Regards,
Malligarjunan S.