You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Ravisankar Mani <rr...@gmail.com> on 2015/07/10 12:49:39 UTC

Spark performance

Hi everyone,

I have planned to move mssql server to spark?.  I have using around 50,000
to 1l records.
 The spark performance is slow when compared to mssql server.

What is the best data base(Spark or sql) to store or retrieve data around
50,000 to 1l records ?

regards,
Ravi

Re: Spark performance

Posted by Jörn Franke <jo...@gmail.com>.

What is your business case for the move?

Le ven. 10 juil. 2015 à 12:49, Ravisankar Mani <rr...@gmail.com> a écrit :

> Hi everyone,
>
> I have planned to move mssql server to spark?.  I have using around 50,000
> to 1l records.
>  The spark performance is slow when compared to mssql server.
>
> What is the best data base(Spark or sql) to store or retrieve data around
> 50,000 to 1l records ?
>
> regards,
> Ravi
>
>

Re: Spark performance

Posted by Jörn Franke <jo...@gmail.com>.

Le sam. 11 juil. 2015 à 14:53, Roman Sokolov <ol...@gmail.com> a écrit :

> Hello. Had the same question. What if I need to store 4-6 Tb and do
> queries? Can't find any clue in documentation.
> Am 11.07.2015 03:28 schrieb "Mohammed Guller" <mo...@glassbeam.com>:
>
>>  Hi Ravi,
>>
>> First, Neither Spark nor Spark SQL is a database. Both are compute
>> engines, which need to be paired with a storage system. Seconds, they are
>> designed for processing large distributed datasets. If you have only
>> 100,000 records or even a million records, you don’t need Spark. A RDBMS
>> will perform much better for that volume of data.
>>
>>
>>
>> Mohammed
>>
>>
>>
>> *From:* Ravisankar Mani [mailto:rravimr@gmail.com]
>> *Sent:* Friday, July 10, 2015 3:50 AM
>> *To:* user@spark.apache.org
>> *Subject:* Spark performance
>>
>>
>>
>> Hi everyone,
>>
>> I have planned to move mssql server to spark?.  I have using around
>> 50,000 to 1l records.
>>
>>  The spark performance is slow when compared to mssql server.
>>
>>
>>
>> What is the best data base(Spark or sql) to store or retrieve data around
>> 50,000 to 1l records ?
>>
>> regards,
>>
>> Ravi
>>
>>
>>
>

RE: Spark performance

Posted by Mohammed Guller <mo...@glassbeam.com>.

Good points, Michael.

The underlying assumption in my statement is that cost is an issue. If cost is not an issue and the only requirement is to query structured data, then there are several databases such as Teradata, Exadata, and Vertica that can handle 4-6 TB of data and outperform Spark.

Mohammed

From: Michael Segel [mailto:msegel_hadoop@hotmail.com]
Sent: Sunday, July 12, 2015 6:59 AM
To: Mohammed Guller
Cc: David Mitchell; Roman Sokolov; user; Ravisankar Mani
Subject: Re: Spark performance

Not necessarily.

It depends on the use case and what you intend to do with the data.

4-6 TB will easily fit on an SMP box and can be efficiently searched by an RDBMS.
Again it depends on what you want to do and how you want to do it.

Informix’s IDS engine with its extensibility could still outperform spark in some use cases based on the proper use of indexes and amount of parallelism.

There is a lot of cross over… now had you said 100TB+ on unstructured data… things may be different.

Please understand that what would make spark more compelling is the TCO of the solution when compared to SMP boxes and software licensing.

Its not that I don’t disagree with your statements, because moving from mssql or any small RDBMS to spark … doesn’t make a whole lot of sense.
Just wanted to add that the decision isn’t as cut and dry as some think….

On Jul 11, 2015, at 8:47 AM, Mohammed Guller <mo...@glassbeam.com>> wrote:

Hi Roman,
Yes, Spark SQL will be a better solution than standard RDBMS databases for querying 4-6 TB data. You can pair Spark SQL with HDFS+Parquet to build a powerful analytics solution.

Mohammed

From: David Mitchell [mailto:jdavidmitchell@gmail.com]
Sent: Saturday, July 11, 2015 7:10 AM
To: Roman Sokolov
Cc: Mohammed Guller; user; Ravisankar Mani
Subject: Re: Spark performance

You can certainly query over 4 TB of data with Spark.  However, you will get an answer in minutes or hours, not in milliseconds or seconds.  OLTP databases are used for web applications, and typically return responses in milliseconds.  Analytic databases tend to operate on large data sets, and return responses in seconds, minutes or hours.  When running batch jobs over large data sets, Spark can be a replacement for analytic databases like Greenplum or Netezza.

On Sat, Jul 11, 2015 at 8:53 AM, Roman Sokolov <ol...@gmail.com>> wrote:
Hello. Had the same question. What if I need to store 4-6 Tb and do queries? Can't find any clue in documentation.
Am 11.07.2015 03:28 schrieb "Mohammed Guller" <mo...@glassbeam.com>>:
Hi Ravi,
First, Neither Spark nor Spark SQL is a database. Both are compute engines, which need to be paired with a storage system. Seconds, they are designed for processing large distributed datasets. If you have only 100,000 records or even a million records, you don’t need Spark. A RDBMS will perform much better for that volume of data.

Mohammed

From: Ravisankar Mani [mailto:rravimr@gmail.com<ma...@gmail.com>]
Sent: Friday, July 10, 2015 3:50 AM
To: user@spark.apache.org<ma...@spark.apache.org>
Subject: Spark performance

Hi everyone,
I have planned to move mssql server to spark?.  I have using around 50,000 to 1l records.
 The spark performance is slow when compared to mssql server.

What is the best data base(Spark or sql) to store or retrieve data around 50,000 to 1l records ?
regards,
Ravi

--
### Confidential e-mail, for recipient's (or recipients') eyes only, not for distribution. ###

Re: Spark performance

Posted by Michael Segel <ms...@hotmail.com>.

Not necessarily.

It depends on the use case and what you intend to do with the data. 

4-6 TB will easily fit on an SMP box and can be efficiently searched by an RDBMS. 
Again it depends on what you want to do and how you want to do it. 

Informix’s IDS engine with its extensibility could still outperform spark in some use cases based on the proper use of indexes and amount of parallelism. 

There is a lot of cross over… now had you said 100TB+ on unstructured data… things may be different. 

Please understand that what would make spark more compelling is the TCO of the solution when compared to SMP boxes and software licensing. 

Its not that I don’t disagree with your statements, because moving from mssql or any small RDBMS to spark … doesn’t make a whole lot of sense. 
Just wanted to add that the decision isn’t as cut and dry as some think…. 

> On Jul 11, 2015, at 8:47 AM, Mohammed Guller <mo...@glassbeam.com> wrote:
> 
> Hi Roman,
> Yes, Spark SQL will be a better solution than standard RDBMS databases for querying 4-6 TB data. You can pair Spark SQL with HDFS+Parquet to build a powerful analytics solution.
>  
> Mohammed
>  
> From: David Mitchell [mailto:jdavidmitchell@gmail.com] 
> Sent: Saturday, July 11, 2015 7:10 AM
> To: Roman Sokolov
> Cc: Mohammed Guller; user; Ravisankar Mani
> Subject: Re: Spark performance
>  
> You can certainly query over 4 TB of data with Spark.  However, you will get an answer in minutes or hours, not in milliseconds or seconds.  OLTP databases are used for web applications, and typically return responses in milliseconds.  Analytic databases tend to operate on large data sets, and return responses in seconds, minutes or hours.  When running batch jobs over large data sets, Spark can be a replacement for analytic databases like Greenplum or Netezza.  
>  
>  
>  
> On Sat, Jul 11, 2015 at 8:53 AM, Roman Sokolov <oleamm@gmail.com <ma...@gmail.com>> wrote:
> Hello. Had the same question. What if I need to store 4-6 Tb and do queries? Can't find any clue in documentation.
> 
> Am 11.07.2015 03:28 schrieb "Mohammed Guller" <mohammed@glassbeam.com <ma...@glassbeam.com>>:
> Hi Ravi,
> First, Neither Spark nor Spark SQL is a database. Both are compute engines, which need to be paired with a storage system. Seconds, they are designed for processing large distributed datasets. If you have only 100,000 records or even a million records, you don’t need Spark. A RDBMS will perform much better for that volume of data.
>  
> Mohammed
>  
> From: Ravisankar Mani [mailto:rravimr@gmail.com <ma...@gmail.com>] 
> Sent: Friday, July 10, 2015 3:50 AM
> To: user@spark.apache.org <ma...@spark.apache.org>
> Subject: Spark performance
>  
> Hi everyone,
> 
> I have planned to move mssql server to spark?.  I have using around 50,000 to 1l records.
>  The spark performance is slow when compared to mssql server.
>  
> What is the best data base(Spark or sql) to store or retrieve data around 50,000 to 1l records ?
> 
> regards,
> Ravi
>  
> 
> 
>  
> --
> ### Confidential e-mail, for recipient's (or recipients') eyes only, not for distribution. ###

RE: Spark performance

Posted by Mohammed Guller <mo...@glassbeam.com>.

Hi Roman,
Yes, Spark SQL will be a better solution than standard RDBMS databases for querying 4-6 TB data. You can pair Spark SQL with HDFS+Parquet to build a powerful analytics solution.

Mohammed

From: David Mitchell [mailto:jdavidmitchell@gmail.com]
Sent: Saturday, July 11, 2015 7:10 AM
To: Roman Sokolov
Cc: Mohammed Guller; user; Ravisankar Mani
Subject: Re: Spark performance

You can certainly query over 4 TB of data with Spark.  However, you will get an answer in minutes or hours, not in milliseconds or seconds.  OLTP databases are used for web applications, and typically return responses in milliseconds.  Analytic databases tend to operate on large data sets, and return responses in seconds, minutes or hours.  When running batch jobs over large data sets, Spark can be a replacement for analytic databases like Greenplum or Netezza.

On Sat, Jul 11, 2015 at 8:53 AM, Roman Sokolov <ol...@gmail.com>> wrote:

Hello. Had the same question. What if I need to store 4-6 Tb and do queries? Can't find any clue in documentation.
Am 11.07.2015 03:28 schrieb "Mohammed Guller" <mo...@glassbeam.com>>:
Hi Ravi,
First, Neither Spark nor Spark SQL is a database. Both are compute engines, which need to be paired with a storage system. Seconds, they are designed for processing large distributed datasets. If you have only 100,000 records or even a million records, you don’t need Spark. A RDBMS will perform much better for that volume of data.

Mohammed

From: Ravisankar Mani [mailto:rravimr@gmail.com<ma...@gmail.com>]
Sent: Friday, July 10, 2015 3:50 AM
To: user@spark.apache.org<ma...@spark.apache.org>
Subject: Spark performance

Hi everyone,
I have planned to move mssql server to spark?.  I have using around 50,000 to 1l records.
 The spark performance is slow when compared to mssql server.

What is the best data base(Spark or sql) to store or retrieve data around 50,000 to 1l records ?
regards,
Ravi

--
### Confidential e-mail, for recipient's (or recipients') eyes only, not for distribution. ###

Re: Spark performance

Posted by David Mitchell <jd...@gmail.com>.

You can certainly query over 4 TB of data with Spark.  However, you will
get an answer in minutes or hours, not in milliseconds or seconds.  OLTP
databases are used for web applications, and typically return responses in
milliseconds.  Analytic databases tend to operate on large data sets, and
return responses in seconds, minutes or hours.  When running batch jobs
over large data sets, Spark can be a replacement for analytic databases
like Greenplum or Netezza.

On Sat, Jul 11, 2015 at 8:53 AM, Roman Sokolov <ol...@gmail.com> wrote:

> Hello. Had the same question. What if I need to store 4-6 Tb and do
> queries? Can't find any clue in documentation.
> Am 11.07.2015 03:28 schrieb "Mohammed Guller" <mo...@glassbeam.com>:
>
>>  Hi Ravi,
>>
>> First, Neither Spark nor Spark SQL is a database. Both are compute
>> engines, which need to be paired with a storage system. Seconds, they are
>> designed for processing large distributed datasets. If you have only
>> 100,000 records or even a million records, you don’t need Spark. A RDBMS
>> will perform much better for that volume of data.
>>
>>
>>
>> Mohammed
>>
>>
>>
>> *From:* Ravisankar Mani [mailto:rravimr@gmail.com]
>> *Sent:* Friday, July 10, 2015 3:50 AM
>> *To:* user@spark.apache.org
>> *Subject:* Spark performance
>>
>>
>>
>> Hi everyone,
>>
>> I have planned to move mssql server to spark?.  I have using around
>> 50,000 to 1l records.
>>
>>  The spark performance is slow when compared to mssql server.
>>
>>
>>
>> What is the best data base(Spark or sql) to store or retrieve data around
>> 50,000 to 1l records ?
>>
>> regards,
>>
>> Ravi
>>
>>
>>
>

-- 
### Confidential e-mail, for recipient's (or recipients') eyes only, not
for distribution. ###

RE: Spark performance

Posted by Roman Sokolov <ol...@gmail.com>.

Hello. Had the same question. What if I need to store 4-6 Tb and do
queries? Can't find any clue in documentation.
Am 11.07.2015 03:28 schrieb "Mohammed Guller" <mo...@glassbeam.com>:

>  Hi Ravi,
>
> First, Neither Spark nor Spark SQL is a database. Both are compute
> engines, which need to be paired with a storage system. Seconds, they are
> designed for processing large distributed datasets. If you have only
> 100,000 records or even a million records, you don’t need Spark. A RDBMS
> will perform much better for that volume of data.
>
>
>
> Mohammed
>
>
>
> *From:* Ravisankar Mani [mailto:rravimr@gmail.com]
> *Sent:* Friday, July 10, 2015 3:50 AM
> *To:* user@spark.apache.org
> *Subject:* Spark performance
>
>
>
> Hi everyone,
>
> I have planned to move mssql server to spark?.  I have using around 50,000
> to 1l records.
>
>  The spark performance is slow when compared to mssql server.
>
>
>
> What is the best data base(Spark or sql) to store or retrieve data around
> 50,000 to 1l records ?
>
> regards,
>
> Ravi
>
>
>

Re: Spark performance

Posted by sa...@gmail.com.

Ravi


Spark (or in that case Big Data solutions like Hive) is suited for large analytical loads, where the “scaling  up” starts to pale in comparison to “Scaling out” with regards to performance, versatility(types of data) and cost. Without going into the details of MsSQL architecture, there is an inflection point in terms of cost(licensing), performance and Maintainability where open Source commodity platform would start to become viable albeit sometimes at the expense of slower performance. With 1 million records ,  I am not sure you are reaching that point to justify a Spark cluster. So why are you planning to move away from MSSql and move to Spark as the destination platform?


You said “Spark performance” is slow as compared to MSSql. What kind of load are you running and what kind of querying are you performing? There may be startup costs associated with running the Map side of the querying.


If your testing to understand Spark, can you post what you are currently doing (queries, table structures, compression and storage optimizations)? That way, we could look at suggesting optimizations but again, not to compare with MsSQL, but to improve Spark side of things.


Again, to quote someone who answered earlier in the thread, What is your ‘Use case’? 


-Santosh






Sent from Windows Mail





From: Jörn Franke
Sent: ‎Saturday‎, ‎July‎ ‎11‎, ‎2015 ‎8‎:‎20‎ ‎PM
To: Mohammed Guller, Ravisankar Mani, user@spark.apache.org





Honestly you are addressing this wrongly - you do not seem.to have a business case for changing - so why do you want to switch 




Le sam. 11 juil. 2015 à 3:28, Mohammed Guller <mo...@glassbeam.com> a écrit :





Hi Ravi,

First, Neither Spark nor Spark SQL is a database. Both are compute engines, which need to be paired with a storage system. Seconds, they are designed for processing large distributed datasets. If you have only 100,000 records or even a million records, you don’t need Spark. A RDBMS will perform much better for that volume of data.

 

Mohammed

 

From: Ravisankar Mani [mailto:rravimr@gmail.com] 
Sent: Friday, July 10, 2015 3:50 AM
To: user@spark.apache.org
Subject: Spark performance



 





Hi everyone,


I have planned to move mssql server to spark?.  I have using around 50,000 to 1l records.


 The spark performance is slow when compared to mssql server.


 

What is the best data base(Spark or sql) to store or retrieve data around 50,000 to 1l records ?

regards,

Ravi

Re: Spark performance

Posted by Jörn Franke <jo...@gmail.com>.

Honestly you are addressing this wrongly - you do not seem.to have a
business case for changing - so why do you want to switch

Le sam. 11 juil. 2015 à 3:28, Mohammed Guller <mo...@glassbeam.com> a
écrit :

>  Hi Ravi,
>
> First, Neither Spark nor Spark SQL is a database. Both are compute
> engines, which need to be paired with a storage system. Seconds, they are
> designed for processing large distributed datasets. If you have only
> 100,000 records or even a million records, you don’t need Spark. A RDBMS
> will perform much better for that volume of data.
>
>
>
> Mohammed
>
>
>
> *From:* Ravisankar Mani [mailto:rravimr@gmail.com]
> *Sent:* Friday, July 10, 2015 3:50 AM
> *To:* user@spark.apache.org
> *Subject:* Spark performance
>
>
>
> Hi everyone,
>
> I have planned to move mssql server to spark?.  I have using around 50,000
> to 1l records.
>
>  The spark performance is slow when compared to mssql server.
>
>
>
> What is the best data base(Spark or sql) to store or retrieve data around
> 50,000 to 1l records ?
>
> regards,
>
> Ravi
>
>
>

RE: Spark performance

Posted by Mohammed Guller <mo...@glassbeam.com>.

Hi Ravi,
First, Neither Spark nor Spark SQL is a database. Both are compute engines, which need to be paired with a storage system. Seconds, they are designed for processing large distributed datasets. If you have only 100,000 records or even a million records, you don’t need Spark. A RDBMS will perform much better for that volume of data.

Mohammed

From: Ravisankar Mani [mailto:rravimr@gmail.com]
Sent: Friday, July 10, 2015 3:50 AM
To: user@spark.apache.org
Subject: Spark performance

Hi everyone,
I have planned to move mssql server to spark?.  I have using around 50,000 to 1l records.
 The spark performance is slow when compared to mssql server.

What is the best data base(Spark or sql) to store or retrieve data around 50,000 to 1l records ?
regards,
Ravi