You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Aliaksei Tsyvunchyk <at...@exadel.com> on 2015/10/20 23:29:22 UTC

Whether Spark is appropriate for our use case.

Hello all community members,

I need opinion of people who was using Spark before and can share there experience to help me select technical approach.
I have a project in Proof Of Concept phase, where we are evaluating possibility of Spark usage for our use case.
Here is brief task description.
We should process big amount of raw data to calculate ratings. We have different type of textual source data. This is just text lines which represents different type of input data (we call them type 20, type 24, type 26, type 33, etc).
To perform calculations we should make joins between diffrerent type of raw data - event records (which represents actual user action) and users description records (which represents person which performs action) and sometimes with userGroup record (we group all users by some criteria).
All ratings are calculated on daily basis and our dataset could be partitioned by date (except probably reference data).

So we have tried to implement it using possibly most obvious way, we parse text file, store data in parquet format and trying to use sparkSQL to query data and perform calculation.
Experimenting with sparkSQL I’ve noticed that SQL query speed decreased proportionally to data size growth. Base on this I assume that SparkSQL performs full records scan while servicing my SQL queries.

So here are the questions I’m trying to find answers:
1. Is parquet format appropriate for storing data in our case (to efficiently query data)? Could it be more suitable to have some DB as storage which could filter data efficiently before it gets to Spark processing engine ?
2. For now we assume that joins we are doing for calculations slowing down execution. As alternatives we consider denormalizing data and join it on parsing phase, but this increase data volume Spark should handle (due to the duplicates we will get). Is it valid approach? Would it be better if we create 2 RDD, from Parquet files filter them out and next join without sparkSQL involvement? Or joins in SparkSQL are fine and we should look for performance bottlenecks in different place?
3. Should we look closer on Cloudera Impala? As I know it is working over the same parquet files and I’m wondering whether it gives better performance for data querying ?
4. 90% of results we need could be pre-calculated since they are not change after one day of data is loaded. So I think it makes sense to keep this pre-calculated data in some DB storage which give me best performance while querying by key. Now I’m consider to use Cassandra for this purpose due to it’s perfect scalability and performance. Could somebody provide any other options we can consider ?

Thanks in Advance,
Any opinion will be helpful and greatly appreciated
--

CONFIDENTIALITY NOTICE: This email and files attached to it are
confidential. If you are not the intended recipient you are hereby notified
that using, copying, distributing or taking any action in reliance on the
contents of this information is strictly prohibited. If you have received
this email in error please notify the sender and delete this email.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Whether Spark is appropriate for our use case.

Posted by Adrian Tanase <at...@adobe.com>.

Can you share your approximate data size? all should be valid use cases for spark, wondering if you are providing enough resources.

Also - do you have some expectations in terms of performance? what does "slow down" mean?

For this usecase I would personally favor parquet over DB, and sql/dataframes over regular spark RDDs as you get some benefits related to predicate pushdown, etc.

Sent from my iPhone

> On 21 Oct 2015, at 00:29, Aliaksei Tsyvunchyk <at...@exadel.com> wrote:
> 
> Hello all community members,
> 
> I need opinion of people who was using Spark before and can share there experience to help me select technical approach.
> I have a project in Proof Of Concept phase, where we are evaluating possibility of Spark usage for our use case. 
> Here is brief task description.
> We should process big amount of raw data to calculate ratings. We have different type of textual source data. This is just text lines which represents different type of input data (we call them type 20, type 24, type 26, type 33, etc).
> To perform calculations we should make joins between diffrerent type of raw data - event records (which represents actual user action) and users description records (which represents person which performs action) and sometimes with userGroup record (we group all users by some criteria).
> All ratings are calculated on daily basis and our dataset could be partitioned by date (except probably reference data).
> 
> 
> So we have tried to implement it using possibly most obvious way, we parse text file, store data in parquet format and trying to use sparkSQL to query data and perform calculation.
> Experimenting with sparkSQL I’ve noticed that SQL query speed decreased proportionally to data size growth. Base on this I assume that SparkSQL performs full records scan while servicing my SQL queries.
> 
> So here are the questions I’m trying to find answers:
> 1.  Is parquet format appropriate for storing data in our case (to efficiently query data)? Could it be more suitable to have some DB as storage which could filter data efficiently before it gets to Spark processing engine ?
> 2.  For now we assume that joins we are doing for calculations slowing down execution. As alternatives we consider denormalizing data and join it on parsing phase, but this increase data volume Spark should handle (due to the duplicates we will get). Is it valid approach? Would it be better if we create 2 RDD, from Parquet files filter them out and next join without sparkSQL involvement?  Or joins in SparkSQL are fine and we should look for performance bottlenecks in different place?
> 3.  Should we look closer on Cloudera Impala? As I know it is working over the same parquet files and I’m wondering whether it gives better performance for data querying ?
> 4.  90% of results we need could be pre-calculated since they are not change after one day of data is loaded. So I think it makes sense to keep this pre-calculated data in some DB storage which give me best performance while querying by key. Now I’m consider to use Cassandra for this purpose due to it’s perfect scalability and performance. Could somebody provide any other options we can consider ?
> 
> Thanks in Advance,
> Any opinion will be helpful and greatly appreciated
> -- 
> 
> 
> CONFIDENTIALITY NOTICE: This email and files attached to it are 
> confidential. If you are not the intended recipient you are hereby notified 
> that using, copying, distributing or taking any action in reliance on the 
> contents of this information is strictly prohibited. If you have received 
> this email in error please notify the sender and delete this email.
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Whether Spark is appropriate for our use case.

Posted by Igor Berman <ig...@gmail.com>.

1. if you have join by some specific field(e.g. user id or account-id or
whatever) you may try to partition parquet file by this field and then join
will be more efficient.
2. you need to see in spark metrics what is performance of particular join,
how much partitions is there, what is shuffle size...in general - tune for
the shuffle performance(e.g. shuffle fraction)

On 21 October 2015 at 00:29, Aliaksei Tsyvunchyk <at...@exadel.com>
wrote:

> Hello all community members,
>
> I need opinion of people who was using Spark before and can share there
> experience to help me select technical approach.
> I have a project in Proof Of Concept phase, where we are evaluating
> possibility of Spark usage for our use case.
> Here is brief task description.
> We should process big amount of raw data to calculate ratings. We have
> different type of textual source data. This is just text lines which
> represents different type of input data (we call them type 20, type 24,
> type 26, type 33, etc).
> To perform calculations we should make joins between diffrerent type of
> raw data - event records (which represents actual user action) and users
> description records (which represents person which performs action) and
> sometimes with userGroup record (we group all users by some criteria).
> All ratings are calculated on daily basis and our dataset could be
> partitioned by date (except probably reference data).
>
>
> So we have tried to implement it using possibly most obvious way, we parse
> text file, store data in parquet format and trying to use sparkSQL to query
> data and perform calculation.
> Experimenting with sparkSQL I’ve noticed that SQL query speed decreased
> proportionally to data size growth. Base on this I assume that SparkSQL
> performs full records scan while servicing my SQL queries.
>
> So here are the questions I’m trying to find answers:
> 1.  Is parquet format appropriate for storing data in our case (to
> efficiently query data)? Could it be more suitable to have some DB as
> storage which could filter data efficiently before it gets to Spark
> processing engine ?
> 2.  For now we assume that joins we are doing for calculations slowing
> down execution. As alternatives we consider denormalizing data and join it
> on parsing phase, but this increase data volume Spark should handle (due to
> the duplicates we will get). Is it valid approach? Would it be better if we
> create 2 RDD, from Parquet files filter them out and next join without
> sparkSQL involvement?  Or joins in SparkSQL are fine and we should look for
> performance bottlenecks in different place?
> 3.  Should we look closer on Cloudera Impala? As I know it is working over
> the same parquet files and I’m wondering whether it gives better
> performance for data querying ?
> 4.  90% of results we need could be pre-calculated since they are not
> change after one day of data is loaded. So I think it makes sense to keep
> this pre-calculated data in some DB storage which give me best performance
> while querying by key. Now I’m consider to use Cassandra for this purpose
> due to it’s perfect scalability and performance. Could somebody provide any
> other options we can consider ?
>
> Thanks in Advance,
> Any opinion will be helpful and greatly appreciated
> --
>
>
> CONFIDENTIALITY NOTICE: This email and files attached to it are
> confidential. If you are not the intended recipient you are hereby notified
> that using, copying, distributing or taking any action in reliance on the
> contents of this information is strictly prohibited. If you have received
> this email in error please notify the sender and delete this email.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>