You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Allan Richards <al...@gmail.com> on 2017/03/05 20:49:28 UTC

Spark Beginner: Correct approach for use case

Hi,

I am looking to use Spark to help execute queries against a reasonably
large dataset (1 billion rows). I'm a bit lost with all the different
libraries / add ons to Spark, and am looking for some direction as to what
I should look at / what may be helpful.

A couple of relevant points:
 - The dataset doesn't change over time.
 - There are a small number of applications (or queries I guess, but it's
more complicated than a single SQL query) that I want to run against it,
but the parameters to those queries will change all the time.
 - There is a logical grouping of the data per customer, which will
generally consist of 1-5000 rows.

I want each query to run as fast as possible (less than a second or two).
So ideally I want to keep all the records in memory, but distributed over
the different nodes in the cluster. Does this mean sharing a SparkContext
between queries, or is this where HDFS comes in, or is there something else
that would be better suited?

Or is there another overall approach I should look into for executing
queries in "real time" against a dataset this size?

Thanks,
Allan.

Re: Spark Beginner: Correct approach for use case

Posted by Allan Richards <al...@gmail.com>.

Thanks for the feedback everyone. We've had a look at different SQL based
solutions, and have got good performance out of them, but some of the
reports we make can't be generated with a single bit of SQL. This is just
an investigation to see if Spark is a viable alternative.

I've got another question (I also asked on stack overflow
<http://stackoverflow.com/questions/42661350/spark-jobserver-very-large-task-size>).
Basically I'm seeing (proportionally) large task deserialisation times, and
am wondering why. I'm using jobserver and reusing an existing context and
RDD, so I believe all the data should be cached on the executors already. I
would have thought the serialised task just contains the query to execute
(the jar should also have been pushed across already?) the partition id and
the RDD id, so should be very lightweight?
An example of timings:
Scheduler delay: 7ms
Task deserialization time: 19ms
Executor computing time: 4ms

Thanks,
Allan.

On Mon, Mar 6, 2017 at 6:05 PM, Jörn Franke <jo...@gmail.com> wrote:

> I agree with the others that a dedicated NoSQL datastore can make sense.
> You should look at the lambda architecture paradigm. Keep in mind that more
> memory does not necessarily mean more performance. It is the right data
> structure for  the queries of your users. Additionally, if your queries are
> executed over the whole dataset and you want to have answer times in 2
> seconds, you should look at databases that do aggregations on samples of
> the data (cf. https://jornfranke.wordpress.com/2015/06/28/big-
> data-what-is-next-oltp-olap-predictive-analytics-sampling-
> and-probabilistic-databases). E.g. Hive has a tablesample functionality
> since a long time.
>
> On 5 Mar 2017, at 21:49, Allan Richards <al...@gmail.com> wrote:
>
> Hi,
>
> I am looking to use Spark to help execute queries against a reasonably
> large dataset (1 billion rows). I'm a bit lost with all the different
> libraries / add ons to Spark, and am looking for some direction as to what
> I should look at / what may be helpful.
>
> A couple of relevant points:
>  - The dataset doesn't change over time.
>  - There are a small number of applications (or queries I guess, but it's
> more complicated than a single SQL query) that I want to run against it,
> but the parameters to those queries will change all the time.
>  - There is a logical grouping of the data per customer, which will
> generally consist of 1-5000 rows.
>
> I want each query to run as fast as possible (less than a second or two).
> So ideally I want to keep all the records in memory, but distributed over
> the different nodes in the cluster. Does this mean sharing a SparkContext
> between queries, or is this where HDFS comes in, or is there something else
> that would be better suited?
>
> Or is there another overall approach I should look into for executing
> queries in "real time" against a dataset this size?
>
> Thanks,
> Allan.
>
>

Re: Spark Beginner: Correct approach for use case

Posted by Jörn Franke <jo...@gmail.com>.

I agree with the others that a dedicated NoSQL datastore can make sense. You should look at the lambda architecture paradigm. Keep in mind that more memory does not necessarily mean more performance. It is the right data structure for  the queries of your users. Additionally, if your queries are executed over the whole dataset and you want to have answer times in 2 seconds, you should look at databases that do aggregations on samples of the data (cf. https://jornfranke.wordpress.com/2015/06/28/big-data-what-is-next-oltp-olap-predictive-analytics-sampling-and-probabilistic-databases). E.g. Hive has a tablesample functionality since a long time.

> On 5 Mar 2017, at 21:49, Allan Richards <al...@gmail.com> wrote:
> 
> Hi,
> 
> I am looking to use Spark to help execute queries against a reasonably large dataset (1 billion rows). I'm a bit lost with all the different libraries / add ons to Spark, and am looking for some direction as to what I should look at / what may be helpful.
> 
> A couple of relevant points:
>  - The dataset doesn't change over time. 
>  - There are a small number of applications (or queries I guess, but it's more complicated than a single SQL query) that I want to run against it, but the parameters to those queries will change all the time.
>  - There is a logical grouping of the data per customer, which will generally consist of 1-5000 rows.
> 
> I want each query to run as fast as possible (less than a second or two). So ideally I want to keep all the records in memory, but distributed over the different nodes in the cluster. Does this mean sharing a SparkContext between queries, or is this where HDFS comes in, or is there something else that would be better suited?
> 
> Or is there another overall approach I should look into for executing queries in "real time" against a dataset this size?
> 
> Thanks,
> Allan.

Re: Spark Beginner: Correct approach for use case

Posted by ayan guha <gu...@gmail.com>.

Any specific reason to choose Spark? It sounds like you have a
Write-Once-Read-Many Times dataset, which is logically partitioned across
customers, sitting in some data store. And essentially you are looking for
a fast way to access it, and most likely you will use the same partition
key for quering the data. This is more of a database/NoSQL kind of use case
than Spark (which is more of distributed processing engine,I reckon).

On Mon, Mar 6, 2017 at 11:56 AM, Subhash Sriram <su...@gmail.com>
wrote:

> Hi Allan,
>
> Where is the data stored right now? If it's in a relational database, and
> you are using Spark with Hadoop, I feel like it would make sense to move
> the import the data into HDFS, just because it would be faster to access
> the data. You could use Sqoop to do that.
>
> In terms of having a long running Spark context, you could look into the
> Spark job server:
>
> https://github.com/spark-jobserver/spark-jobserver/blob/master/README.md
>
> It would allow you to cache all the data in memory and then accept queries
> via REST API calls. You would have to refresh your cache as the data
> changes of course, but it sounds like that is not very often.
>
> In terms of running the queries themselves, I would think you could use
> Spark SQL and the DataFrame/DataSet API, which is built into Spark. You
> will have to think about the best way to partition your data, depending on
> the queries themselves.
>
> Here is a link to the Spark SQL docs:
>
> http://spark.apache.org/docs/latest/sql-programming-guide.html
>
> I hope that helps, and I'm sure other folks will have some helpful advice
> as well.
>
> Thanks,
> Subhash
>
> Sent from my iPhone
>
> On Mar 5, 2017, at 3:49 PM, Allan Richards <al...@gmail.com>
> wrote:
>
> Hi,
>
> I am looking to use Spark to help execute queries against a reasonably
> large dataset (1 billion rows). I'm a bit lost with all the different
> libraries / add ons to Spark, and am looking for some direction as to what
> I should look at / what may be helpful.
>
> A couple of relevant points:
>  - The dataset doesn't change over time.
>  - There are a small number of applications (or queries I guess, but it's
> more complicated than a single SQL query) that I want to run against it,
> but the parameters to those queries will change all the time.
>  - There is a logical grouping of the data per customer, which will
> generally consist of 1-5000 rows.
>
> I want each query to run as fast as possible (less than a second or two).
> So ideally I want to keep all the records in memory, but distributed over
> the different nodes in the cluster. Does this mean sharing a SparkContext
> between queries, or is this where HDFS comes in, or is there something else
> that would be better suited?
>
> Or is there another overall approach I should look into for executing
> queries in "real time" against a dataset this size?
>
> Thanks,
> Allan.
>
>


-- 
Best Regards,
Ayan Guha

Re: Spark Beginner: Correct approach for use case

Posted by Subhash Sriram <su...@gmail.com>.

Hi Allan,

Where is the data stored right now? If it's in a relational database, and you are using Spark with Hadoop, I feel like it would make sense to move the import the data into HDFS, just because it would be faster to access the data. You could use Sqoop to do that.

In terms of having a long running Spark context, you could look into the Spark job server:

https://github.com/spark-jobserver/spark-jobserver/blob/master/README.md

It would allow you to cache all the data in memory and then accept queries via REST API calls. You would have to refresh your cache as the data changes of course, but it sounds like that is not very often.

In terms of running the queries themselves, I would think you could use Spark SQL and the DataFrame/DataSet API, which is built into Spark. You will have to think about the best way to partition your data, depending on the queries themselves.

Here is a link to the Spark SQL docs:

http://spark.apache.org/docs/latest/sql-programming-guide.html

I hope that helps, and I'm sure other folks will have some helpful advice as well.

Thanks,
Subhash 

Sent from my iPhone

> On Mar 5, 2017, at 3:49 PM, Allan Richards <al...@gmail.com> wrote:
> 
> Hi,
> 
> I am looking to use Spark to help execute queries against a reasonably large dataset (1 billion rows). I'm a bit lost with all the different libraries / add ons to Spark, and am looking for some direction as to what I should look at / what may be helpful.
> 
> A couple of relevant points:
>  - The dataset doesn't change over time. 
>  - There are a small number of applications (or queries I guess, but it's more complicated than a single SQL query) that I want to run against it, but the parameters to those queries will change all the time.
>  - There is a logical grouping of the data per customer, which will generally consist of 1-5000 rows.
> 
> I want each query to run as fast as possible (less than a second or two). So ideally I want to keep all the records in memory, but distributed over the different nodes in the cluster. Does this mean sharing a SparkContext between queries, or is this where HDFS comes in, or is there something else that would be better suited?
> 
> Or is there another overall approach I should look into for executing queries in "real time" against a dataset this size?
> 
> Thanks,
> Allan.