You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by onmstester onmstester <on...@zoho.com> on 2018/05/15 06:32:46 UTC

spark sql StackOverflow

Hi, 



I need to run some queries on huge amount input records. Input rate for records are 100K/seconds.

A record is like (key1,key2,value) and the application should report occurances of kye1 = something &amp;&amp; key2 == somethingElse.

The problem is there are too many filters in my query: more than 3 thousands pair of key1 and key2 should be filtered.

I was simply puting 1 millions of records in a temptable each time and running a query sql using spark-sql on temp table:

select * from mytemptable where (kye1 = something &amp;&amp; key2 == somethingElse) or (kye1 = someOtherthing &amp;&amp; key2 == someAnotherThing) or ...(3thousands or!!!)

And i encounter StackOverFlow at ATNConfigSet.java line 178.



So i have two options IMHO:

1. Either put all key1 and key2 filter pairs in another temp table and do a join between  two temp table

2. Or use spark-stream that i'm not familiar with and i don't know if it could handle 3K of filters.



Which way do you suggest? what is the best solution for my problem 'performance-wise'?



Thanks in advance

Re: spark sql StackOverflow

Posted by Jörn Franke <jo...@gmail.com>.

3000 filters don’t look like something reasonable. This is very difficult to test and verify as well as impossible to maintain.
Could it be that your filters are another table that you should join with ?
The example is a little bit artificial to understand the underlying business case. Can you provide a more realistic example?

Maybe a bloom filter or something similar can make sense for you ?  Basically you want  to know if the key pair is in a given set of pairs?

> On 15. May 2018, at 11:48, Alessandro Solimando <al...@gmail.com> wrote:
> 
> From the information you provided I would tackle this as a batch problem, because this way you have access to more sophisticated techniques and you have more flexibility (maybe HDFS and a SparkJob, but also think about a datastore offering good indexes for the kind of data types and values you have for your keys, and benefit from filter push-downs).
> 
> I personally use streaming only when real-time ingestion is needed.
> 
> Hth,
> Alessandro
> 
>> On 15 May 2018 at 09:11, onmstester onmstester <on...@zoho.com> wrote:
>> 
>> How many distinct key1 (resp. key2) values do you have? Are these values reasonably stable over time?
>> 
>> less than 10 thousands and this filters would change each 2-3 days. They would be written and loaded from a database
>> 
>> Are these records ingested in real-time or are they loaded from a datastore?
>> 
>> records would be loaded from some text files that would be copied in some directory over and over
>> 
>> Are you suggesting that i dont need to use spark-streaming?
>> Sent using Zoho Mail
>> 
>> 
>> 
>> ---- On Tue, 15 May 2018 11:26:42 +0430 Alessandro Solimando <al...@gmail.com> wrote ----
>> 
>> Hi,
>> I am not familiar with ATNConfigSet, but some thoughts that might help.
>> 
>> How many distinct key1 (resp. key2) values do you have? Are these values reasonably stable over time?
>> 
>> Are these records ingested in real-time or are they loaded from a datastore?
>> 
>> If the latter case the DB might be able to efficiently perform the filtering, especially if equipped with a proper index over key1/key2 (or a composite one).
>> 
>> In such case the filter push-down could be very effective (I didn't get if you just need to count or do something more with the matching record).
>> 
>> Alternatively, you could try to group by (key1,key2), and then filter (it again depends on the kind of output you have in mind).
>> 
>> If the datastore/stream is distributed and supports partitioning, you could partition your records by either key1 or key2 (or key1+key2), so they are already "separated" and can be consumed more efficiently (e.g., the groupby could then be local to a single partition).
>> 
>> Best regards,
>> Alessandro
>> 
>> On 15 May 2018 at 08:32, onmstester onmstester <on...@zoho.com> wrote:
>> 
>> 
>> Hi, 
>> 
>> I need to run some queries on huge amount input records. Input rate for records are 100K/seconds.
>> A record is like (key1,key2,value) and the application should report occurances of kye1 = something && key2 == somethingElse.
>> The problem is there are too many filters in my query: more than 3 thousands pair of key1 and key2 should be filtered.
>> I was simply puting 1 millions of records in a temptable each time and running a query sql using spark-sql on temp table:
>> select * from mytemptable where (kye1 = something && key2 == somethingElse) or (kye1 = someOtherthing && key2 == someAnotherThing) or ...(3thousands or!!!)
>> And i encounter StackOverFlow at ATNConfigSet.java line 178.
>> 
>> So i have two options IMHO:
>> 1. Either put all key1 and key2 filter pairs in another temp table and do a join between  two temp table
>> 2. Or use spark-stream that i'm not familiar with and i don't know if it could handle 3K of filters.
>> 
>> Which way do you suggest? what is the best solution for my problem 'performance-wise'?
>> 
>> Thanks in advance
>> 
>> 
>

Re: spark sql StackOverflow

Posted by Alessandro Solimando <al...@gmail.com>.

From the information you provided I would tackle this as a batch problem,
because this way you have access to more sophisticated techniques and you
have more flexibility (maybe HDFS and a SparkJob, but also think about a
datastore offering good indexes for the kind of data types and values you
have for your keys, and benefit from filter push-downs).

I personally use streaming only when real-time ingestion is needed.

Hth,
Alessandro

On 15 May 2018 at 09:11, onmstester onmstester <on...@zoho.com> wrote:

>
> How many distinct key1 (resp. key2) values do you have? Are these values
> reasonably stable over time?
>
> less than 10 thousands and this filters would change each 2-3 days. They
> would be written and loaded from a database
>
> Are these records ingested in real-time or are they loaded from a
> datastore?
>
>
> records would be loaded from some text files that would be copied in some
> directory over and over
>
> Are you suggesting that i dont need to use spark-streaming?
>
> Sent using Zoho Mail <https://www.zoho.com/mail/>
>
>
> ---- On Tue, 15 May 2018 11:26:42 +0430 *Alessandro Solimando
> <alessandro.solimando@gmail.com <al...@gmail.com>>* wrote
> ----
>
> Hi,
> I am not familiar with ATNConfigSet, but some thoughts that might help.
>
> How many distinct key1 (resp. key2) values do you have? Are these values
> reasonably stable over time?
>
> Are these records ingested in real-time or are they loaded from a
> datastore?
>
> If the latter case the DB might be able to efficiently perform the
> filtering, especially if equipped with a proper index over key1/key2 (or a
> composite one).
>
> In such case the filter push-down could be very effective (I didn't get if
> you just need to count or do something more with the matching record).
>
> Alternatively, you could try to group by (key1,key2), and then filter (it
> again depends on the kind of output you have in mind).
>
> If the datastore/stream is distributed and supports partitioning, you
> could partition your records by either key1 or key2 (or key1+key2), so they
> are already "separated" and can be consumed more efficiently (e.g., the
> groupby could then be local to a single partition).
>
> Best regards,
> Alessandro
>
> On 15 May 2018 at 08:32, onmstester onmstester <on...@zoho.com>
> wrote:
>
>
> Hi,
>
> I need to run some queries on huge amount input records. Input rate for
> records are 100K/seconds.
> A record is like (key1,key2,value) and the application should report
> occurances of kye1 = something && key2 == somethingElse.
> The problem is there are too many filters in my query: more than 3
> thousands pair of key1 and key2 should be filtered.
> I was simply puting 1 millions of records in a temptable each time and
> running a query sql using spark-sql on temp table:
> select * from mytemptable where (kye1 = something && key2 ==
> somethingElse) or (kye1 = someOtherthing && key2 == someAnotherThing) or
> ...(3thousands or!!!)
> And i encounter StackOverFlow at ATNConfigSet.java line 178.
>
> So i have two options IMHO:
> 1. Either put all key1 and key2 filter pairs in another temp table and do
> a join between  two temp table
> 2. Or use spark-stream that i'm not familiar with and i don't know if it
> could handle 3K of filters.
>
> Which way do you suggest? what is the best solution for my problem
> 'performance-wise'?
>
> Thanks in advance
>
>
>
>

Re: spark sql StackOverflow

Posted by Alessandro Solimando <al...@gmail.com>.

Hi,
I am not familiar with ATNConfigSet, but some thoughts that might help.

How many distinct key1 (resp. key2) values do you have? Are these values
reasonably stable over time?

Are these records ingested in real-time or are they loaded from a datastore?

If the latter case the DB might be able to efficiently perform the
filtering, especially if equipped with a proper index over key1/key2 (or a
composite one).

In such case the filter push-down could be very effective (I didn't get if
you just need to count or do something more with the matching record).

Alternatively, you could try to group by (key1,key2), and then filter (it
again depends on the kind of output you have in mind).

If the datastore/stream is distributed and supports partitioning, you could
partition your records by either key1 or key2 (or key1+key2), so they are
already "separated" and can be consumed more efficiently (e.g., the groupby
could then be local to a single partition).

Best regards,
Alessandro

On 15 May 2018 at 08:32, onmstester onmstester <on...@zoho.com> wrote:

> Hi,
>
> I need to run some queries on huge amount input records. Input rate for
> records are 100K/seconds.
> A record is like (key1,key2,value) and the application should report
> occurances of kye1 = something && key2 == somethingElse.
> The problem is there are too many filters in my query: more than 3
> thousands pair of key1 and key2 should be filtered.
> I was simply puting 1 millions of records in a temptable each time and
> running a query sql using spark-sql on temp table:
> select * from mytemptable where (kye1 = something && key2 ==
> somethingElse) or (kye1 = someOtherthing && key2 == someAnotherThing) or
> ...(3thousands or!!!)
> And i encounter StackOverFlow at ATNConfigSet.java line 178.
>
> So i have two options IMHO:
> 1. Either put all key1 and key2 filter pairs in another temp table and do
> a join between  two temp table
> 2. Or use spark-stream that i'm not familiar with and i don't know if it
> could handle 3K of filters.
>
> Which way do you suggest? what is the best solution for my problem
> 'performance-wise'?
>
> Thanks in advance
>
>