You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Arnaud Bailly <ar...@gmail.com> on 2016/07/07 10:18:09 UTC

Multiple aggregations over streaming dataframes

Hello,

I understand multiple aggregations over streaming dataframes is not
currently supported in Spark 2.0. Is there a workaround? Out of the top of
my head I could think of having a two stage approach:
 - first query writes output to disk/memory using "complete" mode
 - second query reads from this output

Does this makes sense?

Furthermore, I would like to understand what are the technical hurdles that
are preventing Spark SQL from implementing multiple aggregation right now?

Thanks,
-- 
Arnaud Bailly

twitter: abailly
skype: arnaud-bailly
linkedin: http://fr.linkedin.com/in/arnaudbailly/

Re: Multiple aggregations over streaming dataframes

Posted by Arnaud Bailly <ar...@gmail.com>.

Thanks for your answers. I know Kafka's model but I would rather like to
avoid having to setup both Spark and Kafka to handle my use case. I wonder
if it might be possible to handle that using Spark's standard streams ?

-- 
Arnaud Bailly

twitter: abailly
skype: arnaud-bailly
linkedin: http://fr.linkedin.com/in/arnaudbailly/

On Fri, Jul 8, 2016 at 12:00 AM, Andy Davidson <
Andy@santacruzintegration.com> wrote:

> Kafka has an interesting model that might be applicable.
>
> You can think of kafka as enabling a queue system. Writes are called
> producers, and readers are called consumers. The server is called a broker.
> A “topic” is like a named queue.
>
> Producer are independent. They can write to a “topic” at will. Consumers
> (I.e. You nested aggregates) need to be independent of each other and the
> broker. The broker receives data from produces stores it using memory and
> disk. Consumer read from broker and maintain the cursor. Because the client
> maintains the cursor one consumer can not impact other produces and
> consumers.
>
> I would think the tricky part for spark would to know when the data can be
> deleted. In the Kakfa world each topic is allowed to define a TTL SLA. I.e.
> The consumer must read the data with in a limited of window of time.
>
> Andy
>
> From: Michael Armbrust <mi...@databricks.com>
> Date: Thursday, July 7, 2016 at 2:31 PM
> To: Arnaud Bailly <ar...@gmail.com>
> Cc: Sivakumaran S <si...@me.com>, "user @spark" <
> user@spark.apache.org>
> Subject: Re: Multiple aggregations over streaming dataframes
>
> We are planning to address this issue in the future.
>
> At a high level, we'll have to add a delta mode so that updates can be
> communicated from one operator to the next.
>
> On Thu, Jul 7, 2016 at 8:59 AM, Arnaud Bailly <ar...@gmail.com>
> wrote:
>
>> Indeed. But nested aggregation does not work with Structured Streaming,
>> that's the point. I would like to know if there is workaround, or what's
>> the plan regarding this feature which seems to me quite useful. If the
>> implementation is not overtly complex and it is just a matter of manpower,
>> I am fine with devoting some time to it.
>>
>>
>>
>> --
>> Arnaud Bailly
>>
>> twitter: abailly
>> skype: arnaud-bailly
>> linkedin: http://fr.linkedin.com/in/arnaudbailly/
>>
>> On Thu, Jul 7, 2016 at 2:17 PM, Sivakumaran S <si...@me.com>
>> wrote:
>>
>>> Arnauld,
>>>
>>> You could aggregate the first table and then merge it with the second
>>> table (assuming that they are similarly structured) and then carry out the
>>> second aggregation. Unless the data is very large, I don’t see why you
>>> should persist it to disk. IMO, nested aggregation is more elegant and
>>> readable than a complex single stage.
>>>
>>> Regards,
>>>
>>> Sivakumaran
>>>
>>>
>>>
>>> On 07-Jul-2016, at 1:06 PM, Arnaud Bailly <ar...@gmail.com>
>>> wrote:
>>>
>>> It's aggregation at multiple levels in a query: first do some
>>> aggregation on one tavle, then join with another table and do a second
>>> aggregation. I could probably rewrite the query in such a way that it does
>>> aggregation in one pass but that would obfuscate the purpose of the various
>>> stages.
>>> Le 7 juil. 2016 12:55, "Sivakumaran S" <si...@me.com> a écrit :
>>>
>>>> Hi Arnauld,
>>>>
>>>> Sorry for the doubt, but what exactly is multiple aggregation? What is
>>>> the use case?
>>>>
>>>> Regards,
>>>>
>>>> Sivakumaran
>>>>
>>>>
>>>> On 07-Jul-2016, at 11:18 AM, Arnaud Bailly <ar...@gmail.com>
>>>> wrote:
>>>>
>>>> Hello,
>>>>
>>>> I understand multiple aggregations over streaming dataframes is not
>>>> currently supported in Spark 2.0. Is there a workaround? Out of the top of
>>>> my head I could think of having a two stage approach:
>>>>  - first query writes output to disk/memory using "complete" mode
>>>>  - second query reads from this output
>>>>
>>>> Does this makes sense?
>>>>
>>>> Furthermore, I would like to understand what are the technical hurdles
>>>> that are preventing Spark SQL from implementing multiple aggregation right
>>>> now?
>>>>
>>>> Thanks,
>>>> --
>>>> Arnaud Bailly
>>>>
>>>> twitter: abailly
>>>> skype: arnaud-bailly
>>>> linkedin: http://fr.linkedin.com/in/arnaudbailly/
>>>>
>>>>
>>>>
>>>
>>
>

Re: Multiple aggregations over streaming dataframes

Posted by Andy Davidson <An...@SantaCruzIntegration.com>.

Kafka has an interesting model that might be applicable.

You can think of kafka as enabling a queue system. Writes are called
producers, and readers are called consumers. The server is called a broker.
A ³topic² is like a named queue.

Producer are independent. They can write to a ³topic² at will. Consumers
(I.e. You nested aggregates) need to be independent of each other and the
broker. The broker receives data from produces stores it using memory and
disk. Consumer read from broker and maintain the cursor. Because the client
maintains the cursor one consumer can not impact other produces and
consumers.

I would think the tricky part for spark would to know when the data can be
deleted. In the Kakfa world each topic is allowed to define a TTL SLA. I.e.
The consumer must read the data with in a limited of window of time.

Andy

From:  Michael Armbrust <mi...@databricks.com>
Date:  Thursday, July 7, 2016 at 2:31 PM
To:  Arnaud Bailly <ar...@gmail.com>
Cc:  Sivakumaran S <si...@me.com>, "user @spark"
<us...@spark.apache.org>
Subject:  Re: Multiple aggregations over streaming dataframes

> We are planning to address this issue in the future.
> 
> At a high level, we'll have to add a delta mode so that updates can be
> communicated from one operator to the next.
> 
> On Thu, Jul 7, 2016 at 8:59 AM, Arnaud Bailly <ar...@gmail.com> wrote:
>> Indeed. But nested aggregation does not work with Structured Streaming,
>> that's the point. I would like to know if there is workaround, or what's the
>> plan regarding this feature which seems to me quite useful. If the
>> implementation is not overtly complex and it is just a matter of manpower, I
>> am fine with devoting some time to it.
>> 
>> 
>> 
>> -- 
>> Arnaud Bailly
>> 
>> twitter: abailly
>> skype: arnaud-bailly
>> linkedin: http://fr.linkedin.com/in/arnaudbailly/
>> 
>> On Thu, Jul 7, 2016 at 2:17 PM, Sivakumaran S <si...@me.com> wrote:
>>> Arnauld,
>>> 
>>> You could aggregate the first table and then merge it with the second table
>>> (assuming that they are similarly structured) and then carry out the second
>>> aggregation. Unless the data is very large, I don¹t see why you should
>>> persist it to disk. IMO, nested aggregation is more elegant and readable
>>> than a complex single stage.
>>> 
>>> Regards,
>>> 
>>> Sivakumaran
>>> 
>>> 
>>> 
>>>> On 07-Jul-2016, at 1:06 PM, Arnaud Bailly <ar...@gmail.com> wrote:
>>>> 
>>>> It's aggregation at multiple levels in a query: first do some aggregation
>>>> on one tavle, then join with another table and do a second aggregation. I
>>>> could probably rewrite the query in such a way that it does aggregation in
>>>> one pass but that would obfuscate the purpose of the various stages.
>>>> 
>>>> Le 7 juil. 2016 12:55, "Sivakumaran S" <si...@me.com> a écrit :
>>>>> Hi Arnauld,
>>>>> 
>>>>> Sorry for the doubt, but what exactly is multiple aggregation? What is the
>>>>> use case?
>>>>> 
>>>>> Regards,
>>>>> 
>>>>> Sivakumaran
>>>>> 
>>>>> 
>>>>>> On 07-Jul-2016, at 11:18 AM, Arnaud Bailly <ar...@gmail.com>
>>>>>> wrote:
>>>>>> 
>>>>>> Hello,
>>>>>> 
>>>>>> I understand multiple aggregations over streaming dataframes is not
>>>>>> currently supported in Spark 2.0. Is there a workaround? Out of the top
>>>>>> of my head I could think of having a two stage approach:
>>>>>>  - first query writes output to disk/memory using "complete" mode
>>>>>>  - second query reads from this output
>>>>>> 
>>>>>> Does this makes sense?
>>>>>> 
>>>>>> Furthermore, I would like to understand what are the technical hurdles
>>>>>> that are preventing Spark SQL from implementing multiple aggregation
>>>>>> right now? 
>>>>>> 
>>>>>> Thanks,
>>>>>> -- 
>>>>>> Arnaud Bailly
>>>>>> 
>>>>>> twitter: abailly
>>>>>> skype: arnaud-bailly
>>>>>> linkedin: http://fr.linkedin.com/in/arnaudbailly/
>>>>> 
>>> 
>> 
>

Re: Multiple aggregations over streaming dataframes

Posted by Michael Armbrust <mi...@databricks.com>.

We are planning to address this issue in the future.

At a high level, we'll have to add a delta mode so that updates can be
communicated from one operator to the next.

On Thu, Jul 7, 2016 at 8:59 AM, Arnaud Bailly <ar...@gmail.com>
wrote:

> Indeed. But nested aggregation does not work with Structured Streaming,
> that's the point. I would like to know if there is workaround, or what's
> the plan regarding this feature which seems to me quite useful. If the
> implementation is not overtly complex and it is just a matter of manpower,
> I am fine with devoting some time to it.
>
>
>
> --
> Arnaud Bailly
>
> twitter: abailly
> skype: arnaud-bailly
> linkedin: http://fr.linkedin.com/in/arnaudbailly/
>
> On Thu, Jul 7, 2016 at 2:17 PM, Sivakumaran S <si...@me.com> wrote:
>
>> Arnauld,
>>
>> You could aggregate the first table and then merge it with the second
>> table (assuming that they are similarly structured) and then carry out the
>> second aggregation. Unless the data is very large, I don’t see why you
>> should persist it to disk. IMO, nested aggregation is more elegant and
>> readable than a complex single stage.
>>
>> Regards,
>>
>> Sivakumaran
>>
>>
>>
>> On 07-Jul-2016, at 1:06 PM, Arnaud Bailly <ar...@gmail.com> wrote:
>>
>> It's aggregation at multiple levels in a query: first do some aggregation
>> on one tavle, then join with another table and do a second aggregation. I
>> could probably rewrite the query in such a way that it does aggregation in
>> one pass but that would obfuscate the purpose of the various stages.
>> Le 7 juil. 2016 12:55, "Sivakumaran S" <si...@me.com> a écrit :
>>
>>> Hi Arnauld,
>>>
>>> Sorry for the doubt, but what exactly is multiple aggregation? What is
>>> the use case?
>>>
>>> Regards,
>>>
>>> Sivakumaran
>>>
>>>
>>> On 07-Jul-2016, at 11:18 AM, Arnaud Bailly <ar...@gmail.com>
>>> wrote:
>>>
>>> Hello,
>>>
>>> I understand multiple aggregations over streaming dataframes is not
>>> currently supported in Spark 2.0. Is there a workaround? Out of the top of
>>> my head I could think of having a two stage approach:
>>>  - first query writes output to disk/memory using "complete" mode
>>>  - second query reads from this output
>>>
>>> Does this makes sense?
>>>
>>> Furthermore, I would like to understand what are the technical hurdles
>>> that are preventing Spark SQL from implementing multiple aggregation right
>>> now?
>>>
>>> Thanks,
>>> --
>>> Arnaud Bailly
>>>
>>> twitter: abailly
>>> skype: arnaud-bailly
>>> linkedin: http://fr.linkedin.com/in/arnaudbailly/
>>>
>>>
>>>
>>
>

Re: Multiple aggregations over streaming dataframes

Posted by Arnaud Bailly <ar...@gmail.com>.

Indeed. But nested aggregation does not work with Structured Streaming,
that's the point. I would like to know if there is workaround, or what's
the plan regarding this feature which seems to me quite useful. If the
implementation is not overtly complex and it is just a matter of manpower,
I am fine with devoting some time to it.



-- 
Arnaud Bailly

twitter: abailly
skype: arnaud-bailly
linkedin: http://fr.linkedin.com/in/arnaudbailly/

On Thu, Jul 7, 2016 at 2:17 PM, Sivakumaran S <si...@me.com> wrote:

> Arnauld,
>
> You could aggregate the first table and then merge it with the second
> table (assuming that they are similarly structured) and then carry out the
> second aggregation. Unless the data is very large, I don’t see why you
> should persist it to disk. IMO, nested aggregation is more elegant and
> readable than a complex single stage.
>
> Regards,
>
> Sivakumaran
>
>
>
> On 07-Jul-2016, at 1:06 PM, Arnaud Bailly <ar...@gmail.com> wrote:
>
> It's aggregation at multiple levels in a query: first do some aggregation
> on one tavle, then join with another table and do a second aggregation. I
> could probably rewrite the query in such a way that it does aggregation in
> one pass but that would obfuscate the purpose of the various stages.
> Le 7 juil. 2016 12:55, "Sivakumaran S" <si...@me.com> a écrit :
>
>> Hi Arnauld,
>>
>> Sorry for the doubt, but what exactly is multiple aggregation? What is
>> the use case?
>>
>> Regards,
>>
>> Sivakumaran
>>
>>
>> On 07-Jul-2016, at 11:18 AM, Arnaud Bailly <ar...@gmail.com>
>> wrote:
>>
>> Hello,
>>
>> I understand multiple aggregations over streaming dataframes is not
>> currently supported in Spark 2.0. Is there a workaround? Out of the top of
>> my head I could think of having a two stage approach:
>>  - first query writes output to disk/memory using "complete" mode
>>  - second query reads from this output
>>
>> Does this makes sense?
>>
>> Furthermore, I would like to understand what are the technical hurdles
>> that are preventing Spark SQL from implementing multiple aggregation right
>> now?
>>
>> Thanks,
>> --
>> Arnaud Bailly
>>
>> twitter: abailly
>> skype: arnaud-bailly
>> linkedin: http://fr.linkedin.com/in/arnaudbailly/
>>
>>
>>
>

Re: Multiple aggregations over streaming dataframes

Posted by Sivakumaran S <si...@me.com>.

Arnauld,

You could aggregate the first table and then merge it with the second table (assuming that they are similarly structured) and then carry out the second aggregation. Unless the data is very large, I don’t see why you should persist it to disk. IMO, nested aggregation is more elegant and readable than a complex single stage.

Regards,

Sivakumaran


> On 07-Jul-2016, at 1:06 PM, Arnaud Bailly <ar...@gmail.com> wrote:
> 
> It's aggregation at multiple levels in a query: first do some aggregation on one tavle, then join with another table and do a second aggregation. I could probably rewrite the query in such a way that it does aggregation in one pass but that would obfuscate the purpose of the various stages.
> 
> Le 7 juil. 2016 12:55, "Sivakumaran S" <siva.kumaran@me.com <ma...@me.com>> a écrit :
> Hi Arnauld,
> 
> Sorry for the doubt, but what exactly is multiple aggregation? What is the use case?
> 
> Regards,
> 
> Sivakumaran
> 
> 
>> On 07-Jul-2016, at 11:18 AM, Arnaud Bailly <arnaud.oqube@gmail.com <ma...@gmail.com>> wrote:
>> 
>> Hello,
>> 
>> I understand multiple aggregations over streaming dataframes is not currently supported in Spark 2.0. Is there a workaround? Out of the top of my head I could think of having a two stage approach: 
>>  - first query writes output to disk/memory using "complete" mode
>>  - second query reads from this output
>> 
>> Does this makes sense?
>> 
>> Furthermore, I would like to understand what are the technical hurdles that are preventing Spark SQL from implementing multiple aggregation right now? 
>> 
>> Thanks,
>> -- 
>> Arnaud Bailly
>> 
>> twitter: abailly
>> skype: arnaud-bailly
>> linkedin: http://fr.linkedin.com/in/arnaudbailly/ <http://fr.linkedin.com/in/arnaudbailly/>

Re: Multiple aggregations over streaming dataframes

Posted by Arnaud Bailly <ar...@gmail.com>.

It's aggregation at multiple levels in a query: first do some aggregation
on one tavle, then join with another table and do a second aggregation. I
could probably rewrite the query in such a way that it does aggregation in
one pass but that would obfuscate the purpose of the various stages.
Le 7 juil. 2016 12:55, "Sivakumaran S" <si...@me.com> a écrit :

> Hi Arnauld,
>
> Sorry for the doubt, but what exactly is multiple aggregation? What is the
> use case?
>
> Regards,
>
> Sivakumaran
>
>
> On 07-Jul-2016, at 11:18 AM, Arnaud Bailly <ar...@gmail.com> wrote:
>
> Hello,
>
> I understand multiple aggregations over streaming dataframes is not
> currently supported in Spark 2.0. Is there a workaround? Out of the top of
> my head I could think of having a two stage approach:
>  - first query writes output to disk/memory using "complete" mode
>  - second query reads from this output
>
> Does this makes sense?
>
> Furthermore, I would like to understand what are the technical hurdles
> that are preventing Spark SQL from implementing multiple aggregation right
> now?
>
> Thanks,
> --
> Arnaud Bailly
>
> twitter: abailly
> skype: arnaud-bailly
> linkedin: http://fr.linkedin.com/in/arnaudbailly/
>
>
>

Re: Multiple aggregations over streaming dataframes

Posted by Sivakumaran S <si...@me.com>.

Hi Arnauld,

Sorry for the doubt, but what exactly is multiple aggregation? What is the use case?

Regards,

Sivakumaran


> On 07-Jul-2016, at 11:18 AM, Arnaud Bailly <ar...@gmail.com> wrote:
> 
> Hello,
> 
> I understand multiple aggregations over streaming dataframes is not currently supported in Spark 2.0. Is there a workaround? Out of the top of my head I could think of having a two stage approach: 
>  - first query writes output to disk/memory using "complete" mode
>  - second query reads from this output
> 
> Does this makes sense?
> 
> Furthermore, I would like to understand what are the technical hurdles that are preventing Spark SQL from implementing multiple aggregation right now? 
> 
> Thanks,
> -- 
> Arnaud Bailly
> 
> twitter: abailly
> skype: arnaud-bailly
> linkedin: http://fr.linkedin.com/in/arnaudbailly/ <http://fr.linkedin.com/in/arnaudbailly/>