You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@streampipes.apache.org by Philipp Zehnder <ze...@apache.org> on 2020/01/16 22:14:57 UTC

[DISCUSS] Processor to synchronize two data streams

Hi all,

I’d like to discuss how we could design a processor to merge two data streams. 
We already had several versions of this component in the past, but none of them is completely satisfactory.

I would suggest two different processors for two common use cases:
The first one is to append a label (e.g. for machine learning) to the data stream. The processor has two inputs, one with the sensor events (potentially a high frequency) and one with the label information (usually a much lower event Frequency compared to sensor events). The processor enriches the sensor stream with the selected properties of the label stream.

The second processor merges two streams by their timestamp. This could be implemented with flink, but since it is a common use case I think we also need a lightweight solution in Java. What do you think? 
Here are a couple of things we need to keep in mind designing the component:
* How to deal with late arriving events?
* How big must the buffer (state) for the data streams be to synchronize the events? (E.g. there is a large delay in one of the streams)
* Can we assume that events of one stream are in order?

Do you have any other ideas about what we need to consider?

Cheers,
Philipp

Re: [DISCUSS] Processor to synchronize two data streams

Posted by Philipp Zehnder <ze...@apache.org>.

Hi Dominik,

thank you, no it works.
I already fixed and committed it.

Philipp

> On 30. Jan 2020, at 12:49, Dominik Riemer <ri...@apache.org> wrote:
> 
> Hi Philipp,
> 
> regarding your last question, this should work out of the box:
> - copy the additional image into the resources folder of the pipeline element (where documentation.md and icon.png are located)
> - register it in the controller, e.g., .withAssets(Assets.DOCUMENTATION, Assets.ICON, "test.png")
> - reference it in the documentation markdown, e.g. <img src="test.png"/>
> - reinstall the element
> 
> I've just tested it with the new Eclipse Ditto sink and it works fine.
> 
> Dominik
> 
> -----Original Message-----
> From: Philipp Zehnder <ze...@apache.org> 
> Sent: Wednesday, January 29, 2020 4:58 PM
> To: dev@streampipes.apache.org
> Subject: Re: [DISCUSS] Processor to synchronize two data streams
> 
> Hi,
> 
> I have finished the component for merging two data streams by timestamp. 
> I'm not sure if my solution is the most efficient, so don't hesitate to improve it or give me feedback to calculate the merge more efficiently ;)
> 
> During the development a question came up. Is it possible to add images to the documentation.md file in addition to the icon?
> I added another image that describes how the merge is performed on the streams, but the image is not displayed in StreamPipes.
> 
> I think it would be useful to include images in the documentation to explain the components. So does anyone know how we can include images? Is that feature currently missing or did I just use the wrong URL in Markdown?
> 
> Philipp
> 
>> On 22. Jan 2020, at 10:49, Patrick Wiener <wi...@apache.org> wrote:
>> 
>> Hi Philipp,
>> 
>> Yes. E.g., Flink has good built-in features for dealing with late-arriving, out-of-order events.
>> 
>> Other than that, the assumption of having dimension properties such as machine or sensor ID’s as partition keys for a Kafka partition is good to ensure the StreamPipes connect adapters (for each machine) to only publish their sensor data to a dedicated Kafka partition of a certain topic. Since Kafka guarantees correct order per partition this could be sufficient as a starting point to ensure consumer only receiving in order events.
>> 
>> Patrick
>> 
>> 
>>> Am 20.01.2020 um 09:14 schrieb Philipp Zehnder <ze...@apache.org>:
>>> 
>>> Hi Patrick,
>>> 
>>> ok, I think the first processor should be straight forward to implement. I will create an issue for that and then start implementing.
>>> 
>>> Regarding your concerns about the ordering of events within a topic. I think we also will need component which is able to deal with out of order events in the future, but I would suggest to build such a component with a framework that already has a build in solution for that (e.g. flink).
>>> For this component I thought it is sufficient to assume the events are in order, since we will use dimension properties for partitioning events in Kafka. This means the events should be in order. 
>>> Do you think we need an additional mechanism in the component to ensure the ordering of events or is the Kafka guarantee sufficient for the moment?
>>> 
>>> Philipp
>>> 
>>> 
>>> 
>>>> On 19. Jan 2020, at 22:16, Patrick Wiener <wi...@apache.org> wrote:
>>>> 
>>>> Hi Philipp,
>>>> 
>>>> this sounds like a reasonable topic that should further be investigated. 
>>>> 
>>>> I see especially see the benefit of having a lightweight solution in Java that can potentially run right where the data origins. I come across a lot of cases where control messages in addition to sensor events should be merged in one unified data stream to enable downstream analytics tasks (in robotics including ROS messages).
>>>> 
>>>> Your initial thoughts for designing such a processor seems valid. However, I was wondering about your third point. I guess it depends on where data is coming from. So for instance Kafka - AFAIK Kafka only supports ordering per partition per topic. Any initial ideas on how to deal with that?
>>>> 
>>>> Patrick
>>>> 
>>>>> Am 16.01.2020 um 23:14 schrieb Philipp Zehnder <ze...@apache.org>:
>>>>> 
>>>>> Hi all,
>>>>> 
>>>>> I’d like to discuss how we could design a processor to merge two data streams. 
>>>>> We already had several versions of this component in the past, but none of them is completely satisfactory.
>>>>> 
>>>>> I would suggest two different processors for two common use cases:
>>>>> The first one is to append a label (e.g. for machine learning) to the data stream. The processor has two inputs, one with the sensor events (potentially a high frequency) and one with the label information (usually a much lower event Frequency compared to sensor events). The processor enriches the sensor stream with the selected properties of the label stream.
>>>>> 
>>>>> The second processor merges two streams by their timestamp. This could be implemented with flink, but since it is a common use case I think we also need a lightweight solution in Java. What do you think? 
>>>>> Here are a couple of things we need to keep in mind designing the component:
>>>>> * How to deal with late arriving events?
>>>>> * How big must the buffer (state) for the data streams be to synchronize the events? (E.g. there is a large delay in one of the streams)
>>>>> * Can we assume that events of one stream are in order?
>>>>> 
>>>>> Do you have any other ideas about what we need to consider?
>>>>> 
>>>>> Cheers,
>>>>> Philipp
>>>>> 
>>>> 
>>> 
>> 
> 
>

RE: [DISCUSS] Processor to synchronize two data streams

Posted by Dominik Riemer <ri...@apache.org>.

Hi Philipp,

regarding your last question, this should work out of the box:
- copy the additional image into the resources folder of the pipeline element (where documentation.md and icon.png are located)
- register it in the controller, e.g., .withAssets(Assets.DOCUMENTATION, Assets.ICON, "test.png")
- reference it in the documentation markdown, e.g. <img src="test.png"/>
- reinstall the element

I've just tested it with the new Eclipse Ditto sink and it works fine.

Dominik

-----Original Message-----
From: Philipp Zehnder <ze...@apache.org> 
Sent: Wednesday, January 29, 2020 4:58 PM
To: dev@streampipes.apache.org
Subject: Re: [DISCUSS] Processor to synchronize two data streams

Hi,

I have finished the component for merging two data streams by timestamp. 
I'm not sure if my solution is the most efficient, so don't hesitate to improve it or give me feedback to calculate the merge more efficiently ;)

During the development a question came up. Is it possible to add images to the documentation.md file in addition to the icon?
I added another image that describes how the merge is performed on the streams, but the image is not displayed in StreamPipes.

I think it would be useful to include images in the documentation to explain the components. So does anyone know how we can include images? Is that feature currently missing or did I just use the wrong URL in Markdown?

Philipp

> On 22. Jan 2020, at 10:49, Patrick Wiener <wi...@apache.org> wrote:
> 
> Hi Philipp,
> 
> Yes. E.g., Flink has good built-in features for dealing with late-arriving, out-of-order events.
> 
> Other than that, the assumption of having dimension properties such as machine or sensor ID’s as partition keys for a Kafka partition is good to ensure the StreamPipes connect adapters (for each machine) to only publish their sensor data to a dedicated Kafka partition of a certain topic. Since Kafka guarantees correct order per partition this could be sufficient as a starting point to ensure consumer only receiving in order events.
> 
> Patrick
> 
> 
>> Am 20.01.2020 um 09:14 schrieb Philipp Zehnder <ze...@apache.org>:
>> 
>> Hi Patrick,
>> 
>> ok, I think the first processor should be straight forward to implement. I will create an issue for that and then start implementing.
>> 
>> Regarding your concerns about the ordering of events within a topic. I think we also will need component which is able to deal with out of order events in the future, but I would suggest to build such a component with a framework that already has a build in solution for that (e.g. flink).
>> For this component I thought it is sufficient to assume the events are in order, since we will use dimension properties for partitioning events in Kafka. This means the events should be in order. 
>> Do you think we need an additional mechanism in the component to ensure the ordering of events or is the Kafka guarantee sufficient for the moment?
>> 
>> Philipp
>> 
>> 
>> 
>>> On 19. Jan 2020, at 22:16, Patrick Wiener <wi...@apache.org> wrote:
>>> 
>>> Hi Philipp,
>>> 
>>> this sounds like a reasonable topic that should further be investigated. 
>>> 
>>> I see especially see the benefit of having a lightweight solution in Java that can potentially run right where the data origins. I come across a lot of cases where control messages in addition to sensor events should be merged in one unified data stream to enable downstream analytics tasks (in robotics including ROS messages).
>>> 
>>> Your initial thoughts for designing such a processor seems valid. However, I was wondering about your third point. I guess it depends on where data is coming from. So for instance Kafka - AFAIK Kafka only supports ordering per partition per topic. Any initial ideas on how to deal with that?
>>> 
>>> Patrick
>>> 
>>>> Am 16.01.2020 um 23:14 schrieb Philipp Zehnder <ze...@apache.org>:
>>>> 
>>>> Hi all,
>>>> 
>>>> I’d like to discuss how we could design a processor to merge two data streams. 
>>>> We already had several versions of this component in the past, but none of them is completely satisfactory.
>>>> 
>>>> I would suggest two different processors for two common use cases:
>>>> The first one is to append a label (e.g. for machine learning) to the data stream. The processor has two inputs, one with the sensor events (potentially a high frequency) and one with the label information (usually a much lower event Frequency compared to sensor events). The processor enriches the sensor stream with the selected properties of the label stream.
>>>> 
>>>> The second processor merges two streams by their timestamp. This could be implemented with flink, but since it is a common use case I think we also need a lightweight solution in Java. What do you think? 
>>>> Here are a couple of things we need to keep in mind designing the component:
>>>> * How to deal with late arriving events?
>>>> * How big must the buffer (state) for the data streams be to synchronize the events? (E.g. there is a large delay in one of the streams)
>>>> * Can we assume that events of one stream are in order?
>>>> 
>>>> Do you have any other ideas about what we need to consider?
>>>> 
>>>> Cheers,
>>>> Philipp
>>>> 
>>> 
>> 
>

Re: [DISCUSS] Processor to synchronize two data streams

Posted by Patrick Wiener <wi...@apache.org>.

Great news :) I’m eager to try it out.

Patrick


> Am 29.01.2020 um 16:57 schrieb Philipp Zehnder <ze...@apache.org>:
> 
> Hi,
> 
> I have finished the component for merging two data streams by timestamp. 
> I'm not sure if my solution is the most efficient, so don't hesitate to improve it or give me feedback to calculate the merge more efficiently ;)
> 
> During the development a question came up. Is it possible to add images to the documentation.md file in addition to the icon?
> I added another image that describes how the merge is performed on the streams, but the image is not displayed in StreamPipes.
> 
> I think it would be useful to include images in the documentation to explain the components. So does anyone know how we can include images? Is that feature currently missing or did I just use the wrong URL in Markdown?
> 
> Philipp
> 
>> On 22. Jan 2020, at 10:49, Patrick Wiener <wi...@apache.org> wrote:
>> 
>> Hi Philipp,
>> 
>> Yes. E.g., Flink has good built-in features for dealing with late-arriving, out-of-order events.
>> 
>> Other than that, the assumption of having dimension properties such as machine or sensor ID’s as partition keys for a Kafka partition is good to ensure the StreamPipes connect adapters (for each machine) to only publish their sensor data to a dedicated Kafka partition of a certain topic. Since Kafka guarantees correct order per partition this could be sufficient as a starting point to ensure consumer only receiving in order events.
>> 
>> Patrick
>> 
>> 
>>> Am 20.01.2020 um 09:14 schrieb Philipp Zehnder <ze...@apache.org>:
>>> 
>>> Hi Patrick,
>>> 
>>> ok, I think the first processor should be straight forward to implement. I will create an issue for that and then start implementing.
>>> 
>>> Regarding your concerns about the ordering of events within a topic. I think we also will need component which is able to deal with out of order events in the future, but I would suggest to build such a component with a framework that already has a build in solution for that (e.g. flink).
>>> For this component I thought it is sufficient to assume the events are in order, since we will use dimension properties for partitioning events in Kafka. This means the events should be in order. 
>>> Do you think we need an additional mechanism in the component to ensure the ordering of events or is the Kafka guarantee sufficient for the moment?
>>> 
>>> Philipp
>>> 
>>> 
>>> 
>>>> On 19. Jan 2020, at 22:16, Patrick Wiener <wi...@apache.org> wrote:
>>>> 
>>>> Hi Philipp,
>>>> 
>>>> this sounds like a reasonable topic that should further be investigated. 
>>>> 
>>>> I see especially see the benefit of having a lightweight solution in Java that can potentially run right where the data origins. I come across a lot of cases where control messages in addition to sensor events should be merged in one unified data stream to enable downstream analytics tasks (in robotics including ROS messages).
>>>> 
>>>> Your initial thoughts for designing such a processor seems valid. However, I was wondering about your third point. I guess it depends on where data is coming from. So for instance Kafka - AFAIK Kafka only supports ordering per partition per topic. Any initial ideas on how to deal with that?
>>>> 
>>>> Patrick
>>>> 
>>>>> Am 16.01.2020 um 23:14 schrieb Philipp Zehnder <ze...@apache.org>:
>>>>> 
>>>>> Hi all,
>>>>> 
>>>>> I’d like to discuss how we could design a processor to merge two data streams. 
>>>>> We already had several versions of this component in the past, but none of them is completely satisfactory.
>>>>> 
>>>>> I would suggest two different processors for two common use cases:
>>>>> The first one is to append a label (e.g. for machine learning) to the data stream. The processor has two inputs, one with the sensor events (potentially a high frequency) and one with the label information (usually a much lower event Frequency compared to sensor events). The processor enriches the sensor stream with the selected properties of the label stream.
>>>>> 
>>>>> The second processor merges two streams by their timestamp. This could be implemented with flink, but since it is a common use case I think we also need a lightweight solution in Java. What do you think? 
>>>>> Here are a couple of things we need to keep in mind designing the component:
>>>>> * How to deal with late arriving events?
>>>>> * How big must the buffer (state) for the data streams be to synchronize the events? (E.g. there is a large delay in one of the streams)
>>>>> * Can we assume that events of one stream are in order?
>>>>> 
>>>>> Do you have any other ideas about what we need to consider?
>>>>> 
>>>>> Cheers,
>>>>> Philipp
>>>>> 
>>>> 
>>> 
>> 
>

Re: [DISCUSS] Processor to synchronize two data streams

Posted by Philipp Zehnder <ze...@apache.org>.

Hi,

I have finished the component for merging two data streams by timestamp. 
I'm not sure if my solution is the most efficient, so don't hesitate to improve it or give me feedback to calculate the merge more efficiently ;)

During the development a question came up. Is it possible to add images to the documentation.md file in addition to the icon?
I added another image that describes how the merge is performed on the streams, but the image is not displayed in StreamPipes.

I think it would be useful to include images in the documentation to explain the components. So does anyone know how we can include images? Is that feature currently missing or did I just use the wrong URL in Markdown?

Philipp

> On 22. Jan 2020, at 10:49, Patrick Wiener <wi...@apache.org> wrote:
> 
> Hi Philipp,
> 
> Yes. E.g., Flink has good built-in features for dealing with late-arriving, out-of-order events.
> 
> Other than that, the assumption of having dimension properties such as machine or sensor ID’s as partition keys for a Kafka partition is good to ensure the StreamPipes connect adapters (for each machine) to only publish their sensor data to a dedicated Kafka partition of a certain topic. Since Kafka guarantees correct order per partition this could be sufficient as a starting point to ensure consumer only receiving in order events.
> 
> Patrick
> 
> 
>> Am 20.01.2020 um 09:14 schrieb Philipp Zehnder <ze...@apache.org>:
>> 
>> Hi Patrick,
>> 
>> ok, I think the first processor should be straight forward to implement. I will create an issue for that and then start implementing.
>> 
>> Regarding your concerns about the ordering of events within a topic. I think we also will need component which is able to deal with out of order events in the future, but I would suggest to build such a component with a framework that already has a build in solution for that (e.g. flink).
>> For this component I thought it is sufficient to assume the events are in order, since we will use dimension properties for partitioning events in Kafka. This means the events should be in order. 
>> Do you think we need an additional mechanism in the component to ensure the ordering of events or is the Kafka guarantee sufficient for the moment?
>> 
>> Philipp
>> 
>> 
>> 
>>> On 19. Jan 2020, at 22:16, Patrick Wiener <wi...@apache.org> wrote:
>>> 
>>> Hi Philipp,
>>> 
>>> this sounds like a reasonable topic that should further be investigated. 
>>> 
>>> I see especially see the benefit of having a lightweight solution in Java that can potentially run right where the data origins. I come across a lot of cases where control messages in addition to sensor events should be merged in one unified data stream to enable downstream analytics tasks (in robotics including ROS messages).
>>> 
>>> Your initial thoughts for designing such a processor seems valid. However, I was wondering about your third point. I guess it depends on where data is coming from. So for instance Kafka - AFAIK Kafka only supports ordering per partition per topic. Any initial ideas on how to deal with that?
>>> 
>>> Patrick
>>> 
>>>> Am 16.01.2020 um 23:14 schrieb Philipp Zehnder <ze...@apache.org>:
>>>> 
>>>> Hi all,
>>>> 
>>>> I’d like to discuss how we could design a processor to merge two data streams. 
>>>> We already had several versions of this component in the past, but none of them is completely satisfactory.
>>>> 
>>>> I would suggest two different processors for two common use cases:
>>>> The first one is to append a label (e.g. for machine learning) to the data stream. The processor has two inputs, one with the sensor events (potentially a high frequency) and one with the label information (usually a much lower event Frequency compared to sensor events). The processor enriches the sensor stream with the selected properties of the label stream.
>>>> 
>>>> The second processor merges two streams by their timestamp. This could be implemented with flink, but since it is a common use case I think we also need a lightweight solution in Java. What do you think? 
>>>> Here are a couple of things we need to keep in mind designing the component:
>>>> * How to deal with late arriving events?
>>>> * How big must the buffer (state) for the data streams be to synchronize the events? (E.g. there is a large delay in one of the streams)
>>>> * Can we assume that events of one stream are in order?
>>>> 
>>>> Do you have any other ideas about what we need to consider?
>>>> 
>>>> Cheers,
>>>> Philipp
>>>> 
>>> 
>> 
>

Re: [DISCUSS] Processor to synchronize two data streams

Posted by Patrick Wiener <wi...@apache.org>.

Hi Philipp,

Yes. E.g., Flink has good built-in features for dealing with late-arriving, out-of-order events.

Other than that, the assumption of having dimension properties such as machine or sensor ID’s as partition keys for a Kafka partition is good to ensure the StreamPipes connect adapters (for each machine) to only publish their sensor data to a dedicated Kafka partition of a certain topic. Since Kafka guarantees correct order per partition this could be sufficient as a starting point to ensure consumer only receiving in order events.

Patrick


> Am 20.01.2020 um 09:14 schrieb Philipp Zehnder <ze...@apache.org>:
> 
> Hi Patrick,
> 
> ok, I think the first processor should be straight forward to implement. I will create an issue for that and then start implementing.
> 
> Regarding your concerns about the ordering of events within a topic. I think we also will need component which is able to deal with out of order events in the future, but I would suggest to build such a component with a framework that already has a build in solution for that (e.g. flink).
> For this component I thought it is sufficient to assume the events are in order, since we will use dimension properties for partitioning events in Kafka. This means the events should be in order. 
> Do you think we need an additional mechanism in the component to ensure the ordering of events or is the Kafka guarantee sufficient for the moment?
> 
> Philipp
> 
> 
> 
>> On 19. Jan 2020, at 22:16, Patrick Wiener <wi...@apache.org> wrote:
>> 
>> Hi Philipp,
>> 
>> this sounds like a reasonable topic that should further be investigated. 
>> 
>> I see especially see the benefit of having a lightweight solution in Java that can potentially run right where the data origins. I come across a lot of cases where control messages in addition to sensor events should be merged in one unified data stream to enable downstream analytics tasks (in robotics including ROS messages).
>> 
>> Your initial thoughts for designing such a processor seems valid. However, I was wondering about your third point. I guess it depends on where data is coming from. So for instance Kafka - AFAIK Kafka only supports ordering per partition per topic. Any initial ideas on how to deal with that?
>> 
>> Patrick
>> 
>>> Am 16.01.2020 um 23:14 schrieb Philipp Zehnder <ze...@apache.org>:
>>> 
>>> Hi all,
>>> 
>>> I’d like to discuss how we could design a processor to merge two data streams. 
>>> We already had several versions of this component in the past, but none of them is completely satisfactory.
>>> 
>>> I would suggest two different processors for two common use cases:
>>> The first one is to append a label (e.g. for machine learning) to the data stream. The processor has two inputs, one with the sensor events (potentially a high frequency) and one with the label information (usually a much lower event Frequency compared to sensor events). The processor enriches the sensor stream with the selected properties of the label stream.
>>> 
>>> The second processor merges two streams by their timestamp. This could be implemented with flink, but since it is a common use case I think we also need a lightweight solution in Java. What do you think? 
>>> Here are a couple of things we need to keep in mind designing the component:
>>> * How to deal with late arriving events?
>>> * How big must the buffer (state) for the data streams be to synchronize the events? (E.g. there is a large delay in one of the streams)
>>> * Can we assume that events of one stream are in order?
>>> 
>>> Do you have any other ideas about what we need to consider?
>>> 
>>> Cheers,
>>> Philipp
>>> 
>> 
>

Re: [DISCUSS] Processor to synchronize two data streams

Posted by Philipp Zehnder <ze...@apache.org>.

Hi Patrick,

ok, I think the first processor should be straight forward to implement. I will create an issue for that and then start implementing.

Regarding your concerns about the ordering of events within a topic. I think we also will need component which is able to deal with out of order events in the future, but I would suggest to build such a component with a framework that already has a build in solution for that (e.g. flink).
For this component I thought it is sufficient to assume the events are in order, since we will use dimension properties for partitioning events in Kafka. This means the events should be in order. 
Do you think we need an additional mechanism in the component to ensure the ordering of events or is the Kafka guarantee sufficient for the moment?

Philipp



> On 19. Jan 2020, at 22:16, Patrick Wiener <wi...@apache.org> wrote:
> 
> Hi Philipp,
> 
> this sounds like a reasonable topic that should further be investigated. 
> 
> I see especially see the benefit of having a lightweight solution in Java that can potentially run right where the data origins. I come across a lot of cases where control messages in addition to sensor events should be merged in one unified data stream to enable downstream analytics tasks (in robotics including ROS messages).
> 
> Your initial thoughts for designing such a processor seems valid. However, I was wondering about your third point. I guess it depends on where data is coming from. So for instance Kafka - AFAIK Kafka only supports ordering per partition per topic. Any initial ideas on how to deal with that?
> 
> Patrick
> 
>> Am 16.01.2020 um 23:14 schrieb Philipp Zehnder <ze...@apache.org>:
>> 
>> Hi all,
>> 
>> I’d like to discuss how we could design a processor to merge two data streams. 
>> We already had several versions of this component in the past, but none of them is completely satisfactory.
>> 
>> I would suggest two different processors for two common use cases:
>> The first one is to append a label (e.g. for machine learning) to the data stream. The processor has two inputs, one with the sensor events (potentially a high frequency) and one with the label information (usually a much lower event Frequency compared to sensor events). The processor enriches the sensor stream with the selected properties of the label stream.
>> 
>> The second processor merges two streams by their timestamp. This could be implemented with flink, but since it is a common use case I think we also need a lightweight solution in Java. What do you think? 
>> Here are a couple of things we need to keep in mind designing the component:
>> * How to deal with late arriving events?
>> * How big must the buffer (state) for the data streams be to synchronize the events? (E.g. there is a large delay in one of the streams)
>> * Can we assume that events of one stream are in order?
>> 
>> Do you have any other ideas about what we need to consider?
>> 
>> Cheers,
>> Philipp
>> 
>

Re: [DISCUSS] Processor to synchronize two data streams

Posted by Patrick Wiener <wi...@apache.org>.

Hi Philipp,

this sounds like a reasonable topic that should further be investigated. 

I see especially see the benefit of having a lightweight solution in Java that can potentially run right where the data origins. I come across a lot of cases where control messages in addition to sensor events should be merged in one unified data stream to enable downstream analytics tasks (in robotics including ROS messages).

Your initial thoughts for designing such a processor seems valid. However, I was wondering about your third point. I guess it depends on where data is coming from. So for instance Kafka - AFAIK Kafka only supports ordering per partition per topic. Any initial ideas on how to deal with that?

Patrick

> Am 16.01.2020 um 23:14 schrieb Philipp Zehnder <ze...@apache.org>:
> 
> Hi all,
> 
> I’d like to discuss how we could design a processor to merge two data streams. 
> We already had several versions of this component in the past, but none of them is completely satisfactory.
> 
> I would suggest two different processors for two common use cases:
> The first one is to append a label (e.g. for machine learning) to the data stream. The processor has two inputs, one with the sensor events (potentially a high frequency) and one with the label information (usually a much lower event Frequency compared to sensor events). The processor enriches the sensor stream with the selected properties of the label stream.
> 
> The second processor merges two streams by their timestamp. This could be implemented with flink, but since it is a common use case I think we also need a lightweight solution in Java. What do you think? 
> Here are a couple of things we need to keep in mind designing the component:
> * How to deal with late arriving events?
> * How big must the buffer (state) for the data streams be to synchronize the events? (E.g. there is a large delay in one of the streams)
> * Can we assume that events of one stream are in order?
> 
> Do you have any other ideas about what we need to consider?
> 
> Cheers,
> Philipp
>