You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@nifi.apache.org by "Greene (US), Geoffrey N" <ge...@boeing.com> on 2021/02/24 21:59:22 UTC

some questions about splits

Im having some trouble with multiple splits/merges.  Here's the idea:


Big data -> split 1->Save all the fragment.*attributes into variables -> split 2-> save all the fragment.* attributes
    |
Split 1
   |
Save fragment.* attributes into split1.fragment.*
|
Split 2
|
Save fragment.* attributes into split2.fragment.* attributes
|
(More processing)
|
Split 3
|
Save fragment.* attributes into split3.fragment.* attributes
|
(other stuff)
|
Restore split3.fragment.* attributes to fragment.*
|
Merge3, using defragment strategy
|
Restore split2.fragment.* attributes to fragment.*
|
Merge 2 using defragment strategy
|
Restore split1.frragment.* attributes to fragment.*
|
Merge 1 using defragment strategy

Am I thinking about this correctly?  It seems like sometimes, nifi is unable to do a merge on some of the split data (errors like "there are 50 fragments, but we only found one).  Is it possible that I need to do some prioritization in the queues? I have noticed that my things do back up and the queues seem to fill up as its going through (several of the splits need to perform rest calls and processing, which can take time.  Maybe the issue is that one fragment "slips" through, before the others have even been processed far enough.  Is there an approved way to do this?

Thanks for the help!



Re: some questions about splits

Posted by Matt Burgess <ma...@apache.org>.
Geoffrey,

There's a really good blog by the man himself [1] :) I highly recommend the
official blog in general, lots of great posts and many are record-oriented
[2]

Regards,
Matt

[1] https://blogs.apache.org/nifi/entry/record-oriented-data-with-nifi
[2] https://blogs.apache.org/nifi/

On Wed, Feb 24, 2021 at 5:57 PM Greene (US), Geoffrey N <
geoffrey.n.greene@boeing.com> wrote:

> Thank you for the fast response Mark.
>
>
>
> Hrm, record processing does sound useful.
>
>
>
> Are there any good blogs / documentation on this?  I’d really like to
> learn more.  I’ve been doing mostly text processing, as you’ve observed.
>
>
>
> My use case is something like this
>
> 1)      Use server API to get list of sensors
>
> 2)      Use server API to get list of jobs
>
> 3)      For each job, get count of frames/job.  There are up to 6 sets of
> frames/job, depending on which set you query for.
>
> 4)      Frames can only be queried 50 at a time, so for each 50, get set
> of time stamps of a frame
>
> 5)      For each time stamp, query for all sensor values at that time.
> These have to be queried one-at-a-time., because of the way the API works –
> one sensor value can only be given for one time.
>
> 6)      Glue all the data associated with that job (frame times, sensor
> readings,etc) together and paste in a big json. (There are more steps after
> that.
>
>
>
>
>
> Geoffrey Greene
>
> Associate Technical Fellow/Senior Software Ninjaneer
>
> (703) 414 2421
>
> The Boeing Company
>
>
>
> *From:* Mark Payne [mailto:markap14@hotmail.com]
> *Sent:* Wednesday, February 24, 2021 5:20 PM
> *To:* users@nifi.apache.org
> *Subject:* [EXTERNAL] Re: some questions about splits
>
>
>
> EXT email: be mindful of links/attachments.
>
>
>
>
> Geoffrey,
>
>
>
> At a high level, if you’re splitting multiple times and then trying to
> re-assemble everything, then yes I think your thought process is correct.
> But you’ve no doubt seen how complex and cumbersome this approach can be.
> It can also result in extremely poor performance. So much so that when I
> began creating a series of YouTube videos on NiFi Anti-Patterns, the first
> anti-pattern that I covered was the splitting and re-merging of data [1].
>
>
>
> Generally, this should be an absolute last resort, and Record-oriented
> processors should be used instead of splitting the data up and re-merging
> it. If you need to perform REST calls, you could do that with LookupRecord,
> and either use the RESTLookupService or if that doesn’t fit the bill
> exactly you could actually use the ScriptedLookupService and write a small
> script in Groovy or Python that would perform the REST call for you and
> return the results. Or perhaps the ScriptedTransformRecord would be more
> appropriate - hard to tell without knowing the exact use case.
>
>
>
> Obviously, your mileage may vary, but switching the data flow to use
> record-oriented processors, if possible, would typically yield a flow that
> is much simpler and yield throughput that is at least an order of magnitude
> better.
>
>
>
> But if for whatever reason you do end up being stuck with the split/merge
> approach - the key would likely be to consider backpressure heavily. If you
> have backpressure set to 10,000 FlowFiles (the default) and then you’re
> trying to merge together data, but the data comes from many different
> upstream splits, you can certainly end up in a situation like this, where
> you don’t have all of the data from a given ’split’ queued up. for
> MergeContent.
>
>
>
> Hope this helps!
>
> -Mark
>
>
>
> [1] https://www.youtube.com/watch?v=RjWstt7nRVY
>
>
>
>
>
> On Feb 24, 2021, at 4:59 PM, Greene (US), Geoffrey N <
> geoffrey.n.greene@boeing.com> wrote:
>
>
>
> Im having some trouble with multiple splits/merges.  Here’s the idea:
>
>
>
>
>
> Big data -> split 1->Save all the fragment.*attributes into variables ->
> split 2-> save all the fragment.* attributes
>
>     |
>
> Split 1
>
>    |
>
> Save fragment.* attributes into split1.fragment.*
>
> |
>
> Split 2
>
> |
>
> Save fragment.* attributes into split2.fragment.* attributes
>
> |
>
> (More processing)
>
> |
>
> Split 3
>
> |
>
> Save fragment.* attributes into split3.fragment.* attributes
>
> |
>
> (other stuff)
>
> |
>
> Restore split3.fragment.* attributes to fragment.*
>
> |
>
> Merge3, using defragment strategy
>
> |
>
> Restore split2.fragment.* attributes to fragment.*
>
> |
>
> Merge 2 using defragment strategy
>
> |
>
> Restore split1.frragment.* attributes to fragment.*
>
> |
>
> Merge 1 using defragment strategy
>
>
>
> Am I thinking about this correctly?  It seems like sometimes, nifi is
> unable to do a merge on some of the split data (errors like “there are 50
> fragments, but we only found one).  Is it possible that I need to do some
> prioritization in the queues? I have noticed that my things do back up and
> the queues seem to fill up as its going through (several of the splits need
> to perform rest calls and processing, which can take time.  Maybe the issue
> is that one fragment “slips” through, before the others have even been
> processed far enough.  Is there an approved way to do this?
>
>
>
> Thanks for the help!
>
>
>

RE: some questions about splits

Posted by "Greene (US), Geoffrey N" <ge...@boeing.com>.
Thank you for the fast response Mark.

Hrm, record processing does sound useful.

Are there any good blogs / documentation on this?  I’d really like to learn more.  I’ve been doing mostly text processing, as you’ve observed.

My use case is something like this

1)      Use server API to get list of sensors

2)      Use server API to get list of jobs

3)      For each job, get count of frames/job.  There are up to 6 sets of frames/job, depending on which set you query for.

4)      Frames can only be queried 50 at a time, so for each 50, get set of time stamps of a frame

5)      For each time stamp, query for all sensor values at that time.  These have to be queried one-at-a-time., because of the way the API works – one sensor value can only be given for one time.

6)      Glue all the data associated with that job (frame times, sensor readings,etc) together and paste in a big json. (There are more steps after that.


Geoffrey Greene
Associate Technical Fellow/Senior Software Ninjaneer
(703) 414 2421
The Boeing Company

From: Mark Payne [mailto:markap14@hotmail.com]
Sent: Wednesday, February 24, 2021 5:20 PM
To: users@nifi.apache.org
Subject: [EXTERNAL] Re: some questions about splits


EXT email: be mindful of links/attachments.




Geoffrey,

At a high level, if you’re splitting multiple times and then trying to re-assemble everything, then yes I think your thought process is correct. But you’ve no doubt seen how complex and cumbersome this approach can be. It can also result in extremely poor performance. So much so that when I began creating a series of YouTube videos on NiFi Anti-Patterns, the first anti-pattern that I covered was the splitting and re-merging of data [1].

Generally, this should be an absolute last resort, and Record-oriented processors should be used instead of splitting the data up and re-merging it. If you need to perform REST calls, you could do that with LookupRecord, and either use the RESTLookupService or if that doesn’t fit the bill exactly you could actually use the ScriptedLookupService and write a small script in Groovy or Python that would perform the REST call for you and return the results. Or perhaps the ScriptedTransformRecord would be more appropriate - hard to tell without knowing the exact use case.

Obviously, your mileage may vary, but switching the data flow to use record-oriented processors, if possible, would typically yield a flow that is much simpler and yield throughput that is at least an order of magnitude better.

But if for whatever reason you do end up being stuck with the split/merge approach - the key would likely be to consider backpressure heavily. If you have backpressure set to 10,000 FlowFiles (the default) and then you’re trying to merge together data, but the data comes from many different upstream splits, you can certainly end up in a situation like this, where you don’t have all of the data from a given ’split’ queued up. for MergeContent.

Hope this helps!
-Mark

[1] https://www.youtube.com/watch?v=RjWstt7nRVY



On Feb 24, 2021, at 4:59 PM, Greene (US), Geoffrey N <ge...@boeing.com>> wrote:

Im having some trouble with multiple splits/merges.  Here’s the idea:


Big data -> split 1->Save all the fragment.*attributes into variables -> split 2-> save all the fragment.* attributes
    |
Split 1
   |
Save fragment.* attributes into split1.fragment.*
|
Split 2
|
Save fragment.* attributes into split2.fragment.* attributes
|
(More processing)
|
Split 3
|
Save fragment.* attributes into split3.fragment.* attributes
|
(other stuff)
|
Restore split3.fragment.* attributes to fragment.*
|
Merge3, using defragment strategy
|
Restore split2.fragment.* attributes to fragment.*
|
Merge 2 using defragment strategy
|
Restore split1.frragment.* attributes to fragment.*
|
Merge 1 using defragment strategy

Am I thinking about this correctly?  It seems like sometimes, nifi is unable to do a merge on some of the split data (errors like “there are 50 fragments, but we only found one).  Is it possible that I need to do some prioritization in the queues? I have noticed that my things do back up and the queues seem to fill up as its going through (several of the splits need to perform rest calls and processing, which can take time.  Maybe the issue is that one fragment “slips” through, before the others have even been processed far enough.  Is there an approved way to do this?

Thanks for the help!


Re: some questions about splits

Posted by Mark Payne <ma...@hotmail.com>.
Geoffrey,

At a high level, if you’re splitting multiple times and then trying to re-assemble everything, then yes I think your thought process is correct. But you’ve no doubt seen how complex and cumbersome this approach can be. It can also result in extremely poor performance. So much so that when I began creating a series of YouTube videos on NiFi Anti-Patterns, the first anti-pattern that I covered was the splitting and re-merging of data [1].

Generally, this should be an absolute last resort, and Record-oriented processors should be used instead of splitting the data up and re-merging it. If you need to perform REST calls, you could do that with LookupRecord, and either use the RESTLookupService or if that doesn’t fit the bill exactly you could actually use the ScriptedLookupService and write a small script in Groovy or Python that would perform the REST call for you and return the results. Or perhaps the ScriptedTransformRecord would be more appropriate - hard to tell without knowing the exact use case.

Obviously, your mileage may vary, but switching the data flow to use record-oriented processors, if possible, would typically yield a flow that is much simpler and yield throughput that is at least an order of magnitude better.

But if for whatever reason you do end up being stuck with the split/merge approach - the key would likely be to consider backpressure heavily. If you have backpressure set to 10,000 FlowFiles (the default) and then you’re trying to merge together data, but the data comes from many different upstream splits, you can certainly end up in a situation like this, where you don’t have all of the data from a given ’split’ queued up. for MergeContent.

Hope this helps!
-Mark

[1] https://www.youtube.com/watch?v=RjWstt7nRVY


On Feb 24, 2021, at 4:59 PM, Greene (US), Geoffrey N <ge...@boeing.com>> wrote:

Im having some trouble with multiple splits/merges.  Here’s the idea:


Big data -> split 1->Save all the fragment.*attributes into variables -> split 2-> save all the fragment.* attributes
    |
Split 1
   |
Save fragment.* attributes into split1.fragment.*
|
Split 2
|
Save fragment.* attributes into split2.fragment.* attributes
|
(More processing)
|
Split 3
|
Save fragment.* attributes into split3.fragment.* attributes
|
(other stuff)
|
Restore split3.fragment.* attributes to fragment.*
|
Merge3, using defragment strategy
|
Restore split2.fragment.* attributes to fragment.*
|
Merge 2 using defragment strategy
|
Restore split1.frragment.* attributes to fragment.*
|
Merge 1 using defragment strategy

Am I thinking about this correctly?  It seems like sometimes, nifi is unable to do a merge on some of the split data (errors like “there are 50 fragments, but we only found one).  Is it possible that I need to do some prioritization in the queues? I have noticed that my things do back up and the queues seem to fill up as its going through (several of the splits need to perform rest calls and processing, which can take time.  Maybe the issue is that one fragment “slips” through, before the others have even been processed far enough.  Is there an approved way to do this?

Thanks for the help!