You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@nifi.apache.org by GitBox <gi...@apache.org> on 2020/11/03 13:42:38 UTC

[GitHub] [nifi] anaylor opened a new pull request #4641: NIFI-6394

anaylor opened a new pull request #4641:
URL: https://github.com/apache/nifi/pull/4641


   #### Description of PR
   
   Development has stopped on this [PR](https://github.com/apache/nifi/pull/4221) so I picked up the changes and addressed the comments. This PR removes the hardcoded limit for viewing files in the flow file list queue. Adds a "View All" button to load all available flow files for a given connection, then has an option for exporting those files to CSV. The files could then be sorted/explored in another data analysis tool.
   
   ![image](https://user-images.githubusercontent.com/61026014/97916880-9ec88600-1d21-11eb-86fa-d812b31fcaea.png)
   
   _Enables Flowfile list queue CSV Export
   
   In order to streamline the review of the contribution we ask you
   to ensure the following steps have been taken:
   
   ### For all changes:
   - [x] Is there a JIRA ticket associated with this PR? Is it referenced
        in the commit message?
   
   - [x] Does your PR title start with **NIFI-XXXX** where XXXX is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character.
   
   - [x] Has your PR been rebased against the latest commit within the target branch (typically `main`)?
   
   - [x] Is your initial contribution a single, squashed commit? _Additional commits in response to PR reviewer feedback should be made on this branch and pushed to allow change tracking. Do not `squash` or use `--force` when pushing to allow for clean monitoring of changes._
   
   ### For code changes:
   - [x] Have you ensured that the full suite of tests is executed via `mvn -Pcontrib-check clean install` at the root `nifi` folder?
   - [ ] Have you written or updated unit tests to verify your changes?
   - [ ] Have you verified that the full build is successful on JDK 8?
   - [ ] Have you verified that the full build is successful on JDK 11?
   - [ ] If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under [ASF 2.0](http://www.apache.org/legal/resolved.html#category-a)?
   - [ ] If applicable, have you updated the `LICENSE` file, including the main `LICENSE` file under `nifi-assembly`?
   - [ ] If applicable, have you updated the `NOTICE` file, including the main `NOTICE` file found under `nifi-assembly`?
   - [ ] If adding new Properties, have you added `.displayName` in addition to .name (programmatic access) for each of the new properties?
   
   ### For documentation related changes:
   - [ ] Have you ensured that format looks appropriate for the output in which it is rendered?
   
   ### Note:
   Please ensure that once the PR is submitted, you check GitHub Actions CI for build issues and submit an update to your PR as soon as possible.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [nifi] MikeThomsen commented on pull request #4641: NIFI-6394

Posted by GitBox <gi...@apache.org>.
MikeThomsen commented on pull request #4641:
URL: https://github.com/apache/nifi/pull/4641#issuecomment-720837003


   Have you tested this with a queue that has, say, 100k-1M flowfiles in it? That's not a best practice, but not uncommon.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [nifi] taftster edited a comment on pull request #4641: NIFI-6394 - Ability to export contents of a List Queue

Posted by GitBox <gi...@apache.org>.
taftster edited a comment on pull request #4641:
URL: https://github.com/apache/nifi/pull/4641#issuecomment-726162471


   [edits to fix my memory loss]
   
   Right, agreed. Which is why I think, for this PR at least, we need to stay within some sort of reasonable constraint on the number of flowfiles you can expect to retrieve, both through the UI and the REST endpoint.
   
   In the original PR (4221), I hard coded the UI limit at 100K allowing the REST API to be "unlimited" theoretically.  It might make sense to consider letting more flowfiles be viewed in the HTML table.  But ultimately, expecting the entirety of the queue be displayed here is probably a mistake (for the many reasons discussed).
   
   Maybe we can do something "small" in this PR to get it landed on master.  And then consider the broader implications in another issue?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [nifi] taftster commented on pull request #4641: NIFI-6394 - Ability to export contents of a List Queue

Posted by GitBox <gi...@apache.org>.
taftster commented on pull request #4641:
URL: https://github.com/apache/nifi/pull/4641#issuecomment-726156711


   So one problem that I was concerned about in the original PR was letting a large number of rows into the HTML table that displays the flowfiles.  e.g. a "View All" with a 1MM rows flowfiles in the queue would result in a 1MM row HTML table.
   
   This might come from an old-school understanding, but my experience has been a large HTML table would not be very nice on the browser.  Has anyone considered this concern at all?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [nifi] anaylor commented on pull request #4641: NIFI-6394 - Ability to export contents of a List Queue

Posted by GitBox <gi...@apache.org>.
anaylor commented on pull request #4641:
URL: https://github.com/apache/nifi/pull/4641#issuecomment-726189727


   @joewitt Looking at your comments on  [4221](https://github.com/apache/nifi/pull/4221). The limit of 100 is safe but not particularly useful for examining a queue of flow files. 
   
   @taftster I agree, maybe limiting to 20k as this is the number in `nifi.queue.swap.threshold` to start out? 
   
   I have tested with up to 200k without any issues, getting to 1M you have to add some heap but the UI seems to not have any issues. I am sure there are many unseen pieces whirring away. Are there other aspects of the system I can look at to determine if this is working within tolerances?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [nifi] anaylor commented on pull request #4641: NIFI-6394 - Ability to export contents of a List Queue

Posted by GitBox <gi...@apache.org>.
anaylor commented on pull request #4641:
URL: https://github.com/apache/nifi/pull/4641#issuecomment-726070577


   Good call, I missed the second "View All" on the bottom
   
   As for the limitation of the View All, I included a tooltip message that the number of messages is limited by `nifi.queue.swap.threshold`. There is another hard coded limitation in the js of 200k. It currently just defaults to the first 100 when you List Queue, then View All will return up to 200k, this implementation is incorrect, View All should actually View All. I will change the functionality


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [nifi] anaylor commented on pull request #4641: NIFI-6394 - Ability to export contents of a List Queue

Posted by GitBox <gi...@apache.org>.
anaylor commented on pull request #4641:
URL: https://github.com/apache/nifi/pull/4641#issuecomment-728278173


   @joewitt Ok I see where you are coming from. I will take a deeper dive into the mechanisms of how active and swap queues are managed to see if there are ramifications to pulling off more then 100 FF. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [nifi] joewitt commented on pull request #4641: NIFI-6394 - Ability to export contents of a List Queue

Posted by GitBox <gi...@apache.org>.
joewitt commented on pull request #4641:
URL: https://github.com/apache/nifi/pull/4641#issuecomment-731359274


   Well off the cuff I suppose some property in nifi.properties which limits the amount of data returned by a listing a call.  But we really need to try to avoid adding more nifi.properties.  We already have too many knobs.
   
   Another option is another endpoint/api call which explicitly creates a stream of results and never has the whole set in memory.  It is meant purely for the client to get a big listing of what is there.  Not sure exactly how that would play out given we have to merge responses from all the nodes in the cluster but seems like it should be feasible.  This return value wouldn't be meant for the browser... but more like downloading a file.  Kind of awkward for the user.
   
   I'm not quite sure.  It needs a good bit of thought.  I just want to make sure we dont open up more avenues for heap exhaustion a user can trigger.  The flow file attribute thing led to the swapping concept.  Which helped a ton.  But flows we see in live production usage are *huge* now and even the swapping defaults seem too high now.
   
   Can you talk a bit more about how from a user point of view getting all that information helps you?  Can you talk about the use case?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [nifi] taftster commented on pull request #4641: NIFI-6394 - Ability to export contents of a List Queue

Posted by GitBox <gi...@apache.org>.
taftster commented on pull request #4641:
URL: https://github.com/apache/nifi/pull/4641#issuecomment-726162471


   Right, agreed. Which is why I think, for this PR at least, we need to stay within some sort of reasonable constraint on the number of flowfiles you can expect to retrieve, both through the UI and the REST endpoint.
   
   In the original PR (4221), I hard coded the UI limit at 10K (to match the default size of the backpressure limit).  It might make sense to consider letting more flowfiles be viewed in the HTML table.  But ultimately, expecting the entirety of the queue be displayed here is probably a mistake (for the many reasons discussed).
   
   Maybe we can do something "small" in this PR to get it landed on master.  And then consider the broader implications in another issue?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [nifi] anaylor commented on pull request #4641: NIFI-6394

Posted by GitBox <gi...@apache.org>.
anaylor commented on pull request #4641:
URL: https://github.com/apache/nifi/pull/4641#issuecomment-720714814






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [nifi] joewitt commented on pull request #4641: NIFI-6394 - Ability to export contents of a List Queue

Posted by GitBox <gi...@apache.org>.
joewitt commented on pull request #4641:
URL: https://github.com/apache/nifi/pull/4641#issuecomment-726155889


   My concern here is similar in spirt to the statements/assumptions being made on https://github.com/apache/nifi/pull/4221 and its associated PR....
   
   Changing this requires a full understanding of all implications.  Being able to access information beyond what is swapped out *cannot* result in swapped out flowfiles being swapped in or it would defeat the point of swapping which is all about memory blowout prevention.  An alternative solution also could mean we now have a new access vector for swapped out flowfiles so we can look at their information to respond with this 'view all'.  
   
   I totally understand the spirit here but the solution might well not be to actually return everything as there are very good scalability reasons why these things dont occur.  We could update the UI to say 'View All Active' (not swapped) and/or we can update the documentation to make the limitations of the convenience functions more clear.  There is/was never any intent to provide complete access to the queues.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [nifi] markobean commented on pull request #4641: NIFI-6394 - Ability to export contents of a List Queue

Posted by GitBox <gi...@apache.org>.
markobean commented on pull request #4641:
URL: https://github.com/apache/nifi/pull/4641#issuecomment-724788240


   There is a 'View All' option at the top of the listing and at the bottom, next to the Refresh button. I believe the one at the bottom should be removed.
   An extreme nit-pick, but for consistency: the message reads "Displaying X of Y". The "X" value does not have a comma whereas "Y" value does.
   
   More significant issue: I tried 'View All' on a queue with 1,000,200 FF's. It only listed 19,200 (and corresponding CSV only contained those entries as well.)
   Re-running the test with 262,000 files in the queue, 'View All' topped out at 19,000. 
   Interesting to note that my first test initially filled the queue with 200 files, then piled on another 1M. Not sure if that is the reason for the "extra" 200 in the 19,200 limit as compared to the second test that limited at 19,000. 
   Either way, this limitation should be addressed.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [nifi] anaylor commented on pull request #4641: NIFI-6394 - Ability to export contents of a List Queue

Posted by GitBox <gi...@apache.org>.
anaylor commented on pull request #4641:
URL: https://github.com/apache/nifi/pull/4641#issuecomment-732376543


   Our use case would be for tracking specific files throughout the system. Sometimes we get asked about where a specific file is on our system, with a queue limit of 100 we don't have visibility into where a specific file is. We want to provide users a capability to download a list of files currently residing in a queue.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [nifi] taftster edited a comment on pull request #4641: NIFI-6394 - Ability to export contents of a List Queue

Posted by GitBox <gi...@apache.org>.
taftster edited a comment on pull request #4641:
URL: https://github.com/apache/nifi/pull/4641#issuecomment-726162471


   Right, agreed. Which is why I think, for this PR at least, we need to stay within some sort of reasonable constraint on the number of flowfiles you can expect to retrieve, both through the UI and the REST endpoint.
   
   In the original PR (4221), I hard coded the UI limit at 10K (to match the default size of the backpressure limit) allowing the REST API to be "unlimited" theoretically.  It might make sense to consider letting more flowfiles be viewed in the HTML table.  But ultimately, expecting the entirety of the queue be displayed here is probably a mistake (for the many reasons discussed).
   
   Maybe we can do something "small" in this PR to get it landed on master.  And then consider the broader implications in another issue?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [nifi] joewitt commented on pull request #4641: NIFI-6394 - Ability to export contents of a List Queue

Posted by GitBox <gi...@apache.org>.
joewitt commented on pull request #4641:
URL: https://github.com/apache/nifi/pull/4641#issuecomment-731336221


   It is true that the user has to initiate this process either way.  The current mechanism results in no more than 100 items being returned.  That was chosen because flowfile's can have arbitrarily large attributes (and while it is an anti pattern many users do this).  Making it larger such that it matches the swap threshold (which wasn't designed to care about this case) means you must ensure the implications for memory changes are considered.  My understanding of how this works does not align to your statement of it being negligible.  My understanding is that if there are 20,000 flowfiles or lets say 100,000 flowfiles in there and they're lets say '1KB each' in terms of text to represent them we're looking at *at least* 16-80 or so MB of heap.  For many common cases it could be much worse.
   
   If you're telling me my understanding is not accurate then fair enough - this needs to be further reviewed.  If it is accurate then we need to rethink how this gets controlled better.  Scenarios that cause memory exhaustion are among the most sensitive because they create bad outcomes and a lot of user confusion.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [nifi] anaylor commented on pull request #4641: NIFI-6394 - Ability to export contents of a List Queue

Posted by GitBox <gi...@apache.org>.
anaylor commented on pull request #4641:
URL: https://github.com/apache/nifi/pull/4641#issuecomment-731356318


   I did not realize flowfiles could have arbitrarily large attributes, I only tested with default flow file attributes. If you had arbitrarily large attributes there would be a huge difference between 100 and 20k FF summaries, or even 1 and 100. For 20k FF with default attributes its about 4.6M of text, which you would have to override the default backpressure control of connections and request all flow file summaries. 
   
   Your understanding is correct. How would you suggest to better control this? Would this be a check on the backend to not exceed some type of thresh hold? Or a warning on the front end? 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [nifi] joewitt commented on pull request #4641: NIFI-6394 - Ability to export contents of a List Queue

Posted by GitBox <gi...@apache.org>.
joewitt commented on pull request #4641:
URL: https://github.com/apache/nifi/pull/4641#issuecomment-726158728


   ...not designed to do this for arbitrarily large queues. (is what I meant to say)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [nifi] github-actions[bot] closed pull request #4641: NIFI-6394 - Ability to export contents of a List Queue

Posted by GitBox <gi...@apache.org>.
github-actions[bot] closed pull request #4641:
URL: https://github.com/apache/nifi/pull/4641


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [nifi] joewitt commented on pull request #4641: NIFI-6394 - Ability to export contents of a List Queue

Posted by GitBox <gi...@apache.org>.
joewitt commented on pull request #4641:
URL: https://github.com/apache/nifi/pull/4641#issuecomment-726194460


   Safe for all cases and useful for simply examining the top of the queue/getting a sense of prioritization.  Which is what this really designed to do to be clear.
   
   You want to expand that to a much larger set of data.  I'm saying - this is great.  Love the idea.  But the approach to ensuring it remains safe in all cases is still critical.  That is the point I'm making.  Just make sure the safety is *ensured*.  We assessed the level of effort to do this well/safely was quite high.  I want to make sure whatever we do solve those concerns and so far I'm not seeing that level of discussion which is why I finally commented.
   
   Testing up to 200K or larger numbers is good.  But more than testing we must understand the underlying design/approach and what it means for memory.  Off the top of my head I dont recall.  


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [nifi] github-actions[bot] commented on pull request #4641: NIFI-6394 - Ability to export contents of a List Queue

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #4641:
URL: https://github.com/apache/nifi/pull/4641#issuecomment-819929640


   We're marking this PR as stale due to lack of updates in the past few months. If after another couple of weeks the stale label has not been removed this PR will be closed. This stale marker and eventual auto close does not indicate a judgement of the PR just lack of reviewer bandwidth and helps us keep the PR queue more manageable.  If you would like this PR re-opened you can do so and a committer can remove the stale tag.  Or you can open a new PR.  Try to help review other PRs to increase PR review bandwidth which in turn helps yours.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [nifi] joewitt commented on pull request #4641: NIFI-6394 - Ability to export contents of a List Queue

Posted by GitBox <gi...@apache.org>.
joewitt commented on pull request #4641:
URL: https://github.com/apache/nifi/pull/4641#issuecomment-726158412


   I agree that is a major concern.  Memory usage/speed/locks involved/etc.. all throughout trying to accomplish this must be well considered.  The current mechanism I am confident was not designed to do this.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [nifi] anaylor commented on pull request #4641: NIFI-6394 - Ability to export contents of a List Queue

Posted by GitBox <gi...@apache.org>.
anaylor commented on pull request #4641:
URL: https://github.com/apache/nifi/pull/4641#issuecomment-731329752


   @joewitt @markobean @taftster 
   
   It does not seem like any swap events are kicked off directly from requesting a listing of the active queue. When a listing is created (for 100 or 20k) all flow files in the active queue are sorted and the maxResults are returned. It is my opinion that the additional memory needed to add the additional FF summaries would be negligible as the FF summaries are a predictable size. Furthermore, the default behavior is preserved, the user has to take the additional step to click View Active to return that larger list. I think this is reasonable for a max of 20k FF summaries. I am open to suggestions and comment  


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org