You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2019/07/25 03:44:29 UTC

[GitHub] [airflow] potiuk edited a comment on issue #5615: [AIRFLOW-5035] Remove multiprocessing.Manager in-favour of Pipes

potiuk edited a comment on issue #5615: [AIRFLOW-5035] Remove multiprocessing.Manager in-favour of Pipes
URL: https://github.com/apache/airflow/pull/5615#issuecomment-514883328
 
 
   Idea: Maybe we can simply write a small performance test and benchmark it after/before. I am right now setting up (It already works) a Kubernetes Cluster in GKE with Prometheus so that we can observe performance, and we could have an automated performance test that we can run with scheduler. 
   
   Running such performance test before/after the change would clearly show if there is a regression. That will be also useful in case we want to improve the performance in .5 - we will have a baseline at least and a way to measure it, otherwise we are in the dark. And later on we could fully automate it and ran it either with every PR or regularly (daily). This is what another team of our at Polidea is doing with Apache Beam project pretty much full time (including generating some artificial data sources that are simulating actual data that you can expect in production). I can easily tap into their experience there. 
   
   I can also reach out to the Composer team - we are working closely and ask them if they already have something similar.
   
   It should not be that difficult I think to have such test. I believe It just boils down to generating a lot of bigger and smaller DAGs (should be easy - we could use our experience from https://github.com/GoogleCloudPlatform/oozie-to-airflow where we generate Airflow DAGs from Oozie Workflows), then starting a scheduler in controlled setup (number of processes etc.) and measuring the time it takes to process it + gathering some additional metrics on the CPU/memory use. I guess we are only talking for now about performance of processing DAGs with scheduler? 
   
   I am coming more from the infrastructure, operator part of airflow and I have little experience in "production use" of Airflow, but I would love if someone (@KevinYang21 ?) writes down a simple "specification" for such test - more resembling the actual use - that could better asses how difficult it would be to automate it.
   
   I could work together with both of you on helping to automate it  - possibly early next week. We could even release RC and have a vote for testing and make the performance test results an input to the voting as well, potentially vetoing the release.
   
   And that could be a good start to automate more such tests measuring performance of other operations within Airflow. Once we have one such test setup with all the setup overhead, scaffolding, setUp/tearDown working, it should be rather easy to add new tests like that.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services