You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2021/01/28 03:27:34 UTC

[GitHub] [airflow] mik-laj opened a new issue #13941: The PgBouncer configuration is not described in the documentation

mik-laj opened a new issue #13941:
URL: https://github.com/apache/airflow/issues/13941


   Hello,
   
   Currently the documentation does not mention `pgbouncer` in any way even though it is recommended in a production environment with a PostgresSQL database. It would be great if we could describe why we need PGBouncer and the recommended configuration
   
   Best regards,
   Kamil BreguĊ‚a


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] eladkal commented on issue #13941: The PgBouncer configuration is not described in the documentation

Posted by GitBox <gi...@apache.org>.
eladkal commented on issue #13941:
URL: https://github.com/apache/airflow/issues/13941#issuecomment-938658794


   I believe this is covered by https://github.com/apache/airflow/pull/18399


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] vikramkoka commented on issue #13941: The PgBouncer configuration is not described in the documentation

Posted by GitBox <gi...@apache.org>.
vikramkoka commented on issue #13941:
URL: https://github.com/apache/airflow/issues/13941#issuecomment-769877454


   @mik-laj I like this, but I am little confused by this issue. 
   I originally thought you were talking about "PgBouncer Setup and Configuration", but the update you gave a little while ago is around "Database connection sizing". 
   I would have thought those are two separate, but related documents. Am I missing something? Or, are you just articulating the need without necessarily an opinion about how to address it? 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] mik-laj commented on issue #13941: The PgBouncer configuration is not described in the documentation

Posted by GitBox <gi...@apache.org>.
mik-laj commented on issue #13941:
URL: https://github.com/apache/airflow/issues/13941#issuecomment-769893476


   @vikramkoka The last message is a fragment of a conversation with one user to whom I explained why they needs PGBouncer in Airflow. In the Airflow documentation, it seems to me that in addition to how ("PgBouncer Setup and Configuration"), we should also explain why we need it and when we don't need it, so that the user can consciously decide whether to use PGBouncer. If we explain how many connections each component uses is the user, then the user can decide that he does not need PgBouncer at all, since their database can handle enough open connections. 
   
   >  It would be great if we could describe **why we need PGBouncer**, the recommended configuration, and ways to tunning this component.
   
   This is especially helpful when you are using cloud solutions and you are trying to optimize costs. If you know the size of a database instance, you can estimate how many open connections it can have, and then tune the rest of the components to most efficiently from that resource, e.g. configure limits for autoscaling and others. 
   
   I haven't added it to the documentation yet, because I still have to verify all this information and explain how to configure PGBouncer. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] eladkal closed issue #13941: The PgBouncer configuration is not described in the documentation

Posted by GitBox <gi...@apache.org>.
eladkal closed issue #13941:
URL: https://github.com/apache/airflow/issues/13941


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] alexmc-ms commented on issue #13941: The PgBouncer configuration is not described in the documentation

Posted by GitBox <gi...@apache.org>.
alexmc-ms commented on issue #13941:
URL: https://github.com/apache/airflow/issues/13941#issuecomment-774873677


   Pgbouncer is critical for cloud setups using Azure Database for Postgresql. For almost a year, my team was using 1.10.9 in a Production environment running Azure Sql Server. In order to upgrade to 1.10.14, we needed to move to Postgresql as support for Sql Server has degraded to unusable at this point. The default connection pooling to Azure Sql Server worked fine. The default connection pooling to Azure Database for Postgresql was terrible resulting in a 16 CPU database using 85% CPU at all times. We added pgbouncer and CPU usage immediately dropped to 1%.
   
   Steps for setting this up in Azure are here: https://techcommunity.microsoft.com/t5/azure-database-for-postgresql/steps-to-install-and-setup-pgbouncer-connection-pooling-proxy/ba-p/730555.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] mik-laj commented on issue #13941: The PgBouncer configuration is not described in the documentation

Posted by GitBox <gi...@apache.org>.
mik-laj commented on issue #13941:
URL: https://github.com/apache/airflow/issues/13941#issuecomment-769636502


   The following description seems to me to be part of this documentation, but we should verify it.
   
   > We ensure isolation at the process level and each process opens a new connection so these components have many open connections. The new processes also allow us to circumvent GIL limitations, ie the problems with multi-thread handling in Python.
   >
   > - **Scheduler** processes files in a loop. For each file, we create a new process.  The number of files processed simultaneously is controlled by scheduler,max_threads (Airflow 1.10), scheduler.parsing_process (Airflow 2.0).  We recommend setting this option to CPU Count-1.  Additionally, the main scheduler loop has an open connection as well. Managing the processing of files takes place in a separate process/loop, which creates another connection. This means we already have `[processing_process] +2` open connections at the same time.
   > - The main **webserver** process creates many gunicorn workers. The number of processes is controlled by the webserver.gunicorn options. In Airflow 1.10, each worker opened 2 connections to the database, but in Airflow 2.0, I fixed this and now each process opens only one connection. By default, we start 4 workers.
   > - **Worker** processes handle multiple tasks, and for each task, three processes and 2 connections are created. The number of tasks per worker is configurable by the `core.parrallelism` options.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org