You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2022/03/29 14:44:05 UTC

[GitHub] [airflow] tirkarthi opened a new pull request #22588: Print configuration on scheduler startup.

tirkarthi opened a new pull request #22588:
URL: https://github.com/apache/airflow/pull/22588


   This PR prints the configuration used during scheduler startup. The sensitive values are masked so that they are not captured by mistake in log files.
   
   closes: #22141
   related: #22141
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk edited a comment on pull request #22588: Print configuration on scheduler startup.

Posted by GitBox <gi...@apache.org>.
potiuk edited a comment on pull request #22588:
URL: https://github.com/apache/airflow/pull/22588#issuecomment-1084984832


   I'd say that one qualifies as PR of the month ;) (pity the newsletter is out).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] tirkarthi commented on pull request #22588: Print configuration on scheduler startup.

Posted by GitBox <gi...@apache.org>.
tirkarthi commented on pull request #22588:
URL: https://github.com/apache/airflow/pull/22588#issuecomment-1084924730


   Thanks @jedcunningham and @potiuk for the feedback. I have modified the PR so that non-default configuration is displayed along with source in [] as below indicating whether it's from environment variable / airflow.cfg / cmd. For signal SIGUSR1 in non-daemon mode I have added a handler that prints all configuration including default. I have added tests to ensure "default" is filtered by default :)
   
   I have used print since `airflow config list` also used print. Let me know if this needs to be changed to logger and using `log.info`. I have also not added syntax highlighting. Let me know your feedback on the format so that I can modify accordingly.
   
   
   ```
   $ airflow scheduler
     ____________       _____________
    ____    |__( )_________  __/__  /________      __
   ____  /| |_  /__  ___/_  /_ __  /_  __ \_ | /| / /
   ___  ___ |  / _  /   _  __/ _  / / /_/ /_ |/ |/ /
    _/_/  |_/_/  /_/    /_/    /_/  \____/____/|__/
   [core]
   dags_folder = < hidden > [env var]
   hostname_callable = socket.getfqdn [airflow.cfg]
   default_timezone = utc [airflow.cfg]
   executor = < hidden > [env var]
   sql_alchemy_conn = < hidden > [env var]
   sql_engine_encoding = utf-8 [airflow.cfg]
   sql_alchemy_pool_enabled = True [airflow.cfg]
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] tirkarthi commented on a change in pull request #22588: Print configuration on scheduler startup.

Posted by GitBox <gi...@apache.org>.
tirkarthi commented on a change in pull request #22588:
URL: https://github.com/apache/airflow/pull/22588#discussion_r841159917



##########
File path: tests/cli/commands/test_scheduler_command.py
##########
@@ -87,3 +91,18 @@ def test_graceful_shutdown(self, executor, mock_process, mock_scheduler_job):
                 scheduler_command.scheduler(args)
             finally:
                 mock_process().terminate.assert_called()
+
+    @mock.patch.dict(os.environ, {'AIRFLOW__CORE__PARALLELISM': '35'}, clear=True)
+    @mock.patch("airflow.cli.commands.scheduler_command.SchedulerJob")
+    @mock.patch("airflow.cli.commands.scheduler_command.Process")
+    def test_scheduler_print_config(self, mock_process, mock_scheduler_job):
+        args = self.parser.parse_args(['scheduler'])
+        with conf_vars({("core", "sql_alchemy_conn_cmd"): 'echo hello'}):
+            with contextlib.redirect_stdout(io.StringIO()) as temp_stdout:
+                scheduler_command.scheduler(args)
+                output = temp_stdout.getvalue()
+
+                assert "sql_alchemy_conn = < hidden > [cmd]" in output
+                assert "max_active_tasks_per_dag = 16 [airflow.cfg]" in output
+                assert "parallelism = < hidden > [env var]" in output

Review comment:
       Thanks @potiuk , I would like to have this as a separate issue since it might need more discussion on what and how to filter out items.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on pull request #22588: Print configuration on scheduler startup.

Posted by GitBox <gi...@apache.org>.
potiuk commented on pull request #22588:
URL: https://github.com/apache/airflow/pull/22588#issuecomment-1085505751


   One more last comment since there is also static check to fix (I promise it's the last one :). 
   
   It would be great to write a paragraph or two about this behaviour (including the SIGUSR1 option) in https://airflow.apache.org/docs/apache-airflow/stable/howto/set-config.html . This is so tht we do not need to explain it to the users, but just send them the link. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] tirkarthi edited a comment on pull request #22588: Print configuration on scheduler startup.

Posted by GitBox <gi...@apache.org>.
tirkarthi edited a comment on pull request #22588:
URL: https://github.com/apache/airflow/pull/22588#issuecomment-1084924730


   Thanks @jedcunningham and @potiuk for the feedback. I have modified the PR so that non-default configuration is displayed along with source in [] as below indicating whether it's from environment variable / airflow.cfg / cmd. For signal SIGUSR1 in non-daemon mode I have added a handler that prints all configuration including default. I have added tests to ensure "default" is filtered by default :)
   
   I have used print since `airflow config list` also used print. Let me know if this needs to be changed to logger and using `log.info`. I have also not added syntax highlighting. Let me know your feedback on the format so that I can modify accordingly.
   
   
   ```ini
   $ airflow scheduler
     ____________       _____________
    ____    |__( )_________  __/__  /________      __
   ____  /| |_  /__  ___/_  /_ __  /_  __ \_ | /| / /
   ___  ___ |  / _  /   _  __/ _  / / /_/ /_ |/ |/ /
    _/_/  |_/_/  /_/    /_/    /_/  \____/____/|__/
   [core]
   dags_folder = < hidden > [env var]
   hostname_callable = socket.getfqdn [airflow.cfg]
   default_timezone = utc [airflow.cfg]
   executor = < hidden > [env var]
   sql_alchemy_conn = < hidden > [env var]
   sql_engine_encoding = utf-8 [airflow.cfg]
   sql_alchemy_pool_enabled = True [airflow.cfg]
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on a change in pull request #22588: Print configuration on scheduler startup.

Posted by GitBox <gi...@apache.org>.
potiuk commented on a change in pull request #22588:
URL: https://github.com/apache/airflow/pull/22588#discussion_r841056341



##########
File path: tests/cli/commands/test_scheduler_command.py
##########
@@ -87,3 +91,18 @@ def test_graceful_shutdown(self, executor, mock_process, mock_scheduler_job):
                 scheduler_command.scheduler(args)
             finally:
                 mock_process().terminate.assert_called()
+
+    @mock.patch.dict(os.environ, {'AIRFLOW__CORE__PARALLELISM': '35'}, clear=True)
+    @mock.patch("airflow.cli.commands.scheduler_command.SchedulerJob")
+    @mock.patch("airflow.cli.commands.scheduler_command.Process")
+    def test_scheduler_print_config(self, mock_process, mock_scheduler_job):
+        args = self.parser.parse_args(['scheduler'])
+        with conf_vars({("core", "sql_alchemy_conn_cmd"): 'echo hello'}):
+            with contextlib.redirect_stdout(io.StringIO()) as temp_stdout:
+                scheduler_command.scheduler(args)
+                output = temp_stdout.getvalue()
+
+                assert "sql_alchemy_conn = < hidden > [cmd]" in output
+                assert "max_active_tasks_per_dag = 16 [airflow.cfg]" in output
+                assert "parallelism = < hidden > [env var]" in output

Review comment:
       Ough. I did not know `display_sensitive` is "so sensitive" :) .
   
   I thin what we could do though is to implement a new flag in the dict `less_sensitive_config` :) (?) and do few things there:
   * hide anything coming from secret (I think there is flag that just shows path rather than value for that already and it is on by default) 
   * We already have the list of sensitive configs - we could hide them by default:
   ```
       sensitive_config_values = {
           ('core', 'sql_alchemy_conn'),
           ('core', 'fernet_key'),
           ('celery', 'broker_url'),
           ('celery', 'flower_basic_auth'),
           ('celery', 'result_backend'),
           ('atlas', 'password'),
           ('smtp', 'smtp_password'),
           ('webserver', 'secret_key'),
       }
   ```
   * we can pass config through "redact()" - and we could extend the semantics of `sensitive_var_conn_names` to also cover key names of sensitive data (or we could even add another config `sensitive_configuration_names` where user could specify which of the config they want to hide). 
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on pull request #22588: Print configuration on scheduler startup.

Posted by GitBox <gi...@apache.org>.
potiuk commented on pull request #22588:
URL: https://github.com/apache/airflow/pull/22588#issuecomment-1084533783


   Another option is to "query" or send signal to runing webserver/scheduler/triggerer (traditionally SIGUSR1 is used for similar feature) to know what configuration it sees, but I think dumping it to the logs at startup is far easier for diagnostic purposes (if it is written only once at startup).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on a change in pull request #22588: Print configuration on scheduler startup.

Posted by GitBox <gi...@apache.org>.
potiuk commented on a change in pull request #22588:
URL: https://github.com/apache/airflow/pull/22588#discussion_r841222132



##########
File path: tests/cli/commands/test_scheduler_command.py
##########
@@ -87,3 +91,18 @@ def test_graceful_shutdown(self, executor, mock_process, mock_scheduler_job):
                 scheduler_command.scheduler(args)
             finally:
                 mock_process().terminate.assert_called()
+
+    @mock.patch.dict(os.environ, {'AIRFLOW__CORE__PARALLELISM': '35'}, clear=True)
+    @mock.patch("airflow.cli.commands.scheduler_command.SchedulerJob")
+    @mock.patch("airflow.cli.commands.scheduler_command.Process")
+    def test_scheduler_print_config(self, mock_process, mock_scheduler_job):
+        args = self.parser.parse_args(['scheduler'])
+        with conf_vars({("core", "sql_alchemy_conn_cmd"): 'echo hello'}):
+            with contextlib.redirect_stdout(io.StringIO()) as temp_stdout:
+                scheduler_command.scheduler(args)
+                output = temp_stdout.getvalue()
+
+                assert "sql_alchemy_conn = < hidden > [cmd]" in output
+                assert "max_active_tasks_per_dag = 16 [airflow.cfg]" in output
+                assert "parallelism = < hidden > [env var]" in output

Review comment:
       > Thanks @potiuk , I would like to have this as a separate issue since it might need more discussion on what and how to filter out items.
   
   Sure. Works for me :)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] tirkarthi commented on a change in pull request #22588: Print configuration on scheduler startup.

Posted by GitBox <gi...@apache.org>.
tirkarthi commented on a change in pull request #22588:
URL: https://github.com/apache/airflow/pull/22588#discussion_r840379605



##########
File path: airflow/utils/cli.py
##########
@@ -328,3 +344,32 @@ def _wrapper(*args, **kwargs):
                     logging.disable(logging.NOTSET)
 
     return cast(T, _wrapper)
+
+
+def get_config_with_source(include_default=False):
+    """
+    Return configuration along with source for each option.
+    """
+    config_dict = conf.as_dict(display_source=True)
+
+    with io.StringIO() as buf, redirect_stdout(buf):
+        for section, options in config_dict.items():
+            if not include_default:
+                options = {
+                    key: (value, source) for key, (value, source) in options.items() if source != "default"
+                }
+
+            # Print the section only when there are options after filtering
+            if options:
+                print(f"[{section}]")
+                for key, (value, source) in options.items():
+                    if not include_default and source == "default":
+                        continue

Review comment:
       Thanks, will remove it. It was leftover code after comprehension.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on pull request #22588: Print configuration on scheduler startup.

Posted by GitBox <gi...@apache.org>.
potiuk commented on pull request #22588:
URL: https://github.com/apache/airflow/pull/22588#issuecomment-1084984832


   I'd say that one clarifies for PR of the month ;) (pity the newsletter is out).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] github-actions[bot] commented on pull request #22588: Print configuration on scheduler startup.

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #22588:
URL: https://github.com/apache/airflow/pull/22588#issuecomment-1085771152


   The PR most likely needs to run full matrix of tests because it modifies parts of the core of Airflow. However, committers might decide to merge it quickly and take the risk. If they don't merge it quickly - please rebase it to the latest main at your convenience, or amend the last commit of the PR, and push it with --force-with-lease.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] jedcunningham commented on a change in pull request #22588: Print configuration on scheduler startup.

Posted by GitBox <gi...@apache.org>.
jedcunningham commented on a change in pull request #22588:
URL: https://github.com/apache/airflow/pull/22588#discussion_r840956304



##########
File path: tests/cli/commands/test_scheduler_command.py
##########
@@ -87,3 +91,18 @@ def test_graceful_shutdown(self, executor, mock_process, mock_scheduler_job):
                 scheduler_command.scheduler(args)
             finally:
                 mock_process().terminate.assert_called()
+
+    @mock.patch.dict(os.environ, {'AIRFLOW__CORE__PARALLELISM': '35'}, clear=True)
+    @mock.patch("airflow.cli.commands.scheduler_command.SchedulerJob")
+    @mock.patch("airflow.cli.commands.scheduler_command.Process")
+    def test_scheduler_print_config(self, mock_process, mock_scheduler_job):
+        args = self.parser.parse_args(['scheduler'])
+        with conf_vars({("core", "sql_alchemy_conn_cmd"): 'echo hello'}):
+            with contextlib.redirect_stdout(io.StringIO()) as temp_stdout:
+                scheduler_command.scheduler(args)
+                output = temp_stdout.getvalue()
+
+                assert "sql_alchemy_conn = < hidden > [cmd]" in output
+                assert "max_active_tasks_per_dag = 16 [airflow.cfg]" in output
+                assert "parallelism = < hidden > [env var]" in output

Review comment:
       That was what I was fearing it might be. I guess let's not worry about it now for now. This does make this less comprehensive than I was initially imaging as it's not uncommon to use env vars for non-sensitive config as well.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on a change in pull request #22588: Print configuration on scheduler startup.

Posted by GitBox <gi...@apache.org>.
potiuk commented on a change in pull request #22588:
URL: https://github.com/apache/airflow/pull/22588#discussion_r841056341



##########
File path: tests/cli/commands/test_scheduler_command.py
##########
@@ -87,3 +91,18 @@ def test_graceful_shutdown(self, executor, mock_process, mock_scheduler_job):
                 scheduler_command.scheduler(args)
             finally:
                 mock_process().terminate.assert_called()
+
+    @mock.patch.dict(os.environ, {'AIRFLOW__CORE__PARALLELISM': '35'}, clear=True)
+    @mock.patch("airflow.cli.commands.scheduler_command.SchedulerJob")
+    @mock.patch("airflow.cli.commands.scheduler_command.Process")
+    def test_scheduler_print_config(self, mock_process, mock_scheduler_job):
+        args = self.parser.parse_args(['scheduler'])
+        with conf_vars({("core", "sql_alchemy_conn_cmd"): 'echo hello'}):
+            with contextlib.redirect_stdout(io.StringIO()) as temp_stdout:
+                scheduler_command.scheduler(args)
+                output = temp_stdout.getvalue()
+
+                assert "sql_alchemy_conn = < hidden > [cmd]" in output
+                assert "max_active_tasks_per_dag = 16 [airflow.cfg]" in output
+                assert "parallelism = < hidden > [env var]" in output

Review comment:
       Ough. I did not know `display_sensitive` is "so sensitive" :) .
   
   I thin what we could do though is to implement a new flag in the dict `guess_sensitive_data` (?) and do few things there:
   * hide anything coming from secret (I think there is flag that just shows path rather than value for that already and it is on by default) 
   * We already have the list of sensitive configs - we could hide them by default:
   ```
       sensitive_config_values = {
           ('core', 'sql_alchemy_conn'),
           ('core', 'fernet_key'),
           ('celery', 'broker_url'),
           ('celery', 'flower_basic_auth'),
           ('celery', 'result_backend'),
           ('atlas', 'password'),
           ('smtp', 'smtp_password'),
           ('webserver', 'secret_key'),
       }
   ```
   * we can pass config through "redact()" - and we could extend the semantics of `sensitive_var_conn_names` to also cover key names of sensitive data (or we could even add another config `sensitive_configuration_names` where user could specify which of the config they want to hide). 
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] tirkarthi commented on a change in pull request #22588: Print configuration on scheduler startup.

Posted by GitBox <gi...@apache.org>.
tirkarthi commented on a change in pull request #22588:
URL: https://github.com/apache/airflow/pull/22588#discussion_r839921445



##########
File path: airflow/utils/cli.py
##########
@@ -341,19 +346,30 @@ def _wrapper(*args, **kwargs):
     return cast(T, _wrapper)
 
 
-def get_config_with_source(include_default=False):
+def get_config_with_source(include_default=False, include_colors=True):
     """
     Return configuration along with source for each option.
     """
-    parser = AirflowConfigParser(strict=False, interpolation=None)
     config_dict = conf.as_dict(display_source=True)
 
     with io.StringIO() as buf, redirect_stdout(buf):
         for section, options in config_dict.items():
-            print(f"[{section}]")
-            for key, (value, source) in options.items():
-                if not include_default and source == "default":
-                    continue
-                print(f"{key} = {value} [{source}]")
-            print()
-        return buf.getvalue()
+            if not include_default:
+                options = {
+                    key: (value, source) for key, (value, source) in options.items() if source != "default"
+                }
+
+            # Print the section only when there are options after filtering
+            if options:
+                print(f"[{section}]")
+                for key, (value, source) in options.items():
+                    if not include_default and source == "default":
+                        continue
+                    print(f"{key} = {value} [{source}]")
+                print()
+        code = buf.getvalue()
+
+        if include_colors:
+            code = pygments.highlight(code=code, formatter=get_terminal_formatter(), lexer=IniLexer())

Review comment:
       Thanks, makes sense. Used `is_terminal_support_colors` for highlighting. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] jedcunningham commented on a change in pull request #22588: Print configuration on scheduler startup.

Posted by GitBox <gi...@apache.org>.
jedcunningham commented on a change in pull request #22588:
URL: https://github.com/apache/airflow/pull/22588#discussion_r840958346



##########
File path: docs/apache-airflow/howto/set-config.rst
##########
@@ -126,6 +126,10 @@ the example below.
     work as expected. A good example for that is :ref:`secret_key<config:webserver__secret_key>` which
     should be same on the Webserver and Worker to allow Webserver to fetch logs from Worker.
 
+    During startup Airflow scheduler prints configuration options whose values are different from default
+    values along with source from which the value was loaded i.e. airflow.cfg, environment variable, etc.

Review comment:
       ```suggestion
       During startup the scheduler prints configuration options whose values differ from the default
       values along with the source from which the value was loaded (i.e. airflow.cfg, environment variable, etc).
   ```
   
   nit




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] jedcunningham commented on pull request #22588: Print configuration on scheduler startup.

Posted by GitBox <gi...@apache.org>.
jedcunningham commented on pull request #22588:
URL: https://github.com/apache/airflow/pull/22588#issuecomment-1083153602


   Couple things: I think we should only show non-default config (I'm not sure if this is already the case, but I don't think it is) and should we do this for other components also?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on pull request #22588: Print configuration on scheduler startup.

Posted by GitBox <gi...@apache.org>.
potiuk commented on pull request #22588:
URL: https://github.com/apache/airflow/pull/22588#issuecomment-1084929413


   Oh wow. That looks cool. I think adding syntax highliting/rich colors would be fantastic in the print() but then also logging it to the log (without syntax) migh tbe useful. The first one wiell be great for casual users for testing and seeing that their specifc values are used, the second would be for diagnosticts (sometimes stuiff printed on stdout might be silently swallowed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on a change in pull request #22588: Print configuration on scheduler startup.

Posted by GitBox <gi...@apache.org>.
potiuk commented on a change in pull request #22588:
URL: https://github.com/apache/airflow/pull/22588#discussion_r839910787



##########
File path: airflow/utils/cli.py
##########
@@ -341,19 +346,30 @@ def _wrapper(*args, **kwargs):
     return cast(T, _wrapper)
 
 
-def get_config_with_source(include_default=False):
+def get_config_with_source(include_default=False, include_colors=True):
     """
     Return configuration along with source for each option.
     """
-    parser = AirflowConfigParser(strict=False, interpolation=None)
     config_dict = conf.as_dict(display_source=True)
 
     with io.StringIO() as buf, redirect_stdout(buf):
         for section, options in config_dict.items():
-            print(f"[{section}]")
-            for key, (value, source) in options.items():
-                if not include_default and source == "default":
-                    continue
-                print(f"{key} = {value} [{source}]")
-            print()
-        return buf.getvalue()
+            if not include_default:
+                options = {
+                    key: (value, source) for key, (value, source) in options.items() if source != "default"
+                }
+
+            # Print the section only when there are options after filtering
+            if options:
+                print(f"[{section}]")
+                for key, (value, source) in options.items():
+                    if not include_default and source == "default":
+                        continue
+                    print(f"{key} = {value} [{source}]")
+                print()
+        code = buf.getvalue()
+
+        if include_colors:
+            code = pygments.highlight(code=code, formatter=get_terminal_formatter(), lexer=IniLexer())

Review comment:
       one small thing, sorry - i think we should check if we are in terminal or not and only apply color if we are 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] jedcunningham commented on a change in pull request #22588: Print configuration on scheduler startup.

Posted by GitBox <gi...@apache.org>.
jedcunningham commented on a change in pull request #22588:
URL: https://github.com/apache/airflow/pull/22588#discussion_r839949884



##########
File path: tests/cli/commands/test_scheduler_command.py
##########
@@ -87,3 +91,18 @@ def test_graceful_shutdown(self, executor, mock_process, mock_scheduler_job):
                 scheduler_command.scheduler(args)
             finally:
                 mock_process().terminate.assert_called()
+
+    @mock.patch.dict(os.environ, {'AIRFLOW__CORE__PARALLELISM': '35'}, clear=True)
+    @mock.patch("airflow.cli.commands.scheduler_command.SchedulerJob")
+    @mock.patch("airflow.cli.commands.scheduler_command.Process")
+    def test_scheduler_print_config(self, mock_process, mock_scheduler_job):
+        args = self.parser.parse_args(['scheduler'])
+        with conf_vars({("core", "sql_alchemy_conn_cmd"): 'echo hello'}):
+            with contextlib.redirect_stdout(io.StringIO()) as temp_stdout:
+                scheduler_command.scheduler(args)
+                output = temp_stdout.getvalue()
+
+                assert "sql_alchemy_conn = < hidden > [cmd]" in output
+                assert "max_active_tasks_per_dag = 16 [airflow.cfg]" in output
+                assert "parallelism = < hidden > [env var]" in output

Review comment:
       Hmm, why is parallelism hidden 🤔

##########
File path: airflow/utils/cli.py
##########
@@ -328,3 +344,32 @@ def _wrapper(*args, **kwargs):
                     logging.disable(logging.NOTSET)
 
     return cast(T, _wrapper)
+
+
+def get_config_with_source(include_default=False):
+    """
+    Return configuration along with source for each option.
+    """
+    config_dict = conf.as_dict(display_source=True)
+
+    with io.StringIO() as buf, redirect_stdout(buf):
+        for section, options in config_dict.items():
+            if not include_default:
+                options = {
+                    key: (value, source) for key, (value, source) in options.items() if source != "default"
+                }
+
+            # Print the section only when there are options after filtering
+            if options:
+                print(f"[{section}]")
+                for key, (value, source) in options.items():
+                    if not include_default and source == "default":
+                        continue

Review comment:
       ```suggestion
   ```
   
   We've already filtered these out, no?

##########
File path: airflow/utils/cli.py
##########
@@ -328,3 +344,32 @@ def _wrapper(*args, **kwargs):
                     logging.disable(logging.NOTSET)
 
     return cast(T, _wrapper)
+
+
+def get_config_with_source(include_default=False):

Review comment:
       ```suggestion
   def get_config_with_source(include_default: bool = False) -> str:
   ```
   
   nit: let's add typehints




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] tirkarthi commented on pull request #22588: Print configuration on scheduler startup.

Posted by GitBox <gi...@apache.org>.
tirkarthi commented on pull request #22588:
URL: https://github.com/apache/airflow/pull/22588#issuecomment-1085676046


   Thanks @jedcunningham I have fixed comments except for the <hidden> value for env, cmd. Thanks @potiuk, there is a note in the docs about configuration being different across webserver and worker can cause issue. I have added a note there since I felt it's appropriate. Feel free to suggest other wording if it can be explained better.
   
   > Use the same configuration across all the Airflow components. While each component does not require all, some configurations need to be same otherwise they would not work as expected. A good example for that is [secret_key](https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#config-webserver-secret-key) which should be same on the Webserver and Worker to allow Webserver to fetch logs from Worker.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk edited a comment on pull request #22588: Print configuration on scheduler startup.

Posted by GitBox <gi...@apache.org>.
potiuk edited a comment on pull request #22588:
URL: https://github.com/apache/airflow/pull/22588#issuecomment-1084530946


   > Couple things: I think we should only show non-default config (I'm not sure if this is already the case, but I don't think it is) and should we do this for other components also?
   
   I fully agree we should have it for all components. However I think showing complete config is a bit better. Ideally it should also show where the config came from =- default, config file, variable, secret. And this might need a bit modification of the conf code. But I tink it is crucial part of this change - knowing where each configuration variable came from would make it infintely easier to diagnose problems of people who - for example - have systemd and do not realize their `AIRFLOW___` variable is not set  when their server is started but they have it set when the login interactively and their .bashrc is actually sourced then.
   
   Not seeing the config values  currently is the major source of problems (I've seen multiple issues about it) for people who do not realize that `airflow info` can actually produce different set of configuration than whan their webserver or scheduler observes.
   
   Such misconfigurations are next to impossible to diagnose currently, they rely mostly on guessing.
   
   Note that gunicorn makes it a bit difficult for webserver because it will print logs every time gunicorn restarts the process which makes it rather annoying - so this information shoudl only be printed in the main process of webserver. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on pull request #22588: Print configuration on scheduler startup.

Posted by GitBox <gi...@apache.org>.
potiuk commented on pull request #22588:
URL: https://github.com/apache/airflow/pull/22588#issuecomment-1084530946


   > Couple things: I think we should only show non-default config (I'm not sure if this is already the case, but I don't think it is) and should we do this for other components also?
   
   I fully agree we should have it for all components. However I think showing complete config is a bit better. Ideally it should also show where the config came from =- default, config file, variable, secret. And this might need a bit modification of the conf code. But I tink it is crucial part of this change - knowing wheree each configuration variable came from would make it infintely easier to diagnose problems of people who - for example - have systemd and do not realize their `AIRFLOW___` variable is not set  when their server is started but they have it set when the login interactively and their .bashrc is actually sourced then.
   
   Not seeing the config values  currently is the major source of problems (I've seen multiple issues about it) for people who do not realize that `airflow info` can actually produce different set of configuration than whan their webserver or scheduler sees.
   
   Such misconfigurations are next to impossible to diagnose currently, they rely mostly on guessing.
   
   Note that gunicorn makes it a bit difficult for webserver because it will print logs every time gunicorn restarts the process which makes it rather annoying - so this information shoudl only be printed in the main process of webserver. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] tirkarthi commented on a change in pull request #22588: Print configuration on scheduler startup.

Posted by GitBox <gi...@apache.org>.
tirkarthi commented on a change in pull request #22588:
URL: https://github.com/apache/airflow/pull/22588#discussion_r840379038



##########
File path: tests/cli/commands/test_scheduler_command.py
##########
@@ -87,3 +91,18 @@ def test_graceful_shutdown(self, executor, mock_process, mock_scheduler_job):
                 scheduler_command.scheduler(args)
             finally:
                 mock_process().terminate.assert_called()
+
+    @mock.patch.dict(os.environ, {'AIRFLOW__CORE__PARALLELISM': '35'}, clear=True)
+    @mock.patch("airflow.cli.commands.scheduler_command.SchedulerJob")
+    @mock.patch("airflow.cli.commands.scheduler_command.Process")
+    def test_scheduler_print_config(self, mock_process, mock_scheduler_job):
+        args = self.parser.parse_args(['scheduler'])
+        with conf_vars({("core", "sql_alchemy_conn_cmd"): 'echo hello'}):
+            with contextlib.redirect_stdout(io.StringIO()) as temp_stdout:
+                scheduler_command.scheduler(args)
+                output = temp_stdout.getvalue()
+
+                assert "sql_alchemy_conn = < hidden > [cmd]" in output
+                assert "max_active_tasks_per_dag = 16 [airflow.cfg]" in output
+                assert "parallelism = < hidden > [env var]" in output

Review comment:
       I have `display_sensitive=False` (default value is False) which hides all values from env, vars and command. I don't want to pass display_sensitive=True since that might dump sensitive values to logs in case of starting as daemon or systemd. I guess we can try filtering for sensitive values in the for loop similar to how "default" values are filtered out so that other values from env/cmd are shown. Thoughts?
   
   https://github.com/apache/airflow/blob/dc75f5d8768c8a42df29c86beb519de282539e1f/airflow/configuration.py#L717-L719




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] jedcunningham commented on pull request #22588: Print configuration on scheduler startup.

Posted by GitBox <gi...@apache.org>.
jedcunningham commented on pull request #22588:
URL: https://github.com/apache/airflow/pull/22588#issuecomment-1084754475


   Also showing the source makes perfect sense to me, so +1 on that.
   
   I still think we should only show the non-defaults, however. Way less noise, and makes my typical first question of "what config have you changed" much easier to answer 🤷‍♂️. Maybe we send the non-defaults to logs, and let SIGURS1 dump it all? Best of both worlds?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on a change in pull request #22588: Print configuration on scheduler startup.

Posted by GitBox <gi...@apache.org>.
potiuk commented on a change in pull request #22588:
URL: https://github.com/apache/airflow/pull/22588#discussion_r841056341



##########
File path: tests/cli/commands/test_scheduler_command.py
##########
@@ -87,3 +91,18 @@ def test_graceful_shutdown(self, executor, mock_process, mock_scheduler_job):
                 scheduler_command.scheduler(args)
             finally:
                 mock_process().terminate.assert_called()
+
+    @mock.patch.dict(os.environ, {'AIRFLOW__CORE__PARALLELISM': '35'}, clear=True)
+    @mock.patch("airflow.cli.commands.scheduler_command.SchedulerJob")
+    @mock.patch("airflow.cli.commands.scheduler_command.Process")
+    def test_scheduler_print_config(self, mock_process, mock_scheduler_job):
+        args = self.parser.parse_args(['scheduler'])
+        with conf_vars({("core", "sql_alchemy_conn_cmd"): 'echo hello'}):
+            with contextlib.redirect_stdout(io.StringIO()) as temp_stdout:
+                scheduler_command.scheduler(args)
+                output = temp_stdout.getvalue()
+
+                assert "sql_alchemy_conn = < hidden > [cmd]" in output
+                assert "max_active_tasks_per_dag = 16 [airflow.cfg]" in output
+                assert "parallelism = < hidden > [env var]" in output

Review comment:
       Ough. I did not know `display_sensitive` is "so sensitive" :) .
   
   I thin what we could do though is to implement a new flag in the dict `guess_sensitive_data` (?) and do few things there:
   * hide anything coming from secret (I think there is flag that just shows path rather than value for that already and it is on by default) 
   * We already have the list of sensitive configs:
   ```
       sensitive_config_values = {
           ('core', 'sql_alchemy_conn'),
           ('core', 'fernet_key'),
           ('celery', 'broker_url'),
           ('celery', 'flower_basic_auth'),
           ('celery', 'result_backend'),
           ('atlas', 'password'),
           ('smtp', 'smtp_password'),
           ('webserver', 'secret_key'),
       }
   ```
   * we can pass config through "redact()" - and we could extend the semantics of `sensitive_var_conn_names` to also cover key names of sensitive data (or we could even add another config `sensitive_configuration_names` where user could specify which of the config they want to hide). 
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk edited a comment on pull request #22588: Print configuration on scheduler startup.

Posted by GitBox <gi...@apache.org>.
potiuk edited a comment on pull request #22588:
URL: https://github.com/apache/airflow/pull/22588#issuecomment-1084530946


   > Couple things: I think we should only show non-default config (I'm not sure if this is already the case, but I don't think it is) and should we do this for other components also?
   
   I fully agree we should have it for all components. However I think showing complete config is a bit better. Ideally it should also show where the config came from =- default, config file, variable, secret. And this might need a bit modification of the conf code. But I tink it is crucial part of this change - knowing where each configuration variable came from would make it infintely easier to diagnose problems of people who - for example - have systemd and do not realize their `AIRFLOW___` variable is not set  when their server is started but they have it set when the login interactively and their .bashrc is actually sourced then.
   
   Not seeing the config values  currently is the major source of problems (I've seen multiple issues about it) for people who do not realize that `airflow info` can actually produce different set of configuration than whan their webserver or scheduler sees.
   
   Such misconfigurations are next to impossible to diagnose currently, they rely mostly on guessing.
   
   Note that gunicorn makes it a bit difficult for webserver because it will print logs every time gunicorn restarts the process which makes it rather annoying - so this information shoudl only be printed in the main process of webserver. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on pull request #22588: Print configuration on scheduler startup.

Posted by GitBox <gi...@apache.org>.
potiuk commented on pull request #22588:
URL: https://github.com/apache/airflow/pull/22588#issuecomment-1084773199


   > Also showing the source makes perfect sense to me, so +1 on that.
   > 
   > I still think we should only show the non-defaults, however. Way less noise, and makes my typical first question of "what config have you changed" much easier to answer man_shrugging. Maybe we send the non-defaults to logs, and let SIGURS1 dump it all? Best of both worlds?
   
   Yeah. less noise for sure. And yeah. You are right it shows "changes" - I like it more :). SIGUSR1 woudl be a bonus I think ;)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on pull request #22588: Print configuration on scheduler startup.

Posted by GitBox <gi...@apache.org>.
potiuk commented on pull request #22588:
URL: https://github.com/apache/airflow/pull/22588#issuecomment-1084983610


   NICE! @jedcunningham - WDYT? I think this one is cool.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org