You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2021/09/01 09:08:18 UTC

[GitHub] [airflow] thejens opened a new pull request #17946: Add robots.txt and X-Robots-Tag header

thejens opened a new pull request #17946:
URL: https://github.com/apache/airflow/pull/17946


   Added a robots.txt file and X-Rogots-Tag header to limit some of the damage caused when accidentally exposing airflow to the public internet. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] uranusjr commented on a change in pull request #17946: Add robots.txt and X-Robots-Tag header

Posted by GitBox <gi...@apache.org>.
uranusjr commented on a change in pull request #17946:
URL: https://github.com/apache/airflow/pull/17946#discussion_r700004826



##########
File path: airflow/www/views.py
##########
@@ -2924,6 +2925,16 @@ def tree_data(self):
         # avoid spaces to reduce payload size
         return htmlsafe_json_dumps(tree_data, separators=(',', ':'))
 
+    @expose('/robots.txt')
+    @action_logging
+    def robots(self):
+        """
+        Returns a robots.txt file for blocking certain search engine crawlers. This mitigates some
+        of the risk associated with exposing Airflow to the public internet, however it does not
+        address the real security risks associated with such a deployment.
+        """
+        return send_from_directory(current_app.static_folder, 'robots.txt')

Review comment:
       For such a simple file, it’s probably easier to simply embed `robot.txt` in the view function instead of using `send_from_directory` (which introduces filesystem overhead).




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] uranusjr merged pull request #17946: Add robots.txt and X-Robots-Tag header

Posted by GitBox <gi...@apache.org>.
uranusjr merged pull request #17946:
URL: https://github.com/apache/airflow/pull/17946


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk edited a comment on pull request #17946: Add robots.txt and X-Robots-Tag header

Posted by GitBox <gi...@apache.org>.
potiuk edited a comment on pull request #17946:
URL: https://github.com/apache/airflow/pull/17946#issuecomment-909463384


   Cool. Thanks for that ! The easiest way to add tests is to add separate package for `/robots` endpoint + check in the home test view if the header is returned : https://github.com/apache/airflow/blob/main/tests/www/views/test_views_home.py 
   
   The setup is rather easy. For those tests you can setup local virtualenv (https://github.com/apache/airflow/blob/main/LOCAL_VIRTUALENV.rst) - classic python virtualenv  and run `pytest` tests there (make sure to init the db before or run the test once with ``--with-db-init`` custom flag.
   
   
   You can also setup BREEZE  with  (https://github.com/apache/airflow/blob/main/BREEZE.rst)  - thi s  more compleete  environment whicch is docker -compose  based and is the exact replica of what happens in hthe CI.  This is as simple as running `./breeze` and  the script should guide you  and  eventually  you should be dropped tinto basjh interpreter inside breeze container with everyything read y to run pytest  with the test  to run (your sources will be mounted in container so you can  edit them locally and run in container immediately) . Also recommend to install pre-commit (`pre-commit install`) in your repo, so that all static checks will be run at commit time.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on a change in pull request #17946: Add robots.txt and X-Robots-Tag header

Posted by GitBox <gi...@apache.org>.
potiuk commented on a change in pull request #17946:
URL: https://github.com/apache/airflow/pull/17946#discussion_r700131310



##########
File path: airflow/www/views.py
##########
@@ -2924,6 +2925,16 @@ def tree_data(self):
         # avoid spaces to reduce payload size
         return htmlsafe_json_dumps(tree_data, separators=(',', ':'))
 
+    @expose('/robots.txt')
+    @action_logging
+    def robots(self):
+        """
+        Returns a robots.txt file for blocking certain search engine crawlers. This mitigates some
+        of the risk associated with exposing Airflow to the public internet, however it does not
+        address the real security risks associated with such a deployment.
+        """
+        return send_from_directory(current_app.static_folder, 'robots.txt')

Review comment:
       Agree with @thejens  too. I'd definitely look for robots.txt file




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] thejens commented on a change in pull request #17946: Add robots.txt and X-Robots-Tag header

Posted by GitBox <gi...@apache.org>.
thejens commented on a change in pull request #17946:
URL: https://github.com/apache/airflow/pull/17946#discussion_r700014174



##########
File path: airflow/www/views.py
##########
@@ -2924,6 +2925,16 @@ def tree_data(self):
         # avoid spaces to reduce payload size
         return htmlsafe_json_dumps(tree_data, separators=(',', ':'))
 
+    @expose('/robots.txt')
+    @action_logging
+    def robots(self):
+        """
+        Returns a robots.txt file for blocking certain search engine crawlers. This mitigates some
+        of the risk associated with exposing Airflow to the public internet, however it does not
+        address the real security risks associated with such a deployment.
+        """
+        return send_from_directory(current_app.static_folder, 'robots.txt')

Review comment:
       easy enough to change, just thought it'd be easier to maintain as a physical file. - wouldn't expect much traffic/load on this endpoint anyway




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ryanahamilton commented on a change in pull request #17946: Add robots.txt and X-Robots-Tag header

Posted by GitBox <gi...@apache.org>.
ryanahamilton commented on a change in pull request #17946:
URL: https://github.com/apache/airflow/pull/17946#discussion_r700082216



##########
File path: airflow/www/views.py
##########
@@ -2924,6 +2925,16 @@ def tree_data(self):
         # avoid spaces to reduce payload size
         return htmlsafe_json_dumps(tree_data, separators=(',', ':'))
 
+    @expose('/robots.txt')
+    @action_logging
+    def robots(self):
+        """
+        Returns a robots.txt file for blocking certain search engine crawlers. This mitigates some
+        of the risk associated with exposing Airflow to the public internet, however it does not
+        address the real security risks associated with such a deployment.
+        """
+        return send_from_directory(current_app.static_folder, 'robots.txt')

Review comment:
       I would agree with @thejens on this. The most common place anyone (not only those familiar with Flask) would expect to find this file is in a static assets directory. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] uranusjr commented on a change in pull request #17946: Add robots.txt and X-Robots-Tag header

Posted by GitBox <gi...@apache.org>.
uranusjr commented on a change in pull request #17946:
URL: https://github.com/apache/airflow/pull/17946#discussion_r700004826



##########
File path: airflow/www/views.py
##########
@@ -2924,6 +2925,16 @@ def tree_data(self):
         # avoid spaces to reduce payload size
         return htmlsafe_json_dumps(tree_data, separators=(',', ':'))
 
+    @expose('/robots.txt')
+    @action_logging
+    def robots(self):
+        """
+        Returns a robots.txt file for blocking certain search engine crawlers. This mitigates some
+        of the risk associated with exposing Airflow to the public internet, however it does not
+        address the real security risks associated with such a deployment.
+        """
+        return send_from_directory(current_app.static_folder, 'robots.txt')

Review comment:
       For such a simple file, it’s probably easier to simply embed `robot.txt` in the view function instead of using `send_from_directory` (which introduces filesystem overhead).

##########
File path: airflow/www/views.py
##########
@@ -2924,6 +2925,16 @@ def tree_data(self):
         # avoid spaces to reduce payload size
         return htmlsafe_json_dumps(tree_data, separators=(',', ':'))
 
+    @expose('/robots.txt')
+    @action_logging
+    def robots(self):
+        """
+        Returns a robots.txt file for blocking certain search engine crawlers. This mitigates some
+        of the risk associated with exposing Airflow to the public internet, however it does not
+        address the real security risks associated with such a deployment.
+        """
+        return send_from_directory(current_app.static_folder, 'robots.txt')

Review comment:
       The physical file is easy to maintain, but not easy to find for people not intrinsically faimiliar with Flask. And the file is a line-delimited pure text file, so the advantage over a multiple line string
   
   ```python
   """\
   User-agent: *
   Disallow: /
   """
   ```
   
   is marginal at best.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] thejens commented on a change in pull request #17946: Add robots.txt and X-Robots-Tag header

Posted by GitBox <gi...@apache.org>.
thejens commented on a change in pull request #17946:
URL: https://github.com/apache/airflow/pull/17946#discussion_r700014174



##########
File path: airflow/www/views.py
##########
@@ -2924,6 +2925,16 @@ def tree_data(self):
         # avoid spaces to reduce payload size
         return htmlsafe_json_dumps(tree_data, separators=(',', ':'))
 
+    @expose('/robots.txt')
+    @action_logging
+    def robots(self):
+        """
+        Returns a robots.txt file for blocking certain search engine crawlers. This mitigates some
+        of the risk associated with exposing Airflow to the public internet, however it does not
+        address the real security risks associated with such a deployment.
+        """
+        return send_from_directory(current_app.static_folder, 'robots.txt')

Review comment:
       easy enough to change, just thought it'd be easier to maintain as a physical file. - wouldn't expect much traffic/load on this endpoint anyway




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on pull request #17946: Add robots.txt and X-Robots-Tag header

Posted by GitBox <gi...@apache.org>.
potiuk commented on pull request #17946:
URL: https://github.com/apache/airflow/pull/17946#issuecomment-909463384


   Cool. Thanks for that ! The easiest way to add tests is to add separate package for `/robots` endpoint + check in the home test view if the header is returned : https://github.com/apache/airflow/blob/main/tests/www/views/test_views_home.py 
   
   The setup is rather easy. For those tests you can setup local virtualenv (https://github.com/apache/airflow/blob/main/LOCAL_VIRTUALENV.rst) - classic python virtualenv  and run `pytest` tests there (make sure to init the db before or run the test once with ``--with-db-init`` custom flag.
   
   
   You can also setup BREEZE  with  (https://github.com/apache/airflow/blob/main/BREEZE.rst)  - thi s  more compleete  environment whicch is docker -compose  based and is the exact replica of what happens in hthe CI.  This is as simple as running `./breeze` a nd  the scroipt shoudl guide you  and  eventually  you should be dropped tinto basjh interepreter inside breeze container wwith everyything read y to run pytest  with tehe test  to run (your sources will be mounted in conttainer so you can  edit them locally and run in container immediateely) 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] thejens commented on a change in pull request #17946: Add robots.txt and X-Robots-Tag header

Posted by GitBox <gi...@apache.org>.
thejens commented on a change in pull request #17946:
URL: https://github.com/apache/airflow/pull/17946#discussion_r700069825



##########
File path: airflow/www/views.py
##########
@@ -2924,6 +2925,16 @@ def tree_data(self):
         # avoid spaces to reduce payload size
         return htmlsafe_json_dumps(tree_data, separators=(',', ':'))
 
+    @expose('/robots.txt')
+    @action_logging
+    def robots(self):
+        """
+        Returns a robots.txt file for blocking certain search engine crawlers. This mitigates some
+        of the risk associated with exposing Airflow to the public internet, however it does not
+        address the real security risks associated with such a deployment.
+        """
+        return send_from_directory(current_app.static_folder, 'robots.txt')

Review comment:
       Your opinion on the topic is way stronger than mine, so will amend the PR, for someone not familiar with flask - searching the file-system for a file named "robots.txt" ought to be the easiest way to find and alter the content on that file however.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on pull request #17946: Add robots.txt and X-Robots-Tag header

Posted by GitBox <gi...@apache.org>.
potiuk commented on pull request #17946:
URL: https://github.com/apache/airflow/pull/17946#issuecomment-909463384


   Cool. Thanks for that ! The easiest way to add tests is to add separate package for `/robots` endpoint + check in the home test view if the header is returned : https://github.com/apache/airflow/blob/main/tests/www/views/test_views_home.py 
   
   The setup is rather easy. For those tests you can setup local virtualenv (https://github.com/apache/airflow/blob/main/LOCAL_VIRTUALENV.rst) - classic python virtualenv  and run `pytest` tests there (make sure to init the db before or run the test once with ``--with-db-init`` custom flag.
   
   
   You can also setup BREEZE  with  (https://github.com/apache/airflow/blob/main/BREEZE.rst)  - thi s  more compleete  environment whicch is docker -compose  based and is the exact replica of what happens in hthe CI.  This is as simple as running `./breeze` a nd  the scroipt shoudl guide you  and  eventually  you should be dropped tinto basjh interepreter inside breeze container wwith everyything read y to run pytest  with tehe test  to run (your sources will be mounted in conttainer so you can  edit them locally and run in container immediateely) 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] thejens commented on a change in pull request #17946: Add robots.txt and X-Robots-Tag header

Posted by GitBox <gi...@apache.org>.
thejens commented on a change in pull request #17946:
URL: https://github.com/apache/airflow/pull/17946#discussion_r700135750



##########
File path: airflow/www/views.py
##########
@@ -2924,6 +2925,16 @@ def tree_data(self):
         # avoid spaces to reduce payload size
         return htmlsafe_json_dumps(tree_data, separators=(',', ':'))
 
+    @expose('/robots.txt')
+    @action_logging
+    def robots(self):
+        """
+        Returns a robots.txt file for blocking certain search engine crawlers. This mitigates some
+        of the risk associated with exposing Airflow to the public internet, however it does not
+        address the real security risks associated with such a deployment.
+        """
+        return send_from_directory(current_app.static_folder, 'robots.txt')

Review comment:
       Right, changing it back again :) just waiting for the pre-commit-hook to finish




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk edited a comment on pull request #17946: Add robots.txt and X-Robots-Tag header

Posted by GitBox <gi...@apache.org>.
potiuk edited a comment on pull request #17946:
URL: https://github.com/apache/airflow/pull/17946#issuecomment-909463384


   Cool. Thanks for that ! The easiest way to add tests is to add separate package for `/robots` endpoint + check in the home test view if the header is returned : https://github.com/apache/airflow/blob/main/tests/www/views/test_views_home.py 
   
   The setup is rather easy. For those tests you can setup local virtualenv (https://github.com/apache/airflow/blob/main/LOCAL_VIRTUALENV.rst) - classic python virtualenv  and run `pytest` tests there (make sure to init the db before or run the test once with ``--with-db-init`` custom flag.
   
   
   You can also setup BREEZE  with  (https://github.com/apache/airflow/blob/main/BREEZE.rst)  - thi s  more compleete  environment whicch is docker -compose  based and is the exact replica of what happens in hthe CI.  This is as simple as running `./breeze` and  the script should guide you  and  eventually  you should be dropped tinto basjh interpreter inside breeze container with everyything read y to run pytest  with the test  to run (your sources will be mounted in container so you can  edit them locally and run in container immediately) . Also recommend to install pre-commit (`pre-commit install`) in your repo, so that all static checks will be run at commit time.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] thejens commented on pull request #17946: Add robots.txt and X-Robots-Tag header

Posted by GitBox <gi...@apache.org>.
thejens commented on pull request #17946:
URL: https://github.com/apache/airflow/pull/17946#issuecomment-910066904


   @potiuk added tests.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] uranusjr commented on a change in pull request #17946: Add robots.txt and X-Robots-Tag header

Posted by GitBox <gi...@apache.org>.
uranusjr commented on a change in pull request #17946:
URL: https://github.com/apache/airflow/pull/17946#discussion_r700023924



##########
File path: airflow/www/views.py
##########
@@ -2924,6 +2925,16 @@ def tree_data(self):
         # avoid spaces to reduce payload size
         return htmlsafe_json_dumps(tree_data, separators=(',', ':'))
 
+    @expose('/robots.txt')
+    @action_logging
+    def robots(self):
+        """
+        Returns a robots.txt file for blocking certain search engine crawlers. This mitigates some
+        of the risk associated with exposing Airflow to the public internet, however it does not
+        address the real security risks associated with such a deployment.
+        """
+        return send_from_directory(current_app.static_folder, 'robots.txt')

Review comment:
       The physical file is easy to maintain, but not easy to find for people not intrinsically faimiliar with Flask. And the file is a line-delimited pure text file, so the advantage over a multiple line string
   
   ```python
   """\
   User-agent: *
   Disallow: /
   """
   ```
   
   is marginal at best.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] github-actions[bot] commented on pull request #17946: Add robots.txt and X-Robots-Tag header

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #17946:
URL: https://github.com/apache/airflow/pull/17946#issuecomment-910209722


   The PR is likely OK to be merged with just subset of tests for default Python and Database versions without running the full matrix of tests, because it does not modify the core of Airflow. If the committers decide that the full tests matrix is needed, they will add the label 'full tests needed'. Then you should rebase to the latest main or amend the last commit of the PR, and push it with --force-with-lease.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] thejens commented on pull request #17946: Add robots.txt and X-Robots-Tag header

Posted by GitBox <gi...@apache.org>.
thejens commented on pull request #17946:
URL: https://github.com/apache/airflow/pull/17946#issuecomment-909338532


   ... could use some help with where and how to add tests to this as I didn't find that setup intuitive


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] thejens commented on pull request #17946: Add robots.txt and X-Robots-Tag header

Posted by GitBox <gi...@apache.org>.
thejens commented on pull request #17946:
URL: https://github.com/apache/airflow/pull/17946#issuecomment-909338532






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org