You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2022/12/10 06:07:43 UTC

[GitHub] [airflow] blag opened a new issue, #28280: Inconsistent handling of newlines in tasks

blag opened a new issue, #28280:
URL: https://github.com/apache/airflow/issues/28280

   ### Apache Airflow version
   
   main (development)
   
   ### What happened
   
   Variables with newlines in them are not always reproduced exactly as they are pasted into the web UI. Depending on how they are rendered, Unix newlines (`\n`) are silently converted to Windows newlines (`\r\n`) when rendered.
   
   Additionally, when users try to specify `\r` CR/"carriage return" in Python code, that gets rendered as a `\n` LF/"line feed" character.
   
   Tested on both astro and Airflow with Breeze.
   
   ### What you think should happen instead
   
   Unix newlines (`\n`) should always be rendered as Unix newlines (`\n`); Windows newlines (`\r\n`) should always be rendered as Windows newlines.
   
   ### How to reproduce
   
   Steps to reproduce:
   
   1. Spin up Airflow with Breeze (`breeze start-airflow`) - I tested with the PostgreSQL database backend
   2. Install `dos2unix` within the Airflow container: `apt update; apt install --yes dos2unix`
   3. Create a Variable in the Airflow UI named `newlines`, and ensure that its content contains newlines. Note that the non-newline content doesn't really matter. I used:
      ```
      Line 1
      Line 2
      Line 3
      ```
   4. Run this dag:
      ```python
      from datetime import datetime
      
      from airflow import DAG
      from airflow.operators.bash import BashOperator
      from airflow.operators.smooth import SmoothOperator
      
      
      with DAG("test_newlines", schedule=None) as dag:
          test_bash_operator_multiline_env_var = BashOperator(
              dag=dag,
              task_id="test_bash_operator_multiline_env_var",
              start_date=datetime.now(),
              env={
                  "MULTILINE_ENV_VAR": """Line 1
      Line 2
      Line 3"""
              },
              append_env=True,
              bash_command='''
      if [[ "$(echo $MULTILINE_ENV_VAR)" != "$(echo $MULTILINE_ENV_VAR | dos2unix)" ]]; then
          echo >&2 "Multiline environment variable contains newlines incorrectly converted to Windows CRLF"
          exit 1
      fi''',
          )
      
          test_bash_operator_heredoc_contains_newlines = BashOperator(
              dag=dag,
              task_id="test_bash_operator_heredoc_contains_newlines",
              start_date=datetime.now(),
              bash_command="""
      diff <(
      cat <<EOF
      Line 1
      Line 2
      Line 3
      EOF
      ) <(
      cat | dos2unix <<EOF
      Line 1
      Line 2
      Line 3
      EOF
      ) || {
          echo >&2 "Bash heredoc contains newlines incorrectly converted to Windows CRLF"
          exit 1
      }
      """,
          )
      
          test_bash_operator_env_var_from_variable_jinja_interpolation = BashOperator(
              dag=dag,
              task_id="test_bash_operator_env_var_from_variable_jinja_interpolation",
              start_date=datetime.now(),
              env={
                  "ENV_VAR_AIRFLOW_VARIABLE_WITH_NEWLINES": "{{ var.value['newlines'] }}",
              },
              append_env=True,
              bash_command='''
      diff <(echo "$ENV_VAR_AIRFLOW_VARIABLE_WITH_NEWLINES") <(echo "$ENV_VAR_AIRFLOW_VARIABLE_WITH_NEWLINES" | dos2unix) || {
          echo >&2 "Environment variable contains newlines incorrectly converted to Windows CRLF"
          exit 1
      }
      ''',
          )
      
          test_bash_operator_from_variable_jinja_interpolation = BashOperator(
              dag=dag,
              task_id="test_bash_operator_from_variable_jinja_interpolation",
              start_date=datetime.now(),
              bash_command='''
      diff <(echo "{{ var.value['newlines'] }}") <(echo "{{ var.value['newlines'] }}" | dos2unix) || {
          echo >&2 "Jinja interpolated string contains newlines incorrectly converted to Windows CRLF"
          exit 1
      }
      ''',
          )
      
          test_bash_operator_backslash_n_not_equals_backslash_r = BashOperator(
              dag=dag,
              task_id="test_bash_operator_backslash_n_not_equals_backslash_r",
              start_date=datetime.now(),
              bash_command='''
      if [[ "\r" == "\n" ]]; then
          echo >&2 "Backslash-r has been incorrectly converted into backslash-n"
          exit 1
      fi''',
          )
      
          passing = SmoothOperator(dag=dag, task_id="passing", start_date=datetime.now())
          failing = SmoothOperator(dag=dag, task_id="failing", start_date=datetime.now())
      
          [
              test_bash_operator_multiline_env_var,
              test_bash_operator_heredoc_contains_newlines,
          ] >> passing
          [
              test_bash_operator_env_var_from_variable_jinja_interpolation,
              test_bash_operator_from_variable_jinja_interpolation,
              test_bash_operator_backslash_n_not_equals_backslash_r,
          ] >> failing
      ```
   
   Warning: The `test_bash_operator_heredoc_contains_newlines` may not actually ever complete in Breeze. It does complete successfully when using `astro`. That might be another bug.
   
   ### Operating System
   
   macOS 11.6.8 "Big Sur"
   
   ### Versions of Apache Airflow Providers
   
   _No response_
   
   ### Deployment
   
   Astronomer
   
   ### Deployment details
   
   Local breeze environment
   Local `astro` environment
   
   ### Anything else
   
   Every time.
   
   I'm still trying to narrow down on a root cause so I don't have a PR yet. But somebody else might have a better idea of where to look and how to fix it, so don't let me stop you.
   
   ### Are you willing to submit PR?
   
   - [ ] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on issue #28280: Inconsistent handling of newlines in tasks

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #28280:
URL: https://github.com/apache/airflow/issues/28280#issuecomment-1359916847

   Good catch.
   
   
   It is because this is what Jinja2 does by default - it replaces any new line it detects in the template (not in the values) with the value of newline_sequence passed as parameter of the Template (which is `\n` by default). This is nothing specific to Airflow - it's just how Jinja2 works.
   
   You can test it yourself:
   
   When you run this In a regular terminal this one (same with no newline_sequence):
   ```
   import jinja2
   
   jinjatemplate = jinja2.Template('Hello \r{{ name }}!', newline_sequence="\n")
   print(jinjatemplate.render(name="World"))
   ```
   
   You will get:
   
   ```
   Hello 
   World!
   ```
   
   Where this one:
   
   ```
   import jinja2
   
   jinjatemplate = jinja2.Template('Hello \r{{ name }}!', newline_sequence="\r")
   print(jinjatemplate.render(name="World"))
   ```
   
   Will print just:
   
   ```
   World!
   ```
   
   The thing is that Jinja2 treats the template as multi-line text file.  It reads the temple line by line and later joins the line using "newline_sequence" as separator. I think this has a lot to do with the ways how some formatting of jinja template and whitespace is done. For example it would be rather difficult to achieve whitespace manipulation that Jinja allows you to do with whitespace control: https://jinja.palletsprojects.com/en/3.1.x/templates/#whitespace-control - so JINJA authors actually opted to define what is the "canonical newline" they always use in the rendered template. Which makes a lot of sense - imagine the complexity if you have a template that mixes both \r and 'n and Jinja adding/removing those - with some level of consistency.
   
   Additionally, the `test_bash_operator_heredoc_contains_newlines` does not complete successfuly for you because it has a bug (in the task definition). 
   
   Proper command should be:
   
   ```
          bash_command="""
   diff <(
   cat <<EOF
   Line 1
   Line 2
   Line 3
   EOF
   ) <(
   cat <<EOF | dos2unix   # <- Here EOF was in a wrong place
   Line 1
   Line 2
   Line 3
   EOF
   ) || {
       echo >&2 "Bash heredoc contains newlines incorrectly converted to Windows CRLF"
       exit 1
   }
   """,
   ```
   
   The problem was that heredoc was pased as input to dos2unix command, not to the `cat` command and the cat command simply read from the stdin. Apparently astro sdk somehow replaces the stdin when it runs airflow tasks and cat simply gets empty string or smth like that.
   
   Solution:
   
   I am not sure if we want to do anything with it. This seems to be standard Jinja2 behaviour, and even if it kinda unexpected, I think this is the behaviour our users already likely rely on and changing it would be backwards-incompatible. Maybe we could specify newline_sequence somewhere in DAG definition (same as render_template_as_native_object?) 
   
   In any case, It would be great however if this behaviour is documented.
   
   WDYT?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] uranusjr commented on issue #28280: Inconsistent handling of newlines in tasks

Posted by GitBox <gi...@apache.org>.
uranusjr commented on issue #28280:
URL: https://github.com/apache/airflow/issues/28280#issuecomment-1366427691

   Unfortunately we would technically break backward compatibility if we do this, but I don’t think there’s a good way around it. Maybe alternatively we should introduce a config to normalise when a variable is _read_ (and still store them as-is without normalisation) so people can enable it if needed. (It’s probably not very relevant for most use cases.)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on issue #28280: Inconsistent handling of newlines in tasks

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #28280:
URL: https://github.com/apache/airflow/issues/28280#issuecomment-1366424092

   HTML and CSS are AWESOME pair. They never case to Amase me. 
   
   
   TiL (@uranusjr  thanks for finding this gem).
   
   Yes, we could canonicalise it. But that would have to be done in backwards incompatibke way if we do - this time we would have to do it in a global way per installation and add some migration to handle all the cases (adding a checkbox of whether to converting would be i guess utterly confusing because almost every one of our users entering the value would have no idea what we are asking them to)
   
   My question is - should we rally do something about it or is it mostly academic issue? Do you think @blag  this is a real issue that hits our users (what cases?) or is it just the question of treating those behaviours as status quo and maybe just documenting it better ? 
   
   I personally think that cr/lf confusions has been long enough in many tool that they are handling the confusion well. For example Python when reading string with 'text' parametr will treat the textas containing Universal newlines (that was the deprecated name of this parameter) and in vast majority of case it will **just work** (this is mostly why we have not see this raised as a problem so far).
   
   WDYT? Is it worth the hassle of fixing and handling backwards compatibilities @blag? Would you take on such a task if we think it is ? (I think it would have to be done really carefully as some users who actually saw it posaibly apllied already some workarounds to account for that and weight break those workarounds).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] blag closed issue #28280: Inconsistent handling of newlines in tasks

Posted by GitBox <gi...@apache.org>.
blag closed issue #28280: Inconsistent handling of newlines in tasks
URL: https://github.com/apache/airflow/issues/28280


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on issue #28280: Inconsistent handling of newlines in tasks

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #28280:
URL: https://github.com/apache/airflow/issues/28280#issuecomment-1366846230

   Yep. Let's keep it as the kind of issue that we can send anyone to if they stumble upon similar case :).
   
   We can even start counting.
   
    Number of people experiencing the less-than-stellar handling of newlines in both browser and Jinja: :one: 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on issue #28280: Inconsistent handling of newlines in tasks

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #28280:
URL: https://github.com/apache/airflow/issues/28280#issuecomment-1362549266

   > Perhaps a better solution would be to allos _per-field_ specification. One simple way would be to accept passing a `jinja2.Template` to a templated field directly, e.g.
   > 
   > ```python
   > BashOperator(
   >     bash_command=jinja2.Template("...", newline_sequence="\r"),
   > )
   > ```
   
   I like it, but we have to be careful about environment there and replace it. The default Template environment is "ad_hoc" one but we are using by default sanitized (and slightly customized) "Sandboxed" environment or "Native" - depending on the 'render_template_as_native_object' so we should make sure the environment from that template is not use directly I think.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on issue #28280: Inconsistent handling of newlines in tasks

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #28280:
URL: https://github.com/apache/airflow/issues/28280#issuecomment-1362758189

   Or add template_newline_sequence simply ? I think opening it up for "any" parameters of Template is kinda dangerous.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] uranusjr commented on issue #28280: Inconsistent handling of newlines in tasks

Posted by GitBox <gi...@apache.org>.
uranusjr commented on issue #28280:
URL: https://github.com/apache/airflow/issues/28280#issuecomment-1362495682

   I agree, while treating line endings verbatim is a technically a good solution, it would confuse most users too much to be worthwhile. Note that _Python_ also does the same when it reads files for the same reason: https://docs.python.org/3/tutorial/inputoutput.html#reading-and-writing-files
   
   Perhaps a better solution would be to allos _per-field_ specification. One simple way would be to accept passing a `jinja2.Template` to a templated field directly, e.g.
   
   ```python
   BashOperator(
       bash_command=jinja2.Template("...", newline_sequence="\r"),
   )
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on issue #28280: Inconsistent handling of newlines in tasks

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #28280:
URL: https://github.com/apache/airflow/issues/28280#issuecomment-1366520133

   > Unfortunately we would technically break backward compatibility if we do this, but I don’t think there’s a good way around it. Maybe alternatively we should introduce a config to normalise when a variable is _read_ (and still store them as-is without normalisation) so people can enable it if needed. (It’s probably not very relevant for most use cases.)
   
   Yep. The 'read' time normalisation is essentially what 'text=True` does in Python stdlib's read. I guess indeed following this approach and adding 'get(var, text: bool) for variable would be the best way to implement it. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] uranusjr commented on issue #28280: Inconsistent handling of newlines in tasks

Posted by GitBox <gi...@apache.org>.
uranusjr commented on issue #28280:
URL: https://github.com/apache/airflow/issues/28280#issuecomment-1362752376

   Maybe a cleaner? alternative would be to allow for task-level templating options, such as
   
   ```python
   BashOperator(
       template_options={"newline_sequence: "\r"},
   )
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] blag commented on issue #28280: Inconsistent handling of newlines in tasks

Posted by GitBox <gi...@apache.org>.
blag commented on issue #28280:
URL: https://github.com/apache/airflow/issues/28280#issuecomment-1366839468

   TIL. What an odd choice.
   
   While there are many tools that will happily deal with all of the different forms of newlines, sadly some tools that remain in popular usage still do not gracefully handle unexpected newline characters.
   
   I ran into this issue when I was "going off the beaten path" a bit - I'm storing an SSH key in an Airflow variable and writing it to a file before passing it to `ssh` (the SSH operator does not fit this particular use case). Given that, I think it really boils down to how strange of use cases we want to directly support.
   
   Allowing users to normalize variables on read would solve the problem, but simply piping values through `dos2unix` solves the problem as well, and I would expect that practically every tool that doesn't handle non-normalized newlines would be run directly via the BashOperator.
   
   As inconsistent as all of this is, unless there's a more typical use case for normalizing variable text, I don't think we need to change anything in Airflow itself. And since the issue is far removed from Airflow as well, I don't even know if Airflow would be an appropriate place to document this.
   
   Thank you both for your attention, but since there's nothing to really fix, I'm going to close this issue. It can be reopened or duplicated if somebody else runs into this issue and has a more typical use case.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] blag commented on issue #28280: Inconsistent handling of newlines in tasks

Posted by GitBox <gi...@apache.org>.
blag commented on issue #28280:
URL: https://github.com/apache/airflow/issues/28280#issuecomment-1366352922

   Thank you both for the bug fix and the explanations. That explains the behavior of tasks that succeed and why `test_bash_operator_backslash_n_not_equals_backslash_r` fails, but I don't understand how that explains `test_bash_operator_env_var_from_variable_jinja_interpolation` or `test_bash_operator_from_variable_jinja_interpolation`. Both of those have their Unix newlines replaced with Windows newlines, yet Windows is _not_ involved in any part of my environment. It would make more sense to me that the rendered templates would be rendered with newlines appropriate for the platform they're being rendered on (Linux). But I'm happy to be wrong and learn more about this.
   
   I've also noticed that variables set via the web UI are the ones that have their newlines translated to CRLF:
   
   ```python
   from datetime import datetime
   
   from airflow import DAG
   from airflow.operators.bash import BashOperator
   from airflow.operators.smooth import SmoothOperator
   
   
   with DAG("test_newlines", schedule=None) as dag:
       test_bash_operator_env_var_from_variable_webui_jinja_interpolation = BashOperator(
           dag=dag,
           task_id="test_bash_operator_env_var_from_variable_webui_jinja_interpolation",
           start_date=datetime.now(),
           env={
               "ENV_VAR_AIRFLOW_VARIABLE_WITH_NEWLINES": "{{ var.value['newlines'] }}",
           },
           append_env=True,
           bash_command='''
   diff <(echo "$ENV_VAR_AIRFLOW_VARIABLE_WITH_NEWLINES") <(echo "$ENV_VAR_AIRFLOW_VARIABLE_WITH_NEWLINES" | dos2unix) || {
       echo >&2 "Environment variable contains newlines incorrectly converted to Windows CRLF"
       exit 1
   }
   ''',
       )
   
       test_bash_operator_from_variable_webui_jinja_interpolation = BashOperator(
           dag=dag,
           task_id="test_bash_operator_from_variable_webui_jinja_interpolation",
           start_date=datetime.now(),
           bash_command='''
   diff <(echo "{{ var.value['newlines'] }}") <(echo "{{ var.value['newlines'] }}" | dos2unix) || {
       echo >&2 "Jinja interpolated string contains newlines incorrectly converted to Windows CRLF"
       exit 1
   }
   ''',
       )
   
       [
           test_bash_operator_env_var_from_variable_webui_jinja_interpolation,
           test_bash_operator_from_variable_webui_jinja_interpolation,
       ] >> failing
   ```
   
   Is it expected that those tasks would fail? Or should those pass?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] uranusjr commented on issue #28280: Inconsistent handling of newlines in tasks

Posted by GitBox <gi...@apache.org>.
uranusjr commented on issue #28280:
URL: https://github.com/apache/airflow/issues/28280#issuecomment-1362763550

   I’m thinking it may be a good idea to expose a bit more, and have them nested inside a TypedDict. Being able to customise `block_start_string` and `block_end_string` feels useful, for example.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on issue #28280: Inconsistent handling of newlines in tasks

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #28280:
URL: https://github.com/apache/airflow/issues/28280#issuecomment-1362765044

   Yes. If we limit to only some parameters, that might make sense.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] uranusjr commented on issue #28280: Inconsistent handling of newlines in tasks

Posted by GitBox <gi...@apache.org>.
uranusjr commented on issue #28280:
URL: https://github.com/apache/airflow/issues/28280#issuecomment-1366365711

   > I've also noticed that variables set via the web UI are the ones that have their newlines translated to CRLF
   
   Hmm this is a common HTML gotcha; [multi-line user-entered text in the front end is always transmitted through the wire with CRLF line endings](https://stackoverflow.com/questions/14217101/what-character-represents-a-new-line-in-a-text-area). I think we probably should simply canonicalise this into LF before the value is stored into the database.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org