You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2022/09/26 13:20:57 UTC

[GitHub] [airflow] harryplumer opened a new pull request, #26676: Fix null strings bug in SqlToS3Operator in non parquet formats

harryplumer opened a new pull request, #26676:
URL: https://github.com/apache/airflow/pull/26676

   <!--
   Thank you for contributing! Please make sure that your code changes
   are covered with tests. And in case of new features or big changes
   remember to adjust the documentation.
   
   Feel free to ping committers for the review!
   
   In case of an existing issue, reference it using one of the following:
   
   closes: [#ISSUE](https://github.com/apache/airflow/issues/26567)
   related: #ISSUE
   
   How to write a good git commit message:
   http://chris.beams.io/posts/git-commit/
   -->
   
   Once https://github.com/apache/airflow/pull/25083 was merged, when using CSV as the output format on the SqlToS3Operator, null strings started appearing as "None" in the actual CSV export. This will cause unintended behavior in most use cases for reading the CSV including uploading to databases.
   
   This PR restores the original behavior of the null strings and adds unit tests to enforce it going forward.
   
   ---
   **^ Add meaningful description above**
   
   Read the **[Pull Request Guidelines](https://github.com/apache/airflow/blob/main/CONTRIBUTING.rst#pull-request-guidelines)** for more information.
   In case of fundamental code changes, an Airflow Improvement Proposal ([AIP](https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvement+Proposals)) is needed.
   In case of a new dependency, check compliance with the [ASF 3rd Party License Policy](https://www.apache.org/legal/resolved.html#category-x).
   In case of backwards incompatible changes please leave a note in a newsfragment file, named `{pr_number}.significant.rst` or `{issue_number}.significant.rst`, in [newsfragments](https://github.com/apache/airflow/tree/main/newsfragments).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] eladkal commented on a diff in pull request #26676: Fix null strings bug in SqlToS3Operator in non parquet formats

Posted by GitBox <gi...@apache.org>.
eladkal commented on code in PR #26676:
URL: https://github.com/apache/airflow/pull/26676#discussion_r984327443


##########
tests/providers/amazon/aws/transfers/test_sql_to_s3.py:
##########
@@ -145,16 +146,23 @@ def test_execute_json(self, mock_s3_hook, temp_mock):
                 replace=True,
             )
 
-    def test_fix_dtypes(self):
+    @parameterized.expand(
+        [
+            ("with-csv", {"file_format": "csv", "null_string_result": None}),
+            ("with-parquet", {"file_format": "parquet", "null_string_result": "None"}),

Review Comment:
   The operator accept also json format would be cool to add this case also



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] eladkal commented on a diff in pull request #26676: Fix null strings bug in SqlToS3Operator in non parquet formats

Posted by GitBox <gi...@apache.org>.
eladkal commented on code in PR #26676:
URL: https://github.com/apache/airflow/pull/26676#discussion_r982320888


##########
tests/providers/amazon/aws/transfers/test_sql_to_s3.py:
##########
@@ -145,16 +145,30 @@ def test_execute_json(self, mock_s3_hook, temp_mock):
                 replace=True,
             )
 
-    def test_fix_dtypes(self):
+    def test_fix_dtypes_csv(self):
         op = SqlToS3Operator(
             query="query",
             s3_bucket="s3_bucket",
             s3_key="s3_key",
             task_id="task_id",
             sql_conn_id="mysql_conn_id",
         )
-        dirty_df = pd.DataFrame({"strings": ["a", "b", "c"], "ints": [1, 2, None]})
-        op._fix_dtypes(df=dirty_df)
+        dirty_df = pd.DataFrame({"strings": ["a", "b", None], "ints": [1, 2, None]})
+        op._fix_dtypes(df=dirty_df, file_format="csv")
+        assert dirty_df["strings"].values[2] is None
+        assert dirty_df["ints"].dtype.kind == "i"
+
+    def test_fix_dtypes_parquet(self):

Review Comment:
   You can parameterize it rather than creating a new test



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] harryplumer commented on a diff in pull request #26676: Fix null strings bug in SqlToS3Operator in non parquet formats

Posted by GitBox <gi...@apache.org>.
harryplumer commented on code in PR #26676:
URL: https://github.com/apache/airflow/pull/26676#discussion_r983718439


##########
tests/providers/amazon/aws/transfers/test_sql_to_s3.py:
##########
@@ -145,16 +145,30 @@ def test_execute_json(self, mock_s3_hook, temp_mock):
                 replace=True,
             )
 
-    def test_fix_dtypes(self):
+    def test_fix_dtypes_csv(self):
         op = SqlToS3Operator(
             query="query",
             s3_bucket="s3_bucket",
             s3_key="s3_key",
             task_id="task_id",
             sql_conn_id="mysql_conn_id",
         )
-        dirty_df = pd.DataFrame({"strings": ["a", "b", "c"], "ints": [1, 2, None]})
-        op._fix_dtypes(df=dirty_df)
+        dirty_df = pd.DataFrame({"strings": ["a", "b", None], "ints": [1, 2, None]})
+        op._fix_dtypes(df=dirty_df, file_format="csv")
+        assert dirty_df["strings"].values[2] is None
+        assert dirty_df["ints"].dtype.kind == "i"
+
+    def test_fix_dtypes_parquet(self):

Review Comment:
   good callout, fixed that up if you want to take another look!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] eladkal commented on a diff in pull request #26676: Fix null strings bug in SqlToS3Operator in non parquet formats

Posted by GitBox <gi...@apache.org>.
eladkal commented on code in PR #26676:
URL: https://github.com/apache/airflow/pull/26676#discussion_r982322782


##########
tests/providers/amazon/aws/transfers/test_sql_to_s3.py:
##########
@@ -145,16 +145,30 @@ def test_execute_json(self, mock_s3_hook, temp_mock):
                 replace=True,
             )
 
-    def test_fix_dtypes(self):
+    def test_fix_dtypes_csv(self):
         op = SqlToS3Operator(
             query="query",
             s3_bucket="s3_bucket",
             s3_key="s3_key",
             task_id="task_id",
             sql_conn_id="mysql_conn_id",
         )
-        dirty_df = pd.DataFrame({"strings": ["a", "b", "c"], "ints": [1, 2, None]})
-        op._fix_dtypes(df=dirty_df)
+        dirty_df = pd.DataFrame({"strings": ["a", "b", None], "ints": [1, 2, None]})
+        op._fix_dtypes(df=dirty_df, file_format="csv")
+        assert dirty_df["strings"].values[2] is None
+        assert dirty_df["ints"].dtype.kind == "i"
+
+    def test_fix_dtypes_parquet(self):

Review Comment:
   example: https://github.com/apache/airflow/blob/9e06c99f6102d0227c6e7b20b258d628c2bc6d5c/tests/sensors/test_weekday_sensor.py#L56-L78



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk merged pull request #26676: Fix null strings bug in SqlToS3Operator in non parquet formats

Posted by GitBox <gi...@apache.org>.
potiuk merged PR #26676:
URL: https://github.com/apache/airflow/pull/26676


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org