You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2021/07/10 05:13:54 UTC

[GitHub] [airflow] SasanAhmadi opened a new issue #16919: error when using mysql_to_s3 (TypeError: cannot safely cast non-equivalent object to int64)

SasanAhmadi opened a new issue #16919:
URL: https://github.com/apache/airflow/issues/16919


   
   
   **Apache Airflow version**:
   
   **Environment**:
   
   - **Cloud provider or hardware configuration**: aws
   
   **What happened**:
   
   <!-- (please include exact error messages if you can) -->
   
   **What you expected to happen**: when reading data with Mysql_to_s3 following exception happens:
   [2021-07-10 03:24:04,051] {{mysql_to_s3.py:120}} INFO - Data from MySQL obtained
   [2021-07-10 03:24:04,137] {{taskinstance.py:1482}} ERROR - Task failed with exception
   Traceback (most recent call last):
     File "/usr/local/lib64/python3.7/site-packages/pandas/core/arrays/integer.py", line 155, in safe_cast
       return values.astype(dtype, casting="safe", copy=copy)
   TypeError: Cannot cast array data from dtype('O') to dtype('int64') according to the rule 'safe'
   
   The above exception was the direct cause of the following exception:
   ```
   Traceback (most recent call last):
     File "/usr/local/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 1138, in _run_raw_task
       self._prepare_and_execute_task_with_callbacks(context, task)
     File "/usr/local/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 1311, in _prepare_and_execute_task_with_callbacks
       result = self._execute_task(context, task_copy)
     File "/usr/local/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 1341, in _execute_task
       result = task_copy.execute(context=context)
     File "/usr/local/lib/python3.7/site-packages/airflow/providers/amazon/aws/transfers/mysql_to_s3.py", line 122, in execute
       self._fix_int_dtypes(data_df)
     File "/usr/local/lib/python3.7/site-packages/airflow/providers/amazon/aws/transfers/mysql_to_s3.py", line 114, in _fix_int_dtypes
       df[col] = df[col].astype(pd.Int64Dtype())
     File "/usr/local/lib64/python3.7/site-packages/pandas/core/generic.py", line 5877, in astype
       new_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors)
     File "/usr/local/lib64/python3.7/site-packages/pandas/core/internals/managers.py", line 631, in astype
       return self.apply("astype", dtype=dtype, copy=copy, errors=errors)
     File "/usr/local/lib64/python3.7/site-packages/pandas/core/internals/managers.py", line 427, in apply
       applied = getattr(b, f)(**kwargs)
     File "/usr/local/lib64/python3.7/site-packages/pandas/core/internals/blocks.py", line 673, in astype
       values = astype_nansafe(vals1d, dtype, copy=True)
     File "/usr/local/lib64/python3.7/site-packages/pandas/core/dtypes/cast.py", line 1019, in astype_nansafe
       return dtype.construct_array_type()._from_sequence(arr, dtype=dtype, copy=copy)
     File "/usr/local/lib64/python3.7/site-packages/pandas/core/arrays/integer.py", line 363, in _from_sequence
       return integer_array(scalars, dtype=dtype, copy=copy)
     File "/usr/local/lib64/python3.7/site-packages/pandas/core/arrays/integer.py", line 143, in integer_array
       values, mask = coerce_to_array(values, dtype=dtype, copy=copy)
     File "/usr/local/lib64/python3.7/site-packages/pandas/core/arrays/integer.py", line 258, in coerce_to_array
       values = safe_cast(values, dtype, copy=False)
     File "/usr/local/lib64/python3.7/site-packages/pandas/core/arrays/integer.py", line 164, in safe_cast
       ) from err
   TypeError: cannot safely cast non-equivalent object to int64
   ```
   <!-- What do you think went wrong? -->
   
   **How to reproduce it**:
   create a table like the following in mysql database and use mysql_to_s3 to load data from this table into s3
   
   ```
   create table test_data(id int, some_decimal decimal(10, 2))
   
   insert into test_data (id, some_decimal) values(1, 99999999.99), (2, null)
   ```
   
   **Anything else we need to know**:
   
   following code is the problem where it is looking for an occurrence of float data type in the column datatype name and instead of using the ```pd.Float64Dtype()``` it uses the ```pd.Int64Dtype()```. since there could be floating-point values in the array this will cause the exception for safely casting the array to data type.
   
   ```
   def _fix_int_dtypes(self, df: pd.DataFrame) -> None:
           """Mutate DataFrame to set dtypes for int columns containing NaN values."""
           for col in df:
               if "float" in df[col].dtype.name and df[col].hasnans:
                   # inspect values to determine if dtype of non-null values is int or float
                   notna_series = df[col].dropna().values
                   if np.isclose(notna_series, notna_series.astype(int)).all():
                       # set to dtype that retains integers and supports NaNs
                       df[col] = np.where(df[col].isnull(), None, df[col])
                       df[col] = df[col].astype(pd.Int64Dtype())
   ```
   
   Moreover, I don't know why we use ```isclose``` to inspect if the values will be close enough if we cast to integer when we have the option to cast to Float64Dtype.
   ```isclose``` here destroys the perception of the data because it is not an equal evaluation of the sets to determine if the type is float or int. It will approximately check which is the root cause of the exception that follows. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] eladkal commented on issue #16919: error when using mysql_to_s3 (TypeError: cannot safely cast non-equivalent object to int64)

Posted by GitBox <gi...@apache.org>.
eladkal commented on issue #16919:
URL: https://github.com/apache/airflow/issues/16919#issuecomment-928199673


   @SasanAhmadi are you working on this issue?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk closed issue #16919: error when using mysql_to_s3 (TypeError: cannot safely cast non-equivalent object to int64)

Posted by GitBox <gi...@apache.org>.
potiuk closed issue #16919:
URL: https://github.com/apache/airflow/issues/16919


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] boring-cyborg[bot] commented on issue #16919: error when using mysql_to_s3 (TypeError: cannot safely cast non-equivalent object to int64)

Posted by GitBox <gi...@apache.org>.
boring-cyborg[bot] commented on issue #16919:
URL: https://github.com/apache/airflow/issues/16919#issuecomment-877568946


   Thanks for opening your first issue here! Be sure to follow the issue template!
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on issue #16919: error when using mysql_to_s3 (TypeError: cannot safely cast non-equivalent object to int64)

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #16919:
URL: https://github.com/apache/airflow/issues/16919#issuecomment-877618475


   I assigned it to you @SasanAhmadi 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] SasanAhmadi commented on issue #16919: error when using mysql_to_s3 (TypeError: cannot safely cast non-equivalent object to int64)

Posted by GitBox <gi...@apache.org>.
SasanAhmadi commented on issue #16919:
URL: https://github.com/apache/airflow/issues/16919#issuecomment-937901875


   Yes, I am working on it


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] eladkal commented on issue #16919: error when using mysql_to_s3 (TypeError: cannot safely cast non-equivalent object to int64)

Posted by GitBox <gi...@apache.org>.
eladkal commented on issue #16919:
URL: https://github.com/apache/airflow/issues/16919#issuecomment-1006065078


   hi @SasanAhmadi any update on this issue?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] SasanAhmadi commented on issue #16919: error when using mysql_to_s3 (TypeError: cannot safely cast non-equivalent object to int64)

Posted by GitBox <gi...@apache.org>.
SasanAhmadi commented on issue #16919:
URL: https://github.com/apache/airflow/issues/16919#issuecomment-877572458


   I am proposing a change to ```_fix_int_dtypes``` method as below:
   ```
   def _fix_int_dtypes( df: pd.DataFrame) -> None:
           """Mutate DataFrame to set dtypes for int columns containing NaN values."""
           for col in df:
               if "float" in df[col].dtype.name and df[col].hasnans:
                   # inspect values to determine if dtype of non-null values is int or float
                   notna_series = df[col].dropna().values
                   if np.equal(notna_series, notna_series.astype(int)).all():
                       # set to dtype that retains integers and supports NaNs
                       df[col] = np.where(df[col].isnull(), None, df[col])
                       df[col] = df[col].astype(pd.Int64Dtype())
                   elif np.isclose(notna_series, notna_series.astype(int)).all():
                       df[col] = np.where(df[col].isnull(), None, df[col])
                       df[col] = df[col].astype(pd.Float64Dtype())
   ```
   This way it is correctly checking if the values are integer or floating-point and then cast to precise type.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] SasanAhmadi commented on issue #16919: error when using mysql_to_s3 (TypeError: cannot safely cast non-equivalent object to int64)

Posted by GitBox <gi...@apache.org>.
SasanAhmadi commented on issue #16919:
URL: https://github.com/apache/airflow/issues/16919#issuecomment-937901875


   Yes, I am working on it


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] SasanAhmadi commented on issue #16919: error when using mysql_to_s3 (TypeError: cannot safely cast non-equivalent object to int64)

Posted by GitBox <gi...@apache.org>.
SasanAhmadi commented on issue #16919:
URL: https://github.com/apache/airflow/issues/16919#issuecomment-1028237083


   Hi @eladkal, I've created a PR for review on this issue. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] SasanAhmadi commented on issue #16919: error when using mysql_to_s3 (TypeError: cannot safely cast non-equivalent object to int64)

Posted by GitBox <gi...@apache.org>.
SasanAhmadi commented on issue #16919:
URL: https://github.com/apache/airflow/issues/16919#issuecomment-1013424780


   Hi Elad, just got a confirmation from my manager at Skillz to wrap this up in the next week. got a dedicated time from my job to finish this contribution. I am sorry it is taking more than anticipated. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] eladkal commented on issue #16919: error when using mysql_to_s3 (TypeError: cannot safely cast non-equivalent object to int64)

Posted by GitBox <gi...@apache.org>.
eladkal commented on issue #16919:
URL: https://github.com/apache/airflow/issues/16919#issuecomment-928199673


   @SasanAhmadi are you working on this issue?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] SasanAhmadi edited a comment on issue #16919: error when using mysql_to_s3 (TypeError: cannot safely cast non-equivalent object to int64)

Posted by GitBox <gi...@apache.org>.
SasanAhmadi edited a comment on issue #16919:
URL: https://github.com/apache/airflow/issues/16919#issuecomment-1013424780


   Hi Elad, just got a confirmation from my company Skillz to wrap this up in the next week. Got a dedicated time from my job to finish this contribution. I am sorry it is taking more than anticipated. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org