You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2021/11/01 22:55:36 UTC

[GitHub] [airflow] pohek321 opened a new issue #19357: DatabricksHook method get_run_state returns error

pohek321 opened a new issue #19357:
URL: https://github.com/apache/airflow/issues/19357


   ### Apache Airflow Provider(s)
   
   databricks
   
   ### Versions of Apache Airflow Providers
   
   apache-airflow-providers-databricks==1!2.0.2
   
   
   ### Apache Airflow version
   
   2.1.4 (latest released)
   
   ### Operating System
   
   Debian GNU/Linux 11 (bullseye)
   
   ### Deployment
   
   Docker-Compose
   
   ### Deployment details
   
   Running local deployment using Astronomer CLI
   
   ### What happened
   
   When calling the `get_run_state` method from the `DatabricksHook`, I get the following error:
   
   `TypeError: Object of type RunState is not JSON serializable`
   
   I think this is due to [the method returning a RunState custom class](https://github.com/apache/airflow/blob/main/airflow/providers/databricks/hooks/databricks.py#L275) as opposed to a `str` like the rest of the methods in the databricks hook (i.e. `get_job_id`, `get_run_page_url`, etc.)
   
   ### What you expected to happen
   
   When calling the `get_run_state` method, simply return the `result_state` or `state_message` [variables](https://github.com/apache/airflow/blob/main/airflow/providers/databricks/hooks/databricks.py#L287-L288) instead of the `RunState` class.
   
   ### How to reproduce
   
   Create a dag that references a databricks deployment and use this task to see the error:
   
   ```
       from airflow.providers.databricks.hooks.databricks import DatabricksHook
       run_id = <insert run id from databricks ui here>
   
       def get_run_state(self, run_id: str):
              return self.hook.get_run_state(run_id=run_id)
   
       python_get_run_state = PythonOperator(
           task_id="python_get_run_state",
           python_callable=get_run_state,
           op_kwargs={
               "run_id": str(run_id)
           }
       )
   ```
   
   ### Anything else
   
   _No response_
   
   ### Are you willing to submit PR?
   
   - [X] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] alexott commented on issue #19357: DatabricksHook method get_run_state returns error

Posted by GitBox <gi...@apache.org>.
alexott commented on issue #19357:
URL: https://github.com/apache/airflow/issues/19357#issuecomment-969966007


   yes, that's why I'm asking on how you're planning to use it. Looks like just for reporting. I'll look on how to implement it's better, because other people could be interested in just `life_cycle_state` part if you need to make decision if pipeline is in `FAILED` state, etc.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] alexott commented on issue #19357: DatabricksHook method get_run_state returns error

Posted by GitBox <gi...@apache.org>.
alexott commented on issue #19357:
URL: https://github.com/apache/airflow/issues/19357#issuecomment-969323649


   State encodes multiple things - it’s could be running, failed, successful. But for example, failed may have different reasons for failure. Are you interested in state - running/failed/successful or more detailed information?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] alexott commented on issue #19357: DatabricksHook method get_run_state returns error

Posted by GitBox <gi...@apache.org>.
alexott commented on issue #19357:
URL: https://github.com/apache/airflow/issues/19357#issuecomment-958054979


   I can look only over the weekend


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] boring-cyborg[bot] commented on issue #19357: DatabricksHook method get_run_state returns error

Posted by GitBox <gi...@apache.org>.
boring-cyborg[bot] commented on issue #19357:
URL: https://github.com/apache/airflow/issues/19357#issuecomment-956811100


   Thanks for opening your first issue here! Be sure to follow the issue template!
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] pohek321 commented on issue #19357: DatabricksHook method get_run_state returns error

Posted by GitBox <gi...@apache.org>.
pohek321 commented on issue #19357:
URL: https://github.com/apache/airflow/issues/19357#issuecomment-969333739


   Yeah, I think anything would be better than it returning an error message. I was thinking just a running/failed/successful with the details appended to it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] alexott commented on issue #19357: DatabricksHook method get_run_state returns error

Posted by GitBox <gi...@apache.org>.
alexott commented on issue #19357:
URL: https://github.com/apache/airflow/issues/19357#issuecomment-969352325


   It’s good suggestion, but maybe we can just have a separate method for ir


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] alexott commented on issue #19357: DatabricksHook method get_run_state returns error

Posted by GitBox <gi...@apache.org>.
alexott commented on issue #19357:
URL: https://github.com/apache/airflow/issues/19357#issuecomment-958054979


   I can look only over the weekend


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] mik-laj commented on issue #19357: DatabricksHook method get_run_state returns error

Posted by GitBox <gi...@apache.org>.
mik-laj commented on issue #19357:
URL: https://github.com/apache/airflow/issues/19357#issuecomment-956995208


   @alexott @robertsaxby Can you look at it?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] mik-laj commented on issue #19357: DatabricksHook method get_run_state returns error

Posted by GitBox <gi...@apache.org>.
mik-laj commented on issue #19357:
URL: https://github.com/apache/airflow/issues/19357#issuecomment-956995208


   @alexott @robertsaxby Can you look at it?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] pohek321 commented on issue #19357: DatabricksHook method get_run_state returns error

Posted by GitBox <gi...@apache.org>.
pohek321 commented on issue #19357:
URL: https://github.com/apache/airflow/issues/19357#issuecomment-969320087


   @alexott sorry for the late response. I'm not sure I understand your question. I'd expect to get the current state of a Databricks job run. In my mind, the task should just return a string that is parsed from the json response. That's what the other methods do. Maybe just return the `state_message` variable defined on 414 instead of `RunState`? Maybe I'm not understanding what the the purpose of `RunState` is?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] boring-cyborg[bot] commented on issue #19357: DatabricksHook method get_run_state returns error

Posted by GitBox <gi...@apache.org>.
boring-cyborg[bot] commented on issue #19357:
URL: https://github.com/apache/airflow/issues/19357#issuecomment-956811100


   Thanks for opening your first issue here! Be sure to follow the issue template!
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] alexott edited a comment on issue #19357: DatabricksHook method get_run_state returns error

Posted by GitBox <gi...@apache.org>.
alexott edited a comment on issue #19357:
URL: https://github.com/apache/airflow/issues/19357#issuecomment-962443619


   IMHO, the fix would be to rework `json.py` to support custom serializations, besides the "standard" ones.  We already have a [custom one for K8S](https://github.com/apache/airflow/pull/11952/files#diff-7a15a4f7b8fea7f7be350eb3809d8abfcca40419d2da36750d2d9453c7fab732L55), but it's a hack from my point of view.  Better solution would be to allow classes to provide implementation if necessary, but that would be a bigger task


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] alexott edited a comment on issue #19357: DatabricksHook method get_run_state returns error

Posted by GitBox <gi...@apache.org>.
alexott edited a comment on issue #19357:
URL: https://github.com/apache/airflow/issues/19357#issuecomment-969352325


   It’s good suggestion, but maybe we can just have a separate method for it


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk closed issue #19357: DatabricksHook method get_run_state returns error

Posted by GitBox <gi...@apache.org>.
potiuk closed issue #19357:
URL: https://github.com/apache/airflow/issues/19357


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] boring-cyborg[bot] commented on issue #19357: DatabricksHook method get_run_state returns error

Posted by GitBox <gi...@apache.org>.
boring-cyborg[bot] commented on issue #19357:
URL: https://github.com/apache/airflow/issues/19357#issuecomment-956811100


   Thanks for opening your first issue here! Be sure to follow the issue template!
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] alexott commented on issue #19357: DatabricksHook method get_run_state returns error

Posted by GitBox <gi...@apache.org>.
alexott commented on issue #19357:
URL: https://github.com/apache/airflow/issues/19357#issuecomment-962443619


   IMHO, the fix would be to rework `json.py` to support custom serializations, besides the "standard" ones.  We already have a [custom one for K8S](https://github.com/apache/airflow/pull/11952/files#diff-7a15a4f7b8fea7f7be350eb3809d8abfcca40419d2da36750d2d9453c7fab732L55), but it's a hack from my point of view.  Better solution would be to allow classes to provide implementation if necessary


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] pohek321 commented on issue #19357: DatabricksHook method get_run_state returns error

Posted by GitBox <gi...@apache.org>.
pohek321 commented on issue #19357:
URL: https://github.com/apache/airflow/issues/19357#issuecomment-969351273


   @alexott I was able to resolve the issue by adding this to my Airflow environment variables:
   
   `AIRFLOW__CORE__ENABLE_XCOM_PICKLING=TRUE`
   
   However, this isn't very intuitive for users calling the method from this hook. May be worthwhile to make a comment like the following:
   
   `Any Airflow tasks that call the get_run_state method will result in failure unless you have enabled xcom pickling. This can be done using the following environment variable: AIRFLOW__CORE__ENABLE_XCOM_PICKLING=TRUE`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] alexott commented on issue #19357: DatabricksHook method get_run_state returns error

Posted by GitBox <gi...@apache.org>.
alexott commented on issue #19357:
URL: https://github.com/apache/airflow/issues/19357#issuecomment-962456788


   @pohek321 But what kind of information are you interested in? Because there is a lifecycle state, and there is a resulting state ([doc](https://docs.databricks.com/dev-tools/api/2.0/jobs.html#jobsrunstate)) - which of them do you need in your Python task?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] mik-laj commented on issue #19357: DatabricksHook method get_run_state returns error

Posted by GitBox <gi...@apache.org>.
mik-laj commented on issue #19357:
URL: https://github.com/apache/airflow/issues/19357#issuecomment-956995208


   @alexott @robertsaxby Can you look at it?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] alexott commented on issue #19357: DatabricksHook method get_run_state returns error

Posted by GitBox <gi...@apache.org>.
alexott commented on issue #19357:
URL: https://github.com/apache/airflow/issues/19357#issuecomment-969339271


   It’s a bit complicated from my point of view - JSON serialization a bit hardcoded in airflow, it could be a big change to make it extensible. On other side we can add a method to return a string instead of object


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] pohek321 commented on issue #19357: DatabricksHook method get_run_state returns error

Posted by GitBox <gi...@apache.org>.
pohek321 commented on issue #19357:
URL: https://github.com/apache/airflow/issues/19357#issuecomment-969336475


   Just realized that the run state is returned in the Airflow logs, but the task is failing no matter what. In the logs, I do see this message:
   
   `ERROR - Could not serialize the XCom value into JSON. If you are using pickle instead of JSON for XCom, then you need to enable pickle support for XCom in your airflow config.`
   
   and the actual failure message
   
   `TypeError: Object of type RunState is not JSON serializable`
   
   Could this be why the task is failing?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] pohek321 commented on issue #19357: DatabricksHook method get_run_state returns error

Posted by GitBox <gi...@apache.org>.
pohek321 commented on issue #19357:
URL: https://github.com/apache/airflow/issues/19357#issuecomment-969396487


   @alexott what about adding something like this to the `databricks.py` where the hook lives:
   
       """
       Please note that any Airflow tasks that call the `get_run_state` method will result in failure unless you have enabled
       xcom pickling. This can be done using the following environment variable: AIRLFOW_CORE_ENABLE_XCOM_PICKLING=TRUE
   
       If you do not want to enable xcom pickling then use the `get_run_state_str` method
       """
       def get_run_state(self, run_id: str) -> RunState:
           """
           Retrieves run state of the run.
   
           :param run_id: id of the run
           :return: state of the run
           """
           json = {'run_id': run_id}
           response = self._do_api_call(GET_RUN_ENDPOINT, json)
           state = response['state']
           life_cycle_state = state['life_cycle_state']
           # result_state may not be in the state if not terminal
           result_state = state.get('result_state', None)
           state_message = state['state_message']
           return RunState(life_cycle_state, result_state, state_message)
   
       def get_run_state_str(self, run_id: str) -> str:
           """
           Retrieves run state of the run.
   
           :param run_id: id of the run
           :return: state of the run
           """
           json = {'run_id': run_id}
           response = self._do_api_call(GET_RUN_ENDPOINT, json)
           state = response['state']
           life_cycle_state = state['life_cycle_state']
           state_message = state['state_message']
           result = life_cycle_state + ' - ' + state_message
           return result


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org