You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2020/11/23 11:07:56 UTC

[GitHub] [airflow] ibeauvais opened a new issue #12560: Initialization of Dataproc hook in operator constructor

ibeauvais opened a new issue #12560:
URL: https://github.com/apache/airflow/issues/12560


   **Apache Airflow version**: 1.10.10
   
   
   **Kubernetes version (if you are using kubernetes)** (use `kubectl version`): v1.15.12-gke.20
   
   **Environment**:
   
   - **Cloud provider or hardware configuration**: Google Cloud Platform (Composer 1.12.4 )
   - **OS** (e.g. from /etc/os-release):
   - **Kernel** (e.g. `uname -a`):
   - **Install tools**:
   - **Others**:
   
   **What happened**:
   On an environment with a lot of dataproc tasks (spark), we have a lot of performance issues. 
   After investigation, It seems related to the problem below:
   For all dataproc operators, hooks are initialized in the constructor instead of the execute method. The hook initialization results in a significant overhead because it accesses the airflow database (get_connection).
   The operator's constructor is executed for each task by the scheduler and the workers which induces performance degradation for a large amount of dataproc tasks.
   
   Similar problem already fixed in past : #5893 for other GCP operators
   
   The code lead to the issue in dataproc_operator.py, all operator inherit from DataprocOperationBaseOperator :
   ```
   
   class DataprocOperationBaseOperator(BaseOperator):
       """The base class for operators that poll on a Dataproc Operation."""
       @apply_defaults
       def __init__(self,
                    project_id,
                    region='global',
                    gcp_conn_id='google_cloud_default',
                    delegate_to=None,
                    *args,
                    **kwargs):
           super(DataprocOperationBaseOperator, self).__init__(*args, **kwargs)
           self.gcp_conn_id = gcp_conn_id
           self.delegate_to = delegate_to
           self.project_id = project_id
           self.region = region
           self.hook = DataProcHook(
               gcp_conn_id=self.gcp_conn_id,
               delegate_to=self.delegate_to,
               api_version='v1beta2'
           )
   ```
   <!-- (please include exact error messages if you can) -->
   
   **What you expected to happen**:
   Dataproc hook should be initialized in execute method
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] turbaszek commented on issue #12560: Initialization of Dataproc hook in operator constructor

Posted by GitBox <gi...@apache.org>.
turbaszek commented on issue #12560:
URL: https://github.com/apache/airflow/issues/12560#issuecomment-732105502


   I would also suggest to migrate to 1.10.12 and use the providers packages with new operators - the ones that are supported and will be required for Airflow 2.0.
   http://apache-airflow-docs.s3-website.eu-central-1.amazonaws.com/docs/apache-airflow/latest/backport-providers.html?highlight=providers


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] turbaszek commented on issue #12560: Initialization of Dataproc hook in operator constructor

Posted by GitBox <gi...@apache.org>.
turbaszek commented on issue #12560:
URL: https://github.com/apache/airflow/issues/12560#issuecomment-732104416


   That's already done in new operators. We don't use `DataprocOperationBaseOperator` anymore. See:
   https://github.com/apache/airflow/blob/c133df806247cfde87ba461ed22ee924d8d31fd3/airflow/providers/google/cloud/operators/dataproc.py#L389
   https://github.com/apache/airflow/blob/c133df806247cfde87ba461ed22ee924d8d31fd3/airflow/providers/google/cloud/operators/dataproc.py#L1747
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] turbaszek closed issue #12560: Initialization of Dataproc hook in operator constructor

Posted by GitBox <gi...@apache.org>.
turbaszek closed issue #12560:
URL: https://github.com/apache/airflow/issues/12560


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ibeauvais commented on issue #12560: Initialization of Dataproc hook in operator constructor

Posted by GitBox <gi...@apache.org>.
ibeauvais commented on issue #12560:
URL: https://github.com/apache/airflow/issues/12560#issuecomment-732760420


   thanks for this very helpful information


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] boring-cyborg[bot] commented on issue #12560: Initialization of Dataproc hook in operator constructor

Posted by GitBox <gi...@apache.org>.
boring-cyborg[bot] commented on issue #12560:
URL: https://github.com/apache/airflow/issues/12560#issuecomment-732091979


   Thanks for opening your first issue here! Be sure to follow the issue template!
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org