You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2020/05/20 20:39:14 UTC

[GitHub] [airflow] snazzyfox opened a new issue #8933: pyhive is installed without Hive dependencies

snazzyfox opened a new issue #8933:
URL: https://github.com/apache/airflow/issues/8933


   **Apache Airflow version**: 1.10.10
   
   (appears to also affect master)
   
   **What happened**:
   
   When airflow is installed with Hive support using `apache-airflow[hive]`, using HiveServer2Hook to run a query throws the following exception:
   ```
   ...
     File "/usr/local/lib/python3.7/site-packages/airflow/hooks/hive_hooks.py", line 828, in get_conn
       database=schema or db.schema or 'default')
     File "/usr/local/lib/python3.7/site-packages/pyhive/hive.py", line 94, in connect
       return Connection(*args, **kwargs)
     File "/usr/local/lib/python3.7/site-packages/pyhive/hive.py", line 152, in __init__
       import sasl
   ModuleNotFoundError: No module named 'sasl'
   ```
   
   **What you expected to happen**:
   
   The error should not appear.
   
   **How to reproduce it**:
   
   Any minimal dag that uses HiveServer2Hook generates the error. Connection to a working Hive cluster is not required since the required dependencies are not installed
   
   **Probable Reason**:
   
   For `pyhive` to work with Hive, it should be installed as `pyhive[hive]`. The hive extra brings in the `sasl` package.
   
   This is not caught testing since tests run with all the dependencies, and the packages required by [`pyhive[hive]`](https://github.com/dropbox/PyHive/blob/master/setup.py#L47) happen to be also used for [`kerberos`](https://github.com/apache/airflow/blob/master/setup.py#L313).
   
   I know this will be a much bigger conversation, but maybe it's worth it to consider testing operators _only_ with the dependencies they're supposed to rely on?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk closed issue #8933: pyhive is installed without Hive dependencies

Posted by GitBox <gi...@apache.org>.
potiuk closed issue #8933:
URL: https://github.com/apache/airflow/issues/8933


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] snazzyfox commented on issue #8933: pyhive is installed without Hive dependencies

Posted by GitBox <gi...@apache.org>.
snazzyfox commented on issue #8933:
URL: https://github.com/apache/airflow/issues/8933#issuecomment-632345025


   Yep, but `pyhive` is still needed for Hive. But since airflow is installing it without hive support, it's not actually usable unless packages for hive support are installed separately.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] eladkal commented on issue #8933: pyhive is installed without Hive dependencies

Posted by GitBox <gi...@apache.org>.
eladkal commented on issue #8933:
URL: https://github.com/apache/airflow/issues/8933#issuecomment-632059927


   Airflow doesn't need `pyhive` for Presto
   https://github.com/apache/airflow/pull/6822 removed that dependency. Presto is now using `prestodb`


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] snazzyfox commented on issue #8933: pyhive is installed without Hive dependencies

Posted by GitBox <gi...@apache.org>.
snazzyfox commented on issue #8933:
URL: https://github.com/apache/airflow/issues/8933#issuecomment-636354348


   Glad to hear that system tests will be coming! We should definitely still add it to the dependencies because that would actually fix it. System tests would only help catch the issue which we already know in this case :)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ashb commented on issue #8933: pyhive is installed without Hive dependencies

Posted by GitBox <gi...@apache.org>.
ashb commented on issue #8933:
URL: https://github.com/apache/airflow/issues/8933#issuecomment-631762509


   Wait, pyhive has an extra called hive....? :exploding_head: 
   
   > I know this will be a much bigger conversation, but maybe it's worth it to consider testing operators only with the dependencies they're supposed to rely on?
   
   That would be amazing, yes, but would make our test suite take days to run to completion.
   
   Still, might be something to think about.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] snazzyfox commented on issue #8933: pyhive is installed without Hive dependencies

Posted by GitBox <gi...@apache.org>.
snazzyfox commented on issue #8933:
URL: https://github.com/apache/airflow/issues/8933#issuecomment-631786295


   Yep, pyhive supports both hive and presto, so depending on the user's needs either `pyhive[hive]` or `pyhive[presto]` is installed.
   
   Since airflow is just installing `pyhive` itself, it doesn't include the underlying libs for either. Maybe just change the dependency to `pyhive[hive]` for now works?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on issue #8933: pyhive is installed without Hive dependencies

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #8933:
URL: https://github.com/apache/airflow/issues/8933#issuecomment-636322845


   > Wait, pyhive has an extra called hive....? 
   > 
   > That would be amazing, yes, but would make our test suite take days to run to completion.
   > 
   > Still, might be something to think about.
   
   This is already planned for AIP-4 - system tests -  combined with AIP-8  (likely) - split Airflow 2.0 into separate providers (follow up after AIP-21)  and with AIP-26 Production image
   
   That's one of the reasons why we refactored and moved everything to separate backport providers and have clear dependencies for all of them. This will allow to turn them into "airflow-providers" and once we setup automated system tests, we will be able to run those system tests separately for each provider - for hive for example by installing only hive provider (together with its dependencies) and running tests in a production image that will have only dependencies needed by Hive and run them in that image.
   
   This is a mid-term goal I want to achieve (Still this year if possible) 
   
   
   BTW. @snazzyfox @eladkal - > do you think we need to add pyhive to the current set of dependencies for hive ? should we add it?  


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org