You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@zeppelin.apache.org by GitBox <gi...@apache.org> on 2022/07/23 04:37:00 UTC

[GitHub] [zeppelin] zkytech opened a new pull request, #4423: [ZEPPELIN-5781]Add cross datasource query support to Spark SQL interpreter

zkytech opened a new pull request, #4423:
URL: https://github.com/apache/zeppelin/pull/4423

   ### What is this PR for?
   Add **cross datasource query** support to Spark SQL interpreter, currently support cross datasource query with these datasources:
   
   - hive
   - jdbc
   - mongodb
   
   #### How to Use
   1. User should declare a cross datasource table in format: `interpreterName.databaseName.tableName`.
   2. `interpreterName` should exists in zeppelin interpreter configuration.
   3. For JDBC datasource, jdbc driver jars should be included in dependencies.
   
   ### What type of PR is it?
   Feature
   
   
   ### Need Help
   #### 1. Is there a better way to load all interpreter settings:
   
   Currently inplement by reading interpreter settings list inside zengine module and pass this list to Spark SQL interpreter. So this pull request include 2 modules:
   
   1. `zeppelin-zengine`: read and pass all interpreter settings to Spark SQL interpreter
   2. `spark-interpreter`: add Spark SQL cross datasource query
   
   
   When spark is launch with `local` or `yarn-client` mode, it is easy to load interpreter settings list inside `spark-interpreter` and we do not need to make a change to `zeppelin-zengine`, but when you luanch spark interpreter in `yarn-cluster` mode, `interpreter.json` do not exists in yarn-cluster driver node, so you can not get interpreter settings. So I made a change to `zeppelin-zengine` to read and pass all interpreter settings to Spark SQL interpreter, and this works in `yarn-cluster` mode.
   
   I think it is not good to make change to zengine, is there a better way to get all interpreter settings in `yarn-cluster` mode without make change to `zengine` ?
   
   #### 2. How to distinguish between user and role in `option.owners` field of interpreter setting?
   
   I cannot distinguish user and role inside `option.owners` field of interpreter setting and datasource authorization check is implemented with these code:
   ```java
   HashSet<String> usersAndRoles = new HashSet<>(authenticationInfo.getUsersAndRoles());
   HashSet<String> owners = new HashSet<>(iSetting.option.owners);
   // if owners is empty, means all users can access
   if(!owners.isEmpty()){
     int size1 = owners.size();
     owners.retainAll(usersAndRoles);
     int size2 = owners.size();
     if(size1 == size2){
       // no user or role match
       throw new InvalidCredentialsException(String.format(String.format("user %s has not privilege to access interpreter %s",authenticationInfo.getUser(), interpreterId)));
     }
   } 
   ```
   
   If there is any security concern, how can I make a better authentication check ?
   
   ### What is the Jira issue?
   [ZEPPELIN-5781]
   
   ### How should this be tested?
   1. make sure sparkSQL-interepreter(`%sql`) works 
   2. config a jdbc / mongodb interpreter with name `interpreter-nameX`
   3. test query jdbc/mongodb in %sql:
   ```sql
   %sql
   select * from interpreter-nameX.databaseName.tableName;
   ```
   
   ### Screenshots (if appropriate)
   
   ![image](https://user-images.githubusercontent.com/30063898/180590511-231d7a69-1be4-4157-9cc9-13dfc04d4654.png)
   
   
   ### Questions:
   * Does the licenses files need to update? no
   * Is there breaking changes for older versions? yes
   * Does this needs documentation? yes
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@zeppelin.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [zeppelin] zkytech commented on pull request #4423: [ZEPPELIN-5781]Add cross datasource query support to Spark SQL interpreter

Posted by GitBox <gi...@apache.org>.

zkytech commented on PR #4423:
URL: https://github.com/apache/zeppelin/pull/4423#issuecomment-1211460515

   > @zkytech Do you have any plan for the next step?
   
   I have not found a suitable way to get interpreter settings nor make access control work as expect. You can close this PR, If I have any progress on this, I will create a new PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@zeppelin.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [zeppelin] zjffdu commented on pull request #4423: [ZEPPELIN-5781]Add cross datasource query support to Spark SQL interpreter

Posted by GitBox <gi...@apache.org>.

zjffdu commented on PR #4423:
URL: https://github.com/apache/zeppelin/pull/4423#issuecomment-1193134322

   I think it is fine to only support Spark 3.x for this feature. Spark 3.x has been there for more than 2 years. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@zeppelin.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [zeppelin] zkytech commented on pull request #4423: [ZEPPELIN-5781]Add cross datasource query support to Spark SQL interpreter

Posted by GitBox <gi...@apache.org>.

zkytech commented on PR #4423:
URL: https://github.com/apache/zeppelin/pull/4423#issuecomment-1274275116

   I have not found a suitable way to get interpreter settings nor make access control work as expect. You can close this PR, If I have any progress on this, I will create a new PR.
   
   2022年8月10日 上午9:53，Jeff Zhang ***@***.******@***.***>> 写道：
   
   
   
   @zkytech<https://github.com/zkytech> Do you have any plan for the next step?
   
   —
   Reply to this email directly, view it on GitHub<https://github.com/apache/zeppelin/pull/4423#issuecomment-1210060899>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AHFL2GQWUKZXMSZDV2UMKGDVYMDRBANCNFSM54NME5JA>.
   You are receiving this because you were mentioned.Message ID: ***@***.***>
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@zeppelin.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [zeppelin] jongyoul commented on pull request #4423: [ZEPPELIN-5781]Add cross datasource query support to Spark SQL interpreter

Posted by GitBox <gi...@apache.org>.

jongyoul commented on PR #4423:
URL: https://github.com/apache/zeppelin/pull/4423#issuecomment-1193072604

   Hello, thank you for the contribution. By the way, I have a question. IIUC, You can already use several data sources for Spark. Could you please the benefit of this way? It looks changing some code and I, personally, believe that it can make hard to maintain code.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@zeppelin.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [zeppelin] zjffdu closed pull request #4423: [ZEPPELIN-5781]Add cross datasource query support to Spark SQL interpreter

Posted by GitBox <gi...@apache.org>.

zjffdu closed pull request #4423: [ZEPPELIN-5781]Add cross datasource query support to Spark SQL interpreter
URL: https://github.com/apache/zeppelin/pull/4423


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@zeppelin.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [zeppelin] zjffdu commented on pull request #4423: [ZEPPELIN-5781]Add cross datasource query support to Spark SQL interpreter

Posted by GitBox <gi...@apache.org>.

zjffdu commented on PR #4423:
URL: https://github.com/apache/zeppelin/pull/4423#issuecomment-1210060899

   @zkytech Do you have any plan for the next step?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@zeppelin.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [zeppelin] zkytech commented on pull request #4423: [ZEPPELIN-5781]Add cross datasource query support to Spark SQL interpreter

Posted by GitBox <gi...@apache.org>.

zkytech commented on PR #4423:
URL: https://github.com/apache/zeppelin/pull/4423#issuecomment-1193126351

   > Thanks @zkytech for the contribution, I agree that this requirement make sense, but I also have the concern about the implementation. IIUC, using spark catalog would be a much neat solution.
   Thanks for your suggestion, I have looked into spark multiple catalog, It`s a perfect solution for spark 3.x ,  But not works with spark 2.x .
   I will try to make an implementation with catalog for spark 3.x 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@zeppelin.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [zeppelin] zkytech commented on pull request #4423: [ZEPPELIN-5781]Add cross datasource query support to Spark SQL interpreter

Posted by GitBox <gi...@apache.org>.

zkytech commented on PR #4423:
URL: https://github.com/apache/zeppelin/pull/4423#issuecomment-1193082009

   > 
   To make it easier to join between different database. For example， join between `mongodb` & `hive` & `mysql`, assuming that:
   
   - hive has a user_info table with fields: `user_name`, `occupation_code`, `region_code`
   - mongo has a occupation_info table with fields: `occupation_code`, `occupation_name` 
   - mysql has a region_info table with fields: `region_code`, `region_name`
   
   I want to get data with these fields: `user_name`, `occupation_name`, `region_name`
   Without cross datasource query , I need to write spark scala code to load mongo/mysql table to Spark DataFrame, it is a hard job to do this every time.
   
   With cross datasource query, I can easily take these fields from mongodb and mysql with only one sql query.
   
   ```sql
   
   select
      t1.user_name, t2.occupation_name, t3.region_name
   from
      hive_db.user_info as t1
   left join
      mongodb.user_db.occupation_info as t2
   on
      t1.occupation_code = t2.occupation_code
   left join
      mysql.another_userinfo_db.region_info as t3
   on
     t1.region_code = t3.region_code
   ``` 
   This is a lightweight replacement for presto,  easier to use for zeppelin users .
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@zeppelin.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [zeppelin] zjffdu commented on pull request #4423: [ZEPPELIN-5781]Add cross datasource query support to Spark SQL interpreter

Posted by GitBox <gi...@apache.org>.

zjffdu commented on PR #4423:
URL: https://github.com/apache/zeppelin/pull/4423#issuecomment-1193083729

   Thanks @zkytech for the contribution, I agree that this requirement make sense, but I also have the concern about the implementation. IIUC, using spark catalog would be a much neat solution. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@zeppelin.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org