You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by "second_comet@yahoo.com.INVALID" <se...@yahoo.com.INVALID> on 2022/10/20 08:31:44 UTC

pyspark connect to spark thrift server port

Currently my pyspark code able to connect to hive metastore at port 9083. However using this approach i can't put in-place any security mechanism like LDAP and sql authentication control. Is there anyway to connect from pyspark to spark thrift server on port 10000 without exposing hive metastore url to the pyspark ? I would like to authenticate the user before allow to execute spark sql, and user should only allow to query from databases,tables that they have the access.



Thank you,comet

Re: pyspark connect to spark thrift server port

Posted by Artemis User <ar...@dtechspace.com>.
I guess there are some confusions here between the metastore and the 
actual Hive database.  Spark (as well as Apache Hive) requires two 
databases for Hive DB operations.  Metastore is used for storing 
metadata only (e.g., schema info), whereas the actual Hive database, 
accessible through Thrift server, is used for applications.  The reason 
why Hive needs its metadata stored separately as a server is because for 
distributed database operations.

My previous message referred to how to secure the metastore database, 
not the actual Hive tables.  Looks like you are looking for how to 
secure access to Hive not metastore (metastore isn't used by general 
users), and your current configuration wasn't set up with the right user 
access control.  Hive actually supports role-based access model just 
like other RDBMS.  You may refer to the Hive admin guide for more 
details 
(https://cwiki.apache.org/confluence/display/Hive/SQL+Standard+Based+Hive+Authorization). 
You can use beeline or SQL scripts via beeline to set user privileges 
and roles.

On 10/21/22 1:27 AM, second_comet@yahoo.com.INVALID wrote:
>
> Hello Artemis,
>    Understand, if i gave hive metastore uri to anyone to connect using 
> pyspark. the port 9083 is open for anyone without authentication 
> feature. The only way pyspark able to connect to hive is through 9083 
> and not through port 10000.
> On Friday, October 21, 2022 at 04:06:38 AM GMT+8, Artemis User 
> <ar...@dtechspace.com> wrote:
>
>
> By default, Spark uses Apache Derby (running in embedded mode with 
> store content defined in local files) for hosting the Hive metastore.  
> You can externalize the metastore on a JDBC-compliant database (e.g., 
> PostgreSQL) and use the database authentication provided by the 
> database.  The JDBC configuration shall be defined in a hive-site.xml 
> file in the Spark conf directory.  Please see the metastore admin 
> guide for more details, including an init script for setting up your 
> metastore 
> (https://cwiki.apache.org/confluence/display/Hive/AdminManual+Metastore+3.0+Administration 
> <https://cwiki.apache.org/confluence/display/Hive/AdminManual+Metastore+3.0+Administration>). 
>
>
> On 10/20/22 4:31 AM, second_comet@yahoo.com.INVALID 
> <ma...@yahoo.com.INVALID> wrote:
> Currently my pyspark code able to connect to hive metastore at port 
> 9083. However using this approach i can't put in-place any security 
> mechanism like LDAP and sql authentication control. Is there anyway to 
> connect from pyspark to spark thrift server on port 10000 without 
> exposing hive metastore url to the pyspark ? I would like to 
> authenticate the user before allow to execute spark sql, and user 
> should only allow to query from databases,tables that they have the 
> access.
>
>
>
> Thank you,
> comet
>

Re: pyspark connect to spark thrift server port

Posted by "second_comet@yahoo.com.INVALID" <se...@yahoo.com.INVALID>.
 
Hello Artemis,   Understand, if i gave hive metastore uri to anyone to connect using pyspark. the port 9083 is open for anyone without authentication feature. The only way pyspark able to connect to hive is through 9083 and not through port 10000. 
    On Friday, October 21, 2022 at 04:06:38 AM GMT+8, Artemis User <ar...@dtechspace.com> wrote:  
 
  By default, Spark uses Apache Derby (running in embedded mode with store content defined in local files) for hosting the Hive metastore.  You can externalize the metastore on a JDBC-compliant database (e.g., PostgreSQL) and use the database authentication provided by the database.  The JDBC configuration shall be defined in a hive-site.xml file in the Spark conf directory.  Please see the metastore admin guide for more details, including an init script for setting up your metastore(https://cwiki.apache.org/confluence/display/Hive/AdminManual+Metastore+3.0+Administration). 
 
 On 10/20/22 4:31 AM, second_comet@yahoo.com.INVALID wrote:
  
 
 Currently my pyspark code able to connect to hive metastore at port 9083. However using this approach i can't put in-place any security mechanism like LDAP and sql authentication control. Is there anyway to connect from pyspark to spark thrift server on port 10000 without exposing hive metastore url to the pyspark ? I would like to authenticate the user before allow to execute spark sql, and user should only allow to query from databases,tables that they have the access.
  
  
  
  Thank you, comet
   
   

Re: pyspark connect to spark thrift server port

Posted by Artemis User <ar...@dtechspace.com>.
By default, Spark uses Apache Derby (running in embedded mode with store 
content defined in local files) for hosting the Hive metastore.  You can 
externalize the metastore on a JDBC-compliant database (e.g., 
PostgreSQL) and use the database authentication provided by the 
database.  The JDBC configuration shall be defined in a hive-site.xml 
file in the Spark conf directory.  Please see the metastore admin guide 
for more details, including an init script for setting up your metastore 
(https://cwiki.apache.org/confluence/display/Hive/AdminManual+Metastore+3.0+Administration). 


On 10/20/22 4:31 AM, second_comet@yahoo.com.INVALID wrote:
> Currently my pyspark code able to connect to hive metastore at port 
> 9083. However using this approach i can't put in-place any security 
> mechanism like LDAP and sql authentication control. Is there anyway to 
> connect from pyspark to spark thrift server on port 10000 without 
> exposing hive metastore url to the pyspark ? I would like to 
> authenticate the user before allow to execute spark sql, and user 
> should only allow to query from databases,tables that they have the 
> access.
>
>
>
> Thank you,
> comet