You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by JackyLee <qc...@163.com> on 2020/03/23 11:20:10 UTC

[DISCUSS] Supporting hive on DataSourceV2

Hi devs,
I’d like to start a discussion about Supporting Hive on DatasourceV2. We’re
now working on a project using DataSourceV2 to provide multiple source
support and it works with the data lake solution very well, yet it does not
yet support HiveTable.

There are 3 reasons why we need to support Hive on DataSourceV2.
1. Hive itself is one of Spark data sources.
2. HiveTable is essentially a FileTable with its own input and output
formats, it works fine with FileTable.
3. HiveTable should be stateless, and users can freely read or write Hive
using batch or microbatch.

We implemented stateless Hive on DataSourceV1, it supports user to write
into Hive on streaming or batch and it has widely used in our company.
Recently, we are trying to support Hive on DataSourceV2, Multiple Hive
Catalog and DDL Commands have already been supported. 

Looking forward to more discussions on this.



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org

Re: [DISCUSS] Supporting hive on DataSourceV2

Posted by JackyLee <qc...@163.com>.

Hi Blue,

I have created a jira for supporting hive on DataSourceV2,we can associate
specific modules on this jira.
https://issues.apache.org/jira/browse/SPARK-31241

Could you provide a google doc for current design, so that we can discuss
and improve it in detail here?



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org

Re: [DISCUSS] Supporting hive on DataSourceV2

Posted by JackyLee <qc...@163.com>.

Glad to hear that you have already supported it, that is just the thing we
are doing. And these exceptions you said doesn't conflict with hive support,
we can easily make it compatible. 

>Do you have an idea about where the connector should be developed? I don’t
think it makes sense for it to be part of Spark. That would keep complexity
in the main project and require updating Hive versions slowly. Using a
separate project would mean less code in Spark specific to one source, and 
could more easily support multiple Hive versions. Maybe we should create a
project for catalog plug-ins?

AFAIT, it is necessary to create a new project, users need to create their
own Connector according to their own needs. In our implementation of Hive on
DataSourceV2, we put the basic Partition API and Commands in the main
project,  and put a default version HiveCatalog and HiveConnector into the
external project. Users can use our project and can also implement their own
HiveConnector. Maybe this is a good way to support.

Look forward to your patch submission, we can cooperate in this area.



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org

Re: [DISCUSS] Supporting hive on DataSourceV2

Posted by Ryan Blue <rb...@netflix.com.INVALID>.

Hi Jacky,

We’ve internally released support for Hive tables (and Spark FileFormat
tables) using DataSourceV2 so that we can switch between catalogs; sounds
like that’s what you are planning to build as well. It would be great to
work with the broader community on a Hive connector.

I will get a branch of our connectors published so that you can take a
look. I think it should be fairly close to what you’re talking about
building, with a few exceptions:

   - Our implementation always uses our S3 committers, but it should be
   easy to change this
   - It supports per-partition formats, like Hive

Do you have an idea about where the connector should be developed? I don’t
think it makes sense for it to be part of Spark. That would keep complexity
in the main project and require updating Hive versions slowly. Using a
separate project would mean less code in Spark specific to one source, and
could more easily support multiple Hive versions. Maybe we should create a
project for catalog plug-ins?

rb

On Mon, Mar 23, 2020 at 4:20 AM JackyLee <qc...@163.com> wrote:

> Hi devs,
> I’d like to start a discussion about Supporting Hive on DatasourceV2. We’re
> now working on a project using DataSourceV2 to provide multiple source
> support and it works with the data lake solution very well, yet it does not
> yet support HiveTable.
>
> There are 3 reasons why we need to support Hive on DataSourceV2.
> 1. Hive itself is one of Spark data sources.
> 2. HiveTable is essentially a FileTable with its own input and output
> formats, it works fine with FileTable.
> 3. HiveTable should be stateless, and users can freely read or write Hive
> using batch or microbatch.
>
> We implemented stateless Hive on DataSourceV1, it supports user to write
> into Hive on streaming or batch and it has widely used in our company.
> Recently, we are trying to support Hive on DataSourceV2, Multiple Hive
> Catalog and DDL Commands have already been supported.
>
> Looking forward to more discussions on this.
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>
>

-- 
Ryan Blue
Software Engineer
Netflix