You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Ryan Blue <rb...@netflix.com.INVALID> on 2018/11/01 18:50:50 UTC
Re: DataSourceV2 hangouts sync
Thanks to everyone that attended the sync! We had some good discussions.
Here are my notes for anyone that missed it or couldn’t join the live
stream. If anyone wants to add to this, please send additional thoughts or
corrections.
*Attendees:*
- Ryan Blue - Netflix - Using v2 to integrate Iceberg with Spark. SQL,
DDL/schema evolution, delete support, and hidden partitioning working in
Netflix’s branch.
- John Zhuge - Netflix - Working on multi-catalog support.
- Felix Cheung - Uber - Interested in integrating Uber data sources.
External catalog would be useful
- Reynold Xin - DataBricks - Working more on SQL and sources
- Arun M - HortonWorks - Interested in streaming, unified continuous and
micro-batch modes
- Dale Richardson - Private developer - Interested in non-FS based
partitions
- Dongjoon Hyun - HortonWorks - Looking at ORC support in v2,
transactional processing on the write side, data lineage
- Genglian Wang - DataBricks - ORC with v2 API
- Hyukjin Kwon - HortonWorks - DSv2 API implementation for Hive
warehouse, LLAP
- Kevin Yu - IBM - Design for v2
- Matt Cheah - Palantir - Interested in integrating a custom data store
- Thomas D’Silva - Salesforce - Interested in a v2 Phoenix connector
- Wenchen Fan - DataBricks - Proposed DSv2
- Xiao Li - DataBricks - SparkSQL, reviews v2 patches
- Yuanjian Li - Interested in continuous streaming, new catalog API
*Goals for this sync:*
- Build consensus around the context that affects roadmap planning
- Set priorities and some reasonable milestones
- In the future, we’ll open things up to more general technical
discussions, but we will be more effective if we are aligned first.
*Conclusions:*
- Current blocker is Wenchen’s API update. Please review the refactor
proposal
<https://docs.google.com/document/d/1uUmKCpWLdh9vHxP7AWJ9EgbwB_U6T3EJYNjhISGmiQg/edit#heading=h.l22hv3trducc>
and PR #22547 <https://github.com/apache/spark/pull/22547>
- Catalog support: this is a high priority blocker for SQL and real use
of DSv2
- Please review the TableCatalog SPIP proposal
<https://docs.google.com/document/d/1zLFiA1VuaWeVxeTDXNg8bL6GP3BVoOZBkewFtEnjEoo/edit#heading=h.m45webtwxf2d>
and the implementation, PR #21306
<https://github.com/apache/spark/pull/21306>
- A proposal for incrementally introducing catalogs is in PR #21978
<https://github.com/apache/spark/pull/21978>
- Generic and specific catalog support should use the same API
- Replacing the current global catalog will be done in parts with
more specific APIs like TableCatalog and FunctionCatalog
- Behavior compatibility:
- V2 will have well-defined behavior, primarily implemented by Spark
to ensure consistency across sources (e.g. CTAS)
- Uses of V1 sources should not see behavior changes when sources are
updated to use V2.
- Reconciling these concerns is difficult. File-based sources may
need to implement compatibility hacks, like checking
spark.sql.sources.partitionOverwriteMode
- Explicit overwrite is preferred to automatic partition overwrite.
This mechanism could be used to translate some behaviors of INSERT
OVERWRITE ... PARTITION for V2 sources.
*Context that affects roadmap planning:*
This is a *summary* of the long discussion, not quotes. It may not be in
the right order, but I think it captures the highlights.
-
The community adopted the SPIP to standardize logical plans
<https://docs.google.com/document/d/1gYm5Ji2Mge3QBdOliFV5gSPTKlX4q1DCBXIkiyMv62A/edit?ts=5a987801#heading=h.m45webtwxf2d>
and this requires a catalog API for sources
<https://docs.google.com/document/d/1zLFiA1VuaWeVxeTDXNg8bL6GP3BVoOZBkewFtEnjEoo/edit#heading=h.m45webtwxf2d>.
With multi-catalog support also coming soon, it makes sense to incorporate
this in the design and planning.
- Wenchen mentioned that there are two types of catalogs. First, generic
catalogs that can track tables with configurable implementations
(like the
current catalog that can track Parquet, JDBC, JSON, etc. tables). Second,
there are specific catalogs that expose a certain type of table (like a
JDBC catalog that exposes all of the tables in a relational DB or a
Cassandra catalog).
- Ryan: we should be able to use the same catalog API for both of
these use cases.
- Reynold: DataBricks is interested in a catalog API and it should
replace the existing API. Replacing the existing API is difficult because
there are many concerns, like tracking functions. The current API is
complicated and may be difficult to replace.
- Ryan: Past discussions have suggested replacing the current catalog
API in pieces, like the proposed TableCatalog API for named tables, a
FunctionCatalog API to track UDFs, and a PathCatalog for
path-based tables.
I’ve also proposed a way to maintain behavior by keeping the
existing code
paths separate from v2 code paths using CatalogTableIdentifier. (PR
#21978 <https://github.com/apache/spark/pull/21978>)
- Reynold: How would path-based tables work? Most DataBricks users
don’t use a table catalog
- Ryan: SQL execution should be based on a Table abstraction (where
table is a source of data and could be a stream). Different
catalog types,
like path-based or name-based, would provide table instances
that are used
by the rest of Spark.
- Reynold: Are catalogs only needed for CTAS?
- Ryan: No, v2 should support full DDL operations. Netflix has
create/drop/alter implemented and has extended DDL to full
schema evolution
including nested types. I see no reason not to add this upstream.
- Reynold: The problem is that there are a lot of difficult design
challenges for a replacement catalog API.
- Ryan: What needs to be passed for the table catalog API is
well-defined because it is SQL DDL.
-
We’ve already agreed that Wenchen’s API update is the next blocker, so
please review the refactor proposal
<https://docs.google.com/document/d/1uUmKCpWLdh9vHxP7AWJ9EgbwB_U6T3EJYNjhISGmiQg/edit#heading=h.l22hv3trducc>
and PR #22547 <https://github.com/apache/spark/pull/22547>
-
By standardizing behavior in V2, we get consistent and reliable behavior
across sources. But because V1 behavior is not consistent, we can’t simply
move everything to V2.
- Example consistency problems: SaveMode.Overwrite may drop and recreate
a table, may drop all rows and insert, or may overwrite portions of a
table. Path-based tables don’t use an existence check, so all writes are
validated using CTAS rules and not the insert rules.
- Reynold: We should be able to mimic what the file-based sources do
(e.g. parquet, orc)
- Matt: We are interested in RTAS, file source, and metastore
- Ryan: Keeping the same behavior as the file sources is not possible
because that behavior differs based on Spark properties, like
spark.sql.sources.partitionOverwriteMode introduced by SPARK-20236
<https://issues.apache.org/jira/browse/SPARK-20236>
- Felix and Reynold: We can’t change the behavior of file-based
sources
-
Ryan: Agreed, but we also need to deliver well-defined behavior
without requiring a variety of modes that sources are required
to implement
for V2.
-
Ryan: Main problem is INSERT OVERWRITE ... PARTITIONS. This query
implicitly overwrites data. Overwrites should be explicit. We
have built INSERT
OVERWRITE table WHERE delete-expr SELECT ... to make the data that is
replaced explicit. Spark could use explicit overwrites to implement some
overwrite behaviors, like static partition overwrites.
*Next sync will include more specific technical topics, please send
suggestions!*
On Thu, Oct 25, 2018 at 1:09 PM Ryan Blue <rb...@netflix.com> wrote:
Hi everyone,
>
> There's been some great discussion for DataSourceV2 in the last few
> months, but it has been difficult to resolve some of the discussions and I
> don't think that we have a very clear roadmap for getting the work done.
>
> To coordinate better as a community, I'd like to start a regular sync-up
> over google hangouts. We use this in the Parquet community to have more
> effective community discussions about thorny technical issues and to get
> aligned on an overall roadmap. It is really helpful in that community and I
> think it would help us get DSv2 done more quickly.
>
> Here's how it works: people join the hangout, we go around the list to
> gather topics, have about an hour-long discussion, and then send a summary
> of the discussion to the dev list for anyone that couldn't participate.
> That way we can move topics along, but we keep the broader community in the
> loop as well for further discussion on the mailing list.
>
> I'll volunteer to set up the sync and send invites to anyone that wants to
> attend. If you're interested, please reply with the email address you'd
> like to put on the invite list (if there's a way to do this without
> specific invites, let me know). Also for the first sync, please note what
> times would work for you so we can try to account for people in different
> time zones.
>
> For the first one, I was thinking some day next week (time TBD by those
> interested) and starting off with a general roadmap discussion before
> diving into specific technical topics.
>
> Thanks,
>
> rb
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>
--
Ryan Blue
Software Engineer
Netflix