You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Ryan Blue <rb...@netflix.com.INVALID> on 2018/11/01 18:50:50 UTC

Re: DataSourceV2 hangouts sync

Thanks to everyone that attended the sync! We had some good discussions.
Here are my notes for anyone that missed it or couldn’t join the live
stream. If anyone wants to add to this, please send additional thoughts or
corrections.

*Attendees:*

   - Ryan Blue - Netflix - Using v2 to integrate Iceberg with Spark. SQL,
   DDL/schema evolution, delete support, and hidden partitioning working in
   Netflix’s branch.
   - John Zhuge - Netflix - Working on multi-catalog support.
   - Felix Cheung - Uber - Interested in integrating Uber data sources.
   External catalog would be useful
   - Reynold Xin - DataBricks - Working more on SQL and sources
   - Arun M - HortonWorks - Interested in streaming, unified continuous and
   micro-batch modes
   - Dale Richardson - Private developer - Interested in non-FS based
   partitions
   - Dongjoon Hyun - HortonWorks - Looking at ORC support in v2,
   transactional processing on the write side, data lineage
   - Genglian Wang - DataBricks - ORC with v2 API
   - Hyukjin Kwon - HortonWorks - DSv2 API implementation for Hive
   warehouse, LLAP
   - Kevin Yu - IBM - Design for v2
   - Matt Cheah - Palantir - Interested in integrating a custom data store
   - Thomas D’Silva - Salesforce - Interested in a v2 Phoenix connector
   - Wenchen Fan - DataBricks - Proposed DSv2
   - Xiao Li - DataBricks - SparkSQL, reviews v2 patches
   - Yuanjian Li - Interested in continuous streaming, new catalog API

*Goals for this sync:*

   - Build consensus around the context that affects roadmap planning
   - Set priorities and some reasonable milestones
   - In the future, we’ll open things up to more general technical
   discussions, but we will be more effective if we are aligned first.

*Conclusions:*

   - Current blocker is Wenchen’s API update. Please review the refactor
   proposal
   <https://docs.google.com/document/d/1uUmKCpWLdh9vHxP7AWJ9EgbwB_U6T3EJYNjhISGmiQg/edit#heading=h.l22hv3trducc>
   and PR #22547 <https://github.com/apache/spark/pull/22547>
   - Catalog support: this is a high priority blocker for SQL and real use
   of DSv2
      - Please review the TableCatalog SPIP proposal
      <https://docs.google.com/document/d/1zLFiA1VuaWeVxeTDXNg8bL6GP3BVoOZBkewFtEnjEoo/edit#heading=h.m45webtwxf2d>
      and the implementation, PR #21306
      <https://github.com/apache/spark/pull/21306>
      - A proposal for incrementally introducing catalogs is in PR #21978
      <https://github.com/apache/spark/pull/21978>
      - Generic and specific catalog support should use the same API
      - Replacing the current global catalog will be done in parts with
      more specific APIs like TableCatalog and FunctionCatalog
   - Behavior compatibility:
      - V2 will have well-defined behavior, primarily implemented by Spark
      to ensure consistency across sources (e.g. CTAS)
      - Uses of V1 sources should not see behavior changes when sources are
      updated to use V2.
      - Reconciling these concerns is difficult. File-based sources may
      need to implement compatibility hacks, like checking
      spark.sql.sources.partitionOverwriteMode
      - Explicit overwrite is preferred to automatic partition overwrite.
      This mechanism could be used to translate some behaviors of INSERT
      OVERWRITE ... PARTITION for V2 sources.

*Context that affects roadmap planning:*

This is a *summary* of the long discussion, not quotes. It may not be in
the right order, but I think it captures the highlights.

   -

   The community adopted the SPIP to standardize logical plans
   <https://docs.google.com/document/d/1gYm5Ji2Mge3QBdOliFV5gSPTKlX4q1DCBXIkiyMv62A/edit?ts=5a987801#heading=h.m45webtwxf2d>
   and this requires a catalog API for sources
   <https://docs.google.com/document/d/1zLFiA1VuaWeVxeTDXNg8bL6GP3BVoOZBkewFtEnjEoo/edit#heading=h.m45webtwxf2d>.
   With multi-catalog support also coming soon, it makes sense to incorporate
   this in the design and planning.
   - Wenchen mentioned that there are two types of catalogs. First, generic
      catalogs that can track tables with configurable implementations
(like the
      current catalog that can track Parquet, JDBC, JSON, etc. tables). Second,
      there are specific catalogs that expose a certain type of table (like a
      JDBC catalog that exposes all of the tables in a relational DB or a
      Cassandra catalog).
      - Ryan: we should be able to use the same catalog API for both of
      these use cases.
      - Reynold: DataBricks is interested in a catalog API and it should
      replace the existing API. Replacing the existing API is difficult because
      there are many concerns, like tracking functions. The current API is
      complicated and may be difficult to replace.
      - Ryan: Past discussions have suggested replacing the current catalog
      API in pieces, like the proposed TableCatalog API for named tables, a
      FunctionCatalog API to track UDFs, and a PathCatalog for
path-based tables.
      I’ve also proposed a way to maintain behavior by keeping the
existing code
      paths separate from v2 code paths using CatalogTableIdentifier. (PR
      #21978 <https://github.com/apache/spark/pull/21978>)
      - Reynold: How would path-based tables work? Most DataBricks users
      don’t use a table catalog
      - Ryan: SQL execution should be based on a Table abstraction (where
      table is a source of data and could be a stream). Different
catalog types,
      like path-based or name-based, would provide table instances
that are used
      by the rest of Spark.
      - Reynold: Are catalogs only needed for CTAS?
      - Ryan: No, v2 should support full DDL operations. Netflix has
      create/drop/alter implemented and has extended DDL to full
schema evolution
      including nested types. I see no reason not to add this upstream.
      - Reynold: The problem is that there are a lot of difficult design
      challenges for a replacement catalog API.
      - Ryan: What needs to be passed for the table catalog API is
      well-defined because it is SQL DDL.
   -

   We’ve already agreed that Wenchen’s API update is the next blocker, so
   please review the refactor proposal
   <https://docs.google.com/document/d/1uUmKCpWLdh9vHxP7AWJ9EgbwB_U6T3EJYNjhISGmiQg/edit#heading=h.l22hv3trducc>
   and PR #22547 <https://github.com/apache/spark/pull/22547>
   -

   By standardizing behavior in V2, we get consistent and reliable behavior
   across sources. But because V1 behavior is not consistent, we can’t simply
   move everything to V2.
   - Example consistency problems: SaveMode.Overwrite may drop and recreate
      a table, may drop all rows and insert, or may overwrite portions of a
      table. Path-based tables don’t use an existence check, so all writes are
      validated using CTAS rules and not the insert rules.
      - Reynold: We should be able to mimic what the file-based sources do
      (e.g. parquet, orc)
      - Matt: We are interested in RTAS, file source, and metastore
      - Ryan: Keeping the same behavior as the file sources is not possible
      because that behavior differs based on Spark properties, like
      spark.sql.sources.partitionOverwriteMode introduced by SPARK-20236
      <https://issues.apache.org/jira/browse/SPARK-20236>
      - Felix and Reynold: We can’t change the behavior of file-based
      sources
      -

      Ryan: Agreed, but we also need to deliver well-defined behavior
      without requiring a variety of modes that sources are required
to implement
      for V2.
      -

      Ryan: Main problem is INSERT OVERWRITE ... PARTITIONS. This query
      implicitly overwrites data. Overwrites should be explicit. We
have built INSERT
      OVERWRITE table WHERE delete-expr SELECT ... to make the data that is
      replaced explicit. Spark could use explicit overwrites to implement some
      overwrite behaviors, like static partition overwrites.

*Next sync will include more specific technical topics, please send
suggestions!*

On Thu, Oct 25, 2018 at 1:09 PM Ryan Blue <rb...@netflix.com> wrote:

Hi everyone,
>
> There's been some great discussion for DataSourceV2 in the last few
> months, but it has been difficult to resolve some of the discussions and I
> don't think that we have a very clear roadmap for getting the work done.
>
> To coordinate better as a community, I'd like to start a regular sync-up
> over google hangouts. We use this in the Parquet community to have more
> effective community discussions about thorny technical issues and to get
> aligned on an overall roadmap. It is really helpful in that community and I
> think it would help us get DSv2 done more quickly.
>
> Here's how it works: people join the hangout, we go around the list to
> gather topics, have about an hour-long discussion, and then send a summary
> of the discussion to the dev list for anyone that couldn't participate.
> That way we can move topics along, but we keep the broader community in the
> loop as well for further discussion on the mailing list.
>
> I'll volunteer to set up the sync and send invites to anyone that wants to
> attend. If you're interested, please reply with the email address you'd
> like to put on the invite list (if there's a way to do this without
> specific invites, let me know). Also for the first sync, please note what
> times would work for you so we can try to account for people in different
> time zones.
>
> For the first one, I was thinking some day next week (time TBD by those
> interested) and starting off with a general roadmap discussion before
> diving into specific technical topics.
>
> Thanks,
>
> rb
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


-- 
Ryan Blue
Software Engineer
Netflix