You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Ryan Blue <rb...@netflix.com.INVALID> on 2019/05/06 17:04:18 UTC
DataSourceV2 community sync notes - 1 May 2019
Here are my notes for the latest DSv2 community sync. As usual, if you have
comments or corrections, please reply. If you’d like to be invited to the
next sync, email me directly. Everyone is welcome to attend.
*Attendees*:
Ryan Blue
John Zhuge
Andrew Long
Bruce Robbins
Dilip Biswal
Gengliang Wang
Kevin Yu
Michael Artz
Russel Spitzer
Yifei Huang
Zhilmil Dhillon
*Topics*:
Introductions
Open pull requests
V2 organization
Bucketing and sort order from v2 sources
*Discussion*:
- Introductions: we stopped doing introductions when we had a large
group. Attendance has gone down from the first few syncs, so we decided to
resume.
- V2 organization / PR #24416: https://github.com/apache/spark/pull/24416
- Ryan: There’s an open PR to move v2 into catalyst
- Andrew: What is the distinction between catalyst and sql? How do we
know what goes where?
- Ryan: IIUC, the catalyst module is supposed to be a stand-alone
query planner that doesn’t get into concrete physical plans. The catalyst
package is the private implementation. Anything that is generic catalyst,
including APIs like DataType, should be in the catalyst module. Anything
public, like an API, should not be in the catalyst package.
- Ryan: v2 meets those requirements now and I don’t have a strong
opinion on organization. We just need to choose one.
- No one had a strong opinion so we tabled this. In #24416 or shortly
after let’s decide on organization and do the move at once.
- Next steps: someone with an opinion on organization should suggest
a structure.
- TableCatalog API / PR #24246:
https://github.com/apache/spark/pull/24246
- Ryan: Wenchen’s last comment was that he was waiting for tests to
pass and they are. Maybe this will be merged soon?
- Bucketing and sort order from v2 sources
- Andrew: interested in using existing data layout and sorting to
remove expensive tasks in joins
- Ryan: in v2, bucketing is unified with other partitioning
functions. I plan to build a way for Spark to get partition function
implementations from a source so it can use that function to prepare the
other side of a join. From there, I have been thinking about a
way to check
compatibility between functions, so we could validate that table
A has the
same bucketing as table B.
- Dilip: is bucketing Hive-specific?
- Russel: Cassandra also buckets
- Matt: what is the difference between bucketing and other partition
functions for this?
- Ryan: probably no difference. If you’re partitioning by hour, you
could probably use that, too.
- Dilip: how can Spark compare functions?
- Andrew: should we introduce a standard? it would be easy to switch
for their use case
- Ryan: it is difficult to introduce a standard because so much data
already exists in tables. I think it is easier to support multiple
functions.
- Russel: Cassandra uses dynamic bucketing, which wouldn’t be able to
use a standard.
- Dilip: sources could push down joins
- Russel: that’s a harder problem
- Andrew: does anyone else limit bucket size?
- Ryan: we don’t because we assume the sort can spill. probably a
good optimization for later
- Matt: what are the follow-up items for this?
- Andrew: will look into the current state of bucketing in Spark
- Ryan: it would be great if someone thought about what the
FunctionCatalog interface will look like
--
Ryan Blue
Software Engineer
Netflix