You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Ryan Blue <rb...@netflix.com.INVALID> on 2019/04/04 23:21:19 UTC
DataSourceV2 sync 3 April 2019
Hi everyone, here are my notes from last night’s DSv2 sync.
As usual, please reply with comments or corrections.
Sorry for the lack of notes for the last meeting. I lost them when my
laptop battery died.
rb
*Topics*:
- Open pull requests
- PR #24233: Remove SaveMode (
https://github.com/apache/spark/pull/24233)
- PR #24117: Add public transform API (
https://github.com/apache/spark/pull/24117)
- PR #24246: Add TableCatalog API (
https://github.com/apache/spark/pull/24246)
- Add a hack to special-case file sources?
- Partition transforms and tables
*Discussion*:
- PR #24233: Remove SaveMode
- Ryan: This PR has a hack that passes SaveMode directly to file
sources. Because the behavior of v1 is unknown, this should not
be included
in a release.
- Wenchen: This is needed in order to use v2 file sources. Intent is
migrate the file source to provide a reference implementation.
Without the
hack, v2 sources can’t be used (by default) because we know it [v2] is
broken [i.e., can’t run CTAS].
- Ryan: v2 isn’t broken. The only problem is path-based tables
because we don’t know how they should behave [Gengliang is looking into
this]. Catalog tables aren’t broken so we should focus on getting that
support done.
- Russel/Matt: is this needed for current tests to pass?
- Wenchen: no, tests run using v1.
- Russel: +1 for not including the hack.
- Ryan: Also, tests for the new plans will require an in-memory
source that will be a cleaner reference implementation.
- PR #24117 parser rules & code organization:
- Ryan: Reynold suggests using expression instead of transform to get
better error messages. By handling expressions, the AstBuilder can throw
better error messages than the generated parser code. Parser generates
messages like “expected identifier”, but the AstBuilder can do
better, like
“expression is not a transformation”
- Ryan: This is a good idea, but I’d like to avoid blocking this work
and do this in a follow-up if the community agrees. This PR is
blocking the
table catalog, which is blocking the new table resolution rules, new
logical plans, etc. Delaying just to improve the error message
isn’t a good
use of time.
- Wenchen: agreed to update the parser errors later
- Matt: how long will it take to fix it?
- Ryan: even if it only takes a few hours to build and test, this
will reasonably push the PR out by another few days
- Wenchen and Dilip pointed out other rules that are matched by
expression, like sub-queries, so this isn’t a simple fix.
- Ryan: along the same lines, moving classes in this PR will cause
conflicts and delay. We are also moving classes into catalyst. Can we fix
the package organization when it is more convenient?
- Wenchen: agreed to update organization later
- Consensus was to move forward without fixing organization or
parsing to unblock dependent changes, like #24246
- PR #24117: transforms
- Wenchen: Why not base Expression on the internal Expression class?
For example, add references to Expression, not Transform.
- Ryan: Intent is to be as simple as possible to start with. Literal
doesn’t use references, and we can always promote the method to
expression
later, but not the other way around.
- Wenchen: What values are wrapped by Literal? InternalRow?
- Ryan: the internal representations.
- Matt: We would need to be careful about changing the internal
representation then.
- Ryan: It is already difficult to change an internal representation,
even without this. And, these APIs already expose InternalRow,
so we aren’t
creating new problems.
- Matt: Yes, this isn’t a new problem.
- Ryan: I’ll follow up to the PR with documentation for the
representations used by InternalRow.
- Ryan: there is some confusion on the PR about what transforms are.
Reynold asked what dateHour returns. Part of the problem is that
“dateHour”
is a confusing name; it should be “hourly” or “hours”. The name includes
“date” to signal that it isn’t an hour in [0, 23], it is an
hour-granularity partition transform.
- Ryan: these transforms are not concrete functions. There are many
functions that can partition timestamps into hour or day granularity
ranges. The concrete transform function doesn’t matter for configuring
tables.
- Ryan: For example, the “bucket” function is conceptually the same,
but has a different concrete implementation in Spark, Hive, Iceberg, and
Cassandra tables.
- Russel: commented on the craziness of Cassandra’s bucket transform
- Ryan: Spark needs to work with all of these functions. When Spark
needs a concrete implementation to run a bucket join, it should
look up the
“bucket” function from the table’s catalog using the FunctionCatalog
interface. That way it can prepare data for the other side of a bucketed
join to work with bucketed data from any source.
- PR #24246: Add TableCatalog API
- Ryan: This PR is based on #24117, but please start reviewing now to
make this go faster.
*Attendees*:
Ryan Blue
John Zhuge
Russel Spitzer
Gengliang Wang
Yuanjian Li
Matt Cheah
Yifei Huang
Felix Cheung
Dilip Biswal
Wenchen Fan
--
Ryan Blue
Software Engineer
Netflix