You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Josh Rosen <ro...@gmail.com> on 2019/07/14 21:05:12 UTC

Spark SQL upgrade / migration guide: discoverability and content organization

I'd like to discuss the Spark SQL migration / upgrade guides in the Spark
documentation: these are valuable resources and I think we could increase
that value by making these docs easier to discover and by adding a bit more
structure to the existing content.

For folks who aren't familiar with these docs: the Spark docs have a "SQL
Migration Guide" which lists the deprecations and changes of behavior in
each release:

   - Latest published version:
   https://spark.apache.org/docs/latest/sql-migration-guide-upgrade.html
   - Master branch version (will become 3.0):
   https://github.com/apache/spark/blob/master/docs/sql-migration-guide-upgrade.md

A lot of community work went into crafting this doc and I really appreciate
those efforts.

This doc is a little hard to find, though, because it's not consistently
linked from release notes pages: the 2.4.0 page links it under "Changes of
Behavior" (
https://spark.apache.org/releases/spark-release-2-4-0.html#changes-of-behavior)
but subsequent maintenance releases do not link to it (
https://spark.apache.org/releases/spark-release-2-4-1.html). It's also not
very cross-linked from the rest of the Spark docs (e.g. the Overview doc,
doc drop-down menus, etc).

I'm also concerned that the doc may be overwhelming to end users (as
opposed to Spark developers):

   - *Entries aren't grouped by component*, so users need to read the
   entire document to spot changes relevant to their use of Spark (for
   example, PySpark changes are not grouped together).
   - *Entries aren't ordered by size / risk of change,* e.g. performance
   impact vs. loud behavior change (stopping with an explicit exception) vs.
   silent behavior changes (e.g. changing default rounding behavior). If we
   assume limited reader attention then it may be important to prioritize the
   order in which we list entries, putting the highest-expected-impact /
   lowest-organic-discoverability changes first.
   - *We don't link JIRAs*, forcing users to do their own archaeology to
   learn more about a specific change.

The existing ML migration guide addresses some of these issues, so maybe we
can emulate it in the SQL guide:
https://spark.apache.org/docs/latest/ml-guide.html#migration-guide

I think that documentation clarity is especially important with Spark 3.0
around the corner: many folks will seek out this information when they
upgrade, so improving this guide can be a high-leverage, high-impact
activity.

What do folks think? Does anyone have examples from other projects which do
a notably good job of crafting release notes / migration guides? I'd be
glad to help with pre-release editing after we decide on a structure and
style.

Cheers,
Josh

Re: Spark SQL upgrade / migration guide: discoverability and content organization

Posted by Jungtaek Lim <ka...@gmail.com>.
As one of contributors in Structured Streaming, I would vote on having
migration guide doc for structured streaming as well, once we decide
standard format of migration guide.

In Spark 3.0.0 there're some breaking change on even SS area - one example
is SPARK-28199 which Sean took care of leaving release note for this, but
migration guide would be better to help for some users from 2.4.x to 3.0.x
since release note would be bound to only 3.0.0.

-Jungtaek Lim (HeartSaVioR)

On Mon, Jul 15, 2019 at 8:25 AM Xiao Li <li...@databricks.com> wrote:

> Yeah, Josh! All these ideas sound good to me. All the top commercial
> database products have very detailed guide/document about the version
> upgrading. You can easily find them.
>
> Currently, only SQL and ML modules have the migration or upgrade guides.
> Since Spark 2.3 release, we strictly require the PR authors to document all
> the behavior changes in the SQL component. I would suggest to do the same
> things in the other modules. For example, Spark Core and Structured
> Streaming. Any objection?
>
> Cheers,
>
> Xiao
>
>
>
> On Sun, Jul 14, 2019 at 2:05 PM Josh Rosen <ro...@gmail.com> wrote:
>
>> I'd like to discuss the Spark SQL migration / upgrade guides in the Spark
>> documentation: these are valuable resources and I think we could increase
>> that value by making these docs easier to discover and by adding a bit more
>> structure to the existing content.
>>
>> For folks who aren't familiar with these docs: the Spark docs have a "SQL
>> Migration Guide" which lists the deprecations and changes of behavior in
>> each release:
>>
>>    - Latest published version:
>>    https://spark.apache.org/docs/latest/sql-migration-guide-upgrade.html
>>    - Master branch version (will become 3.0):
>>    https://github.com/apache/spark/blob/master/docs/sql-migration-guide-upgrade.md
>>
>> A lot of community work went into crafting this doc and I really
>> appreciate those efforts.
>>
>> This doc is a little hard to find, though, because it's not consistently
>> linked from release notes pages: the 2.4.0 page links it under "Changes of
>> Behavior" (
>> https://spark.apache.org/releases/spark-release-2-4-0.html#changes-of-behavior)
>> but subsequent maintenance releases do not link to it (
>> https://spark.apache.org/releases/spark-release-2-4-1.html). It's also
>> not very cross-linked from the rest of the Spark docs (e.g. the Overview
>> doc, doc drop-down menus, etc).
>>
>> I'm also concerned that the doc may be overwhelming to end users (as
>> opposed to Spark developers):
>>
>>    - *Entries aren't grouped by component*, so users need to read the
>>    entire document to spot changes relevant to their use of Spark (for
>>    example, PySpark changes are not grouped together).
>>    - *Entries aren't ordered by size / risk of change,* e.g. performance
>>    impact vs. loud behavior change (stopping with an explicit exception) vs.
>>    silent behavior changes (e.g. changing default rounding behavior). If we
>>    assume limited reader attention then it may be important to prioritize the
>>    order in which we list entries, putting the highest-expected-impact /
>>    lowest-organic-discoverability changes first.
>>    - *We don't link JIRAs*, forcing users to do their own archaeology to
>>    learn more about a specific change.
>>
>> The existing ML migration guide addresses some of these issues, so maybe
>> we can emulate it in the SQL guide:
>> https://spark.apache.org/docs/latest/ml-guide.html#migration-guide
>>
>> I think that documentation clarity is especially important with Spark 3.0
>> around the corner: many folks will seek out this information when they
>> upgrade, so improving this guide can be a high-leverage, high-impact
>> activity.
>>
>> What do folks think? Does anyone have examples from other projects which
>> do a notably good job of crafting release notes / migration guides? I'd be
>> glad to help with pre-release editing after we decide on a structure and
>> style.
>>
>> Cheers,
>> Josh
>>
>
>
> --
> [image: Databricks Summit - Watch the talks]
> <https://databricks.com/sparkaisummit/north-america>
>


-- 
Name : Jungtaek Lim
Blog : http://medium.com/@heartsavior
Twitter : http://twitter.com/heartsavior
LinkedIn : http://www.linkedin.com/in/heartsavior

Re: Spark SQL upgrade / migration guide: discoverability and content organization

Posted by Dongjoon Hyun <do...@gmail.com>.
Thank you, Josh and Xiao. That sounds great.

Do you think we can have some parts of that improvement in `2.4.4` document
first since that is the very next release?

Bests,
Dongjoon.

On Sun, Jul 14, 2019 at 4:25 PM Xiao Li <li...@databricks.com> wrote:

> Yeah, Josh! All these ideas sound good to me. All the top commercial
> database products have very detailed guide/document about the version
> upgrading. You can easily find them.
>
> Currently, only SQL and ML modules have the migration or upgrade guides.
> Since Spark 2.3 release, we strictly require the PR authors to document all
> the behavior changes in the SQL component. I would suggest to do the same
> things in the other modules. For example, Spark Core and Structured
> Streaming. Any objection?
>
> Cheers,
>
> Xiao
>
>
>
> On Sun, Jul 14, 2019 at 2:05 PM Josh Rosen <ro...@gmail.com> wrote:
>
>> I'd like to discuss the Spark SQL migration / upgrade guides in the Spark
>> documentation: these are valuable resources and I think we could increase
>> that value by making these docs easier to discover and by adding a bit more
>> structure to the existing content.
>>
>> For folks who aren't familiar with these docs: the Spark docs have a "SQL
>> Migration Guide" which lists the deprecations and changes of behavior in
>> each release:
>>
>>    - Latest published version:
>>    https://spark.apache.org/docs/latest/sql-migration-guide-upgrade.html
>>    - Master branch version (will become 3.0):
>>    https://github.com/apache/spark/blob/master/docs/sql-migration-guide-upgrade.md
>>
>> A lot of community work went into crafting this doc and I really
>> appreciate those efforts.
>>
>> This doc is a little hard to find, though, because it's not consistently
>> linked from release notes pages: the 2.4.0 page links it under "Changes of
>> Behavior" (
>> https://spark.apache.org/releases/spark-release-2-4-0.html#changes-of-behavior)
>> but subsequent maintenance releases do not link to it (
>> https://spark.apache.org/releases/spark-release-2-4-1.html). It's also
>> not very cross-linked from the rest of the Spark docs (e.g. the Overview
>> doc, doc drop-down menus, etc).
>>
>> I'm also concerned that the doc may be overwhelming to end users (as
>> opposed to Spark developers):
>>
>>    - *Entries aren't grouped by component*, so users need to read the
>>    entire document to spot changes relevant to their use of Spark (for
>>    example, PySpark changes are not grouped together).
>>    - *Entries aren't ordered by size / risk of change,* e.g. performance
>>    impact vs. loud behavior change (stopping with an explicit exception) vs.
>>    silent behavior changes (e.g. changing default rounding behavior). If we
>>    assume limited reader attention then it may be important to prioritize the
>>    order in which we list entries, putting the highest-expected-impact /
>>    lowest-organic-discoverability changes first.
>>    - *We don't link JIRAs*, forcing users to do their own archaeology to
>>    learn more about a specific change.
>>
>> The existing ML migration guide addresses some of these issues, so maybe
>> we can emulate it in the SQL guide:
>> https://spark.apache.org/docs/latest/ml-guide.html#migration-guide
>>
>> I think that documentation clarity is especially important with Spark 3.0
>> around the corner: many folks will seek out this information when they
>> upgrade, so improving this guide can be a high-leverage, high-impact
>> activity.
>>
>> What do folks think? Does anyone have examples from other projects which
>> do a notably good job of crafting release notes / migration guides? I'd be
>> glad to help with pre-release editing after we decide on a structure and
>> style.
>>
>> Cheers,
>> Josh
>>
>
>
> --
> [image: Databricks Summit - Watch the talks]
> <https://databricks.com/sparkaisummit/north-america>
>

Re: Spark SQL upgrade / migration guide: discoverability and content organization

Posted by Xiao Li <li...@databricks.com>.
Yeah, Josh! All these ideas sound good to me. All the top commercial
database products have very detailed guide/document about the version
upgrading. You can easily find them.

Currently, only SQL and ML modules have the migration or upgrade guides.
Since Spark 2.3 release, we strictly require the PR authors to document all
the behavior changes in the SQL component. I would suggest to do the same
things in the other modules. For example, Spark Core and Structured
Streaming. Any objection?

Cheers,

Xiao



On Sun, Jul 14, 2019 at 2:05 PM Josh Rosen <ro...@gmail.com> wrote:

> I'd like to discuss the Spark SQL migration / upgrade guides in the Spark
> documentation: these are valuable resources and I think we could increase
> that value by making these docs easier to discover and by adding a bit more
> structure to the existing content.
>
> For folks who aren't familiar with these docs: the Spark docs have a "SQL
> Migration Guide" which lists the deprecations and changes of behavior in
> each release:
>
>    - Latest published version:
>    https://spark.apache.org/docs/latest/sql-migration-guide-upgrade.html
>    - Master branch version (will become 3.0):
>    https://github.com/apache/spark/blob/master/docs/sql-migration-guide-upgrade.md
>
> A lot of community work went into crafting this doc and I really
> appreciate those efforts.
>
> This doc is a little hard to find, though, because it's not consistently
> linked from release notes pages: the 2.4.0 page links it under "Changes of
> Behavior" (
> https://spark.apache.org/releases/spark-release-2-4-0.html#changes-of-behavior)
> but subsequent maintenance releases do not link to it (
> https://spark.apache.org/releases/spark-release-2-4-1.html). It's also
> not very cross-linked from the rest of the Spark docs (e.g. the Overview
> doc, doc drop-down menus, etc).
>
> I'm also concerned that the doc may be overwhelming to end users (as
> opposed to Spark developers):
>
>    - *Entries aren't grouped by component*, so users need to read the
>    entire document to spot changes relevant to their use of Spark (for
>    example, PySpark changes are not grouped together).
>    - *Entries aren't ordered by size / risk of change,* e.g. performance
>    impact vs. loud behavior change (stopping with an explicit exception) vs.
>    silent behavior changes (e.g. changing default rounding behavior). If we
>    assume limited reader attention then it may be important to prioritize the
>    order in which we list entries, putting the highest-expected-impact /
>    lowest-organic-discoverability changes first.
>    - *We don't link JIRAs*, forcing users to do their own archaeology to
>    learn more about a specific change.
>
> The existing ML migration guide addresses some of these issues, so maybe
> we can emulate it in the SQL guide:
> https://spark.apache.org/docs/latest/ml-guide.html#migration-guide
>
> I think that documentation clarity is especially important with Spark 3.0
> around the corner: many folks will seek out this information when they
> upgrade, so improving this guide can be a high-leverage, high-impact
> activity.
>
> What do folks think? Does anyone have examples from other projects which
> do a notably good job of crafting release notes / migration guides? I'd be
> glad to help with pre-release editing after we decide on a structure and
> style.
>
> Cheers,
> Josh
>


-- 
[image: Databricks Summit - Watch the talks]
<https://databricks.com/sparkaisummit/north-america>