You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Ryan Blue <rb...@netflix.com.INVALID> on 2019/09/06 22:28:00 UTC

DSv2 sync - 4 September 2019

Here are my notes from the latest sync. Feel free to reply with
clarifications if I’ve missed anything.

*Attendees*:

Ryan Blue
John Zhuge
Russell Spitzer
Matt Cheah
Gengliang Wang
Priyanka Gomatam
Holden Karau

*Topics*:

   - DataFrameWriterV2 insert vs append (recap)
   - ANSI and strict modes for inserting casts
   - Separating identifier resolution from table lookup
   - Open PRs
      - SHOW NAMESPACES - https://github.com/apache/spark/pull/25601
      - DataFrameWriterV2 - https://github.com/apache/spark/pull/25681
      - TableProvider API update -
      https://github.com/apache/spark/pull/25651
      - UPDATE - https://github.com/apache/spark/pull/25626

*Discussion*:

   - DataFrameWriterV2 insert vs append discussion recapped the agreement
   from last sync
   - ANSI and strict modes for inserting casts:
      - Russell: Failure modes are important. ANSI behavior is to fail at
      runtime, not analysis time. If a cast is allowed, but doesn’t throw an
      exception at runtime then this can’t be considered ANSI behavior.
      - Gengliang: ANSI adds the cast
      - Matt: Sounds like there are two conflicting views of the world. Is
      the default ANSI behavior to insert a cast that may produce NULL
or to fail
      at runtime?
      - Ryan: So analysis and runtime behaviors can’t be separate?
      - Matt: Analysis behavior is influenced by behavior at runtime. Maybe
      the vote should cover both?
      - Russell: (linked to the standard) There are 3 steps: if numeric and
      same type, use the data value. If the value can be rounded or truncated,
      round or truncate. Otherwise, throw an exception that a value can’t be
      cast. These are runtime requirements.
      - Ryan: Another consideration is that we can make Spark more
      permissive, but can’t make Spark more strict in future releases.
      - Matt: v1 silently corrupts data
      - Russell: ANSI is fine, as long as the runtime matches (is ANSI).
      Don’t tell people it’s ANSI and not do ANSI completely.
      - Gengliang: people are concerned about long-running jobs failing at
      the end
      - Ryan: That’s okay because they can change the defaults: use strict
      analysis-time validation, or allow casts to produce NULL values.
      - Matt: As long as this is well documented, it should be fine
      - Ryan: Can we run tests to find out what exactly the behavior is?
      - Gengliang: sqlfiddle.com
      - Russell ran tests in MySQL and Postgres. Both threw runtime
      failures.
      - Matt: Let’s move on, but add the runtime behavior to the VOTE
   - Identifier resolution and table lookup
      - Ryan: recent changes merged identifier resolution and table lookup
      together because identifiers owned by the session catalog need
to be loaded
      to find out whether to use v1 or v2 plans. I think this should
be separated
      so that identifier resolution happens independently to ensure
that the two
      separate tasks don’t end up getting done at the same time and
      over-complicating the analyzer.
   - SHOW NAMESPACES - Ready for final review
   - DataFrameWriterV2:
      - Ryan: Tests failed after passing on the PR. Anyone know why that
      would happen?
      - Gengliang: tests failed in maven
      - Holden: PR validation runs SBT tests
   - TableProvider API update: skipped because Wenchen didn’t make it
   - UPDATE support PR
      - Ryan: There is a PR to add a SQL UPDATE command, but it delegates
      entirely to the data source, which seems strange.
      - Matt: What is Spark’s purpose here? Why would Spark parse a SQL
      statement only to pass it entirely to another engine?
      - Ryan: It does make sense to do this. If Spark eventually supports
      MERGE INTO and other row-level operations, then it makes sense
to push down
      the operation to some sources, like JDBC. I just find it backward to add
      the pushdown API before adding an implementation that handles this inside
      Spark — pushdown is usually an optimization.
      - Russell: Would this be safe? Spark retries lots of operations.
      - Ryan: I think it would be safe because Spark won’t retry top-level
      operations and this is a single method call. Nothing would get retried.
      - Ryan: I’ll ask what the PR author’s use case is. Maybe that would
      help clarify why this is a good idea.

-- 
Ryan Blue
Software Engineer
Netflix

Re: DSv2 sync - 4 September 2019

Posted by Nicholas Chammas <ni...@gmail.com>.
Ah yes, on rereading the original email I see that the sync discussion was
different. Thanks for the clarification! I’ll file a JIRA about PERMISSIVE.

2019년 9월 9일 (월) 오전 6:05, Wenchen Fan <cl...@gmail.com>님이 작성:

> Hi Nicholas,
>
> You are talking about a different thing. The PERMISSIVE mode is the
> failure mode for reading text-based data source (json, csv, etc.). It's not
> the general failure mode for Spark table insertion.
>
> I agree with you that the PERMISSIVE mode is hard to use. Feel free to
> open a JIRA ticket if you have some better ideas.
>
> Thanks,
> Wenchen
>
> On Mon, Sep 9, 2019 at 12:46 AM Nicholas Chammas <
> nicholas.chammas@gmail.com> wrote:
>
>> A quick question about failure modes, as a casual observer of the DSv2
>> effort:
>>
>> I was considering filing a JIRA ticket about enhancing the
>> DataFrameReader to include the failure *reason* in addition to the
>> corrupt record when the mode is PERMISSIVE. So if you are loading a CSV,
>> for example, and a value cannot be automatically cast to the type you
>> specify in the schema, you'll get the corrupt record in the column
>> configured by columnNameOfCorruptRecord, but you'll also get some detail
>> about what exactly made the record corrupt, perhaps in a new column
>> specified by something like columnNameOfCorruptReason.
>>
>> Is this an enhancement that would be possible in DSv2?
>>
>> On Fri, Sep 6, 2019 at 6:28 PM Ryan Blue <rb...@netflix.com.invalid>
>> wrote:
>>
>>> Here are my notes from the latest sync. Feel free to reply with
>>> clarifications if I’ve missed anything.
>>>
>>> *Attendees*:
>>>
>>> Ryan Blue
>>> John Zhuge
>>> Russell Spitzer
>>> Matt Cheah
>>> Gengliang Wang
>>> Priyanka Gomatam
>>> Holden Karau
>>>
>>> *Topics*:
>>>
>>>    - DataFrameWriterV2 insert vs append (recap)
>>>    - ANSI and strict modes for inserting casts
>>>    - Separating identifier resolution from table lookup
>>>    - Open PRs
>>>       - SHOW NAMESPACES - https://github.com/apache/spark/pull/25601
>>>       - DataFrameWriterV2 - https://github.com/apache/spark/pull/25681
>>>       - TableProvider API update -
>>>       https://github.com/apache/spark/pull/25651
>>>       - UPDATE - https://github.com/apache/spark/pull/25626
>>>
>>> *Discussion*:
>>>
>>>    - DataFrameWriterV2 insert vs append discussion recapped the
>>>    agreement from last sync
>>>    - ANSI and strict modes for inserting casts:
>>>       - Russell: Failure modes are important. ANSI behavior is to fail
>>>       at runtime, not analysis time. If a cast is allowed, but doesn’t throw an
>>>       exception at runtime then this can’t be considered ANSI behavior.
>>>       - Gengliang: ANSI adds the cast
>>>       - Matt: Sounds like there are two conflicting views of the world.
>>>       Is the default ANSI behavior to insert a cast that may produce NULL or to
>>>       fail at runtime?
>>>       - Ryan: So analysis and runtime behaviors can’t be separate?
>>>       - Matt: Analysis behavior is influenced by behavior at runtime.
>>>       Maybe the vote should cover both?
>>>       - Russell: (linked to the standard) There are 3 steps: if numeric
>>>       and same type, use the data value. If the value can be rounded or
>>>       truncated, round or truncate. Otherwise, throw an exception that a value
>>>       can’t be cast. These are runtime requirements.
>>>       - Ryan: Another consideration is that we can make Spark more
>>>       permissive, but can’t make Spark more strict in future releases.
>>>       - Matt: v1 silently corrupts data
>>>       - Russell: ANSI is fine, as long as the runtime matches (is
>>>       ANSI). Don’t tell people it’s ANSI and not do ANSI completely.
>>>       - Gengliang: people are concerned about long-running jobs failing
>>>       at the end
>>>       - Ryan: That’s okay because they can change the defaults: use
>>>       strict analysis-time validation, or allow casts to produce NULL values.
>>>       - Matt: As long as this is well documented, it should be fine
>>>       - Ryan: Can we run tests to find out what exactly the behavior is?
>>>       - Gengliang: sqlfiddle.com
>>>       - Russell ran tests in MySQL and Postgres. Both threw runtime
>>>       failures.
>>>       - Matt: Let’s move on, but add the runtime behavior to the VOTE
>>>    - Identifier resolution and table lookup
>>>       - Ryan: recent changes merged identifier resolution and table
>>>       lookup together because identifiers owned by the session catalog need to be
>>>       loaded to find out whether to use v1 or v2 plans. I think this should be
>>>       separated so that identifier resolution happens independently to ensure
>>>       that the two separate tasks don’t end up getting done at the same time and
>>>       over-complicating the analyzer.
>>>    - SHOW NAMESPACES - Ready for final review
>>>    - DataFrameWriterV2:
>>>       - Ryan: Tests failed after passing on the PR. Anyone know why
>>>       that would happen?
>>>       - Gengliang: tests failed in maven
>>>       - Holden: PR validation runs SBT tests
>>>    - TableProvider API update: skipped because Wenchen didn’t make it
>>>    - UPDATE support PR
>>>       - Ryan: There is a PR to add a SQL UPDATE command, but it
>>>       delegates entirely to the data source, which seems strange.
>>>       - Matt: What is Spark’s purpose here? Why would Spark parse a SQL
>>>       statement only to pass it entirely to another engine?
>>>       - Ryan: It does make sense to do this. If Spark eventually
>>>       supports MERGE INTO and other row-level operations, then it makes sense to
>>>       push down the operation to some sources, like JDBC. I just find it backward
>>>       to add the pushdown API before adding an implementation that handles this
>>>       inside Spark — pushdown is usually an optimization.
>>>       - Russell: Would this be safe? Spark retries lots of operations.
>>>       - Ryan: I think it would be safe because Spark won’t retry
>>>       top-level operations and this is a single method call. Nothing would get
>>>       retried.
>>>       - Ryan: I’ll ask what the PR author’s use case is. Maybe that
>>>       would help clarify why this is a good idea.
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>

Re: DSv2 sync - 4 September 2019

Posted by Wenchen Fan <cl...@gmail.com>.
Hi Nicholas,

You are talking about a different thing. The PERMISSIVE mode is the failure
mode for reading text-based data source (json, csv, etc.). It's not the
general failure mode for Spark table insertion.

I agree with you that the PERMISSIVE mode is hard to use. Feel free to open
a JIRA ticket if you have some better ideas.

Thanks,
Wenchen

On Mon, Sep 9, 2019 at 12:46 AM Nicholas Chammas <ni...@gmail.com>
wrote:

> A quick question about failure modes, as a casual observer of the DSv2
> effort:
>
> I was considering filing a JIRA ticket about enhancing the DataFrameReader
> to include the failure *reason* in addition to the corrupt record when
> the mode is PERMISSIVE. So if you are loading a CSV, for example, and a
> value cannot be automatically cast to the type you specify in the schema,
> you'll get the corrupt record in the column configured by
> columnNameOfCorruptRecord, but you'll also get some detail about what
> exactly made the record corrupt, perhaps in a new column specified by
> something like columnNameOfCorruptReason.
>
> Is this an enhancement that would be possible in DSv2?
>
> On Fri, Sep 6, 2019 at 6:28 PM Ryan Blue <rb...@netflix.com.invalid>
> wrote:
>
>> Here are my notes from the latest sync. Feel free to reply with
>> clarifications if I’ve missed anything.
>>
>> *Attendees*:
>>
>> Ryan Blue
>> John Zhuge
>> Russell Spitzer
>> Matt Cheah
>> Gengliang Wang
>> Priyanka Gomatam
>> Holden Karau
>>
>> *Topics*:
>>
>>    - DataFrameWriterV2 insert vs append (recap)
>>    - ANSI and strict modes for inserting casts
>>    - Separating identifier resolution from table lookup
>>    - Open PRs
>>       - SHOW NAMESPACES - https://github.com/apache/spark/pull/25601
>>       - DataFrameWriterV2 - https://github.com/apache/spark/pull/25681
>>       - TableProvider API update -
>>       https://github.com/apache/spark/pull/25651
>>       - UPDATE - https://github.com/apache/spark/pull/25626
>>
>> *Discussion*:
>>
>>    - DataFrameWriterV2 insert vs append discussion recapped the
>>    agreement from last sync
>>    - ANSI and strict modes for inserting casts:
>>       - Russell: Failure modes are important. ANSI behavior is to fail
>>       at runtime, not analysis time. If a cast is allowed, but doesn’t throw an
>>       exception at runtime then this can’t be considered ANSI behavior.
>>       - Gengliang: ANSI adds the cast
>>       - Matt: Sounds like there are two conflicting views of the world.
>>       Is the default ANSI behavior to insert a cast that may produce NULL or to
>>       fail at runtime?
>>       - Ryan: So analysis and runtime behaviors can’t be separate?
>>       - Matt: Analysis behavior is influenced by behavior at runtime.
>>       Maybe the vote should cover both?
>>       - Russell: (linked to the standard) There are 3 steps: if numeric
>>       and same type, use the data value. If the value can be rounded or
>>       truncated, round or truncate. Otherwise, throw an exception that a value
>>       can’t be cast. These are runtime requirements.
>>       - Ryan: Another consideration is that we can make Spark more
>>       permissive, but can’t make Spark more strict in future releases.
>>       - Matt: v1 silently corrupts data
>>       - Russell: ANSI is fine, as long as the runtime matches (is ANSI).
>>       Don’t tell people it’s ANSI and not do ANSI completely.
>>       - Gengliang: people are concerned about long-running jobs failing
>>       at the end
>>       - Ryan: That’s okay because they can change the defaults: use
>>       strict analysis-time validation, or allow casts to produce NULL values.
>>       - Matt: As long as this is well documented, it should be fine
>>       - Ryan: Can we run tests to find out what exactly the behavior is?
>>       - Gengliang: sqlfiddle.com
>>       - Russell ran tests in MySQL and Postgres. Both threw runtime
>>       failures.
>>       - Matt: Let’s move on, but add the runtime behavior to the VOTE
>>    - Identifier resolution and table lookup
>>       - Ryan: recent changes merged identifier resolution and table
>>       lookup together because identifiers owned by the session catalog need to be
>>       loaded to find out whether to use v1 or v2 plans. I think this should be
>>       separated so that identifier resolution happens independently to ensure
>>       that the two separate tasks don’t end up getting done at the same time and
>>       over-complicating the analyzer.
>>    - SHOW NAMESPACES - Ready for final review
>>    - DataFrameWriterV2:
>>       - Ryan: Tests failed after passing on the PR. Anyone know why that
>>       would happen?
>>       - Gengliang: tests failed in maven
>>       - Holden: PR validation runs SBT tests
>>    - TableProvider API update: skipped because Wenchen didn’t make it
>>    - UPDATE support PR
>>       - Ryan: There is a PR to add a SQL UPDATE command, but it
>>       delegates entirely to the data source, which seems strange.
>>       - Matt: What is Spark’s purpose here? Why would Spark parse a SQL
>>       statement only to pass it entirely to another engine?
>>       - Ryan: It does make sense to do this. If Spark eventually
>>       supports MERGE INTO and other row-level operations, then it makes sense to
>>       push down the operation to some sources, like JDBC. I just find it backward
>>       to add the pushdown API before adding an implementation that handles this
>>       inside Spark — pushdown is usually an optimization.
>>       - Russell: Would this be safe? Spark retries lots of operations.
>>       - Ryan: I think it would be safe because Spark won’t retry
>>       top-level operations and this is a single method call. Nothing would get
>>       retried.
>>       - Ryan: I’ll ask what the PR author’s use case is. Maybe that
>>       would help clarify why this is a good idea.
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>

Re: DSv2 sync - 4 September 2019

Posted by Nicholas Chammas <ni...@gmail.com>.
A quick question about failure modes, as a casual observer of the DSv2
effort:

I was considering filing a JIRA ticket about enhancing the DataFrameReader
to include the failure *reason* in addition to the corrupt record when the
mode is PERMISSIVE. So if you are loading a CSV, for example, and a value
cannot be automatically cast to the type you specify in the schema, you'll
get the corrupt record in the column configured by
columnNameOfCorruptRecord, but you'll also get some detail about what
exactly made the record corrupt, perhaps in a new column specified by
something like columnNameOfCorruptReason.

Is this an enhancement that would be possible in DSv2?

On Fri, Sep 6, 2019 at 6:28 PM Ryan Blue <rb...@netflix.com.invalid> wrote:

> Here are my notes from the latest sync. Feel free to reply with
> clarifications if I’ve missed anything.
>
> *Attendees*:
>
> Ryan Blue
> John Zhuge
> Russell Spitzer
> Matt Cheah
> Gengliang Wang
> Priyanka Gomatam
> Holden Karau
>
> *Topics*:
>
>    - DataFrameWriterV2 insert vs append (recap)
>    - ANSI and strict modes for inserting casts
>    - Separating identifier resolution from table lookup
>    - Open PRs
>       - SHOW NAMESPACES - https://github.com/apache/spark/pull/25601
>       - DataFrameWriterV2 - https://github.com/apache/spark/pull/25681
>       - TableProvider API update -
>       https://github.com/apache/spark/pull/25651
>       - UPDATE - https://github.com/apache/spark/pull/25626
>
> *Discussion*:
>
>    - DataFrameWriterV2 insert vs append discussion recapped the agreement
>    from last sync
>    - ANSI and strict modes for inserting casts:
>       - Russell: Failure modes are important. ANSI behavior is to fail at
>       runtime, not analysis time. If a cast is allowed, but doesn’t throw an
>       exception at runtime then this can’t be considered ANSI behavior.
>       - Gengliang: ANSI adds the cast
>       - Matt: Sounds like there are two conflicting views of the world.
>       Is the default ANSI behavior to insert a cast that may produce NULL or to
>       fail at runtime?
>       - Ryan: So analysis and runtime behaviors can’t be separate?
>       - Matt: Analysis behavior is influenced by behavior at runtime.
>       Maybe the vote should cover both?
>       - Russell: (linked to the standard) There are 3 steps: if numeric
>       and same type, use the data value. If the value can be rounded or
>       truncated, round or truncate. Otherwise, throw an exception that a value
>       can’t be cast. These are runtime requirements.
>       - Ryan: Another consideration is that we can make Spark more
>       permissive, but can’t make Spark more strict in future releases.
>       - Matt: v1 silently corrupts data
>       - Russell: ANSI is fine, as long as the runtime matches (is ANSI).
>       Don’t tell people it’s ANSI and not do ANSI completely.
>       - Gengliang: people are concerned about long-running jobs failing
>       at the end
>       - Ryan: That’s okay because they can change the defaults: use
>       strict analysis-time validation, or allow casts to produce NULL values.
>       - Matt: As long as this is well documented, it should be fine
>       - Ryan: Can we run tests to find out what exactly the behavior is?
>       - Gengliang: sqlfiddle.com
>       - Russell ran tests in MySQL and Postgres. Both threw runtime
>       failures.
>       - Matt: Let’s move on, but add the runtime behavior to the VOTE
>    - Identifier resolution and table lookup
>       - Ryan: recent changes merged identifier resolution and table
>       lookup together because identifiers owned by the session catalog need to be
>       loaded to find out whether to use v1 or v2 plans. I think this should be
>       separated so that identifier resolution happens independently to ensure
>       that the two separate tasks don’t end up getting done at the same time and
>       over-complicating the analyzer.
>    - SHOW NAMESPACES - Ready for final review
>    - DataFrameWriterV2:
>       - Ryan: Tests failed after passing on the PR. Anyone know why that
>       would happen?
>       - Gengliang: tests failed in maven
>       - Holden: PR validation runs SBT tests
>    - TableProvider API update: skipped because Wenchen didn’t make it
>    - UPDATE support PR
>       - Ryan: There is a PR to add a SQL UPDATE command, but it delegates
>       entirely to the data source, which seems strange.
>       - Matt: What is Spark’s purpose here? Why would Spark parse a SQL
>       statement only to pass it entirely to another engine?
>       - Ryan: It does make sense to do this. If Spark eventually supports
>       MERGE INTO and other row-level operations, then it makes sense to push down
>       the operation to some sources, like JDBC. I just find it backward to add
>       the pushdown API before adding an implementation that handles this inside
>       Spark — pushdown is usually an optimization.
>       - Russell: Would this be safe? Spark retries lots of operations.
>       - Ryan: I think it would be safe because Spark won’t retry
>       top-level operations and this is a single method call. Nothing would get
>       retried.
>       - Ryan: I’ll ask what the PR author’s use case is. Maybe that would
>       help clarify why this is a good idea.
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>