You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Reynold Xin <rx...@databricks.com> on 2016/05/25 20:20:21 UTC

Re: feedback on dataset api explode

Created JIRA ticket: https://issues.apache.org/jira/browse/SPARK-15533

@Koert - Please keep API feedback coming. One thing - in the future, can
you send api feedbacks to the dev@ list instead of user@?



On Wed, May 25, 2016 at 1:05 PM, Cheng Lian <li...@databricks.com> wrote:

> Agree, since they can be easily replaced by .flatMap (to do explosion) and
> .select (to rename output columns)
>
> Cheng
>
>
> On 5/25/16 12:30 PM, Reynold Xin wrote:
>
> Based on this discussion I'm thinking we should deprecate the two explode
> functions.
>
> On Wednesday, May 25, 2016, Koert Kuipers < <ko...@tresata.com>
> koert@tresata.com> wrote:
>
>> wenchen,
>> that definition of explode seems identical to flatMap, so you dont need
>> it either?
>>
>> michael,
>> i didn't know about the column expression version of explode, that makes
>> sense. i will experiment with that instead.
>>
>> On Wed, May 25, 2016 at 3:03 PM, Wenchen Fan <we...@databricks.com>
>> wrote:
>>
>>> I think we only need this version:  `def explode[B : Encoder](f: A
>>> => TraversableOnce[B]): Dataset[B]`
>>>
>>> For untyped one, `df.select(explode($"arrayCol").as("item"))` should be
>>> the best choice.
>>>
>>> On Wed, May 25, 2016 at 11:55 AM, Michael Armbrust <
>>> michael@databricks.com> wrote:
>>>
>>>> These APIs predate Datasets / encoders, so that is why they are Row
>>>> instead of objects.  We should probably rethink that.
>>>>
>>>> Honestly, I usually end up using the column expression version of
>>>> explode now that it exists (i.e. explode($"arrayCol").as("Item")).  It
>>>> would be great to understand more why you are using these instead.
>>>>
>>>> On Wed, May 25, 2016 at 8:49 AM, Koert Kuipers <ko...@tresata.com>
>>>> wrote:
>>>>
>>>>> we currently have 2 explode definitions in Dataset:
>>>>>
>>>>>  def explode[A <: Product : TypeTag](input: Column*)(f: Row =>
>>>>> TraversableOnce[A]): DataFrame
>>>>>
>>>>>  def explode[A, B : TypeTag](inputColumn: String, outputColumn:
>>>>> String)(f: A => TraversableOnce[B]): DataFrame
>>>>>
>>>>> 1) the separation of the functions into their own argument lists is
>>>>> nice, but unfortunately scala's type inference doesn't handle this well,
>>>>> meaning that the generic types always have to be explicitly provided. i
>>>>> assume this was done to allow the "input" to be a varargs in the first
>>>>> method, and then kept the same in the second for reasons of symmetry.
>>>>>
>>>>> 2) i am surprised the first definition returns a DataFrame. this seems
>>>>> to suggest DataFrame usage (so DataFrame to DataFrame), but there is no way
>>>>> to specify the output column names, which limits its usability for
>>>>> DataFrames. i frequently end up using the first definition for DataFrames
>>>>> anyhow because of the need to return more than 1 column (and the data has
>>>>> columns unknown at compile time that i need to carry along making flatMap
>>>>> on Dataset clumsy/unusable), but relying on the output columns being called
>>>>> _1 and _2 and renaming then afterwards seems like an anti-pattern.
>>>>>
>>>>> 3) using Row objects isn't very pretty. why not f: A =>
>>>>> TraversableOnce[B] or something like that for the first definition? how
>>>>> about:
>>>>>  def explode[A: TypeTag, B: TypeTag](input: Seq[Column], output:
>>>>> Seq[Column])(f: A => TraversableOnce[B]): DataFrame
>>>>>
>>>>> best,
>>>>> koert
>>>>>
>>>>
>>>>
>>>
>>
>

Re: feedback on dataset api explode

Posted by Koert Kuipers <ko...@tresata.com>.

oh yes, this was by accident, it should have gone to dev

On Wed, May 25, 2016 at 4:20 PM, Reynold Xin <rx...@databricks.com> wrote:

> Created JIRA ticket: https://issues.apache.org/jira/browse/SPARK-15533
>
> @Koert - Please keep API feedback coming. One thing - in the future, can
> you send api feedbacks to the dev@ list instead of user@?
>
>
>
> On Wed, May 25, 2016 at 1:05 PM, Cheng Lian <li...@databricks.com> wrote:
>
>> Agree, since they can be easily replaced by .flatMap (to do explosion)
>> and .select (to rename output columns)
>>
>> Cheng
>>
>>
>> On 5/25/16 12:30 PM, Reynold Xin wrote:
>>
>> Based on this discussion I'm thinking we should deprecate the two explode
>> functions.
>>
>> On Wednesday, May 25, 2016, Koert Kuipers < <ko...@tresata.com>
>> koert@tresata.com> wrote:
>>
>>> wenchen,
>>> that definition of explode seems identical to flatMap, so you dont need
>>> it either?
>>>
>>> michael,
>>> i didn't know about the column expression version of explode, that makes
>>> sense. i will experiment with that instead.
>>>
>>> On Wed, May 25, 2016 at 3:03 PM, Wenchen Fan <we...@databricks.com>
>>> wrote:
>>>
>>>> I think we only need this version:  `def explode[B : Encoder](f: A
>>>> => TraversableOnce[B]): Dataset[B]`
>>>>
>>>> For untyped one, `df.select(explode($"arrayCol").as("item"))` should be
>>>> the best choice.
>>>>
>>>> On Wed, May 25, 2016 at 11:55 AM, Michael Armbrust <
>>>> michael@databricks.com> wrote:
>>>>
>>>>> These APIs predate Datasets / encoders, so that is why they are Row
>>>>> instead of objects.  We should probably rethink that.
>>>>>
>>>>> Honestly, I usually end up using the column expression version of
>>>>> explode now that it exists (i.e. explode($"arrayCol").as("Item")).
>>>>> It would be great to understand more why you are using these instead.
>>>>>
>>>>> On Wed, May 25, 2016 at 8:49 AM, Koert Kuipers <ko...@tresata.com>
>>>>> wrote:
>>>>>
>>>>>> we currently have 2 explode definitions in Dataset:
>>>>>>
>>>>>>  def explode[A <: Product : TypeTag](input: Column*)(f: Row =>
>>>>>> TraversableOnce[A]): DataFrame
>>>>>>
>>>>>>  def explode[A, B : TypeTag](inputColumn: String, outputColumn:
>>>>>> String)(f: A => TraversableOnce[B]): DataFrame
>>>>>>
>>>>>> 1) the separation of the functions into their own argument lists is
>>>>>> nice, but unfortunately scala's type inference doesn't handle this well,
>>>>>> meaning that the generic types always have to be explicitly provided. i
>>>>>> assume this was done to allow the "input" to be a varargs in the first
>>>>>> method, and then kept the same in the second for reasons of symmetry.
>>>>>>
>>>>>> 2) i am surprised the first definition returns a DataFrame. this
>>>>>> seems to suggest DataFrame usage (so DataFrame to DataFrame), but there is
>>>>>> no way to specify the output column names, which limits its usability for
>>>>>> DataFrames. i frequently end up using the first definition for DataFrames
>>>>>> anyhow because of the need to return more than 1 column (and the data has
>>>>>> columns unknown at compile time that i need to carry along making flatMap
>>>>>> on Dataset clumsy/unusable), but relying on the output columns being called
>>>>>> _1 and _2 and renaming then afterwards seems like an anti-pattern.
>>>>>>
>>>>>> 3) using Row objects isn't very pretty. why not f: A =>
>>>>>> TraversableOnce[B] or something like that for the first definition? how
>>>>>> about:
>>>>>>  def explode[A: TypeTag, B: TypeTag](input: Seq[Column], output:
>>>>>> Seq[Column])(f: A => TraversableOnce[B]): DataFrame
>>>>>>
>>>>>> best,
>>>>>> koert
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: feedback on dataset api explode

Posted by Koert Kuipers <ko...@tresata.com>.

oh yes, this was by accident, it should have gone to dev

On Wed, May 25, 2016 at 4:20 PM, Reynold Xin <rx...@databricks.com> wrote:

> Created JIRA ticket: https://issues.apache.org/jira/browse/SPARK-15533
>
> @Koert - Please keep API feedback coming. One thing - in the future, can
> you send api feedbacks to the dev@ list instead of user@?
>
>
>
> On Wed, May 25, 2016 at 1:05 PM, Cheng Lian <li...@databricks.com> wrote:
>
>> Agree, since they can be easily replaced by .flatMap (to do explosion)
>> and .select (to rename output columns)
>>
>> Cheng
>>
>>
>> On 5/25/16 12:30 PM, Reynold Xin wrote:
>>
>> Based on this discussion I'm thinking we should deprecate the two explode
>> functions.
>>
>> On Wednesday, May 25, 2016, Koert Kuipers < <ko...@tresata.com>
>> koert@tresata.com> wrote:
>>
>>> wenchen,
>>> that definition of explode seems identical to flatMap, so you dont need
>>> it either?
>>>
>>> michael,
>>> i didn't know about the column expression version of explode, that makes
>>> sense. i will experiment with that instead.
>>>
>>> On Wed, May 25, 2016 at 3:03 PM, Wenchen Fan <we...@databricks.com>
>>> wrote:
>>>
>>>> I think we only need this version:  `def explode[B : Encoder](f: A
>>>> => TraversableOnce[B]): Dataset[B]`
>>>>
>>>> For untyped one, `df.select(explode($"arrayCol").as("item"))` should be
>>>> the best choice.
>>>>
>>>> On Wed, May 25, 2016 at 11:55 AM, Michael Armbrust <
>>>> michael@databricks.com> wrote:
>>>>
>>>>> These APIs predate Datasets / encoders, so that is why they are Row
>>>>> instead of objects.  We should probably rethink that.
>>>>>
>>>>> Honestly, I usually end up using the column expression version of
>>>>> explode now that it exists (i.e. explode($"arrayCol").as("Item")).
>>>>> It would be great to understand more why you are using these instead.
>>>>>
>>>>> On Wed, May 25, 2016 at 8:49 AM, Koert Kuipers <ko...@tresata.com>
>>>>> wrote:
>>>>>
>>>>>> we currently have 2 explode definitions in Dataset:
>>>>>>
>>>>>>  def explode[A <: Product : TypeTag](input: Column*)(f: Row =>
>>>>>> TraversableOnce[A]): DataFrame
>>>>>>
>>>>>>  def explode[A, B : TypeTag](inputColumn: String, outputColumn:
>>>>>> String)(f: A => TraversableOnce[B]): DataFrame
>>>>>>
>>>>>> 1) the separation of the functions into their own argument lists is
>>>>>> nice, but unfortunately scala's type inference doesn't handle this well,
>>>>>> meaning that the generic types always have to be explicitly provided. i
>>>>>> assume this was done to allow the "input" to be a varargs in the first
>>>>>> method, and then kept the same in the second for reasons of symmetry.
>>>>>>
>>>>>> 2) i am surprised the first definition returns a DataFrame. this
>>>>>> seems to suggest DataFrame usage (so DataFrame to DataFrame), but there is
>>>>>> no way to specify the output column names, which limits its usability for
>>>>>> DataFrames. i frequently end up using the first definition for DataFrames
>>>>>> anyhow because of the need to return more than 1 column (and the data has
>>>>>> columns unknown at compile time that i need to carry along making flatMap
>>>>>> on Dataset clumsy/unusable), but relying on the output columns being called
>>>>>> _1 and _2 and renaming then afterwards seems like an anti-pattern.
>>>>>>
>>>>>> 3) using Row objects isn't very pretty. why not f: A =>
>>>>>> TraversableOnce[B] or something like that for the first definition? how
>>>>>> about:
>>>>>>  def explode[A: TypeTag, B: TypeTag](input: Seq[Column], output:
>>>>>> Seq[Column])(f: A => TraversableOnce[B]): DataFrame
>>>>>>
>>>>>> best,
>>>>>> koert
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>