You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Koert Kuipers <ko...@tresata.com> on 2017/02/02 15:39:26 UTC

frustration with field names in Dataset

since a dataset is a typed object you ideally don't have to think about
field names.

however there are operations on Dataset that require you to provide a
Column, like for example joinWith (and joinWith returns a strongly typed
Dataset, not DataFrame). once you have to provide a Column you are back to
thinking in field names, and worrying about duplicate field names, which is
something that can easily happen in a Dataset without you realizing it.

so under the hood Dataset has unique identifiers for every column, as in
dataset.queryExecution.logical.output, but these are expressions
(attributes) that i cannot turn back into columns since the constructors
for this are private in spark.

so.... how about having Dataset.apply(i: Int): Column to allow me to pick
columns by position without having to worry about (duplicate) field names?
then i could do something like:

dataset.joinWith(otherDataset, dataset(0) === otherDataset(0), joinType)

Re: frustration with field names in Dataset

Posted by Koert Kuipers <ko...@tresata.com>.

great its an easy fix. i will create jira and pullreq

On Thu, Feb 2, 2017 at 2:13 PM, Michael Armbrust <mi...@databricks.com>
wrote:

> That might be reasonable.  At least I can't think of any problems with
> doing that.
>
> On Thu, Feb 2, 2017 at 7:39 AM, Koert Kuipers <ko...@tresata.com> wrote:
>
>> since a dataset is a typed object you ideally don't have to think about
>> field names.
>>
>> however there are operations on Dataset that require you to provide a
>> Column, like for example joinWith (and joinWith returns a strongly typed
>> Dataset, not DataFrame). once you have to provide a Column you are back to
>> thinking in field names, and worrying about duplicate field names, which is
>> something that can easily happen in a Dataset without you realizing it.
>>
>> so under the hood Dataset has unique identifiers for every column, as in
>> dataset.queryExecution.logical.output, but these are expressions
>> (attributes) that i cannot turn back into columns since the constructors
>> for this are private in spark.
>>
>> so.... how about having Dataset.apply(i: Int): Column to allow me to pick
>> columns by position without having to worry about (duplicate) field names?
>> then i could do something like:
>>
>> dataset.joinWith(otherDataset, dataset(0) === otherDataset(0), joinType)
>>
>
>

Re: frustration with field names in Dataset

Posted by Michael Armbrust <mi...@databricks.com>.

That might be reasonable.  At least I can't think of any problems with
doing that.

On Thu, Feb 2, 2017 at 7:39 AM, Koert Kuipers <ko...@tresata.com> wrote:

> since a dataset is a typed object you ideally don't have to think about
> field names.
>
> however there are operations on Dataset that require you to provide a
> Column, like for example joinWith (and joinWith returns a strongly typed
> Dataset, not DataFrame). once you have to provide a Column you are back to
> thinking in field names, and worrying about duplicate field names, which is
> something that can easily happen in a Dataset without you realizing it.
>
> so under the hood Dataset has unique identifiers for every column, as in
> dataset.queryExecution.logical.output, but these are expressions
> (attributes) that i cannot turn back into columns since the constructors
> for this are private in spark.
>
> so.... how about having Dataset.apply(i: Int): Column to allow me to pick
> columns by position without having to worry about (duplicate) field names?
> then i could do something like:
>
> dataset.joinWith(otherDataset, dataset(0) === otherDataset(0), joinType)
>

Re: frustration with field names in Dataset

Posted by Koert Kuipers <ko...@tresata.com>.

another example is if i have a Dataset[(K, V)] and i want to repartition it
by the key K. repartition requires a Column which means i am suddenly back
to worrying about duplicate field names. i would like to be able to say:

dataset.repartition(dataset(0))

On Thu, Feb 2, 2017 at 10:39 AM, Koert Kuipers <ko...@tresata.com> wrote:

> since a dataset is a typed object you ideally don't have to think about
> field names.
>
> however there are operations on Dataset that require you to provide a
> Column, like for example joinWith (and joinWith returns a strongly typed
> Dataset, not DataFrame). once you have to provide a Column you are back to
> thinking in field names, and worrying about duplicate field names, which is
> something that can easily happen in a Dataset without you realizing it.
>
> so under the hood Dataset has unique identifiers for every column, as in
> dataset.queryExecution.logical.output, but these are expressions
> (attributes) that i cannot turn back into columns since the constructors
> for this are private in spark.
>
> so.... how about having Dataset.apply(i: Int): Column to allow me to pick
> columns by position without having to worry about (duplicate) field names?
> then i could do something like:
>
> dataset.joinWith(otherDataset, dataset(0) === otherDataset(0), joinType)
>