You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Koert Kuipers <ko...@tresata.com> on 2016/06/07 19:30:42 UTC

setting column names on dataset

for some operators on Dataset, like joinWith, one needs to use an
expression which means referring to columns by name.

how can i set the column names for a Dataset before doing a joinWith?

currently i am aware of:
df.toDF("k", "v").as[(K, V)]

but that seems inefficient/anti-pattern? i shouldn't have to go to a
DataFrame and back to set the column names?

or if this is the only way to set names, and column names really shouldn't
be used in Datasets, can i perhaps refer to the columns by their position?

thanks, koert

Re: setting column names on dataset

Posted by Koert Kuipers <ko...@tresata.com>.
That's neat
On Jun 7, 2016 4:34 PM, "Jacek Laskowski" <ja...@japila.pl> wrote:

> Hi,
>
> What about this?
>
> scala> final case class Person(name: String, age: Int)
> warning: there was one unchecked warning; re-run with -unchecked for
> details
> defined class Person
>
> scala> val ds = Seq(Person("foo", 42), Person("bar", 24)).toDS
> ds: org.apache.spark.sql.Dataset[Person] = [name: string, age: int]
>
> scala> ds.as("a").joinWith(ds.as("b"), $"a.name" === $"b.name
> ").show(false)
> +--------+--------+
> |_1      |_2      |
> +--------+--------+
> |[foo,42]|[foo,42]|
> |[bar,24]|[bar,24]|
> +--------+--------+
>
> Pozdrawiam,
> Jacek Laskowski
> ----
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
>
> On Tue, Jun 7, 2016 at 9:30 PM, Koert Kuipers <ko...@tresata.com> wrote:
> > for some operators on Dataset, like joinWith, one needs to use an
> expression
> > which means referring to columns by name.
> >
> > how can i set the column names for a Dataset before doing a joinWith?
> >
> > currently i am aware of:
> > df.toDF("k", "v").as[(K, V)]
> >
> > but that seems inefficient/anti-pattern? i shouldn't have to go to a
> > DataFrame and back to set the column names?
> >
> > or if this is the only way to set names, and column names really
> shouldn't
> > be used in Datasets, can i perhaps refer to the columns by their
> position?
> >
> > thanks, koert
>

Re: setting column names on dataset

Posted by Jacek Laskowski <ja...@japila.pl>.
Hi,

What about this?

scala> final case class Person(name: String, age: Int)
warning: there was one unchecked warning; re-run with -unchecked for details
defined class Person

scala> val ds = Seq(Person("foo", 42), Person("bar", 24)).toDS
ds: org.apache.spark.sql.Dataset[Person] = [name: string, age: int]

scala> ds.as("a").joinWith(ds.as("b"), $"a.name" === $"b.name").show(false)
+--------+--------+
|_1      |_2      |
+--------+--------+
|[foo,42]|[foo,42]|
|[bar,24]|[bar,24]|
+--------+--------+

Pozdrawiam,
Jacek Laskowski
----
https://medium.com/@jaceklaskowski/
Mastering Apache Spark http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski


On Tue, Jun 7, 2016 at 9:30 PM, Koert Kuipers <ko...@tresata.com> wrote:
> for some operators on Dataset, like joinWith, one needs to use an expression
> which means referring to columns by name.
>
> how can i set the column names for a Dataset before doing a joinWith?
>
> currently i am aware of:
> df.toDF("k", "v").as[(K, V)]
>
> but that seems inefficient/anti-pattern? i shouldn't have to go to a
> DataFrame and back to set the column names?
>
> or if this is the only way to set names, and column names really shouldn't
> be used in Datasets, can i perhaps refer to the columns by their position?
>
> thanks, koert

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org