You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Koert Kuipers <ko...@tresata.com> on 2016/06/07 19:30:42 UTC
setting column names on dataset
for some operators on Dataset, like joinWith, one needs to use an
expression which means referring to columns by name.
how can i set the column names for a Dataset before doing a joinWith?
currently i am aware of:
df.toDF("k", "v").as[(K, V)]
but that seems inefficient/anti-pattern? i shouldn't have to go to a
DataFrame and back to set the column names?
or if this is the only way to set names, and column names really shouldn't
be used in Datasets, can i perhaps refer to the columns by their position?
thanks, koert
Re: setting column names on dataset
Posted by Koert Kuipers <ko...@tresata.com>.
That's neat
On Jun 7, 2016 4:34 PM, "Jacek Laskowski" <ja...@japila.pl> wrote:
> Hi,
>
> What about this?
>
> scala> final case class Person(name: String, age: Int)
> warning: there was one unchecked warning; re-run with -unchecked for
> details
> defined class Person
>
> scala> val ds = Seq(Person("foo", 42), Person("bar", 24)).toDS
> ds: org.apache.spark.sql.Dataset[Person] = [name: string, age: int]
>
> scala> ds.as("a").joinWith(ds.as("b"), $"a.name" === $"b.name
> ").show(false)
> +--------+--------+
> |_1 |_2 |
> +--------+--------+
> |[foo,42]|[foo,42]|
> |[bar,24]|[bar,24]|
> +--------+--------+
>
> Pozdrawiam,
> Jacek Laskowski
> ----
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
>
> On Tue, Jun 7, 2016 at 9:30 PM, Koert Kuipers <ko...@tresata.com> wrote:
> > for some operators on Dataset, like joinWith, one needs to use an
> expression
> > which means referring to columns by name.
> >
> > how can i set the column names for a Dataset before doing a joinWith?
> >
> > currently i am aware of:
> > df.toDF("k", "v").as[(K, V)]
> >
> > but that seems inefficient/anti-pattern? i shouldn't have to go to a
> > DataFrame and back to set the column names?
> >
> > or if this is the only way to set names, and column names really
> shouldn't
> > be used in Datasets, can i perhaps refer to the columns by their
> position?
> >
> > thanks, koert
>
Re: setting column names on dataset
Posted by Jacek Laskowski <ja...@japila.pl>.
Hi,
What about this?
scala> final case class Person(name: String, age: Int)
warning: there was one unchecked warning; re-run with -unchecked for details
defined class Person
scala> val ds = Seq(Person("foo", 42), Person("bar", 24)).toDS
ds: org.apache.spark.sql.Dataset[Person] = [name: string, age: int]
scala> ds.as("a").joinWith(ds.as("b"), $"a.name" === $"b.name").show(false)
+--------+--------+
|_1 |_2 |
+--------+--------+
|[foo,42]|[foo,42]|
|[bar,24]|[bar,24]|
+--------+--------+
Pozdrawiam,
Jacek Laskowski
----
https://medium.com/@jaceklaskowski/
Mastering Apache Spark http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski
On Tue, Jun 7, 2016 at 9:30 PM, Koert Kuipers <ko...@tresata.com> wrote:
> for some operators on Dataset, like joinWith, one needs to use an expression
> which means referring to columns by name.
>
> how can i set the column names for a Dataset before doing a joinWith?
>
> currently i am aware of:
> df.toDF("k", "v").as[(K, V)]
>
> but that seems inefficient/anti-pattern? i shouldn't have to go to a
> DataFrame and back to set the column names?
>
> or if this is the only way to set names, and column names really shouldn't
> be used in Datasets, can i perhaps refer to the columns by their position?
>
> thanks, koert
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org