You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Ilya Karpov <i....@cleverdata.ru> on 2015/08/24 11:21:40 UTC
Joining using mulitimap or array
Hi, guys
I'm confused about joining columns in SparkSQL and need your advice.
I want to join 2 datasets of profiles. Each profile has name and array of attributes(age, gender, email etc).
There can be mutliple instances of attribute with the same name, e.g. profile has 2 emails - so 2 attributes with name = 'email' in
array. Now I want to join 2 datasets using 'email' attribute. I cant find the way to do it :(
The code is below. Now result of join is empty, while I expect to see 1 row with all Alice emails.
import org.apache.spark.sql.{DataFrame, SQLContext}
import org.apache.spark.{SparkConf, SparkContext}
case class Attribute(name: String, value: String, weight: Float)
case class Profile(name: String, attributes: Seq[Attribute])
object SparkJoinArrayColumn {
def main(args: Array[String]) {
val sc: SparkContext = new SparkContext(new SparkConf().setMaster("local").setAppName(getClass.getSimpleName))
val sqlContext: SQLContext = new SQLContext(sc)
import sqlContext.implicits._
val a: DataFrame = sc.parallelize(Seq(
Profile("Alice", Seq(Attribute("email", "alice@mail.com", 1.0f), Attribute("email", "a.jones@mail.com", 1.0f)))
)).toDF.as("a")
val b: DataFrame = sc.parallelize(Seq(
Profile("Alice", Seq(Attribute("email", "alice@mail.com", 1.0f), Attribute("age", "29", 0.2f)))
)).toDF.as("b")
a.where($"a.attributes.name" === "email")
.join(
b.where($"b.attributes.name" === "email"),
$"a.attributes.value" === $"b.attributes.value"
)
.show()
}
}
Thanks forward!
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org
Re: Joining using mulitimap or array
Posted by Hemant Bhanawat <he...@gmail.com>.
In your example, a.attributes.name is a list and is not a string . Run this
to find it out :
a.select($"a.attributes.name").show()
On Mon, Aug 24, 2015 at 2:51 PM, Ilya Karpov <i....@cleverdata.ru> wrote:
> Hi, guys
> I'm confused about joining columns in SparkSQL and need your advice.
> I want to join 2 datasets of profiles. Each profile has name and array of
> attributes(age, gender, email etc).
> There can be mutliple instances of attribute with the same name, e.g.
> profile has 2 emails - so 2 attributes with name = 'email' in
> array. Now I want to join 2 datasets using 'email' attribute. I cant find
> the way to do it :(
>
> The code is below. Now result of join is empty, while I expect to see 1
> row with all Alice emails.
>
> import org.apache.spark.sql.{DataFrame, SQLContext}
> import org.apache.spark.{SparkConf, SparkContext}
>
> case class Attribute(name: String, value: String, weight: Float)
> case class Profile(name: String, attributes: Seq[Attribute])
>
> object SparkJoinArrayColumn {
> def main(args: Array[String]) {
> val sc: SparkContext = new SparkContext(new
> SparkConf().setMaster("local").setAppName(getClass.getSimpleName))
> val sqlContext: SQLContext = new SQLContext(sc)
>
> import sqlContext.implicits._
>
> val a: DataFrame = sc.parallelize(Seq(
> Profile("Alice", Seq(Attribute("email", "alice@mail.com", 1.0f),
> Attribute("email", "a.jones@mail.com", 1.0f)))
> )).toDF.as("a")
>
> val b: DataFrame = sc.parallelize(Seq(
> Profile("Alice", Seq(Attribute("email", "alice@mail.com", 1.0f),
> Attribute("age", "29", 0.2f)))
> )).toDF.as("b")
>
>
> a.where($"a.attributes.name" === "email")
> .join(
> b.where($"b.attributes.name" === "email"),
> $"a.attributes.value" === $"b.attributes.value"
> )
> .show()
> }
> }
>
> Thanks forward!
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>
Re: Joining using mulitimap or array
Posted by Ilya Karpov <i....@cleverdata.ru>.
Thanks,
but I think this is not the case of multiple spark contexts (never the less I tried your suggestion - didn’t worked). The problem is join to datasets using array items value: attribute.value in my case. Has anyone ideas?
> 24 авг. 2015 г., в 15:01, satish chandra j <js...@gmail.com> написал(а):
>
> Hi,
> If you join logic is correct, it seems to be a similar issue which i faced recently
>
> Can you try by SparkContext(conf).set("spark.driver.allowMultipleContexts","true")
>
> Regards,
> Satish Chandra
>
> On Mon, Aug 24, 2015 at 2:51 PM, Ilya Karpov <i.karpov@cleverdata.ru <ma...@cleverdata.ru>> wrote:
> Hi, guys
> I'm confused about joining columns in SparkSQL and need your advice.
> I want to join 2 datasets of profiles. Each profile has name and array of attributes(age, gender, email etc).
> There can be mutliple instances of attribute with the same name, e.g. profile has 2 emails - so 2 attributes with name = 'email' in
> array. Now I want to join 2 datasets using 'email' attribute. I cant find the way to do it :(
>
> The code is below. Now result of join is empty, while I expect to see 1 row with all Alice emails.
>
> import org.apache.spark.sql.{DataFrame, SQLContext}
> import org.apache.spark.{SparkConf, SparkContext}
>
> case class Attribute(name: String, value: String, weight: Float)
> case class Profile(name: String, attributes: Seq[Attribute])
>
> object SparkJoinArrayColumn {
> def main(args: Array[String]) {
> val sc: SparkContext = new SparkContext(new SparkConf().setMaster("local").setAppName(getClass.getSimpleName))
> val sqlContext: SQLContext = new SQLContext(sc)
>
> import sqlContext.implicits._
>
> val a: DataFrame = sc.parallelize(Seq(
> Profile("Alice", Seq(Attribute("email", "alice@mail.com <ma...@mail.com>", 1.0f), Attribute("email", "a.jones@mail.com <ma...@mail.com>", 1.0f)))
> )).toDF.as("a")
>
> val b: DataFrame = sc.parallelize(Seq(
> Profile("Alice", Seq(Attribute("email", "alice@mail.com <ma...@mail.com>", 1.0f), Attribute("age", "29", 0.2f)))
> )).toDF.as("b")
>
>
> a.where($"a.attributes.name <http://a.attributes.name/>" === "email")
> .join(
> b.where($"b.attributes.name <http://b.attributes.name/>" === "email"),
> $"a.attributes.value" === $"b.attributes.value"
> )
> .show()
> }
> }
>
> Thanks forward!
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org <ma...@spark.apache.org>
> For additional commands, e-mail: user-help@spark.apache.org <ma...@spark.apache.org>
>
>
Re: Joining using mulitimap or array
Posted by satish chandra j <js...@gmail.com>.
Hi,
If you join logic is correct, it seems to be a similar issue which i faced
recently
Can you try by
*SparkContext(conf).set("spark.driver.allowMultipleContexts","true")*
Regards,
Satish Chandra
On Mon, Aug 24, 2015 at 2:51 PM, Ilya Karpov <i....@cleverdata.ru> wrote:
> Hi, guys
> I'm confused about joining columns in SparkSQL and need your advice.
> I want to join 2 datasets of profiles. Each profile has name and array of
> attributes(age, gender, email etc).
> There can be mutliple instances of attribute with the same name, e.g.
> profile has 2 emails - so 2 attributes with name = 'email' in
> array. Now I want to join 2 datasets using 'email' attribute. I cant find
> the way to do it :(
>
> The code is below. Now result of join is empty, while I expect to see 1
> row with all Alice emails.
>
> import org.apache.spark.sql.{DataFrame, SQLContext}
> import org.apache.spark.{SparkConf, SparkContext}
>
> case class Attribute(name: String, value: String, weight: Float)
> case class Profile(name: String, attributes: Seq[Attribute])
>
> object SparkJoinArrayColumn {
> def main(args: Array[String]) {
> val sc: SparkContext = new SparkContext(new
> SparkConf().setMaster("local").setAppName(getClass.getSimpleName))
> val sqlContext: SQLContext = new SQLContext(sc)
>
> import sqlContext.implicits._
>
> val a: DataFrame = sc.parallelize(Seq(
> Profile("Alice", Seq(Attribute("email", "alice@mail.com", 1.0f),
> Attribute("email", "a.jones@mail.com", 1.0f)))
> )).toDF.as("a")
>
> val b: DataFrame = sc.parallelize(Seq(
> Profile("Alice", Seq(Attribute("email", "alice@mail.com", 1.0f),
> Attribute("age", "29", 0.2f)))
> )).toDF.as("b")
>
>
> a.where($"a.attributes.name" === "email")
> .join(
> b.where($"b.attributes.name" === "email"),
> $"a.attributes.value" === $"b.attributes.value"
> )
> .show()
> }
> }
>
> Thanks forward!
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>