You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Luis Guerra <lu...@gmail.com> on 2014/07/17 10:15:08 UTC

class after join

Hi all,

I am a newbie Spark user with many doubts, so sorry if this is a "silly"
question.

I am dealing with tabular data formatted as text files, so when I first
load the data, my code is like this:

case class data_class(
   V1: String,
   V2: String,
   V3: String,
   V4: String,
   V5: String,
   V6: String,
   V7: String)

val data= sc.textFile(data_path)
  .map(x => {
  val fields = (x+" ").split("\t")

 data_class(fields(0).trim(),fields(1).trim(),fields(2).trim(),fields(3).trim(),

fields(4).trim(), fields(5).trim(),fields(6).trim())
     })

I am doing this because I would like to access to each position using the
variable name (V1...V7). Is there any other way of doing this?

Also related to this question, if I have data with more than 22 variables,
I am restringed to use class instead of case class. However, this kind of
solution has many restrictions mainly related to getter methods. Is there
any other way of doing this?

And finally, one of my main problems comes after operations of different
data variables. For instance, if I have two different variables (data1 and
data2), and I want to join them both as:

val data3 = data1.keyBy(_.V1).leftOuterJoin(data2.keyBy(_.V1))

Then I have to post process data3 in order to obtain a new class that
contains those variables from data1 and also those variables from data2. As
data3 is (key, (data1, data2)), do I have to create a new different class
with all these attributes from data1 and data2? This is kind of annoying
when there are many attributes.

Thanks in advance,

Best

Re: class after join

Posted by Michael Armbrust <mi...@databricks.com>.

If you intern the string it will be more efficient, but still significantly
more expensive than the class based approach.

** VERY EXPERIMENTAL **
We are working with EPFL on a lightweight syntax for naming the results of
spark transformations in scala (and are going to make it interoperate with
SQL).  Sparse details here: https://github.com/scala-records/scala-records

Stay tuned for more...

Michael


On Thu, Jul 17, 2014 at 4:47 AM, Luis Guerra <lu...@gmail.com> wrote:

> Thank you for your fast reply.
>
> We are considering this Map[String, String] solution, but there are some
> details that we do not control yet. What would happen if we have different
> data types for different fields? Also, with this solution, we have to
> repeat the field names for every "row" that we have, is this efficient?
>
> Regarding the solution with composition, the key would be repeated in the
> new class, whereas it is only necessary once after the join, isn't it?
>
>
> On Thu, Jul 17, 2014 at 10:25 AM, Sean Owen <so...@cloudera.com> wrote:
>
>> If what you have is a large number of named strings, why not use a
>> Map[String,String] to represent them? If you're approaching a class
>> with >22 String fields anyway, it probably makes more sense. You lose
>> a bit of compile-time checking, but gain flexibility.
>>
>> Also, merging two Maps to make a new one is pretty simple, compared to
>> making many of these values classes.
>>
>> (Although, if you otherwise needed a class that represented "all of
>> the things in class A and class B", this could be done easily with
>> composition, a class with an A and a B inside.)
>>
>> On Thu, Jul 17, 2014 at 9:15 AM, Luis Guerra <lu...@gmail.com>
>> wrote:
>> > Hi all,
>> >
>> > I am a newbie Spark user with many doubts, so sorry if this is a "silly"
>> > question.
>> >
>> > I am dealing with tabular data formatted as text files, so when I first
>> load
>> > the data, my code is like this:
>> >
>> > case class data_class(
>> >   V1: String,
>> >   V2: String,
>> >   V3: String,
>> >   V4: String,
>> >   V5: String,
>> >   V6: String,
>> >   V7: String)
>> >
>> > val data= sc.textFile(data_path)
>> >   .map(x => {
>> >   val fields = (x+" ").split("\t")
>> >
>> data_class(fields(0).trim(),fields(1).trim(),fields(2).trim(),fields(3).trim(),
>> > fields(4).trim(), fields(5).trim(),fields(6).trim())
>> >     })
>> >
>> > I am doing this because I would like to access to each position using
>> the
>> > variable name (V1...V7). Is there any other way of doing this?
>> >
>> > Also related to this question, if I have data with more than 22
>> variables, I
>> > am restringed to use class instead of case class. However, this kind of
>> > solution has many restrictions mainly related to getter methods. Is
>> there
>> > any other way of doing this?
>> >
>> > And finally, one of my main problems comes after operations of different
>> > data variables. For instance, if I have two different variables (data1
>> and
>> > data2), and I want to join them both as:
>> >
>> > val data3 = data1.keyBy(_.V1).leftOuterJoin(data2.keyBy(_.V1))
>> >
>> > Then I have to post process data3 in order to obtain a new class that
>> > contains those variables from data1 and also those variables from
>> data2. As
>> > data3 is (key, (data1, data2)), do I have to create a new different
>> class
>> > with all these attributes from data1 and data2? This is kind of annoying
>> > when there are many attributes.
>> >
>> > Thanks in advance,
>> >
>> > Best
>>
>
>

Re: class after join

Posted by Luis Guerra <lu...@gmail.com>.

Thank you for your fast reply.

We are considering this Map[String, String] solution, but there are some
details that we do not control yet. What would happen if we have different
data types for different fields? Also, with this solution, we have to
repeat the field names for every "row" that we have, is this efficient?

Regarding the solution with composition, the key would be repeated in the
new class, whereas it is only necessary once after the join, isn't it?


On Thu, Jul 17, 2014 at 10:25 AM, Sean Owen <so...@cloudera.com> wrote:

> If what you have is a large number of named strings, why not use a
> Map[String,String] to represent them? If you're approaching a class
> with >22 String fields anyway, it probably makes more sense. You lose
> a bit of compile-time checking, but gain flexibility.
>
> Also, merging two Maps to make a new one is pretty simple, compared to
> making many of these values classes.
>
> (Although, if you otherwise needed a class that represented "all of
> the things in class A and class B", this could be done easily with
> composition, a class with an A and a B inside.)
>
> On Thu, Jul 17, 2014 at 9:15 AM, Luis Guerra <lu...@gmail.com>
> wrote:
> > Hi all,
> >
> > I am a newbie Spark user with many doubts, so sorry if this is a "silly"
> > question.
> >
> > I am dealing with tabular data formatted as text files, so when I first
> load
> > the data, my code is like this:
> >
> > case class data_class(
> >   V1: String,
> >   V2: String,
> >   V3: String,
> >   V4: String,
> >   V5: String,
> >   V6: String,
> >   V7: String)
> >
> > val data= sc.textFile(data_path)
> >   .map(x => {
> >   val fields = (x+" ").split("\t")
> >
> data_class(fields(0).trim(),fields(1).trim(),fields(2).trim(),fields(3).trim(),
> > fields(4).trim(), fields(5).trim(),fields(6).trim())
> >     })
> >
> > I am doing this because I would like to access to each position using the
> > variable name (V1...V7). Is there any other way of doing this?
> >
> > Also related to this question, if I have data with more than 22
> variables, I
> > am restringed to use class instead of case class. However, this kind of
> > solution has many restrictions mainly related to getter methods. Is there
> > any other way of doing this?
> >
> > And finally, one of my main problems comes after operations of different
> > data variables. For instance, if I have two different variables (data1
> and
> > data2), and I want to join them both as:
> >
> > val data3 = data1.keyBy(_.V1).leftOuterJoin(data2.keyBy(_.V1))
> >
> > Then I have to post process data3 in order to obtain a new class that
> > contains those variables from data1 and also those variables from data2.
> As
> > data3 is (key, (data1, data2)), do I have to create a new different class
> > with all these attributes from data1 and data2? This is kind of annoying
> > when there are many attributes.
> >
> > Thanks in advance,
> >
> > Best
>

Re: class after join

Posted by Sean Owen <so...@cloudera.com>.

If what you have is a large number of named strings, why not use a
Map[String,String] to represent them? If you're approaching a class
with >22 String fields anyway, it probably makes more sense. You lose
a bit of compile-time checking, but gain flexibility.

Also, merging two Maps to make a new one is pretty simple, compared to
making many of these values classes.

(Although, if you otherwise needed a class that represented "all of
the things in class A and class B", this could be done easily with
composition, a class with an A and a B inside.)

On Thu, Jul 17, 2014 at 9:15 AM, Luis Guerra <lu...@gmail.com> wrote:
> Hi all,
>
> I am a newbie Spark user with many doubts, so sorry if this is a "silly"
> question.
>
> I am dealing with tabular data formatted as text files, so when I first load
> the data, my code is like this:
>
> case class data_class(
>   V1: String,
>   V2: String,
>   V3: String,
>   V4: String,
>   V5: String,
>   V6: String,
>   V7: String)
>
> val data= sc.textFile(data_path)
>   .map(x => {
>   val fields = (x+" ").split("\t")
> data_class(fields(0).trim(),fields(1).trim(),fields(2).trim(),fields(3).trim(),
> fields(4).trim(), fields(5).trim(),fields(6).trim())
>     })
>
> I am doing this because I would like to access to each position using the
> variable name (V1...V7). Is there any other way of doing this?
>
> Also related to this question, if I have data with more than 22 variables, I
> am restringed to use class instead of case class. However, this kind of
> solution has many restrictions mainly related to getter methods. Is there
> any other way of doing this?
>
> And finally, one of my main problems comes after operations of different
> data variables. For instance, if I have two different variables (data1 and
> data2), and I want to join them both as:
>
> val data3 = data1.keyBy(_.V1).leftOuterJoin(data2.keyBy(_.V1))
>
> Then I have to post process data3 in order to obtain a new class that
> contains those variables from data1 and also those variables from data2. As
> data3 is (key, (data1, data2)), do I have to create a new different class
> with all these attributes from data1 and data2? This is kind of annoying
> when there are many attributes.
>
> Thanks in advance,
>
> Best