You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Shay Seng <sh...@1618labs.com> on 2013/11/15 19:10:49 UTC

DataFrame RDDs

Hi,

Is there some way to get R-style Data.Frame data structures into RDDs? I've
been using RDD[Seq[]] but this is getting quite error-prone and the code
gets pretty hard to read especially after a few joins, maps etc.

Rather than access columns by index, I would prefer to access them by name.
e.g. instead of writing:
myrdd.map(l => Seq(l(0), l(1), l,(4), l(9))
I would prefer to write
myrdd.map(l => DataFrame(l.id, l.entryTime, l.exitTime, l.cost))

Also joins are particularly irritating. Currently I have to first construct
a pair:
somePairRdd.join(myrdd.map(l=> (l(1),l(2)), (l(0),l(1),l(2),l(3)))
Now I have to unzip away the join-key and remap the values into a seq

instead I would rather write
someDataFrame.join(myrdd , l=> l.entryTime && l.exitTime)


The question is this:
(1) I started writing a DataFrameRDD class that kept track of the column
names and column values, and some optional attributes common to the entire
dataframe. However I got a little muddled when trying to figure out what
happens when a dataframRDD is chained with other operations and get
transformed to other types of RDDs. The Value part of the RDD is obvious,
but I didn't know the best way to pass on the "column and attribute"
portions of the DataFrame class.

I googled around for some documentation on how to write RDDs, but only
found a pptx slide presentation with very vague info. Is there a better
source of info on how to write RDDs?

(2) Even better than info on how to write RDDs, has anyone written an RDD
that functions as a DataFrame? :-)

tks
shay

Re: DataFrame RDDs

Posted by andy petrella <an...@gmail.com>.

Exactly, we could actually use Dynamic with Records.

Still thinking out of loud:
The fact is that, with Dynamic we would loose the type  -- the
implementation is up to us and makes uses of Any for parameters of course.
Maybe could we use shapeless records as the delegated value for the
selectDynamic (and so on) implementation...

However, we might encounter some problem because of the Any in the
signatures, since the compiler looses the type... most probably we'd need a
ClassTag or something similar... not sure if it'll work -- with a clean
code at least (°_-)



Andy Petrella
Belgium (Liège)


*       *********
 IT Consultant for *NextLab <http://nextlab.be/> sprl* (co-founder)
 Engaged Citizen Coder for *WAJUG <http://wajug.be/>* (co-founder)
 Author of *Learning Play! Framework
2*<http://www.packtpub.com/learning-play-framework-2/book>


*       *********Mobile: *+32 495 99 11 04*
Mails:

   - andy.petrella@nextlab.be
   - andy.petrella@gmail.com

Socials:

   - Twitter: https://twitter.com/#!/noootsab
   - LinkedIn: http://be.linkedin.com/in/andypetrella
   - Blogger: http://ska-la.blogspot.com/
   - GitHub:  https://github.com/andypetrella
   - Masterbranch: https://masterbranch.com/andy.petrella



On Tue, Nov 19, 2013 at 8:07 AM, Matei Zaharia <ma...@gmail.com>wrote:

> Interesting idea — in Scala you can also use the Dynamic type (
> http://hacking-scala.org/post/49051516694/introduction-to-type-dynamic)
> to allow dynamic properties. It has the same potential pitfalls as string
> names, but with nicer syntax.
>
> Matei
>
> On Nov 18, 2013, at 3:45 PM, andy petrella <an...@gmail.com>
> wrote:
>
> Maybe I'm wrong, but this use case could be a good fit for Shapeless<https://github.com/milessabin/shapeless>'
> records.
>
> Shapeless' records are like, so to say, lisp's record but typed! In that
> sense, they're more closer to Haskell's record notation, but imho less
> powerful, since the access will be based on String (field name) for
> Shapeless where Haskell will use pure functions!
>
> Anyway, this documentation<https://github.com/milessabin/shapeless/wiki/Feature-overview%3a-shapeless-2.0.0#extensible-records> is
> self-explanatory and straightforward how we (maybe) could use them to
> simulate an R's frame
>
> Thinking out loud: when reading a csv file, for instance, what would be
> needed are
>  * a Read[T] for each column,
>  * fold'ling the list of columns by "reading" each and prepending the
> result (combined with the name with ->>) to an HList
>
> The gain would be that we should recover one helpful feature of R's frame
> which is:
>   R       :: frame$newCol = frame$post - frame$pre
>           // which adds a column to a frame
>   Shpls :: frame2 = frame + ("newCol" --> (frame("post") - frame("pre")))
>     // type safe "difference" between ints for instance
>
> Of course, we're not recovering R's frame as is, because we're simply
> dealing with rows on by one, where a frame is dealing with the full table
> -- but in the case of Spark this would have no sense to mimic that, since
> we use RDDs for that :-D.
>
> I didn't experimented this yet, but It'd be fun to try, don't know if
> someone is interested in ^^
>
> Cheers
>
> andy
>
>
> On Fri, Nov 15, 2013 at 8:49 PM, Christopher Nguyen <ct...@adatao.com>wrote:
>
>> Sure, Shay. Let's connect offline.
>>
>> Sent while mobile. Pls excuse typos etc.
>> On Nov 16, 2013 2:27 AM, "Shay Seng" <sh...@1618labs.com> wrote:
>>
>>> Nice, any possibility of sharing this code in advance?
>>>
>>>
>>> On Fri, Nov 15, 2013 at 11:22 AM, Christopher Nguyen <ct...@adatao.com>wrote:
>>>
>>>> Shay, we've done this at Adatao, specifically a big data frame in RDD
>>>> representation and subsetting/projections/data mining/machine learning
>>>> algorithms on that in-memory table structure.
>>>>
>>>> We're planning to harmonize that with the MLBase work in the near
>>>> future. Just a matter of prioritization on limited resources. If there's
>>>> enough interest we'll accelerate that.
>>>>
>>>> Sent while mobile. Pls excuse typos etc.
>>>>  On Nov 16, 2013 1:11 AM, "Shay Seng" <sh...@1618labs.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> Is there some way to get R-style Data.Frame data structures into RDDs?
>>>>> I've been using RDD[Seq[]] but this is getting quite error-prone and the
>>>>> code gets pretty hard to read especially after a few joins, maps etc.
>>>>>
>>>>> Rather than access columns by index, I would prefer to access them by
>>>>> name.
>>>>> e.g. instead of writing:
>>>>> myrdd.map(l => Seq(l(0), l(1), l,(4), l(9))
>>>>> I would prefer to write
>>>>> myrdd.map(l => DataFrame(l.id, l.entryTime, l.exitTime, l.cost))
>>>>>
>>>>> Also joins are particularly irritating. Currently I have to first
>>>>> construct a pair:
>>>>> somePairRdd.join(myrdd.map(l=> (l(1),l(2)), (l(0),l(1),l(2),l(3)))
>>>>> Now I have to unzip away the join-key and remap the values into a seq
>>>>>
>>>>> instead I would rather write
>>>>> someDataFrame.join(myrdd , l=> l.entryTime && l.exitTime)
>>>>>
>>>>>
>>>>> The question is this:
>>>>> (1) I started writing a DataFrameRDD class that kept track of the
>>>>> column names and column values, and some optional attributes common to the
>>>>> entire dataframe. However I got a little muddled when trying to figure out
>>>>> what happens when a dataframRDD is chained with other operations and get
>>>>> transformed to other types of RDDs. The Value part of the RDD is obvious,
>>>>> but I didn't know the best way to pass on the "column and attribute"
>>>>> portions of the DataFrame class.
>>>>>
>>>>> I googled around for some documentation on how to write RDDs, but only
>>>>> found a pptx slide presentation with very vague info. Is there a better
>>>>> source of info on how to write RDDs?
>>>>>
>>>>> (2) Even better than info on how to write RDDs, has anyone written an
>>>>> RDD that functions as a DataFrame? :-)
>>>>>
>>>>> tks
>>>>> shay
>>>>>
>>>>
>>>
>
>

Re: DataFrame RDDs

Posted by Matei Zaharia <ma...@gmail.com>.

Interesting idea — in Scala you can also use the Dynamic type (http://hacking-scala.org/post/49051516694/introduction-to-type-dynamic) to allow dynamic properties. It has the same potential pitfalls as string names, but with nicer syntax.

Matei

On Nov 18, 2013, at 3:45 PM, andy petrella <an...@gmail.com> wrote:

> Maybe I'm wrong, but this use case could be a good fit for Shapeless' records.
> 
> Shapeless' records are like, so to say, lisp's record but typed! In that sense, they're more closer to Haskell's record notation, but imho less powerful, since the access will be based on String (field name) for Shapeless where Haskell will use pure functions!
> 
> Anyway, this documentation is self-explanatory and straightforward how we (maybe) could use them to simulate an R's frame
> 
> Thinking out loud: when reading a csv file, for instance, what would be needed are 
>  * a Read[T] for each column, 
>  * fold'ling the list of columns by "reading" each and prepending the result (combined with the name with ->>) to an HList
> 
> The gain would be that we should recover one helpful feature of R's frame which is:
>   R       :: frame$newCol = frame$post - frame$pre                                   // which adds a column to a frame
>   Shpls :: frame2 = frame + ("newCol" --> (frame("post") - frame("pre")))     // type safe "difference" between ints for instance
>    
> Of course, we're not recovering R's frame as is, because we're simply dealing with rows on by one, where a frame is dealing with the full table -- but in the case of Spark this would have no sense to mimic that, since we use RDDs for that :-D.
> 
> I didn't experimented this yet, but It'd be fun to try, don't know if someone is interested in ^^
> 
> Cheers
> 
> andy
> 
> 
> On Fri, Nov 15, 2013 at 8:49 PM, Christopher Nguyen <ct...@adatao.com> wrote:
> Sure, Shay. Let's connect offline.
> 
> Sent while mobile. Pls excuse typos etc.
> 
> On Nov 16, 2013 2:27 AM, "Shay Seng" <sh...@1618labs.com> wrote:
> Nice, any possibility of sharing this code in advance? 
> 
> 
> On Fri, Nov 15, 2013 at 11:22 AM, Christopher Nguyen <ct...@adatao.com> wrote:
> Shay, we've done this at Adatao, specifically a big data frame in RDD representation and subsetting/projections/data mining/machine learning algorithms on that in-memory table structure.
> 
> We're planning to harmonize that with the MLBase work in the near future. Just a matter of prioritization on limited resources. If there's enough interest we'll accelerate that.
> 
> Sent while mobile. Pls excuse typos etc.
> 
> On Nov 16, 2013 1:11 AM, "Shay Seng" <sh...@1618labs.com> wrote:
> Hi, 
> 
> Is there some way to get R-style Data.Frame data structures into RDDs? I've been using RDD[Seq[]] but this is getting quite error-prone and the code gets pretty hard to read especially after a few joins, maps etc. 
> 
> Rather than access columns by index, I would prefer to access them by name.
> e.g. instead of writing:
> myrdd.map(l => Seq(l(0), l(1), l,(4), l(9))
> I would prefer to write
> myrdd.map(l => DataFrame(l.id, l.entryTime, l.exitTime, l.cost))
> 
> Also joins are particularly irritating. Currently I have to first construct a pair:
> somePairRdd.join(myrdd.map(l=> (l(1),l(2)), (l(0),l(1),l(2),l(3)))
> Now I have to unzip away the join-key and remap the values into a seq
> 
> instead I would rather write
> someDataFrame.join(myrdd , l=> l.entryTime && l.exitTime)
> 
> 
> The question is this:
> (1) I started writing a DataFrameRDD class that kept track of the column names and column values, and some optional attributes common to the entire dataframe. However I got a little muddled when trying to figure out what happens when a dataframRDD is chained with other operations and get transformed to other types of RDDs. The Value part of the RDD is obvious, but I didn't know the best way to pass on the "column and attribute" portions of the DataFrame class.
> 
> I googled around for some documentation on how to write RDDs, but only found a pptx slide presentation with very vague info. Is there a better source of info on how to write RDDs? 
> 
> (2) Even better than info on how to write RDDs, has anyone written an RDD that functions as a DataFrame? :-)
> 
> tks
> shay
> 
>

Re: DataFrame RDDs

Posted by andy petrella <an...@gmail.com>.

FYI,
I asked Miles and, unsurprisingly, Alois Cochard has did some work exactly
in that direction (I know he's a big fan of Records).

Check this out: https://twitter.com/aloiscochard/status/402745443450232832

So undoubtedly there is a room for having neat DataFrame in Scala but the
path to it is rather tough ;-)

Cheers

andy

On Tue, Nov 19, 2013 at 9:03 AM, andy petrella <an...@gmail.com>wrote:

> indeed the scala version could be blocking (I'm not sure what it needs
> 2.11, maybe Miles uses quasiquotes...)
>
> Andy
>
>
>
> On Tue, Nov 19, 2013 at 8:48 AM, Anwar Rizal <an...@gmail.com> wrote:
>
>> I had that in mind too when Miles Sabin presented Shapeless at Scala.IO
>> Paris last month.
>>
>> If anybody would like to experiment with shapeless in Spark to create
>> something like R data frame or In canter dataset, I would be happy to see
>> and eventually help.
>>
>> My feeling is however the fact that shapeless goes fast (eg. in my
>> understanding, the latest shapeless requires 2.11) may be a problem.
>> On Nov 19, 2013 12:46 AM, "andy petrella" <an...@gmail.com>
>> wrote:
>>
>>> Maybe I'm wrong, but this use case could be a good fit for Shapeless<https://github.com/milessabin/shapeless>'
>>> records.
>>>
>>> Shapeless' records are like, so to say, lisp's record but typed! In that
>>> sense, they're more closer to Haskell's record notation, but imho less
>>> powerful, since the access will be based on String (field name) for
>>> Shapeless where Haskell will use pure functions!
>>>
>>> Anyway, this documentation<https://github.com/milessabin/shapeless/wiki/Feature-overview%3a-shapeless-2.0.0#extensible-records> is
>>> self-explanatory and straightforward how we (maybe) could use them to
>>> simulate an R's frame
>>>
>>> Thinking out loud: when reading a csv file, for instance, what would be
>>> needed are
>>>  * a Read[T] for each column,
>>>  * fold'ling the list of columns by "reading" each and prepending the
>>> result (combined with the name with ->>) to an HList
>>>
>>> The gain would be that we should recover one helpful feature of R's
>>> frame which is:
>>>   R       :: frame$newCol = frame$post - frame$pre
>>>             // which adds a column to a frame
>>>   Shpls :: frame2 = frame + ("newCol" --> (frame("post") -
>>> frame("pre")))     // type safe "difference" between ints for instance
>>>
>>> Of course, we're not recovering R's frame as is, because we're simply
>>> dealing with rows on by one, where a frame is dealing with the full table
>>> -- but in the case of Spark this would have no sense to mimic that, since
>>> we use RDDs for that :-D.
>>>
>>> I didn't experimented this yet, but It'd be fun to try, don't know if
>>> someone is interested in ^^
>>>
>>> Cheers
>>>
>>> andy
>>>
>>>
>>> On Fri, Nov 15, 2013 at 8:49 PM, Christopher Nguyen <ct...@adatao.com>wrote:
>>>
>>>> Sure, Shay. Let's connect offline.
>>>>
>>>> Sent while mobile. Pls excuse typos etc.
>>>> On Nov 16, 2013 2:27 AM, "Shay Seng" <sh...@1618labs.com> wrote:
>>>>
>>>>> Nice, any possibility of sharing this code in advance?
>>>>>
>>>>>
>>>>> On Fri, Nov 15, 2013 at 11:22 AM, Christopher Nguyen <ct...@adatao.com>wrote:
>>>>>
>>>>>> Shay, we've done this at Adatao, specifically a big data frame in RDD
>>>>>> representation and subsetting/projections/data mining/machine learning
>>>>>> algorithms on that in-memory table structure.
>>>>>>
>>>>>> We're planning to harmonize that with the MLBase work in the near
>>>>>> future. Just a matter of prioritization on limited resources. If there's
>>>>>> enough interest we'll accelerate that.
>>>>>>
>>>>>> Sent while mobile. Pls excuse typos etc.
>>>>>> On Nov 16, 2013 1:11 AM, "Shay Seng" <sh...@1618labs.com> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> Is there some way to get R-style Data.Frame data structures into
>>>>>>> RDDs? I've been using RDD[Seq[]] but this is getting quite error-prone and
>>>>>>> the code gets pretty hard to read especially after a few joins, maps etc.
>>>>>>>
>>>>>>> Rather than access columns by index, I would prefer to access them
>>>>>>> by name.
>>>>>>> e.g. instead of writing:
>>>>>>> myrdd.map(l => Seq(l(0), l(1), l,(4), l(9))
>>>>>>> I would prefer to write
>>>>>>> myrdd.map(l => DataFrame(l.id, l.entryTime, l.exitTime, l.cost))
>>>>>>>
>>>>>>> Also joins are particularly irritating. Currently I have to first
>>>>>>> construct a pair:
>>>>>>> somePairRdd.join(myrdd.map(l=> (l(1),l(2)), (l(0),l(1),l(2),l(3)))
>>>>>>> Now I have to unzip away the join-key and remap the values into a seq
>>>>>>>
>>>>>>> instead I would rather write
>>>>>>> someDataFrame.join(myrdd , l=> l.entryTime && l.exitTime)
>>>>>>>
>>>>>>>
>>>>>>> The question is this:
>>>>>>> (1) I started writing a DataFrameRDD class that kept track of the
>>>>>>> column names and column values, and some optional attributes common to the
>>>>>>> entire dataframe. However I got a little muddled when trying to figure out
>>>>>>> what happens when a dataframRDD is chained with other operations and get
>>>>>>> transformed to other types of RDDs. The Value part of the RDD is obvious,
>>>>>>> but I didn't know the best way to pass on the "column and attribute"
>>>>>>> portions of the DataFrame class.
>>>>>>>
>>>>>>> I googled around for some documentation on how to write RDDs, but
>>>>>>> only found a pptx slide presentation with very vague info. Is there a
>>>>>>> better source of info on how to write RDDs?
>>>>>>>
>>>>>>> (2) Even better than info on how to write RDDs, has anyone written
>>>>>>> an RDD that functions as a DataFrame? :-)
>>>>>>>
>>>>>>> tks
>>>>>>> shay
>>>>>>>
>>>>>>
>>>>>
>>>
>

Re: DataFrame RDDs

Posted by andy petrella <an...@gmail.com>.

indeed the scala version could be blocking (I'm not sure what it needs
2.11, maybe Miles uses quasiquotes...)

Andy


On Tue, Nov 19, 2013 at 8:48 AM, Anwar Rizal <an...@gmail.com> wrote:

> I had that in mind too when Miles Sabin presented Shapeless at Scala.IO
> Paris last month.
>
> If anybody would like to experiment with shapeless in Spark to create
> something like R data frame or In canter dataset, I would be happy to see
> and eventually help.
>
> My feeling is however the fact that shapeless goes fast (eg. in my
> understanding, the latest shapeless requires 2.11) may be a problem.
> On Nov 19, 2013 12:46 AM, "andy petrella" <an...@gmail.com> wrote:
>
>> Maybe I'm wrong, but this use case could be a good fit for Shapeless<https://github.com/milessabin/shapeless>'
>> records.
>>
>> Shapeless' records are like, so to say, lisp's record but typed! In that
>> sense, they're more closer to Haskell's record notation, but imho less
>> powerful, since the access will be based on String (field name) for
>> Shapeless where Haskell will use pure functions!
>>
>> Anyway, this documentation<https://github.com/milessabin/shapeless/wiki/Feature-overview%3a-shapeless-2.0.0#extensible-records> is
>> self-explanatory and straightforward how we (maybe) could use them to
>> simulate an R's frame
>>
>> Thinking out loud: when reading a csv file, for instance, what would be
>> needed are
>>  * a Read[T] for each column,
>>  * fold'ling the list of columns by "reading" each and prepending the
>> result (combined with the name with ->>) to an HList
>>
>> The gain would be that we should recover one helpful feature of R's frame
>> which is:
>>   R       :: frame$newCol = frame$post - frame$pre
>>             // which adds a column to a frame
>>   Shpls :: frame2 = frame + ("newCol" --> (frame("post") - frame("pre")))
>>     // type safe "difference" between ints for instance
>>
>> Of course, we're not recovering R's frame as is, because we're simply
>> dealing with rows on by one, where a frame is dealing with the full table
>> -- but in the case of Spark this would have no sense to mimic that, since
>> we use RDDs for that :-D.
>>
>> I didn't experimented this yet, but It'd be fun to try, don't know if
>> someone is interested in ^^
>>
>> Cheers
>>
>> andy
>>
>>
>> On Fri, Nov 15, 2013 at 8:49 PM, Christopher Nguyen <ct...@adatao.com>wrote:
>>
>>> Sure, Shay. Let's connect offline.
>>>
>>> Sent while mobile. Pls excuse typos etc.
>>> On Nov 16, 2013 2:27 AM, "Shay Seng" <sh...@1618labs.com> wrote:
>>>
>>>> Nice, any possibility of sharing this code in advance?
>>>>
>>>>
>>>> On Fri, Nov 15, 2013 at 11:22 AM, Christopher Nguyen <ct...@adatao.com>wrote:
>>>>
>>>>> Shay, we've done this at Adatao, specifically a big data frame in RDD
>>>>> representation and subsetting/projections/data mining/machine learning
>>>>> algorithms on that in-memory table structure.
>>>>>
>>>>> We're planning to harmonize that with the MLBase work in the near
>>>>> future. Just a matter of prioritization on limited resources. If there's
>>>>> enough interest we'll accelerate that.
>>>>>
>>>>> Sent while mobile. Pls excuse typos etc.
>>>>> On Nov 16, 2013 1:11 AM, "Shay Seng" <sh...@1618labs.com> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> Is there some way to get R-style Data.Frame data structures into
>>>>>> RDDs? I've been using RDD[Seq[]] but this is getting quite error-prone and
>>>>>> the code gets pretty hard to read especially after a few joins, maps etc.
>>>>>>
>>>>>> Rather than access columns by index, I would prefer to access them by
>>>>>> name.
>>>>>> e.g. instead of writing:
>>>>>> myrdd.map(l => Seq(l(0), l(1), l,(4), l(9))
>>>>>> I would prefer to write
>>>>>> myrdd.map(l => DataFrame(l.id, l.entryTime, l.exitTime, l.cost))
>>>>>>
>>>>>> Also joins are particularly irritating. Currently I have to first
>>>>>> construct a pair:
>>>>>> somePairRdd.join(myrdd.map(l=> (l(1),l(2)), (l(0),l(1),l(2),l(3)))
>>>>>> Now I have to unzip away the join-key and remap the values into a seq
>>>>>>
>>>>>> instead I would rather write
>>>>>> someDataFrame.join(myrdd , l=> l.entryTime && l.exitTime)
>>>>>>
>>>>>>
>>>>>> The question is this:
>>>>>> (1) I started writing a DataFrameRDD class that kept track of the
>>>>>> column names and column values, and some optional attributes common to the
>>>>>> entire dataframe. However I got a little muddled when trying to figure out
>>>>>> what happens when a dataframRDD is chained with other operations and get
>>>>>> transformed to other types of RDDs. The Value part of the RDD is obvious,
>>>>>> but I didn't know the best way to pass on the "column and attribute"
>>>>>> portions of the DataFrame class.
>>>>>>
>>>>>> I googled around for some documentation on how to write RDDs, but
>>>>>> only found a pptx slide presentation with very vague info. Is there a
>>>>>> better source of info on how to write RDDs?
>>>>>>
>>>>>> (2) Even better than info on how to write RDDs, has anyone written an
>>>>>> RDD that functions as a DataFrame? :-)
>>>>>>
>>>>>> tks
>>>>>> shay
>>>>>>
>>>>>
>>>>
>>

Re: DataFrame RDDs

Posted by Anwar Rizal <an...@gmail.com>.

I had that in mind too when Miles Sabin presented Shapeless at Scala.IO
Paris last month.

If anybody would like to experiment with shapeless in Spark to create
something like R data frame or In canter dataset, I would be happy to see
and eventually help.

My feeling is however the fact that shapeless goes fast (eg. in my
understanding, the latest shapeless requires 2.11) may be a problem.
On Nov 19, 2013 12:46 AM, "andy petrella" <an...@gmail.com> wrote:

> Maybe I'm wrong, but this use case could be a good fit for Shapeless<https://github.com/milessabin/shapeless>'
> records.
>
> Shapeless' records are like, so to say, lisp's record but typed! In that
> sense, they're more closer to Haskell's record notation, but imho less
> powerful, since the access will be based on String (field name) for
> Shapeless where Haskell will use pure functions!
>
> Anyway, this documentation<https://github.com/milessabin/shapeless/wiki/Feature-overview%3a-shapeless-2.0.0#extensible-records> is
> self-explanatory and straightforward how we (maybe) could use them to
> simulate an R's frame
>
> Thinking out loud: when reading a csv file, for instance, what would be
> needed are
>  * a Read[T] for each column,
>  * fold'ling the list of columns by "reading" each and prepending the
> result (combined with the name with ->>) to an HList
>
> The gain would be that we should recover one helpful feature of R's frame
> which is:
>   R       :: frame$newCol = frame$post - frame$pre
>           // which adds a column to a frame
>   Shpls :: frame2 = frame + ("newCol" --> (frame("post") - frame("pre")))
>     // type safe "difference" between ints for instance
>
> Of course, we're not recovering R's frame as is, because we're simply
> dealing with rows on by one, where a frame is dealing with the full table
> -- but in the case of Spark this would have no sense to mimic that, since
> we use RDDs for that :-D.
>
> I didn't experimented this yet, but It'd be fun to try, don't know if
> someone is interested in ^^
>
> Cheers
>
> andy
>
>
> On Fri, Nov 15, 2013 at 8:49 PM, Christopher Nguyen <ct...@adatao.com>wrote:
>
>> Sure, Shay. Let's connect offline.
>>
>> Sent while mobile. Pls excuse typos etc.
>> On Nov 16, 2013 2:27 AM, "Shay Seng" <sh...@1618labs.com> wrote:
>>
>>> Nice, any possibility of sharing this code in advance?
>>>
>>>
>>> On Fri, Nov 15, 2013 at 11:22 AM, Christopher Nguyen <ct...@adatao.com>wrote:
>>>
>>>> Shay, we've done this at Adatao, specifically a big data frame in RDD
>>>> representation and subsetting/projections/data mining/machine learning
>>>> algorithms on that in-memory table structure.
>>>>
>>>> We're planning to harmonize that with the MLBase work in the near
>>>> future. Just a matter of prioritization on limited resources. If there's
>>>> enough interest we'll accelerate that.
>>>>
>>>> Sent while mobile. Pls excuse typos etc.
>>>> On Nov 16, 2013 1:11 AM, "Shay Seng" <sh...@1618labs.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> Is there some way to get R-style Data.Frame data structures into RDDs?
>>>>> I've been using RDD[Seq[]] but this is getting quite error-prone and the
>>>>> code gets pretty hard to read especially after a few joins, maps etc.
>>>>>
>>>>> Rather than access columns by index, I would prefer to access them by
>>>>> name.
>>>>> e.g. instead of writing:
>>>>> myrdd.map(l => Seq(l(0), l(1), l,(4), l(9))
>>>>> I would prefer to write
>>>>> myrdd.map(l => DataFrame(l.id, l.entryTime, l.exitTime, l.cost))
>>>>>
>>>>> Also joins are particularly irritating. Currently I have to first
>>>>> construct a pair:
>>>>> somePairRdd.join(myrdd.map(l=> (l(1),l(2)), (l(0),l(1),l(2),l(3)))
>>>>> Now I have to unzip away the join-key and remap the values into a seq
>>>>>
>>>>> instead I would rather write
>>>>> someDataFrame.join(myrdd , l=> l.entryTime && l.exitTime)
>>>>>
>>>>>
>>>>> The question is this:
>>>>> (1) I started writing a DataFrameRDD class that kept track of the
>>>>> column names and column values, and some optional attributes common to the
>>>>> entire dataframe. However I got a little muddled when trying to figure out
>>>>> what happens when a dataframRDD is chained with other operations and get
>>>>> transformed to other types of RDDs. The Value part of the RDD is obvious,
>>>>> but I didn't know the best way to pass on the "column and attribute"
>>>>> portions of the DataFrame class.
>>>>>
>>>>> I googled around for some documentation on how to write RDDs, but only
>>>>> found a pptx slide presentation with very vague info. Is there a better
>>>>> source of info on how to write RDDs?
>>>>>
>>>>> (2) Even better than info on how to write RDDs, has anyone written an
>>>>> RDD that functions as a DataFrame? :-)
>>>>>
>>>>> tks
>>>>> shay
>>>>>
>>>>
>>>
>

Re: DataFrame RDDs

Posted by andy petrella <an...@gmail.com>.

Maybe I'm wrong, but this use case could be a good fit for
Shapeless<https://github.com/milessabin/shapeless>'
records.

Shapeless' records are like, so to say, lisp's record but typed! In that
sense, they're more closer to Haskell's record notation, but imho less
powerful, since the access will be based on String (field name) for
Shapeless where Haskell will use pure functions!

Anyway, this documentation<https://github.com/milessabin/shapeless/wiki/Feature-overview%3a-shapeless-2.0.0#extensible-records>
is
self-explanatory and straightforward how we (maybe) could use them to
simulate an R's frame

Thinking out loud: when reading a csv file, for instance, what would be
needed are
 * a Read[T] for each column,
 * fold'ling the list of columns by "reading" each and prepending the
result (combined with the name with ->>) to an HList

The gain would be that we should recover one helpful feature of R's frame
which is:
  R       :: frame$newCol = frame$post - frame$pre
          // which adds a column to a frame
  Shpls :: frame2 = frame + ("newCol" --> (frame("post") - frame("pre")))
  // type safe "difference" between ints for instance

Of course, we're not recovering R's frame as is, because we're simply
dealing with rows on by one, where a frame is dealing with the full table
-- but in the case of Spark this would have no sense to mimic that, since
we use RDDs for that :-D.

I didn't experimented this yet, but It'd be fun to try, don't know if
someone is interested in ^^

Cheers

andy


On Fri, Nov 15, 2013 at 8:49 PM, Christopher Nguyen <ct...@adatao.com> wrote:

> Sure, Shay. Let's connect offline.
>
> Sent while mobile. Pls excuse typos etc.
> On Nov 16, 2013 2:27 AM, "Shay Seng" <sh...@1618labs.com> wrote:
>
>> Nice, any possibility of sharing this code in advance?
>>
>>
>> On Fri, Nov 15, 2013 at 11:22 AM, Christopher Nguyen <ct...@adatao.com>wrote:
>>
>>> Shay, we've done this at Adatao, specifically a big data frame in RDD
>>> representation and subsetting/projections/data mining/machine learning
>>> algorithms on that in-memory table structure.
>>>
>>> We're planning to harmonize that with the MLBase work in the near
>>> future. Just a matter of prioritization on limited resources. If there's
>>> enough interest we'll accelerate that.
>>>
>>> Sent while mobile. Pls excuse typos etc.
>>> On Nov 16, 2013 1:11 AM, "Shay Seng" <sh...@1618labs.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> Is there some way to get R-style Data.Frame data structures into RDDs?
>>>> I've been using RDD[Seq[]] but this is getting quite error-prone and the
>>>> code gets pretty hard to read especially after a few joins, maps etc.
>>>>
>>>> Rather than access columns by index, I would prefer to access them by
>>>> name.
>>>> e.g. instead of writing:
>>>> myrdd.map(l => Seq(l(0), l(1), l,(4), l(9))
>>>> I would prefer to write
>>>> myrdd.map(l => DataFrame(l.id, l.entryTime, l.exitTime, l.cost))
>>>>
>>>> Also joins are particularly irritating. Currently I have to first
>>>> construct a pair:
>>>> somePairRdd.join(myrdd.map(l=> (l(1),l(2)), (l(0),l(1),l(2),l(3)))
>>>> Now I have to unzip away the join-key and remap the values into a seq
>>>>
>>>> instead I would rather write
>>>> someDataFrame.join(myrdd , l=> l.entryTime && l.exitTime)
>>>>
>>>>
>>>> The question is this:
>>>> (1) I started writing a DataFrameRDD class that kept track of the
>>>> column names and column values, and some optional attributes common to the
>>>> entire dataframe. However I got a little muddled when trying to figure out
>>>> what happens when a dataframRDD is chained with other operations and get
>>>> transformed to other types of RDDs. The Value part of the RDD is obvious,
>>>> but I didn't know the best way to pass on the "column and attribute"
>>>> portions of the DataFrame class.
>>>>
>>>> I googled around for some documentation on how to write RDDs, but only
>>>> found a pptx slide presentation with very vague info. Is there a better
>>>> source of info on how to write RDDs?
>>>>
>>>> (2) Even better than info on how to write RDDs, has anyone written an
>>>> RDD that functions as a DataFrame? :-)
>>>>
>>>> tks
>>>> shay
>>>>
>>>
>>

Re: DataFrame RDDs

Posted by Christopher Nguyen <ct...@adatao.com>.

Sure, Shay. Let's connect offline.

Sent while mobile. Pls excuse typos etc.
On Nov 16, 2013 2:27 AM, "Shay Seng" <sh...@1618labs.com> wrote:

> Nice, any possibility of sharing this code in advance?
>
>
> On Fri, Nov 15, 2013 at 11:22 AM, Christopher Nguyen <ct...@adatao.com>wrote:
>
>> Shay, we've done this at Adatao, specifically a big data frame in RDD
>> representation and subsetting/projections/data mining/machine learning
>> algorithms on that in-memory table structure.
>>
>> We're planning to harmonize that with the MLBase work in the near future.
>> Just a matter of prioritization on limited resources. If there's enough
>> interest we'll accelerate that.
>>
>> Sent while mobile. Pls excuse typos etc.
>> On Nov 16, 2013 1:11 AM, "Shay Seng" <sh...@1618labs.com> wrote:
>>
>>> Hi,
>>>
>>> Is there some way to get R-style Data.Frame data structures into RDDs?
>>> I've been using RDD[Seq[]] but this is getting quite error-prone and the
>>> code gets pretty hard to read especially after a few joins, maps etc.
>>>
>>> Rather than access columns by index, I would prefer to access them by
>>> name.
>>> e.g. instead of writing:
>>> myrdd.map(l => Seq(l(0), l(1), l,(4), l(9))
>>> I would prefer to write
>>> myrdd.map(l => DataFrame(l.id, l.entryTime, l.exitTime, l.cost))
>>>
>>> Also joins are particularly irritating. Currently I have to first
>>> construct a pair:
>>> somePairRdd.join(myrdd.map(l=> (l(1),l(2)), (l(0),l(1),l(2),l(3)))
>>> Now I have to unzip away the join-key and remap the values into a seq
>>>
>>> instead I would rather write
>>> someDataFrame.join(myrdd , l=> l.entryTime && l.exitTime)
>>>
>>>
>>> The question is this:
>>> (1) I started writing a DataFrameRDD class that kept track of the column
>>> names and column values, and some optional attributes common to the entire
>>> dataframe. However I got a little muddled when trying to figure out what
>>> happens when a dataframRDD is chained with other operations and get
>>> transformed to other types of RDDs. The Value part of the RDD is obvious,
>>> but I didn't know the best way to pass on the "column and attribute"
>>> portions of the DataFrame class.
>>>
>>> I googled around for some documentation on how to write RDDs, but only
>>> found a pptx slide presentation with very vague info. Is there a better
>>> source of info on how to write RDDs?
>>>
>>> (2) Even better than info on how to write RDDs, has anyone written an
>>> RDD that functions as a DataFrame? :-)
>>>
>>> tks
>>> shay
>>>
>>
>

Re: DataFrame RDDs

Posted by Shay Seng <sh...@1618labs.com>.

Nice, any possibility of sharing this code in advance?


On Fri, Nov 15, 2013 at 11:22 AM, Christopher Nguyen <ct...@adatao.com> wrote:

> Shay, we've done this at Adatao, specifically a big data frame in RDD
> representation and subsetting/projections/data mining/machine learning
> algorithms on that in-memory table structure.
>
> We're planning to harmonize that with the MLBase work in the near future.
> Just a matter of prioritization on limited resources. If there's enough
> interest we'll accelerate that.
>
> Sent while mobile. Pls excuse typos etc.
> On Nov 16, 2013 1:11 AM, "Shay Seng" <sh...@1618labs.com> wrote:
>
>> Hi,
>>
>> Is there some way to get R-style Data.Frame data structures into RDDs?
>> I've been using RDD[Seq[]] but this is getting quite error-prone and the
>> code gets pretty hard to read especially after a few joins, maps etc.
>>
>> Rather than access columns by index, I would prefer to access them by
>> name.
>> e.g. instead of writing:
>> myrdd.map(l => Seq(l(0), l(1), l,(4), l(9))
>> I would prefer to write
>> myrdd.map(l => DataFrame(l.id, l.entryTime, l.exitTime, l.cost))
>>
>> Also joins are particularly irritating. Currently I have to first
>> construct a pair:
>> somePairRdd.join(myrdd.map(l=> (l(1),l(2)), (l(0),l(1),l(2),l(3)))
>> Now I have to unzip away the join-key and remap the values into a seq
>>
>> instead I would rather write
>> someDataFrame.join(myrdd , l=> l.entryTime && l.exitTime)
>>
>>
>> The question is this:
>> (1) I started writing a DataFrameRDD class that kept track of the column
>> names and column values, and some optional attributes common to the entire
>> dataframe. However I got a little muddled when trying to figure out what
>> happens when a dataframRDD is chained with other operations and get
>> transformed to other types of RDDs. The Value part of the RDD is obvious,
>> but I didn't know the best way to pass on the "column and attribute"
>> portions of the DataFrame class.
>>
>> I googled around for some documentation on how to write RDDs, but only
>> found a pptx slide presentation with very vague info. Is there a better
>> source of info on how to write RDDs?
>>
>> (2) Even better than info on how to write RDDs, has anyone written an RDD
>> that functions as a DataFrame? :-)
>>
>> tks
>> shay
>>
>

Re: DataFrame RDDs

Posted by Christopher Nguyen <ct...@adatao.com>.

Shay, we've done this at Adatao, specifically a big data frame in RDD
representation and subsetting/projections/data mining/machine learning
algorithms on that in-memory table structure.

We're planning to harmonize that with the MLBase work in the near future.
Just a matter of prioritization on limited resources. If there's enough
interest we'll accelerate that.

Sent while mobile. Pls excuse typos etc.
On Nov 16, 2013 1:11 AM, "Shay Seng" <sh...@1618labs.com> wrote:

> Hi,
>
> Is there some way to get R-style Data.Frame data structures into RDDs?
> I've been using RDD[Seq[]] but this is getting quite error-prone and the
> code gets pretty hard to read especially after a few joins, maps etc.
>
> Rather than access columns by index, I would prefer to access them by name.
> e.g. instead of writing:
> myrdd.map(l => Seq(l(0), l(1), l,(4), l(9))
> I would prefer to write
> myrdd.map(l => DataFrame(l.id, l.entryTime, l.exitTime, l.cost))
>
> Also joins are particularly irritating. Currently I have to first
> construct a pair:
> somePairRdd.join(myrdd.map(l=> (l(1),l(2)), (l(0),l(1),l(2),l(3)))
> Now I have to unzip away the join-key and remap the values into a seq
>
> instead I would rather write
> someDataFrame.join(myrdd , l=> l.entryTime && l.exitTime)
>
>
> The question is this:
> (1) I started writing a DataFrameRDD class that kept track of the column
> names and column values, and some optional attributes common to the entire
> dataframe. However I got a little muddled when trying to figure out what
> happens when a dataframRDD is chained with other operations and get
> transformed to other types of RDDs. The Value part of the RDD is obvious,
> but I didn't know the best way to pass on the "column and attribute"
> portions of the DataFrame class.
>
> I googled around for some documentation on how to write RDDs, but only
> found a pptx slide presentation with very vague info. Is there a better
> source of info on how to write RDDs?
>
> (2) Even better than info on how to write RDDs, has anyone written an RDD
> that functions as a DataFrame? :-)
>
> tks
> shay
>