You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Alok Singh (JIRA)" <ji...@apache.org> on 2016/08/04 21:25:20 UTC
[jira] [Commented] (SPARK-16611) Expose several hidden DataFrame/RDD functions

    [ https://issues.apache.org/jira/browse/SPARK-16611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15408513#comment-15408513 ] 

Alok Singh commented on SPARK-16611:
------------------------------------

Hi [~shivaram]

 Thanks for the reply.

1)To illustrate what I meant by the broadcast varb issue. Please refer to the following example

    for example 
    randomMatBr <- broadcast(sc, randomMat)
   
   worker <- function(r) {list(r[[0]] +1)}
   o1<- dapply(df, worker, out_sch) # case 1

   o2<- lapply(df, worker) # case 2
   
  useBroadcast <- function(x) {  sum(value(randomMatBr) * x)}
  o3 <- lapply(df.toRdd, useBroadcast) # case3

 Notes:
  - user intends to use the case3 so he created the broadcast array. but he also want to compute either o1 or o2 (for other use cases). so in the case1 and case2, he know that he will never use the broadcast elements. but in the case1, the framework will anyway ship the element in the ls(broadcastArr) to each nodes.
in case2, it won't.

2) If one has one way of getting the RDD from dataframe i.e toRDD as you had suggested, it would be great :)
  but is it going to work with the pipeline RDD,df too?

 Here is one example to illustrate the point
 # read.csv custom
 
 parseFields <- function(record) {
   Sys.setlocale("LC_ALL", "C") # necessary for strsplit() to work correctly
   nrecord<- as.character(record); parts <- strsplit(nrecord, ",")[[1]]
   list(id=parts[1], title=parts[2], modified=parts[3], text=parts[4], username=parts[5]) }

  pr=SparkR:::lapply(f, parseFields)
  cache(pr)
  pr
  sch=structType(structField("id", "string"), structField("title", "string"), structField("modified", "string"),     structField("text", "string"), structField("username", "string"))
  air_df <- createDataFrame(sqlContext, pr, sch)


  # now we pass in air_df's RDD to systemML
  the current air_df is the pipeline df and getJRDD will returns the proper RDD but if we use toRDD . my last experiment didn't work properly.
 # please note that, in 2.0 we will have read.csv but the point is that user can have any pipelined RDD and dataframe. does toRDD also will work with pipeline RDD,dataframe?




Thanks for the confirmation that, we are not removing the RDD yet and only rename is the goal :)

Alok

> Expose several hidden DataFrame/RDD functions
> ---------------------------------------------
>
>                 Key: SPARK-16611
>                 URL: https://issues.apache.org/jira/browse/SPARK-16611
>             Project: Spark
>          Issue Type: Improvement
>          Components: SparkR
>            Reporter: Oscar D. Lara Yejas
>
> Expose the following functions:
> - lapply or map
> - lapplyPartition or mapPartition
> - flatMap
> - RDD
> - toRDD
> - getJRDD
> - cleanup.jobj
> cc:
> [~javierluraschi] [~jj@rstudio.com] [~shivaram]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org