You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@toree.apache.org by Marius Vileiniskis <mv...@danskebank.lt.INVALID> on 2017/07/03 10:14:31 UTC
Possible bug

Hi,

We're using Apache Toree kernel in Jupyter Notebook to communicate with Spark and experiencing some difficulties running simple code. The code works from the spark-shell but it fails to work in Jupyter Notebook. The idea is to simply to take several elements from the dataframe and put them in to a labelled point. Whenever, the code is ran from the Jupyter Notebook it throws object not serializable error. The code is as follows:

import scala.util.Random.{setSeed, nextDouble}
setSeed(1)

case class Record (
    foo: Double, target: Double, x1: Double, x2: Double, x3: Double) extends Serializable

val rows = sc.parallelize(
    (1 to 10).map(_ => Record(
        nextDouble, nextDouble, nextDouble, nextDouble, nextDouble
   ))
)
val df = spark.sqlContext.createDataFrame(rows)
df.registerTempTable("df")

spark.sqlContext.sql("""
  SELECT ROUND(foo, 2) foo,
         ROUND(target, 2) target,
         ROUND(x1, 2) x1,
         ROUND(x2, 2) x2,
         ROUND(x3, 2) x3
  FROM df""").show

// Or if you want to exclude columns
val ignored = List("foo", "target", "x2")
// Map feature names to indices
val featInd = df.columns.diff(ignored).map(df.columns.indexOf(_))

// Get index of target
val targetInd = df.columns.indexOf("target")
val label_data = df.rdd.map(r => org.apache.spark.mllib.regression.LabeledPoint(
   r.getDouble(targetInd), // Get target value
   // Map feature indices to values
    org.apache.spark.mllib.linalg.Vectors.dense(featInd.map(r.getDouble(_)).toArray)
)).take(2).foreach(println)


Simply tweaking this a bit, it works fine. Tweaking the last bit:


val label_data = df.rdd.map(r => org.apache.spark.mllib.regression.LabeledPoint(
   r.getDouble(1), // Get target value
   // Map feature indices to values
    org.apache.spark.mllib.linalg.Vectors.dense(r.getDouble(2), r.getDouble(3))
)).take(2).foreach(println)

It seems that it can't get head around, when the index to getDouble is specified as a variable rather than a hard-coded number. Yet, again this runs perfectly on spark-shell. Is it a known issue or should this be reported as a bug?

It's running with Spark 2.1.1.


Kind regards
Dr. Marius Vileiniškis
Senior Data Scientist

[logo]
A.Goštauto str. 12A (UNIQ), Vilnius
Danske Group IT Lithuania
Mobile: +37065315325
mvil@danskebank.lt<ma...@danskebank.lt>



_______________
Šioje žinuteje esanti informacija gali buti konfidenciali. Jeigu šią žinutę gavote per klaidą, prašome grąžinti ją siuntejui atsakant i gautą laišką ir iškart ištrinkite žinutę iš Jusu sistemos nekopijuojant, neplatinant ir neišsisaugant jos.
Nors esame isitikinę, kad ši žinute ir prie jos esantys priedai nera užkresti virusais ar kitaip pažeisti, del ko galetu buti paveiktas kompiuteris ar IT sistema, kurioje žinute gauta ir skaitoma, adresatas atidarydamas failą prisiima riziką. Mes neatsakome už nuostolius ar žalą, galinčius atsirasti del šios žinutes gavimo ar kitokio naudojimo.
_______________
Please note that this message may contain confidential information. If you have received this message by mistake, please inform the sender of the mistake by sending a reply, then delete the message from your system without making, distributing or retaining any copies of it. Although we believe that the message and any attachments are free from viruses and other errors that might affect the computer or IT system where it is received and read, the recipient opens the message at his or her own risk. We assume no responsibility for any loss or damage arising from the receipt or use of this message.