You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@zeppelin.apache.org by "AV (JIRA)" <ji...@apache.org> on 2018/12/28 10:05:00 UTC
[jira] [Created] (ZEPPELIN-3927) Unstable State running Code

AV created ZEPPELIN-3927:
----------------------------

             Summary: Unstable State running Code
                 Key: ZEPPELIN-3927
                 URL: https://issues.apache.org/jira/browse/ZEPPELIN-3927
             Project: Zeppelin
          Issue Type: Bug
          Components: zeppelin-interpreter
    Affects Versions: 0.9.0
            Reporter: AV


Executing the tutorial notebook code produces weird results using Spark 2.4.0:

> import org.apache.commons.io.IOUtils
> import java.net.URL
> import java.nio.charset.Charset
>
>
> // Zeppelin creates and injects sc (SparkContext) and sqlContext (HiveContext or SqlContext)
> // So you don't need create them manually
>
> // Remote Address
> val csvURL = "https://s3.amazonaws.com/apache-zeppelin/tutorial/bank/bank.csv";
>
> // Parallel processing
> val bankText = sc.parallelize( IOUtils.toString( new URL(csvURL), Charset.forName("UTF-8") ).toString().split("\n") )
>
> case class Bank(age: Integer, job: String, marital: String, education: String, balance: Integer)
>
> val bank = bankText.map(s => s.split(";")).filter(s => s(0) != "\"age\"").map(
>    s => Bank(s(0).toInt, 
>            s(1).replaceAll("\"", ""),
>            s(2).replaceAll("\"", ""),
>            s(3).replaceAll("\"", ""),
>            s(5).replaceAll("\"", "").toInt
>        )
> ).toDF()
>
> bank.registerTempTable("bank")

 

In the first run (after an spark interpreter restart), everything works fine, the output is:

> warning: there was one deprecation warning; re-run with -deprecation for details

> import sqlContext.implicits._

> import org.apache.commons.io.IOUtils

> import java.net.URL

> import java.nio.charset.Charset csvURL: String = [https://s3.amazonaws.com/apache-zeppelin/tutorial/bank/bank.csv]

> bankText: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0]

> at parallelize at <console>:28 defined class Bank bank: org.apache.spark.sql.DataFrame = [age: int, job: string ... 3 more fields]

 

After the code has been executed once any re-run fails:

> warning: there was one deprecation warning; re-run with -deprecation for details

> java.lang.IllegalArgumentException: URI is not absolute

> at java.net.URI.toURL(URI.java:1088)

> at org.apache.hadoop.fs.http.AbstractHttpFileSystem.open(AbstractHttpFileSystem.java:60)

> at org.apache.hadoop.fs.http.HttpsFileSystem.open(HttpsFileSystem.java:23)

> at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:769)

> at org.apache.hadoop.fs.FsUrlConnection.connect(FsUrlConnection.java:50)

> at org.apache.hadoop.fs.FsUrlConnection.getInputStream(FsUrlConnection.java:59)

> at java.net.URL.openStream(URL.java:1045)

> at org.apache.commons.io.IOUtils.toString(IOUtils.java:894) ... 39 elided

 

The deprecation warning:

> <console>:36: error: value toDF is not a member of org.apache.spark.rdd.RDD[Bank]

> possible cause: maybe a semicolon is missing before `value toDF'? ).toDF()

 

Any ideas?

 

ps.: I'm a little bit curious why there are no other messages regarding my problems. Using the latest stable spark / hadoop releases when compiling from source is natural for me.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)