You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Eric Beabes <ma...@gmail.com> on 2021/05/25 15:30:49 UTC
NullPointerException in SparkSession while reading Parquet files on S3
I keep getting the following exception when I am trying to read a Parquet
file from a Path on S3 in Spark/Scala. Note: I am running this on EMR.
java.lang.NullPointerException
at org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:144)
at org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:142)
at org.apache.spark.sql.DataFrameReader.<init>(DataFrameReader.scala:789)
at org.apache.spark.sql.SparkSession.read(SparkSession.scala:656)
Interestingly I can read the path from Spark shell:
scala> val df = spark.read.parquet("s3://my-path/").count
df: Long = 47
I've created the SparkSession as follows:
val sparkConf = new SparkConf().setAppName("My spark app")val spark =
SparkSession.builder.config(sparkConf).enableHiveSupport().getOrCreate()
spark.sparkContext.setLogLevel("WARN")
spark.sparkContext.hadoopConfiguration.set("java.library.path",
"/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native")
spark.conf.set("spark.sql.parquet.mergeSchema", "true")
spark.conf.set("spark.speculation", "false")
spark.conf.set("spark.sql.crossJoin.enabled", "true")
spark.conf.set("spark.sql.sources.partitionColumnTypeInference.enabled", "true")
spark.conf.set("spark.sql.sources.partitionOverwriteMode", "dynamic")
spark.sparkContext.hadoopConfiguration.set("mapreduce.fileoutputcommitter.algorithm.version",
"2")
spark.sparkContext.hadoopConfiguration.setBoolean("mapreduce.fileoutputcommitter.cleanup.skipped",
true)
spark.sparkContext.hadoopConfiguration.set("fs.s3a.access.key",
System.getenv("AWS_ACCESS_KEY_ID"))
spark.sparkContext.hadoopConfiguration.set("fs.s3a.secret.key",
System.getenv("AWS_SECRET_ACCESS_KEY"))
spark.sparkContext.hadoopConfiguration.set("fs.s3a.endpoint",
"s3.amazonaws.com")
Here's the line where I am getting this exception:
val df1 = spark.read.parquet(pathToRead)
What am I doing wrong? I have tried without setting 'access key' &
'secret key' as well with no luck.
Re: NullPointerException in SparkSession while reading Parquet files
on S3
Posted by YEONWOO BAEK <ye...@gmail.com>.
unsubscribe
2021년 5월 26일 (수) 오전 12:31, Eric Beabes <ma...@gmail.com>님이 작성:
> I keep getting the following exception when I am trying to read a Parquet
> file from a Path on S3 in Spark/Scala. Note: I am running this on EMR.
>
> java.lang.NullPointerException
> at org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:144)
> at org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:142)
> at org.apache.spark.sql.DataFrameReader.<init>(DataFrameReader.scala:789)
> at org.apache.spark.sql.SparkSession.read(SparkSession.scala:656)
>
> Interestingly I can read the path from Spark shell:
>
> scala> val df = spark.read.parquet("s3://my-path/").count
> df: Long = 47
>
> I've created the SparkSession as follows:
>
> val sparkConf = new SparkConf().setAppName("My spark app")val spark = SparkSession.builder.config(sparkConf).enableHiveSupport().getOrCreate()
> spark.sparkContext.setLogLevel("WARN")
> spark.sparkContext.hadoopConfiguration.set("java.library.path", "/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native")
> spark.conf.set("spark.sql.parquet.mergeSchema", "true")
> spark.conf.set("spark.speculation", "false")
> spark.conf.set("spark.sql.crossJoin.enabled", "true")
> spark.conf.set("spark.sql.sources.partitionColumnTypeInference.enabled", "true")
> spark.conf.set("spark.sql.sources.partitionOverwriteMode", "dynamic")
> spark.sparkContext.hadoopConfiguration.set("mapreduce.fileoutputcommitter.algorithm.version", "2")
> spark.sparkContext.hadoopConfiguration.setBoolean("mapreduce.fileoutputcommitter.cleanup.skipped", true)
> spark.sparkContext.hadoopConfiguration.set("fs.s3a.access.key", System.getenv("AWS_ACCESS_KEY_ID"))
> spark.sparkContext.hadoopConfiguration.set("fs.s3a.secret.key", System.getenv("AWS_SECRET_ACCESS_KEY"))
> spark.sparkContext.hadoopConfiguration.set("fs.s3a.endpoint", "s3.amazonaws.com")
>
> Here's the line where I am getting this exception:
>
> val df1 = spark.read.parquet(pathToRead)
>
> What am I doing wrong? I have tried without setting 'access key' & 'secret key' as well with no luck.
>
>
>
>
>