You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Sean Owen (JIRA)" <ji...@apache.org> on 2015/09/23 15:11:05 UTC
[jira] [Resolved] (SPARK-6161) sqlCtx.parquetFile(dataFilePath)
throws NPE when using s3, but OK when using local filesystem
[ https://issues.apache.org/jira/browse/SPARK-6161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sean Owen resolved SPARK-6161.
------------------------------
Resolution: Not A Problem
I think this is maybe a question for user@ first, but also, appears to be an S3 problem.
> sqlCtx.parquetFile(dataFilePath) throws NPE when using s3, but OK when using local filesystem
> ---------------------------------------------------------------------------------------------
>
> Key: SPARK-6161
> URL: https://issues.apache.org/jira/browse/SPARK-6161
> Project: Spark
> Issue Type: Question
> Components: Spark Submit
> Affects Versions: 1.2.1
> Environment: MacOSX 10.10, S3
> Reporter: Marshall
>
> Using some examples from Spark summit 2014 and spark1.2.1, we converted 15 pipe-separated raw text files (with on avg 100k lines) individually
> to parquet file format using the following code:
> JavaSchemaRDD schemaXXXXData = sqlCtx.applySchema(xxxxData, XXXXRecord.class);
> schemaXXXXData.registerTempTable("xxxxdata");
> schemaXXXXData.saveAsParquetFile(output);
> We took the results of each folder and renamed the part file to match the original filename plus .parquet and dropped them all into one directory.
> We created a java class that we then invoke using a spark-1.2.1/bin/spark-submit command...
> SparkConf sparkConf = new SparkConf().setAppName("XXXXX");
> JavaSparkContext ctx = new JavaSparkContext(sparkConf);
> JavaSQLContext sqlCtx = new JavaSQLContext(ctx);
>
> final String dataFilePath = "/tmp/xxxxprocessor/xxxxsamplefiles_parquet";
> //final String dataFilePath = inputPath;
> // Create a JavaSchemaRDD from the file(s) pointed to by path
> JavaSchemaRDD xxxxData = sqlCtx.parquetFile(dataFilePath);
> GOOD: when we run our spark app locally (specifying dataFilePath as a full filename of ONE specific parquet on local filesystem), all is well... the 'sqlCtx.parquetFile(dataFilePath);' command finds the file and proceeds.
> GOOD: when we run our spark app locally (specifying dataFilePath as a the directory that contains all the parquet files), all is well... the 'sqlCtx.parquetFile(dataFilePath);' command rips thru each file in the dataFilePath directory and proceeds.
> GOOD: if we do the same thing by uploading ONE of the parquet files to s3, and change our app to use the s3 path (giving it the full filename to ONE parquet file), all is good - code finds the file and proceeds...
> BAD: if we then upload all the parquet files to s3 and specify the s3 directory where all the parquet files are, we get an NPE:
> Exception in thread "main" java.lang.NullPointerException
> at org.apache.hadoop.fs.s3native.NativeS3FileSystem$NativeS3FsInputStream.close(NativeS3FileSystem.java:106)
> at java.io.BufferedInputStream.close(BufferedInputStream.java:472)
> at java.io.FilterInputStream.close(FilterInputStream.java:181)
> at parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:428)
> at parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:389)
> at org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$readMetaData$3.apply(ParquetTypes.scala:457)
> at org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$readMetaData$3.apply(ParquetTypes.scala:457)
> at scala.Option.map(Option.scala:145)
> at org.apache.spark.sql.parquet.ParquetTypesConverter$.readMetaData(ParquetTypes.scala:457)
> at org.apache.spark.sql.parquet.ParquetTypesConverter$.readSchemaFromFile(ParquetTypes.scala:477)
> at org.apache.spark.sql.parquet.ParquetRelation.<init>(ParquetRelation.scala:65)
> at org.apache.spark.sql.api.java.JavaSQLContext.parquetFile(JavaSQLContext.scala:141)
> at com.aol.ido.spark.sql.XXXXFileIndexParquet.doWork(XXXFileIndexParquet.java:101)
> Wondering why specifying a 'dir' works locally but not in S3...
> BTW, we have done above steps using json formatted files and all four scenarios work well.
> // Create a JavaSchemaRDD from the file(s) pointed to by path
> JavaSchemaRDD xxxxData = sqlCtx.jsonFile(dataFilePath);
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org