You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Sean Owen (JIRA)" <ji...@apache.org> on 2015/09/23 15:11:05 UTC

[jira] [Resolved] (SPARK-6161) sqlCtx.parquetFile(dataFilePath) throws NPE when using s3, but OK when using local filesystem

     [ https://issues.apache.org/jira/browse/SPARK-6161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Owen resolved SPARK-6161.
------------------------------
    Resolution: Not A Problem

I think this is maybe a question for user@ first, but also, appears to be an S3 problem.

> sqlCtx.parquetFile(dataFilePath) throws NPE when using s3, but OK when using local filesystem
> ---------------------------------------------------------------------------------------------
>
>                 Key: SPARK-6161
>                 URL: https://issues.apache.org/jira/browse/SPARK-6161
>             Project: Spark
>          Issue Type: Question
>          Components: Spark Submit
>    Affects Versions: 1.2.1
>         Environment: MacOSX 10.10, S3
>            Reporter: Marshall
>
> Using some examples from Spark summit 2014 and spark1.2.1, we converted 15 pipe-separated raw text files (with on avg 100k lines) individually ​
> to parquet file ​format using the following code​:
>   JavaSchemaRDD schemaXXXXData = sqlCtx.applySchema(xxxxData, XXXXRecord.class);
>   schemaXXXXData.registerTempTable("xxxxdata");
>   schemaXXXXData.saveAsParquetFile(output);
> We took the results of each folder and renamed the part file to match the original filename plus .parquet and dropped them all into one directory.
> We created a java class that we then invoke using a spark-1.2.1/bin/spark-submit command...
>       SparkConf sparkConf = new SparkConf().setAppName("XXXXX");
>       JavaSparkContext ctx = new JavaSparkContext(sparkConf);
>       JavaSQLContext sqlCtx = new JavaSQLContext(ctx);
>        
>       final String dataFilePath = "/tmp/xxxxprocessor/xxxxsamplefiles_parquet";
>       //final String dataFilePath = inputPath;
>       // Create a JavaSchemaRDD from the file(s) pointed to by path
>       JavaSchemaRDD xxxxData = sqlCtx.parquetFile(dataFilePath);
> GOOD: when we run our spark app locally (specifying dataFilePath as a full filename of ONE ​specific parquet on local filesystem), all is well... the 'sqlCtx.parquetFile(dataFilePath);' command finds the file and proceeds.
> GOOD: when we run our spark app locally (specifying dataFilePath as a the directory that contains all the parquet files), all is well... the 'sqlCtx.parquetFile(dataFilePath);' command rips thru each file in the dataFilePath directory and proceeds.
> GOOD: if we do the same thing by uploading ONE of the parquet files to s3, and change our app to use the s3 path (giving it the full filename to ONE parquet file​), all is good - code finds the file and proceeds...
> BAD: if we then upload all the parquet files to s3 and specify the s3 directory where all the parquet files are, we get an NPE:
>  Exception in thread "main" java.lang.NullPointerException
>     at org.apache.hadoop.fs.s3native.NativeS3FileSystem$NativeS3FsInputStream.close(NativeS3FileSystem.java:106)
>     at java.io.BufferedInputStream.close(BufferedInputStream.java:472)
>     at java.io.FilterInputStream.close(FilterInputStream.java:181)
>     at parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:428)
>     at parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:389)
>     at org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$readMetaData$3.apply(ParquetTypes.scala:457)
>     at org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$readMetaData$3.apply(ParquetTypes.scala:457)
>     at scala.Option.map(Option.scala:145)
>     at org.apache.spark.sql.parquet.ParquetTypesConverter$.readMetaData(ParquetTypes.scala:457)
>     at org.apache.spark.sql.parquet.ParquetTypesConverter$.readSchemaFromFile(ParquetTypes.scala:477)
>     at org.apache.spark.sql.parquet.ParquetRelation.<init>(ParquetRelation.scala:65)
>     at org.apache.spark.sql.api.java.JavaSQLContext.parquetFile(JavaSQLContext.scala:141)
>     at com.aol.ido.spark.sql.XXXXFileIndexParquet.doWork(XXXFileIndexParquet.java:101)
> Wondering why specifying a 'dir' works locally but not in S3...
> BTW, we have done above steps using json formatted files and all four scenarios work well.
>       // Create a JavaSchemaRDD from the file(s) pointed to by path
>       JavaSchemaRDD xxxxData = sqlCtx.jsonFile(dataFilePath);



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org