You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Rory Byrne <ro...@gmail.com> on 2016/02/18 13:05:54 UTC

How do I stream in Parquet files using fileStream() and ParquetInputFormat?

Hi,

I'm trying to understand how to stream Parquet files into Spark using
StreamingContext.fileStream[Key, Value, Format]().

I am struggling to understand a) what should be passed as Key and Value
(assuming ParquetInputFormat - is this the correct format?), and b) how -
if at all - to configure the ParquetInputFormat with a ReadSupport class,
 RecordMaterializer etc..

I have tried setting the ReadSupportClass to GroupReadSupport (from the
examples), but I am having problems with the fact that I must also pass a
Hadoop MapReduce job - which is expected to be running and attached to a
job tracker.

Any help or reading suggestions are appreciated as I have almost no
knowledge of Hadoop so this low level use of Hadoop is very confusing for
me.