You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by "☼ R Nair (रविशंकर नायर)" <ra...@gmail.com> on 2018/04/27 12:19:43 UTC
Spark Streaming for more file types
All,
I have the following methods in my scala code, currently executed on demand
val files = sc.binaryFiles ("file:///imocks/data/ocr/raw")
//Abive line takes all PDF files
files.map(myconveter(_)).count
myconverter signature:
def myconverter (
file: (String,
org.apache.spark.input.PortableDataStream)
) : Unit =
{
//Code to interact with IBM Datamap OCR which converts the PDF files into
text
}
I do want to change the above code to Spark streaming. Unfortunately there
is ( definitely the would be a great addition to Spark) No "binaryFiles"
functions from StreamingContext. The closest I can think of is to write
like this:
//Assuming myconverter is not changed
val dstream = ssc.fileStream[BytesWritable,BytesWritable,
SequenceFileAsBinaryInputFormat]("file:///imocks/data/ocr/raw") ;
dstream.map(myconverter(_))
Unfortunately everything is in problem now. There are errors showing the
method signature does not match etc etc. Can anyone please help how can I
get out of the issue? Appreciate your help.
Also, won't it be a super excellent idea to have all methods of
SparkContext to be reusable for StreamingContext as well ? In that way, it
takes no extra effort to change a batch program to a streaming app.
Best,
Passion