You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Joel D <ga...@gmail.com> on 2018/09/28 17:10:52 UTC
Text from pdf spark
I'm trying to extract text from pdf files in hdfs using pdfBox.
However it throws an error:
"Exception in thread "main" org.apache.spark.SparkException: ...
java.io.FileNotFoundException: /nnAlias:8020/tmp/sample.pdf
(No such file or directory)"
What am I missing? Should I be working with PortableDataStream instead of
the string part of:
val files: RDD[(String, PortableDataStream)]?
def pdfRead(fileNameFromRDD: (String, PortableDataStream), sparkSession:
SparkSession) = {
val file: File = new File(fileNameFromRDD._1.drop(5))
val document = PDDocument.load(file); //It throws an error here.
if (!document.isEncrypted()) {
val stripper = new PDFTextStripper()
val text = stripper.getText(document)
println("Text:" + text)
}
document.close()
}
//This is where I call the above pdf to text converter method.
val files =
sparkSession.sparkContext.binaryFiles("hdfs://nnAlias:8020/tmp/sample.pdf")
files.foreach(println)
files.foreach(f => println(f._1))
files.foreach(fileStream => pdfRead(fileStream, sparkSession))
Thanks.
Re: Text from pdf spark
Posted by Joel D <ga...@gmail.com>.
Yes, I can access the file using cli.
On Fri, Sep 28, 2018 at 1:24 PM kathleen li <ka...@gmail.com> wrote:
> The error message is “file not found”
> Are you able to use the following command line to assess the file with the
> user you submitted the job?
> hdfs dfs -ls /tmp/sample.pdf
>
> Sent from my iPhone
>
> On Sep 28, 2018, at 12:10 PM, Joel D <ga...@gmail.com> wrote:
>
> I'm trying to extract text from pdf files in hdfs using pdfBox.
>
> However it throws an error:
>
> "Exception in thread "main" org.apache.spark.SparkException: ...
>
> java.io.FileNotFoundException: /nnAlias:8020/tmp/sample.pdf
>
> (No such file or directory)"
>
>
>
>
> What am I missing? Should I be working with PortableDataStream instead of
> the string part of:
>
> val files: RDD[(String, PortableDataStream)]?
>
> def pdfRead(fileNameFromRDD: (String, PortableDataStream), sparkSession:
> SparkSession) = {
>
> val file: File = new File(fileNameFromRDD._1.drop(5))
>
> val document = PDDocument.load(file); //It throws an error here.
>
>
> if (!document.isEncrypted()) {
>
> val stripper = new PDFTextStripper()
>
> val text = stripper.getText(document)
>
> println("Text:" + text)
>
>
> }
>
> document.close()
>
>
> }
>
>
> //This is where I call the above pdf to text converter method.
>
> val files =
> sparkSession.sparkContext.binaryFiles("hdfs://nnAlias:8020/tmp/sample.pdf")
>
> files.foreach(println)
>
>
> files.foreach(f => println(f._1))
>
>
> files.foreach(fileStream => pdfRead(fileStream, sparkSession))
>
>
> Thanks.
>
>
>
>
>
>
>
>
Re: Text from pdf spark
Posted by kathleen li <ka...@gmail.com>.
The error message is “file not found”
Are you able to use the following command line to assess the file with the user you submitted the job?
hdfs dfs -ls /tmp/sample.pdf
Sent from my iPhone
> On Sep 28, 2018, at 12:10 PM, Joel D <ga...@gmail.com> wrote:
>
> I'm trying to extract text from pdf files in hdfs using pdfBox.
> However it throws an error:
>
> "Exception in thread "main" org.apache.spark.SparkException: ...
> java.io.FileNotFoundException: /nnAlias:8020/tmp/sample.pdf
> (No such file or directory)"
>
>
>
> What am I missing? Should I be working with PortableDataStream instead of the string part of:
> val files: RDD[(String, PortableDataStream)]?
> def pdfRead(fileNameFromRDD: (String, PortableDataStream), sparkSession: SparkSession) = {
> val file: File = new File(fileNameFromRDD._1.drop(5))
> val document = PDDocument.load(file); //It throws an error here.
>
> if (!document.isEncrypted()) {
> val stripper = new PDFTextStripper()
> val text = stripper.getText(document)
> println("Text:" + text)
>
> }
> document.close()
>
> }
>
> //This is where I call the above pdf to text converter method.
> val files = sparkSession.sparkContext.binaryFiles("hdfs://nnAlias:8020/tmp/sample.pdf")
> files.foreach(println)
>
> files.foreach(f => println(f._1))
>
> files.foreach(fileStream => pdfRead(fileStream, sparkSession))
>
> Thanks.
>
>
>
>
>
>