You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Joel D <ga...@gmail.com> on 2018/09/28 17:10:52 UTC

Text from pdf spark

I'm trying to extract text from pdf files in hdfs using pdfBox.

However it throws an error:

"Exception in thread "main" org.apache.spark.SparkException: ...

java.io.FileNotFoundException: /nnAlias:8020/tmp/sample.pdf

(No such file or directory)"




What am I missing? Should I be working with PortableDataStream instead of
the string part of:

val files: RDD[(String, PortableDataStream)]?

def pdfRead(fileNameFromRDD: (String, PortableDataStream), sparkSession:
SparkSession) = {

val file: File = new File(fileNameFromRDD._1.drop(5))

val document = PDDocument.load(file); //It throws an error here.


if (!document.isEncrypted()) {

  val stripper = new PDFTextStripper()

  val text = stripper.getText(document)

  println("Text:" + text)


}

    document.close()


  }


//This is where I call the above pdf to text converter method.

     val files =
sparkSession.sparkContext.binaryFiles("hdfs://nnAlias:8020/tmp/sample.pdf")

    files.foreach(println)


    files.foreach(f => println(f._1))


    files.foreach(fileStream => pdfRead(fileStream, sparkSession))


Thanks.

Re: Text from pdf spark

Posted by Joel D <ga...@gmail.com>.

Yes, I can access the file using cli.

On Fri, Sep 28, 2018 at 1:24 PM kathleen li <ka...@gmail.com> wrote:

> The error message is “file not found”
> Are you able to use the following command line to assess the file with the
> user you submitted the job?
> hdfs dfs -ls /tmp/sample.pdf
>
> Sent from my iPhone
>
> On Sep 28, 2018, at 12:10 PM, Joel D <ga...@gmail.com> wrote:
>
> I'm trying to extract text from pdf files in hdfs using pdfBox.
>
> However it throws an error:
>
> "Exception in thread "main" org.apache.spark.SparkException: ...
>
> java.io.FileNotFoundException: /nnAlias:8020/tmp/sample.pdf
>
> (No such file or directory)"
>
>
>
>
> What am I missing? Should I be working with PortableDataStream instead of
> the string part of:
>
> val files: RDD[(String, PortableDataStream)]?
>
> def pdfRead(fileNameFromRDD: (String, PortableDataStream), sparkSession:
> SparkSession) = {
>
> val file: File = new File(fileNameFromRDD._1.drop(5))
>
> val document = PDDocument.load(file); //It throws an error here.
>
>
> if (!document.isEncrypted()) {
>
>   val stripper = new PDFTextStripper()
>
>   val text = stripper.getText(document)
>
>   println("Text:" + text)
>
>
> }
>
>     document.close()
>
>
>   }
>
>
> //This is where I call the above pdf to text converter method.
>
>      val files =
> sparkSession.sparkContext.binaryFiles("hdfs://nnAlias:8020/tmp/sample.pdf")
>
>     files.foreach(println)
>
>
>     files.foreach(f => println(f._1))
>
>
>     files.foreach(fileStream => pdfRead(fileStream, sparkSession))
>
>
> Thanks.
>
>
>
>
>
>
>
>

Re: Text from pdf spark

Posted by kathleen li <ka...@gmail.com>.

The error message is “file not found”
Are you able to use the following command line to assess the file with the user you submitted the job?
hdfs dfs -ls /tmp/sample.pdf

Sent from my iPhone

> On Sep 28, 2018, at 12:10 PM, Joel D <ga...@gmail.com> wrote:
> 
> I'm trying to extract text from pdf files in hdfs using pdfBox. 
> However it throws an error:
> 
> "Exception in thread "main" org.apache.spark.SparkException: ...
> java.io.FileNotFoundException: /nnAlias:8020/tmp/sample.pdf 
> (No such file or directory)"
> 
> 
> 
> What am I missing? Should I be working with PortableDataStream instead of the string part of:
> val files: RDD[(String, PortableDataStream)]?
> def pdfRead(fileNameFromRDD: (String, PortableDataStream), sparkSession: SparkSession) = {
> val file: File = new File(fileNameFromRDD._1.drop(5))
> val document = PDDocument.load(file); //It throws an error here.
> 
> if (!document.isEncrypted()) {
>   val stripper = new PDFTextStripper()
>   val text = stripper.getText(document)
>   println("Text:" + text)
> 
> }
>     document.close()
> 
>   }
> 
> //This is where I call the above pdf to text converter method.
>      val files = sparkSession.sparkContext.binaryFiles("hdfs://nnAlias:8020/tmp/sample.pdf")
>     files.foreach(println)
> 
>     files.foreach(f => println(f._1))
> 
>     files.foreach(fileStream => pdfRead(fileStream, sparkSession))
> 
> Thanks.
> 
> 
> 
> 
> 
>