You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by James Green <ja...@gmail.com> on 2013/11/14 16:29:26 UTC

Unable to get text from this pdf - why?

This was created via a fairly obtuse means but suffice it say it should
still work.

https://www.dropbox.com/s/uaq5sqmlf88108p/sample-from-pdf.pdf

This was me creating a document in LibreOffice Writer, exporting that as a
pdf then loading the pdf into DocumentViewer (Evince, although Adobe
Reader) could also be used. This is then printed to a java application via
the windows PScript dll where the java app runs the received postscript
through Ghostscript to get PDF and finally imported into PDFBox.

This used to work a few weeks ago, and we are unsure why it does not now.
Printing an odt directly from Writer into the Java app works fine.

This is using PDFBox 1.8.2.

Thanks,

James

Re: Unable to get text from this pdf - why?

Posted by James Green <ja...@gmail.com>.
Pretty much confirms our thoughts. Regrettably I don't have Acrobat (only
Reader) here but we did notice the loss of selectable text. Thanks for your
time.


On 14 November 2013 15:42, Gilad Denneboom <gi...@gmail.com>wrote:

> It seems that GS converted the text in the file to graphical elements. You
> can see it in Acrobat if you open the Contents panel, and you can also see
> that the text in the file is not selectable, and therefore can't be
> extracted.
> You'll need to look for a solution in GS. It has nothing to do with how
> PDFBox works, as there's just no text to read in that file.
>
>
> On Thu, Nov 14, 2013 at 4:29 PM, James Green <james.mk.green@gmail.com
> >wrote:
>
> > This was created via a fairly obtuse means but suffice it say it should
> > still work.
> >
> > https://www.dropbox.com/s/uaq5sqmlf88108p/sample-from-pdf.pdf
> >
> > This was me creating a document in LibreOffice Writer, exporting that as
> a
> > pdf then loading the pdf into DocumentViewer (Evince, although Adobe
> > Reader) could also be used. This is then printed to a java application
> via
> > the windows PScript dll where the java app runs the received postscript
> > through Ghostscript to get PDF and finally imported into PDFBox.
> >
> > This used to work a few weeks ago, and we are unsure why it does not now.
> > Printing an odt directly from Writer into the Java app works fine.
> >
> > This is using PDFBox 1.8.2.
> >
> > Thanks,
> >
> > James
> >
>

Re: Unable to get text from this pdf - why?

Posted by Gilad Denneboom <gi...@gmail.com>.
It seems that GS converted the text in the file to graphical elements. You
can see it in Acrobat if you open the Contents panel, and you can also see
that the text in the file is not selectable, and therefore can't be
extracted.
You'll need to look for a solution in GS. It has nothing to do with how
PDFBox works, as there's just no text to read in that file.


On Thu, Nov 14, 2013 at 4:29 PM, James Green <ja...@gmail.com>wrote:

> This was created via a fairly obtuse means but suffice it say it should
> still work.
>
> https://www.dropbox.com/s/uaq5sqmlf88108p/sample-from-pdf.pdf
>
> This was me creating a document in LibreOffice Writer, exporting that as a
> pdf then loading the pdf into DocumentViewer (Evince, although Adobe
> Reader) could also be used. This is then printed to a java application via
> the windows PScript dll where the java app runs the received postscript
> through Ghostscript to get PDF and finally imported into PDFBox.
>
> This used to work a few weeks ago, and we are unsure why it does not now.
> Printing an odt directly from Writer into the Java app works fine.
>
> This is using PDFBox 1.8.2.
>
> Thanks,
>
> James
>