You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by Natalia Gómez García <na...@gmail.com> on 2012/09/09 11:13:10 UTC

Problems with Java PDFBox

Hello,

I am a computer science student and I'm using your library PDFBox in Java
to extract text data from some pdf files.

In this project, I am having difficulties extracting the text from this
pdf: http://www.escet.urjc.es/alumnos/horarios/GR_Biologia_2012-13.pdf.
Specifically, I can't get to extract the text "Semana del 3 al 7 de
Septiembre de 2012".

Why can this be happening? Could you please give me some directions on how
to extract this data?

The code I'm using right now is the following:
pdfDoc = PDDocument.load(url);
pdfStripper = new PDFTextStripper();
texto=pdfStripper.getText(pdfDoc);
pdfDoc.close();

Thanks for your attention
Natalia

Re: Problems with Java PDFBox

Posted by Gilad Denneboom <gi...@gmail.com>.
I believe it's because that text is written in a non-standard font which is
only partially embedded in the file, called "TTE1890348t00"...
You can see it for yourself if you open the file in Acrobat and try to copy
that text using the text selection tool. The result is just a bunch of
unreadable unicode symbols. Other text in the file uses Arial or some other
standard fonts, and therefore can be read easily.

On Sun, Sep 9, 2012 at 11:13 AM, Natalia Gómez García <
natalia.gmz.garcia@gmail.com> wrote:

> Hello,
>
> I am a computer science student and I'm using your library PDFBox in Java
> to extract text data from some pdf files.
>
> In this project, I am having difficulties extracting the text from this
> pdf: http://www.escet.urjc.es/alumnos/horarios/GR_Biologia_2012-13.pdf.
> Specifically, I can't get to extract the text "Semana del 3 al 7 de
> Septiembre de 2012".
>
> Why can this be happening? Could you please give me some directions on how
> to extract this data?
>
> The code I'm using right now is the following:
> pdfDoc = PDDocument.load(url);
> pdfStripper = new PDFTextStripper();
> texto=pdfStripper.getText(pdfDoc);
> pdfDoc.close();
>
> Thanks for your attention
> Natalia
>