You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by Piotr Rychlik <ol...@aster.pl> on 2010/04/06 12:42:25 UTC

Extracting plain text from PDF

Hi,

I have a problem with extracting plain text from PDF documents that contain polish characters.
I am using the following approach to extract text:
 ......
   File f = new File(fileName);

 PDFParser parser = new PDFParser(new FileInputStream(f));
 parser.parse();

 COSDocument cosDoc = parser.getDocument();
 PDFTextStripper pdfStripper = new PDFTextStripper();
 PDDocument pdDoc = new PDDocument(cosDoc);
 String parsedText = pdfStripper.getText(pdDoc);
 ......

parsedText is then written to a file using UTF8 encoding.

The above code works fine in most cases. Text containing polish characters is extracted correctly.
However, I managed to find a strange .pdf file for witch the above method does not work. Polish characters are replaced. E.g. polish crossed l (ł) is replaced by %. Is there any way to fix this problem?

Regards,
Piotr Rychlik


Re: Extracting plain text from PDF

Posted by Thomas Fischer <fi...@aon.at>.
Hi Piotr,

to extract text from PDF files I use the following command:

java -Xmx256m -classpath /Library/Java/Extensions/ org.apache.pdfbox.ExtractText -encoding UTF-8 "theFile" "newName"

where classpath gives the directory of my java extensions, including PDFBox, "theFile" is the full path of the file to convert and "newName" the full path of the converted file (all in a more elaborate setting to convert all PDF files in a given directory and save them in an appropriate place, using one of several PDF converters).

This usually works nicely with Polish letters. It would be interesting to take a look at your "strange .pdf file for witch the above method does not work"

All the best
Thomas

Am 06.04.2010 um 12:42 schrieb Piotr Rychlik:

> Hi,
> 
> I have a problem with extracting plain text from PDF documents that contain polish characters.
> I am using the following approach to extract text:
> ......
>   File f = new File(fileName);
> 
> PDFParser parser = new PDFParser(new FileInputStream(f));
> parser.parse();
> 
> COSDocument cosDoc = parser.getDocument();
> PDFTextStripper pdfStripper = new PDFTextStripper();
> PDDocument pdDoc = new PDDocument(cosDoc);
> String parsedText = pdfStripper.getText(pdDoc);
> ......
> 
> parsedText is then written to a file using UTF8 encoding.
> 
> The above code works fine in most cases. Text containing polish characters is extracted correctly.
> However, I managed to find a strange .pdf file for witch the above method does not work. Polish characters are replaced. E.g. polish crossed l (ł) is replaced by %. Is there any way to fix this problem?
> 
> Regards,
> Piotr Rychlik
>