You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@pdfbox.apache.org by Piotr Rychlik <ol...@aster.pl> on 2010/04/08 22:06:39 UTC

extracting polish characters

Hi,

I have a problem with extracting plain text from PDF documents that contain polish characters.
I am using the following approach to extract text:
 ......
   File f = new File(fileName);

 PDFParser parser = new PDFParser(new FileInputStream(f));
 parser.parse();

 COSDocument cosDoc = parser.getDocument();
 PDFTextStripper pdfStripper = new PDFTextStripper();
 PDDocument pdDoc = new PDDocument(cosDoc);
 String parsedText = pdfStripper.getText(pdDoc);
 ......

parsedText is then written to a file using UTF8 encoding.

The above code works fine in most cases. Text containing polish characters is extracted correctly.
There are, however, the .pdf files for witch the above method does not work. Polish characters are replaced. E.g. polish crossed l (ł) is replaced by %. Is there any way to fix this problem?

Regards,
Piotr Rychlik

Re: extracting polish characters

Posted by Villu Ruusmann <vi...@gmail.com>.

Hello there,

>
> I have a problem with extracting plain text from PDF documents that contain polish characters.
> I am using the following approach to extract text:
>  ......
>
> The above code works fine in most cases. Text containing polish characters is extracted correctly.
> There are, however, the .pdf files for witch the above method does not work. Polish characters are replaced. E.g. polish crossed l (ł) is replaced by %. Is there any way to fix this problem?
>

Your code looks fine to me, so that shouldn't be the problem. I
suspect that PDFBox is unable to decode characters (ie. the
problematic polish characters are outside of the most common US-ASCII
character set), but we should be able to get a sample PDF document on
our hands to conduct a more thorough investigation.

Could you open a JIRA issue and attach a sample PDF document there?


VR