You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by kirillkh <ki...@gmail.com> on 2011/06/07 05:59:40 UTC

PDFTextStripper: space characters inside words

Hi,

I've encountered two issues with PDFTextStripper and discovered (imperfect)
workarounds for both. Can anyone from the maintainers please take a look at
the issues and at my patch (which is admittedly pretty hackish)?
The patch is based off trunk, but I only tested it with PDFBox 1.5.0.
https://github.com/kirillkh/pdfbox/commit/9a23c3956a96c276dfc677a0862c6954661b6d6a

1. With the attached document (I hope it will be accepted by the mailing
list... If not, contact me, and I'll send it to you directly.), I'm seeing
spaces interspersed inside certain words (e.g., in the second page's title.)
The document is in Hebrew (RTL), which might or might not matter.

While I don't know what exactly the code is doing, I got the impression that
the problem is caused by zero-width space characters. Looks like the
document was produced by software that incorrectly specified the width of
every space character as 0 and also inserted them at random places inside
the document. (Does that make any sense?.. In any case, that was my
impression.) I assume that a real PDF renderer just ignores such characters,
but PDFTextStripper outputs every such character as text. I've managed to
modify the code in a way that makes these space characters be ignored (see
the patch), but chances are it is not the best solution.

2. (RTL-specific) After working around the main issue, I've encountered
another one. In some cases, the zero-width space characters coincided with
word boundaries; since I removed them, PDFTextStripper switched to using the
average character width to determine word boundaries. This resulted in
special WordSeparator positions being inserted where spaces were before. The
problem with that is the PDFTextStripper.normalize() method for some reason
splits the text on these word boundaries (instead of splitting it on the
line boundaries) to perform visual-to-logical reordering. For some lines,
this results in words order being reversed (the characters inside words are
in the correct order, the words are ordered in reverse).

I solved this by outputting a space character for every WordSeparator
encountered by normalize(). Again, this worked for me with this document,
but I'm not sure that is the right way to go.


-Kirill

Re: PDFTextStripper: space characters inside words

Posted by kirillkh <ki...@gmail.com>.
Hi,

I was wondering whether the mail I sent a month ago was received on this
list, since I haven't received any responses. (I guess it's possible it was
not received because it contained an attachment.) The original mail is
quoted below.

Thanks,
-Kirill

2011/6/7 kirillkh <ki...@gmail.com>

> Hi,
>
> I've encountered two issues with PDFTextStripper and discovered (imperfect)
> workarounds for both. Can anyone from the maintainers please take a look at
> the issues and at my patch (which is admittedly pretty hackish)?
> The patch is based off trunk, but I only tested it with PDFBox 1.5.0.
> https://github.com/kirillkh/pdfbox/commit/9a23c3956a96c276dfc677a0862c6954661b6d6a
>
> 1. With the attached document (I hope it will be accepted by the mailing
> list... If not, contact me, and I'll send it to you directly.), I'm seeing
> spaces interspersed inside certain words (e.g., in the second page's title.)
> The document is in Hebrew (RTL), which might or might not matter.
>
> While I don't know what exactly the code is doing, I got the impression
> that the problem is caused by zero-width space characters. Looks like the
> document was produced by software that incorrectly specified the width of
> every space character as 0 and also inserted them at random places inside
> the document. (Does that make any sense?.. In any case, that was my
> impression.) I assume that a real PDF renderer just ignores such characters,
> but PDFTextStripper outputs every such character as text. I've managed to
> modify the code in a way that makes these space characters be ignored (see
> the patch), but chances are it is not the best solution.
>
> 2. (RTL-specific) After working around the main issue, I've encountered
> another one. In some cases, the zero-width space characters coincided with
> word boundaries; since I removed them, PDFTextStripper switched to using the
> average character width to determine word boundaries. This resulted in
> special WordSeparator positions being inserted where spaces were before. The
> problem with that is the PDFTextStripper.normalize() method for some reason
> splits the text on these word boundaries (instead of splitting it on the
> line boundaries) to perform visual-to-logical reordering. For some lines,
> this results in words order being reversed (the characters inside words are
> in the correct order, the words are ordered in reverse).
>
> I solved this by outputting a space character for every WordSeparator
> encountered by normalize(). Again, this worked for me with this document,
> but I'm not sure that is the right way to go.
>
>
> -Kirill
>