You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Timo Boehme (JIRA)" <ji...@apache.org> on 2016/02/26 09:47:18 UTC
[jira] [Commented] (PDFBOX-3248) Unwanted spaces in text extraction (2)

    [ https://issues.apache.org/jira/browse/PDFBOX-3248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15168670#comment-15168670 ] 

Timo Boehme commented on PDFBOX-3248:
-------------------------------------

For our text extraction we generally do ignore the space characters and decide on actual character spacing. We also split the text chunks (e.g. 'lla ' to 'l' 'l' 'a') if there is a larger character spacing defined.

> Unwanted spaces in text extraction (2)
> --------------------------------------
>
>                 Key: PDFBOX-3248
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3248
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.8.11, 2.0.0
>            Reporter: Tilman Hausherr
>         Attachments: PDFBOX-3248-spaces.pdf
>
>
> The attached file provided by Francisco from the user mailing list has spaces in text extraction regardless of setting spacingTolerance or averageCharTolerance. I was unable to extract "Cada frasco ampolla" which looked straightforward in rendering, but it always appeared as "Ca da fras co ampo lla". Adobe Reader has no such problem.
> The content stream has this:
> {code}
>      6 0 1.058 6 122.0924 312.51 Tm
>      (Ca) Tj
>      /Span << /ActualText (\376\377\000\255) >> BDC
>        ( ) Tj
>      EMC
>      [ (da ) -301 (fras) ] TJ
>      /Span << /ActualText (\376\377\000\255) >> BDC
>        ( ) Tj
>      EMC
>      [ (co ) -301 (ampo) ] TJ
>      /Span << /ActualText (\376\377\000\255) >> BDC
>        ( ) Tj
>      EMC
>      [ (lla ) -301 (con) ] TJ
> {code}
> So there are really spaces there, and we keep them. Adobe is smarter, and ignores them because they are overwritten thanks to the "-301" backwards positioning.
> Would /ActualText help? However it is always the same here...
> Would it help to ignore spaces and decide based on positions only, maybe as an option? I added these two lines below the first existing one:
> {code}
>                 String characterValue = position.getUnicode();
>                 if (" ".equals(characterValue))
>                     continue;
> {code}
> The output looks promising:
> {quote}
> F ó r m u l a :
> Cronopen® Balsámico Adultos:
> Cada frasco ampolla contiene: ampicilina (como ampicilina sódica)
> 100 mg; ampicilina (como ampicilina benzatínica) 500 mg.
> Cada ampolla solvente de 5 ml contiene: dipirona 1000 mg; guaife
> nesina 100 mg. Exc.: bisulfito de sodio; agua destilada.
> {quote}
> A complete test brings many differences, most are harmless or are improvements. Only one test case really fails, hello3.pdf. Original extract is "Hello محمد World.", new extract is "Hello .Worldمحمد".
> More from Francisco
> {quote}
> As additional information, I've found 2 related posts (about another tools)
> in StackOverflow:
> http://stackoverflow.com/questions/34579824/itext-how-to-tweak-text-extraction
> http://stackoverflow.com/questions/22671974/itext-reading-pdf-1s-as-up-arrows-error/22688775#22688775
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org