You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Tilman Hausherr (JIRA)" <ji...@apache.org> on 2016/09/15 16:27:20 UTC

[jira] [Resolved] (PDFBOX-3498) Unexpected spaces in text extraction

     [ https://issues.apache.org/jira/browse/PDFBOX-3498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tilman Hausherr resolved PDFBOX-3498.
-------------------------------------
       Resolution: Fixed
    Fix Version/s: 2.1.0

> Unexpected spaces in text extraction
> ------------------------------------
>
>                 Key: PDFBOX-3498
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3498
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.2, 2.0.3
>            Reporter: Tilman Hausherr
>            Assignee: Tilman Hausherr
>             Fix For: 2.0.4, 2.1.0
>
>         Attachments: PDFBOX-3498-Y5TLCWTIAE3FYDVJTV2TXRZGXLEDUNSW.pdf
>
>
> In the tests by [~tallison@apache.org] regressions were found with files from Delaware courts, see reduced file attached.
> The extracted text with 2.0.2 and 2.0.3 is
> {code}
> IN THE  COUR T OF  CHAN CER Y O F TH E STA TE OF  D ELA WARE
> {code}
> in 2.0.1 and 1.8 it was
> {code}
> IN THE COURT OF CHANCERY OF THE STATE OF DELAWARE
> {code}
> The cause is the /W ranges table:
> {code}
> /W [1 1 0 2 3 250 4 10 0 11
> 12 333 13 14 0 15 15 250 16 16
> 333 17 17 250 18 18 277 19 19 0
> 20 23 500 24 35 0 36 36 722 37
> 37 666 38 39 722 40 40 666 41 41
> 610 42 43 777 44 44 389 45 45 0
> 46 46 777 47 47 666 48 48 943 49
> 49 722 50 50 777 51 51 610 52 52
> 0 53 53 722 54 54 556 55 55 666
> 56 57 722 59 59 0 60 60 722 61
> 67 0 68 68 500 69 69 556 70 70
> 443 71 71 556 72 72 443 73 73 333
> 74 74 500 75 75 556 76 76 277 77
> 77 0 78 78 556 79 79 277 80 80
> 833 81 81 556 82 82 500 83 84 556
> 85 85 443 86 86 389 87 87 333 88
> 88 556 89 89 0 90 90 722 91 92
> 500 93 178 0 179 180 500 181 181 0
> 182 182 333 183 751 0 752 752 198 753
> 794 0 795 795 612 796 1126 0 1127 1127
> 125 1129 1129 2000 1130 65534 0]
> {code}
> For text extraction, the width of a space is caculated by taking an average of all widths. However these files have a lot (over 60000) of widths that are 0. So I'll just ignore widths <= 0, as it is already done in PDFont.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org