You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pdfbox.apache.org by "John Hewson (JIRA)" <ji...@apache.org> on 2014/06/17 22:20:07 UTC

[jira] [Closed] (PDFBOX-1153) Use dictionary lookups to increase text extraction accuracy

     [ https://issues.apache.org/jira/browse/PDFBOX-1153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

John Hewson closed PDFBOX-1153.
-------------------------------

    Resolution: Won't Fix

This won't work in practice, there are too many non-dictionary words out there.

> Use dictionary lookups to increase text extraction accuracy
> -----------------------------------------------------------
>
>                 Key: PDFBOX-1153
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1153
>             Project: PDFBox
>          Issue Type: New Feature
>          Components: Text extraction
>            Reporter: Jukka Zitting
>
> There are still some cases where the text extraction code incorrectly inserts spaces inside words extracted from a PDF document. We could increase extraction accuracy with an optional dictionary lookup mechanism that checks each extracted word or token against a dictionary of common words. If the lookup fails (and the amount of empty space after the token is small), the token is concatenated with the next one. If that concatenated token matches a word in the dictionary, the intervening space can very likely be dropped.



--
This message was sent by Atlassian JIRA
(v6.2#6252)