You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Andreas Lehmkühler (Jira)" <ji...@apache.org> on 2021/02/01 07:26:00 UTC

[jira] [Resolved] (PDFBOX-5090) Missing text extraction under certain conditions starting with apache pdfbox 2.0.18

     [ https://issues.apache.org/jira/browse/PDFBOX-5090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andreas Lehmkühler resolved PDFBOX-5090.
----------------------------------------
    Resolution: Fixed

[~tilman] Thanks for the double check

> Missing text extraction under certain conditions starting with apache pdfbox 2.0.18
> -----------------------------------------------------------------------------------
>
>                 Key: PDFBOX-5090
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-5090
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.18, 2.0.19, 2.0.20, 2.0.21, 2.0.22
>         Environment: jdk 1.8, apache pdfbox, fontbox 2.0.18~, windows 10
>            Reporter: sungwon kim
>            Assignee: Andreas Lehmkühler
>            Priority: Major
>              Labels: regression
>             Fix For: 2.0.23, 3.0.0 PDFBox
>
>         Attachments: 128채널심장전기도시스템을위한3차원매핑소프트웨어개발.pdf, 128채널심장전기도시스템을위한3차원매핑소프트웨어개발.txt, 128채널심장전기도시스템을위한3차원매핑소프트웨어개발_2p_left_botton.PNG, PDFBOX-3442-DirectResources.pdf, PDFBOX-5090_reduced.pdf, textstripper_2.0.17_128채널심장전기도시스템을위한3차원매핑소프트웨어개발_2p_left_botton.PNG, textstripper_2.0.17_独立財政機関をめぐる論点整理_3p_top.PNG, textstripper_2.0.18_128채널심장전기도시스템을위한3차원매핑소프트웨어개발_2p_left_botton.PNG, textstripper_2.0.18_独立財政機関をめぐる論点整理_3p_top.PNG, 独立財政機関をめぐる論点整理.pdf, 独立財政機関をめぐる論点整理_3p_top.PNG
>
>
> When calling PDFTextStripper.getText() function on pdfbox 2.0.18 or later, it fails to extract text with any condition.
> It is suspected that the missing text extraction phenomenon is associated with either the font type or the font size or text's width and height.
>  I have attached the text extraction results of version 2.0.17 and version 2.0.18 and the sample data used for the test.
> code
>  
> {code:java}
> PDDocument pdDocument = PDDocument.load(new File(path));
> PDFTextStripper stripper = new PDFTextStripper();
> {code}
> dependencies
>  
> {code:java}
> <properties>
>     <apache.pdfbox.version>2.0.18</apache.pdfbox.version>
> </properties>
> <dependencies>
>     <dependency>
>         <groupId>org.apache.pdfbox</groupId>
>         <artifactId>pdfbox</artifactId>
>         <version>${apache.pdfbox.version}</version>
>     </dependency>
>     <dependency>
>         <groupId>org.apache.pdfbox</groupId>
>         <artifactId>fontbox</artifactId>
>         <version>${apache.pdfbox.version}</version>
>     </dependency>
>     <dependency>
>         <groupId>org.apache.pdfbox</groupId>
>         <artifactId>xmpbox</artifactId>
>         <version>${apache.pdfbox.version}</version>
>     </dependency>
> </dependencies>
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org