You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Andreas Lehmkühler (Jira)" <ji...@apache.org> on 2021/02/01 07:26:00 UTC
[jira] [Resolved] (PDFBOX-5090) Missing text extraction under
certain conditions starting with apache pdfbox 2.0.18
[ https://issues.apache.org/jira/browse/PDFBOX-5090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Andreas Lehmkühler resolved PDFBOX-5090.
----------------------------------------
Resolution: Fixed
[~tilman] Thanks for the double check
> Missing text extraction under certain conditions starting with apache pdfbox 2.0.18
> -----------------------------------------------------------------------------------
>
> Key: PDFBOX-5090
> URL: https://issues.apache.org/jira/browse/PDFBOX-5090
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 2.0.18, 2.0.19, 2.0.20, 2.0.21, 2.0.22
> Environment: jdk 1.8, apache pdfbox, fontbox 2.0.18~, windows 10
> Reporter: sungwon kim
> Assignee: Andreas Lehmkühler
> Priority: Major
> Labels: regression
> Fix For: 2.0.23, 3.0.0 PDFBox
>
> Attachments: 128채널심장전기도시스템을위한3차원매핑소프트웨어개발.pdf, 128채널심장전기도시스템을위한3차원매핑소프트웨어개발.txt, 128채널심장전기도시스템을위한3차원매핑소프트웨어개발_2p_left_botton.PNG, PDFBOX-3442-DirectResources.pdf, PDFBOX-5090_reduced.pdf, textstripper_2.0.17_128채널심장전기도시스템을위한3차원매핑소프트웨어개발_2p_left_botton.PNG, textstripper_2.0.17_独立財政機関をめぐる論点整理_3p_top.PNG, textstripper_2.0.18_128채널심장전기도시스템을위한3차원매핑소프트웨어개발_2p_left_botton.PNG, textstripper_2.0.18_独立財政機関をめぐる論点整理_3p_top.PNG, 独立財政機関をめぐる論点整理.pdf, 独立財政機関をめぐる論点整理_3p_top.PNG
>
>
> When calling PDFTextStripper.getText() function on pdfbox 2.0.18 or later, it fails to extract text with any condition.
> It is suspected that the missing text extraction phenomenon is associated with either the font type or the font size or text's width and height.
> I have attached the text extraction results of version 2.0.17 and version 2.0.18 and the sample data used for the test.
> code
>
> {code:java}
> PDDocument pdDocument = PDDocument.load(new File(path));
> PDFTextStripper stripper = new PDFTextStripper();
> {code}
> dependencies
>
> {code:java}
> <properties>
> <apache.pdfbox.version>2.0.18</apache.pdfbox.version>
> </properties>
> <dependencies>
> <dependency>
> <groupId>org.apache.pdfbox</groupId>
> <artifactId>pdfbox</artifactId>
> <version>${apache.pdfbox.version}</version>
> </dependency>
> <dependency>
> <groupId>org.apache.pdfbox</groupId>
> <artifactId>fontbox</artifactId>
> <version>${apache.pdfbox.version}</version>
> </dependency>
> <dependency>
> <groupId>org.apache.pdfbox</groupId>
> <artifactId>xmpbox</artifactId>
> <version>${apache.pdfbox.version}</version>
> </dependency>
> </dependencies>
> {code}
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org