You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Tilman Hausherr (Jira)" <ji...@apache.org> on 2020/02/05 04:35:00 UTC
[jira] [Commented] (PDFBOX-4764) When a PDF has table with blank
entries in the column the stripper just ignores the column and moves to
next field in the coulmn
[ https://issues.apache.org/jira/browse/PDFBOX-4764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17030350#comment-17030350 ]
Tilman Hausherr commented on PDFBOX-4764:
-----------------------------------------
This isn't a bug. This is a text extraction tool and if there is no text, not even blanks, then there is nothing to extract.
PDF isn't like HTML where there is a TABLE syntax. What you, as a human, see as a "table" is just vector graphics.
If you want to extract tables, use products for that, e.g. Tabula. Or use ExtractTextByArea with the coordinates of your table cells.
> When a PDF has table with blank entries in the column the stripper just ignores the column and moves to next field in the coulmn
> --------------------------------------------------------------------------------------------------------------------------------
>
> Key: PDFBOX-4764
> URL: https://issues.apache.org/jira/browse/PDFBOX-4764
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 2.0.8
> Reporter: karthik guns
> Priority: Major
>
> When a PDF has tables with columns with empty values,the stripper ignores the field and moves to next column which has records(if its blank it should capture)
>
> PDFTextStripperByArea stripper = new PDFTextStripperByArea();
> stripper.setSortByPosition(true);
> PDFTextStripper tStripper = new PDFTextStripper();
> String pdfFileInText = tStripper.getText(document);
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org