You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Tilman Hausherr (JIRA)" <ji...@apache.org> on 2017/06/01 18:04:04 UTC

[jira] [Updated] (PDFBOX-3804) Detect end of paragraphs

     [ https://issues.apache.org/jira/browse/PDFBOX-3804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tilman Hausherr updated PDFBOX-3804:
------------------------------------
    Attachment: PDFBOX-3804-noimage.pdf
                PDFBOX-3804-singlespaced.pdf
                PDFBOX-3804-115spaced.pdf

I had another look at your file. It is a scanned image with invisible OCRed text. I have attached page 3 without the image so that the glyphs are visible now. The font "Code 2000" was not properly read in 1.8 and substituted with another font, this may be the cause.

Another thing I observed is that the space between lines is different because sizes are different. If you open the file with Adobe Reader and mark all, you will see that the space between "The Internet has reshaped" and the line above is smaller than the space between "patterns that might help" and the line above. So this may be the reason that PDFBox thinks that "patterns that might help" is a new paragraph.

So your file is not really a good test case...

I tried to create a new file to find out whether it works with a "clean" text. It works when using single spaced lines, but not when these were 1,15 spaced.

> Detect end of paragraphs
> ------------------------
>
>                 Key: PDFBOX-3804
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3804
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.6, 2.0.7, 3.0.0
>            Reporter: Alexandre
>              Labels: extraction, paragraph
>         Attachments: example.pdf, PDFBOX-3804-115spaced.pdf, PDFBOX-3804-noimage.pdf, PDFBOX-3804-singlespaced.pdf
>
>
> Hi,
> To extract text by paragraphs is probably the most looking forward improvement asked by PDFBox users.
> *The current text extraction approach detects correctly end of lines. But it does not detect correctly end of paragraphs.*
> *What is a paragraph ?* A paragraph is a text that contains one or several sentences. It can start by a tabulation but this is not mandatory. In a paragraph, there is one or more lines but there is no carriage return (except the one at the very end). A paragraph can end before the very end of a line, but some paragraphs end at the very end. If a paragraph ends at the very end there is no new lines containing words after.
> *So, the last line of a paragraph ends before reaching the very end of the line except if there is no new lines containing words after it.* Do you follow me ? +And an algorithm could use that pattern to detect properly paragraphs.+ 
> In my opinion, the algorithm should use the following information:
> (*) the +width of the block+ containing the paragraph ;
> (*) precomputed width of the +first word in the next line+.
> The +width of a block+ refers to the width of the area that contains the line that contains the character the algorithm is evaluating at any steps.
> The algorithm runs on every character and when it reaches the +last character of a line+, it pre computes +the first word of the next line+ to have it's width.
> If +this word+ fits in the previous line after the +last character+, then the algorithm concludes an end of paragraph (*case 1*).
> If there is no +next word+, then this is also the end of the paragraph (*case 2*).
> If there is a tabulation before the +next word+ (*case 3*).
> If the +last character+ is far from the end of the block, we automatically conclude for the end of a paragraph (*case 4 is optional*).
> Cheers,
> A.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org