You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Alexandre (JIRA)" <ji...@apache.org> on 2017/05/22 14:51:04 UTC

[jira] [Updated] (PDFBOX-3804) Detect end of paragraphs

     [ https://issues.apache.org/jira/browse/PDFBOX-3804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alexandre updated PDFBOX-3804:
------------------------------
    Description: 
Hi,

To extract text by paragraphs is probably the most looking forward improvement asked by PDFBox users.


*The current text extraction approach detects correctly end of lines. But it does not detect correctly end of paragraph.* Indeed, a carriage return character is added after each new line.

*What is a paragraph ?* A paragraph is a text that contain one or several sentences. It can start by a tabulation but this is not mandatory. In a paragraph, there is one or more lines but there is no carriage return (except the one at the very end). A paragraph can end before the very end of a line, but some paragraphs end at the very end.

*So, the last line of a paragraph ends before reaching the very end of the line.* +And we could use that pattern detect properly paragraphs.+ 

In my opinion, the algorithm never adds carriage return except when a end of paragraph is detect.

In my opinion, the algorithm should use the following information:
(*) the width of the block containing the paragraph ;
(*) detect the padding and margin of that block ;
(*) precomputed width of the next word.

The width of a block is either the width of the pdf minus left and right margins if the pdf has one column. Or the width of a block is the width of a column if the pdf has two columns for example.

The algorithm runs on every character and when it reach the last character of a line, it pre computes the next line first word to have it's width. If this word fits in the previous line, then we conclude we have an end of paragraph.

Cheers,
A.

  was:
Hi,

To extract text by paragraphs is probably the most looking forward improvement asked by PDFBox users.


*The current text extraction approach detects correctly end of lines. But it does not detect correctly end of paragraph.* Indeed, a carriage return character is added after each new line.

*What is a paragraph ?* A paragraph is a text that contain one or several sentences. It can start by a tabulation but this is not mandatory. In a paragraph, there is one or more lines but there is no carriage return (except the one at the very end). A paragraph can end before the very end of a line, but some paragraphs end at the very end.

*So, the last line of a paragraph ends before reaching the very end of the line.* +And we could use that pattern detect properly paragraphs.+ 

In my opinion, the algorithm never adds carriage return except when a end of paragraph is detect.

In my opinion, the algorithm should use the following information:
(*) the width of the block containing the paragraph ;
(*) detect the padding and margin of that block ;
(*) precomputed width of the next word.

The width of a block is either the width of the pdf minus left and right margins if the pdf has one column. Or the width of a block is the width of a column if the pdf has two columns for example.

The algorithm runs on every character and when it reach the last character of a line, it pre computes the next line first word to have it's width. If this word fits in the previous line, then we conclude we have a end of paragraph.

Best,
A.


> Detect end of paragraphs
> ------------------------
>
>                 Key: PDFBOX-3804
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3804
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Text extraction
>    Affects Versions: 2.0.6, 3.0.0
>            Reporter: Alexandre
>            Priority: Minor
>              Labels: extraction, paragraph
>         Attachments: example.pdf
>
>
> Hi,
> To extract text by paragraphs is probably the most looking forward improvement asked by PDFBox users.
> *The current text extraction approach detects correctly end of lines. But it does not detect correctly end of paragraph.* Indeed, a carriage return character is added after each new line.
> *What is a paragraph ?* A paragraph is a text that contain one or several sentences. It can start by a tabulation but this is not mandatory. In a paragraph, there is one or more lines but there is no carriage return (except the one at the very end). A paragraph can end before the very end of a line, but some paragraphs end at the very end.
> *So, the last line of a paragraph ends before reaching the very end of the line.* +And we could use that pattern detect properly paragraphs.+ 
> In my opinion, the algorithm never adds carriage return except when a end of paragraph is detect.
> In my opinion, the algorithm should use the following information:
> (*) the width of the block containing the paragraph ;
> (*) detect the padding and margin of that block ;
> (*) precomputed width of the next word.
> The width of a block is either the width of the pdf minus left and right margins if the pdf has one column. Or the width of a block is the width of a column if the pdf has two columns for example.
> The algorithm runs on every character and when it reach the last character of a line, it pre computes the next line first word to have it's width. If this word fits in the previous line, then we conclude we have an end of paragraph.
> Cheers,
> A.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org