You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Mel Martinez (JIRA)" <ji...@apache.org> on 2009/09/08 19:45:58 UTC

[jira] Created: (PDFBOX-521) Improved PDF Text Extraction that notes paragraph boundaries

Improved PDF Text Extraction that notes paragraph boundaries
------------------------------------------------------------

                 Key: PDFBOX-521
                 URL: https://issues.apache.org/jira/browse/PDFBOX-521
             Project: PDFBox
          Issue Type: Improvement
          Components: Parsing
    Affects Versions: 0.8.0-incubator
         Environment: all
            Reporter: Mel Martinez


The current behavior of the org.apache.pdfbox.util.PDFTextStripper class is to ignore paragraph demarcation in the text.  It basically just renders each line of text as it discovers it, separating each line equally with the same line separator.

This makes it difficult to identify paragraph (or even page) starts and stops in the extracted text.  This is often necessary for text processing that needs to work with logical 'chunks' of text.  Further, rendering into other formats (such as HTML or XML) is facilitated by resolving the document into more discrete logical text chunks.

The request here is for improved text extraction that provides more discrete instrumentation of the parsing, allowing one to identify / tag paragraph starts and stops.




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PDFBOX-521) Improved PDF Text Extraction that notes paragraph boundaries

Posted by "Mel Martinez (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PDFBOX-521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mel Martinez updated PDFBOX-521:
--------------------------------

    Attachment: pdftextstripper2.zip

Modified so as to re-enable use of the writeWordSeparator() and writeCharacters(TextPosition) methods, improving instrumentation available to sub-classes.

I.E. - you can now override the discrete character output methods as well as the various sectional boundaries.

This goes towards addressing the concerns of issue PDFBOX-533.

This also fixes a bug where non-RTL text was skipping presentation normalization - now ligatures and special characters are properly processed to replace them with their plain text equivalents (looks much better!).

Performance seems to be virtually unchanged from the previous version, taking just a hair over 40s to process the 31MB 2006 PDF 1.7 reference doc.

Needs to be tested with RTL text (i.e. Hebrew).  I don't have any such documents with which to test.  If anyone has some and would like to send me an example please do.  Or test it yourself and post the results here.

> Improved PDF Text Extraction that notes paragraph boundaries
> ------------------------------------------------------------
>
>                 Key: PDFBOX-521
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-521
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Parsing
>    Affects Versions: 0.8.0-incubator
>         Environment: all
>            Reporter: Mel Martinez
>         Attachments: pdftextstripper2.zip
>
>
> The current behavior of the org.apache.pdfbox.util.PDFTextStripper class is to ignore paragraph demarcation in the text.  It basically just renders each line of text as it discovers it, separating each line equally with the same line separator.
> This makes it difficult to identify paragraph (or even page) starts and stops in the extracted text.  This is often necessary for text processing that needs to work with logical 'chunks' of text.  Further, rendering into other formats (such as HTML or XML) is facilitated by resolving the document into more discrete logical text chunks.
> The request here is for improved text extraction that provides more discrete instrumentation of the parsing, allowing one to identify / tag paragraph starts and stops.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PDFBOX-521) Improved PDF Text Extraction that notes paragraph boundaries

Posted by "Mel Martinez (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PDFBOX-521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mel Martinez updated PDFBOX-521:
--------------------------------

    Attachment:     (was: pdftextstripper2.zip)

> Improved PDF Text Extraction that notes paragraph boundaries
> ------------------------------------------------------------
>
>                 Key: PDFBOX-521
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-521
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Parsing
>    Affects Versions: 0.8.0-incubator
>         Environment: all
>            Reporter: Mel Martinez
>
> The current behavior of the org.apache.pdfbox.util.PDFTextStripper class is to ignore paragraph demarcation in the text.  It basically just renders each line of text as it discovers it, separating each line equally with the same line separator.
> This makes it difficult to identify paragraph (or even page) starts and stops in the extracted text.  This is often necessary for text processing that needs to work with logical 'chunks' of text.  Further, rendering into other formats (such as HTML or XML) is facilitated by resolving the document into more discrete logical text chunks.
> The request here is for improved text extraction that provides more discrete instrumentation of the parsing, allowing one to identify / tag paragraph starts and stops.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PDFBOX-521) Improved PDF Text Extraction that notes paragraph boundaries

Posted by "Mel Martinez (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PDFBOX-521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mel Martinez updated PDFBOX-521:
--------------------------------

    Attachment: pdftextstripper2.zip

added ability to change the default values used for drop space and indent testing using -D sytem property values.  Examples:

-Dpdftextstripper2.drop=2.8f

-Dpdftextstripper2.indent=4.0f



> Improved PDF Text Extraction that notes paragraph boundaries
> ------------------------------------------------------------
>
>                 Key: PDFBOX-521
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-521
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Parsing
>    Affects Versions: 0.8.0-incubator
>         Environment: all
>            Reporter: Mel Martinez
>         Attachments: pdftextstripper2.zip
>
>
> The current behavior of the org.apache.pdfbox.util.PDFTextStripper class is to ignore paragraph demarcation in the text.  It basically just renders each line of text as it discovers it, separating each line equally with the same line separator.
> This makes it difficult to identify paragraph (or even page) starts and stops in the extracted text.  This is often necessary for text processing that needs to work with logical 'chunks' of text.  Further, rendering into other formats (such as HTML or XML) is facilitated by resolving the document into more discrete logical text chunks.
> The request here is for improved text extraction that provides more discrete instrumentation of the parsing, allowing one to identify / tag paragraph starts and stops.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PDFBOX-521) Improved PDF Text Extraction that notes paragraph boundaries

Posted by "Mel Martinez (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PDFBOX-521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mel Martinez updated PDFBOX-521:
--------------------------------

    Attachment: pdftextstripper2.zip

minor tweak to clean up imports and to improve instrumentation.

> Improved PDF Text Extraction that notes paragraph boundaries
> ------------------------------------------------------------
>
>                 Key: PDFBOX-521
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-521
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Parsing
>    Affects Versions: 0.8.0-incubator
>         Environment: all
>            Reporter: Mel Martinez
>         Attachments: pdftextstripper2.zip
>
>
> The current behavior of the org.apache.pdfbox.util.PDFTextStripper class is to ignore paragraph demarcation in the text.  It basically just renders each line of text as it discovers it, separating each line equally with the same line separator.
> This makes it difficult to identify paragraph (or even page) starts and stops in the extracted text.  This is often necessary for text processing that needs to work with logical 'chunks' of text.  Further, rendering into other formats (such as HTML or XML) is facilitated by resolving the document into more discrete logical text chunks.
> The request here is for improved text extraction that provides more discrete instrumentation of the parsing, allowing one to identify / tag paragraph starts and stops.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PDFBOX-521) Improved PDF Text Extraction that notes paragraph boundaries

Posted by "Mel Martinez (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PDFBOX-521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mel Martinez updated PDFBOX-521:
--------------------------------

    Attachment: pdftextstripper2.zip

fixes bug in the acceptance of system property options  -Dpdftextstripper2.drop and -Dpdftextstripper2.indent

> Improved PDF Text Extraction that notes paragraph boundaries
> ------------------------------------------------------------
>
>                 Key: PDFBOX-521
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-521
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Parsing
>    Affects Versions: 0.8.0-incubator
>         Environment: all
>            Reporter: Mel Martinez
>         Attachments: pdftextstripper2.zip
>
>
> The current behavior of the org.apache.pdfbox.util.PDFTextStripper class is to ignore paragraph demarcation in the text.  It basically just renders each line of text as it discovers it, separating each line equally with the same line separator.
> This makes it difficult to identify paragraph (or even page) starts and stops in the extracted text.  This is often necessary for text processing that needs to work with logical 'chunks' of text.  Further, rendering into other formats (such as HTML or XML) is facilitated by resolving the document into more discrete logical text chunks.
> The request here is for improved text extraction that provides more discrete instrumentation of the parsing, allowing one to identify / tag paragraph starts and stops.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PDFBOX-521) Improved PDF Text Extraction that notes paragraph boundaries

Posted by "Mel Martinez (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PDFBOX-521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mel Martinez updated PDFBOX-521:
--------------------------------

    Attachment: pdftextstripper2.zip

Fixed page start / article start nesting error.

> Improved PDF Text Extraction that notes paragraph boundaries
> ------------------------------------------------------------
>
>                 Key: PDFBOX-521
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-521
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Parsing
>    Affects Versions: 0.8.0-incubator
>         Environment: all
>            Reporter: Mel Martinez
>         Attachments: pdftextstripper2.zip
>
>
> The current behavior of the org.apache.pdfbox.util.PDFTextStripper class is to ignore paragraph demarcation in the text.  It basically just renders each line of text as it discovers it, separating each line equally with the same line separator.
> This makes it difficult to identify paragraph (or even page) starts and stops in the extracted text.  This is often necessary for text processing that needs to work with logical 'chunks' of text.  Further, rendering into other formats (such as HTML or XML) is facilitated by resolving the document into more discrete logical text chunks.
> The request here is for improved text extraction that provides more discrete instrumentation of the parsing, allowing one to identify / tag paragraph starts and stops.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PDFBOX-521) Improved PDF Text Extraction that notes paragraph boundaries

Posted by "Mel Martinez (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PDFBOX-521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mel Martinez updated PDFBOX-521:
--------------------------------

    Attachment:     (was: pdftextstripper2.zip)

> Improved PDF Text Extraction that notes paragraph boundaries
> ------------------------------------------------------------
>
>                 Key: PDFBOX-521
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-521
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Parsing
>    Affects Versions: 0.8.0-incubator
>         Environment: all
>            Reporter: Mel Martinez
>
> The current behavior of the org.apache.pdfbox.util.PDFTextStripper class is to ignore paragraph demarcation in the text.  It basically just renders each line of text as it discovers it, separating each line equally with the same line separator.
> This makes it difficult to identify paragraph (or even page) starts and stops in the extracted text.  This is often necessary for text processing that needs to work with logical 'chunks' of text.  Further, rendering into other formats (such as HTML or XML) is facilitated by resolving the document into more discrete logical text chunks.
> The request here is for improved text extraction that provides more discrete instrumentation of the parsing, allowing one to identify / tag paragraph starts and stops.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PDFBOX-521) Improved PDF Text Extraction that notes paragraph boundaries

Posted by "Mel Martinez (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PDFBOX-521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mel Martinez updated PDFBOX-521:
--------------------------------

    Attachment: pdftextstripper2.zip

Sigh ... if I could only overcome my sudden flashes of complete idiocy ...

Loaded the wrong version last time.  This one fixes it.

Adds separate attributes for article start/ends and separate write methods for paragraph start/ends.


> Improved PDF Text Extraction that notes paragraph boundaries
> ------------------------------------------------------------
>
>                 Key: PDFBOX-521
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-521
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Parsing
>    Affects Versions: 0.8.0-incubator
>         Environment: all
>            Reporter: Mel Martinez
>         Attachments: pdftextstripper2.zip
>
>
> The current behavior of the org.apache.pdfbox.util.PDFTextStripper class is to ignore paragraph demarcation in the text.  It basically just renders each line of text as it discovers it, separating each line equally with the same line separator.
> This makes it difficult to identify paragraph (or even page) starts and stops in the extracted text.  This is often necessary for text processing that needs to work with logical 'chunks' of text.  Further, rendering into other formats (such as HTML or XML) is facilitated by resolving the document into more discrete logical text chunks.
> The request here is for improved text extraction that provides more discrete instrumentation of the parsing, allowing one to identify / tag paragraph starts and stops.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PDFBOX-521) Improved PDF Text Extraction that notes paragraph boundaries

Posted by "Mel Martinez (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PDFBOX-521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mel Martinez updated PDFBOX-521:
--------------------------------

    Attachment: pdftextstripper2.zip

Attached is a proposed solution in the form of a subclass of PDFTextStripper (org.apache.pdfbox.util.PDFTextStripper2) that enhances the text extraction in several ways.

- detects paragraph separation in several ways:  Vertical drop, horizontal indent.  Hanging indents.  The core logic can be overriden in a subclass by overriding the isParagraphSeparation() method.
- provides separate attributes for the demarcation used for both paragraph start and paragraph end.
- provides separate attributes for the demarcation of page starts and page ends.
- configurable thresholds for both vertical drop and horizontal indent tests.
- detects most hanging indent cases through regex matching of common list item formats - (i.e. bullet items & numbered items).  The patterns used can be extended/changed through sub-classing.

In addition to the PDFTextStripper2 class, the attachments include a utility PositionWrapper class used by the stripper and a new PDFText2HTML2 class based on PDFTextStripper2 that uses the improved demarcations in creating an HTML form of the output.


> Improved PDF Text Extraction that notes paragraph boundaries
> ------------------------------------------------------------
>
>                 Key: PDFBOX-521
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-521
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Parsing
>    Affects Versions: 0.8.0-incubator
>         Environment: all
>            Reporter: Mel Martinez
>         Attachments: pdftextstripper2.zip
>
>
> The current behavior of the org.apache.pdfbox.util.PDFTextStripper class is to ignore paragraph demarcation in the text.  It basically just renders each line of text as it discovers it, separating each line equally with the same line separator.
> This makes it difficult to identify paragraph (or even page) starts and stops in the extracted text.  This is often necessary for text processing that needs to work with logical 'chunks' of text.  Further, rendering into other formats (such as HTML or XML) is facilitated by resolving the document into more discrete logical text chunks.
> The request here is for improved text extraction that provides more discrete instrumentation of the parsing, allowing one to identify / tag paragraph starts and stops.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PDFBOX-521) Improved PDF Text Extraction that notes paragraph boundaries

Posted by "Mel Martinez (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PDFBOX-521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mel Martinez updated PDFBOX-521:
--------------------------------

    Attachment:     (was: pdftextstripper2.zip)

> Improved PDF Text Extraction that notes paragraph boundaries
> ------------------------------------------------------------
>
>                 Key: PDFBOX-521
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-521
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Parsing
>    Affects Versions: 0.8.0-incubator
>         Environment: all
>            Reporter: Mel Martinez
>
> The current behavior of the org.apache.pdfbox.util.PDFTextStripper class is to ignore paragraph demarcation in the text.  It basically just renders each line of text as it discovers it, separating each line equally with the same line separator.
> This makes it difficult to identify paragraph (or even page) starts and stops in the extracted text.  This is often necessary for text processing that needs to work with logical 'chunks' of text.  Further, rendering into other formats (such as HTML or XML) is facilitated by resolving the document into more discrete logical text chunks.
> The request here is for improved text extraction that provides more discrete instrumentation of the parsing, allowing one to identify / tag paragraph starts and stops.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PDFBOX-521) Improved PDF Text Extraction that notes paragraph boundaries

Posted by "Mel Martinez (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PDFBOX-521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mel Martinez updated PDFBOX-521:
--------------------------------

    Attachment:     (was: pdftextstripper2.zip)

> Improved PDF Text Extraction that notes paragraph boundaries
> ------------------------------------------------------------
>
>                 Key: PDFBOX-521
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-521
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Parsing
>    Affects Versions: 0.8.0-incubator
>         Environment: all
>            Reporter: Mel Martinez
>
> The current behavior of the org.apache.pdfbox.util.PDFTextStripper class is to ignore paragraph demarcation in the text.  It basically just renders each line of text as it discovers it, separating each line equally with the same line separator.
> This makes it difficult to identify paragraph (or even page) starts and stops in the extracted text.  This is often necessary for text processing that needs to work with logical 'chunks' of text.  Further, rendering into other formats (such as HTML or XML) is facilitated by resolving the document into more discrete logical text chunks.
> The request here is for improved text extraction that provides more discrete instrumentation of the parsing, allowing one to identify / tag paragraph starts and stops.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PDFBOX-521) Improved PDF Text Extraction that notes paragraph boundaries

Posted by "Mel Martinez (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PDFBOX-521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mel Martinez updated PDFBOX-521:
--------------------------------

    Attachment:     (was: pdftextstripper2.zip)

> Improved PDF Text Extraction that notes paragraph boundaries
> ------------------------------------------------------------
>
>                 Key: PDFBOX-521
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-521
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Parsing
>    Affects Versions: 0.8.0-incubator
>         Environment: all
>            Reporter: Mel Martinez
>
> The current behavior of the org.apache.pdfbox.util.PDFTextStripper class is to ignore paragraph demarcation in the text.  It basically just renders each line of text as it discovers it, separating each line equally with the same line separator.
> This makes it difficult to identify paragraph (or even page) starts and stops in the extracted text.  This is often necessary for text processing that needs to work with logical 'chunks' of text.  Further, rendering into other formats (such as HTML or XML) is facilitated by resolving the document into more discrete logical text chunks.
> The request here is for improved text extraction that provides more discrete instrumentation of the parsing, allowing one to identify / tag paragraph starts and stops.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PDFBOX-521) Improved PDF Text Extraction that notes paragraph boundaries

Posted by "Mel Martinez (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PDFBOX-521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12753112#action_12753112 ] 

Mel Martinez commented on PDFBOX-521:
-------------------------------------

Just noticed that the import statements for the attached source need to be trimmed to remove unused imports.  Not important enough to warrant reposting new files.  Just letting folks know.  If using Eclipse, just hit 'ctrl-shift-O' while inside each file and that will automagically clean it up.


> Improved PDF Text Extraction that notes paragraph boundaries
> ------------------------------------------------------------
>
>                 Key: PDFBOX-521
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-521
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Parsing
>    Affects Versions: 0.8.0-incubator
>         Environment: all
>            Reporter: Mel Martinez
>         Attachments: pdftextstripper2.zip
>
>
> The current behavior of the org.apache.pdfbox.util.PDFTextStripper class is to ignore paragraph demarcation in the text.  It basically just renders each line of text as it discovers it, separating each line equally with the same line separator.
> This makes it difficult to identify paragraph (or even page) starts and stops in the extracted text.  This is often necessary for text processing that needs to work with logical 'chunks' of text.  Further, rendering into other formats (such as HTML or XML) is facilitated by resolving the document into more discrete logical text chunks.
> The request here is for improved text extraction that provides more discrete instrumentation of the parsing, allowing one to identify / tag paragraph starts and stops.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PDFBOX-521) Improved PDF Text Extraction that notes paragraph boundaries

Posted by "Mel Martinez (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PDFBOX-521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mel Martinez updated PDFBOX-521:
--------------------------------

    Attachment:     (was: pdftextstripper2.zip)

> Improved PDF Text Extraction that notes paragraph boundaries
> ------------------------------------------------------------
>
>                 Key: PDFBOX-521
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-521
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Parsing
>    Affects Versions: 0.8.0-incubator
>         Environment: all
>            Reporter: Mel Martinez
>
> The current behavior of the org.apache.pdfbox.util.PDFTextStripper class is to ignore paragraph demarcation in the text.  It basically just renders each line of text as it discovers it, separating each line equally with the same line separator.
> This makes it difficult to identify paragraph (or even page) starts and stops in the extracted text.  This is often necessary for text processing that needs to work with logical 'chunks' of text.  Further, rendering into other formats (such as HTML or XML) is facilitated by resolving the document into more discrete logical text chunks.
> The request here is for improved text extraction that provides more discrete instrumentation of the parsing, allowing one to identify / tag paragraph starts and stops.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.