You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Ema Panz (JIRA)" <ji...@apache.org> on 2010/06/17 12:38:25 UTC

[jira] Commented: (PDFBOX-521) Improved PDF Text Extraction that notes paragraph boundaries

    [ https://issues.apache.org/jira/browse/PDFBOX-521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12879740#action_12879740 ] 

Ema Panz commented on PDFBOX-521:
---------------------------------

Hi, thank you for your work in HTML stripper! I've done some works in the "Guessing Title" part of your code and I've ended up to a better method (that maybe needs some further work) my code is here :http://pastebin.com/HNNsk0qJ

What I've implemented:

1. font-size comparisons is now regarding font size in points (pt) 
2. the minimum font size is calculated from the a base font size (maybe guessed from the document body text)
3. the title length is now bigger: 180 chars (I've some scientific papers with long titles)

I know that my code isn't in the apache coding standards... please don't be much hard on me! :)

Cheers

> Improved PDF Text Extraction that notes paragraph boundaries
> ------------------------------------------------------------
>
>                 Key: PDFBOX-521
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-521
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Parsing
>    Affects Versions: 0.8.0-incubator
>         Environment: all
>            Reporter: Mel Martinez
>         Attachments: pdftextstripper2.zip
>
>
> The current behavior of the org.apache.pdfbox.util.PDFTextStripper class is to ignore paragraph demarcation in the text.  It basically just renders each line of text as it discovers it, separating each line equally with the same line separator.
> This makes it difficult to identify paragraph (or even page) starts and stops in the extracted text.  This is often necessary for text processing that needs to work with logical 'chunks' of text.  Further, rendering into other formats (such as HTML or XML) is facilitated by resolving the document into more discrete logical text chunks.
> The request here is for improved text extraction that provides more discrete instrumentation of the parsing, allowing one to identify / tag paragraph starts and stops.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.