You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Thilo Planz (JIRA)" <ji...@apache.org> on 2011/01/11 12:25:45 UTC

[jira] Commented: (PDFBOX-213) Text Extraction with Formatting

    [ https://issues.apache.org/jira/browse/PDFBOX-213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12980062#action_12980062 ] 

Thilo Planz commented on PDFBOX-213:
------------------------------------

It would also be nice if the plain-text extraction would support something similar to Xpdf's pdftotext -layout flag. It does a pretty good job of preserving the layout of the text in the PDF using just spaces.



> Text Extraction with Formatting
> -------------------------------
>
>                 Key: PDFBOX-213
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-213
>             Project: PDFBox
>          Issue Type: New Feature
>          Components: Text extraction
>            Priority: Minor
>
> [imported from SourceForge]
> http://sourceforge.net/tracker/index.php?group_id=78314&atid=552835&aid=1589018
> Originally submitted by cetinsert on 2006-11-01 17:50.
> Is it possible to extract text from a PDF without
> ignoring the formatting?
> HTML tags might be used for example. I thought the
> PDFText2Html class would do the trick but it does not.
> Thank you for reading.
> [comment on SourceForge]
> Originally sent by rrufai.
> Logged In: YES 
> user_id=1776491
> Originator: NO
> It's sent.
> [comment on SourceForge]
> Originally sent by rrufai.
> Logged In: YES 
> user_id=1776491
> Originator: NO
> What email address should I send it to? 
> [comment on SourceForge]
> Originally sent by cetinsert.
> Logged In: YES 
> user_id=1562185
> Originator: YES
> @ rruffai
> > You might send a compiled 32-bit windows or linux binary personally to me.
> > (I'm a user of pdftohtml.)
> I messed things up. This was also PDFBox. Hehe, sorry.
> [comment on SourceForge]
> Originally sent by cetinsert.
> Logged In: YES 
> user_id=1562185
> Originator: YES
> @ rrufai
> what is the trouble you have with handling underlines?
> You might send a compiled 32-bit windows or linux binary personally to me. (I'm a user of pdftohtml.)
> [comment on SourceForge]
> Originally sent by rrufai.
> Logged In: YES 
> user_id=1776491
> Originator: NO
> Hi Ben,
> I've extended PDFText2Html to handle bold, new lines (with <br> tags). However, I'm having trouble figuring out how to handle underlines.
> Also, I don't know how to post updates. 
> Regards,
> Raimi
> [comment on SourceForge]
> Originally sent by cetinsert.
> Logged In: YES 
> user_id=1562185
> Uhmm... well bold, italic, underlined etc... would be a good
> beginning but my ultimate wish would be something like
> quoted below:
> <?xml version="1.0" encoding="ISO-8859-1"?>
> <!DOCTYPE pdf2xml SYSTEM "pdf2xml.dtd">
> <pdf2xml>
> <page number="1" position="absolute" top="0" left="0"
> height="1262" width="892">
>  <fontspec id="0" size="16" family="Times" color="#000000"/>
>  <fontspec id="1" size="16" family="Times" color="#000000"/>
>  <fontspec id="2" size="16" family="Times" color="#000000"/>
> <text top="110" left="106" width="137" height="18"
> font="0"><i>She </i>told <b>me</b>. äµß </text>
> </page>
> </pdf2xml>
> I think I have made a mistake by naming it "Text Extraction
> with Formatting"... I should have put my question under a
> more fitting title, something like "PDF to (HTML/)XML
> Conversion with formatting".
> Thank you very much for your prompt replies. ^_^
> [comment on SourceForge]
> Originally sent by benlitchfield.
> Logged In: YES 
> user_id=601708
> Specifically are you looking only for bold & italic or other things?
> [comment on SourceForge]
> Originally sent by cetinsert.
> Logged In: YES 
> user_id=1562185
> That's exactly what I am looking for. But is this not a
> priority issue for the PDFBox package? It would take me
> quite a time to extend the stripper on my own. One of the
> PDFBox developers might do it better I think.
> If you insist that it's a user's issue and PDFBox developers
> would not invest their time in such an extension, could you
> at least tell me whether you have any links to any
> information regarding this matter?
> [comment on SourceForge]
> Originally sent by benlitchfield.
> Logged In: YES 
> user_id=601708
> HTML tags are not used to format a PDF document.  Font information is available but can be tricky to get what you 
> want.  You will need to extend PDFTextStripper and override writeCharacters to get formatting such as bold/italic.  
> Is that what you are looking for?
> Ben

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.