You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Navendu Garg (JIRA)" <ji...@apache.org> on 2009/09/17 17:50:57 UTC

[jira] Issue Comment Edited: (PDFBOX-533) PDFTextStripper.writeCharacters is called no where in the class

    [ https://issues.apache.org/jira/browse/PDFBOX-533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12756582#action_12756582 ] 

Navendu Garg edited comment on PDFBOX-533 at 9/17/09 8:50 AM:
--------------------------------------------------------------

Usecase: While extracting text, I need character as well as  text position information. I also need to keep track of line breaks. Now, the only way I could figure out was to use writeLineSeparator and writeCharacters. 

Currently, I am using processLineSeparator & writeCharacters methods, which are available in the previous version, to achieve this and it works fine. 

I would prefer finer instrumentation while sacrificing some performance. Finer control allows you to structure the extracted text as per your needs. Most of the PDF libraries out there do not have finer level of instrumentation and probably that is why I chose PDFBox.I have been using PDFBox for a while now on fairly large PDF documents (7-8 mb). I must say PDFBox runs pretty fast. Still some benchmark information will be good.

I will take a look at PDFTextStripper2.

thanks

Navendu Garg


      was (Author: navendugarg):
    Usecase: While extracting text, I need character as well as  text position information. I also need to keep track of line breaks. Now, the only way I could figure out was to use writeLineSeparator and writeCharacters. 

Currently, I am using processLineSeparator & writeCharacters methods, which are available in the previous version, to achieve this and it works fine. 

Now from an API standpoint, if writeCharacters and writeWordSeparator are not used then they should be deprecated/removed. 

I will take a look athe PDFTextStripper2.

thanks

Navendu Garg

  
> PDFTextStripper.writeCharacters is called no where in the class
> ---------------------------------------------------------------
>
>                 Key: PDFBOX-533
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-533
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 0.8.0-incubator
>            Reporter: Navendu Garg
>
> It seems writeCharacters method is not called anywhere in the PDFTextStripper class. This makes it impossible for handling character TextPosition as well as Line Separator because processLineSeparator method is no longer there and writeLineSeparator is called when actual writing happens.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.