You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pdfbox.apache.org by "Navendu Garg (JIRA)" <ji...@apache.org> on 2009/09/17 00:01:58 UTC

[jira] Created: (PDFBOX-533) PDFTextStripper.writeCharacters is called no where in the class

PDFTextStripper.writeCharacters is called no where in the class
---------------------------------------------------------------

                 Key: PDFBOX-533
                 URL: https://issues.apache.org/jira/browse/PDFBOX-533
             Project: PDFBox
          Issue Type: Bug
          Components: Text extraction
    Affects Versions: 0.8.0-incubator
            Reporter: Navendu Garg
             Fix For: 0.8.0-incubator


It seems writeCharacters method is not called anywhere in the PDFTextStripper class. This makes it impossible for handling character TextPosition as well as Line Separator because processLineSeparator method is no longer there and writeLineSeparator is called when actual writing happens.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-533) PDFTextStripper.writeCharacters is called no where in the class

Posted by "Navendu Garg (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PDFBOX-533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12757270#action_12757270 ] 

Navendu Garg commented on PDFBOX-533:
-------------------------------------

System details:
JDK: 1.5_20
RAM: 2GB
Processor: Intel(R) Core Duo (T7300, 2.00 GHz)
PDFBox Version: 0.7.4-dev (this was available briefly on the PDFBox site longtime ago)

I have attached the code that I used to run my test. I ran the test with 256M Heap space and -server option.

It took about on  an average 27s  to convert pdf to text. I ran  this 10 times. 

> PDFTextStripper.writeCharacters is called no where in the class
> ---------------------------------------------------------------
>
>                 Key: PDFBOX-533
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-533
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 0.8.0-incubator
>            Reporter: Navendu Garg
>
> It seems writeCharacters method is not called anywhere in the PDFTextStripper class. This makes it impossible for handling character TextPosition as well as Line Separator because processLineSeparator method is no longer there and writeLineSeparator is called when actual writing happens.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-533) PDFTextStripper.writeCharacters is called no where in the class

Posted by "Lars Torunski (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PDFBOX-533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12761553#action_12761553 ] 

Lars Torunski commented on PDFBOX-533:
--------------------------------------

fontbox-0.8.0-incubating.jar and icu4j-3.8.jar are installed, but I'm getting similar thread dumps:

org.apache.pdfbox.exceptions.WrappedIOException
        at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:237)
        at mycompany....
        at java.lang.Thread.run(Thread.java:534)
Caused by: java.util.NoSuchElementException
        at java.util.AbstractList$Itr.next(AbstractList.java:426)
        at org.apache.pdfbox.pdfparser.PDFXrefStreamParser.parse(PDFXrefStreamParser.java:115)
        at org.apache.pdfbox.cos.COSDocument.parseXrefStreams(COSDocument.java:538)
        at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:203)
        ... 11 more

PDF file: www.oppenheim.pl/plpl/_download/09_05_11_Archiv.pdf

> PDFTextStripper.writeCharacters is called no where in the class
> ---------------------------------------------------------------
>
>                 Key: PDFBOX-533
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-533
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 0.8.0-incubator
>            Reporter: Navendu Garg
>         Attachments: TestPDFTextStripperPerf.java
>
>
> It seems writeCharacters method is not called anywhere in the PDFTextStripper class. This makes it impossible for handling character TextPosition as well as Line Separator because processLineSeparator method is no longer there and writeLineSeparator is called when actual writing happens.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-533) PDFTextStripper.writeCharacters is called no where in the class

Posted by "Mel Martinez (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PDFBOX-533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12758058#action_12758058 ] 

Mel Martinez commented on PDFBOX-533:
-------------------------------------

Okay, I did some tweaking and I created a version of my PDFTextStripper2 that reinstates use of both writeWordSeparator() and writeCharacters(TextPosition).

The performance is fine - no real penalty at all.

BUT .... I need to figure out how to handle the Right-To-Left normalization before I post it.  That will take me a bit more time.

I skipped that to do my initial tests using LTR text.   

I suspect that the RTL conversion will be a bit tricky - especially given that I don't have any RTL documents to test with.  The problem is that it requires tracking the RTL status up to the moment the the string is rendered.  The way the current stripper code works is that the whole string is built up then normalized as a whole.    To make the writeCharacters() call work, I'll have to detect the RTL nature of the aggregate line of TextPositions and call it for each TextPosition object in the line in the correct order - and THEN the writeCharacters() method (or its override) itself needs to further normalize the text pulled from the TextPosition object before writing to the stream.

I.E. it will be the responsibility of any subclass that overrides writeCharacters() to do the normalization before sending the characters to the stream.    Maybe I'll add a utility method to help with that.  Ligatures also have to be normalized as well.


> PDFTextStripper.writeCharacters is called no where in the class
> ---------------------------------------------------------------
>
>                 Key: PDFBOX-533
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-533
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 0.8.0-incubator
>            Reporter: Navendu Garg
>         Attachments: TestPDFTextStripperPerf.java
>
>
> It seems writeCharacters method is not called anywhere in the PDFTextStripper class. This makes it impossible for handling character TextPosition as well as Line Separator because processLineSeparator method is no longer there and writeLineSeparator is called when actual writing happens.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-533) PDFTextStripper.writeCharacters is called no where in the class

Posted by "Navendu Garg (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PDFBOX-533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12772551#action_12772551 ] 

Navendu Garg commented on PDFBOX-533:
-------------------------------------

When do you think you can make PdfTextStripper2 a part of the mainstream pdfbox project?

Navendu


> PDFTextStripper.writeCharacters is called no where in the class
> ---------------------------------------------------------------
>
>                 Key: PDFBOX-533
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-533
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 0.8.0-incubator
>            Reporter: Navendu Garg
>         Attachments: TestPDFTextStripperPerf.java
>
>
> It seems writeCharacters method is not called anywhere in the PDFTextStripper class. This makes it impossible for handling character TextPosition as well as Line Separator because processLineSeparator method is no longer there and writeLineSeparator is called when actual writing happens.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PDFBOX-533) PDFTextStripper.writeCharacters is called no where in the class

Posted by "Navendu Garg (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PDFBOX-533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Navendu Garg updated PDFBOX-533:
--------------------------------

    Attachment: TestPDFTextStripperPerf.java

> PDFTextStripper.writeCharacters is called no where in the class
> ---------------------------------------------------------------
>
>                 Key: PDFBOX-533
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-533
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 0.8.0-incubator
>            Reporter: Navendu Garg
>         Attachments: TestPDFTextStripperPerf.java
>
>
> It seems writeCharacters method is not called anywhere in the PDFTextStripper class. This makes it impossible for handling character TextPosition as well as Line Separator because processLineSeparator method is no longer there and writeLineSeparator is called when actual writing happens.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-533) PDFTextStripper.writeCharacters is called no where in the class

Posted by "Phil Varner (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PDFBOX-533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12796828#action_12796828 ] 

Phil Varner commented on PDFBOX-533:
------------------------------------

Mel, 

I think it should be 

            while(pdfSource.available() > 0 && objIter.hasNext())

instead, so the call to next() returns the correct Integer when next() is called later on.

This worked for me on a doc that threw the same exception.

I didn't see a separate JIRA issue for this, I'll gladly file and fix if someone can provide a doc that the error occurs on (mine is confidential from a customer).

 

> PDFTextStripper.writeCharacters is called no where in the class
> ---------------------------------------------------------------
>
>                 Key: PDFBOX-533
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-533
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 0.8.0-incubator
>            Reporter: Navendu Garg
>         Attachments: TestPDFTextStripperPerf.java
>
>
> It seems writeCharacters method is not called anywhere in the PDFTextStripper class. This makes it impossible for handling character TextPosition as well as Line Separator because processLineSeparator method is no longer there and writeLineSeparator is called when actual writing happens.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-533) PDFTextStripper.writeCharacters is called no where in the class

Posted by "Mel Martinez (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PDFBOX-533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12756552#action_12756552 ] 

Mel Martinez commented on PDFBOX-533:
-------------------------------------

Also note that writeWordSeparator() is also not called.

I'm not sure that is bad or not.  Currently, each line of text is written to the output as a line, not as discrete characters or discrete words.

While changing the code to use either or both of these methods might improve instrumentation, the disadvantage of changing the code to use either of these methods would probably be performance.

What is your exact use-case that you need to solve?  There may be a different solution.

If it helps at all, i've been working on a subclass of PDFTextStripper that is currently posted in Jira issue PDFBOX-521.   Feel free to check that out and provide feedback.  I'll create a link to it from this issue page.

> PDFTextStripper.writeCharacters is called no where in the class
> ---------------------------------------------------------------
>
>                 Key: PDFBOX-533
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-533
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 0.8.0-incubator
>            Reporter: Navendu Garg
>
> It seems writeCharacters method is not called anywhere in the PDFTextStripper class. This makes it impossible for handling character TextPosition as well as Line Separator because processLineSeparator method is no longer there and writeLineSeparator is called when actual writing happens.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PDFBOX-533) PDFTextStripper.writeCharacters is called no where in the class

Posted by "Andreas Lehmkühler (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PDFBOX-533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andreas Lehmkühler updated PDFBOX-533:
--------------------------------------

    Fix Version/s:     (was: 0.8.0-incubator)

> PDFTextStripper.writeCharacters is called no where in the class
> ---------------------------------------------------------------
>
>                 Key: PDFBOX-533
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-533
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 0.8.0-incubator
>            Reporter: Navendu Garg
>
> It seems writeCharacters method is not called anywhere in the PDFTextStripper class. This makes it impossible for handling character TextPosition as well as Line Separator because processLineSeparator method is no longer there and writeLineSeparator is called when actual writing happens.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (PDFBOX-533) PDFTextStripper.writeCharacters is called no where in the class

Posted by "Navendu Garg (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PDFBOX-533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12756582#action_12756582 ] 

Navendu Garg edited comment on PDFBOX-533 at 9/17/09 8:50 AM:
--------------------------------------------------------------

Usecase: While extracting text, I need character as well as  text position information. I also need to keep track of line breaks. Now, the only way I could figure out was to use writeLineSeparator and writeCharacters. 

Currently, I am using processLineSeparator & writeCharacters methods, which are available in the previous version, to achieve this and it works fine. 

I would prefer finer instrumentation while sacrificing some performance. Finer control allows you to structure the extracted text as per your needs. Most of the PDF libraries out there do not have finer level of instrumentation and probably that is why I chose PDFBox.I have been using PDFBox for a while now on fairly large PDF documents (7-8 mb). I must say PDFBox runs pretty fast. Still some benchmark information will be good.

I will take a look at PDFTextStripper2.

thanks

Navendu Garg


      was (Author: navendugarg):
    Usecase: While extracting text, I need character as well as  text position information. I also need to keep track of line breaks. Now, the only way I could figure out was to use writeLineSeparator and writeCharacters. 

Currently, I am using processLineSeparator & writeCharacters methods, which are available in the previous version, to achieve this and it works fine. 

Now from an API standpoint, if writeCharacters and writeWordSeparator are not used then they should be deprecated/removed. 

I will take a look athe PDFTextStripper2.

thanks

Navendu Garg

  
> PDFTextStripper.writeCharacters is called no where in the class
> ---------------------------------------------------------------
>
>                 Key: PDFBOX-533
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-533
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 0.8.0-incubator
>            Reporter: Navendu Garg
>
> It seems writeCharacters method is not called anywhere in the PDFTextStripper class. This makes it impossible for handling character TextPosition as well as Line Separator because processLineSeparator method is no longer there and writeLineSeparator is called when actual writing happens.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-533) PDFTextStripper.writeCharacters is called no where in the class

Posted by "Navendu Garg (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PDFBOX-533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12758737#action_12758737 ] 

Navendu Garg commented on PDFBOX-533:
-------------------------------------

Mel,

I tried to use PDFTextStripper2. However, it is giving me the following info/error messages:

INFO: unsupported/disabled operation: BDC
Sep 23, 2009 10:35:54 AM org.apache.pdfbox.util.PDFStreamEngine processOperator
INFO: unsupported/disabled operation: g
Sep 23, 2009 10:35:54 AM org.apache.pdfbox.util.PDFStreamEngine processOperator
INFO: unsupported/disabled operation: EMC
Exception in thread "main" java.lang.ExceptionInInitializerError
	at org.apache.pdfbox.encoding.EncodingManager.<clinit>(EncodingManager.java:38)
	at org.apache.pdfbox.pdmodel.font.PDFont.getEncoding(PDFont.java:518)
	at org.apache.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:438)
	at org.apache.pdfbox.util.PDFStreamEngine.processEncodedText(PDFStreamEngine.java:343)
	at org.apache.pdfbox.util.operator.ShowTextGlyph.process(ShowTextGlyph.java:66)
	at org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:516)
	at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:229)
	at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:188)
	at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:367)
	at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:291)
	at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:247)
	at org.apache.pdfbox.util.TestPDFTextStripperPerf.main(TestPDFTextStripperPerf.java:27)
Caused by: java.lang.NullPointerException
	at java.io.Reader.<init>(Reader.java:61)
	at java.io.InputStreamReader.<init>(InputStreamReader.java:55)
	at org.apache.pdfbox.encoding.Encoding.loadGlyphList(Encoding.java:98)
	at org.apache.pdfbox.encoding.Encoding.<clinit>(Encoding.java:58)
	... 12 more


> PDFTextStripper.writeCharacters is called no where in the class
> ---------------------------------------------------------------
>
>                 Key: PDFBOX-533
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-533
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 0.8.0-incubator
>            Reporter: Navendu Garg
>         Attachments: TestPDFTextStripperPerf.java
>
>
> It seems writeCharacters method is not called anywhere in the PDFTextStripper class. This makes it impossible for handling character TextPosition as well as Line Separator because processLineSeparator method is no longer there and writeLineSeparator is called when actual writing happens.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-533) PDFTextStripper.writeCharacters is called no where in the class

Posted by "Mel Martinez (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PDFBOX-533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12757203#action_12757203 ] 

Mel Martinez commented on PDFBOX-533:
-------------------------------------

Yeah, that sounds pretty good.

A lot depends on the nature of the particular PDF file since they can have arbitrary amounts of text, the text can be organized in arbitrary sized text objects which are stuffed into arbitrary stream containers with various encodings so it is hard to compare without having the same file.

I used the pdf_reference_1-7.pdf file (6th edition, Nov 2006), which is 31,712 KB.  Not quite as big as your file, but a robust test.

I parsed it with the latest PDFTextStripper in 0.8.0 incubating and it took 40 seconds.

I also parsed it with my PDFTextStripper2 (from PDFBOX-521) and that also took 40 seconds.  Virtually no difference.

In each case I ran twice, ignoring the first run in order to get all classes loaded. which shaved 2 seconds off the second run.  Multiple runs might be needed for a full hotspot optimization.  This is on a Xeon E5430 @ 2.66GHz w/4 GB ram - I allocated 256MB ram to the JRE for the test.  Less than that runs very slow.  More doesn't speed it up much.

Could you run the old version (I'm assuming that you mean 0.7.3) on the pdf reference file?  You can download it from 

http://www.adobe.com/devnet/acrobat/pdfs/pdf_reference_1-7.pdf

If I can clear some time this weekend, I'll try to experiment with a subclass that adds use of writeCharacters() to see if I can get that added to my version for you in some way.

I'm not sure what caused the error you encountered.  It looks like it was happening on the load side, which is before the text extraction even occurs.  Try allocating more memory?


> PDFTextStripper.writeCharacters is called no where in the class
> ---------------------------------------------------------------
>
>                 Key: PDFBOX-533
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-533
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 0.8.0-incubator
>            Reporter: Navendu Garg
>
> It seems writeCharacters method is not called anywhere in the PDFTextStripper class. This makes it impossible for handling character TextPosition as well as Line Separator because processLineSeparator method is no longer there and writeLineSeparator is called when actual writing happens.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-533) PDFTextStripper.writeCharacters is called no where in the class

Posted by "Navendu Garg (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PDFBOX-533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12756825#action_12756825 ] 

Navendu Garg commented on PDFBOX-533:
-------------------------------------

I just extracted text from a 50 MB using the old version and it took approx 20 seconds to convert to text. Please I am doing other stuff too in between. I admit this is a crude benchmark. Unfortunately, PDFBox 0.8.0-incubating version crashed with this error on this file.

org.apache.pdfbox.exceptions.WrappedIOException
	at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:237)
	at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:841)
	at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:808)
	at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:779)
	at org.apache.pdfbox.util.TestLargePDF.test(TestLargePDF.java:13)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
	at java.lang.reflect.Method.invoke(Method.java:592)
	at junit.framework.TestCase.runTest(TestCase.java:164)
	at junit.framework.TestCase.runBare(TestCase.java:130)
	at junit.framework.TestResult$1.protect(TestResult.java:106)
	at junit.framework.TestResult.runProtected(TestResult.java:124)
	at junit.framework.TestResult.run(TestResult.java:109)
	at junit.framework.TestCase.run(TestCase.java:120)
	at junit.framework.TestSuite.runTest(TestSuite.java:230)
	at junit.framework.TestSuite.run(TestSuite.java:225)
	at org.eclipse.jdt.internal.junit.runner.junit3.JUnit3TestReference.run(JUnit3TestReference.java:130)
	at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
	at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:467)
	at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:683)
	at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:390)
	at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:197)
Caused by: java.util.NoSuchElementException
	at java.util.AbstractList$Itr.next(AbstractList.java:427)
	at org.apache.pdfbox.pdfparser.PDFXrefStreamParser.parse(PDFXrefStreamParser.java:115)
	at org.apache.pdfbox.cos.COSDocument.parseXrefStreams(COSDocument.java:538)
	at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:203)
	... 22 more

Still, I think writeCharacters() method will not affect the performance all that much. 

> PDFTextStripper.writeCharacters is called no where in the class
> ---------------------------------------------------------------
>
>                 Key: PDFBOX-533
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-533
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 0.8.0-incubator
>            Reporter: Navendu Garg
>
> It seems writeCharacters method is not called anywhere in the PDFTextStripper class. This makes it impossible for handling character TextPosition as well as Line Separator because processLineSeparator method is no longer there and writeLineSeparator is called when actual writing happens.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-533) PDFTextStripper.writeCharacters is called no where in the class

Posted by "Mel Martinez (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PDFBOX-533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12756691#action_12756691 ] 

Mel Martinez commented on PDFBOX-533:
-------------------------------------


I can't help with character information - not sure exactly what that entails - but in regards to line breaks, that is largely what drove me to create the PDFTextStripper2 class.  It tries to do a more logical breakdown of textual chunks in the page and provide distinct start / end instrumentation around pages, articles and paragraphs.

It does provide a method, 

protected void isParagraphSeparation(PositionWrapper position, PositionWrapper lastPosition, PositionWrapper lastLineStartPosition);

that is called each time it tests a line start to see if it is the beginning of a new paragraph.  That can be overridden to change the logic.  You'll note from that signature that for that part, at least, you would be able to get to the TextPosition objects (referenced from the PositionWrapper) for at least the current position, the last position and the last known line start position.

If you do actually need to look at the TextPosition object for all text objects as they are written then you'd need to rewrite the writePage() method to change the way it builds a line and writes it out.

Currently, it simply queues the characters up in a 'lineStr' String until it reaches what it feels is a linebreak.  Then it outputs the whole line with a single String write.

To introduce the use of the writeCharacters(..) method in a way that provides you with access to all TextPosition objects, they would have to be queued up in a similar manner and the resulting collection sent to a 'writeLine(....)' method which in turn could invoke the writeCharacters(...) method for each discrete TextPosition object in the line.   That doesn't sound too hard, but I am leary of the performance hit.

You could try overriding writePage() in this way in a subclass (look at my PDFTextStripper2 in PDFBOX-521 for an example of subclassing PDFTextStripper) and see how that works.  Unfortunately you'd have to copy all of the source in writePage() into your override version, then make the changes to it to add the instrumentation that you want.  That's what I had to do.

I hope this is helpful.



> PDFTextStripper.writeCharacters is called no where in the class
> ---------------------------------------------------------------
>
>                 Key: PDFBOX-533
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-533
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 0.8.0-incubator
>            Reporter: Navendu Garg
>
> It seems writeCharacters method is not called anywhere in the PDFTextStripper class. This makes it impossible for handling character TextPosition as well as Line Separator because processLineSeparator method is no longer there and writeLineSeparator is called when actual writing happens.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (PDFBOX-533) PDFTextStripper.writeCharacters is called no where in the class

Posted by "Navendu Garg (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PDFBOX-533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12758737#action_12758737 ] 

Navendu Garg edited comment on PDFBOX-533 at 9/23/09 8:40 AM:
--------------------------------------------------------------

Mel,
Thanks for implementing writeCharacters method. It is going to save me a lot of time.

I tried to use PDFTextStripper2. However, it is giving me the following info/error messages:

INFO: unsupported/disabled operation: BDC
Sep 23, 2009 10:35:54 AM org.apache.pdfbox.util.PDFStreamEngine processOperator
INFO: unsupported/disabled operation: g
Sep 23, 2009 10:35:54 AM org.apache.pdfbox.util.PDFStreamEngine processOperator
INFO: unsupported/disabled operation: EMC
Exception in thread "main" java.lang.ExceptionInInitializerError
	at org.apache.pdfbox.encoding.EncodingManager.<clinit>(EncodingManager.java:38)
	at org.apache.pdfbox.pdmodel.font.PDFont.getEncoding(PDFont.java:518)
	at org.apache.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:438)
	at org.apache.pdfbox.util.PDFStreamEngine.processEncodedText(PDFStreamEngine.java:343)
	at org.apache.pdfbox.util.operator.ShowTextGlyph.process(ShowTextGlyph.java:66)
	at org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:516)
	at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:229)
	at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:188)
	at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:367)
	at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:291)
	at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:247)
	at org.apache.pdfbox.util.TestPDFTextStripperPerf.main(TestPDFTextStripperPerf.java:27)
Caused by: java.lang.NullPointerException
	at java.io.Reader.<init>(Reader.java:61)
	at java.io.InputStreamReader.<init>(InputStreamReader.java:55)
	at org.apache.pdfbox.encoding.Encoding.loadGlyphList(Encoding.java:98)
	at org.apache.pdfbox.encoding.Encoding.<clinit>(Encoding.java:58)
	... 12 more


      was (Author: navendugarg):
    Mel,

I tried to use PDFTextStripper2. However, it is giving me the following info/error messages:

INFO: unsupported/disabled operation: BDC
Sep 23, 2009 10:35:54 AM org.apache.pdfbox.util.PDFStreamEngine processOperator
INFO: unsupported/disabled operation: g
Sep 23, 2009 10:35:54 AM org.apache.pdfbox.util.PDFStreamEngine processOperator
INFO: unsupported/disabled operation: EMC
Exception in thread "main" java.lang.ExceptionInInitializerError
	at org.apache.pdfbox.encoding.EncodingManager.<clinit>(EncodingManager.java:38)
	at org.apache.pdfbox.pdmodel.font.PDFont.getEncoding(PDFont.java:518)
	at org.apache.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:438)
	at org.apache.pdfbox.util.PDFStreamEngine.processEncodedText(PDFStreamEngine.java:343)
	at org.apache.pdfbox.util.operator.ShowTextGlyph.process(ShowTextGlyph.java:66)
	at org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:516)
	at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:229)
	at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:188)
	at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:367)
	at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:291)
	at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:247)
	at org.apache.pdfbox.util.TestPDFTextStripperPerf.main(TestPDFTextStripperPerf.java:27)
Caused by: java.lang.NullPointerException
	at java.io.Reader.<init>(Reader.java:61)
	at java.io.InputStreamReader.<init>(InputStreamReader.java:55)
	at org.apache.pdfbox.encoding.Encoding.loadGlyphList(Encoding.java:98)
	at org.apache.pdfbox.encoding.Encoding.<clinit>(Encoding.java:58)
	... 12 more

  
> PDFTextStripper.writeCharacters is called no where in the class
> ---------------------------------------------------------------
>
>                 Key: PDFBOX-533
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-533
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 0.8.0-incubator
>            Reporter: Navendu Garg
>         Attachments: TestPDFTextStripperPerf.java
>
>
> It seems writeCharacters method is not called anywhere in the PDFTextStripper class. This makes it impossible for handling character TextPosition as well as Line Separator because processLineSeparator method is no longer there and writeLineSeparator is called when actual writing happens.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-533) PDFTextStripper.writeCharacters is called no where in the class

Posted by "Mel Martinez (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PDFBOX-533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12758436#action_12758436 ] 

Mel Martinez commented on PDFBOX-533:
-------------------------------------

Navendu - please check out the latest version of PDFTextStripper2 that I posted in PDFBOX-521.

It re-enables the use of the writeWordSeparator() and writeCharacters(TextPosition) methods so you can, as you used to do, override their behavior in a subclass.

Contrary to my last comment above in PDFBOX-533, you actually should NOT have to normalize the extracted string from the TextPosition.getCharacters() call within a writeCharacters() override.  The normalization happens automagically when the getCharacters() method is invoked so the resulting string is (hopefully) flipped to RTL order if necessary.  And it will also have special chars and ligatures properly processed by the icu4 presentation utility as well.

And the performance looks virtually identical to the 'linestr' version before (i.e. 40s to process the PDF 1.7 ref file).


> PDFTextStripper.writeCharacters is called no where in the class
> ---------------------------------------------------------------
>
>                 Key: PDFBOX-533
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-533
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 0.8.0-incubator
>            Reporter: Navendu Garg
>         Attachments: TestPDFTextStripperPerf.java
>
>
> It seems writeCharacters method is not called anywhere in the PDFTextStripper class. This makes it impossible for handling character TextPosition as well as Line Separator because processLineSeparator method is no longer there and writeLineSeparator is called when actual writing happens.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-533) PDFTextStripper.writeCharacters is called no where in the class

Posted by "Mel Martinez (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PDFBOX-533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12758789#action_12758789 ] 

Mel Martinez commented on PDFBOX-533:
-------------------------------------

Navendu,

That exception is occuring prior to any of the code that PDFTextStripper2 overrides.

It is being thrown out of the 'processStream(PDPage, COSStream)' method which is called immediately _prior_ to writePage() - which is basically what PDFTextStripper2 overrides.

This whatever is causing this is in PDFBox itself.

The 'PDFStreamEngine.processStream()' method scans the COS stream for tokens and processes operators found to 'draw' into the specified page.  For our interests, when it sees a text operator, it invokes the ShowText operator processor which calls back to the processEncodedText() method of the base PDFStreamEngine class.  This would in turn eventually invoke the overridden PDFTextStripper.processTextPosition() class which loads the TextPosition objects found into a buffer which is later sent to the output by writePage().  But it never gets there.  It blows up while trying to process the encoded text.  Specifically, when trying to derive the unicode value, it uses the font named by the current graphics state to encode the character.    Inside that PDFont object it blows up with a nullpointer exception.

Can you parse the document using the base PDFTextStripper?

My bet is that it would blow up independent of PDFTextStripper2.

Hmm.... do you have both the fontbox and icu4 libary jars in your classpath?   Both are required for this to work.  Specifically, you should have both

fontbox-0.8.0-incubating.jar   and
icu4j-3.8.jar

in your runtime classpath.





> PDFTextStripper.writeCharacters is called no where in the class
> ---------------------------------------------------------------
>
>                 Key: PDFBOX-533
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-533
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 0.8.0-incubator
>            Reporter: Navendu Garg
>         Attachments: TestPDFTextStripperPerf.java
>
>
> It seems writeCharacters method is not called anywhere in the PDFTextStripper class. This makes it impossible for handling character TextPosition as well as Line Separator because processLineSeparator method is no longer there and writeLineSeparator is called when actual writing happens.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (PDFBOX-533) PDFTextStripper.writeCharacters is called no where in the class

Posted by "Navendu Garg (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PDFBOX-533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12757270#action_12757270 ] 

Navendu Garg edited comment on PDFBOX-533 at 9/18/09 10:14 AM:
---------------------------------------------------------------

System details:
JDK: 1.5_20
RAM: 2GB
Processor: Intel(R) Core Duo (T7300, 2.00 GHz)
PDFBox Version: 0.7.4-dev (this was available briefly on the PDFBox site longtime ago)

I have attached the code that I used to run my test. I ran the test with 256M Heap space and -server option.

It took about on  an average 27s  to convert pdf to text. I ran  this 10 times. 

I have attached the code I used to run this test. 

      was (Author: navendugarg):
    System details:
JDK: 1.5_20
RAM: 2GB
Processor: Intel(R) Core Duo (T7300, 2.00 GHz)
PDFBox Version: 0.7.4-dev (this was available briefly on the PDFBox site longtime ago)

I have attached the code that I used to run my test. I ran the test with 256M Heap space and -server option.

It took about on  an average 27s  to convert pdf to text. I ran  this 10 times. 
  
> PDFTextStripper.writeCharacters is called no where in the class
> ---------------------------------------------------------------
>
>                 Key: PDFBOX-533
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-533
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 0.8.0-incubator
>            Reporter: Navendu Garg
>         Attachments: TestPDFTextStripperPerf.java
>
>
> It seems writeCharacters method is not called anywhere in the PDFTextStripper class. This makes it impossible for handling character TextPosition as well as Line Separator because processLineSeparator method is no longer there and writeLineSeparator is called when actual writing happens.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-533) PDFTextStripper.writeCharacters is called no where in the class

Posted by "Mel Martinez (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PDFBOX-533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12797125#action_12797125 ] 

Mel Martinez commented on PDFBOX-533:
-------------------------------------

Yes, that was a typo on my part.

I see that you fixed that with its own issue (PDFBOX-590).  Thanks, Phil!

> PDFTextStripper.writeCharacters is called no where in the class
> ---------------------------------------------------------------
>
>                 Key: PDFBOX-533
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-533
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 0.8.0-incubator
>            Reporter: Navendu Garg
>         Attachments: TestPDFTextStripperPerf.java
>
>
> It seems writeCharacters method is not called anywhere in the PDFTextStripper class. This makes it impossible for handling character TextPosition as well as Line Separator because processLineSeparator method is no longer there and writeLineSeparator is called when actual writing happens.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (PDFBOX-533) PDFTextStripper.writeCharacters is called no where in the class

Posted by "Phil Varner (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PDFBOX-533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12796828#action_12796828 ] 

Phil Varner edited comment on PDFBOX-533 at 1/5/10 8:10 PM:
------------------------------------------------------------

Mel, 

I think it should be 

            while(pdfSource.available() > 0 && objIter.hasNext())

instead, so the call to next() returns the correct Integer when next() is called later on.

This worked for me on a doc that threw the same exception.

I didn't see a separate JIRA issue for this, I'll file and fix. (edited, I see the pdf now)

 

      was (Author: philvarner):
    Mel, 

I think it should be 

            while(pdfSource.available() > 0 && objIter.hasNext())

instead, so the call to next() returns the correct Integer when next() is called later on.

This worked for me on a doc that threw the same exception.

I didn't see a separate JIRA issue for this, I'll gladly file and fix if someone can provide a doc that the error occurs on (mine is confidential from a customer).

 
  
> PDFTextStripper.writeCharacters is called no where in the class
> ---------------------------------------------------------------
>
>                 Key: PDFBOX-533
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-533
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 0.8.0-incubator
>            Reporter: Navendu Garg
>         Attachments: TestPDFTextStripperPerf.java
>
>
> It seems writeCharacters method is not called anywhere in the PDFTextStripper class. This makes it impossible for handling character TextPosition as well as Line Separator because processLineSeparator method is no longer there and writeLineSeparator is called when actual writing happens.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-533) PDFTextStripper.writeCharacters is called no where in the class

Posted by "Mel Martinez (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PDFBOX-533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12761607#action_12761607 ] 

Mel Martinez commented on PDFBOX-533:
-------------------------------------

Lars - thanks for posting the problematic file - I was able to reproduce the error.

This is actually a different error than what Navendu was hitting, but similarly unrelated to the text extraction code.

This is happening in the PDFXrefStreamParser.parse() method because there is no objIter.hasNext() test to protect the objIter.next() call on line 115.  This is an outright bug.

Specifically, the current code looks like so:

public void parse() throws IOException {
    ...
            Iterator objIter = objNums.iterator();   //<------- here we create the Iterator
            /*
             * Calculating the size of the line in bytes
             */
            int w0 = xrefFormat.getInt(0);
            int w1 = xrefFormat.getInt(1);
            int w2 = xrefFormat.getInt(2);
            int lineSize = w0 + w1 + w2;
            
            while(pdfSource.available() > 0)
            {
                byte[] currLine = new byte[lineSize];
                pdfSource.read(currLine);

                int type = 0;
                /*
                 * Grabs the number of bytes specified for the first column in 
                 * the W array and stores it.
                 */
                for(int i = 0; i < w0; i++)
                {
                    type += (currLine[i] & 0x00ff) << ((w0 - i - 1)* 8);
                }
                //Need to remember the current objID
                Integer objID = (Integer)objIter.next();    //<---- here we attempt to pull objects out of it.
                /*
                 * 3 different types of entries. 
                 */
                switch(type)
                {
                    // ... do stuff ...
                }
            }
    ...
}

The code seems to be written with the assumption that if pdfSource.available() >0 that the object count will have another increment.  That seems a bit vulnerable to corrupt streams.  Further it is a logic error because the stream seems to contain lines of different types not processed as Xref objects.   At least that seems clear from my cursory step through.

I modified line 100 to look like 

            while(pdfSource.available() > 0 && objIter.next())

and it now parses your test document just fine.  From what I can tell all the text is captured.  

If you use my PDFTextStripper2 you will need to adjust the vertical drop threshold used for paragraph tests.  The default is a bit too small and it breaks most paragraphs up into separate chunks.  I tried a value of 3 (the default is 2.5) and got decent results with your document.  My Deutch is very very rusty but I think it did a decent job.  Note that I just uploaded a new version to PDFBOX-521 that fixes a small bug.

I will create a separate JIRA that covers this particular issue (the missing iterator test) and post the modified src file there (I am not a committer) for consideration by the devs.  I will link back to this one.

> PDFTextStripper.writeCharacters is called no where in the class
> ---------------------------------------------------------------
>
>                 Key: PDFBOX-533
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-533
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 0.8.0-incubator
>            Reporter: Navendu Garg
>         Attachments: TestPDFTextStripperPerf.java
>
>
> It seems writeCharacters method is not called anywhere in the PDFTextStripper class. This makes it impossible for handling character TextPosition as well as Line Separator because processLineSeparator method is no longer there and writeLineSeparator is called when actual writing happens.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-533) PDFTextStripper.writeCharacters is called no where in the class

Posted by "Navendu Garg (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PDFBOX-533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12756582#action_12756582 ] 

Navendu Garg commented on PDFBOX-533:
-------------------------------------

Usecase: While extracting text, I need character as well as  text position information. I also need to keep track of line breaks. Now, the only way I could figure out was to use writeLineSeparator and writeCharacters. 

Currently, I am using processLineSeparator & writeCharacters methods, which are available in the previous version, to achieve this and it works fine. 

Now from an API standpoint, if writeCharacters and writeWordSeparator are not used then they should be deprecated/removed. 

I will take a look athe PDFTextStripper2.

thanks

Navendu Garg


> PDFTextStripper.writeCharacters is called no where in the class
> ---------------------------------------------------------------
>
>                 Key: PDFBOX-533
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-533
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 0.8.0-incubator
>            Reporter: Navendu Garg
>
> It seems writeCharacters method is not called anywhere in the PDFTextStripper class. This makes it impossible for handling character TextPosition as well as Line Separator because processLineSeparator method is no longer there and writeLineSeparator is called when actual writing happens.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.