You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Sascha Szott (JIRA)" <ji...@apache.org> on 2013/04/29 13:44:16 UTC

[jira] [Commented] (PDFBOX-1585) org.apache.pdfbox.util.PDFTextStripper.getText() causes thread to block indefinitely

    [ https://issues.apache.org/jira/browse/PDFBOX-1585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13644437#comment-13644437 ] 

Sascha Szott commented on PDFBOX-1585:
--------------------------------------

One way to bypass this issue would be to define a reasonable timeout for the extraction call (in my case I got proper results with a value for EXTRACTION_TIMEOUT of 300 seconds) :
{code}
TimeLimiter limiter = new SimpleTimeLimiter();
return limiter.callWithTimeout(new Callable<String>() {
  public String call() {
    return new PDFTextStripper().getText(PDDocument.load("/home/sascha/testfile.pdf", true));
  }
}, EXTRACTION_TIMEOUT, TimeUnit.SECONDS, false);
{code}
                
> org.apache.pdfbox.util.PDFTextStripper.getText() causes thread to block indefinitely
> ------------------------------------------------------------------------------------
>
>                 Key: PDFBOX-1585
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1585
>             Project: PDFBox
>          Issue Type: Bug
>          Components: PDFReader, Text extraction
>    Affects Versions: 1.8.1
>         Environment: Ubuntu Linux 10.04
> Solaris 10
> Java 1.6.0_34
>            Reporter: Sascha Szott
>
> URL of the problematic pdf file is http://www.redalyc.org/pdf/540/54017220.pdf
> My program tries to extract the fulltext of the given pdf file in the following manner:
> {code}
> String fileName = "/home/sascha/testfile.pdf"                   // 1
> PDDocument pdDoc = PDDocument.load(fileName, true); // 2
> PDFTextStripper text = new PDFTextStripper();	            // 3
> String fullText = text.getText(pdDoc);                               // 4
> {code}
> The call in line 4 causes the thread to block indefinitely (runs now for more than two days without making any progress). The file is stored in a local file system (no network interaction occurs).
> jstack indicates that the thread is not deadlocked:
> {code}
> "main" prio=10 tid=0x000000004187d800 nid=0x6ed8 runnable [0x00007f9e28e56000]
>    java.lang.Thread.State: RUNNABLE
>         at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
>         at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
>         - locked <0x00000007d73a84a0> (a java.io.BufferedInputStream)
>         at java.io.FilterInputStream.read(FilterInputStream.java:66)
>         at java.io.PushbackInputStream.read(PushbackInputStream.java:122)
>         at org.apache.pdfbox.io.PushBackInputStream.read(PushBackInputStream.java:91)
>         at org.apache.pdfbox.pdfparser.BaseParser.parseCOSHexString(BaseParser.java:1006)
>         at org.apache.pdfbox.pdfparser.BaseParser.parseCOSString(BaseParser.java:808)
>         at org.apache.pdfbox.pdfparser.PDFStreamParser.parseNextToken(PDFStreamParser.java:260)
>         at org.apache.pdfbox.pdfparser.PDFStreamParser.access$000(PDFStreamParser.java:46)
>         at org.apache.pdfbox.pdfparser.PDFStreamParser$1.tryNext(PDFStreamParser.java:182)
>         at org.apache.pdfbox.pdfparser.PDFStreamParser$1.hasNext(PDFStreamParser.java:194)
>         at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:255)
>         at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235)
>         at org.apache.pdfbox.util.operator.Invoke.process(Invoke.java:67)
>         at org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:554)
>         at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:268)
>         at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235)
>         at org.apache.pdfbox.util.operator.Invoke.process(Invoke.java:67)
>         at org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:554)
>         at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:268)
>         at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235)
>         at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:215)
>         at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:455)
>         at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:379)
>         at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:335)
>         at org.apache.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:254)
>         at de.kobv.ked.extraction.FulltextExtraction.getFulltext(FulltextExtraction.java:65)
> {code}
> Any idea or advice on how to fix that problem? Is it possible to set up a timeout for the extraction operation?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira