You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by Jeff Hagen <un...@gmail.com> on 2011/09/30 09:42:18 UTC
Strange connections while extracting text, and a failure condition

Hi,

I've got an some toy code I've been playing with using pdfbox and I've noticed a few interesting things in my output logs that I thought I should run past you.

My code is taking in an already opened URLConnection of which the contentType() has been checked against the list of valid .pdf strings. 

(method of my code included just so you can see what I'm doing...)

public void getText (URLConnection connection)
{
  InputStream input = null;
  try
  {
    input = connection.getInputStream();
  }
  catch (java.io.IOException e)
  {
    return;
  }

  PDDocument document = null;
  PDFTextStripper stripper = null;
     
  try
  {
   document = PDDocument.load(input);
   if (document != null)
   {
     stripper = new PDFTextStripper("UTF-8");
     if (stripper != null)
        this.text = stripper.getText(document);
     /* this has gotten me a null pointer exception before, so protection is needed?? */
     if (document != null)
         document.close();
     if (input != null)
         input.close();
    }
   catch (java.io.IOException e)
   {
     /* try real hard to close the connections */
      try
      {
        if (document != null)
          document.close();
      }
      catch (java.io.IOException ex)
      {
      }
      try
      {
        if (input != null)
         input.close();
      }
      catch (java.io.IOException ex)
      {
      }
   }
}

Now onto the strangeness...

Firstoff, it just seems to die on some pdf documents where the page size is set very large, look at page ~34ish of the document in the first trace.

- First Trace -
2011-09-29 23:46:47: 11: Error parsing PDF document: http://www.co.sanmateo.ca.us/bos.dir/BosAgendas/agendas2011/Agenda20110329/20110329_a_3.pdf
Sep 29, 2011 11:46:51 PM org.apache.pdfbox.util.PDFStreamEngine processOperator
INFO: unsupported/disabled operation: EI
[Unloading class sun.reflect.GeneratedMethodAccessor2]
[Unloading class sun.reflect.GeneratedMethodAccessor3]
Sep 29, 2011 11:47:14 PM org.apache.pdfbox.filter.FlateFilter decode
SEVERE: Stop reading corrupt stream
Sep 29, 2011 11:47:14 PM org.apache.pdfbox.filter.FlateFilter decode
SEVERE: Stop reading corrupt stream
Sep 29, 2011 11:47:14 PM org.apache.pdfbox.filter.FlateFilter decode
SEVERE: Stop reading corrupt stream
Sep 29, 2011 11:47:14 PM org.apache.pdfbox.pdmodel.font.PDType1Font <init>
INFO: Can't read the embedded type1C font HEDMHP+MyriadPro-Regular
Sep 29, 2011 11:47:15 PM org.apache.pdfbox.filter.FlateFilter decode
SEVERE: Stop reading corrupt stream
Sep 29, 2011 11:47:15 PM org.apache.pdfbox.filter.FlateFilter decode
SEVERE: Stop reading corrupt stream
Sep 29, 2011 11:47:15 PM org.apache.pdfbox.filter.FlateFilter decode
SEVERE: Stop reading corrupt stream
Sep 29, 2011 11:47:15 PM org.apache.pdfbox.filter.FlateFilter decode
SEVERE: Stop reading corrupt stream
Sep 29, 2011 11:47:15 PM org.apache.pdfbox.filter.FlateFilter decode
SEVERE: Stop reading corrupt stream
Sep 29, 2011 11:47:16 PM org.apache.pdfbox.filter.FlateFilter decode
SEVERE: Stop reading corrupt stream
Sep 29, 2011 11:47:16 PM org.apache.pdfbox.filter.FlateFilter decode
SEVERE: Stop reading corrupt stream
Sep 29, 2011 11:47:16 PM org.apache.pdfbox.pdmodel.font.PDType1Font <init>
INFO: Can't read the embedded type1C font HGNGBI+TimesNewRomanPS-BoldItalicMT
Sep 29, 2011 11:47:16 PM org.apache.pdfbox.filter.FlateFilter decode
SEVERE: Stop reading corrupt stream
Sep 29, 2011 11:47:16 PM org.apache.pdfbox.pdmodel.font.PDType1Font <init>
INFO: Can't read the embedded type1C font HGNGCJ+TimesNewRomanPS-BoldMT
Sep 29, 2011 11:47:16 PM org.apache.pdfbox.filter.FlateFilter decode
SEVERE: Stop reading corrupt stream
Sep 29, 2011 11:47:16 PM org.apache.pdfbox.filter.FlateFilter decode
SEVERE: Stop reading corrupt stream
[Loaded sun.reflect.GeneratedConstructorAccessor26 from __JVM_DefineClass__]
Sep 29, 2011 11:47:16 PM org.apache.pdfbox.filter.FlateFilter decode
SEVERE: Stop reading corrupt stream
Sep 29, 2011 11:47:16 PM org.apache.pdfbox.pdmodel.font.PDType1Font <init>
INFO: Can't read the embedded type1C font HGNHHC+Verdana-Bold
Sep 29, 2011 11:47:17 PM org.apache.pdfbox.filter.FlateFilter decode
SEVERE: Stop reading corrupt stream
Sep 29, 2011 11:47:17 PM org.apache.pdfbox.filter.FlateFilter decode
SEVERE: Stop reading corrupt stream
Sep 29, 2011 11:47:17 PM org.apache.pdfbox.filter.FlateFilter decode
SEVERE: Stop reading corrupt stream
Sep 29, 2011 11:47:17 PM org.apache.pdfbox.pdmodel.font.PDType1Font <init>
INFO: Can't read the embedded type1C font HGNHOA+Verdana-BoldItalic
Sep 29, 2011 11:47:17 PM org.apache.pdfbox.filter.FlateFilter decode
... many more lines of the same ...

Secondly, when I run pdfbox in a signed and trusted applet with the same text extraction code (ancient I know, but it ought to work...) it prints many of the following lines in the Java console:
network: Cache entry not found [url: http://mack.local/~jhagen/org/apache/pdfbox/resources/cmap/WinAnsiEncoding, version: null]
network: Connecting http://mack.local/~jhagen/org/apache/pdfbox/resources/cmap/WinAnsiEncoding with proxy=DIRECT
(repeated 10 more times)

The applet is hosted on the machine named "mack.local" (a mac on an intranet) and it was trying to do text extraction on the following document: http://www.co.sanmateo.ca.us/bos.dir/BosAgendas/agendas2011/Agenda20110913/20110913_m_66.pdf

Why is it trying to connect to the machine hosting the applet to find a resource file? Is this a file that should be in my jar?

I have a class file named that baked into my jar, however it is in a different location.
$ jar -tf target/webgrep-1.0-SNAPSHOT-jar-with-dependencies.jar  | grep WinAnsiEncoding
org/apache/pdfbox/encoding/WinAnsiEncoding.class

I do not see a file with that name as part of the pdfbox distribution otherwise.


Thanks in advance for any help!
-Jeff