You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by 叶严杰 <hu...@gmail.com> on 2012/05/09 19:26:13 UTC

bug report for v1.6.0

I tried to get text from a pdf with pdfbox by striper.getText. (see code
attached below)
the pdf is attached as file. And bug info attached below.
anyway to solve this bug?

regrads

*Code*
    public void read()
    {
        PDDocument document = null;
        FileInputStream is = null;
        try {
            is = new FileInputStream(file);
            PDFParser parser = new PDFParser(is);
            parser.parse();
            document = parser.getPDDocument();
            PDFTextStripper stripper = new PDFTextStripper();
            content = stripper.getText(document);
        } catch (FileNotFoundException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        } finally {
            if (is != null) {
                try {
                    is.close();
                } catch (IOException e) {
                    e.printStackTrace();
                }
            }
            if (document != null) {
                try {
                    document.close();
                } catch (IOException e) {
                    e.printStackTrace();
                }
            }
        }
    }

*Bug Info*
Exception in thread "main" java.lang.NumberFormatException: For input
string: "dup"
    at java.lang.NumberFormatException.forInputString(Unknown Source)
    at java.lang.Integer.parseInt(Unknown Source)
    at java.lang.Integer.parseInt(Unknown Source)
    at
org.apache.pdfbox.pdmodel.font.PDType1Font.getEncodingFromFont(PDType1Font.java:344)
    at
org.apache.pdfbox.pdmodel.font.PDType1Font.determineEncoding(PDType1Font.java:280)
    at org.apache.pdfbox.pdmodel.font.PDFont.<init>(PDFont.java:181)
    at
org.apache.pdfbox.pdmodel.font.PDSimpleFont.<init>(PDSimpleFont.java:83)
    at
org.apache.pdfbox.pdmodel.font.PDType1Font.<init>(PDType1Font.java:152)
    at
org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:108)
    at
org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:75)
    at org.apache.pdfbox.pdmodel.PDResources.getFonts(PDResources.java:115)
    at
org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:243)
    at
org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:225)
    at
org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:442)
    at
org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:366)
    at
org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:322)
    at
org.apache.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:242)
    at get.read(get.java:33)
    at get.main(get.java:60)

Re: bug report for v1.6.0

Posted by Andreas Lehmkuehler <an...@lehmi.de>.
Hi

this issue is solved in the current trunk, see [1] for further details.

BR
Andreas Lehmkühler

[1] https://issues.apache.org/jira/browse/PDFBOX-1481

Am 09.05.2012 20:15, schrieb 叶严杰:
> ..url for the pdf file:
> http://www.aclweb.org/anthology-new/P/P02/P02-1046.pdf
>
> On Thu, May 10, 2012 at 1:26 AM, 叶严杰 <hu...@gmail.com> wrote:
>
>> I tried to get text from a pdf with pdfbox by striper.getText. (see code
>> attached below)
>> the pdf is attached as file. And bug info attached below.
>> anyway to solve this bug?
>>
>> regrads
>>
>> *Code*
>>      public void read()
>>      {
>>          PDDocument document = null;
>>          FileInputStream is = null;
>>          try {
>>              is = new FileInputStream(file);
>>              PDFParser parser = new PDFParser(is);
>>              parser.parse();
>>              document = parser.getPDDocument();
>>              PDFTextStripper stripper = new PDFTextStripper();
>>              content = stripper.getText(document);
>>          } catch (FileNotFoundException e) {
>>              e.printStackTrace();
>>          } catch (IOException e) {
>>              e.printStackTrace();
>>          } finally {
>>              if (is != null) {
>>                  try {
>>                      is.close();
>>                  } catch (IOException e) {
>>                      e.printStackTrace();
>>                  }
>>              }
>>              if (document != null) {
>>                  try {
>>                      document.close();
>>                  } catch (IOException e) {
>>                      e.printStackTrace();
>>                  }
>>              }
>>          }
>>      }
>>
>> *Bug Info*
>> Exception in thread "main" java.lang.NumberFormatException: For input
>> string: "dup"
>>      at java.lang.NumberFormatException.forInputString(Unknown Source)
>>      at java.lang.Integer.parseInt(Unknown Source)
>>      at java.lang.Integer.parseInt(Unknown Source)
>>      at
>> org.apache.pdfbox.pdmodel.font.PDType1Font.getEncodingFromFont(PDType1Font.java:344)
>>      at
>> org.apache.pdfbox.pdmodel.font.PDType1Font.determineEncoding(PDType1Font.java:280)
>>      at org.apache.pdfbox.pdmodel.font.PDFont.<init>(PDFont.java:181)
>>      at
>> org.apache.pdfbox.pdmodel.font.PDSimpleFont.<init>(PDSimpleFont.java:83)
>>      at
>> org.apache.pdfbox.pdmodel.font.PDType1Font.<init>(PDType1Font.java:152)
>>      at
>> org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:108)
>>      at
>> org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:75)
>>      at org.apache.pdfbox.pdmodel.PDResources.getFonts(PDResources.java:115)
>>      at
>> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:243)
>>      at
>> org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:225)
>>      at
>> org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:442)
>>      at
>> org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:366)
>>      at
>> org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:322)
>>      at
>> org.apache.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:242)
>>      at get.read(get.java:33)
>>      at get.main(get.java:60)
>>
>


Re: bug report for v1.6.0

Posted by 叶严杰 <hu...@gmail.com>.
..url for the pdf file:
http://www.aclweb.org/anthology-new/P/P02/P02-1046.pdf

On Thu, May 10, 2012 at 1:26 AM, 叶严杰 <hu...@gmail.com> wrote:

> I tried to get text from a pdf with pdfbox by striper.getText. (see code
> attached below)
> the pdf is attached as file. And bug info attached below.
> anyway to solve this bug?
>
> regrads
>
> *Code*
>     public void read()
>     {
>         PDDocument document = null;
>         FileInputStream is = null;
>         try {
>             is = new FileInputStream(file);
>             PDFParser parser = new PDFParser(is);
>             parser.parse();
>             document = parser.getPDDocument();
>             PDFTextStripper stripper = new PDFTextStripper();
>             content = stripper.getText(document);
>         } catch (FileNotFoundException e) {
>             e.printStackTrace();
>         } catch (IOException e) {
>             e.printStackTrace();
>         } finally {
>             if (is != null) {
>                 try {
>                     is.close();
>                 } catch (IOException e) {
>                     e.printStackTrace();
>                 }
>             }
>             if (document != null) {
>                 try {
>                     document.close();
>                 } catch (IOException e) {
>                     e.printStackTrace();
>                 }
>             }
>         }
>     }
>
> *Bug Info*
> Exception in thread "main" java.lang.NumberFormatException: For input
> string: "dup"
>     at java.lang.NumberFormatException.forInputString(Unknown Source)
>     at java.lang.Integer.parseInt(Unknown Source)
>     at java.lang.Integer.parseInt(Unknown Source)
>     at
> org.apache.pdfbox.pdmodel.font.PDType1Font.getEncodingFromFont(PDType1Font.java:344)
>     at
> org.apache.pdfbox.pdmodel.font.PDType1Font.determineEncoding(PDType1Font.java:280)
>     at org.apache.pdfbox.pdmodel.font.PDFont.<init>(PDFont.java:181)
>     at
> org.apache.pdfbox.pdmodel.font.PDSimpleFont.<init>(PDSimpleFont.java:83)
>     at
> org.apache.pdfbox.pdmodel.font.PDType1Font.<init>(PDType1Font.java:152)
>     at
> org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:108)
>     at
> org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:75)
>     at org.apache.pdfbox.pdmodel.PDResources.getFonts(PDResources.java:115)
>     at
> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:243)
>     at
> org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:225)
>     at
> org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:442)
>     at
> org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:366)
>     at
> org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:322)
>     at
> org.apache.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:242)
>     at get.read(get.java:33)
>     at get.main(get.java:60)
>