You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by 叶严杰 <hu...@gmail.com> on 2012/05/09 19:26:13 UTC
bug report for v1.6.0
I tried to get text from a pdf with pdfbox by striper.getText. (see code
attached below)
the pdf is attached as file. And bug info attached below.
anyway to solve this bug?
regrads
*Code*
public void read()
{
PDDocument document = null;
FileInputStream is = null;
try {
is = new FileInputStream(file);
PDFParser parser = new PDFParser(is);
parser.parse();
document = parser.getPDDocument();
PDFTextStripper stripper = new PDFTextStripper();
content = stripper.getText(document);
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
} finally {
if (is != null) {
try {
is.close();
} catch (IOException e) {
e.printStackTrace();
}
}
if (document != null) {
try {
document.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
}
*Bug Info*
Exception in thread "main" java.lang.NumberFormatException: For input
string: "dup"
at java.lang.NumberFormatException.forInputString(Unknown Source)
at java.lang.Integer.parseInt(Unknown Source)
at java.lang.Integer.parseInt(Unknown Source)
at
org.apache.pdfbox.pdmodel.font.PDType1Font.getEncodingFromFont(PDType1Font.java:344)
at
org.apache.pdfbox.pdmodel.font.PDType1Font.determineEncoding(PDType1Font.java:280)
at org.apache.pdfbox.pdmodel.font.PDFont.<init>(PDFont.java:181)
at
org.apache.pdfbox.pdmodel.font.PDSimpleFont.<init>(PDSimpleFont.java:83)
at
org.apache.pdfbox.pdmodel.font.PDType1Font.<init>(PDType1Font.java:152)
at
org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:108)
at
org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:75)
at org.apache.pdfbox.pdmodel.PDResources.getFonts(PDResources.java:115)
at
org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:243)
at
org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:225)
at
org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:442)
at
org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:366)
at
org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:322)
at
org.apache.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:242)
at get.read(get.java:33)
at get.main(get.java:60)
Re: bug report for v1.6.0
Posted by Andreas Lehmkuehler <an...@lehmi.de>.
Hi
this issue is solved in the current trunk, see [1] for further details.
BR
Andreas Lehmkühler
[1] https://issues.apache.org/jira/browse/PDFBOX-1481
Am 09.05.2012 20:15, schrieb 叶严杰:
> ..url for the pdf file:
> http://www.aclweb.org/anthology-new/P/P02/P02-1046.pdf
>
> On Thu, May 10, 2012 at 1:26 AM, 叶严杰 <hu...@gmail.com> wrote:
>
>> I tried to get text from a pdf with pdfbox by striper.getText. (see code
>> attached below)
>> the pdf is attached as file. And bug info attached below.
>> anyway to solve this bug?
>>
>> regrads
>>
>> *Code*
>> public void read()
>> {
>> PDDocument document = null;
>> FileInputStream is = null;
>> try {
>> is = new FileInputStream(file);
>> PDFParser parser = new PDFParser(is);
>> parser.parse();
>> document = parser.getPDDocument();
>> PDFTextStripper stripper = new PDFTextStripper();
>> content = stripper.getText(document);
>> } catch (FileNotFoundException e) {
>> e.printStackTrace();
>> } catch (IOException e) {
>> e.printStackTrace();
>> } finally {
>> if (is != null) {
>> try {
>> is.close();
>> } catch (IOException e) {
>> e.printStackTrace();
>> }
>> }
>> if (document != null) {
>> try {
>> document.close();
>> } catch (IOException e) {
>> e.printStackTrace();
>> }
>> }
>> }
>> }
>>
>> *Bug Info*
>> Exception in thread "main" java.lang.NumberFormatException: For input
>> string: "dup"
>> at java.lang.NumberFormatException.forInputString(Unknown Source)
>> at java.lang.Integer.parseInt(Unknown Source)
>> at java.lang.Integer.parseInt(Unknown Source)
>> at
>> org.apache.pdfbox.pdmodel.font.PDType1Font.getEncodingFromFont(PDType1Font.java:344)
>> at
>> org.apache.pdfbox.pdmodel.font.PDType1Font.determineEncoding(PDType1Font.java:280)
>> at org.apache.pdfbox.pdmodel.font.PDFont.<init>(PDFont.java:181)
>> at
>> org.apache.pdfbox.pdmodel.font.PDSimpleFont.<init>(PDSimpleFont.java:83)
>> at
>> org.apache.pdfbox.pdmodel.font.PDType1Font.<init>(PDType1Font.java:152)
>> at
>> org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:108)
>> at
>> org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:75)
>> at org.apache.pdfbox.pdmodel.PDResources.getFonts(PDResources.java:115)
>> at
>> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:243)
>> at
>> org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:225)
>> at
>> org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:442)
>> at
>> org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:366)
>> at
>> org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:322)
>> at
>> org.apache.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:242)
>> at get.read(get.java:33)
>> at get.main(get.java:60)
>>
>
Re: bug report for v1.6.0
Posted by 叶严杰 <hu...@gmail.com>.
..url for the pdf file:
http://www.aclweb.org/anthology-new/P/P02/P02-1046.pdf
On Thu, May 10, 2012 at 1:26 AM, 叶严杰 <hu...@gmail.com> wrote:
> I tried to get text from a pdf with pdfbox by striper.getText. (see code
> attached below)
> the pdf is attached as file. And bug info attached below.
> anyway to solve this bug?
>
> regrads
>
> *Code*
> public void read()
> {
> PDDocument document = null;
> FileInputStream is = null;
> try {
> is = new FileInputStream(file);
> PDFParser parser = new PDFParser(is);
> parser.parse();
> document = parser.getPDDocument();
> PDFTextStripper stripper = new PDFTextStripper();
> content = stripper.getText(document);
> } catch (FileNotFoundException e) {
> e.printStackTrace();
> } catch (IOException e) {
> e.printStackTrace();
> } finally {
> if (is != null) {
> try {
> is.close();
> } catch (IOException e) {
> e.printStackTrace();
> }
> }
> if (document != null) {
> try {
> document.close();
> } catch (IOException e) {
> e.printStackTrace();
> }
> }
> }
> }
>
> *Bug Info*
> Exception in thread "main" java.lang.NumberFormatException: For input
> string: "dup"
> at java.lang.NumberFormatException.forInputString(Unknown Source)
> at java.lang.Integer.parseInt(Unknown Source)
> at java.lang.Integer.parseInt(Unknown Source)
> at
> org.apache.pdfbox.pdmodel.font.PDType1Font.getEncodingFromFont(PDType1Font.java:344)
> at
> org.apache.pdfbox.pdmodel.font.PDType1Font.determineEncoding(PDType1Font.java:280)
> at org.apache.pdfbox.pdmodel.font.PDFont.<init>(PDFont.java:181)
> at
> org.apache.pdfbox.pdmodel.font.PDSimpleFont.<init>(PDSimpleFont.java:83)
> at
> org.apache.pdfbox.pdmodel.font.PDType1Font.<init>(PDType1Font.java:152)
> at
> org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:108)
> at
> org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:75)
> at org.apache.pdfbox.pdmodel.PDResources.getFonts(PDResources.java:115)
> at
> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:243)
> at
> org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:225)
> at
> org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:442)
> at
> org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:366)
> at
> org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:322)
> at
> org.apache.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:242)
> at get.read(get.java:33)
> at get.main(get.java:60)
>