You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by Arjohn Kampman <ar...@aduna-software.com> on 2010/02/23 12:20:27 UTC

pdfbox 1.0.0 regression

Hi all,

I seem to be running into a regression with pdfbox 1.0.0. I have a PDF 
file that parses fine with 0.8.0, but triggers an exception with 1.0.0.
The stack trace looks like this:

java.lang.ArrayIndexOutOfBoundsException: 3
	at org.apache.fontbox.cff.CFFParser$IndexData.getBytes(CFFParser.java:585)
	at org.apache.fontbox.cff.CFFParser.parseFont(CFFParser.java:329)
	at org.apache.fontbox.cff.CFFParser.parse(CFFParser.java:65)
	at 
org.apache.pdfbox.pdmodel.font.PDType1CFont.ensureLoaded(PDType1CFont.java:290)
	at 
org.apache.pdfbox.pdmodel.font.PDType1CFont.getFontWidth(PDType1CFont.java:138)
	at 
org.apache.pdfbox.util.PDFStreamEngine.processEncodedText(PDFStreamEngine.java:323)
	at org.apache.pdfbox.util.operator.ShowText.process(ShowText.java:45)
	at 
org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:552)
	at 
org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:248)
	at 
org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:207)
	at 
org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:367)
	at 
org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:291)
	at 
org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:247)
	at org.apache.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:180)

A public file that triggers this exception can be found at:
http://domex.nps.edu/corp/files/govdocs1/000/000163.pdf

Any ideas?

Regards,
Arjohn Kampman

Re: pdfbox 1.0.0 regression

Posted by Arjohn Kampman <ar...@aduna-software.com>.
Villu Ruusmann wrote:
> I haven't had time to investigate PDFBOX-634 yet. Hope to do so over
> the next 2 to 3 days.
> 
> However, we have provided a fallback mechanism for "broken" CFF fonts
> as PDFBOX-635, which has been incorporated into SVN trunk as revision
> 915854:
> https://issues.apache.org/jira/browse/PDFBOX-635
> 
> All things considered, if it's not too much trouble for you, I suggest
> you to check out the latest PDFBox version (1.0.1-SNAPSHOT) from SVN
> trunk and give it a try:
> http://svn.apache.org/repos/asf/pdfbox/trunk

The trunk is working fine. It logs a warning but produces acceptable
results. Many thanks for this fallback mechanism.

Arjohn

Re: pdfbox 1.0.0 regression

Posted by Villu Ruusmann <vi...@gmail.com>.
Hello there,

>
> Did you have time to investigate this issue yet? Anything I can do to
> help? We'd really like to upgrade our 0.8.0 dependency, but this issue
> is a blocker for us.
>

I haven't had time to investigate PDFBOX-634 yet. Hope to do so over
the next 2 to 3 days.

However, we have provided a fallback mechanism for "broken" CFF fonts
as PDFBOX-635, which has been incorporated into SVN trunk as revision
915854:
https://issues.apache.org/jira/browse/PDFBOX-635

All things considered, if it's not too much trouble for you, I suggest
you to check out the latest PDFBox version (1.0.1-SNAPSHOT) from SVN
trunk and give it a try:
http://svn.apache.org/repos/asf/pdfbox/trunk


VR

Re: pdfbox 1.0.0 regression

Posted by Arjohn Kampman <ar...@aduna-software.com>.
Hi Villu,

Did you have time to investigate this issue yet? Anything I can do to
help? We'd really like to upgrade our 0.8.0 dependency, but this issue
is a blocker for us.

Regards,
Arjohn

Villu Ruusmann wrote:
> Hello there,
> 
>> I seem to be running into a regression with pdfbox 1.0.0. I have a PDF file
>> that parses fine with 0.8.0, but triggers an exception with 1.0.0.
>>
> 
> PDFBox 1.0.0 (together with FontBox 1.0.0) introduced "native" support
> for CFF font programs:
> http://issues.apache.org/jira/browse/PDFBOX-542
> 
> The code has undergone a fair amount of testing, but looks like that
> there's room for improvement.
> 
>> A public file that triggers this exception can be found at:
>> http://domex.nps.edu/corp/files/govdocs1/000/000163.pdf
>>
> 
> I created a new issue with all the relevant information attached:
> https://issues.apache.org/jira/browse/PDFBOX-634
> 
> I'll take a closer look at it shortly.
> 
> 
> VR


-- 
Arjohn Kampman, Senior Software Engineer
Aduna - Semantic Power
www.aduna-software.com

Re: pdfbox 1.0.0 regression

Posted by Arjohn Kampman <ar...@aduna-software.com>.
Here's some more info that I got running the process in a debugger.

The problem seems to be located on line 329 in CFFParser:

   font.getCharStringsDict().put(glyphEntry.getName(), 
charStringsIndex.getBytes(i + 1));

This line appears in a loop that iterates over all glyphEntries.
However, the glyphEntries list has 228 elements while charStringsIndex
has a count of 2 (and three offsets).

For completeness the toString() values of these two variables:

charStringsIndex:
org.apache.fontbox.cff.CFFParser$IndexData[count=2, offSize=1, 
offset=[1, 78, 81], data=[248, 136, 139, 189, 248, 236, 189, 1, 139, 
189, 248, 36, 189, 3, 139, 4, 248, 136, 249, 80, 252, 136, 6, 247, 142, 
251, 197, 21, 251, 62, 247, 147, 5, 247, 232, 6, 251, 32, 251, 192, 21, 
247, 62, 247, 147, 5, 252, 146, 7, 252, 6, 94, 21, 247, 62, 247, 147, 
247, 62, 251, 147, 5, 252, 6, 248, 191, 21, 247, 62, 251, 147, 251, 62, 
251, 147, 5, 14, 247, 132, 14]]

glyphEntries:
[[sid=1, name=space], [sid=2, name=exclam], [sid=3, name=quotedbl], 
[sid=4, name=numbersign], [sid=5, name=dollar], [sid=6, name=percent], 
[sid=7, name=ampersand], [sid=8, name=quoteright], [sid=9, 
name=parenleft], [sid=10, name=parenright], [sid=11, name=asterisk], 
[sid=12, name=plus], [sid=13, name=comma], [sid=14, name=hyphen], 
[sid=15, name=period], [sid=16, name=slash], [sid=17, name=zero], 
[sid=18, name=one], [sid=19, name=two], [sid=20, name=three], [sid=21, 
name=four], [sid=22, name=five], [sid=23, name=six], [sid=24, 
name=seven], [sid=25, name=eight], [sid=26, name=nine], [sid=27, 
name=colon], [sid=28, name=semicolon], [sid=29, name=less], [sid=30, 
name=equal], [sid=31, name=greater], [sid=32, name=question], [sid=33, 
name=at], [sid=34, name=A], [sid=35, name=B], [sid=36, name=C], [sid=37, 
name=D], [sid=38, name=E], [sid=39, name=F], [sid=40, name=G], [sid=41, 
name=H], [sid=42, name=I], [sid=43, name=J], [sid=44, name=K], [sid=45, 
name=L], [sid=46, name=M], [sid=47, name=N], [sid=48, name=O], [sid=49, 
name=P], [sid=50, name=Q], [sid=51, name=R], [sid=52, name=S], [sid=53, 
name=T], [sid=54, name=U], [sid=55, name=V], [sid=56, name=W], [sid=57, 
name=X], [sid=58, name=Y], [sid=59, name=Z], [sid=60, name=bracketleft], 
[sid=61, name=backslash], [sid=62, name=bracketright], [sid=63, 
name=asciicircum], [sid=64, name=underscore], [sid=65, name=quoteleft], 
[sid=66, name=a], [sid=67, name=b], [sid=68, name=c], [sid=69, name=d], 
[sid=70, name=e], [sid=71, name=f], [sid=72, name=g], [sid=73, name=h], 
[sid=74, name=i], [sid=75, name=j], [sid=76, name=k], [sid=77, name=l], 
[sid=78, name=m], [sid=79, name=n], [sid=80, name=o], [sid=81, name=p], 
[sid=82, name=q], [sid=83, name=r], [sid=84, name=s], [sid=85, name=t], 
[sid=86, name=u], [sid=87, name=v], [sid=88, name=w], [sid=89, name=x], 
[sid=90, name=y], [sid=91, name=z], [sid=92, name=braceleft], [sid=93, 
name=bar], [sid=94, name=braceright], [sid=95, name=asciitilde], 
[sid=96, name=exclamdown], [sid=97, name=cent], [sid=98, name=sterling], 
[sid=99, name=fraction], [sid=100, name=yen], [sid=101, name=florin], 
[sid=102, name=section], [sid=103, name=currency], [sid=104, 
name=quotesingle], [sid=105, name=quotedblleft], [sid=106, 
name=guillemotleft], [sid=107, name=guilsinglleft], [sid=108, 
name=guilsinglright], [sid=109, name=fi], [sid=110, name=fl], [sid=111, 
name=endash], [sid=112, name=dagger], [sid=113, name=daggerdbl], 
[sid=114, name=periodcentered], [sid=115, name=paragraph], [sid=116, 
name=bullet], [sid=117, name=quotesinglbase], [sid=118, 
name=quotedblbase], [sid=119, name=quotedblright], [sid=120, 
name=guillemotright], [sid=121, name=ellipsis], [sid=122, 
name=perthousand], [sid=123, name=questiondown], [sid=124, name=grave], 
[sid=125, name=acute], [sid=126, name=circumflex], [sid=127, 
name=tilde], [sid=128, name=macron], [sid=129, name=breve], [sid=130, 
name=dotaccent], [sid=131, name=dieresis], [sid=132, name=ring], 
[sid=133, name=cedilla], [sid=134, name=hungarumlaut], [sid=135, 
name=ogonek], [sid=136, name=caron], [sid=137, name=emdash], [sid=138, 
name=AE], [sid=139, name=ordfeminine], [sid=140, name=Lslash], [sid=141, 
name=Oslash], [sid=142, name=OE], [sid=143, name=ordmasculine], 
[sid=144, name=ae], [sid=145, name=dotlessi], [sid=146, name=lslash], 
[sid=147, name=oslash], [sid=148, name=oe], [sid=149, name=germandbls], 
[sid=150, name=onesuperior], [sid=151, name=logicalnot], [sid=152, 
name=mu], [sid=153, name=trademark], [sid=154, name=Eth], [sid=155, 
name=onehalf], [sid=156, name=plusminus], [sid=157, name=Thorn], 
[sid=158, name=onequarter], [sid=159, name=divide], [sid=160, 
name=brokenbar], [sid=161, name=degree], [sid=162, name=thorn], 
[sid=163, name=threequarters], [sid=164, name=twosuperior], [sid=165, 
name=registered], [sid=166, name=minus], [sid=167, name=eth], [sid=168, 
name=multiply], [sid=169, name=threesuperior], [sid=170, 
name=copyright], [sid=171, name=Aacute], [sid=172, name=Acircumflex], 
[sid=173, name=Adieresis], [sid=174, name=Agrave], [sid=175, 
name=Aring], [sid=176, name=Atilde], [sid=177, name=Ccedilla], [sid=178, 
name=Eacute], [sid=179, name=Ecircumflex], [sid=180, name=Edieresis], 
[sid=181, name=Egrave], [sid=182, name=Iacute], [sid=183, 
name=Icircumflex], [sid=184, name=Idieresis], [sid=185, name=Igrave], 
[sid=186, name=Ntilde], [sid=187, name=Oacute], [sid=188, 
name=Ocircumflex], [sid=189, name=Odieresis], [sid=190, name=Ograve], 
[sid=191, name=Otilde], [sid=192, name=Scaron], [sid=193, name=Uacute], 
[sid=194, name=Ucircumflex], [sid=195, name=Udieresis], [sid=196, 
name=Ugrave], [sid=197, name=Yacute], [sid=198, name=Ydieresis], 
[sid=199, name=Zcaron], [sid=200, name=aacute], [sid=201, 
name=acircumflex], [sid=202, name=adieresis], [sid=203, name=agrave], 
[sid=204, name=aring], [sid=205, name=atilde], [sid=206, name=ccedilla], 
[sid=207, name=eacute], [sid=208, name=ecircumflex], [sid=209, 
name=edieresis], [sid=210, name=egrave], [sid=211, name=iacute], 
[sid=212, name=icircumflex], [sid=213, name=idieresis], [sid=214, 
name=igrave], [sid=215, name=ntilde], [sid=216, name=oacute], [sid=217, 
name=ocircumflex], [sid=218, name=odieresis], [sid=219, name=ograve], 
[sid=220, name=otilde], [sid=221, name=scaron], [sid=222, name=uacute], 
[sid=223, name=ucircumflex], [sid=224, name=udieresis], [sid=225, 
name=ugrave], [sid=226, name=yacute], [sid=227, name=ydieresis], 
[sid=228, name=zcaron]]


Hope this helps,

Arjohn

Re: pdfbox 1.0.0 regression

Posted by Arjohn Kampman <ar...@aduna-software.com>.
Thank you very much. Hope to hear from you soon.

Arjohn

Villu Ruusmann wrote:
> Hello there,
> 
>> I seem to be running into a regression with pdfbox 1.0.0. I have a PDF file
>> that parses fine with 0.8.0, but triggers an exception with 1.0.0.
>>
> 
> PDFBox 1.0.0 (together with FontBox 1.0.0) introduced "native" support
> for CFF font programs:
> http://issues.apache.org/jira/browse/PDFBOX-542
> 
> The code has undergone a fair amount of testing, but looks like that
> there's room for improvement.
> 
>> A public file that triggers this exception can be found at:
>> http://domex.nps.edu/corp/files/govdocs1/000/000163.pdf
>>
> 
> I created a new issue with all the relevant information attached:
> https://issues.apache.org/jira/browse/PDFBOX-634
> 
> I'll take a closer look at it shortly.
> 
> 
> VR


-- 
Arjohn Kampman, Senior Software Engineer
Aduna - Semantic Power
www.aduna-software.com

Re: pdfbox 1.0.0 regression

Posted by Arjohn Kampman <ar...@aduna-software.com>.
I think so. I used maven to get all dependencies for pdfbox.

Arjohn

Daniel Wilson wrote:
> Arjohn,
> Are you indeed using version 1.0.0 of Fontbox?  I ran into some difficulty
> when I used the latest PDFBox with an older Fontbox.
> 
> Daniel
> 
> On Tue, Feb 23, 2010 at 10:08 AM, Villu Ruusmann
> <vi...@gmail.com>wrote:
> 
>> Hello there,
>>
>>> I seem to be running into a regression with pdfbox 1.0.0. I have a PDF
>> file
>>> that parses fine with 0.8.0, but triggers an exception with 1.0.0.
>>>
>> PDFBox 1.0.0 (together with FontBox 1.0.0) introduced "native" support
>> for CFF font programs:
>> http://issues.apache.org/jira/browse/PDFBOX-542
>>
>> The code has undergone a fair amount of testing, but looks like that
>> there's room for improvement.
>>
>>> A public file that triggers this exception can be found at:
>>> http://domex.nps.edu/corp/files/govdocs1/000/000163.pdf
>>>
>> I created a new issue with all the relevant information attached:
>> https://issues.apache.org/jira/browse/PDFBOX-634
>>
>> I'll take a closer look at it shortly.
>>
>>
>> VR
>>
> 


-- 
Arjohn Kampman, Senior Software Engineer
Aduna - Semantic Power
www.aduna-software.com

Re: pdfbox 1.0.0 regression

Posted by Daniel Wilson <wi...@gmail.com>.
Arjohn,
Are you indeed using version 1.0.0 of Fontbox?  I ran into some difficulty
when I used the latest PDFBox with an older Fontbox.

Daniel

On Tue, Feb 23, 2010 at 10:08 AM, Villu Ruusmann
<vi...@gmail.com>wrote:

> Hello there,
>
> >
> > I seem to be running into a regression with pdfbox 1.0.0. I have a PDF
> file
> > that parses fine with 0.8.0, but triggers an exception with 1.0.0.
> >
>
> PDFBox 1.0.0 (together with FontBox 1.0.0) introduced "native" support
> for CFF font programs:
> http://issues.apache.org/jira/browse/PDFBOX-542
>
> The code has undergone a fair amount of testing, but looks like that
> there's room for improvement.
>
> > A public file that triggers this exception can be found at:
> > http://domex.nps.edu/corp/files/govdocs1/000/000163.pdf
> >
>
> I created a new issue with all the relevant information attached:
> https://issues.apache.org/jira/browse/PDFBOX-634
>
> I'll take a closer look at it shortly.
>
>
> VR
>

Re: pdfbox 1.0.0 regression

Posted by Villu Ruusmann <vi...@gmail.com>.
Hello there,

>
> I seem to be running into a regression with pdfbox 1.0.0. I have a PDF file
> that parses fine with 0.8.0, but triggers an exception with 1.0.0.
>

PDFBox 1.0.0 (together with FontBox 1.0.0) introduced "native" support
for CFF font programs:
http://issues.apache.org/jira/browse/PDFBOX-542

The code has undergone a fair amount of testing, but looks like that
there's room for improvement.

> A public file that triggers this exception can be found at:
> http://domex.nps.edu/corp/files/govdocs1/000/000163.pdf
>

I created a new issue with all the relevant information attached:
https://issues.apache.org/jira/browse/PDFBOX-634

I'll take a closer look at it shortly.


VR