You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by Arjohn Kampman <ar...@aduna-software.com> on 2010/02/23 12:20:27 UTC
pdfbox 1.0.0 regression
Hi all,
I seem to be running into a regression with pdfbox 1.0.0. I have a PDF
file that parses fine with 0.8.0, but triggers an exception with 1.0.0.
The stack trace looks like this:
java.lang.ArrayIndexOutOfBoundsException: 3
at org.apache.fontbox.cff.CFFParser$IndexData.getBytes(CFFParser.java:585)
at org.apache.fontbox.cff.CFFParser.parseFont(CFFParser.java:329)
at org.apache.fontbox.cff.CFFParser.parse(CFFParser.java:65)
at
org.apache.pdfbox.pdmodel.font.PDType1CFont.ensureLoaded(PDType1CFont.java:290)
at
org.apache.pdfbox.pdmodel.font.PDType1CFont.getFontWidth(PDType1CFont.java:138)
at
org.apache.pdfbox.util.PDFStreamEngine.processEncodedText(PDFStreamEngine.java:323)
at org.apache.pdfbox.util.operator.ShowText.process(ShowText.java:45)
at
org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:552)
at
org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:248)
at
org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:207)
at
org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:367)
at
org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:291)
at
org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:247)
at org.apache.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:180)
A public file that triggers this exception can be found at:
http://domex.nps.edu/corp/files/govdocs1/000/000163.pdf
Any ideas?
Regards,
Arjohn Kampman
Re: pdfbox 1.0.0 regression
Posted by Arjohn Kampman <ar...@aduna-software.com>.
Villu Ruusmann wrote:
> I haven't had time to investigate PDFBOX-634 yet. Hope to do so over
> the next 2 to 3 days.
>
> However, we have provided a fallback mechanism for "broken" CFF fonts
> as PDFBOX-635, which has been incorporated into SVN trunk as revision
> 915854:
> https://issues.apache.org/jira/browse/PDFBOX-635
>
> All things considered, if it's not too much trouble for you, I suggest
> you to check out the latest PDFBox version (1.0.1-SNAPSHOT) from SVN
> trunk and give it a try:
> http://svn.apache.org/repos/asf/pdfbox/trunk
The trunk is working fine. It logs a warning but produces acceptable
results. Many thanks for this fallback mechanism.
Arjohn
Re: pdfbox 1.0.0 regression
Posted by Villu Ruusmann <vi...@gmail.com>.
Hello there,
>
> Did you have time to investigate this issue yet? Anything I can do to
> help? We'd really like to upgrade our 0.8.0 dependency, but this issue
> is a blocker for us.
>
I haven't had time to investigate PDFBOX-634 yet. Hope to do so over
the next 2 to 3 days.
However, we have provided a fallback mechanism for "broken" CFF fonts
as PDFBOX-635, which has been incorporated into SVN trunk as revision
915854:
https://issues.apache.org/jira/browse/PDFBOX-635
All things considered, if it's not too much trouble for you, I suggest
you to check out the latest PDFBox version (1.0.1-SNAPSHOT) from SVN
trunk and give it a try:
http://svn.apache.org/repos/asf/pdfbox/trunk
VR
Re: pdfbox 1.0.0 regression
Posted by Arjohn Kampman <ar...@aduna-software.com>.
Hi Villu,
Did you have time to investigate this issue yet? Anything I can do to
help? We'd really like to upgrade our 0.8.0 dependency, but this issue
is a blocker for us.
Regards,
Arjohn
Villu Ruusmann wrote:
> Hello there,
>
>> I seem to be running into a regression with pdfbox 1.0.0. I have a PDF file
>> that parses fine with 0.8.0, but triggers an exception with 1.0.0.
>>
>
> PDFBox 1.0.0 (together with FontBox 1.0.0) introduced "native" support
> for CFF font programs:
> http://issues.apache.org/jira/browse/PDFBOX-542
>
> The code has undergone a fair amount of testing, but looks like that
> there's room for improvement.
>
>> A public file that triggers this exception can be found at:
>> http://domex.nps.edu/corp/files/govdocs1/000/000163.pdf
>>
>
> I created a new issue with all the relevant information attached:
> https://issues.apache.org/jira/browse/PDFBOX-634
>
> I'll take a closer look at it shortly.
>
>
> VR
--
Arjohn Kampman, Senior Software Engineer
Aduna - Semantic Power
www.aduna-software.com
Re: pdfbox 1.0.0 regression
Posted by Arjohn Kampman <ar...@aduna-software.com>.
Here's some more info that I got running the process in a debugger.
The problem seems to be located on line 329 in CFFParser:
font.getCharStringsDict().put(glyphEntry.getName(),
charStringsIndex.getBytes(i + 1));
This line appears in a loop that iterates over all glyphEntries.
However, the glyphEntries list has 228 elements while charStringsIndex
has a count of 2 (and three offsets).
For completeness the toString() values of these two variables:
charStringsIndex:
org.apache.fontbox.cff.CFFParser$IndexData[count=2, offSize=1,
offset=[1, 78, 81], data=[248, 136, 139, 189, 248, 236, 189, 1, 139,
189, 248, 36, 189, 3, 139, 4, 248, 136, 249, 80, 252, 136, 6, 247, 142,
251, 197, 21, 251, 62, 247, 147, 5, 247, 232, 6, 251, 32, 251, 192, 21,
247, 62, 247, 147, 5, 252, 146, 7, 252, 6, 94, 21, 247, 62, 247, 147,
247, 62, 251, 147, 5, 252, 6, 248, 191, 21, 247, 62, 251, 147, 251, 62,
251, 147, 5, 14, 247, 132, 14]]
glyphEntries:
[[sid=1, name=space], [sid=2, name=exclam], [sid=3, name=quotedbl],
[sid=4, name=numbersign], [sid=5, name=dollar], [sid=6, name=percent],
[sid=7, name=ampersand], [sid=8, name=quoteright], [sid=9,
name=parenleft], [sid=10, name=parenright], [sid=11, name=asterisk],
[sid=12, name=plus], [sid=13, name=comma], [sid=14, name=hyphen],
[sid=15, name=period], [sid=16, name=slash], [sid=17, name=zero],
[sid=18, name=one], [sid=19, name=two], [sid=20, name=three], [sid=21,
name=four], [sid=22, name=five], [sid=23, name=six], [sid=24,
name=seven], [sid=25, name=eight], [sid=26, name=nine], [sid=27,
name=colon], [sid=28, name=semicolon], [sid=29, name=less], [sid=30,
name=equal], [sid=31, name=greater], [sid=32, name=question], [sid=33,
name=at], [sid=34, name=A], [sid=35, name=B], [sid=36, name=C], [sid=37,
name=D], [sid=38, name=E], [sid=39, name=F], [sid=40, name=G], [sid=41,
name=H], [sid=42, name=I], [sid=43, name=J], [sid=44, name=K], [sid=45,
name=L], [sid=46, name=M], [sid=47, name=N], [sid=48, name=O], [sid=49,
name=P], [sid=50, name=Q], [sid=51, name=R], [sid=52, name=S], [sid=53,
name=T], [sid=54, name=U], [sid=55, name=V], [sid=56, name=W], [sid=57,
name=X], [sid=58, name=Y], [sid=59, name=Z], [sid=60, name=bracketleft],
[sid=61, name=backslash], [sid=62, name=bracketright], [sid=63,
name=asciicircum], [sid=64, name=underscore], [sid=65, name=quoteleft],
[sid=66, name=a], [sid=67, name=b], [sid=68, name=c], [sid=69, name=d],
[sid=70, name=e], [sid=71, name=f], [sid=72, name=g], [sid=73, name=h],
[sid=74, name=i], [sid=75, name=j], [sid=76, name=k], [sid=77, name=l],
[sid=78, name=m], [sid=79, name=n], [sid=80, name=o], [sid=81, name=p],
[sid=82, name=q], [sid=83, name=r], [sid=84, name=s], [sid=85, name=t],
[sid=86, name=u], [sid=87, name=v], [sid=88, name=w], [sid=89, name=x],
[sid=90, name=y], [sid=91, name=z], [sid=92, name=braceleft], [sid=93,
name=bar], [sid=94, name=braceright], [sid=95, name=asciitilde],
[sid=96, name=exclamdown], [sid=97, name=cent], [sid=98, name=sterling],
[sid=99, name=fraction], [sid=100, name=yen], [sid=101, name=florin],
[sid=102, name=section], [sid=103, name=currency], [sid=104,
name=quotesingle], [sid=105, name=quotedblleft], [sid=106,
name=guillemotleft], [sid=107, name=guilsinglleft], [sid=108,
name=guilsinglright], [sid=109, name=fi], [sid=110, name=fl], [sid=111,
name=endash], [sid=112, name=dagger], [sid=113, name=daggerdbl],
[sid=114, name=periodcentered], [sid=115, name=paragraph], [sid=116,
name=bullet], [sid=117, name=quotesinglbase], [sid=118,
name=quotedblbase], [sid=119, name=quotedblright], [sid=120,
name=guillemotright], [sid=121, name=ellipsis], [sid=122,
name=perthousand], [sid=123, name=questiondown], [sid=124, name=grave],
[sid=125, name=acute], [sid=126, name=circumflex], [sid=127,
name=tilde], [sid=128, name=macron], [sid=129, name=breve], [sid=130,
name=dotaccent], [sid=131, name=dieresis], [sid=132, name=ring],
[sid=133, name=cedilla], [sid=134, name=hungarumlaut], [sid=135,
name=ogonek], [sid=136, name=caron], [sid=137, name=emdash], [sid=138,
name=AE], [sid=139, name=ordfeminine], [sid=140, name=Lslash], [sid=141,
name=Oslash], [sid=142, name=OE], [sid=143, name=ordmasculine],
[sid=144, name=ae], [sid=145, name=dotlessi], [sid=146, name=lslash],
[sid=147, name=oslash], [sid=148, name=oe], [sid=149, name=germandbls],
[sid=150, name=onesuperior], [sid=151, name=logicalnot], [sid=152,
name=mu], [sid=153, name=trademark], [sid=154, name=Eth], [sid=155,
name=onehalf], [sid=156, name=plusminus], [sid=157, name=Thorn],
[sid=158, name=onequarter], [sid=159, name=divide], [sid=160,
name=brokenbar], [sid=161, name=degree], [sid=162, name=thorn],
[sid=163, name=threequarters], [sid=164, name=twosuperior], [sid=165,
name=registered], [sid=166, name=minus], [sid=167, name=eth], [sid=168,
name=multiply], [sid=169, name=threesuperior], [sid=170,
name=copyright], [sid=171, name=Aacute], [sid=172, name=Acircumflex],
[sid=173, name=Adieresis], [sid=174, name=Agrave], [sid=175,
name=Aring], [sid=176, name=Atilde], [sid=177, name=Ccedilla], [sid=178,
name=Eacute], [sid=179, name=Ecircumflex], [sid=180, name=Edieresis],
[sid=181, name=Egrave], [sid=182, name=Iacute], [sid=183,
name=Icircumflex], [sid=184, name=Idieresis], [sid=185, name=Igrave],
[sid=186, name=Ntilde], [sid=187, name=Oacute], [sid=188,
name=Ocircumflex], [sid=189, name=Odieresis], [sid=190, name=Ograve],
[sid=191, name=Otilde], [sid=192, name=Scaron], [sid=193, name=Uacute],
[sid=194, name=Ucircumflex], [sid=195, name=Udieresis], [sid=196,
name=Ugrave], [sid=197, name=Yacute], [sid=198, name=Ydieresis],
[sid=199, name=Zcaron], [sid=200, name=aacute], [sid=201,
name=acircumflex], [sid=202, name=adieresis], [sid=203, name=agrave],
[sid=204, name=aring], [sid=205, name=atilde], [sid=206, name=ccedilla],
[sid=207, name=eacute], [sid=208, name=ecircumflex], [sid=209,
name=edieresis], [sid=210, name=egrave], [sid=211, name=iacute],
[sid=212, name=icircumflex], [sid=213, name=idieresis], [sid=214,
name=igrave], [sid=215, name=ntilde], [sid=216, name=oacute], [sid=217,
name=ocircumflex], [sid=218, name=odieresis], [sid=219, name=ograve],
[sid=220, name=otilde], [sid=221, name=scaron], [sid=222, name=uacute],
[sid=223, name=ucircumflex], [sid=224, name=udieresis], [sid=225,
name=ugrave], [sid=226, name=yacute], [sid=227, name=ydieresis],
[sid=228, name=zcaron]]
Hope this helps,
Arjohn
Re: pdfbox 1.0.0 regression
Posted by Arjohn Kampman <ar...@aduna-software.com>.
Thank you very much. Hope to hear from you soon.
Arjohn
Villu Ruusmann wrote:
> Hello there,
>
>> I seem to be running into a regression with pdfbox 1.0.0. I have a PDF file
>> that parses fine with 0.8.0, but triggers an exception with 1.0.0.
>>
>
> PDFBox 1.0.0 (together with FontBox 1.0.0) introduced "native" support
> for CFF font programs:
> http://issues.apache.org/jira/browse/PDFBOX-542
>
> The code has undergone a fair amount of testing, but looks like that
> there's room for improvement.
>
>> A public file that triggers this exception can be found at:
>> http://domex.nps.edu/corp/files/govdocs1/000/000163.pdf
>>
>
> I created a new issue with all the relevant information attached:
> https://issues.apache.org/jira/browse/PDFBOX-634
>
> I'll take a closer look at it shortly.
>
>
> VR
--
Arjohn Kampman, Senior Software Engineer
Aduna - Semantic Power
www.aduna-software.com
Re: pdfbox 1.0.0 regression
Posted by Arjohn Kampman <ar...@aduna-software.com>.
I think so. I used maven to get all dependencies for pdfbox.
Arjohn
Daniel Wilson wrote:
> Arjohn,
> Are you indeed using version 1.0.0 of Fontbox? I ran into some difficulty
> when I used the latest PDFBox with an older Fontbox.
>
> Daniel
>
> On Tue, Feb 23, 2010 at 10:08 AM, Villu Ruusmann
> <vi...@gmail.com>wrote:
>
>> Hello there,
>>
>>> I seem to be running into a regression with pdfbox 1.0.0. I have a PDF
>> file
>>> that parses fine with 0.8.0, but triggers an exception with 1.0.0.
>>>
>> PDFBox 1.0.0 (together with FontBox 1.0.0) introduced "native" support
>> for CFF font programs:
>> http://issues.apache.org/jira/browse/PDFBOX-542
>>
>> The code has undergone a fair amount of testing, but looks like that
>> there's room for improvement.
>>
>>> A public file that triggers this exception can be found at:
>>> http://domex.nps.edu/corp/files/govdocs1/000/000163.pdf
>>>
>> I created a new issue with all the relevant information attached:
>> https://issues.apache.org/jira/browse/PDFBOX-634
>>
>> I'll take a closer look at it shortly.
>>
>>
>> VR
>>
>
--
Arjohn Kampman, Senior Software Engineer
Aduna - Semantic Power
www.aduna-software.com
Re: pdfbox 1.0.0 regression
Posted by Daniel Wilson <wi...@gmail.com>.
Arjohn,
Are you indeed using version 1.0.0 of Fontbox? I ran into some difficulty
when I used the latest PDFBox with an older Fontbox.
Daniel
On Tue, Feb 23, 2010 at 10:08 AM, Villu Ruusmann
<vi...@gmail.com>wrote:
> Hello there,
>
> >
> > I seem to be running into a regression with pdfbox 1.0.0. I have a PDF
> file
> > that parses fine with 0.8.0, but triggers an exception with 1.0.0.
> >
>
> PDFBox 1.0.0 (together with FontBox 1.0.0) introduced "native" support
> for CFF font programs:
> http://issues.apache.org/jira/browse/PDFBOX-542
>
> The code has undergone a fair amount of testing, but looks like that
> there's room for improvement.
>
> > A public file that triggers this exception can be found at:
> > http://domex.nps.edu/corp/files/govdocs1/000/000163.pdf
> >
>
> I created a new issue with all the relevant information attached:
> https://issues.apache.org/jira/browse/PDFBOX-634
>
> I'll take a closer look at it shortly.
>
>
> VR
>
Re: pdfbox 1.0.0 regression
Posted by Villu Ruusmann <vi...@gmail.com>.
Hello there,
>
> I seem to be running into a regression with pdfbox 1.0.0. I have a PDF file
> that parses fine with 0.8.0, but triggers an exception with 1.0.0.
>
PDFBox 1.0.0 (together with FontBox 1.0.0) introduced "native" support
for CFF font programs:
http://issues.apache.org/jira/browse/PDFBOX-542
The code has undergone a fair amount of testing, but looks like that
there's room for improvement.
> A public file that triggers this exception can be found at:
> http://domex.nps.edu/corp/files/govdocs1/000/000163.pdf
>
I created a new issue with all the relevant information attached:
https://issues.apache.org/jira/browse/PDFBOX-634
I'll take a closer look at it shortly.
VR