You are viewing a plain text version of this content. The canonical link for it is here.
Posted to fop-dev@xmlgraphics.apache.org by bu...@apache.org on 2010/12/14 15:13:29 UTC

DO NOT REPLY [Bug 50471] New: Greek Extended character throwing ArrayIndexOutOfBoundException.

https://issues.apache.org/bugzilla/show_bug.cgi?id=50471

           Summary: Greek Extended character throwing
                    ArrayIndexOutOfBoundException.
           Product: Fop
           Version: 0.95
          Platform: PC
        OS/Version: Linux
            Status: NEW
          Severity: critical
          Priority: P2
         Component: pdf
        AssignedTo: fop-dev@xmlgraphics.apache.org
        ReportedBy: tvsudhir@rediffmail.com


We want to create a PDF using FOP. We used XSL and XML files to transform to
create PDF. The xml file contains Greek Extended character and its decimal code
is 8062 and its Hex code is 1F7E and its HTML representation is ὾.
The moment this character is discovered in the string then the
transformer.transform method throws TransformerException which actually was
caused due to ArrayIndexOutofBoundsException.
The exact Exception Stack trace is as per below. 
We tried decoding the FOP code and we could not understand the array
lineBreakProperties defined in LineBreakUtils. 

Please help us in getting the way out of this exception.

Base Exception in PDFGenerator.buildPdf() Error in Creating PDF
      at PDFTest.buildPdf(PDFTest.java:140)
      at PDFTest.main(PDFTest.java:50)
Caused by: javax.xml.transform.TransformerException:
java.lang.ArrayIndexOutOfBoundsException: -1
      at
com.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl.transform(Unknown
Source)
      at
com.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl.transform(Unknown
Source)
      at PDFTest.buildPdf(PDFTest.java:118)
      ... 1 more
Caused by: java.lang.ArrayIndexOutOfBoundsException: -1
      at
org.apache.fop.text.linebreak.LineBreakUtils.getLineBreakPairProperty(LineBreakUtils.java:668)
      at
org.apache.fop.text.linebreak.LineBreakStatus.nextChar(LineBreakStatus.java:117)
      at
org.apache.fop.layoutmgr.inline.TextLayoutManager.getNextKnuthElements(TextLayoutManager.java:543)
      at
org.apache.fop.layoutmgr.inline.LineLayoutManager.collectInlineKnuthElements(LineLayoutManager.java:658)
      at
org.apache.fop.layoutmgr.inline.LineLayoutManager.getNextKnuthElements(LineLayoutManager.java:594)
      at
org.apache.fop.layoutmgr.BlockStackingLayoutManager.getNextKnuthElements(BlockStackingLayoutManager.java:294)
      at
org.apache.fop.layoutmgr.BlockLayoutManager.getNextKnuthElements(BlockLayoutManager.java:116)
      at
org.apache.fop.layoutmgr.table.TableCellLayoutManager.getNextKnuthElements(TableCellLayoutManager.java:170)
      at
org.apache.fop.layoutmgr.table.RowGroupLayoutManager.createElementsForRowGroup(RowGroupLayoutManager.java:120)
      at
org.apache.fop.layoutmgr.table.RowGroupLayoutManager.getNextKnuthElements(RowGroupLayoutManager.java:60)
      at
org.apache.fop.layoutmgr.table.TableContentLayoutManager.getKnuthElementsForRowIterator(TableContentLayoutManager.java:228)
      at
org.apache.fop.layoutmgr.table.TableContentLayoutManager.getNextKnuthElements(TableContentLayoutManager.java:172)
      at
org.apache.fop.layoutmgr.table.TableLayoutManager.getNextKnuthElements(TableLayoutManager.java:247)
      at
org.apache.fop.layoutmgr.BlockStackingLayoutManager.getNextKnuthElements(BlockStackingLayoutManager.java:294)
      at
org.apache.fop.layoutmgr.BlockLayoutManager.getNextKnuthElements(BlockLayoutManager.java:116)
      at
org.apache.fop.layoutmgr.BlockStackingLayoutManager.getNextKnuthElements(BlockStackingLayoutManager.java:294)
      at
org.apache.fop.layoutmgr.BlockLayoutManager.getNextKnuthElements(BlockLayoutManager.java:116)
      at
org.apache.fop.layoutmgr.FlowLayoutManager.getNextKnuthElements(FlowLayoutManager.java:107)
      at
org.apache.fop.layoutmgr.PageBreaker.getNextKnuthElements(PageBreaker.java:145)
      at
org.apache.fop.layoutmgr.AbstractBreaker.getNextBlockList(AbstractBreaker.java:552)
      at
org.apache.fop.layoutmgr.PageBreaker.getNextBlockList(PageBreaker.java:137)
      at
org.apache.fop.layoutmgr.AbstractBreaker.doLayout(AbstractBreaker.java:302)Stop...s

      at
org.apache.fop.layoutmgr.AbstractBreaker.doLayout(AbstractBreaker.java:264)
      at
org.apache.fop.layoutmgr.PageSequenceLayoutManager.activateLayout(PageSequenceLayoutManager.java:106)
      at
org.apache.fop.area.AreaTreeHandler.endPageSequence(AreaTreeHandler.java:234)
      at
org.apache.fop.fo.pagination.PageSequence.endOfNode(PageSequence.java:123)
      at
org.apache.fop.fo.FOTreeBuilder$MainFOHandler.endElement(FOTreeBuilder.java:340)
      at org.apache.fop.fo.FOTreeBuilder.endElement(FOTreeBuilder.java:169)
      at
com.sun.org.apache.xml.internal.serializer.ToXMLSAXHandler.endElement(Unknown
Source)
      at
com.sun.org.apache.xml.internal.serializer.ToXMLSAXHandler.endElement(Unknown
Source)
      at GregorSamsa.template$dot$0()
      at GregorSamsa.applyTemplates()
      at GregorSamsa.transform()
      at
com.sun.org.apache.xalan.internal.xsltc.runtime.AbstractTranslet.transform(Unknown
Source)

-- 
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

DO NOT REPLY [Bug 50471] Greek Extended character throwing ArrayIndexOutOfBoundException.

Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=50471

Andreas L. Delmelle <ad...@apache.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |FIXED

--- Comment #5 from Andreas L. Delmelle <ad...@apache.org> 2011-01-07 16:28:06 EST ---

Fixed in Trunk. See: http://svn.apache.org/viewvc?rev=1056518&view=rev

-- 
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

DO NOT REPLY [Bug 50471] Greek Extended character throwing ArrayIndexOutOfBoundException.

Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=50471

Dominik Stadler <do...@gmx.at> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |dominik.stadler@gmx.at
             Blocks|                            |49636, 41999

-- 
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

Custom FOP

Posted by Eric Douglas <ed...@blockhouse.com>.
That seems odd, to include information for fonts which are never used.
I'll just ignore that then.  It just seemed like that would be putting a
lot into the jar which will never be needed.  I'm running in webstart,
referencing fop.jar in the jnlp, so anything in the jar has to be copied
to the client.

My next task is print preview.  I'm using custom windows and embedding
org.apache.fop.render.awt.viewer.PreviewPanel.  That works except I have
to create it with a useragent and a renderer which takes forever.  I'm
wondering if I can use a version of this class without a useragent or a
renderer.  If I can pass in a rendered document as an array of pages as
images instead of a renderable object, can I still do the scale
resizing?  Other than the zoom in the preview there's no reason I need a
renderable object here.  I'm keeping a copy of the renderable source on
the server while displaying the preview.  Sending the output to a
printer or generating a PDF can be done on the server.  I'm saving the
PDF to the client but I'm getting the output from FOP as the byte stream
so I just copy bytes to the client and save them.  The renderer,
transformer, etc should never have to exist on the client.
 

-----Original Message-----
From: Jeremias Maerki [mailto:dev@jeremias-maerki.ch] 
Sent: Friday, January 07, 2011 9:35 AM
To: fop-dev@xmlgraphics.apache.org
Subject: Re: [Bug 50471] Greek Extended character throwing
ArrayIndexOutOfBoundException.

On 07.01.2011 15:06:19 Eric Douglas wrote:
> I've been trying to see if I can modify the source to eliminate the 
> fonts that come packaged with it.  I'm not sure why it needs to 
> include Courier, Helvetica, etc.

The PDF specification requires support for the so-called Base 14 fonts.
And so does the PostScript spec. We don't actually include the fonts,
just the font metrics. So this hardly needs any space.

Re: [Bug 50471] Greek Extended character throwing ArrayIndexOutOfBoundException.

Posted by Jeremias Maerki <de...@jeremias-maerki.ch>.
On 07.01.2011 15:06:19 Eric Douglas wrote:
> I've been trying to see if I can modify the source to eliminate the
> fonts that come packaged with it.  I'm not sure why it needs to include
> Courier, Helvetica, etc. 

The PDF specification requires support for the so-called Base 14 fonts.
And so does the PostScript spec. We don't actually include the fonts,
just the font metrics. So this hardly needs any space.

>  I would think they're just a waste of space if
> FOP is designed to use custom fonts or installed fonts.  I pass in
> custom fonts using only Lucida which comes in one file for normal, one
> for bold, one for unicode, and should be a different one for italic
> which I haven't needed yet.
> 
> I'm passing in the files that came with Windows XP in the fonts folder,
> l_10646.ttf for unicode.  For FOP to display a unicode character for the
> 'glyph not found' error rather than one of standard ascii, it should
> come packaged with a unicode font set.  I print the &#x25A1; character
> to my reports and passing in the l_10646.ttf font that works fine.
> 
> 
> -----Original Message-----
> From: Jeremias Maerki [mailto:dev@jeremias-maerki.ch] 
> Sent: Friday, January 07, 2011 8:44 AM
> To: fop-dev@xmlgraphics.apache.org
> Subject: Re: [Bug 50471] Greek Extended character throwing
> ArrayIndexOutOfBoundException.
> 
> I think so. The use of "#" is mostly historical due to lack of Unicode
> support initially. At least I believe so. The first fonts were WinAnsi
> only. IMO, it makes sense to make that transition. However, for
> single-byte fonts, we might still need to use "#". Not sure.
> 
> On 07.01.2011 14:17:42 Simon Pepping wrote:
> > On Fri, Jan 07, 2011 at 07:31:07AM -0500, bugzilla@apache.org wrote:
> > > https://issues.apache.org/bugzilla/show_bug.cgi?id=50471
> > > 
> > > --- Comment #4 from Andreas L. Delmelle <ad...@apache.org> 
> > > 2011-01-07 07:31:03 EST ---
> > > 
> > > Very right indeed. 
> > > So, if no one objects, I will apply the patch as proposed. FOP will 
> > > no longer crash, but simply show a '#' for such unassigned
> codepoints in the output.
> > > Treating them as regular alphabetic characters seems to be safe 
> > > enough for the time being.
> > 
> > Would it not be better to use character FFFD, 'Replacement Character',
> 
> > ?, for this?
> > 
> > Simon
> 
> 
> 
> 
> Jeremias Maerki
> 




Jeremias Maerki


Re: [Bug 50471] Greek Extended character throwing ArrayIndexOutOfBoundException.

Posted by Simon Pepping <sp...@leverkruid.eu>.
On Fri, Jan 07, 2011 at 02:44:13PM +0100, Jeremias Maerki wrote:
> I think so. The use of "#" is mostly historical due to lack of Unicode
> support initially. At least I believe so. The first fonts were WinAnsi
> only. IMO, it makes sense to make that transition. However, for
> single-byte fonts, we might still need to use "#". Not sure.

Then it might be better to use '?', which many tools use for that
purpose.

Let us put that on our wish list. It is not part of the fix for this
bug report.

Simon

> On 07.01.2011 14:17:42 Simon Pepping wrote:
> > On Fri, Jan 07, 2011 at 07:31:07AM -0500, bugzilla@apache.org wrote:
> > > https://issues.apache.org/bugzilla/show_bug.cgi?id=50471
> > > 
> > > --- Comment #4 from Andreas L. Delmelle <ad...@apache.org> 2011-01-07 07:31:03 EST ---
> > > 
> > > Very right indeed. 
> > > So, if no one objects, I will apply the patch as proposed. FOP will no longer
> > > crash, but simply show a '#' for such unassigned codepoints in the output.
> > > Treating them as regular alphabetic characters seems to be safe enough for the
> > > time being.
> > 
> > Would it not be better to use character FFFD, 'Replacement Character',
> > ?, for this?

RE: [Bug 50471] Greek Extended character throwing ArrayIndexOutOfBoundException.

Posted by Eric Douglas <ed...@blockhouse.com>.
I've been trying to see if I can modify the source to eliminate the
fonts that come packaged with it.  I'm not sure why it needs to include
Courier, Helvetica, etc.  I would think they're just a waste of space if
FOP is designed to use custom fonts or installed fonts.  I pass in
custom fonts using only Lucida which comes in one file for normal, one
for bold, one for unicode, and should be a different one for italic
which I haven't needed yet.

I'm passing in the files that came with Windows XP in the fonts folder,
l_10646.ttf for unicode.  For FOP to display a unicode character for the
'glyph not found' error rather than one of standard ascii, it should
come packaged with a unicode font set.  I print the &#x25A1; character
to my reports and passing in the l_10646.ttf font that works fine.


-----Original Message-----
From: Jeremias Maerki [mailto:dev@jeremias-maerki.ch] 
Sent: Friday, January 07, 2011 8:44 AM
To: fop-dev@xmlgraphics.apache.org
Subject: Re: [Bug 50471] Greek Extended character throwing
ArrayIndexOutOfBoundException.

I think so. The use of "#" is mostly historical due to lack of Unicode
support initially. At least I believe so. The first fonts were WinAnsi
only. IMO, it makes sense to make that transition. However, for
single-byte fonts, we might still need to use "#". Not sure.

On 07.01.2011 14:17:42 Simon Pepping wrote:
> On Fri, Jan 07, 2011 at 07:31:07AM -0500, bugzilla@apache.org wrote:
> > https://issues.apache.org/bugzilla/show_bug.cgi?id=50471
> > 
> > --- Comment #4 from Andreas L. Delmelle <ad...@apache.org> 
> > 2011-01-07 07:31:03 EST ---
> > 
> > Very right indeed. 
> > So, if no one objects, I will apply the patch as proposed. FOP will 
> > no longer crash, but simply show a '#' for such unassigned
codepoints in the output.
> > Treating them as regular alphabetic characters seems to be safe 
> > enough for the time being.
> 
> Would it not be better to use character FFFD, 'Replacement Character',

> ?, for this?
> 
> Simon




Jeremias Maerki


Re: [Bug 50471] Greek Extended character throwing ArrayIndexOutOfBoundException.

Posted by Jeremias Maerki <de...@jeremias-maerki.ch>.
I think so. The use of "#" is mostly historical due to lack of Unicode
support initially. At least I believe so. The first fonts were WinAnsi
only. IMO, it makes sense to make that transition. However, for
single-byte fonts, we might still need to use "#". Not sure.

On 07.01.2011 14:17:42 Simon Pepping wrote:
> On Fri, Jan 07, 2011 at 07:31:07AM -0500, bugzilla@apache.org wrote:
> > https://issues.apache.org/bugzilla/show_bug.cgi?id=50471
> > 
> > --- Comment #4 from Andreas L. Delmelle <ad...@apache.org> 2011-01-07 07:31:03 EST ---
> > 
> > Very right indeed. 
> > So, if no one objects, I will apply the patch as proposed. FOP will no longer
> > crash, but simply show a '#' for such unassigned codepoints in the output.
> > Treating them as regular alphabetic characters seems to be safe enough for the
> > time being.
> 
> Would it not be better to use character FFFD, 'Replacement Character',
> ?, for this?
> 
> Simon




Jeremias Maerki


Re: [Bug 50471] Greek Extended character throwing ArrayIndexOutOfBoundException.

Posted by Andreas Delmelle <an...@telenet.be>.
On 07 Jan 2011, at 14:58, Simon Pepping wrote:

<snip />
> I had not yet thought so far. I reflected on the use of '#' as the
> replacement character for missing glyphs. Is that not particular to
> FOP, and should we not conform to Unicode and use the Unicode
> replacement character in such situations?

OK, I see now. That's indeed FOP legacy, and it seems preferable to show U+FFFD in such cases, if possible.

As for the effort involved, it could turn out to be fairly straightforward to change this behavior.
The # replacement char is hardcoded as org.apache.fop.fonts.Typeface.NOT_FOUND, which is used in only three places (the mapChar() implementations of Font, SingleByteFont and MultiByteFont).

<snip />
> ...PDF/U1F00.pdf,
> shows no character assignment for this code point.) That means that it
> does not even have properties, such as a linebreaking class. Using
> class 'Ambiguous' seems the right solution for that problem.

OK, I will make sure this is reflected in the code-comment.


Regards,

Andreas
---


Re: [Bug 50471] Greek Extended character throwing ArrayIndexOutOfBoundException.

Posted by Simon Pepping <sp...@leverkruid.eu>.
On Fri, Jan 07, 2011 at 02:38:49PM +0100, Andreas Delmelle wrote:
> On 07 Jan 2011, at 14:17, Simon Pepping wrote:
> 
> Hi Simon,
> 
> > On Fri, Jan 07, 2011 at 07:31:07AM -0500, bugzilla@apache.org wrote:
> >> So, if no one objects, I will apply the patch as proposed. FOP will no longer
> >> crash, but simply show a '#' for such unassigned codepoints in the output.
> >> Treating them as regular alphabetic characters seems to be safe enough for the
> >> time being.
> > 
> > Would it not be better to use character FFFD, 'Replacement Character',
> > �, for this?
> 
> Interesting. In the context of linebreaking, that comes down to basically the same thing.
> 
> U+FFFD has linebreak class 'AI' or 'Ambiguous', which is currently also converted to 'Alphabetic' as part of the initial conversions.
> 
> Are you suggesting that we substitute the codepoint in the actual text content (rather than leave it there, and further rely on the default treatment of 'missing glyphs')?

I had not yet thought so far. I reflected on the use of '#' as the
replacement character for missing glyphs. Is that not particular to
FOP, and should we not conform to Unicode and use the Unicode
replacement character in such situations?

Really replacing the character in the text would go very far. A
missing glyph is usually dependent on the chosen font, while the
character itself is quite valid. In this case, however, the character
itself is invalid, in the sense that the code point has not been
assigned to a character in Unicode. (The bug report calls 1F7E a Greek
extended character, but the Unicode chart for Greek
extended characters, http://www.unicode.org/charts/PDF/U1F00.pdf,
shows no character assignment for this code point.) That means that it
does not even have properties, such as a linebreaking class. Using
class 'Ambiguous' seems the right solution for that problem.

Simon

Re: [Bug 50471] Greek Extended character throwing ArrayIndexOutOfBoundException.

Posted by Andreas Delmelle <an...@telenet.be>.
On 07 Jan 2011, at 14:17, Simon Pepping wrote:

Hi Simon,

> On Fri, Jan 07, 2011 at 07:31:07AM -0500, bugzilla@apache.org wrote:
>> So, if no one objects, I will apply the patch as proposed. FOP will no longer
>> crash, but simply show a '#' for such unassigned codepoints in the output.
>> Treating them as regular alphabetic characters seems to be safe enough for the
>> time being.
> 
> Would it not be better to use character FFFD, 'Replacement Character',
> �, for this?

Interesting. In the context of linebreaking, that comes down to basically the same thing.

U+FFFD has linebreak class 'AI' or 'Ambiguous', which is currently also converted to 'Alphabetic' as part of the initial conversions.

Are you suggesting that we substitute the codepoint in the actual text content (rather than leave it there, and further rely on the default treatment of 'missing glyphs')?


Regards,

Andreas
---


Re: [Bug 50471] Greek Extended character throwing ArrayIndexOutOfBoundException.

Posted by Simon Pepping <sp...@leverkruid.eu>.
On Fri, Jan 07, 2011 at 07:31:07AM -0500, bugzilla@apache.org wrote:
> https://issues.apache.org/bugzilla/show_bug.cgi?id=50471
> 
> --- Comment #4 from Andreas L. Delmelle <ad...@apache.org> 2011-01-07 07:31:03 EST ---
> 
> Very right indeed. 
> So, if no one objects, I will apply the patch as proposed. FOP will no longer
> crash, but simply show a '#' for such unassigned codepoints in the output.
> Treating them as regular alphabetic characters seems to be safe enough for the
> time being.

Would it not be better to use character FFFD, 'Replacement Character',
�, for this?

Simon

DO NOT REPLY [Bug 50471] Greek Extended character throwing ArrayIndexOutOfBoundException.

Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=50471

--- Comment #4 from Andreas L. Delmelle <ad...@apache.org> 2011-01-07 07:31:03 EST ---
(In reply to comment #3)
> At least there should be some configuration available to the end user to tell
> FOP to use some default line break in such special cases it becomes specific to
> the customer who is using FOP. Just because of some special character the
> entire PDF generation should not be put in stake. Isn't it ? If given a choice
> to the customer to choose from set of options, to get rid of this situation
> then it is better, rather than crashing.

Very right indeed. 
So, if no one objects, I will apply the patch as proposed. FOP will no longer
crash, but simply show a '#' for such unassigned codepoints in the output.
Treating them as regular alphabetic characters seems to be safe enough for the
time being.
Customization of and/or more refined configuration possibilities for the
Unicode line-breaking algorithm is something that is still on the wish-list for
the longer term.

-- 
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

DO NOT REPLY [Bug 50471] Greek Extended character throwing ArrayIndexOutOfBoundException.

Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=50471

--- Comment #3 from tvsudhir@rediffmail.com 2011-01-06 23:36:48 EST ---
Andreas,

Thanks a lot for your response. 

Actually we came across some special characters which are not intended to be
present in our database. We can figure out the reasons for this corruption and
correct but then I do expect FOP to display whatever content is available.
Whatever may be the character, till it is a valid code-point (even though it is
reserved and do not have any representation of its own) I do not expect FOP to
crash due to it.

At least there should be some configuration available to the end user to tell
FOP to use some default line break in such special cases it becomes specific to
the customer who is using FOP. Just because of some special character the
entire PDF generation should not be put in stake. Isn't it ? If given a choice
to the customer to choose from set of options, to get rid of this situation
then it is better, rather than crashing.

Frankly speaking, we lost the hope of getting some response on this issue from
Apache. We searched for this problem in google and we have seen many other guys
complaining about similar issue (i.e. getting ArrayIndexOutofBoundsException).
I believe they also might be having some reserved character in their text. We
at least nailed down the cause of the problem. A proper resolution to this
issue is of great help, not only to me but many others. 

Thanks again for looking into it and discussing about it in the forum.

-- 
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

DO NOT REPLY [Bug 50471] Greek Extended character throwing ArrayIndexOutOfBoundException.

Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=50471

Glenn Adams <gl...@skynav.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|RESOLVED                    |CLOSED

--- Comment #6 from Glenn Adams <gl...@skynav.com> 2012-04-01 06:18:03 UTC ---
batch transition to closed; if someone wishes to restore one of these to
resolved in order to perform a verification step, then feel free to do so

-- 
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

DO NOT REPLY [Bug 50471] Greek Extended character throwing ArrayIndexOutOfBoundException.

Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=50471

--- Comment #2 from Chris Bowditch <bo...@hotmail.com> 2011-01-06 06:48:25 EST ---
Indeed you raise a very good point Andreas. Even if you make the code change, I
would expect # to appear in the output, because no font is likely to have a
glyph for a reserved code point. So I am also interested to hear the business
reason for using such a code point.

-- 
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

DO NOT REPLY [Bug 50471] Greek Extended character throwing ArrayIndexOutOfBoundException.

Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=50471

--- Comment #1 from Andreas L. Delmelle <ad...@apache.org> 2011-01-05 13:31:26 EST ---

Thanks for reporting, and apologies for the late reply...

At first glance, this seems like a minor oversight in the implementation of
Unicode linebreaking in FOP. This does not take into account the possibility
that a given codepoint is not assigned a 'class' in linebreaking context. (=
U+1F7E does not appear in the file
http://www.unicode.org/Public/UNIDATA/LineBreak.txt, which is used as a basis
to generate those arrays in LineBreakUtils.java)

On the other hand, one could obviously raise the question why you so
desperately need to have an unassigned codepoint in your output. Are you
absolutely sure you need this? If yes, then can you elaborate on the exact
reason? (i.e. What exactly is this unassigned codepoint used for?)

The most straightforward 'fix' seems to be roughly as follows:

Index: src/java/org/apache/fop/text/linebreak/LineBreakStatus.java
===================================================================
--- src/java/org/apache/fop/text/linebreak/LineBreakStatus.java    (revision
1054383)
+++ src/java/org/apache/fop/text/linebreak/LineBreakStatus.java    (working
copy)
@@ -87,6 +87,7 @@

         /* Initial conversions */
         switch (currentClass) {
+            case 0: // Unassigned codepoint: consider as AL?
             case LineBreakUtils.LINE_BREAK_PROPERTY_AI:
             case LineBreakUtils.LINE_BREAK_PROPERTY_SG:
             case LineBreakUtils.LINE_BREAK_PROPERTY_XX:

What this does, is assign the class 'AL' or 'Alphabetic' to any codepoint that
has not been assigned a class by Unicode. This means it will be treated as a
regular letter.
Now, the reason why I am asking the question whether you are sure you know what
you're doing, is that this may turn out to be undesirable. Perhaps the
character in question needs to be treated as a space rather than a letter.
Unicode does not define U+1F7E other than as a 'reserved' character, so it
makes sense that Unicode cannot say what should happen with this character in
the context of linebreaking...

That said, it is also wrong of FOP to crash in this case, so the bug is
definitely genuine.

-- 
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.