You are viewing a plain text version of this content. The canonical link for it is here.
Posted to fop-dev@xmlgraphics.apache.org by "Nicholas Moser (Jira)" <ji...@apache.org> on 2022/07/22 15:46:00 UTC

[jira] [Commented] (FOP-2963) Add Option for Safer Hyphenation

    [ https://issues.apache.org/jira/browse/FOP-2963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17570093#comment-17570093 ] 

Nicholas Moser commented on FOP-2963:
-------------------------------------

Just a heads up for anyone interested in using this patch, it does result in more memory usage since many more Knuth nodes are being created. To help alleviate memory I created an additional patch that helps reduce the number of object allocations. Specifically, I found that there are many calls to log.debug(...) that include String concatenation in them, resulting in creating a StringBuilder object. This can be alleviated by first checking if debug logging is enabled. For example:
{code:java}
-                log.debug("PLM> break - " + getBreakClassName(breakPenalty.getBreakClass()));
+                if (log.isDebugEnabled()) {
+                    log.debug("PLM> break - " + getBreakClassName(breakPenalty.getBreakClass()));
+                } {code}
I've attached a patch fixing this to this JIRA: [^perf_improvements.patch]

These debug logs are also a problem in the mainline branch of FOP, but they are even more of a problem after taking the patch from this Jira since there are more Knuth nodes and many of these log.debug(...) calls occur in a hot loop over the Knuth nodes.

I've also attached a .fo file I used to create the original PDFs on this JIRA: [^example.fo]

> Add Option for Safer Hyphenation
> --------------------------------
>
>                 Key: FOP-2963
>                 URL: https://issues.apache.org/jira/browse/FOP-2963
>             Project: FOP
>          Issue Type: Improvement
>            Reporter: Nicholas Moser
>            Priority: Major
>         Attachments: example-after-disabled.pdf, example-after-enabled.pdf, example-before.pdf, example.fo, patch.diff, perf_improvements.patch
>
>
> This is a new proposed setting for FOP I have decided to call *safer hyphenation*.
> Currently, FOP may generate PDFs where text can overlap or go off the page. The most common scenarios I've seen this occur are:
>  # A very small amount of space is allocated for text, such as the cell of table. Even if there are valid hyphenation points for words, a sufficiently large word may exit the cell as there aren't enough hyphenation points in it.
>  # A string of characters such as numbers will exit the space allocated for them even if there is plenty of room to line break. This is because hyphenation patterns do not set line breaks for strings of numbers, therefore it sees no valid hyphenation points.
> Examples of these issues can be seen in the attached PDF *example-before.pdf*. The third row on the first table has a really long word with many hyphenation points. Despite this, it exits the cell twice due to there not being enough hyphenation points. Additionally, The rows below this row contain a long series of numbers that have no hyphenation points and go off the page.
> My proposed fix for this involves a new configuration setting called *safer hyphenation*. It effectively does three things.
>  # Places hyphenation points between every character in a string buffer, ignoring hyphenation patterns.
>  # Moves hyphenation from the second pass to the third pass of findOptimalBreakingPoints(...)
>  # Massively increases the penalty for hyphenation.
> The first change is fairly simple. A hyphenation can occur anywhere in any word in the document. This effectively fixes both of the problems, since now they will line break before they exit their allocated space. The issue is that now, the line breaking algorithm will attempt to use these new hyphenation points even when not necessary. This will result in many ugly hyphenations. Since hyphenation patterns are no longer used, I argue that the best way to handle this is to avoid hyphenation now unless it is absolutely necessary.
> The second and third changes attempt to avoid hyphenation unless it is absolutely necessary. The second change only allows hyphenation during the third pass of the optimal breaking point search, after the max adjustment has been changed to 20. The third change massively increases the penalty for using a hyphenation. This results in the algorithm in avoiding hyphenation unless there are no other options.
> Since this is a new configuration setting, I've included two additional PDFs, *example-after-disabled.pdf* and *example-after-enabled.pdf*. The first PDF proves that when the configuration is off, the changes are entirely passive and cause no different. The second PDF shows the improvements of using safer hyphenation. It also shows the downside, in that old hyphenation (with hyphenation patterns) can no longer be used to improve the layout of a paragraph.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)