You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Tilman Hausherr (JIRA)" <ji...@apache.org> on 2015/06/17 08:22:00 UTC
[jira] [Commented] (PDFBOX-2831) ArrayIndexOutOfBoundsException in
mergeDiacritic() on extraction of text with diacritic text
[ https://issues.apache.org/jira/browse/PDFBOX-2831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14589363#comment-14589363 ]
Tilman Hausherr commented on PDFBOX-2831:
-----------------------------------------
Do you have a file that reproduces the problem?
> ArrayIndexOutOfBoundsException in mergeDiacritic() on extraction of text with diacritic text
> --------------------------------------------------------------------------------------------
>
> Key: PDFBOX-2831
> URL: https://issues.apache.org/jira/browse/PDFBOX-2831
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 2.0.0
> Reporter: Andreas Meier
> Priority: Minor
>
> PDFBox may fail on extraction of text in method mergeDiacritic(TextPosition diacritic):
> {code}
> Caused by: java.lang.ArrayIndexOutOfBoundsException: 1
> at org.apache.pdfbox.text.TextPosition.mergeDiacritic(TextPosition.java:532)
> at org.apache.pdfbox.text.PDFTextStripper.processTextPosition(PDFTextStripper.java:945)
> at org.apache.pdfbox.text.PDFTextStreamEngine.showGlyph(PDFTextStreamEngine.java:229)
> at org.apache.pdfbox.contentstream.PDFStreamEngine.showText(PDFStreamEngine.java:683)
> at org.apache.pdfbox.contentstream.PDFStreamEngine.showTextStrings(PDFStreamEngine.java:593)
> at org.apache.pdfbox.contentstream.operator.text.ShowTextAdjusted.process(ShowTextAdjusted.java:38)
> at org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:795)
> at org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:462)
> at org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:438)
> at org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:149)
> at org.apache.pdfbox.text.PDFTextStreamEngine.processPage(PDFTextStreamEngine.java:117)
> at org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:369)
> at org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:305)
> at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:249)
> at org.apache.pdfbox.text.PDFTextStripper.getText(PDFTextStripper.java:210)
> ... 8 more
> {code}
> The exception is thrown, because variable "unicode" contains two diacritic signs (for example: arabic shaddah U+0651 and arabic fathah U+064E, unicode length = 2), while array "widths" only contains one entry at that time( [x.xxxxxxxx] ).
> Temporary workaround could be to check the size of the array:
> (does not address the actual problem, that unicode and widths variable drift apart)
> {code}
> /**
> * Merge a single character TextPosition into the current object. This is to be used only for
> * cases where we have a diacritic that overlaps an existing TextPosition. In a graphical
> * display, we could overlay them, but for text extraction we need to merge them. Use the
> * contains() method to test if two objects overlap.
> *
> * @param diacritic TextPosition to merge into the current TextPosition.
> */
> public void mergeDiacritic(TextPosition diacritic)
> {
> if (diacritic.getUnicode().length() > 1)
> {
> return;
> }
> float diacXStart = diacritic.getXDirAdj();
> float diacXEnd = diacXStart + diacritic.widths[0];
> float currCharXStart = getXDirAdj();
> int strLen = unicode.length();
> boolean wasAdded = false;
> for (int i = 0; i < strLen && !wasAdded; i++)
> {
> if (i <= (widths.length - 1))
> {
> float currCharXEnd = currCharXStart + widths[i];
> // this is the case where there is an overlap of the diacritic character with the
> // current character and the previous character. If no previous character, just append
> // the diacritic after the current one
> if (diacXStart < currCharXStart && diacXEnd <= currCharXEnd)
> {
> if (i == 0)
> {
> insertDiacritic(i, diacritic);
> }
> else
> {
> float distanceOverlapping1 = diacXEnd - currCharXStart;
> float percentage1 = distanceOverlapping1/widths[i];
> float distanceOverlapping2 = currCharXStart - diacXStart;
> float percentage2 = distanceOverlapping2/widths[i - 1];
> if (percentage1 >= percentage2)
> {
> insertDiacritic(i, diacritic);
> }
> else
> {
> insertDiacritic(i - 1, diacritic);
> }
> }
> wasAdded = true;
> }
> // diacritic completely covers this character and therefore we assume that this is the
> // character the diacritic belongs to
> else if (diacXStart < currCharXStart && diacXEnd > currCharXEnd)
> {
> insertDiacritic(i, diacritic);
> wasAdded = true;
> }
> // otherwise, The diacritic modifies this character because its completely
> // contained by the character width
> else if (diacXStart >= currCharXStart && diacXEnd <= currCharXEnd)
> {
> insertDiacritic(i, diacritic);
> wasAdded = true;
> }
> // last character in the TextPosition so we add diacritic to the end
> else if (diacXStart >= currCharXStart && diacXEnd > currCharXEnd && i == strLen - 1)
> {
> insertDiacritic(i, diacritic);
> wasAdded = true;
> }
> // couldn't find anything useful so we go to the next character in the TextPosition
> currCharXStart += widths[i];
> } else {
> // problem: unicode length and widths size differ
> }
> }
> }
> {code}
> This problem only happened on arabic texts so far. Since there is no evidence that it will occur only in arabic text I did not attach it to another issue. Further investigation needed.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org