You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Andreas Meier (JIRA)" <ji...@apache.org> on 2015/06/17 07:04:00 UTC

[jira] [Updated] (PDFBOX-2831) ArrayIndexOutOfBoundsException in mergeDiacritic() on extraction of text with diacritic text

     [ https://issues.apache.org/jira/browse/PDFBOX-2831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andreas Meier updated PDFBOX-2831:
----------------------------------
    Description: 
PDFBox may fail on extraction of text in method mergeDiacritic(TextPosition diacritic):

{code}
Caused by: java.lang.ArrayIndexOutOfBoundsException: 1
  	at org.apache.pdfbox.text.TextPosition.mergeDiacritic(TextPosition.java:532)
  	at org.apache.pdfbox.text.PDFTextStripper.processTextPosition(PDFTextStripper.java:945)
  	at org.apache.pdfbox.text.PDFTextStreamEngine.showGlyph(PDFTextStreamEngine.java:229)
  	at org.apache.pdfbox.contentstream.PDFStreamEngine.showText(PDFStreamEngine.java:683)
  	at org.apache.pdfbox.contentstream.PDFStreamEngine.showTextStrings(PDFStreamEngine.java:593)
  	at org.apache.pdfbox.contentstream.operator.text.ShowTextAdjusted.process(ShowTextAdjusted.java:38)
  	at org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:795)
  	at org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:462)
  	at org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:438)
  	at org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:149)
  	at org.apache.pdfbox.text.PDFTextStreamEngine.processPage(PDFTextStreamEngine.java:117)
  	at org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:369)
  	at org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:305)
  	at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:249)
  	at org.apache.pdfbox.text.PDFTextStripper.getText(PDFTextStripper.java:210)
  	... 8 more
{code}

The exception is thrown, because variable "unicode" contains two diacritic signs (for example: arabic shaddah U+0651 and arabic fathah U+064E, unicode length = 2), while array "widths" only contains one entry at that time( [x.xxxxxxxx] ).


Temporary workaround could be to check the size of the array:
(does not address the actual problem, that unicode and widths variable drift apart)

{code}
/**
     * Merge a single character TextPosition into the current object. This is to be used only for
     * cases where we have a diacritic that overlaps an existing TextPosition. In a graphical
     * display, we could overlay them, but for text extraction we need to merge them. Use the
     * contains() method to test if two objects overlap.
     *
     * @param diacritic TextPosition to merge into the current TextPosition.
     */
    public void mergeDiacritic(TextPosition diacritic)
    {
        if (diacritic.getUnicode().length() > 1)
        {
            return;
        }

        float diacXStart = diacritic.getXDirAdj();
        float diacXEnd = diacXStart + diacritic.widths[0];

        float currCharXStart = getXDirAdj();

        int strLen = unicode.length();
        boolean wasAdded = false;

        for (int i = 0; i < strLen && !wasAdded; i++)
        {

            if (i <= (widths.length - 1))
            {

                float currCharXEnd = currCharXStart + widths[i];

                 // this is the case where there is an overlap of the diacritic character with the
                 // current character and the previous character. If no previous character, just append
                 // the diacritic after the current one
                if (diacXStart < currCharXStart && diacXEnd <= currCharXEnd)
                {
                    if (i == 0)
                    {
                        insertDiacritic(i, diacritic);
                    }
                    else
                    {
                        float distanceOverlapping1 = diacXEnd - currCharXStart;
                        float percentage1 = distanceOverlapping1/widths[i];

                        float distanceOverlapping2 = currCharXStart - diacXStart;
                        float percentage2 = distanceOverlapping2/widths[i - 1];

                        if (percentage1 >= percentage2)
                        {
                            insertDiacritic(i, diacritic);
                        }
                        else
                        {
                            insertDiacritic(i - 1, diacritic);
                        }
                    }
                    wasAdded = true;
                }
                // diacritic completely covers this character and therefore we assume that this is the
                // character the diacritic belongs to
                else if (diacXStart < currCharXStart && diacXEnd > currCharXEnd)
                {
                    insertDiacritic(i, diacritic);
                    wasAdded = true;
                }
                // otherwise, The diacritic modifies this character because its completely
                // contained by the character width
                else if (diacXStart >= currCharXStart && diacXEnd <= currCharXEnd)
                {
                    insertDiacritic(i, diacritic);
                    wasAdded = true;
                }
                // last character in the TextPosition so we add diacritic to the end
                else if (diacXStart >= currCharXStart && diacXEnd > currCharXEnd && i == strLen - 1)
                {
                    insertDiacritic(i, diacritic);
                    wasAdded = true;
                }

                // couldn't find anything useful so we go to the next character in the TextPosition
                currCharXStart += widths[i];

            } else {
                // problem: unicode length and widths size differ
            }
        }
    }
{code}

This problem only happened on arabic texts so far. Since there is no evidence that it will occur only in arabic text I did not attach it to another issue. Further investigation needed.

  was:
PDFBox may fail on extraction of text in method mergeDiacritic(TextPosition diacritic):

Caused by: java.lang.ArrayIndexOutOfBoundsException: 1
  	at org.apache.pdfbox.text.TextPosition.mergeDiacritic(TextPosition.java:532)
  	at org.apache.pdfbox.text.PDFTextStripper.processTextPosition(PDFTextStripper.java:945)
  	at org.apache.pdfbox.text.PDFTextStreamEngine.showGlyph(PDFTextStreamEngine.java:229)
  	at org.apache.pdfbox.contentstream.PDFStreamEngine.showText(PDFStreamEngine.java:683)
  	at org.apache.pdfbox.contentstream.PDFStreamEngine.showTextStrings(PDFStreamEngine.java:593)
  	at org.apache.pdfbox.contentstream.operator.text.ShowTextAdjusted.process(ShowTextAdjusted.java:38)
  	at org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:795)
  	at org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:462)
  	at org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:438)
  	at org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:149)
  	at org.apache.pdfbox.text.PDFTextStreamEngine.processPage(PDFTextStreamEngine.java:117)
  	at org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:369)
  	at org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:305)
  	at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:249)
  	at org.apache.pdfbox.text.PDFTextStripper.getText(PDFTextStripper.java:210)
  	... 8 more

The exception is thrown, because variable "unicode" contains two diacritic signs (for example: arabic shaddah U+0651 and arabic fathah U+064E, unicode length = 2), while array "widths" only contains one entry at that time( [x.xxxxxxxx] ).


Temporary workaround could be to check the size of the array:
(does not address the actual problem, that unicode and widths variable drift apart)

/**
     * Merge a single character TextPosition into the current object. This is to be used only for
     * cases where we have a diacritic that overlaps an existing TextPosition. In a graphical
     * display, we could overlay them, but for text extraction we need to merge them. Use the
     * contains() method to test if two objects overlap.
     *
     * @param diacritic TextPosition to merge into the current TextPosition.
     */
    public void mergeDiacritic(TextPosition diacritic)
    {
        if (diacritic.getUnicode().length() > 1)
        {
            return;
        }

        float diacXStart = diacritic.getXDirAdj();
        float diacXEnd = diacXStart + diacritic.widths[0];

        float currCharXStart = getXDirAdj();

        int strLen = unicode.length();
        boolean wasAdded = false;

        for (int i = 0; i < strLen && !wasAdded; i++)
        {

            if (i <= (widths.length - 1))
            {

                float currCharXEnd = currCharXStart + widths[i];

                 // this is the case where there is an overlap of the diacritic character with the
                 // current character and the previous character. If no previous character, just append
                 // the diacritic after the current one
                if (diacXStart < currCharXStart && diacXEnd <= currCharXEnd)
                {
                    if (i == 0)
                    {
                        insertDiacritic(i, diacritic);
                    }
                    else
                    {
                        float distanceOverlapping1 = diacXEnd - currCharXStart;
                        float percentage1 = distanceOverlapping1/widths[i];

                        float distanceOverlapping2 = currCharXStart - diacXStart;
                        float percentage2 = distanceOverlapping2/widths[i - 1];

                        if (percentage1 >= percentage2)
                        {
                            insertDiacritic(i, diacritic);
                        }
                        else
                        {
                            insertDiacritic(i - 1, diacritic);
                        }
                    }
                    wasAdded = true;
                }
                // diacritic completely covers this character and therefore we assume that this is the
                // character the diacritic belongs to
                else if (diacXStart < currCharXStart && diacXEnd > currCharXEnd)
                {
                    insertDiacritic(i, diacritic);
                    wasAdded = true;
                }
                // otherwise, The diacritic modifies this character because its completely
                // contained by the character width
                else if (diacXStart >= currCharXStart && diacXEnd <= currCharXEnd)
                {
                    insertDiacritic(i, diacritic);
                    wasAdded = true;
                }
                // last character in the TextPosition so we add diacritic to the end
                else if (diacXStart >= currCharXStart && diacXEnd > currCharXEnd && i == strLen - 1)
                {
                    insertDiacritic(i, diacritic);
                    wasAdded = true;
                }

                // couldn't find anything useful so we go to the next character in the TextPosition
                currCharXStart += widths[i];

            } else {
                // problem: unicode length and widths size differ
            }
        }
    }

This problem only happened on arabic texts so far. Since there is no evidence that it will occur only in arabic text I did not attach it to another issue. Further investigation needed.


> ArrayIndexOutOfBoundsException in mergeDiacritic() on extraction of text with diacritic text
> --------------------------------------------------------------------------------------------
>
>                 Key: PDFBOX-2831
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2831
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.0
>            Reporter: Andreas Meier
>            Priority: Minor
>
> PDFBox may fail on extraction of text in method mergeDiacritic(TextPosition diacritic):
> {code}
> Caused by: java.lang.ArrayIndexOutOfBoundsException: 1
>   	at org.apache.pdfbox.text.TextPosition.mergeDiacritic(TextPosition.java:532)
>   	at org.apache.pdfbox.text.PDFTextStripper.processTextPosition(PDFTextStripper.java:945)
>   	at org.apache.pdfbox.text.PDFTextStreamEngine.showGlyph(PDFTextStreamEngine.java:229)
>   	at org.apache.pdfbox.contentstream.PDFStreamEngine.showText(PDFStreamEngine.java:683)
>   	at org.apache.pdfbox.contentstream.PDFStreamEngine.showTextStrings(PDFStreamEngine.java:593)
>   	at org.apache.pdfbox.contentstream.operator.text.ShowTextAdjusted.process(ShowTextAdjusted.java:38)
>   	at org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:795)
>   	at org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:462)
>   	at org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:438)
>   	at org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:149)
>   	at org.apache.pdfbox.text.PDFTextStreamEngine.processPage(PDFTextStreamEngine.java:117)
>   	at org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:369)
>   	at org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:305)
>   	at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:249)
>   	at org.apache.pdfbox.text.PDFTextStripper.getText(PDFTextStripper.java:210)
>   	... 8 more
> {code}
> The exception is thrown, because variable "unicode" contains two diacritic signs (for example: arabic shaddah U+0651 and arabic fathah U+064E, unicode length = 2), while array "widths" only contains one entry at that time( [x.xxxxxxxx] ).
> Temporary workaround could be to check the size of the array:
> (does not address the actual problem, that unicode and widths variable drift apart)
> {code}
> /**
>      * Merge a single character TextPosition into the current object. This is to be used only for
>      * cases where we have a diacritic that overlaps an existing TextPosition. In a graphical
>      * display, we could overlay them, but for text extraction we need to merge them. Use the
>      * contains() method to test if two objects overlap.
>      *
>      * @param diacritic TextPosition to merge into the current TextPosition.
>      */
>     public void mergeDiacritic(TextPosition diacritic)
>     {
>         if (diacritic.getUnicode().length() > 1)
>         {
>             return;
>         }
>         float diacXStart = diacritic.getXDirAdj();
>         float diacXEnd = diacXStart + diacritic.widths[0];
>         float currCharXStart = getXDirAdj();
>         int strLen = unicode.length();
>         boolean wasAdded = false;
>         for (int i = 0; i < strLen && !wasAdded; i++)
>         {
>             if (i <= (widths.length - 1))
>             {
>                 float currCharXEnd = currCharXStart + widths[i];
>                  // this is the case where there is an overlap of the diacritic character with the
>                  // current character and the previous character. If no previous character, just append
>                  // the diacritic after the current one
>                 if (diacXStart < currCharXStart && diacXEnd <= currCharXEnd)
>                 {
>                     if (i == 0)
>                     {
>                         insertDiacritic(i, diacritic);
>                     }
>                     else
>                     {
>                         float distanceOverlapping1 = diacXEnd - currCharXStart;
>                         float percentage1 = distanceOverlapping1/widths[i];
>                         float distanceOverlapping2 = currCharXStart - diacXStart;
>                         float percentage2 = distanceOverlapping2/widths[i - 1];
>                         if (percentage1 >= percentage2)
>                         {
>                             insertDiacritic(i, diacritic);
>                         }
>                         else
>                         {
>                             insertDiacritic(i - 1, diacritic);
>                         }
>                     }
>                     wasAdded = true;
>                 }
>                 // diacritic completely covers this character and therefore we assume that this is the
>                 // character the diacritic belongs to
>                 else if (diacXStart < currCharXStart && diacXEnd > currCharXEnd)
>                 {
>                     insertDiacritic(i, diacritic);
>                     wasAdded = true;
>                 }
>                 // otherwise, The diacritic modifies this character because its completely
>                 // contained by the character width
>                 else if (diacXStart >= currCharXStart && diacXEnd <= currCharXEnd)
>                 {
>                     insertDiacritic(i, diacritic);
>                     wasAdded = true;
>                 }
>                 // last character in the TextPosition so we add diacritic to the end
>                 else if (diacXStart >= currCharXStart && diacXEnd > currCharXEnd && i == strLen - 1)
>                 {
>                     insertDiacritic(i, diacritic);
>                     wasAdded = true;
>                 }
>                 // couldn't find anything useful so we go to the next character in the TextPosition
>                 currCharXStart += widths[i];
>             } else {
>                 // problem: unicode length and widths size differ
>             }
>         }
>     }
> {code}
> This problem only happened on arabic texts so far. Since there is no evidence that it will occur only in arabic text I did not attach it to another issue. Further investigation needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org