You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Gábor Stefanik (Jira)" <ji...@apache.org> on 2021/03/09 12:32:00 UTC
[jira] [Comment Edited] (PDFBOX-5126) Complex Unicode glyphs (surrogate pairs, combining diacritics, zero-width join, etc.) in a RTL context get reversed incorrectly on text extraction

    [ https://issues.apache.org/jira/browse/PDFBOX-5126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17298045#comment-17298045 ] 

Gábor Stefanik edited comment on PDFBOX-5126 at 3/9/21, 12:31 PM:
------------------------------------------------------------------

If it helps, this is the workaround we currently have in place for this issue:
{code:java}
import java.io.IOException;
import java.text.Bidi;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.stream.Collectors;
import org.apache.pdfbox.text.PDFTextStripper;
import org.apache.pdfbox.text.TextPosition;

public class BiDirectionalPDFTextStripper extends PDFTextStripper {

	@Override
	protected void writeString(String string, List<TextPosition> pos) throws IOException {
		pos = processBidi(pos);
		super.writeString(positionsToString(pos), pos);
	}

	private static String positionsToString(List<TextPosition> pos) {
		return pos.stream().map(TextPosition::toString).collect(Collectors.joining());
	}

	private List<TextPosition> processBidi(List<TextPosition> pos) {
		String word = positionsToString(pos);
		Bidi bidi = new Bidi(word, Bidi.DIRECTION_DEFAULT_LEFT_TO_RIGHT);
		Map<Integer, Integer> char2glyph = new HashMap<>();
		int p = 0;

		for (int i = 0; i < pos.size(); i++) {
			char2glyph.put(p, i);
			p += pos.get(i).getUnicode().length();
		}
		char2glyph.put(p, pos.size());

		// if there is pure LTR text no need to process further
		if (!bidi.isMixed() && bidi.getBaseLevel() == Bidi.DIRECTION_LEFT_TO_RIGHT) {
			return pos;
		}

		// collect individual bidi information
		int runCount = bidi.getRunCount();
		byte[] levels = new byte[runCount];
		Integer[] runs = new Integer[runCount];

		for (int i = 0; i < runCount; i++) {
			levels[i] = (byte) bidi.getRunLevel(i);
			runs[i] = i;
		}

		// reorder individual parts based on their levels
		Bidi.reorderVisually(levels, 0, runs, 0, runCount);
		List<TextPosition> newPos = new ArrayList<>();

		for (int i = 0; i < runCount; i++) {
			int index = runs[i];
			int start = bidi.getRunStart(index);
			int end = bidi.getRunLimit(index);
			int level = levels[index];

			if ((level & 1) != 0) {
				while (--end >= start) {
					if (!char2glyph.containsKey(end))
						continue;
					newPos.add(pos.get(char2glyph.get(end)));
				}
			} else {
				newPos.addAll(pos.subList(char2glyph.get(start), char2glyph.get(end)));
			}
		}

		return newPos;
	}
}
{code}
I hope this is easier to adapt into an actual fix than working from scratch.

 


was (Author: googulator):
If it helps, this is the workaround we currently have in place for this issue:

 
{code:java}
import java.io.IOException;
import java.text.Bidi;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.stream.Collectors;
import org.apache.pdfbox.text.PDFTextStripper;
import org.apache.pdfbox.text.TextPosition;

public class BiDirectionalPDFTextStripper extends PDFTextStripper {

	@Override
	protected void writeString(String string, List<TextPosition> pos) throws IOException {
		pos = processBidi(pos);
		super.writeString(positionsToString(pos), pos);
	}

	private static String positionsToString(List<TextPosition> pos) {
		return pos.stream().map(TextPosition::toString).collect(Collectors.joining());
	}

	private List<TextPosition> processBidi(List<TextPosition> pos) {
		String word = positionsToString(pos);
		Bidi bidi = new Bidi(word, Bidi.DIRECTION_DEFAULT_LEFT_TO_RIGHT);
		Map<Integer, Integer> char2glyph = new HashMap<>();
		int p = 0;

		for (int i = 0; i < pos.size(); i++) {
			char2glyph.put(p, i);
			p += pos.get(i).getUnicode().length();
		}
		char2glyph.put(p, pos.size());

		// if there is pure LTR text no need to process further
		if (!bidi.isMixed() && bidi.getBaseLevel() == Bidi.DIRECTION_LEFT_TO_RIGHT) {
			return pos;
		}

		// collect individual bidi information
		int runCount = bidi.getRunCount();
		byte[] levels = new byte[runCount];
		Integer[] runs = new Integer[runCount];

		for (int i = 0; i < runCount; i++) {
			levels[i] = (byte) bidi.getRunLevel(i);
			runs[i] = i;
		}

		// reorder individual parts based on their levels
		Bidi.reorderVisually(levels, 0, runs, 0, runCount);
		List<TextPosition> newPos = new ArrayList<>();

		for (int i = 0; i < runCount; i++) {
			int index = runs[i];
			int start = bidi.getRunStart(index);
			int end = bidi.getRunLimit(index);
			int level = levels[index];

			if ((level & 1) != 0) {
				while (--end >= start) {
					if (!char2glyph.containsKey(end))
						continue;
					newPos.add(pos.get(char2glyph.get(end)));
				}
			} else {
				newPos.addAll(pos.subList(char2glyph.get(start), char2glyph.get(end)));
			}
		}

		return newPos;
	}
}
{code}
I hope this is easier to adapt into an actual fix than working from scratch.

 

> Complex Unicode glyphs (surrogate pairs, combining diacritics, zero-width join, etc.) in a RTL context get reversed incorrectly on text extraction
> --------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: PDFBOX-5126
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-5126
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.22
>            Reporter: Gábor Stefanik
>            Priority: Major
>         Attachments: rovasvegyes.pdf
>
>
> The attached PDF contains old Hungarian runic script, which is both right-to-left and outside Unicode's Basic Multilingual Plane (and thus encoded as surrogate pairs in Java's internal UTF-16-like representation). When this text is extracted, the surrogate pairs are reversed due to an overly naive use of "char"-level reversal, leading to malformed Unicode output.
> Likewise, when combining diacritics/modifiers occur in a right-to-left context, their position relative to the "parent" character is reversed, and so they appear on the wrong glyph, as demonstrated by the Hebrew sample in the same PDF. I imagine the same thing would also happen to emoji using the "zero-width joiner" in an RTL context.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org