You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (JIRA)" <ji...@apache.org> on 2017/11/06 14:10:01 UTC

[jira] [Updated] (TIKA-2484) Improve CharsetDetector to recognize UTF-16LE/BE,UTF-32LE/BE and UTF-7 with/without BOMs correctly

     [ https://issues.apache.org/jira/browse/TIKA-2484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tim Allison updated TIKA-2484:
------------------------------
    Attachment: charset.zip

File from [~AndreasMeier]

> Improve CharsetDetector to recognize UTF-16LE/BE,UTF-32LE/BE and UTF-7 with/without BOMs correctly
> --------------------------------------------------------------------------------------------------
>
>                 Key: TIKA-2484
>                 URL: https://issues.apache.org/jira/browse/TIKA-2484
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.16, 1.17
>            Reporter: Andreas Meier
>            Priority: Minor
>         Attachments: IUC10-ar.UTF-16BE.with-BOM, IUC10-ar.UTF-16BE.without-BOM, IUC10-ar.UTF-16LE.with-BOM, IUC10-ar.UTF-16LE.without-BOM, IUC10-ar.UTF-32BE.with-BOM, IUC10-ar.UTF-32BE.without-BOM, IUC10-ar.UTF-32LE.with-BOM, IUC10-ar.UTF-32LE.without-BOM, IUC10-ar.UTF-7.with-BOM, IUC10-ar.UTF-7.without-BOM, IUC10-fr.UTF-16BE.with-BOM, IUC10-fr.UTF-16BE.without-BOM, IUC10-fr.UTF-16LE.with-BOM, IUC10-fr.UTF-16LE.without-BOM, IUC10-fr.UTF-32BE.with-BOM, IUC10-fr.UTF-32BE.without-BOM, IUC10-fr.UTF-32LE.with-BOM, IUC10-fr.UTF-32LE.without-BOM, IUC10-fr.UTF-7.with-BOM, IUC10-fr.UTF-7.without-BOM, charset.zip
>
>
> I would like to help to improve the recognition accuracy of the CharsetDetector.
> Therefore I created a testset of plain/text-files to check the quality of org.apache.tika.parser.txt.CharsetDetector: charset.tar.gz
> (Testset created out of http://source.icu-project.org/repos/icu/icu4j/tags/release-4-8/main/tests/core/src/com/ibm/icu/dev/test/charsetdet/CharsetDetectionTests.xml)
> The Testset was processed using TIKA1.17 (ID: 877d621, HEAD from 26.10.2017) and ICU4J 59.1 CharsetDetector with custom UTF-7 improvements. Here are the results:
> {noformat}
> TIKA-1.17
> charset.tar.gz
> Correct recognitions: 165/341
> {noformat}
> {noformat}
> TIKA-1.17+ UTF-7 recognizer:
> charset.tar.gz
> Correct recognitions: 213/341
> {noformat}
> {noformat}
> ICU4j 59.1 + UTF-7 recognizer:
> charset.tar.gz
> Correct recognitions: 333/341
> {noformat}
> As UTF-7 recognizer I used these two simple classes:
> {code:java}
> package test.utils;
> import java.util.Arrays;
> /**
>  * Pattern state container for the Boyer-Moore algorithm
>  */
> public final class BoyerMoorePattern
> {
>     private final byte[] pattern;
>     private final int[] skipArray;
>     public BoyerMoorePattern(byte[] pattern)
>     {
>         this.pattern = pattern;
>         skipArray = new int[256];
>         Arrays.fill(skipArray, -1);
>         // Initialize with pattern values
>         for (int i = 0; i < pattern.length; i++)
>         {
>             skipArray[pattern[i] & 0xFF] = i;
>         }
>     }
>     /**
>      * Get the pattern length
>      * 
>      * @return length as int
>      */
>     public int getLength()
>     {
>         return pattern.length;
>     }
>     /**
>      * Searches for the first occurrence of the pattern in the input byte array.
>      * 
>      * @param data - The data we want to search in
>      * @param startIdx - The startindex
>      * @param endIdx - The endindex
>      * @return offset as int or -1 if not found at all
>      */
>     public final int searchPattern(byte[] data, int startIdx, int endIdx)
>     {
>         int patternLength = pattern.length;
>         int skip = 0;
>         for (int i = startIdx; i <= endIdx - patternLength; i += skip)
>         {
>             skip = 0;
>             for (int j = patternLength - 1; j >= 0; j--)
>             {
>                 if (pattern[j] != data[i + j])
>                 {
>                     skip = Math.max(1, j - skipArray[data[i + j] & 0xFF]);
>                     break;
>                 }
>             }
>             if (skip == 0)
>             {
>                 return i;
>             }
>         }
>         return -1;
>     }
>     /**
>      * Searches for the first occurrence of the pattern in the input byte array.
>      * 
>      * @param data - The data we want to search in
>      * @param startIdx - The startindex
>      * @return offset as int
>      */
>     public final int searchPattern(byte[] data, int startIdx)
>     {
>         return searchPattern(data, startIdx, data.length);
>     }
>     /**
>      * Searches for the first occurrence of the pattern in the input byte array.
>      * 
>      * @param data - The data we want to search in
>      * @return offset as int or -1 if not found at all
>      */
>     public final int searchPattern(byte[] data)
>     {
>         return searchPattern(data, 0, data.length);
>     }
> }
> {code}
> {code:java}
> package test;
> import java.io.IOException;
> import java.io.InputStream;
> import java.nio.charset.Charset;
> import java.util.logging.Logger;
> import java.util.regex.Matcher;
> import java.util.regex.Pattern;
> import org.apache.commons.io.IOUtils;
> import org.apache.tika.detect.EncodingDetector;
> import org.apache.tika.metadata.Metadata;
> import org.apache.tika.parser.txt.CharsetDetector;
> import org.apache.tika.parser.txt.CharsetMatch;
> import test.utils.BoyerMoorePattern;
> public class MyEncodingDetector implements EncodingDetector {
> 	
> 	public Charset detect(InputStream input, Metadata metadata)
> 			throws IOException {
> 		
> 		CharsetDetector detector;
> 		CharsetMatch match;
> 		detector = new CharsetDetector();
> 		detector.setText(input);
> 		match = detector.detect();
> 		match = detector.detect();
> 		
> 		String charsetName = isItUtf7(match, IOUtils.toByteArray(input)); // determines whether the match is UTF-7 or not
> 		
> 		if (charsetName != null) {
> 			return Charset.forName(charsetName);
> 		}
> 		return null;
> 	}
>     /**
>      * Checks for BOM and determines whether it is UTF-7 or not.
>      * 
>      * @param match - The default match we expect, if it is not UTF-7
>      * @param data - The bytearray we want to check
>      * 
>      * @return match
>      */
>     private String isItUtf7(CharsetMatch match, byte[] data)
>     {
>         if (isUTF7withBOM(data) || isUTF7withoutBOM(data)) {
>             return "UTF-7";
>         } else {
>         	if (match != null) {
>         		return match.getName();
>         	}
>         	return null;
>         }        
>     }
>     
>     private boolean isUTF7withBOM(byte[] data) {
>         if ((data.length > 4 && data[0] == 43 && data[1] == 47 && data[2] == 118)
>                 && (data[3] == 56 || data[3] == 57 || data[3] == 43 || data[3] == 47))
>         {
>             // Checkin byte-array for "byte order marks" (BOM):
>             // 43 47 118 56
>             // 43 47 118 57
>             // 43 47 118 43
>             // 43 47 118 47
>             return true;
>         }
>         return false;
>     }
>     
>     private boolean isUTF7withoutBOM(byte[] data) {
>         byte[] utf7StartPattern = "+".getBytes();
>         byte[] utf7EndPattern = "-".getBytes();
>         BoyerMoorePattern bmpattern = new BoyerMoorePattern(utf7StartPattern); // create a new pattern with the bytes
>         int startPosSP = bmpattern.searchPattern(data);
>         
>         BoyerMoorePattern empattern = new BoyerMoorePattern(utf7EndPattern); // create a new pattern with the bytes
>         int startPosEP = empattern.searchPattern(data);
> 		
>         if (startPosSP != -1 && startPosEP != -1) {
>         	// the pattern was found, so we can create a regular expression for the basic pattern now
>         	Pattern p = Pattern.compile("\\+[a-zA-Z]\\w{2,}\\-");	// a word with length of at least 3 characters or more
>         	Matcher m = p.matcher(new String(data));
>         	
>         	int numberMatches = 0;
>         	while (m.find()) {
>         		numberMatches++;
>         	}
>         	
>         	System.out.println("Number of possible UTF-7 regex matches: " + numberMatches);
>         	int minimumMatches = 3;
>         	
>         	if (numberMatches > minimumMatches) {	// if there are more than minimumMatches "+xxx-" words the expected encoding shall be UTF-7
>         		return true;
>         	}
>         }
>         
>         return false;
>     }
> }
> {code}
> There might be some false positive (FP) recognitions with the current regex and the number of matches.
> A better approach might be to set the minimumMatches in dependence of the amount of text given to the detector.
> This is just a simple first try, nothing for productivity. It even does not cover all possible UTF-7 strings.
> By the way:
> I am perfectly aware of the fact that the current testset does only cover a few encodings. However, the specified files address the main weakness of the current CharsetDetector.
> I don't know the history that lead to the creation of the CharsetDetector in TIKA and why ICU4J was rebuild with extensions like the cp866 ngram detection, instead of participating in icu4j development.
> Wouldn't it be better to forward the changes of the CharsetDetector to the ICU4J developers so they can implement missing encodings?
> Is it planned to include the newest version of ICU4J in future releases of TIKA?
> What about neural networks to determine some or all charsets? (given that there are enough testfiles)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)