You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Andreas Meier (JIRA)" <ji...@apache.org> on 2017/11/07 08:04:01 UTC

[jira] [Comment Edited] (TIKA-2484) Improve CharsetDetector to recognize UTF-16LE/BE,UTF-32LE/BE and UTF-7 with/without BOMs correctly

    [ https://issues.apache.org/jira/browse/TIKA-2484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16241633#comment-16241633 ] 

Andreas Meier edited comment on TIKA-2484 at 11/7/17 8:03 AM:
--------------------------------------------------------------

Thanks for the info [~gagravarr]

I think I understand the basic problem.
Their reservations towards the changes make sense if other detections are completely broken by the supplied code.

However, I am still not happy with the "won't fix", without having a large test corpus and the results with/without the changes.
Whether the change makes sense or not might also depend on the use case of the Software using the charsetdetector.

I will open another ticket, maybe the icu devs can provide a switch to activate the changes we need.
If the icu devs refuse to do so (they might have their reasons) we should consider refactoring the TIKA CharsetDetector:
Exchanging TIKA CharsetDetector with ICU4J CharsetDetector and trigger IBM866/EBCDIC_500 afterwards if a certain charset was found.
In my opinion it is not worth to rework the whole TIKA CharsetDetector when there were changes in ICU4J. (We even might not notice them by time, not to mention the errors that might occur doing so...)

ICU Ticket: [link http://bugs.icu-project.org/trac/ticket/13465|http://bugs.icu-project.org/trac/ticket/13465]


was (Author: andreasmeier):
Thanks for the info [~gagravarr]

I think I understand the basic problem.
Their reservations towards the changes make sense if other detections are completely broken by the supplied code.

However, I am still not happy with the "won't fix", without having a large test corpus and the results with/without the changes.
Whether the change makes sense or not might also depend on the use case of the Software using the charsetdetector.

I will open another ticket, maybe the icu devs can provide a switch to activate the changes we need.
If the icu devs refuse to do so (they might have their reasons) we should consider refactoring the TIKA CharsetDetector:
Exchanging TIKA CharsetDetector with ICU4J CharsetDetector and trigger IBM866/EBCDIC_500 afterwards if a certain charset was found.
In my opinion it is not worth to rework the whole TIKA CharsetDetector when there were changes in ICU4J. (We even might not notice them by time, not to mention the errors that might occur doing so...)


> Improve CharsetDetector to recognize UTF-16LE/BE,UTF-32LE/BE and UTF-7 with/without BOMs correctly
> --------------------------------------------------------------------------------------------------
>
>                 Key: TIKA-2484
>                 URL: https://issues.apache.org/jira/browse/TIKA-2484
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.16, 1.17
>            Reporter: Andreas Meier
>            Priority: Minor
>         Attachments: IUC10-ar.UTF-16BE.with-BOM, IUC10-ar.UTF-16BE.without-BOM, IUC10-ar.UTF-16LE.with-BOM, IUC10-ar.UTF-16LE.without-BOM, IUC10-ar.UTF-32BE.with-BOM, IUC10-ar.UTF-32BE.without-BOM, IUC10-ar.UTF-32LE.with-BOM, IUC10-ar.UTF-32LE.without-BOM, IUC10-ar.UTF-7.with-BOM, IUC10-ar.UTF-7.without-BOM, IUC10-fr.UTF-16BE.with-BOM, IUC10-fr.UTF-16BE.without-BOM, IUC10-fr.UTF-16LE.with-BOM, IUC10-fr.UTF-16LE.without-BOM, IUC10-fr.UTF-32BE.with-BOM, IUC10-fr.UTF-32BE.without-BOM, IUC10-fr.UTF-32LE.with-BOM, IUC10-fr.UTF-32LE.without-BOM, IUC10-fr.UTF-7.with-BOM, IUC10-fr.UTF-7.without-BOM, charset.zip
>
>
> I would like to help to improve the recognition accuracy of the CharsetDetector.
> Therefore I created a testset of plain/text-files to check the quality of org.apache.tika.parser.txt.CharsetDetector: charset.tar.gz
> (Testset created out of http://source.icu-project.org/repos/icu/icu4j/tags/release-4-8/main/tests/core/src/com/ibm/icu/dev/test/charsetdet/CharsetDetectionTests.xml)
> The Testset was processed using TIKA1.17 (ID: 877d621, HEAD from 26.10.2017) and ICU4J 59.1 CharsetDetector with custom UTF-7 improvements. Here are the results:
> {noformat}
> TIKA-1.17
> charset.tar.gz
> Correct recognitions: 165/341
> {noformat}
> {noformat}
> TIKA-1.17+ UTF-7 recognizer:
> charset.tar.gz
> Correct recognitions: 213/341
> {noformat}
> {noformat}
> ICU4j 59.1 + UTF-7 recognizer:
> charset.tar.gz
> Correct recognitions: 333/341
> {noformat}
> As UTF-7 recognizer I used these two simple classes:
> {code:java}
> package test.utils;
> import java.util.Arrays;
> /**
>  * Pattern state container for the Boyer-Moore algorithm
>  */
> public final class BoyerMoorePattern
> {
>     private final byte[] pattern;
>     private final int[] skipArray;
>     public BoyerMoorePattern(byte[] pattern)
>     {
>         this.pattern = pattern;
>         skipArray = new int[256];
>         Arrays.fill(skipArray, -1);
>         // Initialize with pattern values
>         for (int i = 0; i < pattern.length; i++)
>         {
>             skipArray[pattern[i] & 0xFF] = i;
>         }
>     }
>     /**
>      * Get the pattern length
>      * 
>      * @return length as int
>      */
>     public int getLength()
>     {
>         return pattern.length;
>     }
>     /**
>      * Searches for the first occurrence of the pattern in the input byte array.
>      * 
>      * @param data - The data we want to search in
>      * @param startIdx - The startindex
>      * @param endIdx - The endindex
>      * @return offset as int or -1 if not found at all
>      */
>     public final int searchPattern(byte[] data, int startIdx, int endIdx)
>     {
>         int patternLength = pattern.length;
>         int skip = 0;
>         for (int i = startIdx; i <= endIdx - patternLength; i += skip)
>         {
>             skip = 0;
>             for (int j = patternLength - 1; j >= 0; j--)
>             {
>                 if (pattern[j] != data[i + j])
>                 {
>                     skip = Math.max(1, j - skipArray[data[i + j] & 0xFF]);
>                     break;
>                 }
>             }
>             if (skip == 0)
>             {
>                 return i;
>             }
>         }
>         return -1;
>     }
>     /**
>      * Searches for the first occurrence of the pattern in the input byte array.
>      * 
>      * @param data - The data we want to search in
>      * @param startIdx - The startindex
>      * @return offset as int
>      */
>     public final int searchPattern(byte[] data, int startIdx)
>     {
>         return searchPattern(data, startIdx, data.length);
>     }
>     /**
>      * Searches for the first occurrence of the pattern in the input byte array.
>      * 
>      * @param data - The data we want to search in
>      * @return offset as int or -1 if not found at all
>      */
>     public final int searchPattern(byte[] data)
>     {
>         return searchPattern(data, 0, data.length);
>     }
> }
> {code}
> {code:java}
> package test;
> import java.io.IOException;
> import java.io.InputStream;
> import java.nio.charset.Charset;
> import java.util.logging.Logger;
> import java.util.regex.Matcher;
> import java.util.regex.Pattern;
> import org.apache.commons.io.IOUtils;
> import org.apache.tika.detect.EncodingDetector;
> import org.apache.tika.metadata.Metadata;
> import org.apache.tika.parser.txt.CharsetDetector;
> import org.apache.tika.parser.txt.CharsetMatch;
> import test.utils.BoyerMoorePattern;
> public class MyEncodingDetector implements EncodingDetector {
> 	
> 	public Charset detect(InputStream input, Metadata metadata)
> 			throws IOException {
> 		
> 		CharsetDetector detector;
> 		CharsetMatch match;
> 		detector = new CharsetDetector();
> 		detector.setText(input);
> 		match = detector.detect();
> 		match = detector.detect();
> 		
> 		String charsetName = isItUtf7(match, IOUtils.toByteArray(input)); // determines whether the match is UTF-7 or not
> 		
> 		if (charsetName != null) {
> 			return Charset.forName(charsetName);
> 		}
> 		return null;
> 	}
>     /**
>      * Checks for BOM and determines whether it is UTF-7 or not.
>      * 
>      * @param match - The default match we expect, if it is not UTF-7
>      * @param data - The bytearray we want to check
>      * 
>      * @return match
>      */
>     private String isItUtf7(CharsetMatch match, byte[] data)
>     {
>         if (isUTF7withBOM(data) || isUTF7withoutBOM(data)) {
>             return "UTF-7";
>         } else {
>         	if (match != null) {
>         		return match.getName();
>         	}
>         	return null;
>         }        
>     }
>     
>     private boolean isUTF7withBOM(byte[] data) {
>         if ((data.length > 4 && data[0] == 43 && data[1] == 47 && data[2] == 118)
>                 && (data[3] == 56 || data[3] == 57 || data[3] == 43 || data[3] == 47))
>         {
>             // Checkin byte-array for "byte order marks" (BOM):
>             // 43 47 118 56
>             // 43 47 118 57
>             // 43 47 118 43
>             // 43 47 118 47
>             return true;
>         }
>         return false;
>     }
>     
>     private boolean isUTF7withoutBOM(byte[] data) {
>         byte[] utf7StartPattern = "+".getBytes();
>         byte[] utf7EndPattern = "-".getBytes();
>         BoyerMoorePattern bmpattern = new BoyerMoorePattern(utf7StartPattern); // create a new pattern with the bytes
>         int startPosSP = bmpattern.searchPattern(data);
>         
>         BoyerMoorePattern empattern = new BoyerMoorePattern(utf7EndPattern); // create a new pattern with the bytes
>         int startPosEP = empattern.searchPattern(data);
> 		
>         if (startPosSP != -1 && startPosEP != -1) {
>         	// the pattern was found, so we can create a regular expression for the basic pattern now
>         	Pattern p = Pattern.compile("\\+[a-zA-Z]\\w{2,}\\-");	// a word with length of at least 3 characters or more
>         	Matcher m = p.matcher(new String(data));
>         	
>         	int numberMatches = 0;
>         	while (m.find()) {
>         		numberMatches++;
>         	}
>         	
>         	System.out.println("Number of possible UTF-7 regex matches: " + numberMatches);
>         	int minimumMatches = 3;
>         	
>         	if (numberMatches > minimumMatches) {	// if there are more than minimumMatches "+xxx-" words the expected encoding shall be UTF-7
>         		return true;
>         	}
>         }
>         
>         return false;
>     }
> }
> {code}
> There might be some false positive (FP) recognitions with the current regex and the number of matches.
> A better approach might be to set the minimumMatches in dependence of the amount of text given to the detector.
> This is just a simple first try, nothing for productivity. It even does not cover all possible UTF-7 strings.
> By the way:
> I am perfectly aware of the fact that the current testset does only cover a few encodings. However, the specified files address the main weakness of the current CharsetDetector.
> I don't know the history that lead to the creation of the CharsetDetector in TIKA and why ICU4J was rebuild with extensions like the cp866 ngram detection, instead of participating in icu4j development.
> Wouldn't it be better to forward the changes of the CharsetDetector to the ICU4J developers so they can implement missing encodings?
> Is it planned to include the newest version of ICU4J in future releases of TIKA?
> What about neural networks to determine some or all charsets? (given that there are enough testfiles)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)