You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (JIRA)" <ji...@apache.org> on 2016/07/27 00:48:20 UTC
[jira] [Resolved] (TIKA-2041) Charset detection doesn't appear to
be thread-safe
[ https://issues.apache.org/jira/browse/TIKA-2041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tim Allison resolved TIKA-2041.
-------------------------------
Resolution: Fixed
Assignee: Tim Allison
Fix Version/s: 2.0
1.14,
Thank you, [~c.leitinger], [~fnl], [~christian.aistleitner@selerityinc.com] for finding this issue and helping us to find the cause.
[~c.leitinger], thank you for taking the first (very important!) step of contacting us about this issue. Now you know how to reach us. Let us know what else you find. :)
> Charset detection doesn't appear to be thread-safe
> --------------------------------------------------
>
> Key: TIKA-2041
> URL: https://issues.apache.org/jira/browse/TIKA-2041
> Project: Tika
> Issue Type: Bug
> Reporter: Tim Allison
> Assignee: Tim Allison
> Fix For: 1.14,, 2.0
>
>
> On the user list, Christian Leitinger noted that his team found a potential issue with the thread safety of the encoding detector. I was able to reproduce this with on the corpus of html files in [~faghani]'s encoding detector.
> {noformat}
> @Test
> public void testMultiThreadingEncodingDetection() throws Exception {
> Path testDocs = Paths.get("C:/data/encodings/corpus");
> List<Path> paths = new ArrayList<>();
> Map<Path, String> encodings = new ConcurrentHashMap<>();
> for (File encodingDirs : testDocs.toFile().listFiles()) {
> for (File file : encodingDirs.listFiles()) {
> String encoding = getEncoding(file.toPath());
> paths.add(file.toPath());
> encodings.put(file.toPath(), encoding);
> }
> }
> int numThreads = 1000;
> ExecutorService ex = Executors.newFixedThreadPool(numThreads);
> CompletionService<String> completionService =
> new ExecutorCompletionService<>(ex);
> for (int i = 0; i < numThreads; i++) {
> completionService.submit(new EncodingDetectorRunner(paths, encodings), "done");
> }
> int completed = 0;
> while (completed < numThreads) {
> Future<String> future = completionService.take();
> if (future.isDone() && "done".equals(future.get())) {
> completed++;
> }
> }
> assertTrue("success!", true);
> }
> private class EncodingDetectorRunner implements Runnable {
> private final List<Path> paths;
> private final Map<Path, String> encodings;
> private final Random r = new Random();
> private EncodingDetectorRunner(List<Path> paths, Map<Path, String> encodings) {
> this.paths = paths;
> this.encodings = encodings;
> }
> @Override
> public void run() {
> for (int i = 0; i < 100; i++) {
> int pInd = r.nextInt(paths.size());
> String detectedEncoding = null;
> try {
> detectedEncoding = getEncoding(paths.get(pInd));
> } catch (Exception e) {
> throw new RuntimeException(e);
> }
> String trueEncoding = encodings.get(paths.get(pInd));
> if (! detectedEncoding.equals(trueEncoding)) {
> throw new RuntimeException("detected: " + detectedEncoding +
> " but should have been: "+trueEncoding + " for " + paths.get(pInd));
> }
> }
> }
> }
> public String getEncoding(Path p) throws Exception {
> try (InputStream is = TikaInputStream.get(p)) {
> AutoDetectReader reader = new AutoDetectReader(is);
> String val = reader.getCharset().toString();
> if (val == null) {
> return "NULL";
> } else {
> return val;
> }
> }
> }
> {noformat}
> yields:
> {noformat}
> ava.util.concurrent.ExecutionException: java.lang.RuntimeException: detected: ISO-8859-1 but should have been: windows-1252 for C:\data\encodings\corpus\Shift_JIS\1
> at java.util.concurrent.FutureTask.report(FutureTask.java:122)
> at java.util.concurrent.FutureTask.get(FutureTask.java:192)
> at org.apache.tika.parser.html.HtmlParserTest.testMultiThreadingEncodingDetection(HtmlParserTest.java:1213)
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)