You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by c....@lirum.at on 2016/07/25 22:00:52 UTC

Is Tika (especially CharsetDetector) considered thread-safe?

Hi,

I am working in a project where Tika is getting used in a heavily
multi-threaded environment. Lately, there have been some issues where
character set detection in isolation gives plausible results, while
running it in parallel gives results that are way off.

The root cause has not yet been found, but within the team, there was
quite some finger-pointing towards Tika's thread-safety and lots of
FUD especially around org.apache.tika.parser.txt.CharsetDetector.

But it seems no one in our team reached out or cared to either bug
report or ask on the mailing list.

So just to get rid of the FUD: Is
org.apache.tika.parser.txt.CharsetDetector considered to be
thread-safe?
(Some bugs suggest that Tika cares about thread-safety, but I could
not find anything in the javadoc for CharsetDetector)

Thanks and Best regards,
Christian


P.S.: We're building a fresh, new CharSetDetector for each byte array
that should have the character set encoding detected. And only the
thread that created the CharSetDetector is using it.


P.P.S.: We're still using Tika 1.9.

Re: Is Tika (especially CharsetDetector) considered thread-safe?

Posted by Christian <c....@lirum.at>.
Hi,

On Tue, Jul 26, 2016 at 02:17:13AM +0000, Allison, Timothy B. wrote:
> Exactly what code are you using?  How are you doing detection?

I see that you already have something working on TIKA-2041.

But for completeness' sake:
Our code is a bit convoluted.
It boils down to running the following piece of code in multiple
threads in parallel:

    private String getCharset(final byte[] raw) {
        CharsetDetector detector = new CharsetDetector();
        detector.setText(raw);
        CharsetMatch match = detector.detect();
        if (match == null) {
            return null;
        }
        return match.getName();
    }

`raw` is isolated per thread. So CharsetDetector does not have the
byte array changed underneath its feet.


Best regards,
Christian

RE: Is Tika (especially CharsetDetector) considered thread-safe?

Posted by "Allison, Timothy B." <ta...@mitre.org>.
Y, we have a problem.  Thank you for raising this.

https://issues.apache.org/jira/browse/TIKA-2041



RE: Is Tika (especially CharsetDetector) considered thread-safe?

Posted by "Allison, Timothy B." <ta...@mitre.org>.
<face_palm/>

Still couldn't find any problems with actual multithreaded code. :(

    @Test
    public void testMultiThreadingEncodingDetection() throws Exception {
        Path testDocs = Paths.get(this.getClass().getResource("/test-documents").toURI());
        List<Path> paths = new ArrayList<>();
        Map<Path, String> encodings = new ConcurrentHashMap<>();
        for (File file : testDocs.toFile().listFiles()) {
            if (file.getName().endsWith(".txt") || file.getName().endsWith(".html")) {
                    System.out.println(file);
                String encoding = getEncoding(file.toPath());
                paths.add(file.toPath());
                encodings.put(file.toPath(), encoding);
            }
        }
        int numThreads = 100;
        ExecutorService ex = Executors.newFixedThreadPool(numThreads);
        CompletionService<String> completionService =
                new ExecutorCompletionService<>(ex);

        for (int i = 0; i < numThreads; i++) {
            completionService.submit(new EncodingDetector(paths, encodings), "done");
        }
        int completed = 0;
        while (completed < numThreads) {
            Future<String> future = completionService.take();
            if (future.isDone() && "done".equals(future.get())) {
                completed++;
            }
        }
        assertTrue("success!", true);
    }

    private class EncodingDetector implements Runnable {
        private final List<Path> paths;
        private final Map<Path, String> encodings;
        private final Random r = new Random();
        private EncodingDetector(List<Path> paths, Map<Path, String> encodings) {
            this.paths = paths;
            this.encodings = encodings;
        }

        @Override
        public void run() {
            for (int i = 0; i < 1000; i++) {
                int pInd = r.nextInt(paths.size());

                String detectedEncoding = null;
                try {
                    detectedEncoding = getEncoding(paths.get(pInd));
                } catch (Exception e) {
                    throw new RuntimeException(e);
                }
                String trueEncoding = encodings.get(paths.get(pInd));
                if (! detectedEncoding.equals(trueEncoding)) {
                    throw new RuntimeException("detected: " + detectedEncoding +
                            " but should have been: "+trueEncoding);
                }
            }
        }
    }

    public String getEncoding(Path p) throws Exception {
        try (InputStream is = TikaInputStream.get(p)) {
            AutoDetectReader reader = new AutoDetectReader(is);
            String val = reader.getCharset().toString();
            if (val == null) {
                return "NULL";
            } else {
                return val;
            }
        }
    }

-----Original Message-----
From: Allison, Timothy B. [mailto:tallison@mitre.org] 
Sent: Monday, July 25, 2016 10:17 PM
To: user@tika.apache.org
Subject: RE: Is Tika (especially CharsetDetector) considered thread-safe?

With 1.13 and this code, I'm not able to see any problems with our handful of test files in our unit tests.  

Exactly what code are you using?  How are you doing detection?


RE: Is Tika (especially CharsetDetector) considered thread-safe?

Posted by "Allison, Timothy B." <ta...@mitre.org>.
With 1.13 and this code, I'm not able to see any problems with our handful of test files in our unit tests.  

Exactly what code are you using?  How are you doing detection?


    @Test
    public void testMultiThreadedEncodingDetection() throws Exception {
        Path testDocs = Paths.get(this.getClass().getResource("/test-documents").toURI());
        List<Path> paths = new ArrayList<>();
        Map<Path, String> encodings = new ConcurrentHashMap<>();
        for (File file : testDocs.toFile().listFiles()) {
            if (file.getName().endsWith(".txt") || file.getName().endsWith(".html")) {
                String encoding = getEncoding(file.toPath());
                paths.add(file.toPath());
                encodings.put(file.toPath(), encoding);
            }
        }
        for (int i = 0; i < 100; i++) {
            new Thread(new EncodingDetector(paths, encodings)).run();
        }
        assertTrue("success!", true);
    }

    private class EncodingDetector implements Runnable {
        private final List<Path> paths;
        private final Map<Path, String> encodings;
        private final Random r = new Random();
        private EncodingDetector(List<Path> paths, Map<Path, String> encodings) {
            this.paths = paths;
            this.encodings = encodings;
        }

        @Override
        public void run() {
            for (int i = 0; i < 100; i++) {
                int pInd = r.nextInt(paths.size());
                String detectedEncoding = null;
                try {
                    detectedEncoding = getEncoding(paths.get(pInd));
                } catch (Exception e) {
                    throw new RuntimeException(e);
                }
                String trueEncoding = encodings.get(paths.get(pInd));
                if (! detectedEncoding.equals(trueEncoding)) {
                    throw new RuntimeException("detected: " + detectedEncoding +
                            " but should have been: "+trueEncoding);
                }
            }
        }
    }

    public String getEncoding(Path p) throws Exception {
        try (InputStream is = TikaInputStream.get(p)) {
            AutoDetectReader reader = new AutoDetectReader(is);
            String val = reader.getCharset().toString();
            if (val == null) {
                return "NULL";
            } else {
                return val;
            }
        }
    }

-----Original Message-----
From: Allison, Timothy B. [mailto:tallison@mitre.org] 
Sent: Monday, July 25, 2016 9:21 PM
To: user@tika.apache.org
Subject: RE: Is Tika (especially CharsetDetector) considered thread-safe?

Charset detection _should_ be thread safe.  If you can help us track down the problem (unit test?), we need to fix this.

Thank you for raising this.

Best,

         Tim


RE: Is Tika (especially CharsetDetector) considered thread-safe?

Posted by Nick Burch <ap...@gagravarr.org>.
On Tue, 26 Jul 2016, Allison, Timothy B. wrote:
> Charset detection _should_ be thread safe.  If you can help us track 
> down the problem (unit test?), we need to fix this.

On the whole, I think Tika is following the POI model on thread-safety as 
a minimum. That is, two threads working on two different documents should 
always be fine. Two threads trying to work on the same document may not be

Nick

RE: Is Tika (especially CharsetDetector) considered thread-safe?

Posted by "Allison, Timothy B." <ta...@mitre.org>.
Charset detection _should_ be thread safe.  If you can help us track down the problem (unit test?), we need to fix this.

Thank you for raising this.

Best,

         Tim

-----Original Message-----
From: c.leitinger@lirum.at [mailto:c.leitinger@lirum.at] 
Sent: Monday, July 25, 2016 6:01 PM
To: user@tika.apache.org
Subject: Is Tika (especially CharsetDetector) considered thread-safe?

Hi,

I am working in a project where Tika is getting used in a heavily multi-threaded environment. Lately, there have been some issues where character set detection in isolation gives plausible results, while running it in parallel gives results that are way off.

The root cause has not yet been found, but within the team, there was quite some finger-pointing towards Tika's thread-safety and lots of FUD especially around org.apache.tika.parser.txt.CharsetDetector.

But it seems no one in our team reached out or cared to either bug report or ask on the mailing list.

So just to get rid of the FUD: Is
org.apache.tika.parser.txt.CharsetDetector considered to be thread-safe?
(Some bugs suggest that Tika cares about thread-safety, but I could not find anything in the javadoc for CharsetDetector)

Thanks and Best regards,
Christian


P.S.: We're building a fresh, new CharSetDetector for each byte array that should have the character set encoding detected. And only the thread that created the CharSetDetector is using it.


P.P.S.: We're still using Tika 1.9.

Re: Is Tika (especially CharsetDetector) considered thread-safe?

Posted by "c.leitinger@lirum.at" <c....@lirum.at>.
Hi,

On Tue, Jul 26, 2016 at 03:07:59AM +0000, Allison, Timothy B. wrote:
> If you could open an account on JIRA, it would be helpful for
> discussion on this issue.

Done.

Thanks!

Best regards,
Christian

RE: Is Tika (especially CharsetDetector) considered thread-safe?

Posted by "Allison, Timothy B." <ta...@mitre.org>.
Christian,
  If you could open an account on JIRA, it would be helpful for discussion on this issue.  Thank you, again.

        Best,

                Tim
       

-----Original Message-----
From: c.leitinger@lirum.at [mailto:c.leitinger@lirum.at] 
Sent: Monday, July 25, 2016 6:01 PM
To: user@tika.apache.org
Subject: Is Tika (especially CharsetDetector) considered thread-safe?

Hi,

I am working in a project where Tika is getting used in a heavily multi-threaded environment. Lately, there have been some issues where character set detection in isolation gives plausible results, while running it in parallel gives results that are way off.

The root cause has not yet been found, but within the team, there was quite some finger-pointing towards Tika's thread-safety and lots of FUD especially around org.apache.tika.parser.txt.CharsetDetector.

But it seems no one in our team reached out or cared to either bug report or ask on the mailing list.

So just to get rid of the FUD: Is
org.apache.tika.parser.txt.CharsetDetector considered to be thread-safe?
(Some bugs suggest that Tika cares about thread-safety, but I could not find anything in the javadoc for CharsetDetector)

Thanks and Best regards,
Christian


P.S.: We're building a fresh, new CharSetDetector for each byte array that should have the character set encoding detected. And only the thread that created the CharSetDetector is using it.


P.P.S.: We're still using Tika 1.9.