You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by Paul Jakubik <pa...@purediscovery.com> on 2010/08/12 21:37:55 UTC

Faster charset detection or turn off charset detection?

Hi,

I'm wondering if there is a way to turn off character set detection when
parsing with the AutoDetectParser, or if there is a way to speed up
character set detection.

I ran a test that converted 52,717 documents to text. The documents were
emails embedded in a .tar file.

With character set detection, the test to 220 seconds. Without character set
detection, the test took 21 seconds and only 6% of that time was spent in
Tika.

According to a profiler, the following methods took the bulk of the runtime
when character set detection was used:
61.7%  org.apache.tika.parser.txt.CharsetRecog_sbcs$NGramParser.parse
  4.3%
 org.apache.tika.parser.txt.CharsetRecog_sbcs$CharsetRecog_IBM420_ar.isLamAlef
  3.1%
 org.apache.tika.parser.txt.CharsetRecog_sbcs$CharsetRecog_IBM420_ar.unshapeLamAlef
  2.6%  org.apache.tika.parser.txt.CharsetDetector.setText(byte[ ])
  2.3%  org.apache.tika.parser.txt.CharsetRecog_mbcs.match

One problem that seems to contribute to this is that every character set is
tested for each document, instead of starting with common character sets and
stopping as soon as an adequate character set is found.

To turn off character set detection, I created a new class that is
essentially the TXTParser with character set detection removed. I then
replaced every instance of TXTParser in AutoDetectParser's map of parsers
with a text parser that does not determine the character set.

I'm left with the following questions:
- Can character set detection be sped up?
- If character set detection can't be sped up, is there an easier way to
turn it off?
- If character set detection can't be sped up and there isn't an easier way
to turn off character set detection, could an easier way to turn off
character set detection be added?

Thanks for your help,
Paul

Re: Faster charset detection or turn off charset detection?

Posted by Ken Krugler <kk...@transpac.com>.
Hi Paul,

Thanks for providing some interesting statistics.

On Aug 12, 2010, at 12:37pm, Paul Jakubik wrote:

> I'm wondering if there is a way to turn off character set detection  
> when
> parsing with the AutoDetectParser, or if there is a way to speed up
> character set detection.

There are ways to make it faster, yes. Mostly involving changing the  
underlying algorithm, which requires processing a significant amount  
of text (currently it processes all the text). Some related issues:

https://issues.apache.org/jira/browse/TIKA-322

https://issues.apache.org/jira/browse/TIKA-369

-- Ken

>
> I ran a test that converted 52,717 documents to text. The documents  
> were
> emails embedded in a .tar file.
>
> With character set detection, the test to 220 seconds. Without  
> character set
> detection, the test took 21 seconds and only 6% of that time was  
> spent in
> Tika.
>
> According to a profiler, the following methods took the bulk of the  
> runtime
> when character set detection was used:
> 61.7%  org.apache.tika.parser.txt.CharsetRecog_sbcs$NGramParser.parse
>  4.3%
> org.apache.tika.parser.txt.CharsetRecog_sbcs 
> $CharsetRecog_IBM420_ar.isLamAlef
>  3.1%
> org.apache.tika.parser.txt.CharsetRecog_sbcs 
> $CharsetRecog_IBM420_ar.unshapeLamAlef
>  2.6%  org.apache.tika.parser.txt.CharsetDetector.setText(byte[ ])
>  2.3%  org.apache.tika.parser.txt.CharsetRecog_mbcs.match
>
> One problem that seems to contribute to this is that every character  
> set is
> tested for each document, instead of starting with common character  
> sets and
> stopping as soon as an adequate character set is found.
>
> To turn off character set detection, I created a new class that is
> essentially the TXTParser with character set detection removed. I then
> replaced every instance of TXTParser in AutoDetectParser's map of  
> parsers
> with a text parser that does not determine the character set.
>
> I'm left with the following questions:
> - Can character set detection be sped up?
> - If character set detection can't be sped up, is there an easier  
> way to
> turn it off?
> - If character set detection can't be sped up and there isn't an  
> easier way
> to turn off character set detection, could an easier way to turn off
> character set detection be added?
>
> Thanks for your help,
> Paul

--------------------------------------------
<http://ken-blog.krugler.org>
+1 530-265-2225






--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g