You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Ken Krugler (JIRA)" <ji...@apache.org> on 2010/01/24 17:54:17 UTC

[jira] Commented: (TIKA-354) ProfilingHandler should take a length-limiting parameter

    [ https://issues.apache.org/jira/browse/TIKA-354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12804259#action_12804259 ] 

Ken Krugler commented on TIKA-354:
----------------------------------

I'm working on speeding up language identification, since it's consuming close to 90% of the time during document parsing for my big web crawls.

I have some changes that make it about 2.2x faster for the test files, and a more significant change (data sampling) that should significantly speed up time for processing larger documents.

One problem is that the confidence level (for certainty) needs to be dropped a bit for when text is sampled, at least for the unit tests to pass. But based on email conversations with Ted Dunning, this approach of using an absolute value doesn't work very well in principle, and fails badly for shorter documents. I've been looking at Ted's paper on a more sophisticated approach, and will open a separate issue to track that.


> ProfilingHandler should take a length-limiting parameter
> --------------------------------------------------------
>
>                 Key: TIKA-354
>                 URL: https://issues.apache.org/jira/browse/TIKA-354
>             Project: Tika
>          Issue Type: Improvement
>          Components: languageidentifier
>    Affects Versions: 0.5
>            Reporter: Vivek Magotra
>            Assignee: Ken Krugler
>
> ProfilingHandler currently parses the entire document (thereby analyzing n-grams for the entire doc).
> ProfilingHandler should take a length-limiting parameter that allows a user to specify the amount of data that should get analyzed.
> In fact, by default that limit should be set to something like 8K.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.