You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Shuai Liu (JIRA)" <ji...@apache.org> on 2015/01/15 04:23:34 UTC
[jira] [Commented] (TIKA-1517) MIME type selection with probability

    [ https://issues.apache.org/jira/browse/TIKA-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14278190#comment-14278190 ] 

Shuai Liu commented on TIKA-1517:
---------------------------------

Proposed design:
The idea of selection is to incorporate probability as weights on each MIME type identification method currently being implemented in Tika (they are Magic bytes approach, file extension match and metadata content-type hint).

for example,
as an user, i would probably like to have the preference over the method i trust the most, and order the results if they don't coincide.
Bayesian rule may be a bit appropriate here to meet the intuition.
The following is what are needed for Bayesian rule implementation.

> Prior probability P(file_type) e.g. P(pdf), theoretically this is computed based on the samples, so this depends on the domain or use case, intuitively we more care the orders of the weights or probability of the results rather than the actual numbers, and also the context of Prior depends on samples for a particular use case or domain, e.g. if we happen to crawl a website that contains mostly the pdf files, we probably can collect some samples and compute the prior, based on the samples we say 90% of docs are pdf, our prior is defined to be P(pdf) = 0.9, but here we propose the prior as configurable param for users, and by default we leave the prior to be "unapplicable", on the other hands, we can define prior for each file type  1/[number of supported file types in Tika] I think the number would be approximately 1/1157 and using this number seems to be fair, but the point of avoiding it is that this prior is fixed for every type, and eventually we care more the orders of the result, so bringing this number of 1/1157 into the Baysien equantion will not be able to change the order but will lumber our implementation with extra computation, thus we will leave it as "unapplicable" which means we assign 1 to it as it never exists! but note we care more the order rather the actual number, and this param is configurable, and we believe it provides much flexibilities in some use cases.


> Conditional probability of positive tests given a file type P(test| file_type) e.g. P(test1 = pdf | pdf), this probability is also based on collection of samples and domain or use cases, we leave it configurable, but based on our intuition we think test1(i.e. Magic-bytes method) is most trustworthy, thus the default value is 0.75 for P(test1 = a_file_type | a_file_type), this is to say given the file whose type is "a file type", the probability of the test1 predicting the file is "a_file_type" is 0.75, that is really our intuition, as we trust test1 most, next we propose to use 0.7 for test3, and 0.65 for test2;
(note again, test1 = magic-bytes, test2 = file extension, test3 = Metadata Content-type hint)

> Conditional probability of negative tests also need to be intuitively defined.
E.g. By default, given a file type that is not pdf, the probability of test1 predicting it is pdf is 1-P(test1 = pdf | pdf), thus P(test1=pdf | ~pdf) = 1- 0.75 = 0.25, as we trust the test1 the most, the other tests are defined with 0.35 and 0.3 respectively with the same intuition.

 
>> The goal is to find out 
P(file_type | test1 = file_type, test2=file_type, test3=file_type)

(Please note, we are mostly interested in the order of choice rather than the explicit computation, we selectively drop some of the parameters used in Bayesian rule. Those are not considered will by default be set to 1 .)

For example, given a file the following 3 tests have predicted as follows
test1 = pdf
test2 = pdf
test3 = pdf

prior: P(pdf) = 1 and P(~pdf) = 1
P(test1=pdf|pdf) = 0.75
P(test2=pdf|pdf) =0.65
P(test3=pdf|pdf) = 0.7
With the same concept or intuition, we have the negative conditional probability
P(test1=pdf|~pdf) = 0.25
P(test2=pdf|~pdf) =0.35
P(test3=pdf|~pdf) = 0.3

Then we ready to compute.
Our goal is P(pdf|test1=pdf, test2=pdf, test3=pdf)

P(pdf|test1=pdf, test2=pdf, test3=pdf) = [P(pdf) * P(test1=pdf|pdf) * P(test2=pdf|pdf) * P(test3=pdf|pdf)]/total probability 


More example with this will continue.....




> MIME type selection with probability
> ------------------------------------
>
>                 Key: TIKA-1517
>                 URL: https://issues.apache.org/jira/browse/TIKA-1517
>             Project: Tika
>          Issue Type: Improvement
>          Components: mime
>    Affects Versions: 0.1-incubating, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.10, 1.0, 1.1, 1.2, 1.3, 1.4, 1.5, 1.6
>            Reporter: Shuai Liu
>
> Problem and intuition
> The original implementation in MIME type determination is a bit less flexible, and it heavily relies on the outcome of magic-bytes; Thus e.g. if magic-bytes is applicable in a file, Tika will follow the file type detected by magic-bytes.
> This proposed approach slightly incorporate the Bayesian probability theorem, where users are able to assign weights to each approach in terms of probability, so they have the control over which file type or mime type identification methods implemented/available in Tika, and currently there are 3 methods for identifying MIME type in Tika (i.e. Magic-Bytes, File extension and Metadata content-type hint). By introducing some weights on the approach in the proposed approach, users choose which method they trust most, the magic-bytes method is often trust-worthy though. But the virtue is that in some situations, file type identification must be sensitive, some might want each of the MIME type identification methods to arrive at the same file type before they start processing those file, incorrect file type identification is less intolerable. The current implementation seems to be less flexible and heavily rely on the Magic-bytes file identification method (although magic-bytes is most reliable compared to the other 2 currently being available in Tika); 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)