You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@tika.apache.org by Apache Wiki <wi...@apache.org> on 2018/11/06 17:47:36 UTC
[Tika Wiki] Update of "CommonCrawl3" by TimothyAllison

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Tika Wiki" for change notification.

The "CommonCrawl3" page has been changed by TimothyAllison:
https://wiki.apache.org/tika/CommonCrawl3

New page:
= Refreshing the Regression Corpus =

Since the last efforts to refresh the regression corpus (TIKA-2038), CommonCrawl has added important metadata items in the indices, including: ''mime-detected'', ''languages'' and ''charset''.  I opened [[https://issues.apache.org/jira/browse/TIKA-2750|TIKA-2750]] to track progress on updating our corpus.

There are two primary goals of TIKA-2750: include more "interesting" files, and refetch some of the files that are truncated in CommonCrawl.  I don't have a definition of interesting, but the goal is to include broad coverage of file formats and languages.  

While I recognize that the new metadata may be errorful, this new metadata allows for more accurate oversampling of file formats/charsets that are of interest.

I started by downloading the 300 index files for September 2018's crawl: CC-MAIN-2018-39 (~226GB)

Given the work on TIKA-2038 and the focus on country top level domains (TLDs), I started by counting the number of ''mimes'' by TLD and the number of charsets by TLD ([[https://issues.apache.org/jira/secure/attachment/12945796/CC-MAIN-2018-39-mimes-charsets-by-tld.zip|here]]).

I also counted the number of "detected mimes," and the top 10 are:
||mime||count||
|| text/html || 2,070,375,191 ||
|| application/xhtml+xml || 749,683,874 ||
|| image/jpeg || 6,207,029 ||
|| application/pdf || 4,128,740 ||
|| application/rss+xml || 3,495,173 ||
|| application/atom+xml || 2,868,625 ||
|| application/xml || 1,353,092 ||
|| image/png || 585,019 ||
|| text/plain || 492,429 ||
|| text/calendar || 470,624 ||

Given our interest in office-ish files, I chose to break the sampling into three passes:

 1. MSOffice and PDFs
 2. Other binaries
 3. HTML/Text

I wanted to keep the corpus to below 1 TB and on the order of a few million files.

== MSOffice and PDFs ==
The top 10 file formats of this category included:

||application/pdf||4,128,740||
||application/vnd.openxmlformats-officedocument.wordprocessingml.document||53,579||
|||application/msword||52,087||
||application/rtf||22,509||
||application/vnd.ms-excel||22,067||
||application/vnd.openxmlformats-officedocument.spreadsheetml.sheet||16,290||
||application/vnd.oasis.opendocument.text||8,314||
||application/vnd.openxmlformats-officedocument.presentationml.presentation||6,835||
||application/vnd.ms-powerpoint||5,799||
||application/vnd.openxmlformats-officedocument.presentationml.slideshow||2,465||

{{{
select mime, sum(count) cnt
from detected_mimes
where 
(mime ilike '%pdf%' 
 OR 
 mime similar to '%(word|doc|power|ppt|excel|xls|application.*access|outlook|msg|visio|rtf|iwork|pages|numbers|keynot)%'
)
group by mime
order by cnt desc
}}}

Given how quickly the tail drops off, we could afford to take all of the non-PDFs.  For PDFs, we created a sampling frame by TLD.  

== Other Binaries ==


== HTML/Text ==