You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@tika.apache.org by Apache Wiki <wi...@apache.org> on 2018/11/26 19:08:13 UTC

[Tika Wiki] Update of "ComparisonTikaAndPDFToText201811" by TimothyAllison

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Tika Wiki" for change notification.

The "ComparisonTikaAndPDFToText201811" page has been changed by TimothyAllison:
https://wiki.apache.org/tika/ComparisonTikaAndPDFToText201811

New page:
= Overview =
After refreshing Tika's regression corpus (see and ), we thought it might be interesting to run a comparison between the text extracted with pdftotext and Tika.  Given that pdftotext does not extract content from embedded files and given that it does not perform Optical Character Recognition (OCR) or offer integration with OCR, this evaluation focused only on the extracting electronic text as stored within PDFs.  The goals of this study include:

 1. Identify areas for improvements for PDFBox and pdftotext
 2. Identify areas for improvements for tika-eval

The reader should '''not''' read the following as a recommendation for one tool over another.

= Tools and Data =
 * pdftotext -- we downloaded the most recent available binaries, version 4.00.01, and we followed the directions to install all language modules (see [https://wiki.apache.org/tika/VirtualMachine#pdftotext]). We wrote a simple Groovy wrapper to call a new pdftotext process for every file; if no extract file was generated by pdftotext, the Groovy script generated a 0-byte file; also, we forced a timeout after 300 seconds (5 minutes).
 * Tika/PDFBox -- we used a snapshot version of Tika 1.20, which uses PDFBox 2.0.12
 * Tika identified 528,618 PDF files in the new pull from Common Crawl. Many of these files are truncated, and 6,787 caused permission exceptions (these are either encrypted or they do not allow extraction of text).

= Exceptions =
There were 58,077 empty files generated by our wrapper of pdftotext.  Given that pdftotext respects access permissions, that means that there were up to 51,290 runtime exceptions; of these, seven files caused timeouts. Some kind of content (number of tokens > 0) was extracted for 399,241 files; there were 371,485 extracts that contained >= 100 tokens.

Aside from the "permission exceptions", there were 38,158 files that caused a runtime exception for PDFBox.  Some kind of content (number of tokens > 0) was extracted for 428,706 files.  Note that for this evaluation, we used a standard default content handler that appends the title to the extracted content.  Therefore, we also report that there were 384,277 extracts that contained >= 100 tokens.


= High Level Comparison =

== Languages ==

In the following, we show the top 20 languages identified in the extracted text.  The first language is that identified in the extract from pdftotext, and the second is the language identified on the extract of PDFBox.  For example 'en->fa' means that language id returned 'en' on the pdftotext extract, but 'fa' on the Tika/PDFBox extract.

||Language id||Number of Files||
||en->en||143710||
||ru->ru||44460||
||fr->fr||38870||
||it->it||36428||
||de->de||30150||
||es->es||18334||
||ja->ja||12446||
||el->el||9760||
||fa->fa||8486||
||ko->ko||7574||
||zh-cn->zh-cn||5657||
||tr->tr||5472||
||null||3365||
||vi->vi||2912||
||he->he||2280||
||ar->ar||2087||
||el->ja||1705||
||ca->ca||1273||
||en->fa||1240||
||pt->pt||1106||

In the following, we show the top 10 language id pairs, where the language id differs between the extracts.

||Language ids||Number of Files||
||el->ja||1705||
||en->fa||1240||
||de->en||921||
||en->de||518||
||en->bn||392||
||ar->fa||391||
||it->en||209||
||en->it||208||
||fr->en||200||
||vi->ja||175||

= Overall improvements to this process =
 * The wrapper around pdftotext should have "caught" the exception written to stderr and stored that as we do with exceptions from Tika.
 * Tika currently includes the file's 'title' metadata in the content of the file.  This gives the misleading impression that some content was extracted from the file when, in fact, only the title was extracted from the XMP or metadata.  Next time, we should use a content handler that only includes the extracted text.

= Improvements to tika-eval =
 * If there's an "extract exception", meaning an empty file or an incomplete json file, we include that information in the containers table, but we don't include that