You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@tika.apache.org by Apache Wiki <wi...@apache.org> on 2017/10/23 15:50:54 UTC

[Tika Wiki] Update of "TikaEvalOnVM" by TimothyAllison

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Tika Wiki" for change notification.

The "TikaEvalOnVM" page has been changed by TimothyAllison:
https://wiki.apache.org/tika/TikaEvalOnVM

New page:
'''ROUGH DRAFT OF HOW TO RUN TIKA_EVAL ON THE RACKSPACE VM'''

While users can run tika-eval on their own machines with their own documents, the Apache Tika, Apache PDFBox and Apache POI communities have gathered ~1TB of documents from govdocs1 and from Common Crawl to serve as a regression testing corpus.  Before a release, we'll run the last release against the current release to identify potential regressions.  

Rackspace generously hosts this vm, and we are extremely grateful.

This page is intended for committers/PMC members with access to the VM who want to run the regression tests.  The example focuses on testing a SNAPSHOT version of PDFBox, but the steps are nearly identical same for the full Tika eval or for sub projects.

== An Example with Apache PDFBox ==

 0. Clean up from any previous runs
   a. Remove tika-app.jar from /work/batch_apps/tika_working/lib
   b. Remove or rename /work/batch_apps/tika_working/logs
   c. Remove or rename /work/batch_apps/tika_working/nohup.out
 1. Run the current "A" version
   a. Place the "A" version of tika-app.jar in /work/batch_apps/tika_working/lib.
   b. Modify `appBatchExecutor.sh` to
    i. put the output in a new output directory `-o /data4/batch_runs/pdfboxA`
    ii. confirm that the correct file list is specified `-fileList pdf_files_single_col.txt`
   c. execute: `nohup ./appBatchExecutor.sh`
   d. wait for the "A" version to complete before starting the "B" version
 2. Build and run the "B" version
   a. Update PDFBox from SVN, `mvn install`
   b. Update the PDFBox and Fontbox versions in the Tika project tika-parsers/pom.xml
   c. Run the PDFParser tests in tika-parsers/src/test/jva/ao.a.t.p.pdf.* to make sure that the Tika unit tests work.
   d. Build the new tika-app.jar `mvn install`
   e. Remove the tika.app-A.jar from /work/batch_apps/tika_working/lib, rename nohup.out to nohup-A.out, rename /work/batch_apps/tika_working/logs to /work/batch_apps/tika_working/logs-A
   f.  Modify `appBatchExecutor.sh` to
    i. put the output in a new output directory `-o /data4/batch_runs/pdfboxB`
    ii. confirm that the correct file list is specified `-fileList pdf_files_single_col.txt`