You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@tika.apache.org by Apache Wiki <wi...@apache.org> on 2017/10/23 15:50:54 UTC
[Tika Wiki] Update of "TikaEvalOnVM" by TimothyAllison
Dear Wiki user,
You have subscribed to a wiki page or wiki category on "Tika Wiki" for change notification.
The "TikaEvalOnVM" page has been changed by TimothyAllison:
https://wiki.apache.org/tika/TikaEvalOnVM
New page:
'''ROUGH DRAFT OF HOW TO RUN TIKA_EVAL ON THE RACKSPACE VM'''
While users can run tika-eval on their own machines with their own documents, the Apache Tika, Apache PDFBox and Apache POI communities have gathered ~1TB of documents from govdocs1 and from Common Crawl to serve as a regression testing corpus. Before a release, we'll run the last release against the current release to identify potential regressions.
Rackspace generously hosts this vm, and we are extremely grateful.
This page is intended for committers/PMC members with access to the VM who want to run the regression tests. The example focuses on testing a SNAPSHOT version of PDFBox, but the steps are nearly identical same for the full Tika eval or for sub projects.
== An Example with Apache PDFBox ==
0. Clean up from any previous runs
a. Remove tika-app.jar from /work/batch_apps/tika_working/lib
b. Remove or rename /work/batch_apps/tika_working/logs
c. Remove or rename /work/batch_apps/tika_working/nohup.out
1. Run the current "A" version
a. Place the "A" version of tika-app.jar in /work/batch_apps/tika_working/lib.
b. Modify `appBatchExecutor.sh` to
i. put the output in a new output directory `-o /data4/batch_runs/pdfboxA`
ii. confirm that the correct file list is specified `-fileList pdf_files_single_col.txt`
c. execute: `nohup ./appBatchExecutor.sh`
d. wait for the "A" version to complete before starting the "B" version
2. Build and run the "B" version
a. Update PDFBox from SVN, `mvn install`
b. Update the PDFBox and Fontbox versions in the Tika project tika-parsers/pom.xml
c. Run the PDFParser tests in tika-parsers/src/test/jva/ao.a.t.p.pdf.* to make sure that the Tika unit tests work.
d. Build the new tika-app.jar `mvn install`
e. Remove the tika.app-A.jar from /work/batch_apps/tika_working/lib, rename nohup.out to nohup-A.out, rename /work/batch_apps/tika_working/logs to /work/batch_apps/tika_working/logs-A
f. Modify `appBatchExecutor.sh` to
i. put the output in a new output directory `-o /data4/batch_runs/pdfboxB`
ii. confirm that the correct file list is specified `-fileList pdf_files_single_col.txt`