You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by Iain Fraser <fr...@gmail.com> on 2020/03/09 00:56:30 UTC

Request to publish to Wiki

Hi there

My name is Iain Fraser and I'm a software developer from Perth in Western
Australia. Recently, I was tasked with improving the parsing performance of
an ASP.NET application that was using tika-app via the command line.
Without getting too far into it, the solution is looking like running
tika-server and bumping up the maxMainMemoryBytes parameter
of the PDFParser. We needed to bump that value up because the performance
of large files was unacceptably slow - and this was a sticking point for
the team preventing them from using tika-server (given that tika-app
exhibited no such issue for us).

I'm getting in touch today because I think I can help with documentation. I
was only able to arrive at the solution I did through intense googling,
reading message boards, experimentation and eventually, just reading the
source code. Perhaps I can help cut that process short for someone else in
the future?

Specifically, I had no idea that the config XML could accept <params>
elements under the <parser> element. Furthermore, I couldn't find any
documentation showing the parameters available and what they did. Through
my work, I have extracted a list of possible parameters for PDFParser as
well as comments from the implementing developer, which I'd really like to
document somewhere outside of source. I also might add something in your
"Troubleshooting Tika" section about bumping up main memory when you get
slow performance with large or complex PDF files (with some sample xml) and
perhaps even a note to Windows users about why their config.xml files might
not work when they extract them from the app via the command line (check
the encoding is actually UTF-8, PowerShell outputs UTF-16).

In order to do this, I would need to have create/edit access in the Tika
space of your Confluence app. Could it be possible to arrange this please?
I already have an account there under my name and this email address (
fraser.iain@gmail.com)

Kind regards
Iain