You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Jens Emil Schulz Østergaard (Jira)" <ji...@apache.org> on 2021/08/26 11:39:00 UTC
[jira] [Created] (TIKA-3538) TikaServer, cancelling request
client-side does not kill working OCR process
Jens Emil Schulz Østergaard created TIKA-3538:
-------------------------------------------------
Summary: TikaServer, cancelling request client-side does not kill working OCR process
Key: TIKA-3538
URL: https://issues.apache.org/jira/browse/TIKA-3538
Project: Tika
Issue Type: Bug
Components: server
Affects Versions: 2.0.0-BETA
Environment: OS: ArcoLinux
Kernel: 5.10.60-1-lts
CPU: Intel i5-8400 (6) @ 4.000GHz
Memory: 32Gb
Reporter: Jens Emil Schulz Østergaard
Attachments: tika_error.log
It appears that canceling a request will not stop work in Tika. The handler finishes the job and then fails as it attempts to return data.
I would have expected tika to detect client-side cancellations and propagate this to relevant child processes, like tesseract, thus avoiding unnecessary work.
I send a request like so. Here FILE is a pdf that has inline images and requires OCR scanning.
{code:bash}
curl -T "$FILE" \
-s "http://localhost:9998/tika/text" \
-H "Accept: application/json" \
-H "X-Tika-OCRLanguage: dan+eng" \
-H "X-Tika-PDFextractInlineImages: true"
{code}
Then "ctrl-C" before the response is returned.
Dockerfile:
{code:bash}
FROM apache/tika:2.0.0-full
RUN DEBIAN_FRONTEND=noninteractive apt-get update && apt-get -y install tesseract-ocr-dan
{code}
docker-compose.yaml:
{noformat}
version: "3.9"
services:
tika:
build: tika/
ports:
- "9998:9998"
{noformat}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)