You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Jens Emil Schulz Østergaard (Jira)" <ji...@apache.org> on 2021/08/26 11:39:00 UTC

[jira] [Created] (TIKA-3538) TikaServer, cancelling request client-side does not kill working OCR process

Jens Emil Schulz Østergaard created TIKA-3538:
-------------------------------------------------

             Summary: TikaServer, cancelling request client-side does not kill working OCR process
                 Key: TIKA-3538
                 URL: https://issues.apache.org/jira/browse/TIKA-3538
             Project: Tika
          Issue Type: Bug
          Components: server
    Affects Versions: 2.0.0-BETA
         Environment: OS: ArcoLinux
Kernel: 5.10.60-1-lts
 CPU: Intel i5-8400 (6) @ 4.000GHz
 Memory: 32Gb
            Reporter: Jens Emil Schulz Østergaard
         Attachments: tika_error.log

It appears that canceling a request will not stop work in Tika. The handler finishes the job and then fails as it attempts to return data.

I would have expected tika to detect client-side cancellations and propagate this to relevant child processes, like tesseract, thus avoiding unnecessary work.

I send a request like so. Here FILE is a pdf that has inline images and requires OCR scanning.

{code:bash}
curl -T "$FILE" \
          -s "http://localhost:9998/tika/text" \
          -H "Accept: application/json" \
          -H "X-Tika-OCRLanguage: dan+eng" \
          -H "X-Tika-PDFextractInlineImages: true"
{code}

Then "ctrl-C" before the response is returned.


Dockerfile:
{code:bash}
FROM apache/tika:2.0.0-full

RUN DEBIAN_FRONTEND=noninteractive apt-get update && apt-get -y install tesseract-ocr-dan
{code}

docker-compose.yaml:
{noformat}
version: "3.9"


services:
  tika:
    build: tika/
    ports:
      - "9998:9998"
{noformat}


 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)