You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Konstantin Gribov (Jira)" <ji...@apache.org> on 2021/04/24 01:21:00 UTC

[jira] [Updated] (TIKA-3369) Flaky Tesseract OCR confirmMultiPageTiffHandling test

     [ https://issues.apache.org/jira/browse/TIKA-3369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Konstantin Gribov updated TIKA-3369:
------------------------------------
    Description: 
Current main@08793d360a838db04a3d23b902c34d9e6e7362e4 fails with

{noformat}
[ERROR]   TesseractOCRParserTest.confirmMultiPageTiffHandling:108->TikaTest.assertContains:79 Page 2 not found in:
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="Content-Type-Parser-Override" content="image/ocr-tiff" />
<meta name="X-TIKA:Parsed-By" content="org.apache.tika.parser.DefaultParser" />
<meta name="X-TIKA:Parsed-By" content="org.apache.tika.parser.ocr.TesseractOCRParser" />
<title></title>
</head>
<body><div class="ocr">Multipage
TIFF
Example
Page 1
Multipage
TIFF
Example
Page?2
</div>
</body></html>
{noformat}

Take note that tesseract extract {{Page?2}} instead of {{Page 2}}.

  was:
Current main@08793d360a838db04a3d23b902c34d9e6e7362e4 fails with

{noformat}
[ERROR]   TesseractOCRParserTest.confirmMultiPageTiffHandling:108->TikaTest.assertContains:79 Page 2 not found in:
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="Content-Type-Parser-Override" content="image/ocr-tiff" />
<meta name="X-TIKA:Parsed-By" content="org.apache.tika.parser.DefaultParser" />
<meta name="X-TIKA:Parsed-By" content="org.apache.tika.parser.ocr.TesseractOCRParser" />
<title></title>
</head>
<body><div class="ocr">Multipage
TIFF
Example
Page 1
Multipage
TIFF
Example
Page?2
</div>
</body></html>
{noformat}



> Flaky Tesseract OCR confirmMultiPageTiffHandling test
> -----------------------------------------------------
>
>                 Key: TIKA-3369
>                 URL: https://issues.apache.org/jira/browse/TIKA-3369
>             Project: Tika
>          Issue Type: Test
>          Components: ocr
>    Affects Versions: 2.0.0
>         Environment: Arch Linux, kernel: 5.11.16-arch1-1 #1 SMP PREEMPT Wed, 21 Apr 2021 17:22:13 +0000 x86_64 GNU/Linux
> OpenJDK 15.0.2.u7-1
> Tesseract 4.1.1-5 with icu 69.1-1, cairo 1.17.4-5, pango 1:1.48.4-1, tesseract-data-{eng,deu,fra,rus,ukr} 2:4.0.0-1 (other languages not installed)
>            Reporter: Konstantin Gribov
>            Priority: Minor
>
> Current main@08793d360a838db04a3d23b902c34d9e6e7362e4 fails with
> {noformat}
> [ERROR]   TesseractOCRParserTest.confirmMultiPageTiffHandling:108->TikaTest.assertContains:79 Page 2 not found in:
> <html xmlns="http://www.w3.org/1999/xhtml">
> <head>
> <meta name="Content-Type-Parser-Override" content="image/ocr-tiff" />
> <meta name="X-TIKA:Parsed-By" content="org.apache.tika.parser.DefaultParser" />
> <meta name="X-TIKA:Parsed-By" content="org.apache.tika.parser.ocr.TesseractOCRParser" />
> <title></title>
> </head>
> <body><div class="ocr">Multipage
> TIFF
> Example
> Page 1
> Multipage
> TIFF
> Example
> Page?2
> </div>
> </body></html>
> {noformat}
> Take note that tesseract extract {{Page?2}} instead of {{Page 2}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)