You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "eldk (JIRA)" <ji...@apache.org> on 2016/03/15 16:12:33 UTC
[jira] [Comment Edited] (NUTCH-2138) Tika cannot OCR embedded images from PDF

    [ https://issues.apache.org/jira/browse/NUTCH-2138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15195439#comment-15195439 ] 

eldk edited comment on NUTCH-2138 at 3/15/16 3:11 PM:
------------------------------------------------------

Hello,

OCR for image in PDF still not working with nutch 1.11, lib/tika-core-1.11.jar, plugins/parse-tika/tika-parsers-1.11.jar

tesseract -v
tesseract 3.03
 leptonica-1.70
  libgif 4.1.6(?) : libjpeg 8d : libpng 1.2.50 : libtiff 4.0.3 : zlib 1.2.8 : webp 0.4.0

bin/nutch parsechecker -dumpText http://domain.tld/file.pdf

fetching: http://domain.tld/file.pdf
robots.txt whitelist not configured.
parsing: http://domain.tld/file.pdf
contentType: application/pdf
signature: af00322e75c5eb43085df668f2faca2f
---------
Url
---------------

http://domain.tld/file.pdf
---------
ParseData
---------

Version: 5
Status: success(1,0)
Title: 
Outlinks: 0
Content Metadata: nutch.fetch.time=1458053976974 Age=0 Content-Language=fr-FR Served-by=domain.tld Content-Length=5052242 Content-Transfer-Encoding=binary Expires=Tue, 15 Mar 2016 15:09:37 GMT Last-Modified=Fri, 12 Jun 2015 14:58:13 GMT Set-Cookie=eZSESSID=6ns8c06tnu40kd3ohfpl6vnrj5; path=/ Connection=close X-Cache=Miss from Varnish Server=nginx X-Powered-By=eZ Publish Cache-Control= Pragma= X-Varnish=1186703160 Date=Tue, 15 Mar 2016 14:59:37 GMT Content-Disposition=inline; filename="file.pdf" nutch.crawl.score=0.0 Via=1.1 varnish Accept-Ranges=bytes Content-Type=application/pdf 
Parse Metadata: access_permission:extract_for_accessibility=true meta:save-date=2015-06-12T14:47:32Z dcterms:created=2015-06-12T14:47:32Z date=2015-06-12T14:47:32Z access_permission:can_modify=true access_permission:modify_annotations=true Creation-Date=2015-06-12T14:47:32Z created=Fri Jun 12 16:47:32 CEST 2015 access_permission:fill_in_form=true access_permission:can_print=true dc:format=application/pdf; version=1.4 xmp:CreatorTool=RICOH MP 3353 Last-Save-Date=2015-06-12T14:47:32Z access_permission:assemble_document=true meta:creation-date=2015-06-12T14:47:32Z dcterms:modified=2015-06-12T14:47:32Z Last-Modified=2015-06-12T14:47:32Z pdf:PDFVersion=1.4 modified=2015-06-12T14:47:32Z xmpTPg:NPages=45 access_permission:can_print_degraded=true pdf:encrypted=false access_permission:extract_content=true producer=RICOH MP 3353 Content-Type=application/pdf 
---------
ParseText
---------


thanks,

Eric

https://tika.apache.org/1.11/gettingstarted.html


was (Author: eldk):
Hello,

OCR for image in PDF still not working with nutch 1.11, lib/tika-core-1.11.jar, plugins/parse-tika/tika-parsers-1.11.jar

tesseract -v
tesseract 3.03
 leptonica-1.70
  libgif 4.1.6(?) : libjpeg 8d : libpng 1.2.50 : libtiff 4.0.3 : zlib 1.2.8 : webp 0.4.0

bin/nutch parsechecker -dumpText http://domain.tld/file.pdf

fetching: http://domain.tld/file.pdf
robots.txt whitelist not configured.
parsing: http://domain.tld/file.pdf
contentType: application/pdf
signature: af00322e75c5eb43085df668f2faca2f
---------
Url
---------------

http://domain.tld/file.pdf
---------
ParseData
---------

Version: 5
Status: success(1,0)
Title: 
Outlinks: 0
Content Metadata: nutch.fetch.time=1458053976974 Age=0 Content-Language=fr-FR Served-by=www.nord.gouv.fr Content-Length=5052242 Content-Transfer-Encoding=binary Expires=Tue, 15 Mar 2016 15:09:37 GMT Last-Modified=Fri, 12 Jun 2015 14:58:13 GMT Set-Cookie=eZSESSID=6ns8c06tnu40kd3ohfpl6vnrj5; path=/ Connection=close X-Cache=Miss from Varnish Server=nginx X-Powered-By=eZ Publish Cache-Control= Pragma= X-Varnish=1186703160 Date=Tue, 15 Mar 2016 14:59:37 GMT Content-Disposition=inline; filename="file.pdf" nutch.crawl.score=0.0 Via=1.1 varnish Accept-Ranges=bytes Content-Type=application/pdf 
Parse Metadata: access_permission:extract_for_accessibility=true meta:save-date=2015-06-12T14:47:32Z dcterms:created=2015-06-12T14:47:32Z date=2015-06-12T14:47:32Z access_permission:can_modify=true access_permission:modify_annotations=true Creation-Date=2015-06-12T14:47:32Z created=Fri Jun 12 16:47:32 CEST 2015 access_permission:fill_in_form=true access_permission:can_print=true dc:format=application/pdf; version=1.4 xmp:CreatorTool=RICOH MP 3353 Last-Save-Date=2015-06-12T14:47:32Z access_permission:assemble_document=true meta:creation-date=2015-06-12T14:47:32Z dcterms:modified=2015-06-12T14:47:32Z Last-Modified=2015-06-12T14:47:32Z pdf:PDFVersion=1.4 modified=2015-06-12T14:47:32Z xmpTPg:NPages=45 access_permission:can_print_degraded=true pdf:encrypted=false access_permission:extract_content=true producer=RICOH MP 3353 Content-Type=application/pdf 
---------
ParseText
---------


thanks,

Eric

https://tika.apache.org/1.11/gettingstarted.html

> Tika cannot OCR embedded images from PDF
> ----------------------------------------
>
>                 Key: NUTCH-2138
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2138
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.10
>         Environment: Nutch v1.10
> openjdk version "1.8.0_60-internal"
> Debian 7.8
> Tika 1.8 or Tika 1.10
>            Reporter: jean blue
>
> Tika 1.10 is able to OCR embedded images if PDFParser.properties is modified accordingly in tika-app-1.10.jar but parse-tika doesn't if same modifications are made in runtime/local/plugins/parse-tika/tika-parsers-1.10.jar



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)