You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by "Allison, Timothy B." <ta...@mitre.org> on 2016/01/05 16:01:24 UTC

RE: Unable to extract images content (OCR) from PDF files using Solr

I concur with Erick and Upayavira that it is best to keep Tika in a separate JVM...well, ideally a separate box or rack or even data center [0][1]. :)

But seriously, if you're using DIH/SolrCell, you have to configure Tika to parse documents recursively.  This was made possible in SOLR-7189...see the test case/patch [2] for how to configure this.  Given that this is the behavior that most people probably expect, we may want to modify the default setting in DIH; this may be a major/breaking default change, though.

As always, please ping the Tika users list if you have any questions.

Looks like we should update our wiki [3] to include guidance on OCR'ing embedded images.

[0] http://events.linuxfoundation.org/sites/events/files/slides/TikaEval_ACNA15_allison_herceg_v2.pdf
[1] http://events.linuxfoundation.org/sites/events/files/slides/WhatsNewWithApacheTika.pdf
[2]https://issues.apache.org/jira/browse/SOLR-7189
[3] https://wiki.apache.org/tika/TikaOCR

-----Original Message-----
From: Erick Erickson [mailto:erickerickson@gmail.com] 
Sent: Thursday, December 24, 2015 2:52 PM
To: solr-user <so...@lucene.apache.org>
Subject: Re: Unable to extract images content (OCR) from PDF files using Solr

Here's an example of what Upayavira is talking about.
https://lucidworks.com/blog/2012/02/14/indexing-with-solrj/

It has some RDBMS bits, but you can take those out.

Best,
Erick

On Wed, Dec 23, 2015 at 1:27 AM, Upayavira <uv...@odoko.co.uk> wrote:
> If your needs of Tika fall outside of those provided by the embedded 
> Tika, I would suggest you include Tika in your own ingestion pipeline, 
> and just post raw content to Solr. This will probably perform better 
> anyway, as you are otherwise using up valuable Solr resources to do 
> your extraction work, and, as you are seeing, have far less control 
> over what happens inside than you would if Tika was consumed by your 
> own application.
>
> Upayavira
>
> On Wed, Dec 23, 2015, at 03:11 AM, Zheng Lin Edwin Yeo wrote:
>> Hi,
>>
>> I'm also facing the same issue as what you faced 2 months back, like 
>> able to extract the image content if there are in .jpg or .png 
>> format, but not able to extract the images in pdf, even after setting 
>> "extractInlineImages true" in the PDFParser.properties.
>>
>> Have you managed to find alternative solutions to this problem?
>>
>> Regards,
>> Edwin
>>
>> On 22 October 2015 at 18:05, Damien Picard <pi...@gmail.com>
>> wrote:
>>
>> > Hi,
>> >
>> > I'm using Solr 5.3.0 on a Red Hat EL 7 and I try to extract content 
>> > from PDF, Word, LibreOffice, etc. docs using the ExtractingRequestHandler.
>> >
>> > Everything works fine, except when I want to extract content from 
>> > embedding images in PDF/Word etc. documents :
>> >
>> > I send an extract request like this :
>> > POST /update/extract?literal.id
>> > =ocrpdf8&fmap.content=attr_content&uprefix=attr_
>> >
>> > In attr_content, I get :
>> > \n \n date 2015-08-28T13:23:03Z \n
>> > pdf:PDFVersion 1.4 \n
>> > xmp:CreatorTool PDFCreator Version 1.2.3 \n  stream_content_type 
>> > application/pdf \n  Keywords \n  subject \n  dc:creator S050735 \n  
>> > dcterms:created 2015-08-28T13:23:03Z \n  Last-Modified 
>> > 2015-08-28T13:23:03Z \n  dcterms:modified 2015-08-28T13:23:03Z \n  
>> > dc:format application/pdf; version=1.4 \n  Last-Save-Date 
>> > 2015-08-28T13:23:03Z \n  stream_name imagepdf.pdf \n  
>> > meta:save-date 2015-08-28T13:23:03Z \n  pdf:encrypted false \n  
>> > dc:title imagepdf \n  modified 2015-08-28T13:23:03Z \n  cp:subject 
>> > \n  Content-Type application/pdf \n  stream_size 423660 \n  
>> > X-Parsed-By org.apache.tika.parser.DefaultParser \n  X-Parsed-By 
>> > org.apache.tika.parser.pdf.PDFParser \n  creator S050735 \n  
>> > meta:author S050735 \n  dc:subject \n  meta:creation-date 
>> > 2015-08-28T13:23:03Z \n  stream_source_info the-file \n  created 
>> > Fri Aug 28 13:23:03 UTC 2015 \n  xmpTPg:NPages 1 \n  Creation-Date 
>> > 2015-08-28T13:23:03Z \n  meta:keyword \n  Author S050735 \n  
>> > producer GPL Ghostscript 9.04 \n  imagepdf \n  \n  page \n  Page 1 
>> > sur 1\n \n
>> >  28/08/2015
>> > http://confluence/download/attachments/158471300/image2015-3-3+18%3A10%3A4.
>> > ..
>> > \n \n embedded:image0.jpg image0.jpg embedded:image1.jpg image1.jpg 
>> > embedded:image2.jpg image2.jpg \n
>> >
>> > So, tika works fine, but it doesn't apply OCR content extraction on 
>> > the embedded images.
>> >
>> > When I post an image (JPG) on /update/extract, I get its content 
>> > indexed throught Tesseract OCR (attr_content) field :
>> > \n \n stream_size 55422 \n
>> >  X-Parsed-By org.apache.tika.parser.DefaultParser \n  X-Parsed-By 
>> > org.apache.tika.parser.ocr.TesseractOCRParser \n  
>> > stream_content_type image/jpeg \n  stream_name OM_1.jpg \n  
>> > stream_source_info the-file \n  Content-Type image/jpeg \n \n \n  ‘ 
>> > '\"I“ \" \"' ./\nlrast. Shortly before the classes started I was 
>> > visiting a.\ncertain public school, a school set in a typically 
>> > English\ncountryside, which on the June clay of my visit was 
>> > wonder-\nfully beauliful. The Head Master—-no less typical than 
>> > his\nschool and the country-side—pointed out the charms of\nboth, 
>> > and his pride came out in the final remark which he made\nbeforehe 
>> > left me. He explained that he had a class to take\nin'I'heocritus. 
>> > Then (with a. buoyant gesture); “ Can you\n\n, conceive anything 
>> > more delightful than a class in Theocritus,\n\non such a day and in 
>> > such a place?\"\n\n \n \n \n stream_size 55422 \n X-Parsed-By 
>> > org.apache.tika.parser.DefaultParser \n X-Parsed-By 
>> > org.apache.tika.parser.ocr.TesseractOCRParser \n X-Parsed-By 
>> > org.apache.tika.parser.jpeg.JpegParser \n stream_content_type 
>> > image/jpeg \n Resolution Units inch \n stream_source_info the-file 
>> > \n Compression Type Progressive, Huffman \n Data Precision 8 bits 
>> > \n Number of Components 3 \n tiff:ImageLength 286 \n Component 2 Cb 
>> > component: Quantization table 1, Sampling factors 1 horiz/1 vert \n 
>> > Component 1 Y component: Quantization table 0, Sampling factors 2 
>> > horiz/2 vert \n Image Height 286 pixels \n X Resolution 72 dots \n 
>> > Image Width 690 pixels \n stream_name OM_1.jpg \n Component 3 Cr 
>> > component: Quantization table 1, Sampling factors 1 horiz/1 vert \n 
>> > tiff:BitsPerSample 8 \n tiff:ImageWidth 690 \n Content-Type 
>> > image/jpeg \n Y Resolution 72 dots
>> >
>> > I see on Tika JIRA that I have to enable extractInlineImages in 
>> > org/apache/tika/parser/pdf/PDFParser.properties to force image 
>> > extraction on PDF. So I did it, and I package a tika-app-1.7.jar 
>> > that contains the tika-parsers-1.7.jar with this file modified to set to true this property.
>> > Then, I test my Tika JAR using CLI :
>> >
>> > # java -jar tika-app-1.7.jar -t /data/docs/imagepdf.pdf
>> >
>> > In this case, I get the images content :
>> >
>> >
>> > Page 1 sur 1
>> >
>> > 28/08/2015
>> > http://confluence/download/attachments/158471300/image2015-3-3+18%3
>> > A10%3A4
>> > .
>> > ..
>> >
>> > Simple Evan!
>> > Use Case
>> > Sdsedulet
>> >
>> > So, I replace the solr/contrib/extraction/lib/tika-parsers-1.7.jar 
>> > by my modified one, but the images remains not extracted in my pdf.
>> >
>> > Does anybody know what I'm doing wrong ?
>> >
>> > Thank you.
>> >
>> > --
>> > Damien Picard
>> > Expert GWT
>> > <
>> > http://www.editions-eni.fr/livres/gwt-google-web-toolkit-developpez
>> > -des-applications-internet-riches-ria-en-java/.97a1a26e7d5be94763fc
>> > 45ac2a1e961a.html
>> > >
>> > Mob : 06 11 51 47 78
>> >