You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Zheng Lin Edwin Yeo <ed...@gmail.com> on 2015/12/23 04:11:16 UTC

Re: Unable to extract images content (OCR) from PDF files using Solr

Hi,

I'm also facing the same issue as what you faced 2 months back, like able
to extract the image content if there are in .jpg or .png format, but not
able to extract the images in pdf, even after setting "extractInlineImages
true" in the PDFParser.properties.

Have you managed to find alternative solutions to this problem?

Regards,
Edwin

On 22 October 2015 at 18:05, Damien Picard <pi...@gmail.com> wrote:

> Hi,
>
> I'm using Solr 5.3.0 on a Red Hat EL 7 and I try to extract content from
> PDF, Word, LibreOffice, etc. docs using the ExtractingRequestHandler.
>
> Everything works fine, except when I want to extract content from embedding
> images in PDF/Word etc. documents :
>
> I send an extract request like this :
> POST /update/extract?literal.id
> =ocrpdf8&fmap.content=attr_content&uprefix=attr_
>
> In attr_content, I get :
> \n \n date 2015-08-28T13:23:03Z \n
> pdf:PDFVersion 1.4 \n
> xmp:CreatorTool PDFCreator Version 1.2.3 \n
>  stream_content_type application/pdf \n
>  Keywords \n
>  subject \n
>  dc:creator S050735 \n
>  dcterms:created 2015-08-28T13:23:03Z \n
>  Last-Modified 2015-08-28T13:23:03Z \n
>  dcterms:modified 2015-08-28T13:23:03Z \n
>  dc:format application/pdf; version=1.4 \n
>  Last-Save-Date 2015-08-28T13:23:03Z \n
>  stream_name imagepdf.pdf \n
>  meta:save-date 2015-08-28T13:23:03Z \n
>  pdf:encrypted false \n
>  dc:title imagepdf \n
>  modified 2015-08-28T13:23:03Z \n
>  cp:subject \n
>  Content-Type application/pdf \n
>  stream_size 423660 \n
>  X-Parsed-By org.apache.tika.parser.DefaultParser \n
>  X-Parsed-By org.apache.tika.parser.pdf.PDFParser \n
>  creator S050735 \n
>  meta:author S050735 \n
>  dc:subject \n
>  meta:creation-date 2015-08-28T13:23:03Z \n
>  stream_source_info the-file \n
>  created Fri Aug 28 13:23:03 UTC 2015 \n
>  xmpTPg:NPages 1 \n
>  Creation-Date 2015-08-28T13:23:03Z \n
>  meta:keyword \n
>  Author S050735 \n
>  producer GPL Ghostscript 9.04 \n
>  imagepdf \n
>  \n
>  page \n
>  Page 1 sur 1\n \n
>  28/08/2015
> http://confluence/download/attachments/158471300/image2015-3-3+18%3A10%3A4.
> ..
> \n \n embedded:image0.jpg image0.jpg embedded:image1.jpg image1.jpg
> embedded:image2.jpg image2.jpg \n
>
> So, tika works fine, but it doesn't apply OCR content extraction on the
> embedded images.
>
> When I post an image (JPG) on /update/extract, I get its content indexed
> throught Tesseract OCR (attr_content) field :
> \n \n stream_size 55422 \n
>  X-Parsed-By org.apache.tika.parser.DefaultParser \n
>  X-Parsed-By org.apache.tika.parser.ocr.TesseractOCRParser \n
>  stream_content_type image/jpeg \n
>  stream_name OM_1.jpg \n
>  stream_source_info the-file \n
>  Content-Type image/jpeg \n \n \n
>  ‘ '\"I“ \" \"' ./\nlrast. Shortly before the classes started I was
> visiting a.\ncertain public school, a school set in a typically
> English\ncountryside, which on the June clay of my visit was wonder-\nfully
> beauliful. The Head Master—-no less typical than his\nschool and the
> country-side—pointed out the charms of\nboth, and his pride came out in the
> ﬁnal remark which he made\nbeforehe left me. He explained that he had a
> class to take\nin'I'heocritus. Then (with a. buoyant gesture); “ Can
> you\n\n, conceive anything more delightful than a class in
> Theocritus,\n\non such a day and in such a place?\"\n\n \n \n \n
> stream_size 55422 \n X-Parsed-By org.apache.tika.parser.DefaultParser \n
> X-Parsed-By org.apache.tika.parser.ocr.TesseractOCRParser \n X-Parsed-By
> org.apache.tika.parser.jpeg.JpegParser \n stream_content_type image/jpeg \n
> Resolution Units inch \n stream_source_info the-file \n Compression Type
> Progressive, Huffman \n Data Precision 8 bits \n Number of Components 3 \n
> tiff:ImageLength 286 \n Component 2 Cb component: Quantization table 1,
> Sampling factors 1 horiz/1 vert \n Component 1 Y component: Quantization
> table 0, Sampling factors 2 horiz/2 vert \n Image Height 286 pixels \n X
> Resolution 72 dots \n Image Width 690 pixels \n stream_name OM_1.jpg \n
> Component 3 Cr component: Quantization table 1, Sampling factors 1 horiz/1
> vert \n tiff:BitsPerSample 8 \n tiff:ImageWidth 690 \n Content-Type
> image/jpeg \n Y Resolution 72 dots
>
> I see on Tika JIRA that I have to enable extractInlineImages in
> org/apache/tika/parser/pdf/PDFParser.properties to force image extraction
> on PDF. So I did it, and I package a tika-app-1.7.jar that contains the
> tika-parsers-1.7.jar with this file modified to set to true this property.
> Then, I test my Tika JAR using CLI :
>
> # java -jar tika-app-1.7.jar -t /data/docs/imagepdf.pdf
>
> In this case, I get the images content :
>
>
> Page 1 sur 1
>
> 28/08/2015
> http://confluence/download/attachments/158471300/image2015-3-3+18%3A10%3A4
> .
> ..
>
> Simple Evan!
> Use Case
> Sdsedulet
>
> So, I replace the solr/contrib/extraction/lib/tika-parsers-1.7.jar by my
> modified one, but the images remains not extracted in my pdf.
>
> Does anybody know what I'm doing wrong ?
>
> Thank you.
>
> --
> Damien Picard
> Expert GWT
> <
> http://www.editions-eni.fr/livres/gwt-google-web-toolkit-developpez-des-applications-internet-riches-ria-en-java/.97a1a26e7d5be94763fc45ac2a1e961a.html
> >
> Mob : 06 11 51 47 78
>

RE: Unable to extract images content (OCR) from PDF files using Solr

Posted by "Allison, Timothy B." <ta...@mitre.org>.

I concur with Erick and Upayavira that it is best to keep Tika in a separate JVM...well, ideally a separate box or rack or even data center [0][1]. :)

But seriously, if you're using DIH/SolrCell, you have to configure Tika to parse documents recursively.  This was made possible in SOLR-7189...see the test case/patch [2] for how to configure this.  Given that this is the behavior that most people probably expect, we may want to modify the default setting in DIH; this may be a major/breaking default change, though.

As always, please ping the Tika users list if you have any questions.

Looks like we should update our wiki [3] to include guidance on OCR'ing embedded images.

[0] http://events.linuxfoundation.org/sites/events/files/slides/TikaEval_ACNA15_allison_herceg_v2.pdf
[1] http://events.linuxfoundation.org/sites/events/files/slides/WhatsNewWithApacheTika.pdf
[2]https://issues.apache.org/jira/browse/SOLR-7189
[3] https://wiki.apache.org/tika/TikaOCR

-----Original Message-----
From: Erick Erickson [mailto:erickerickson@gmail.com] 
Sent: Thursday, December 24, 2015 2:52 PM
To: solr-user <so...@lucene.apache.org>
Subject: Re: Unable to extract images content (OCR) from PDF files using Solr

Here's an example of what Upayavira is talking about.
https://lucidworks.com/blog/2012/02/14/indexing-with-solrj/

It has some RDBMS bits, but you can take those out.

Best,
Erick

On Wed, Dec 23, 2015 at 1:27 AM, Upayavira <uv...@odoko.co.uk> wrote:
> If your needs of Tika fall outside of those provided by the embedded 
> Tika, I would suggest you include Tika in your own ingestion pipeline, 
> and just post raw content to Solr. This will probably perform better 
> anyway, as you are otherwise using up valuable Solr resources to do 
> your extraction work, and, as you are seeing, have far less control 
> over what happens inside than you would if Tika was consumed by your 
> own application.
>
> Upayavira
>
> On Wed, Dec 23, 2015, at 03:11 AM, Zheng Lin Edwin Yeo wrote:
>> Hi,
>>
>> I'm also facing the same issue as what you faced 2 months back, like 
>> able to extract the image content if there are in .jpg or .png 
>> format, but not able to extract the images in pdf, even after setting 
>> "extractInlineImages true" in the PDFParser.properties.
>>
>> Have you managed to find alternative solutions to this problem?
>>
>> Regards,
>> Edwin
>>
>> On 22 October 2015 at 18:05, Damien Picard <pi...@gmail.com>
>> wrote:
>>
>> > Hi,
>> >
>> > I'm using Solr 5.3.0 on a Red Hat EL 7 and I try to extract content 
>> > from PDF, Word, LibreOffice, etc. docs using the ExtractingRequestHandler.
>> >
>> > Everything works fine, except when I want to extract content from 
>> > embedding images in PDF/Word etc. documents :
>> >
>> > I send an extract request like this :
>> > POST /update/extract?literal.id
>> > =ocrpdf8&fmap.content=attr_content&uprefix=attr_
>> >
>> > In attr_content, I get :
>> > \n \n date 2015-08-28T13:23:03Z \n
>> > pdf:PDFVersion 1.4 \n
>> > xmp:CreatorTool PDFCreator Version 1.2.3 \n  stream_content_type 
>> > application/pdf \n  Keywords \n  subject \n  dc:creator S050735 \n  
>> > dcterms:created 2015-08-28T13:23:03Z \n  Last-Modified 
>> > 2015-08-28T13:23:03Z \n  dcterms:modified 2015-08-28T13:23:03Z \n  
>> > dc:format application/pdf; version=1.4 \n  Last-Save-Date 
>> > 2015-08-28T13:23:03Z \n  stream_name imagepdf.pdf \n  
>> > meta:save-date 2015-08-28T13:23:03Z \n  pdf:encrypted false \n  
>> > dc:title imagepdf \n  modified 2015-08-28T13:23:03Z \n  cp:subject 
>> > \n  Content-Type application/pdf \n  stream_size 423660 \n  
>> > X-Parsed-By org.apache.tika.parser.DefaultParser \n  X-Parsed-By 
>> > org.apache.tika.parser.pdf.PDFParser \n  creator S050735 \n  
>> > meta:author S050735 \n  dc:subject \n  meta:creation-date 
>> > 2015-08-28T13:23:03Z \n  stream_source_info the-file \n  created 
>> > Fri Aug 28 13:23:03 UTC 2015 \n  xmpTPg:NPages 1 \n  Creation-Date 
>> > 2015-08-28T13:23:03Z \n  meta:keyword \n  Author S050735 \n  
>> > producer GPL Ghostscript 9.04 \n  imagepdf \n  \n  page \n  Page 1 
>> > sur 1\n \n
>> >  28/08/2015
>> > http://confluence/download/attachments/158471300/image2015-3-3+18%3A10%3A4.
>> > ..
>> > \n \n embedded:image0.jpg image0.jpg embedded:image1.jpg image1.jpg 
>> > embedded:image2.jpg image2.jpg \n
>> >
>> > So, tika works fine, but it doesn't apply OCR content extraction on 
>> > the embedded images.
>> >
>> > When I post an image (JPG) on /update/extract, I get its content 
>> > indexed throught Tesseract OCR (attr_content) field :
>> > \n \n stream_size 55422 \n
>> >  X-Parsed-By org.apache.tika.parser.DefaultParser \n  X-Parsed-By 
>> > org.apache.tika.parser.ocr.TesseractOCRParser \n  
>> > stream_content_type image/jpeg \n  stream_name OM_1.jpg \n  
>> > stream_source_info the-file \n  Content-Type image/jpeg \n \n \n  ‘ 
>> > '\"I“ \" \"' ./\nlrast. Shortly before the classes started I was 
>> > visiting a.\ncertain public school, a school set in a typically 
>> > English\ncountryside, which on the June clay of my visit was 
>> > wonder-\nfully beauliful. The Head Master—-no less typical than 
>> > his\nschool and the country-side—pointed out the charms of\nboth, 
>> > and his pride came out in the ﬁnal remark which he made\nbeforehe 
>> > left me. He explained that he had a class to take\nin'I'heocritus. 
>> > Then (with a. buoyant gesture); “ Can you\n\n, conceive anything 
>> > more delightful than a class in Theocritus,\n\non such a day and in 
>> > such a place?\"\n\n \n \n \n stream_size 55422 \n X-Parsed-By 
>> > org.apache.tika.parser.DefaultParser \n X-Parsed-By 
>> > org.apache.tika.parser.ocr.TesseractOCRParser \n X-Parsed-By 
>> > org.apache.tika.parser.jpeg.JpegParser \n stream_content_type 
>> > image/jpeg \n Resolution Units inch \n stream_source_info the-file 
>> > \n Compression Type Progressive, Huffman \n Data Precision 8 bits 
>> > \n Number of Components 3 \n tiff:ImageLength 286 \n Component 2 Cb 
>> > component: Quantization table 1, Sampling factors 1 horiz/1 vert \n 
>> > Component 1 Y component: Quantization table 0, Sampling factors 2 
>> > horiz/2 vert \n Image Height 286 pixels \n X Resolution 72 dots \n 
>> > Image Width 690 pixels \n stream_name OM_1.jpg \n Component 3 Cr 
>> > component: Quantization table 1, Sampling factors 1 horiz/1 vert \n 
>> > tiff:BitsPerSample 8 \n tiff:ImageWidth 690 \n Content-Type 
>> > image/jpeg \n Y Resolution 72 dots
>> >
>> > I see on Tika JIRA that I have to enable extractInlineImages in 
>> > org/apache/tika/parser/pdf/PDFParser.properties to force image 
>> > extraction on PDF. So I did it, and I package a tika-app-1.7.jar 
>> > that contains the tika-parsers-1.7.jar with this file modified to set to true this property.
>> > Then, I test my Tika JAR using CLI :
>> >
>> > # java -jar tika-app-1.7.jar -t /data/docs/imagepdf.pdf
>> >
>> > In this case, I get the images content :
>> >
>> >
>> > Page 1 sur 1
>> >
>> > 28/08/2015
>> > http://confluence/download/attachments/158471300/image2015-3-3+18%3
>> > A10%3A4
>> > .
>> > ..
>> >
>> > Simple Evan!
>> > Use Case
>> > Sdsedulet
>> >
>> > So, I replace the solr/contrib/extraction/lib/tika-parsers-1.7.jar 
>> > by my modified one, but the images remains not extracted in my pdf.
>> >
>> > Does anybody know what I'm doing wrong ?
>> >
>> > Thank you.
>> >
>> > --
>> > Damien Picard
>> > Expert GWT
>> > <
>> > http://www.editions-eni.fr/livres/gwt-google-web-toolkit-developpez
>> > -des-applications-internet-riches-ria-en-java/.97a1a26e7d5be94763fc
>> > 45ac2a1e961a.html
>> > >
>> > Mob : 06 11 51 47 78
>> >

Re: Unable to extract images content (OCR) from PDF files using Solr

Posted by Erick Erickson <er...@gmail.com>.

Here's an example of what Upayavira is talking about.
https://lucidworks.com/blog/2012/02/14/indexing-with-solrj/

It has some RDBMS bits, but you can take those out.

Best,
Erick

On Wed, Dec 23, 2015 at 1:27 AM, Upayavira <uv...@odoko.co.uk> wrote:
> If your needs of Tika fall outside of those provided by the embedded
> Tika, I would suggest you include Tika in your own ingestion pipeline,
> and just post raw content to Solr. This will probably perform better
> anyway, as you are otherwise using up valuable Solr resources to do your
> extraction work, and, as you are seeing, have far less control over what
> happens inside than you would if Tika was consumed by your own
> application.
>
> Upayavira
>
> On Wed, Dec 23, 2015, at 03:11 AM, Zheng Lin Edwin Yeo wrote:
>> Hi,
>>
>> I'm also facing the same issue as what you faced 2 months back, like able
>> to extract the image content if there are in .jpg or .png format, but not
>> able to extract the images in pdf, even after setting
>> "extractInlineImages
>> true" in the PDFParser.properties.
>>
>> Have you managed to find alternative solutions to this problem?
>>
>> Regards,
>> Edwin
>>
>> On 22 October 2015 at 18:05, Damien Picard <pi...@gmail.com>
>> wrote:
>>
>> > Hi,
>> >
>> > I'm using Solr 5.3.0 on a Red Hat EL 7 and I try to extract content from
>> > PDF, Word, LibreOffice, etc. docs using the ExtractingRequestHandler.
>> >
>> > Everything works fine, except when I want to extract content from embedding
>> > images in PDF/Word etc. documents :
>> >
>> > I send an extract request like this :
>> > POST /update/extract?literal.id
>> > =ocrpdf8&fmap.content=attr_content&uprefix=attr_
>> >
>> > In attr_content, I get :
>> > \n \n date 2015-08-28T13:23:03Z \n
>> > pdf:PDFVersion 1.4 \n
>> > xmp:CreatorTool PDFCreator Version 1.2.3 \n
>> >  stream_content_type application/pdf \n
>> >  Keywords \n
>> >  subject \n
>> >  dc:creator S050735 \n
>> >  dcterms:created 2015-08-28T13:23:03Z \n
>> >  Last-Modified 2015-08-28T13:23:03Z \n
>> >  dcterms:modified 2015-08-28T13:23:03Z \n
>> >  dc:format application/pdf; version=1.4 \n
>> >  Last-Save-Date 2015-08-28T13:23:03Z \n
>> >  stream_name imagepdf.pdf \n
>> >  meta:save-date 2015-08-28T13:23:03Z \n
>> >  pdf:encrypted false \n
>> >  dc:title imagepdf \n
>> >  modified 2015-08-28T13:23:03Z \n
>> >  cp:subject \n
>> >  Content-Type application/pdf \n
>> >  stream_size 423660 \n
>> >  X-Parsed-By org.apache.tika.parser.DefaultParser \n
>> >  X-Parsed-By org.apache.tika.parser.pdf.PDFParser \n
>> >  creator S050735 \n
>> >  meta:author S050735 \n
>> >  dc:subject \n
>> >  meta:creation-date 2015-08-28T13:23:03Z \n
>> >  stream_source_info the-file \n
>> >  created Fri Aug 28 13:23:03 UTC 2015 \n
>> >  xmpTPg:NPages 1 \n
>> >  Creation-Date 2015-08-28T13:23:03Z \n
>> >  meta:keyword \n
>> >  Author S050735 \n
>> >  producer GPL Ghostscript 9.04 \n
>> >  imagepdf \n
>> >  \n
>> >  page \n
>> >  Page 1 sur 1\n \n
>> >  28/08/2015
>> > http://confluence/download/attachments/158471300/image2015-3-3+18%3A10%3A4.
>> > ..
>> > \n \n embedded:image0.jpg image0.jpg embedded:image1.jpg image1.jpg
>> > embedded:image2.jpg image2.jpg \n
>> >
>> > So, tika works fine, but it doesn't apply OCR content extraction on the
>> > embedded images.
>> >
>> > When I post an image (JPG) on /update/extract, I get its content indexed
>> > throught Tesseract OCR (attr_content) field :
>> > \n \n stream_size 55422 \n
>> >  X-Parsed-By org.apache.tika.parser.DefaultParser \n
>> >  X-Parsed-By org.apache.tika.parser.ocr.TesseractOCRParser \n
>> >  stream_content_type image/jpeg \n
>> >  stream_name OM_1.jpg \n
>> >  stream_source_info the-file \n
>> >  Content-Type image/jpeg \n \n \n
>> >  ‘ '\"I“ \" \"' ./\nlrast. Shortly before the classes started I was
>> > visiting a.\ncertain public school, a school set in a typically
>> > English\ncountryside, which on the June clay of my visit was wonder-\nfully
>> > beauliful. The Head Master—-no less typical than his\nschool and the
>> > country-side—pointed out the charms of\nboth, and his pride came out in the
>> > ﬁnal remark which he made\nbeforehe left me. He explained that he had a
>> > class to take\nin'I'heocritus. Then (with a. buoyant gesture); “ Can
>> > you\n\n, conceive anything more delightful than a class in
>> > Theocritus,\n\non such a day and in such a place?\"\n\n \n \n \n
>> > stream_size 55422 \n X-Parsed-By org.apache.tika.parser.DefaultParser \n
>> > X-Parsed-By org.apache.tika.parser.ocr.TesseractOCRParser \n X-Parsed-By
>> > org.apache.tika.parser.jpeg.JpegParser \n stream_content_type image/jpeg \n
>> > Resolution Units inch \n stream_source_info the-file \n Compression Type
>> > Progressive, Huffman \n Data Precision 8 bits \n Number of Components 3 \n
>> > tiff:ImageLength 286 \n Component 2 Cb component: Quantization table 1,
>> > Sampling factors 1 horiz/1 vert \n Component 1 Y component: Quantization
>> > table 0, Sampling factors 2 horiz/2 vert \n Image Height 286 pixels \n X
>> > Resolution 72 dots \n Image Width 690 pixels \n stream_name OM_1.jpg \n
>> > Component 3 Cr component: Quantization table 1, Sampling factors 1 horiz/1
>> > vert \n tiff:BitsPerSample 8 \n tiff:ImageWidth 690 \n Content-Type
>> > image/jpeg \n Y Resolution 72 dots
>> >
>> > I see on Tika JIRA that I have to enable extractInlineImages in
>> > org/apache/tika/parser/pdf/PDFParser.properties to force image extraction
>> > on PDF. So I did it, and I package a tika-app-1.7.jar that contains the
>> > tika-parsers-1.7.jar with this file modified to set to true this property.
>> > Then, I test my Tika JAR using CLI :
>> >
>> > # java -jar tika-app-1.7.jar -t /data/docs/imagepdf.pdf
>> >
>> > In this case, I get the images content :
>> >
>> >
>> > Page 1 sur 1
>> >
>> > 28/08/2015
>> > http://confluence/download/attachments/158471300/image2015-3-3+18%3A10%3A4
>> > .
>> > ..
>> >
>> > Simple Evan!
>> > Use Case
>> > Sdsedulet
>> >
>> > So, I replace the solr/contrib/extraction/lib/tika-parsers-1.7.jar by my
>> > modified one, but the images remains not extracted in my pdf.
>> >
>> > Does anybody know what I'm doing wrong ?
>> >
>> > Thank you.
>> >
>> > --
>> > Damien Picard
>> > Expert GWT
>> > <
>> > http://www.editions-eni.fr/livres/gwt-google-web-toolkit-developpez-des-applications-internet-riches-ria-en-java/.97a1a26e7d5be94763fc45ac2a1e961a.html
>> > >
>> > Mob : 06 11 51 47 78
>> >

Re: Unable to extract images content (OCR) from PDF files using Solr

Posted by Upayavira <uv...@odoko.co.uk>.

If your needs of Tika fall outside of those provided by the embedded
Tika, I would suggest you include Tika in your own ingestion pipeline,
and just post raw content to Solr. This will probably perform better
anyway, as you are otherwise using up valuable Solr resources to do your
extraction work, and, as you are seeing, have far less control over what
happens inside than you would if Tika was consumed by your own
application.

Upayavira

On Wed, Dec 23, 2015, at 03:11 AM, Zheng Lin Edwin Yeo wrote:
> Hi,
> 
> I'm also facing the same issue as what you faced 2 months back, like able
> to extract the image content if there are in .jpg or .png format, but not
> able to extract the images in pdf, even after setting
> "extractInlineImages
> true" in the PDFParser.properties.
> 
> Have you managed to find alternative solutions to this problem?
> 
> Regards,
> Edwin
> 
> On 22 October 2015 at 18:05, Damien Picard <pi...@gmail.com>
> wrote:
> 
> > Hi,
> >
> > I'm using Solr 5.3.0 on a Red Hat EL 7 and I try to extract content from
> > PDF, Word, LibreOffice, etc. docs using the ExtractingRequestHandler.
> >
> > Everything works fine, except when I want to extract content from embedding
> > images in PDF/Word etc. documents :
> >
> > I send an extract request like this :
> > POST /update/extract?literal.id
> > =ocrpdf8&fmap.content=attr_content&uprefix=attr_
> >
> > In attr_content, I get :
> > \n \n date 2015-08-28T13:23:03Z \n
> > pdf:PDFVersion 1.4 \n
> > xmp:CreatorTool PDFCreator Version 1.2.3 \n
> >  stream_content_type application/pdf \n
> >  Keywords \n
> >  subject \n
> >  dc:creator S050735 \n
> >  dcterms:created 2015-08-28T13:23:03Z \n
> >  Last-Modified 2015-08-28T13:23:03Z \n
> >  dcterms:modified 2015-08-28T13:23:03Z \n
> >  dc:format application/pdf; version=1.4 \n
> >  Last-Save-Date 2015-08-28T13:23:03Z \n
> >  stream_name imagepdf.pdf \n
> >  meta:save-date 2015-08-28T13:23:03Z \n
> >  pdf:encrypted false \n
> >  dc:title imagepdf \n
> >  modified 2015-08-28T13:23:03Z \n
> >  cp:subject \n
> >  Content-Type application/pdf \n
> >  stream_size 423660 \n
> >  X-Parsed-By org.apache.tika.parser.DefaultParser \n
> >  X-Parsed-By org.apache.tika.parser.pdf.PDFParser \n
> >  creator S050735 \n
> >  meta:author S050735 \n
> >  dc:subject \n
> >  meta:creation-date 2015-08-28T13:23:03Z \n
> >  stream_source_info the-file \n
> >  created Fri Aug 28 13:23:03 UTC 2015 \n
> >  xmpTPg:NPages 1 \n
> >  Creation-Date 2015-08-28T13:23:03Z \n
> >  meta:keyword \n
> >  Author S050735 \n
> >  producer GPL Ghostscript 9.04 \n
> >  imagepdf \n
> >  \n
> >  page \n
> >  Page 1 sur 1\n \n
> >  28/08/2015
> > http://confluence/download/attachments/158471300/image2015-3-3+18%3A10%3A4.
> > ..
> > \n \n embedded:image0.jpg image0.jpg embedded:image1.jpg image1.jpg
> > embedded:image2.jpg image2.jpg \n
> >
> > So, tika works fine, but it doesn't apply OCR content extraction on the
> > embedded images.
> >
> > When I post an image (JPG) on /update/extract, I get its content indexed
> > throught Tesseract OCR (attr_content) field :
> > \n \n stream_size 55422 \n
> >  X-Parsed-By org.apache.tika.parser.DefaultParser \n
> >  X-Parsed-By org.apache.tika.parser.ocr.TesseractOCRParser \n
> >  stream_content_type image/jpeg \n
> >  stream_name OM_1.jpg \n
> >  stream_source_info the-file \n
> >  Content-Type image/jpeg \n \n \n
> >  ‘ '\"I“ \" \"' ./\nlrast. Shortly before the classes started I was
> > visiting a.\ncertain public school, a school set in a typically
> > English\ncountryside, which on the June clay of my visit was wonder-\nfully
> > beauliful. The Head Master—-no less typical than his\nschool and the
> > country-side—pointed out the charms of\nboth, and his pride came out in the
> > ﬁnal remark which he made\nbeforehe left me. He explained that he had a
> > class to take\nin'I'heocritus. Then (with a. buoyant gesture); “ Can
> > you\n\n, conceive anything more delightful than a class in
> > Theocritus,\n\non such a day and in such a place?\"\n\n \n \n \n
> > stream_size 55422 \n X-Parsed-By org.apache.tika.parser.DefaultParser \n
> > X-Parsed-By org.apache.tika.parser.ocr.TesseractOCRParser \n X-Parsed-By
> > org.apache.tika.parser.jpeg.JpegParser \n stream_content_type image/jpeg \n
> > Resolution Units inch \n stream_source_info the-file \n Compression Type
> > Progressive, Huffman \n Data Precision 8 bits \n Number of Components 3 \n
> > tiff:ImageLength 286 \n Component 2 Cb component: Quantization table 1,
> > Sampling factors 1 horiz/1 vert \n Component 1 Y component: Quantization
> > table 0, Sampling factors 2 horiz/2 vert \n Image Height 286 pixels \n X
> > Resolution 72 dots \n Image Width 690 pixels \n stream_name OM_1.jpg \n
> > Component 3 Cr component: Quantization table 1, Sampling factors 1 horiz/1
> > vert \n tiff:BitsPerSample 8 \n tiff:ImageWidth 690 \n Content-Type
> > image/jpeg \n Y Resolution 72 dots
> >
> > I see on Tika JIRA that I have to enable extractInlineImages in
> > org/apache/tika/parser/pdf/PDFParser.properties to force image extraction
> > on PDF. So I did it, and I package a tika-app-1.7.jar that contains the
> > tika-parsers-1.7.jar with this file modified to set to true this property.
> > Then, I test my Tika JAR using CLI :
> >
> > # java -jar tika-app-1.7.jar -t /data/docs/imagepdf.pdf
> >
> > In this case, I get the images content :
> >
> >
> > Page 1 sur 1
> >
> > 28/08/2015
> > http://confluence/download/attachments/158471300/image2015-3-3+18%3A10%3A4
> > .
> > ..
> >
> > Simple Evan!
> > Use Case
> > Sdsedulet
> >
> > So, I replace the solr/contrib/extraction/lib/tika-parsers-1.7.jar by my
> > modified one, but the images remains not extracted in my pdf.
> >
> > Does anybody know what I'm doing wrong ?
> >
> > Thank you.
> >
> > --
> > Damien Picard
> > Expert GWT
> > <
> > http://www.editions-eni.fr/livres/gwt-google-web-toolkit-developpez-des-applications-internet-riches-ria-en-java/.97a1a26e7d5be94763fc45ac2a1e961a.html
> > >
> > Mob : 06 11 51 47 78
> >