You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by je...@tutanota.com on 2015/10/06 15:55:48 UTC

OCR images from PDF with Tika

Hello,

I use Nutch v1.10, i just want to know if Nutch with Tika parser v1.8 can 
natively OCR images from PDF files? I can OCR JPEG or PNG files but Tika do 
not convert images from PDF. I use Elastic to index.

Thank you

Re: OCR images from PDF with Tika

Posted by Sebastian Nagel <wa...@googlemail.com>.

Hi,

I've just verified with Nutch trunk (upcoming 1.11):
- Tika 1.10 is able to OCR embedded images if
  PDFParser.properties is modified accordingly
  in tika-app-1.10.jar
- but parse-tika doesn't if same modifications
  are made in runtime/local/plugins/parse-tika/tika-parsers-1.10.jar

Needs some debugging to find out what is wrong.

Please, feel free to file a bug report on
https://issues.apache.org/jira/browse/NUTCH

Thanks,
Sebastian

On 10/09/2015 06:21 PM, Sebastian Nagel wrote:
> Hi,
> 
> sorry, but I didn't try this by myself, just had
> in mind that there has been a thread on the Tika
> mailing list.
> 
>> What is difference between ./plugins/parse-tika/parse-tika.jar and
>> ./plugins/parse-tika/tika-parsers-1.8.jar ?
> 
> parse-tika.jar contains the classes of Nutch's parse-tika plugin
> which depends on the library tika-parsers-1.x.jar.
> 
> Sebastian
> 
> On 10/09/2015 02:54 PM, jeanblue@tutanota.com wrote:
>> Hello,
>>
>> I try do edit JAR file and edit 
>> 'org/apache/tika/parser/pdf/PDFParser.properties' :
>>
>>   enableAutospace true
>>   extractAnnotationText true
>>   sortByPosition  false
>>   suppressDuplicateOverlappingText  false
>>   useNonSequentialParser  false
>>   extractAcroFormContent  true
>>   extractInlineImages true
>>   extractUniqueInlineImagesOnly false
>>   checkExtractAccessPermission false
>>   allowExtractionForAccessibility true
>>
>> but same result. Tesseract has also been installed.
>>
>> What is difference between ./plugins/parse-tika/parse-tika.jar and  
>> ./plugins/parse-tika/tika-parsers-1.8.jar ?
>>
>> Thank for your help !
>>
>> 8. Oct 2015 20:43 by wastl.nagel@googlemail.com:
>>
>>
>>> Hi,
>>>
>>> there as been a similar question on the Tika mailing list recently:
>>>
>>> http://mail-archives.apache.org/mod_mbox/tika-user/201505.mbox/%3CDM2PR09MB071346D01729FC9367308E94C7D90@DM2PR09MB0713.namprd09.prod.outlook.com%3E
>>>
>>> If you get Tika to OCR the embedded images, the parse-tika
>>> plugin will probably also do if the Tika jar is repla    steps
> 
> ced.
>>>
>>> Sebastian
>>>
>>> On 10/06/2015 03:55 PM, > jeanblue@tutanota.com>  wrote:
>>>> Hello,
>>>>
>>>> I use Nutch v1.10, i just want to know if Nutch with Tika parser v1.8 can
>>>> natively OCR images from PDF files? I can OCR JPEG or PNG files but Tika 
>>>> do
>>>> not convert images from PDF. I use Elastic to index.
>>>>
>>>> Thank you
>>>>
>

Re: OCR images from PDF with Tika

Posted by Sebastian Nagel <wa...@googlemail.com>.

Hi,

sorry, but I didn't try this by myself, just had
in mind that there has been a thread on the Tika
mailing list.

> What is difference between ./plugins/parse-tika/parse-tika.jar and
> ./plugins/parse-tika/tika-parsers-1.8.jar ?

parse-tika.jar contains the classes of Nutch's parse-tika plugin
which depends on the library tika-parsers-1.x.jar.

Sebastian

On 10/09/2015 02:54 PM, jeanblue@tutanota.com wrote:
> Hello,
> 
> I try do edit JAR file and edit 
> 'org/apache/tika/parser/pdf/PDFParser.properties' :
> 
>   enableAutospace true
>   extractAnnotationText true
>   sortByPosition  false
>   suppressDuplicateOverlappingText  false
>   useNonSequentialParser  false
>   extractAcroFormContent  true
>   extractInlineImages true
>   extractUniqueInlineImagesOnly false
>   checkExtractAccessPermission false
>   allowExtractionForAccessibility true
> 
> but same result. Tesseract has also been installed.
> 
> What is difference between ./plugins/parse-tika/parse-tika.jar and  
> ./plugins/parse-tika/tika-parsers-1.8.jar ?
> 
> Thank for your help !
> 
> 8. Oct 2015 20:43 by wastl.nagel@googlemail.com:
> 
> 
>> Hi,
>>
>> there as been a similar question on the Tika mailing list recently:
>>
>> http://mail-archives.apache.org/mod_mbox/tika-user/201505.mbox/%3CDM2PR09MB071346D01729FC9367308E94C7D90@DM2PR09MB0713.namprd09.prod.outlook.com%3E
>>
>> If you get Tika to OCR the embedded images, the parse-tika
>> plugin will probably also do if the Tika jar is repla    steps

ced.
>>
>> Sebastian
>>
>> On 10/06/2015 03:55 PM, > jeanblue@tutanota.com>  wrote:
>>> Hello,
>>>
>>> I use Nutch v1.10, i just want to know if Nutch with Tika parser v1.8 can
>>> natively OCR images from PDF files? I can OCR JPEG or PNG files but Tika 
>>> do
>>> not convert images from PDF. I use Elastic to index.
>>>
>>> Thank you
>>>

Re: OCR images from PDF with Tika

Posted by je...@tutanota.com.

Hello,

I try do edit JAR file and edit 
'org/apache/tika/parser/pdf/PDFParser.properties' :

  enableAutospace true
  extractAnnotationText true
  sortByPosition  false
  suppressDuplicateOverlappingText  false
  useNonSequentialParser  false
  extractAcroFormContent  true
  extractInlineImages true
  extractUniqueInlineImagesOnly false
  checkExtractAccessPermission false
  allowExtractionForAccessibility true

but same result. Tesseract has also been installed.

What is difference between ./plugins/parse-tika/parse-tika.jar and  
./plugins/parse-tika/tika-parsers-1.8.jar ?

Thank for your help !

8. Oct 2015 20:43 by wastl.nagel@googlemail.com:


> Hi,
>
> there as been a similar question on the Tika mailing list recently:
>
> http://mail-archives.apache.org/mod_mbox/tika-user/201505.mbox/%3CDM2PR09MB071346D01729FC9367308E94C7D90@DM2PR09MB0713.namprd09.prod.outlook.com%3E
>
> If you get Tika to OCR the embedded images, the parse-tika
> plugin will probably also do if the Tika jar is replaced.
>
> Sebastian
>
> On 10/06/2015 03:55 PM, > jeanblue@tutanota.com>  wrote:
>> Hello,
>>
>> I use Nutch v1.10, i just want to know if Nutch with Tika parser v1.8 can
>> natively OCR images from PDF files? I can OCR JPEG or PNG files but Tika 
>> do
>> not convert images from PDF. I use Elastic to index.
>>
>> Thank you
>>

Re: OCR images from PDF with Tika

Posted by Sebastian Nagel <wa...@googlemail.com>.

Hi,

there as been a similar question on the Tika mailing list recently:

http://mail-archives.apache.org/mod_mbox/tika-user/201505.mbox/%3CDM2PR09MB071346D01729FC9367308E94C7D90@DM2PR09MB0713.namprd09.prod.outlook.com%3E

If you get Tika to OCR the embedded images, the parse-tika
plugin will probably also do if the Tika jar is replaced.

Sebastian

On 10/06/2015 03:55 PM, jeanblue@tutanota.com wrote:
> Hello,
> 
> I use Nutch v1.10, i just want to know if Nutch with Tika parser v1.8 can 
> natively OCR images from PDF files? I can OCR JPEG or PNG files but Tika do 
> not convert images from PDF. I use Elastic to index.
> 
> Thank you
>