You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by je...@tutanota.com on 2015/10/06 15:55:48 UTC
OCR images from PDF with Tika
Hello,
I use Nutch v1.10, i just want to know if Nutch with Tika parser v1.8 can
natively OCR images from PDF files? I can OCR JPEG or PNG files but Tika do
not convert images from PDF. I use Elastic to index.
Thank you
Re: OCR images from PDF with Tika
Posted by Sebastian Nagel <wa...@googlemail.com>.
Hi,
I've just verified with Nutch trunk (upcoming 1.11):
- Tika 1.10 is able to OCR embedded images if
PDFParser.properties is modified accordingly
in tika-app-1.10.jar
- but parse-tika doesn't if same modifications
are made in runtime/local/plugins/parse-tika/tika-parsers-1.10.jar
Needs some debugging to find out what is wrong.
Please, feel free to file a bug report on
https://issues.apache.org/jira/browse/NUTCH
Thanks,
Sebastian
On 10/09/2015 06:21 PM, Sebastian Nagel wrote:
> Hi,
>
> sorry, but I didn't try this by myself, just had
> in mind that there has been a thread on the Tika
> mailing list.
>
>> What is difference between ./plugins/parse-tika/parse-tika.jar and
>> ./plugins/parse-tika/tika-parsers-1.8.jar ?
>
> parse-tika.jar contains the classes of Nutch's parse-tika plugin
> which depends on the library tika-parsers-1.x.jar.
>
> Sebastian
>
> On 10/09/2015 02:54 PM, jeanblue@tutanota.com wrote:
>> Hello,
>>
>> I try do edit JAR file and edit
>> 'org/apache/tika/parser/pdf/PDFParser.properties' :
>>
>> enableAutospace true
>> extractAnnotationText true
>> sortByPosition false
>> suppressDuplicateOverlappingText false
>> useNonSequentialParser false
>> extractAcroFormContent true
>> extractInlineImages true
>> extractUniqueInlineImagesOnly false
>> checkExtractAccessPermission false
>> allowExtractionForAccessibility true
>>
>> but same result. Tesseract has also been installed.
>>
>> What is difference between ./plugins/parse-tika/parse-tika.jar and
>> ./plugins/parse-tika/tika-parsers-1.8.jar ?
>>
>> Thank for your help !
>>
>> 8. Oct 2015 20:43 by wastl.nagel@googlemail.com:
>>
>>
>>> Hi,
>>>
>>> there as been a similar question on the Tika mailing list recently:
>>>
>>> http://mail-archives.apache.org/mod_mbox/tika-user/201505.mbox/%3CDM2PR09MB071346D01729FC9367308E94C7D90@DM2PR09MB0713.namprd09.prod.outlook.com%3E
>>>
>>> If you get Tika to OCR the embedded images, the parse-tika
>>> plugin will probably also do if the Tika jar is repla steps
>
> ced.
>>>
>>> Sebastian
>>>
>>> On 10/06/2015 03:55 PM, > jeanblue@tutanota.com> wrote:
>>>> Hello,
>>>>
>>>> I use Nutch v1.10, i just want to know if Nutch with Tika parser v1.8 can
>>>> natively OCR images from PDF files? I can OCR JPEG or PNG files but Tika
>>>> do
>>>> not convert images from PDF. I use Elastic to index.
>>>>
>>>> Thank you
>>>>
>
Re: OCR images from PDF with Tika
Posted by Sebastian Nagel <wa...@googlemail.com>.
Hi,
sorry, but I didn't try this by myself, just had
in mind that there has been a thread on the Tika
mailing list.
> What is difference between ./plugins/parse-tika/parse-tika.jar and
> ./plugins/parse-tika/tika-parsers-1.8.jar ?
parse-tika.jar contains the classes of Nutch's parse-tika plugin
which depends on the library tika-parsers-1.x.jar.
Sebastian
On 10/09/2015 02:54 PM, jeanblue@tutanota.com wrote:
> Hello,
>
> I try do edit JAR file and edit
> 'org/apache/tika/parser/pdf/PDFParser.properties' :
>
> enableAutospace true
> extractAnnotationText true
> sortByPosition false
> suppressDuplicateOverlappingText false
> useNonSequentialParser false
> extractAcroFormContent true
> extractInlineImages true
> extractUniqueInlineImagesOnly false
> checkExtractAccessPermission false
> allowExtractionForAccessibility true
>
> but same result. Tesseract has also been installed.
>
> What is difference between ./plugins/parse-tika/parse-tika.jar and
> ./plugins/parse-tika/tika-parsers-1.8.jar ?
>
> Thank for your help !
>
> 8. Oct 2015 20:43 by wastl.nagel@googlemail.com:
>
>
>> Hi,
>>
>> there as been a similar question on the Tika mailing list recently:
>>
>> http://mail-archives.apache.org/mod_mbox/tika-user/201505.mbox/%3CDM2PR09MB071346D01729FC9367308E94C7D90@DM2PR09MB0713.namprd09.prod.outlook.com%3E
>>
>> If you get Tika to OCR the embedded images, the parse-tika
>> plugin will probably also do if the Tika jar is repla steps
ced.
>>
>> Sebastian
>>
>> On 10/06/2015 03:55 PM, > jeanblue@tutanota.com> wrote:
>>> Hello,
>>>
>>> I use Nutch v1.10, i just want to know if Nutch with Tika parser v1.8 can
>>> natively OCR images from PDF files? I can OCR JPEG or PNG files but Tika
>>> do
>>> not convert images from PDF. I use Elastic to index.
>>>
>>> Thank you
>>>
Re: OCR images from PDF with Tika
Posted by je...@tutanota.com.
Hello,
I try do edit JAR file and edit
'org/apache/tika/parser/pdf/PDFParser.properties' :
enableAutospace true
extractAnnotationText true
sortByPosition false
suppressDuplicateOverlappingText false
useNonSequentialParser false
extractAcroFormContent true
extractInlineImages true
extractUniqueInlineImagesOnly false
checkExtractAccessPermission false
allowExtractionForAccessibility true
but same result. Tesseract has also been installed.
What is difference between ./plugins/parse-tika/parse-tika.jar and
./plugins/parse-tika/tika-parsers-1.8.jar ?
Thank for your help !
8. Oct 2015 20:43 by wastl.nagel@googlemail.com:
> Hi,
>
> there as been a similar question on the Tika mailing list recently:
>
> http://mail-archives.apache.org/mod_mbox/tika-user/201505.mbox/%3CDM2PR09MB071346D01729FC9367308E94C7D90@DM2PR09MB0713.namprd09.prod.outlook.com%3E
>
> If you get Tika to OCR the embedded images, the parse-tika
> plugin will probably also do if the Tika jar is replaced.
>
> Sebastian
>
> On 10/06/2015 03:55 PM, > jeanblue@tutanota.com> wrote:
>> Hello,
>>
>> I use Nutch v1.10, i just want to know if Nutch with Tika parser v1.8 can
>> natively OCR images from PDF files? I can OCR JPEG or PNG files but Tika
>> do
>> not convert images from PDF. I use Elastic to index.
>>
>> Thank you
>>
Re: OCR images from PDF with Tika
Posted by Sebastian Nagel <wa...@googlemail.com>.
Hi,
there as been a similar question on the Tika mailing list recently:
http://mail-archives.apache.org/mod_mbox/tika-user/201505.mbox/%3CDM2PR09MB071346D01729FC9367308E94C7D90@DM2PR09MB0713.namprd09.prod.outlook.com%3E
If you get Tika to OCR the embedded images, the parse-tika
plugin will probably also do if the Tika jar is replaced.
Sebastian
On 10/06/2015 03:55 PM, jeanblue@tutanota.com wrote:
> Hello,
>
> I use Nutch v1.10, i just want to know if Nutch with Tika parser v1.8 can
> natively OCR images from PDF files? I can OCR JPEG or PNG files but Tika do
> not convert images from PDF. I use Elastic to index.
>
> Thank you
>