You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@nifi.apache.org by mayank rathi <ma...@gmail.com> on 2017/10/08 13:53:11 UTC

Re: NiFi: Extracting text from images

Thanks Joe and Kevin for replying.

Just to close the loop on this issue. I was "incorrectly" expecting
ExtractImageMetadata to identify Text in Images. As per Kevin's suggestion
we are now exploring tesseract for this.

Regards
Mayank



On Thu, Sep 28, 2017 at 4:23 PM, Kevin Doran <kd...@gmail.com>
wrote:

> Hi Mayank,
>
> To clarify, are you attempting to extract image metadata (e.g., timestamp,
> width, height) or convert a photo/graphic of text into a text string using
> optical character recognition (OCR).
>
> If the former (metadata), Joe has you on the correct track for digging
> deeper. If OCR, has been done by others with NiFi using custom processors.
> Keyword search is "NiFi OCR". I would recommend giving Jeremy Dyer's
> Tesseract Processor [1] a look. Here is a guide Jeremy published [2].
>
> Cheers,
> Kevin
>
> [1] https://github.com/jdye64/nifi-addons/tree/master/
> Processors/nifi-tesseract
> [2] https://community.hortonworks.com/articles/28380/nifi-ocr-
> using-apache-nifi-to-read-childrens-books.html
>
> On 9/28/17, 16:14, "Joe Witt" <jo...@gmail.com> wrote:
>
>     Mayank,
>
>     When you tried it what happened?  Did you look at the flow file
>     attributes after the extraction?
>
>     Can you share a jpeg you're using that is not working as you'd expect?
>
>     Thanks
>     Joe
>
>     On Thu, Sep 28, 2017 at 4:12 PM, mayank rathi <ma...@gmail.com>
> wrote:
>     > Hello All,
>     >
>     > Can NiFi extract text from jpg images? If Yes, then which processor
> do I
>     > need to use? I tried ExtractImageMetadata processor and it did not
> help.
>     >
>     > Thanks in advance.
>     >
>
>
>
>


-- 
NOTICE: This email message is for the sole use of the intended recipient(s)
and may contain confidential and privileged information. Any unauthorized
review, use, disclosure or distribution is prohibited. If you are not the
intended recipient, please contact the sender by reply email and destroy
all copies of the original message.

Re: NiFi: Extracting text from images

Posted by mayank rathi <ma...@gmail.com>.

Thanks Jeremy.

Indeed quality of output text was not up to our expectations. Instead of
spending time on tuning tessearct parameters we identified a way to get
text as part of metadata. As a result we most probably won't be using
tesseract.

Thanks for offering help though.


On Sun, Oct 8, 2017 at 2:00 PM, Jeremy Dyer <jd...@gmail.com> wrote:

> Mayank - As you may already know I had a few licensing issues getting the
> tesseract project into the formal project so development stalled. It does
> work however and I will be glad to help you out along the way if you have
> any questions while adding it.
>
> One thing I would highly recommend is to first use tesseract from the
> command line to tune your parameters and gain some insight into the
> realistic accuracy you will get extracting the text from your images before
> going through the hassle of doing the complete setup.
>
> Thanks
> Jeremy Dyer
>
> On Sun, Oct 8, 2017 at 9:53 AM, mayank rathi <ma...@gmail.com>
> wrote:
>
>> Thanks Joe and Kevin for replying.
>>
>> Just to close the loop on this issue. I was "incorrectly" expecting
>> ExtractImageMetadata to identify Text in Images. As per Kevin's suggestion
>> we are now exploring tesseract for this.
>>
>> Regards
>> Mayank
>>
>>
>>
>> On Thu, Sep 28, 2017 at 4:23 PM, Kevin Doran <kd...@gmail.com>
>> wrote:
>>
>>> Hi Mayank,
>>>
>>> To clarify, are you attempting to extract image metadata (e.g.,
>>> timestamp, width, height) or convert a photo/graphic of text into a text
>>> string using optical character recognition (OCR).
>>>
>>> If the former (metadata), Joe has you on the correct track for digging
>>> deeper. If OCR, has been done by others with NiFi using custom processors.
>>> Keyword search is "NiFi OCR". I would recommend giving Jeremy Dyer's
>>> Tesseract Processor [1] a look. Here is a guide Jeremy published [2].
>>>
>>> Cheers,
>>> Kevin
>>>
>>> [1] https://github.com/jdye64/nifi-addons/tree/master/Processors
>>> /nifi-tesseract
>>> [2] https://community.hortonworks.com/articles/28380/nifi-ocr-us
>>> ing-apache-nifi-to-read-childrens-books.html
>>>
>>> On 9/28/17, 16:14, "Joe Witt" <jo...@gmail.com> wrote:
>>>
>>>     Mayank,
>>>
>>>     When you tried it what happened?  Did you look at the flow file
>>>     attributes after the extraction?
>>>
>>>     Can you share a jpeg you're using that is not working as you'd
>>> expect?
>>>
>>>     Thanks
>>>     Joe
>>>
>>>     On Thu, Sep 28, 2017 at 4:12 PM, mayank rathi <
>>> mayank.rathi@gmail.com> wrote:
>>>     > Hello All,
>>>     >
>>>     > Can NiFi extract text from jpg images? If Yes, then which
>>> processor do I
>>>     > need to use? I tried ExtractImageMetadata processor and it did not
>>> help.
>>>     >
>>>     > Thanks in advance.
>>>     >
>>>
>>>
>>>
>>>
>>
>>
>> --
>> NOTICE: This email message is for the sole use of the intended
>> recipient(s) and may contain confidential and privileged information. Any
>> unauthorized review, use, disclosure or distribution is prohibited. If you
>> are not the intended recipient, please contact the sender by reply email
>> and destroy all copies of the original message.
>>
>
>


-- 
NOTICE: This email message is for the sole use of the intended recipient(s)
and may contain confidential and privileged information. Any unauthorized
review, use, disclosure or distribution is prohibited. If you are not the
intended recipient, please contact the sender by reply email and destroy
all copies of the original message.

Re: NiFi: Extracting text from images

Posted by Jeremy Dyer <jd...@gmail.com>.

Chandra - Yes, you must first install tesseract as a system dependency
before the tesseract processor will work. Depending on your system you will
need to either alter your java.library.path to include the directory
containing your tesseract.so (for linux) library. I find the simplest place
to do this is in your NiFi boostrap.conf file since this will not alter any
other java applications that might be on your system.

It will be easy to know if you have it properly configured because NiFi
will fail to start if it cannot find your tesseract library. If NiFi boots
up you should be good to go! Let me know if you have any trouble.

- Jeremy Dyer

On Thu, Oct 12, 2017 at 3:42 PM, emceemouli <
chandramouli.muthukumaran@calgary.ca> wrote:

> Hi Jeremy,
>
> Can you please let me know the steps. I looked in the github..but I wasnt
> sure if installing tesseract is a prerequisite for the custom processor to
> work?
>
> Any help or more details would be greatly appreciated.
>
> Thanks,
> Chandra
>
>
>
> --
> Sent from: http://apache-nifi-users-list.2361937.n4.nabble.com/
>

Re: NiFi: Extracting text from images

Posted by emceemouli <ch...@calgary.ca>.

Hi Jeremy,

Can you please let me know the steps. I looked in the github..but I wasnt
sure if installing tesseract is a prerequisite for the custom processor to
work?

Any help or more details would be greatly appreciated.

Thanks,
Chandra



--
Sent from: http://apache-nifi-users-list.2361937.n4.nabble.com/

Re: NiFi: Extracting text from images

Posted by Jeremy Dyer <jd...@gmail.com>.

Mayank - As you may already know I had a few licensing issues getting the
tesseract project into the formal project so development stalled. It does
work however and I will be glad to help you out along the way if you have
any questions while adding it.

One thing I would highly recommend is to first use tesseract from the
command line to tune your parameters and gain some insight into the
realistic accuracy you will get extracting the text from your images before
going through the hassle of doing the complete setup.

Thanks
Jeremy Dyer

On Sun, Oct 8, 2017 at 9:53 AM, mayank rathi <ma...@gmail.com> wrote:

> Thanks Joe and Kevin for replying.
>
> Just to close the loop on this issue. I was "incorrectly" expecting
> ExtractImageMetadata to identify Text in Images. As per Kevin's suggestion
> we are now exploring tesseract for this.
>
> Regards
> Mayank
>
>
>
> On Thu, Sep 28, 2017 at 4:23 PM, Kevin Doran <kd...@gmail.com>
> wrote:
>
>> Hi Mayank,
>>
>> To clarify, are you attempting to extract image metadata (e.g.,
>> timestamp, width, height) or convert a photo/graphic of text into a text
>> string using optical character recognition (OCR).
>>
>> If the former (metadata), Joe has you on the correct track for digging
>> deeper. If OCR, has been done by others with NiFi using custom processors.
>> Keyword search is "NiFi OCR". I would recommend giving Jeremy Dyer's
>> Tesseract Processor [1] a look. Here is a guide Jeremy published [2].
>>
>> Cheers,
>> Kevin
>>
>> [1] https://github.com/jdye64/nifi-addons/tree/master/Processors
>> /nifi-tesseract
>> [2] https://community.hortonworks.com/articles/28380/nifi-ocr-us
>> ing-apache-nifi-to-read-childrens-books.html
>>
>> On 9/28/17, 16:14, "Joe Witt" <jo...@gmail.com> wrote:
>>
>>     Mayank,
>>
>>     When you tried it what happened?  Did you look at the flow file
>>     attributes after the extraction?
>>
>>     Can you share a jpeg you're using that is not working as you'd expect?
>>
>>     Thanks
>>     Joe
>>
>>     On Thu, Sep 28, 2017 at 4:12 PM, mayank rathi <ma...@gmail.com>
>> wrote:
>>     > Hello All,
>>     >
>>     > Can NiFi extract text from jpg images? If Yes, then which processor
>> do I
>>     > need to use? I tried ExtractImageMetadata processor and it did not
>> help.
>>     >
>>     > Thanks in advance.
>>     >
>>
>>
>>
>>
>
>
> --
> NOTICE: This email message is for the sole use of the intended
> recipient(s) and may contain confidential and privileged information. Any
> unauthorized review, use, disclosure or distribution is prohibited. If you
> are not the intended recipient, please contact the sender by reply email
> and destroy all copies of the original message.
>