You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@nifi.apache.org by Ralf Meier <ne...@cht3.com> on 2016/02/20 16:31:40 UTC

Using Apache Nifi and Tika to extract content from pdf

Hi Everybody, 

I’m new to Nifi and I want to find out if it is possible to extract content and metadata from PDF’s using a library like tika. 
My first Idea was to to use the following processors:
- GetFile (Watch a specific Folder)
- IdentifyMimeType (Identify if the file is a typ application/pdf) 
- RouteOnAttribute (If it is a pdf)
- ExecuteStreamCommand:
	I changed the following settings.
	Command Arguments: {flowfilw_contents}
	Command Path: tika-python parse all
	
I use the python tika wrapper from (https://github.com/chrismattmann/tika-python <https://github.com/chrismattmann/tika-python>)

But it is not working. 
Has somebody an Idea how to use tika to extract the content and the metadata using nifi or what I’m doing wrong.

Thanks for your help.
BR 
Ralf

Re: Using Apache Nifi and Tika to extract content from pdf

Posted by Russell Whitaker <ru...@gmail.com>.

On Sun, Feb 21, 2016 at 3:30 PM, Matt Burgess <ma...@gmail.com> wrote:
[SNIP]
>
> To that end, I am would love to hear any comments, questions, or suggestions
> to make the scripting processors better. Russell's suggestion for adding
> Clojure is a great example, I am hoping we can take this thing as far as it
> can go :)
>

I've done quite a bit of wrapping Clojure functions in
ExecuteStreamProcess processors
in recent months at my workplace, so I'd be very willing to help
review & test code enabling
scripting support for Clojure.

R

-- 
Russell Whitaker
https://github.com/russellwhitaker
http://twitter.com/OrthoNormalRuss
http://www.linkedin.com/pub/russell-whitaker/0/b86/329

Re: Using Apache Nifi and Tika to extract content from pdf

Posted by Matt Burgess <ma...@gmail.com>.

There are some RegEx processors you can use to see if the PDF parsed text
is "empty" or full of just whitespace, or you can use the scripting
processor for that too.

For Jython, check the unit test:
https://github.com/apache/nifi/blob/master/nifi-nar-bundles/nifi-scripting-bundle/nifi-scripting-processors/src/test/java/org/apache/nifi/processors/script/TestExecuteJython.java
 It refers to resources in
nifi/nifi-nar-bundles/nifi-scripting-bundle/nifi-scripting-processors/src/test/resources/jython,
and also does some flowfile manipulation. Remember that if you use Jython
you can't use JARs like PDFBox; you'd need a Jython-compatible
module/script. It should eventually support both same-language modules
(although currently only JRuby and Jython support it) and JVM libraries
(JARs), to allow to max flexibility and power.

To that end, I am would love to hear any comments, questions, or
suggestions to make the scripting processors better. Russell's suggestion
for adding Clojure is a great example, I am hoping we can take this thing
as far as it can go :)

Regards,
Matt


On Sun, Feb 21, 2016 at 7:22 AM, Ralf Meier <ne...@cht3.com> wrote:

> Hi,
>
> thanks for your help. Now the workflow is working. But I still have some
> issues. The PutFile at the end of the workflow writes the file to disk. But
> in my case the content of the flow file is mostly empty (only one PDF
> worked for me). Even that the rest is processed just fine. Also when I try
> to put the result e.g. into Elasitcsearch it is empty.
>
> Is there a special hint for this?
>
> And in addition I searched the documentation to find out what would be the
> syntax in python to read the input-flowfile and to create a new flowfile
> and parse it back. Is there a documentation? Or where did I find some infos?
>
> Sorry for all my questions.
>
> BR and thanks.
>
> Ralf
>
>
> Am 20.02.2016 um 22:27 schrieb Matt Burgess <ma...@gmail.com>:
>
> I will update the blog to make these more clear. I used PDFBox 1.8.10 so
> I'm not sure what else you need for the 2.0-series. For the JAR issue with
> 1.8.10, PDFBox doc says you need 3 JARs: PDFBox, fontbox, and jempbox, plus
> commons-logging but I think that's already in NiFi.
>
> The stack trace from the script error should be in logs/nifi-app.log, if
> you send it along I can take a look. You should be able to point to the
> folder containing the JARs, or supply a comma-separated list of each JAR in
> the Module Path property.
>
> For the groovy "magic" stuff (syntactic sugar and closure coercion while
> using the NiFi APIs), I explain some of that in another post on that blog:
>
> http://funnifi.blogspot.com/2016/02/executescript-processor-replacing-flow.html?m=1
>
> Hope this helps,
> Matt
>
> On Feb 20, 2016, at 3:54 PM, Ralf Meier <ne...@cht3.com> wrote:
>
> Hi,
>
> thanks for your information. I try to understand your workflow but get
> some errors when I test it:
>
> : org.apache.nifi.processor.exception.ProcessException: javax.script.ScriptException: org.codehaus.groovy.control.MultipleCompilationErrorsException: startup failed:
> Script36800.groovy: 15: unable to resolve class PDFTextStripper
>  @ line 15, column 9.
>    def s = new PDFTextStripper()
>
>
> I downloaded the pdfbox-2.0.0-RC3.jar and copied in a folder pdfbox in my
> download folder. I then changed the path (Module Directory)  in the
> ExecuteScript to this folder. The rest I didn’t changed.
>
> But I get this error. Do you have some hints? This would be great.
>
>
> To be honest (I’m totally new to groovy) in addition I did also not
> understand what happens here in detail:
>
> flowFile = session.write(flowFile, {inputStream, outputStream ->
> doc = PDDocument.load(inputStream)
> info = doc.getDocumentInformation()
>         s.writeText(doc, new OutputStreamWriter(outputStream))
>     } as StreamCallback
> )
>
> Thanks for your help.
>
> BR
> Ralf
>
>
>
>
> Am 20.02.2016 um 16:44 schrieb Matt Burgess <ma...@gmail.com>:
>
> I have a blog post on how to do this with NiFi using a Groovy script in
> the ExecuteScript (new in 0.5.0) processor using PDFBox instead of Tika:
>
>
> http://funnifi.blogspot.com/2016/02/executescript-extract-text-metadata.html?m=1
>
> Jython is also supported but can't yet use Java libraries (it uses Jython
> scripts/modules instead). The other languages (Groovy, Lua, JavaScript,
> JRuby) can use Java libraries like Tika and PDFBox.
>
> Regards,
> Matt
>
> Sent from my iPhone
>
> On Feb 20, 2016, at 10:31 AM, Ralf Meier <ne...@cht3.com> wrote:
>
> Hi Everybody,
>
> I’m new to Nifi and I want to find out if it is possible to extract
> content and metadata from PDF’s using a library like tika.
> My first Idea was to to use the following processors:
> - GetFile (Watch a specific Folder)
> - IdentifyMimeType (Identify if the file is a typ application/pdf)
> - RouteOnAttribute (If it is a pdf)
> - ExecuteStreamCommand:
> I changed the following settings.
> Command Arguments: {flowfilw_contents}
> Command Path: tika-python parse all
> I use the python tika wrapper from (
> https://github.com/chrismattmann/tika-python)
>
> But it is not working.
> Has somebody an Idea how to use tika to extract the content and the
> metadata using nifi or what I’m doing wrong.
>
> Thanks for your help.
> BR
> Ralf
>
>
>
>

Re: Using Apache Nifi and Tika to extract content from pdf

Posted by Ralf Meier <ne...@cht3.com>.

Hi,

thanks for your help. Now the workflow is working. But I still have some issues. The PutFile at the end of the workflow writes the file to disk. But in my case the content of the flow file is mostly empty (only one PDF worked for me). Even that the rest is processed just fine. Also when I try to put the result e.g. into Elasitcsearch it is empty. 

Is there a special hint for this?

And in addition I searched the documentation to find out what would be the syntax in python to read the input-flowfile and to create a new flowfile and parse it back. Is there a documentation? Or where did I find some infos?

Sorry for all my questions.

BR and thanks.

Ralf


> Am 20.02.2016 um 22:27 schrieb Matt Burgess <ma...@gmail.com>:
> 
> I will update the blog to make these more clear. I used PDFBox 1.8.10 so I'm not sure what else you need for the 2.0-series. For the JAR issue with 1.8.10, PDFBox doc says you need 3 JARs: PDFBox, fontbox, and jempbox, plus commons-logging but I think that's already in NiFi.
> 
> The stack trace from the script error should be in logs/nifi-app.log, if you send it along I can take a look. You should be able to point to the folder containing the JARs, or supply a comma-separated list of each JAR in the Module Path property.
> 
> For the groovy "magic" stuff (syntactic sugar and closure coercion while using the NiFi APIs), I explain some of that in another post on that blog: 
> http://funnifi.blogspot.com/2016/02/executescript-processor-replacing-flow.html?m=1 <http://funnifi.blogspot.com/2016/02/executescript-processor-replacing-flow.html?m=1>
> 
> Hope this helps,
> Matt
> 
> On Feb 20, 2016, at 3:54 PM, Ralf Meier <news@cht3.com <ma...@cht3.com>> wrote:
> 
>> Hi,
>> 
>> thanks for your information. I try to understand your workflow but get some errors when I test it:
>> 
>> : org.apache.nifi.processor.exception.ProcessException: javax.script.ScriptException: org.codehaus.groovy.control.MultipleCompilationErrorsException: startup failed:
>> Script36800.groovy: 15: unable to resolve class PDFTextStripper 
>>  @ line 15, column 9.
>>    def s = new PDFTextStripper()
>> 
>> I downloaded the pdfbox-2.0.0-RC3.jar and copied in a folder pdfbox in my download folder. I then changed the path (Module Directory)  in the ExecuteScript to this folder. The rest I didn’t changed. 
>> 
>> But I get this error. Do you have some hints? This would be great.
>> 
>> 
>> To be honest (I’m totally new to groovy) in addition I did also not understand what happens here in detail:
>> 
>> flowFile = session.write(flowFile, {inputStream, outputStream ->
>> 	doc = PDDocument.load(inputStream)
>> 	info = doc.getDocumentInformation()
>>         s.writeText(doc, new OutputStreamWriter(outputStream))
>>     } as StreamCallback
>> )
>> 
>> Thanks for your help.
>> 
>> BR
>> Ralf
>> 
>> 
>> 
>> 
>>> Am 20.02.2016 um 16:44 schrieb Matt Burgess <mattyb149@gmail.com <ma...@gmail.com>>:
>>> 
>>> I have a blog post on how to do this with NiFi using a Groovy script in the ExecuteScript (new in 0.5.0) processor using PDFBox instead of Tika:
>>> 
>>> http://funnifi.blogspot.com/2016/02/executescript-extract-text-metadata.html?m=1 <http://funnifi.blogspot.com/2016/02/executescript-extract-text-metadata.html?m=1>
>>> 
>>> Jython is also supported but can't yet use Java libraries (it uses Jython scripts/modules instead). The other languages (Groovy, Lua, JavaScript, JRuby) can use Java libraries like Tika and PDFBox.
>>> 
>>> Regards,
>>> Matt
>>> 
>>> Sent from my iPhone
>>> 
>>> On Feb 20, 2016, at 10:31 AM, Ralf Meier <news@cht3.com <ma...@cht3.com>> wrote:
>>> 
>>>> Hi Everybody, 
>>>> 
>>>> I’m new to Nifi and I want to find out if it is possible to extract content and metadata from PDF’s using a library like tika. 
>>>> My first Idea was to to use the following processors:
>>>> - GetFile (Watch a specific Folder)
>>>> - IdentifyMimeType (Identify if the file is a typ application/pdf) 
>>>> - RouteOnAttribute (If it is a pdf)
>>>> - ExecuteStreamCommand:
>>>> 	I changed the following settings.
>>>> 	Command Arguments: {flowfilw_contents}
>>>> 	Command Path: tika-python parse all
>>>> 	
>>>> I use the python tika wrapper from (https://github.com/chrismattmann/tika-python <https://github.com/chrismattmann/tika-python>)
>>>> 
>>>> But it is not working. 
>>>> Has somebody an Idea how to use tika to extract the content and the metadata using nifi or what I’m doing wrong.
>>>> 
>>>> Thanks for your help.
>>>> BR 
>>>> Ralf
>>

Re: Using Apache Nifi and Tika to extract content from pdf

Posted by Matt Burgess <ma...@gmail.com>.

I will update the blog to make these more clear. I used PDFBox 1.8.10 so I'm not sure what else you need for the 2.0-series. For the JAR issue with 1.8.10, PDFBox doc says you need 3 JARs: PDFBox, fontbox, and jempbox, plus commons-logging but I think that's already in NiFi.

The stack trace from the script error should be in logs/nifi-app.log, if you send it along I can take a look. You should be able to point to the folder containing the JARs, or supply a comma-separated list of each JAR in the Module Path property.

For the groovy "magic" stuff (syntactic sugar and closure coercion while using the NiFi APIs), I explain some of that in another post on that blog: 
http://funnifi.blogspot.com/2016/02/executescript-processor-replacing-flow.html?m=1

Hope this helps,
Matt

> On Feb 20, 2016, at 3:54 PM, Ralf Meier <ne...@cht3.com> wrote:
> 
> Hi,
> 
> thanks for your information. I try to understand your workflow but get some errors when I test it:
> 
> : org.apache.nifi.processor.exception.ProcessException: javax.script.ScriptException: org.codehaus.groovy.control.MultipleCompilationErrorsException: startup failed:
> Script36800.groovy: 15: unable to resolve class PDFTextStripper 
>  @ line 15, column 9.
>    def s = new PDFTextStripper()
> 
> I downloaded the pdfbox-2.0.0-RC3.jar and copied in a folder pdfbox in my download folder. I then changed the path (Module Directory)  in the ExecuteScript to this folder. The rest I didn’t changed. 
> 
> But I get this error. Do you have some hints? This would be great.
> 
> 
> To be honest (I’m totally new to groovy) in addition I did also not understand what happens here in detail:
> 
> flowFile = session.write(flowFile, {inputStream, outputStream ->
> 	doc = PDDocument.load(inputStream)
> 	info = doc.getDocumentInformation()
>         s.writeText(doc, new OutputStreamWriter(outputStream))
>     } as StreamCallback
> )
> 
> Thanks for your help.
> 
> BR
> Ralf
> 
> 
> 
> 
>> Am 20.02.2016 um 16:44 schrieb Matt Burgess <ma...@gmail.com>:
>> 
>> I have a blog post on how to do this with NiFi using a Groovy script in the ExecuteScript (new in 0.5.0) processor using PDFBox instead of Tika:
>> 
>> http://funnifi.blogspot.com/2016/02/executescript-extract-text-metadata.html?m=1
>> 
>> Jython is also supported but can't yet use Java libraries (it uses Jython scripts/modules instead). The other languages (Groovy, Lua, JavaScript, JRuby) can use Java libraries like Tika and PDFBox.
>> 
>> Regards,
>> Matt
>> 
>> Sent from my iPhone
>> 
>>> On Feb 20, 2016, at 10:31 AM, Ralf Meier <ne...@cht3.com> wrote:
>>> 
>>> Hi Everybody, 
>>> 
>>> I’m new to Nifi and I want to find out if it is possible to extract content and metadata from PDF’s using a library like tika. 
>>> My first Idea was to to use the following processors:
>>> - GetFile (Watch a specific Folder)
>>> - IdentifyMimeType (Identify if the file is a typ application/pdf) 
>>> - RouteOnAttribute (If it is a pdf)
>>> - ExecuteStreamCommand:
>>> 	I changed the following settings.
>>> 	Command Arguments: {flowfilw_contents}
>>> 	Command Path: tika-python parse all
>>> 	
>>> I use the python tika wrapper from (https://github.com/chrismattmann/tika-python)
>>> 
>>> But it is not working. 
>>> Has somebody an Idea how to use tika to extract the content and the metadata using nifi or what I’m doing wrong.
>>> 
>>> Thanks for your help.
>>> BR 
>>> Ralf
>

Re: Using Apache Nifi and Tika to extract content from pdf

Posted by Ralf Meier <ne...@cht3.com>.

Hi,

thanks for your information. I try to understand your workflow but get some errors when I test it:

: org.apache.nifi.processor.exception.ProcessException: javax.script.ScriptException: org.codehaus.groovy.control.MultipleCompilationErrorsException: startup failed:
Script36800.groovy: 15: unable to resolve class PDFTextStripper 
 @ line 15, column 9.
   def s = new PDFTextStripper()

I downloaded the pdfbox-2.0.0-RC3.jar and copied in a folder pdfbox in my download folder. I then changed the path (Module Directory)  in the ExecuteScript to this folder. The rest I didn’t changed. 

But I get this error. Do you have some hints? This would be great.

To be honest (I’m totally new to groovy) in addition I did also not understand what happens here in detail:

flowFile = session.write(flowFile, {inputStream, outputStream ->
	doc = PDDocument.load(inputStream)
	info = doc.getDocumentInformation()
        s.writeText(doc, new OutputStreamWriter(outputStream))
    } as StreamCallback
)

Thanks for your help.

BR
Ralf

> Am 20.02.2016 um 16:44 schrieb Matt Burgess <ma...@gmail.com>:
> 
> I have a blog post on how to do this with NiFi using a Groovy script in the ExecuteScript (new in 0.5.0) processor using PDFBox instead of Tika:
> 
> http://funnifi.blogspot.com/2016/02/executescript-extract-text-metadata.html?m=1 <http://funnifi.blogspot.com/2016/02/executescript-extract-text-metadata.html?m=1>
> 
> Jython is also supported but can't yet use Java libraries (it uses Jython scripts/modules instead). The other languages (Groovy, Lua, JavaScript, JRuby) can use Java libraries like Tika and PDFBox.
> 
> Regards,
> Matt
> 
> Sent from my iPhone
> 
> On Feb 20, 2016, at 10:31 AM, Ralf Meier <news@cht3.com <ma...@cht3.com>> wrote:
> 
>> Hi Everybody, 
>> 
>> I’m new to Nifi and I want to find out if it is possible to extract content and metadata from PDF’s using a library like tika. 
>> My first Idea was to to use the following processors:
>> - GetFile (Watch a specific Folder)
>> - IdentifyMimeType (Identify if the file is a typ application/pdf) 
>> - RouteOnAttribute (If it is a pdf)
>> - ExecuteStreamCommand:
>> 	I changed the following settings.
>> 	Command Arguments: {flowfilw_contents}
>> 	Command Path: tika-python parse all
>> 	
>> I use the python tika wrapper from (https://github.com/chrismattmann/tika-python <https://github.com/chrismattmann/tika-python>)
>> 
>> But it is not working. 
>> Has somebody an Idea how to use tika to extract the content and the metadata using nifi or what I’m doing wrong.
>> 
>> Thanks for your help.
>> BR 
>> Ralf

Re: Using Apache Nifi and Tika to extract content from pdf

Posted by Russell Whitaker <ru...@gmail.com>.

Yes! I, for one, will weigh in with my interest in Clojure support in
the scripting processors.

Russell

On Sat, Feb 20, 2016 at 11:34 AM, Matt Burgess <ma...@gmail.com> wrote:
> Clojure libraries (or any JARs) can be used by the supported scripting
> languages. However Clojure itself is not yet supported by the NiFi scripting
> processors, there were issues with the Clojure ScriptEngine bridge so it was
> left off the original list. If there is interest in adding Clojure, I can
> write up an improvement Jira with the initial findings.
>
> Regards,
> Matt
>
>
> On Feb 20, 2016, at 2:18 PM, Russell Whitaker <ru...@gmail.com>
> wrote:
>
> Don't forget Clojure as well.
>
> Russell Whitaker
> Sent from my iPhone
>
> On Feb 20, 2016, at 7:44 AM, Matt Burgess <ma...@gmail.com> wrote:
>
> I have a blog post on how to do this with NiFi using a Groovy script in the
> ExecuteScript (new in 0.5.0) processor using PDFBox instead of Tika:
>
> http://funnifi.blogspot.com/2016/02/executescript-extract-text-metadata.html?m=1
>
> Jython is also supported but can't yet use Java libraries (it uses Jython
> scripts/modules instead). The other languages (Groovy, Lua, JavaScript,
> JRuby) can use Java libraries like Tika and PDFBox.
>
> Regards,
> Matt
>
> Sent from my iPhone
>
> On Feb 20, 2016, at 10:31 AM, Ralf Meier <ne...@cht3.com> wrote:
>
> Hi Everybody,
>
> I’m new to Nifi and I want to find out if it is possible to extract content
> and metadata from PDF’s using a library like tika.
> My first Idea was to to use the following processors:
> - GetFile (Watch a specific Folder)
> - IdentifyMimeType (Identify if the file is a typ application/pdf)
> - RouteOnAttribute (If it is a pdf)
> - ExecuteStreamCommand:
> I changed the following settings.
> Command Arguments: {flowfilw_contents}
> Command Path: tika-python parse all
> I use the python tika wrapper from
> (https://github.com/chrismattmann/tika-python)
>
> But it is not working.
> Has somebody an Idea how to use tika to extract the content and the metadata
> using nifi or what I’m doing wrong.
>
> Thanks for your help.
> BR
> Ralf



-- 
Russell Whitaker
http://twitter.com/OrthoNormalRuss
http://www.linkedin.com/pub/russell-whitaker/0/b86/329

Re: Using Apache Nifi and Tika to extract content from pdf

Posted by Matt Burgess <ma...@gmail.com>.

Clojure libraries (or any JARs) can be used by the supported scripting languages. However Clojure itself is not yet supported by the NiFi scripting processors, there were issues with the Clojure ScriptEngine bridge so it was left off the original list. If there is interest in adding Clojure, I can write up an improvement Jira with the initial findings.

Regards,
Matt


> On Feb 20, 2016, at 2:18 PM, Russell Whitaker <ru...@gmail.com> wrote:
> 
> Don't forget Clojure as well. 
> 
> Russell Whitaker
> Sent from my iPhone
> 
>> On Feb 20, 2016, at 7:44 AM, Matt Burgess <ma...@gmail.com> wrote:
>> 
>> I have a blog post on how to do this with NiFi using a Groovy script in the ExecuteScript (new in 0.5.0) processor using PDFBox instead of Tika:
>> 
>> http://funnifi.blogspot.com/2016/02/executescript-extract-text-metadata.html?m=1
>> 
>> Jython is also supported but can't yet use Java libraries (it uses Jython scripts/modules instead). The other languages (Groovy, Lua, JavaScript, JRuby) can use Java libraries like Tika and PDFBox.
>> 
>> Regards,
>> Matt
>> 
>> Sent from my iPhone
>> 
>>> On Feb 20, 2016, at 10:31 AM, Ralf Meier <ne...@cht3.com> wrote:
>>> 
>>> Hi Everybody, 
>>> 
>>> I’m new to Nifi and I want to find out if it is possible to extract content and metadata from PDF’s using a library like tika. 
>>> My first Idea was to to use the following processors:
>>> - GetFile (Watch a specific Folder)
>>> - IdentifyMimeType (Identify if the file is a typ application/pdf) 
>>> - RouteOnAttribute (If it is a pdf)
>>> - ExecuteStreamCommand:
>>> 	I changed the following settings.
>>> 	Command Arguments: {flowfilw_contents}
>>> 	Command Path: tika-python parse all
>>> 	
>>> I use the python tika wrapper from (https://github.com/chrismattmann/tika-python)
>>> 
>>> But it is not working. 
>>> Has somebody an Idea how to use tika to extract the content and the metadata using nifi or what I’m doing wrong.
>>> 
>>> Thanks for your help.
>>> BR 
>>> Ralf

Re: Using Apache Nifi and Tika to extract content from pdf

Posted by Russell Whitaker <ru...@gmail.com>.

Don't forget Clojure as well. 

Russell Whitaker
Sent from my iPhone

> On Feb 20, 2016, at 7:44 AM, Matt Burgess <ma...@gmail.com> wrote:
> 
> I have a blog post on how to do this with NiFi using a Groovy script in the ExecuteScript (new in 0.5.0) processor using PDFBox instead of Tika:
> 
> http://funnifi.blogspot.com/2016/02/executescript-extract-text-metadata.html?m=1
> 
> Jython is also supported but can't yet use Java libraries (it uses Jython scripts/modules instead). The other languages (Groovy, Lua, JavaScript, JRuby) can use Java libraries like Tika and PDFBox.
> 
> Regards,
> Matt
> 
> Sent from my iPhone
> 
>> On Feb 20, 2016, at 10:31 AM, Ralf Meier <ne...@cht3.com> wrote:
>> 
>> Hi Everybody, 
>> 
>> I�fm new to Nifi and I want to find out if it is possible to extract content and metadata from PDF�fs using a library like tika. 
>> My first Idea was to to use the following processors:
>> - GetFile (Watch a specific Folder)
>> - IdentifyMimeType (Identify if the file is a typ application/pdf) 
>> - RouteOnAttribute (If it is a pdf)
>> - ExecuteStreamCommand:
>> 	I changed the following settings.
>> 	Command Arguments: {flowfilw_contents}
>> 	Command Path: tika-python parse all
>> 	
>> I use the python tika wrapper from (https://github.com/chrismattmann/tika-python)
>> 
>> But it is not working. 
>> Has somebody an Idea how to use tika to extract the content and the metadata using nifi or what I�fm doing wrong.
>> 
>> Thanks for your help.
>> BR 
>> Ralf

Re: Using Apache Nifi and Tika to extract content from pdf

Posted by Matt Burgess <ma...@gmail.com>.

I have a blog post on how to do this with NiFi using a Groovy script in the ExecuteScript (new in 0.5.0) processor using PDFBox instead of Tika:

http://funnifi.blogspot.com/2016/02/executescript-extract-text-metadata.html?m=1

Jython is also supported but can't yet use Java libraries (it uses Jython scripts/modules instead). The other languages (Groovy, Lua, JavaScript, JRuby) can use Java libraries like Tika and PDFBox.

Regards,
Matt

Sent from my iPhone

> On Feb 20, 2016, at 10:31 AM, Ralf Meier <ne...@cht3.com> wrote:
> 
> Hi Everybody, 
> 
> I’m new to Nifi and I want to find out if it is possible to extract content and metadata from PDF’s using a library like tika. 
> My first Idea was to to use the following processors:
> - GetFile (Watch a specific Folder)
> - IdentifyMimeType (Identify if the file is a typ application/pdf) 
> - RouteOnAttribute (If it is a pdf)
> - ExecuteStreamCommand:
> 	I changed the following settings.
> 	Command Arguments: {flowfilw_contents}
> 	Command Path: tika-python parse all
> 	
> I use the python tika wrapper from (https://github.com/chrismattmann/tika-python)
> 
> But it is not working. 
> Has somebody an Idea how to use tika to extract the content and the metadata using nifi or what I’m doing wrong.
> 
> Thanks for your help.
> BR 
> Ralf