You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nifi.apache.org by Dmitry Goldenberg <dg...@hexastax.com> on 2016/04/01 02:30:56 UTC

Re: Text and metadata extraction processor

Simon,

Interesting commentary.  The issue that Joe and I have both looked at, with
the splitting of metadata and content extraction, is that if they're split
then the underlying Tika extraction has to process the file twice: once to
pull out the attributes and once to pull out the content.  Perhaps it may
be good to add ExtractMetadata and ExtractTextContent in addition to
ExtractMediaAttributes - ? Seems kind of an overkill but I may be wrong.

It seems prudent to provide one wholesome, out-of-the-box extractor
processor with options to extract just metadata, just content, or both
metadata and content.

I think what I'm hearing is that we need to allow for checking somewhere
for whether text/content has already been extracted by the time we get to
the ExtractMediaAttributes processor - ?  If that is the issue then I
believe the user would use RouteOnAttribute and if the content is already
filled in then they'd not route to ExtractMediaAttributes.

As far as the OCR.  Tika internally supports OCR by directing image files
to Tesseract (if Tesseract is installed and configured properly).  We've
started talking about how this could be reconciled in the
ExtractMediaAttributes.

I think that once we have the basic ExtractMediaAttributes, we could add
filters for what files to enable the OCR on, and we'd need to expose a few
config parameters specific to OCR, such as e.g. the location of the
Tesseract installation and the maximum file size on which to attempt the
OCR.  Perhaps there can also be a RunOCR processor which would be dedicated
to running OCR.  But since Tika already has OCR integrated we'd probably
want to take care of that in the ExtractMediaAttributes configuration.

Additionally, I've proposed the idea of a ProcessPDF processor which would
ascertain whether a PDF is 'text' or 'scanned'. If scanned, we would break
it up into pages and run OCR on each page, then aggregate the extracted
text.

- Dmitry



On Thu, Mar 31, 2016 at 3:19 PM, Simon Ball <sb...@hortonworks.com> wrote:

> Just a thought…
>
> To keep consistent with other Nifi Parse patterns, would it make sense to
> based the extraction of content on the presence of a relation. So your tika
> processor would have an original relation which would have meta data
> attached as attributed, and an extracted relation which would have the
> metadata and the processed content (text from OCRed image for example).
> That way you can just use context.hasConnection(relationship) to determine
> whether to enable the tika content processing.
>
> This seems more idiomatic than a mode flag.
>
> Simon
>
> > On 31 Mar 2016, at 19:48, Joe Skora <js...@gmail.com> wrote:
> >
> > Dmitry,
> >
> > I think we're good.  I was confused because "XXX_METADATA MIMETYPE
> FILTER"
> > entries referred to some MIME type of the metadata, but you meant to use
> > the file's MIME type to select what files have metadata extracted.
> >
> > Sorry, about that, I think we are on the same page.
> >
> > Joe
> >
> > On Thu, Mar 31, 2016 at 11:40 AM, Dmitry Goldenberg <
> > dgoldenberg@hexastax.com> wrote:
> >
> >> Hi Joe,
> >>
> >> I think if we have the filters in place then there's no need for the
> 'mode'
> >> enum, as the filters themselves guide the processor in deciding whether
> >> metadata and/or content is extracted for a given input file.
> >>
> >> Agreed on the handling of archives as a separate processor (template,
> seems
> >> like).
> >>
> >> I think it's easiest to do both metadata and/or content in one processor
> >> since it can tell Tika whether to extract metadata and/or content, in
> one
> >> pass over the file bytes (as you pointed out).
> >>
> >> Agreed on the exclusions trumping inclusions; I think that makes sense.
> >>
> >>>> We will only have a mimetype for the original flow file itself so I'm
> >> not sure about the metadata mimetype filter.
> >>
> >> I'm not sure where there might be an issue here. The metadata MIME type
> >> filter tells the processor for which MIME types to perform the metadata
> >> extraction.  For instance, extract metadata for images and videos, only.
> >> This could possibly be coupled with an exclusion filter for content that
> >> says, don't try to extract content from images and videos.
> >>
> >> I think with the six filters we get all the bases covered:
> >>
> >>   1. include metadata? --
> >>      1. yes --
> >>         1. determine the inclusion of metadata by filename pattern
> >>         2. determine the inclusion of metadata by MIME type pattern
> >>      2. no --
> >>         1. determine the exclusion of metadata by filename pattern
> >>         2. determine the exclusion of metadata by MIME type pattern
> >>      2. include content? --
> >>      1. yes --
> >>         1. determine the inclusion of content by filename pattern
> >>         2. determine the inclusion of content by MIME type pattern
> >>      2. no --
> >>         1. determine the exclusion of content by filename pattern
> >>         2. determine the exclusion of content by MIME type pattern
> >>
> >> Does this work?
> >>
> >> Thanks,
> >> - Dmitry
> >>
> >>
> >> On Thu, Mar 31, 2016 at 9:27 AM, Joe Skora <js...@gmail.com> wrote:
> >>
> >>> Dmitry,
> >>>
> >>> Looking at this and your prior email.
> >>>
> >>>
> >>>   1. I can see "extract metadata only" being as popular as "extract
> >>>   metadata and content".  It will all depend on the type of media, for
> >>>   audio/video files adding the metadata to the flow file is enough but
> >> for
> >>>   Word, PDF, etc. files the content may be wanted as well.
> >>>   2. After thinking about it, I agree on an enum for mode.
> >>>   3. I think any handling of zips or archive files should be handled by
> >>>   another processor, that keeps this processor cleaner and improves its
> >>>   ability for re-use.
> >>>   4. I like the addition of exclude filters but I'm not sure about
> >> adding
> >>>   content filters.  We will only have a mimetype for the original flow
> >>> file
> >>>   itself so I'm not sure about the metadata mimetype filter.  I think
> >>> content
> >>>   filtering may be best left for another downstream processor, but it
> >>> might
> >>>   be run faster if included here since the entire content will be
> >> handled
> >>>   during extraction.  If the content filters are implemented, for
> >>> performance
> >>>   they need to short circuit so that if the property is not set or is
> >> set
> >>> to
> >>>   ".*" they don't evaluate the regex.
> >>>   1. FILENAME_FILTER - selects flow files to process based on filename
> >>>      matching regex. (exists)
> >>>      2. MIMETYPE_FILTER - selects flow files to process based on
> >> mimetype
> >>>      matching regex. (exists)
> >>>      3. FILENAME_EXCLUDE - excludes already selected flow files from
> >>>      processing based on filename matching regex. (new)
> >>>      4. MIMETYPE_EXCLUDE - excludes already selected flow  files from
> >>>      processing based on mimetype matching regex. (new)
> >>>      5. CONTENT_FILTER (optional) - selects flow files for output based
> >> on
> >>>      extracted content matching regex. (new)
> >>>      6. CONTENT_EXCLUDE (optional) - excludes flow files from output
> >> based
> >>>      on extracted content matching regex. (new)
> >>>   5. As indicated in the descriptions in #4, I don't think overlapping
> >>>   filters are an error, instead excludes should take precedence over
> >>>   includes.  Then I can include a domain (like A*) but exclude sub-sets
> >>> (like
> >>>   AXYZ*).
> >>>
> >>> I'm sure there's something we missed, but I think that covers most of
> it.
> >>>
> >>> Regards,
> >>> Joe
> >>>
> >>>
> >>> On Wed, Mar 30, 2016 at 1:56 PM, Dmitry Goldenberg <
> >>> dgoldenberg@hexastax.com
> >>>> wrote:
> >>>
> >>>> Joe,
> >>>>
> >>>> Upon some thinking, I've started wondering whether all the cases can
> be
> >>>> covered by the following filters:
> >>>>
> >>>> INCLUDE_CONTENT_FILENAME_FILTER - defines any patterns for which input
> >>>> files get their content extracted, by file name
> >>>> INCLUDE_METADATA_FILENAME_FILTER - defines any patterns for which
> input
> >>>> files get their metadata extracted, by file name
> >>>> INCLUDE_CONTENT_MIMETYPE_FILTER - defines any patterns for which input
> >>>> files get their content extracted, by MIME type
> >>>> INCLUDE_METADATA_MIMETYPE_FILTER - defines any patterns for which
> input
> >>>> files get their metadata extracted, by MIME type
> >>>>
> >>>> EXCLUDE_CONTENT_FILENAME_FILTER - defines any patterns for which input
> >>>> files do NOT get their content extracted, by file name
> >>>> EXCLUDE_METADATA_FILENAME_FILTER - defines any patterns for which
> input
> >>>> files do NOT get their metadata extracted, by file name
> >>>> EXCLUDE_CONTENT_MIMETYPE_FILTER - defines any patterns for which input
> >>>> files do NOT get their content extracted, by MIME type
> >>>> EXCLUDE_METADATA_MIMETYPE_FILTER - defines any patterns for which
> input
> >>>> files do NOT get their metadata extracted, by MIME type
> >>>>
> >>>> I believe this gets all the bases covered. At processor init time, we
> >> can
> >>>> analyze the inclusions vs. exclusions; any overlap would cause a
> >>>> configuration error.
> >>>>
> >>>> Let me know what you think, thanks.
> >>>> - Dmitry
> >>>>
> >>>> On Wed, Mar 30, 2016 at 10:41 AM, Dmitry Goldenberg <
> >>>> dgoldenberg@hexastax.com> wrote:
> >>>>
> >>>>> Hi Joe,
> >>>>>
> >>>>> I follow your reasoning on the semantics of "media".  One might argue
> >>>> that
> >>>>> media files are a case of "document" or that a document is a case of
> >>>>> "media".
> >>>>>
> >>>>> I'm not proposing filters for the mode of processing, I'm proposing a
> >>>>> flag/enum with 3 values:
> >>>>>
> >>>>> A) extract metadata only;
> >>>>> B) extract content only and place it into the flowfile content;
> >>>>> C) extract both metadata and content.
> >>>>>
> >>>>> I think the default should be C, to extract both.  At least in my
> >>>>> experience most flows I've dealt with were interested in extracting
> >>> both.
> >>>>>
> >>>>> I don't see how this mode would benefit from being expression driven
> >> -
> >>> ?
> >>>>>
> >>>>> I think we can add this enum mode and have the basic use case
> >> covered.
> >>>>>
> >>>>> Additionally, further down the line, I was thinking we could ponder
> >> the
> >>>>> following (these have been essential in search engine ingestion):
> >>>>>
> >>>>>   1. Extraction from compressed files/archives. How would
> >>> UnpackContent
> >>>>>   work with ExtractMediaAttributes? Use-case being, we've got a zip
> >>>> file as
> >>>>>   input and want to crack it open and unravel it recursively; it may
> >>>> have
> >>>>>   other, nested zips inside, along with other documents. One way to
> >>>> handle
> >>>>>   this is to treat the whole archive as one document and merge all
> >>>> attributes
> >>>>>   into one FlowFile.  The other way would be to treat each archive
> >>>> entry as
> >>>>>   its own flow file and keep a pointer back at the parent archive.
> >>> Yet
> >>>>>   another case is when the user might want to only extract the
> >> 'leaf'
> >>>> entries
> >>>>>   and discard any parent container archives.
> >>>>>
> >>>>>   2. Attachments and embeddings. Users may want to treat any
> >> attached
> >>> or
> >>>>>   embedded files as separate flowfiles with perhaps pointers back to
> >>> the
> >>>>>   parent files. This definitely warrants a filter. Oftentimes Office
> >>>>>   documents have 'media' embeddings which are often not of interest,
> >>>>>   especially for the case of ingesting into a search engine.
> >>>>>
> >>>>>   3. PDF. For PDF's, we can do OCR. This is important for the
> >>>>>   'image'/scanned PDF's for which Tika won't extract text.
> >>>>>
> >>>>> I'd like to understand how much of this is already supported in NiFi
> >>> and
> >>>>> if not I'd volunteer/collaborate to implement some of this.
> >>>>>
> >>>>> - Dmitry
> >>>>>
> >>>>>
> >>>>> On Wed, Mar 30, 2016 at 9:03 AM, Joe Skora <js...@gmail.com> wrote:
> >>>>>
> >>>>>> Dmitry,
> >>>>>>
> >>>>>> Are you proposing separate filters that determine the mode of
> >>>> processing,
> >>>>>> metadata/content/metadataAndContent?  I was thinking of one
> >> selection
> >>>>>> filters and a static mode switch at the processor instance level, to
> >>>> make
> >>>>>> configuration more obvious such that one instance of the processor
> >>> will
> >>>>>> handle a known set of files regardless of the processing mode.
> >>>>>>
> >>>>>> I was thinking it would be useful for the mode switch to support
> >>>>>> expression
> >>>>>> language, but I'm not sure about that since the selection filters
> >> will
> >>>>>> control what files get processed and it would be harder to configure
> >>> if
> >>>>>> the
> >>>>>> output flow file could vary between source format and extracted
> >> text.
> >>>> So,
> >>>>>> while it might be easy to do, and occasionally useful, I think in
> >>> normal
> >>>>>> use I'd never have a varying mode but would more likely have
> >> multiple
> >>>>>> processor instances with some routing or selection going on further
> >>>>>> upstream.
> >>>>>>
> >>>>>> I wrestled with the naming issue too.  I went with
> >>>>>> "ExtractMediaAttributes"
> >>>>>> over "ExtractDocumentAttributes" because it seemed to represent the
> >>>>>> broader
> >>>>>> context better.  In reality, media files and documents and documents
> >>> are
> >>>>>> media files, but in the end it's all just semantics.
> >>>>>>
> >>>>>> I don't think I would change the NAR bundle name, because I think
> >>>>>> "nifi-media-nar" establishes it as a place to collect this and other
> >>>> media
> >>>>>> related processors in the future.
> >>>>>>
> >>>>>> Regards,
> >>>>>> Joe
> >>>>>>
> >>>>>> On Tue, Mar 29, 2016 at 3:09 PM, Dmitry Goldenberg <
> >>>>>> dgoldenberg@hexastax.com
> >>>>>>> wrote:
> >>>>>>
> >>>>>>> Hi Joe,
> >>>>>>>
> >>>>>>> Thanks for all the details.
> >>>>>>>
> >>>>>>> I wanted to propose that I do some of this work so as to go
> >> through
> >>>> the
> >>>>>>> full cycle of developing a processor and committing it.
> >>>>>>>
> >>>>>>> Once your changes are merged, I could extend your
> >>>> 'ExtractMediaMetadata'
> >>>>>>> processor to handle the content, in addition to the metadata.
> >>>>>>>
> >>>>>>> We could keep the FILENAME_FILTER and MIMETYPE_FILTER but add a
> >> mode
> >>>>>> with 3
> >>>>>>> values: metadataOnly, contentOnly, metadataAndContent.
> >>>>>>>
> >>>>>>> One thing that looks to be a design issue right now is, your
> >> changes
> >>>> and
> >>>>>>> the 'nomenclature' seem media-oriented ("nifi-media-nar" etc.)
> >>>>>>>
> >>>>>>> Would it make sense to have a generic processor
> >>>>>>> ExtractDocumentMetadataAndContent?  Are there enough specifics in
> >>> the
> >>>>>>> image/video processing stuff to warrant that to be a separate
> >> layer;
> >>>>>>> perhaps a subclass of ExtractDocumentMetadataAndContent ?  Might
> >> it
> >>>> make
> >>>>>>> sense to rename nifi-media-nar into nifi-text-extract-nar ?
> >>>>>>>
> >>>>>>> Thanks,
> >>>>>>> - Dmitry
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On Tue, Mar 29, 2016 at 2:36 PM, Joe Skora <js...@gmail.com>
> >>> wrote:
> >>>>>>>
> >>>>>>>> Dmitry,
> >>>>>>>>
> >>>>>>>> Yeah, I agree, Tika is pretty impressive.  The original ticket,
> >>>>>> NIFI-615
> >>>>>>>> <https://issues.apache.org/jira/browse/NIFI-615>, wanted
> >>> extraction
> >>>>>> of
> >>>>>>>> metadata from WAV files, but as I got into it I found Tika so
> >> for
> >>>> the
> >>>>>>> same
> >>>>>>>> effort it supports the 1,000+ file formats Tika understands.
> >> That
> >>>> new
> >>>>>>>> processor called "ExtractMediaMetadata", you can pull that pull
> >>>> PR-252
> >>>>>>>> <https://github.com/apache/nifi/pull/252> from GitHub if you
> >> want
> >>>> to
> >>>>>>> give
> >>>>>>>> it a try before it's merged.
> >>>>>>>>
> >>>>>>>> Extraction content for those 1,000+ formats would be a valuable
> >>>>>> addition.
> >>>>>>>> I see two possible approaches, 1) create a new
> >>> "ExtractMediaContent"
> >>>>>>>> processor that would put the document content in a new flow
> >> file,
> >>>> and
> >>>>>> 2)
> >>>>>>>> extend the new "ExtractMediaMetadata" processor so it can
> >> extract
> >>>>>>> metadata,
> >>>>>>>> content, or both.  One combined processor makes sense if it can
> >>>>>> provide a
> >>>>>>>> performance gain, otherwise two complementary processors may
> >> make
> >>>>>> usage
> >>>>>>>> easier.
> >>>>>>>>
> >>>>>>>> I'm glad to help if you want to take a cut at the processor
> >>>> yourself,
> >>>>>> or
> >>>>>>> I
> >>>>>>>> can take a crack at it myself if you'd prefer.
> >>>>>>>>
> >>>>>>>> Don't hesitate to ask questions or share comments and feedback
> >>>>>> regarding
> >>>>>>>> the ExtractMediaMetadata processor or the addition of content
> >>>>>> handling.
> >>>>>>>>
> >>>>>>>> Regards,
> >>>>>>>> Joe Skora
> >>>>>>>>
> >>>>>>>> On Thu, Mar 24, 2016 at 11:40 AM, Dmitry Goldenberg <
> >>>>>>>> dgoldenberg@hexastax.com> wrote:
> >>>>>>>>
> >>>>>>>>> Thanks, Joe!
> >>>>>>>>>
> >>>>>>>>> Hi Joe S. - I'm definitely up for discussing and contributing.
> >>>>>>>>>
> >>>>>>>>> While building search-related ingestion systems, I've seen
> >>>> metadata
> >>>>>> and
> >>>>>>>>> text extraction being done all the time; it's always there and
> >>>>>> always
> >>>>>>> has
> >>>>>>>>> to be done for building search indexes.  Beyond that,
> >>> OCR-related
> >>>>>>>>> capabilities are often requested, and the advantage of Tika is
> >>>> that
> >>>>>> it
> >>>>>>>>> supports OCR out of the box.
> >>>>>>>>>
> >>>>>>>>> - Dmitry
> >>>>>>>>>
> >>>>>>>>> On Thu, Mar 24, 2016 at 11:36 AM, Joe Witt <
> >> joe.witt@gmail.com>
> >>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> Dmitry,
> >>>>>>>>>>
> >>>>>>>>>> Another community member (Joe Skora) has a PR outstanding
> >> for
> >>>>>>>>>> extracting metadata from media files using Tika.  Perhaps it
> >>>> makes
> >>>>>>>>>> sense to broaden that to in general extract what Tika can
> >>> find.
> >>>>>> Joe
> >>>>>>> -
> >>>>>>>>>> perhaps you can discuss your ideas with Dmitry and see if
> >>>>>> broadening
> >>>>>>>>>> is a good idea or if rather domain specific ones make more
> >>>> sense.
> >>>>>>>>>>
> >>>>>>>>>> This concept of extracting metadata from documents/text
> >> files,
> >>>>>> etc..
> >>>>>>>>>> using something like Tika is certainly useful as that then
> >> can
> >>>>>> drive
> >>>>>>>>>> nice automated routing decisions.
> >>>>>>>>>>
> >>>>>>>>>> Thanks
> >>>>>>>>>> Joe
> >>>>>>>>>>
> >>>>>>>>>> On Thu, Mar 24, 2016 at 9:28 AM, Dmitry Goldenberg
> >>>>>>>>>> <dg...@hexastax.com> wrote:
> >>>>>>>>>>> Hi,
> >>>>>>>>>>>
> >>>>>>>>>>> I see that the ExtractText processor extracts text using
> >>>> regex.
> >>>>>>>>>>>
> >>>>>>>>>>> What about a processor that extracts text and metadata
> >> from
> >>>>>>> incoming
> >>>>>>>>>>> files?  That doesn't seem to exist - but perhaps I didn't
> >>>> quite
> >>>>>>> look
> >>>>>>>> in
> >>>>>>>>>> the
> >>>>>>>>>>> right spots.
> >>>>>>>>>>>
> >>>>>>>>>>> If that doesn't exist I'd like to implement and commit it,
> >>>> using
> >>>>>>>> Apache
> >>>>>>>>>>> Tika.  There may also be a couple of related processors to
> >>>> that.
> >>>>>>>>>>>
> >>>>>>>>>>> Thoughts?
> >>>>>>>>>>>
> >>>>>>>>>>> Thanks,
> >>>>>>>>>>> - Dmitry
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>>
> >>>>
> >>>
> >>
>
>

Re: Text and metadata extraction processor

Posted by Dmitry Goldenberg <dg...@hexastax.com>.
Got it.

What's the typical JIRA ticket triage process like within NiFi?  I'm
curious as to how consensus is built around designs, ticket assignments,
and what goes into a release.

On Fri, Apr 1, 2016 at 10:33 AM, Mark Payne <ma...@hotmail.com> wrote:

> As far I know, the processors haven't made it into any release yet. If
> that is the case,
> then we could just remove those properties all together and it's easy.
>
> If they have already been released, then we would need to ensure that the
> processor
> is invalid on startup (it doesn't accept those as dynamic properties) and
> then we update
> the migration guide to explain how to obtain the same behavior.
>
> But either way, we can definitely remove the properties if it's determined
> that there is not
> a good enough reason to keep them in.
>
> -Mark
>
>
> > On Apr 1, 2016, at 10:10 AM, Dmitry Goldenberg <dg...@hexastax.com>
> wrote:
> >
> > Hi Mark,
> >
> > That is a good point.  It also has crossed my mind.  AFAIK,
> > ExtractMediaAttributes already has a couple of similar filters on it; Joe
> > S., please correct me if I'm wrong.  I merely suggested that we extend
> > these filters.
> >
> > I'd have to agree with your points, Mark, that it's cleaner to keep the
> > conditionals separate, on RouteOnAttribute and the like.
> >
> > If that is the consensus then I believe we're back to the idea of a
> "mode"
> > configuration on ExtractMediaAttributes, with 3 values: a)
> > extractMetadataOnly, b) extractContentOnly, c) extractMetadataAndContent.
> > As an alternative we have also considered rolling 3 separate processors:
> > ExtractMetadata, ExtractContent, and ExtractMetadataAndContent.  Given
> that
> > ExtractMediaAttributes already exists, I think it may be easiest to roll
> > with the new "mode" config parameter.
> >
> > One question then is also, what to do with the filters that are already
> on
> > ExtractMediaAttributes - ?  Should they still be there?
> >
> > BTW, I've filed the following JIRA tickets related to the topics we've
> been
> > discussing:
> >
> > Extract metadata and text - NIFI1717
> > <https://issues.apache.org/jira/browse/NIFI-1717>
> > PerformOCR - NIFI1718 <https://issues.apache.org/jira/browse/NIFI-1718>
> > ProcessPDF - NIFI1719 <https://issues.apache.org/jira/browse/NIFI-1719>
> >
> > I'll propagate more info into those as we discuss things more.
> >
> > Mark, could you take a look at: NIFI1716
> > <https://issues.apache.org/jira/browse/NIFI-1716>.  This is a separate
> > topic so we could create a separate discussion thread for the CSV
> splitter.
> >
> > Thanks,
> > - Dmitry
> >
> >
> > On Fri, Apr 1, 2016 at 9:06 AM, Mark Payne <ma...@hotmail.com> wrote:
> >
> >> Dmitry,
> >>
> >> I would be a bit concerned about providing options for filters that
> >> include and
> >> exclude certain things. I believe that if you send a FlowFile to the
> >> Processor,
> >> then the Processor should do its thing. If you want to filter out which
> >> FlowFiles
> >> have their content extracted, for example, I would suggest using a
> >> Processor
> >> like RouteOnAttribute to ensure that only the appropriate FlowFiles are
> >> processed
> >> by the ExtractMediaMetadata processor.
> >>
> >> This allows the metadata extraction processor to focus purely on
> extracting
> >> metadata and doesn't have to deal with all of the logic of filtering
> >> things out. The logic
> >> for filtering things out is almost guaranteed to grow much more complex
> as
> >> people
> >> start to use this more and more. NiFi already provides several
> route-based
> >> processors
> >> to allow for a great deal of flexibility with this type of logic
> >> (RouteOnAttribute, RouteOnContent,
> >> ScanAttribute, ScanContent, etc.).
> >>
> >> Thanks
> >> -Mark
> >>
> >>
> >>
> >>> On Apr 1, 2016, at 12:55 AM, Dmitry Goldenberg <
> dgoldenberg@hexastax.com>
> >> wrote:
> >>>
> >>> Simon,
> >>>
> >>> I believe we've moved on past the 'mode' option and have now switched
> to
> >>> talking about how the include/exclude filters, for metadata and
> content,
> >> on
> >>> the one hand side, and filename or MIME type based, on the other hand
> >> side,
> >>> would drive whether meta, content, or both would get extracted.
> >>>
> >>> For example, a user could configure the ExtractMediaAttributes
> processor
> >> to
> >>> extract metadata for all image files (but not content), extract content
> >>> only for plain text documents (but no metadata), or both meta and
> content
> >>> for documents with an extension ".pqr", based on the filename.
> >>>
> >>> Could you elaborate on your vision of how relationships could "drive"
> >> this
> >>> type of functionality?  Joe has already built some of the filtering
> into
> >>> the processor; I just suggested to extend that further, and we get all
> >> the
> >>> bases covered.
> >>>
> >>> I'm not sure I followed your comment on the extracted content being
> >>> transferred into a new FlowFile.  My thoughts were that the extracted
> >>> content would be inserted into a new, dedicated field, called for
> >> example,
> >>> "text", on *the same* FlowFile.  I imagine that for a lot of use-cases,
> >>> especially data ingestion into a search engine, the extracted
> attributes
> >>> *and* the extracted text must travel together as part of the ingested
> >>> document, with the original flowfile-content most likely getting
> dropped
> >> on
> >>> the way into the index.
> >>>
> >>> I guess an alternative could be to have an option to represent the
> >>> extraction results as a new document, and an option to drop the
> original,
> >>> and an option to copy the original's attributes onto the new doc. Seems
> >>> rather complex.  I like the "in-place" extraction.
> >>>
> >>> Could you also elaborate on how a controller service would handle OCR?
> >>> When a document floats into ExtractMediaAttributes, assuming Tesseract
> is
> >>> installed properly, Tika will already automatically fire off OCR.
> Unless
> >>> we turn that off and cause OCR to only be supported via this service.
> >> I'm
> >>> tempted to say why don't we just let Tika do its job for all cases, OCR
> >>> included.  Caveat being that OCR is expensive and it would be nice to
> >> have
> >>> ways of ensuring it has enough resources and doesn't bog the flow down.
> >>>
> >>> For the PDF processor, I'm thinking, yes, PDFBox to break it up into
> >> pages
> >>> and then apply Tika page by page, then aggregate the output together,
> >> with
> >>> a configurable max of up to N pages per document to process (due to how
> >>> slow OCR is).  I already have a prototype of this going, I'll file a
> JIRA
> >>> ticket for this feature.
> >>>
> >>> - Dmitry
> >>>
> >>>
> >>>
> >>> On Thu, Mar 31, 2016 at 8:43 PM, Simon Ball <sb...@hortonworks.com>
> >> wrote:
> >>>
> >>>> What I’m suggesting is a single processor for both, but instead of
> >> using a
> >>>> mode property to determine which bits get extracted, you use the state
> >> of
> >>>> the relations on the processor to configure which options tika uses
> and
> >>>> using a single pass to actually parse metadata into attributes, and
> >> content
> >>>> into a new flow file transfer into the parsed relation.
> >>>>
> >>>> On the tesseract front, it may make sense to do this through a
> >> controller
> >>>> service.
> >>>>
> >>>> A PDF processor might be interesting. Are you thinking of something
> like
> >>>> PDFBox, or tika again?
> >>>>
> >>>> Simon
> >>>>
> >>>>
> >>>>> On 1 Apr 2016, at 01:30, Dmitry Goldenberg <dgoldenberg@hexastax.com
> >
> >>>> wrote:
> >>>>>
> >>>>> Simon,
> >>>>>
> >>>>> Interesting commentary.  The issue that Joe and I have both looked
> at,
> >>>> with
> >>>>> the splitting of metadata and content extraction, is that if they're
> >>>> split
> >>>>> then the underlying Tika extraction has to process the file twice:
> once
> >>>> to
> >>>>> pull out the attributes and once to pull out the content.  Perhaps it
> >> may
> >>>>> be good to add ExtractMetadata and ExtractTextContent in addition to
> >>>>> ExtractMediaAttributes - ? Seems kind of an overkill but I may be
> >> wrong.
> >>>>>
> >>>>> It seems prudent to provide one wholesome, out-of-the-box extractor
> >>>>> processor with options to extract just metadata, just content, or
> both
> >>>>> metadata and content.
> >>>>>
> >>>>> I think what I'm hearing is that we need to allow for checking
> >> somewhere
> >>>>> for whether text/content has already been extracted by the time we
> get
> >> to
> >>>>> the ExtractMediaAttributes processor - ?  If that is the issue then I
> >>>>> believe the user would use RouteOnAttribute and if the content is
> >> already
> >>>>> filled in then they'd not route to ExtractMediaAttributes.
> >>>>>
> >>>>> As far as the OCR.  Tika internally supports OCR by directing image
> >> files
> >>>>> to Tesseract (if Tesseract is installed and configured properly).
> >> We've
> >>>>> started talking about how this could be reconciled in the
> >>>>> ExtractMediaAttributes.
> >>>>>
> >>>>> I think that once we have the basic ExtractMediaAttributes, we could
> >> add
> >>>>> filters for what files to enable the OCR on, and we'd need to expose
> a
> >>>> few
> >>>>> config parameters specific to OCR, such as e.g. the location of the
> >>>>> Tesseract installation and the maximum file size on which to attempt
> >> the
> >>>>> OCR.  Perhaps there can also be a RunOCR processor which would be
> >>>> dedicated
> >>>>> to running OCR.  But since Tika already has OCR integrated we'd
> >> probably
> >>>>> want to take care of that in the ExtractMediaAttributes
> configuration.
> >>>>>
> >>>>> Additionally, I've proposed the idea of a ProcessPDF processor which
> >>>> would
> >>>>> ascertain whether a PDF is 'text' or 'scanned'. If scanned, we would
> >>>> break
> >>>>> it up into pages and run OCR on each page, then aggregate the
> extracted
> >>>>> text.
> >>>>>
> >>>>> - Dmitry
> >>>>>
> >>>>>
> >>>>>
> >>>>> On Thu, Mar 31, 2016 at 3:19 PM, Simon Ball <sb...@hortonworks.com>
> >>>> wrote:
> >>>>>
> >>>>>> Just a thought…
> >>>>>>
> >>>>>> To keep consistent with other Nifi Parse patterns, would it make
> sense
> >>>> to
> >>>>>> based the extraction of content on the presence of a relation. So
> your
> >>>> tika
> >>>>>> processor would have an original relation which would have meta data
> >>>>>> attached as attributed, and an extracted relation which would have
> the
> >>>>>> metadata and the processed content (text from OCRed image for
> >> example).
> >>>>>> That way you can just use context.hasConnection(relationship) to
> >>>> determine
> >>>>>> whether to enable the tika content processing.
> >>>>>>
> >>>>>> This seems more idiomatic than a mode flag.
> >>>>>>
> >>>>>> Simon
> >>>>>>
> >>>>>>> On 31 Mar 2016, at 19:48, Joe Skora <js...@gmail.com> wrote:
> >>>>>>>
> >>>>>>> Dmitry,
> >>>>>>>
> >>>>>>> I think we're good.  I was confused because "XXX_METADATA MIMETYPE
> >>>>>> FILTER"
> >>>>>>> entries referred to some MIME type of the metadata, but you meant
> to
> >>>> use
> >>>>>>> the file's MIME type to select what files have metadata extracted.
> >>>>>>>
> >>>>>>> Sorry, about that, I think we are on the same page.
> >>>>>>>
> >>>>>>> Joe
> >>>>>>>
> >>>>>>> On Thu, Mar 31, 2016 at 11:40 AM, Dmitry Goldenberg <
> >>>>>>> dgoldenberg@hexastax.com> wrote:
> >>>>>>>
> >>>>>>>> Hi Joe,
> >>>>>>>>
> >>>>>>>> I think if we have the filters in place then there's no need for
> the
> >>>>>> 'mode'
> >>>>>>>> enum, as the filters themselves guide the processor in deciding
> >>>> whether
> >>>>>>>> metadata and/or content is extracted for a given input file.
> >>>>>>>>
> >>>>>>>> Agreed on the handling of archives as a separate processor
> >> (template,
> >>>>>> seems
> >>>>>>>> like).
> >>>>>>>>
> >>>>>>>> I think it's easiest to do both metadata and/or content in one
> >>>> processor
> >>>>>>>> since it can tell Tika whether to extract metadata and/or content,
> >> in
> >>>>>> one
> >>>>>>>> pass over the file bytes (as you pointed out).
> >>>>>>>>
> >>>>>>>> Agreed on the exclusions trumping inclusions; I think that makes
> >>>> sense.
> >>>>>>>>
> >>>>>>>>>> We will only have a mimetype for the original flow file itself
> so
> >>>> I'm
> >>>>>>>> not sure about the metadata mimetype filter.
> >>>>>>>>
> >>>>>>>> I'm not sure where there might be an issue here. The metadata MIME
> >>>> type
> >>>>>>>> filter tells the processor for which MIME types to perform the
> >>>> metadata
> >>>>>>>> extraction.  For instance, extract metadata for images and videos,
> >>>> only.
> >>>>>>>> This could possibly be coupled with an exclusion filter for
> content
> >>>> that
> >>>>>>>> says, don't try to extract content from images and videos.
> >>>>>>>>
> >>>>>>>> I think with the six filters we get all the bases covered:
> >>>>>>>>
> >>>>>>>> 1. include metadata? --
> >>>>>>>>   1. yes --
> >>>>>>>>      1. determine the inclusion of metadata by filename pattern
> >>>>>>>>      2. determine the inclusion of metadata by MIME type pattern
> >>>>>>>>   2. no --
> >>>>>>>>      1. determine the exclusion of metadata by filename pattern
> >>>>>>>>      2. determine the exclusion of metadata by MIME type pattern
> >>>>>>>>   2. include content? --
> >>>>>>>>   1. yes --
> >>>>>>>>      1. determine the inclusion of content by filename pattern
> >>>>>>>>      2. determine the inclusion of content by MIME type pattern
> >>>>>>>>   2. no --
> >>>>>>>>      1. determine the exclusion of content by filename pattern
> >>>>>>>>      2. determine the exclusion of content by MIME type pattern
> >>>>>>>>
> >>>>>>>> Does this work?
> >>>>>>>>
> >>>>>>>> Thanks,
> >>>>>>>> - Dmitry
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Thu, Mar 31, 2016 at 9:27 AM, Joe Skora <js...@gmail.com>
> >> wrote:
> >>>>>>>>
> >>>>>>>>> Dmitry,
> >>>>>>>>>
> >>>>>>>>> Looking at this and your prior email.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> 1. I can see "extract metadata only" being as popular as "extract
> >>>>>>>>> metadata and content".  It will all depend on the type of media,
> >> for
> >>>>>>>>> audio/video files adding the metadata to the flow file is enough
> >> but
> >>>>>>>> for
> >>>>>>>>> Word, PDF, etc. files the content may be wanted as well.
> >>>>>>>>> 2. After thinking about it, I agree on an enum for mode.
> >>>>>>>>> 3. I think any handling of zips or archive files should be
> handled
> >>>> by
> >>>>>>>>> another processor, that keeps this processor cleaner and improves
> >>>> its
> >>>>>>>>> ability for re-use.
> >>>>>>>>> 4. I like the addition of exclude filters but I'm not sure about
> >>>>>>>> adding
> >>>>>>>>> content filters.  We will only have a mimetype for the original
> >> flow
> >>>>>>>>> file
> >>>>>>>>> itself so I'm not sure about the metadata mimetype filter.  I
> think
> >>>>>>>>> content
> >>>>>>>>> filtering may be best left for another downstream processor, but
> it
> >>>>>>>>> might
> >>>>>>>>> be run faster if included here since the entire content will be
> >>>>>>>> handled
> >>>>>>>>> during extraction.  If the content filters are implemented, for
> >>>>>>>>> performance
> >>>>>>>>> they need to short circuit so that if the property is not set or
> is
> >>>>>>>> set
> >>>>>>>>> to
> >>>>>>>>> ".*" they don't evaluate the regex.
> >>>>>>>>> 1. FILENAME_FILTER - selects flow files to process based on
> >> filename
> >>>>>>>>>   matching regex. (exists)
> >>>>>>>>>   2. MIMETYPE_FILTER - selects flow files to process based on
> >>>>>>>> mimetype
> >>>>>>>>>   matching regex. (exists)
> >>>>>>>>>   3. FILENAME_EXCLUDE - excludes already selected flow files from
> >>>>>>>>>   processing based on filename matching regex. (new)
> >>>>>>>>>   4. MIMETYPE_EXCLUDE - excludes already selected flow  files
> from
> >>>>>>>>>   processing based on mimetype matching regex. (new)
> >>>>>>>>>   5. CONTENT_FILTER (optional) - selects flow files for output
> >>>> based
> >>>>>>>> on
> >>>>>>>>>   extracted content matching regex. (new)
> >>>>>>>>>   6. CONTENT_EXCLUDE (optional) - excludes flow files from output
> >>>>>>>> based
> >>>>>>>>>   on extracted content matching regex. (new)
> >>>>>>>>> 5. As indicated in the descriptions in #4, I don't think
> >> overlapping
> >>>>>>>>> filters are an error, instead excludes should take precedence
> over
> >>>>>>>>> includes.  Then I can include a domain (like A*) but exclude
> >>>> sub-sets
> >>>>>>>>> (like
> >>>>>>>>> AXYZ*).
> >>>>>>>>>
> >>>>>>>>> I'm sure there's something we missed, but I think that covers
> most
> >> of
> >>>>>> it.
> >>>>>>>>>
> >>>>>>>>> Regards,
> >>>>>>>>> Joe
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Wed, Mar 30, 2016 at 1:56 PM, Dmitry Goldenberg <
> >>>>>>>>> dgoldenberg@hexastax.com
> >>>>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> Joe,
> >>>>>>>>>>
> >>>>>>>>>> Upon some thinking, I've started wondering whether all the cases
> >> can
> >>>>>> be
> >>>>>>>>>> covered by the following filters:
> >>>>>>>>>>
> >>>>>>>>>> INCLUDE_CONTENT_FILENAME_FILTER - defines any patterns for which
> >>>> input
> >>>>>>>>>> files get their content extracted, by file name
> >>>>>>>>>> INCLUDE_METADATA_FILENAME_FILTER - defines any patterns for
> which
> >>>>>> input
> >>>>>>>>>> files get their metadata extracted, by file name
> >>>>>>>>>> INCLUDE_CONTENT_MIMETYPE_FILTER - defines any patterns for which
> >>>> input
> >>>>>>>>>> files get their content extracted, by MIME type
> >>>>>>>>>> INCLUDE_METADATA_MIMETYPE_FILTER - defines any patterns for
> which
> >>>>>> input
> >>>>>>>>>> files get their metadata extracted, by MIME type
> >>>>>>>>>>
> >>>>>>>>>> EXCLUDE_CONTENT_FILENAME_FILTER - defines any patterns for which
> >>>> input
> >>>>>>>>>> files do NOT get their content extracted, by file name
> >>>>>>>>>> EXCLUDE_METADATA_FILENAME_FILTER - defines any patterns for
> which
> >>>>>> input
> >>>>>>>>>> files do NOT get their metadata extracted, by file name
> >>>>>>>>>> EXCLUDE_CONTENT_MIMETYPE_FILTER - defines any patterns for which
> >>>> input
> >>>>>>>>>> files do NOT get their content extracted, by MIME type
> >>>>>>>>>> EXCLUDE_METADATA_MIMETYPE_FILTER - defines any patterns for
> which
> >>>>>> input
> >>>>>>>>>> files do NOT get their metadata extracted, by MIME type
> >>>>>>>>>>
> >>>>>>>>>> I believe this gets all the bases covered. At processor init
> time,
> >>>> we
> >>>>>>>> can
> >>>>>>>>>> analyze the inclusions vs. exclusions; any overlap would cause a
> >>>>>>>>>> configuration error.
> >>>>>>>>>>
> >>>>>>>>>> Let me know what you think, thanks.
> >>>>>>>>>> - Dmitry
> >>>>>>>>>>
> >>>>>>>>>> On Wed, Mar 30, 2016 at 10:41 AM, Dmitry Goldenberg <
> >>>>>>>>>> dgoldenberg@hexastax.com> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Hi Joe,
> >>>>>>>>>>>
> >>>>>>>>>>> I follow your reasoning on the semantics of "media".  One might
> >>>> argue
> >>>>>>>>>> that
> >>>>>>>>>>> media files are a case of "document" or that a document is a
> case
> >>>> of
> >>>>>>>>>>> "media".
> >>>>>>>>>>>
> >>>>>>>>>>> I'm not proposing filters for the mode of processing, I'm
> >>>> proposing a
> >>>>>>>>>>> flag/enum with 3 values:
> >>>>>>>>>>>
> >>>>>>>>>>> A) extract metadata only;
> >>>>>>>>>>> B) extract content only and place it into the flowfile content;
> >>>>>>>>>>> C) extract both metadata and content.
> >>>>>>>>>>>
> >>>>>>>>>>> I think the default should be C, to extract both.  At least in
> my
> >>>>>>>>>>> experience most flows I've dealt with were interested in
> >> extracting
> >>>>>>>>> both.
> >>>>>>>>>>>
> >>>>>>>>>>> I don't see how this mode would benefit from being expression
> >>>> driven
> >>>>>>>> -
> >>>>>>>>> ?
> >>>>>>>>>>>
> >>>>>>>>>>> I think we can add this enum mode and have the basic use case
> >>>>>>>> covered.
> >>>>>>>>>>>
> >>>>>>>>>>> Additionally, further down the line, I was thinking we could
> >> ponder
> >>>>>>>> the
> >>>>>>>>>>> following (these have been essential in search engine
> ingestion):
> >>>>>>>>>>>
> >>>>>>>>>>> 1. Extraction from compressed files/archives. How would
> >>>>>>>>> UnpackContent
> >>>>>>>>>>> work with ExtractMediaAttributes? Use-case being, we've got a
> zip
> >>>>>>>>>> file as
> >>>>>>>>>>> input and want to crack it open and unravel it recursively; it
> >> may
> >>>>>>>>>> have
> >>>>>>>>>>> other, nested zips inside, along with other documents. One way
> to
> >>>>>>>>>> handle
> >>>>>>>>>>> this is to treat the whole archive as one document and merge
> all
> >>>>>>>>>> attributes
> >>>>>>>>>>> into one FlowFile.  The other way would be to treat each
> archive
> >>>>>>>>>> entry as
> >>>>>>>>>>> its own flow file and keep a pointer back at the parent
> archive.
> >>>>>>>>> Yet
> >>>>>>>>>>> another case is when the user might want to only extract the
> >>>>>>>> 'leaf'
> >>>>>>>>>> entries
> >>>>>>>>>>> and discard any parent container archives.
> >>>>>>>>>>>
> >>>>>>>>>>> 2. Attachments and embeddings. Users may want to treat any
> >>>>>>>> attached
> >>>>>>>>> or
> >>>>>>>>>>> embedded files as separate flowfiles with perhaps pointers back
> >> to
> >>>>>>>>> the
> >>>>>>>>>>> parent files. This definitely warrants a filter. Oftentimes
> >> Office
> >>>>>>>>>>> documents have 'media' embeddings which are often not of
> >> interest,
> >>>>>>>>>>> especially for the case of ingesting into a search engine.
> >>>>>>>>>>>
> >>>>>>>>>>> 3. PDF. For PDF's, we can do OCR. This is important for the
> >>>>>>>>>>> 'image'/scanned PDF's for which Tika won't extract text.
> >>>>>>>>>>>
> >>>>>>>>>>> I'd like to understand how much of this is already supported in
> >>>> NiFi
> >>>>>>>>> and
> >>>>>>>>>>> if not I'd volunteer/collaborate to implement some of this.
> >>>>>>>>>>>
> >>>>>>>>>>> - Dmitry
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> On Wed, Mar 30, 2016 at 9:03 AM, Joe Skora <js...@gmail.com>
> >>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> Dmitry,
> >>>>>>>>>>>>
> >>>>>>>>>>>> Are you proposing separate filters that determine the mode of
> >>>>>>>>>> processing,
> >>>>>>>>>>>> metadata/content/metadataAndContent?  I was thinking of one
> >>>>>>>> selection
> >>>>>>>>>>>> filters and a static mode switch at the processor instance
> >> level,
> >>>> to
> >>>>>>>>>> make
> >>>>>>>>>>>> configuration more obvious such that one instance of the
> >> processor
> >>>>>>>>> will
> >>>>>>>>>>>> handle a known set of files regardless of the processing mode.
> >>>>>>>>>>>>
> >>>>>>>>>>>> I was thinking it would be useful for the mode switch to
> support
> >>>>>>>>>>>> expression
> >>>>>>>>>>>> language, but I'm not sure about that since the selection
> >> filters
> >>>>>>>> will
> >>>>>>>>>>>> control what files get processed and it would be harder to
> >>>> configure
> >>>>>>>>> if
> >>>>>>>>>>>> the
> >>>>>>>>>>>> output flow file could vary between source format and
> extracted
> >>>>>>>> text.
> >>>>>>>>>> So,
> >>>>>>>>>>>> while it might be easy to do, and occasionally useful, I think
> >> in
> >>>>>>>>> normal
> >>>>>>>>>>>> use I'd never have a varying mode but would more likely have
> >>>>>>>> multiple
> >>>>>>>>>>>> processor instances with some routing or selection going on
> >>>> further
> >>>>>>>>>>>> upstream.
> >>>>>>>>>>>>
> >>>>>>>>>>>> I wrestled with the naming issue too.  I went with
> >>>>>>>>>>>> "ExtractMediaAttributes"
> >>>>>>>>>>>> over "ExtractDocumentAttributes" because it seemed to
> represent
> >>>> the
> >>>>>>>>>>>> broader
> >>>>>>>>>>>> context better.  In reality, media files and documents and
> >>>> documents
> >>>>>>>>> are
> >>>>>>>>>>>> media files, but in the end it's all just semantics.
> >>>>>>>>>>>>
> >>>>>>>>>>>> I don't think I would change the NAR bundle name, because I
> >> think
> >>>>>>>>>>>> "nifi-media-nar" establishes it as a place to collect this and
> >>>> other
> >>>>>>>>>> media
> >>>>>>>>>>>> related processors in the future.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Regards,
> >>>>>>>>>>>> Joe
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Tue, Mar 29, 2016 at 3:09 PM, Dmitry Goldenberg <
> >>>>>>>>>>>> dgoldenberg@hexastax.com
> >>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Hi Joe,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Thanks for all the details.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I wanted to propose that I do some of this work so as to go
> >>>>>>>> through
> >>>>>>>>>> the
> >>>>>>>>>>>>> full cycle of developing a processor and committing it.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Once your changes are merged, I could extend your
> >>>>>>>>>> 'ExtractMediaMetadata'
> >>>>>>>>>>>>> processor to handle the content, in addition to the metadata.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> We could keep the FILENAME_FILTER and MIMETYPE_FILTER but
> add a
> >>>>>>>> mode
> >>>>>>>>>>>> with 3
> >>>>>>>>>>>>> values: metadataOnly, contentOnly, metadataAndContent.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> One thing that looks to be a design issue right now is, your
> >>>>>>>> changes
> >>>>>>>>>> and
> >>>>>>>>>>>>> the 'nomenclature' seem media-oriented ("nifi-media-nar"
> etc.)
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Would it make sense to have a generic processor
> >>>>>>>>>>>>> ExtractDocumentMetadataAndContent?  Are there enough
> specifics
> >> in
> >>>>>>>>> the
> >>>>>>>>>>>>> image/video processing stuff to warrant that to be a separate
> >>>>>>>> layer;
> >>>>>>>>>>>>> perhaps a subclass of ExtractDocumentMetadataAndContent ?
> >> Might
> >>>>>>>> it
> >>>>>>>>>> make
> >>>>>>>>>>>>> sense to rename nifi-media-nar into nifi-text-extract-nar ?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>> - Dmitry
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Tue, Mar 29, 2016 at 2:36 PM, Joe Skora <jskora@gmail.com
> >
> >>>>>>>>> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> Dmitry,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Yeah, I agree, Tika is pretty impressive.  The original
> >> ticket,
> >>>>>>>>>>>> NIFI-615
> >>>>>>>>>>>>>> <https://issues.apache.org/jira/browse/NIFI-615>, wanted
> >>>>>>>>> extraction
> >>>>>>>>>>>> of
> >>>>>>>>>>>>>> metadata from WAV files, but as I got into it I found Tika
> so
> >>>>>>>> for
> >>>>>>>>>> the
> >>>>>>>>>>>>> same
> >>>>>>>>>>>>>> effort it supports the 1,000+ file formats Tika understands.
> >>>>>>>> That
> >>>>>>>>>> new
> >>>>>>>>>>>>>> processor called "ExtractMediaMetadata", you can pull that
> >> pull
> >>>>>>>>>> PR-252
> >>>>>>>>>>>>>> <https://github.com/apache/nifi/pull/252> from GitHub if
> you
> >>>>>>>> want
> >>>>>>>>>> to
> >>>>>>>>>>>>> give
> >>>>>>>>>>>>>> it a try before it's merged.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Extraction content for those 1,000+ formats would be a
> >> valuable
> >>>>>>>>>>>> addition.
> >>>>>>>>>>>>>> I see two possible approaches, 1) create a new
> >>>>>>>>> "ExtractMediaContent"
> >>>>>>>>>>>>>> processor that would put the document content in a new flow
> >>>>>>>> file,
> >>>>>>>>>> and
> >>>>>>>>>>>> 2)
> >>>>>>>>>>>>>> extend the new "ExtractMediaMetadata" processor so it can
> >>>>>>>> extract
> >>>>>>>>>>>>> metadata,
> >>>>>>>>>>>>>> content, or both.  One combined processor makes sense if it
> >> can
> >>>>>>>>>>>> provide a
> >>>>>>>>>>>>>> performance gain, otherwise two complementary processors may
> >>>>>>>> make
> >>>>>>>>>>>> usage
> >>>>>>>>>>>>>> easier.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> I'm glad to help if you want to take a cut at the processor
> >>>>>>>>>> yourself,
> >>>>>>>>>>>> or
> >>>>>>>>>>>>> I
> >>>>>>>>>>>>>> can take a crack at it myself if you'd prefer.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Don't hesitate to ask questions or share comments and
> feedback
> >>>>>>>>>>>> regarding
> >>>>>>>>>>>>>> the ExtractMediaMetadata processor or the addition of
> content
> >>>>>>>>>>>> handling.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Regards,
> >>>>>>>>>>>>>> Joe Skora
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Thu, Mar 24, 2016 at 11:40 AM, Dmitry Goldenberg <
> >>>>>>>>>>>>>> dgoldenberg@hexastax.com> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Thanks, Joe!
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Hi Joe S. - I'm definitely up for discussing and
> >> contributing.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> While building search-related ingestion systems, I've seen
> >>>>>>>>>> metadata
> >>>>>>>>>>>> and
> >>>>>>>>>>>>>>> text extraction being done all the time; it's always there
> >> and
> >>>>>>>>>>>> always
> >>>>>>>>>>>>> has
> >>>>>>>>>>>>>>> to be done for building search indexes.  Beyond that,
> >>>>>>>>> OCR-related
> >>>>>>>>>>>>>>> capabilities are often requested, and the advantage of Tika
> >> is
> >>>>>>>>>> that
> >>>>>>>>>>>> it
> >>>>>>>>>>>>>>> supports OCR out of the box.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> - Dmitry
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On Thu, Mar 24, 2016 at 11:36 AM, Joe Witt <
> >>>>>>>> joe.witt@gmail.com>
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Dmitry,
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Another community member (Joe Skora) has a PR outstanding
> >>>>>>>> for
> >>>>>>>>>>>>>>>> extracting metadata from media files using Tika.  Perhaps
> it
> >>>>>>>>>> makes
> >>>>>>>>>>>>>>>> sense to broaden that to in general extract what Tika can
> >>>>>>>>> find.
> >>>>>>>>>>>> Joe
> >>>>>>>>>>>>> -
> >>>>>>>>>>>>>>>> perhaps you can discuss your ideas with Dmitry and see if
> >>>>>>>>>>>> broadening
> >>>>>>>>>>>>>>>> is a good idea or if rather domain specific ones make more
> >>>>>>>>>> sense.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> This concept of extracting metadata from documents/text
> >>>>>>>> files,
> >>>>>>>>>>>> etc..
> >>>>>>>>>>>>>>>> using something like Tika is certainly useful as that then
> >>>>>>>> can
> >>>>>>>>>>>> drive
> >>>>>>>>>>>>>>>> nice automated routing decisions.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Thanks
> >>>>>>>>>>>>>>>> Joe
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> On Thu, Mar 24, 2016 at 9:28 AM, Dmitry Goldenberg
> >>>>>>>>>>>>>>>> <dg...@hexastax.com> wrote:
> >>>>>>>>>>>>>>>>> Hi,
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> I see that the ExtractText processor extracts text using
> >>>>>>>>>> regex.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> What about a processor that extracts text and metadata
> >>>>>>>> from
> >>>>>>>>>>>>> incoming
> >>>>>>>>>>>>>>>>> files?  That doesn't seem to exist - but perhaps I didn't
> >>>>>>>>>> quite
> >>>>>>>>>>>>> look
> >>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>> right spots.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> If that doesn't exist I'd like to implement and commit
> it,
> >>>>>>>>>> using
> >>>>>>>>>>>>>> Apache
> >>>>>>>>>>>>>>>>> Tika.  There may also be a couple of related processors
> to
> >>>>>>>>>> that.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Thoughts?
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>>> - Dmitry
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>>>
> >>>>
> >>>>
> >>
> >>
>
>

Re: Text and metadata extraction processor

Posted by Mark Payne <ma...@hotmail.com>.
As far I know, the processors haven't made it into any release yet. If that is the case,
then we could just remove those properties all together and it's easy.

If they have already been released, then we would need to ensure that the processor
is invalid on startup (it doesn't accept those as dynamic properties) and then we update
the migration guide to explain how to obtain the same behavior.

But either way, we can definitely remove the properties if it's determined that there is not
a good enough reason to keep them in.

-Mark


> On Apr 1, 2016, at 10:10 AM, Dmitry Goldenberg <dg...@hexastax.com> wrote:
> 
> Hi Mark,
> 
> That is a good point.  It also has crossed my mind.  AFAIK,
> ExtractMediaAttributes already has a couple of similar filters on it; Joe
> S., please correct me if I'm wrong.  I merely suggested that we extend
> these filters.
> 
> I'd have to agree with your points, Mark, that it's cleaner to keep the
> conditionals separate, on RouteOnAttribute and the like.
> 
> If that is the consensus then I believe we're back to the idea of a "mode"
> configuration on ExtractMediaAttributes, with 3 values: a)
> extractMetadataOnly, b) extractContentOnly, c) extractMetadataAndContent.
> As an alternative we have also considered rolling 3 separate processors:
> ExtractMetadata, ExtractContent, and ExtractMetadataAndContent.  Given that
> ExtractMediaAttributes already exists, I think it may be easiest to roll
> with the new "mode" config parameter.
> 
> One question then is also, what to do with the filters that are already on
> ExtractMediaAttributes - ?  Should they still be there?
> 
> BTW, I've filed the following JIRA tickets related to the topics we've been
> discussing:
> 
> Extract metadata and text - NIFI1717
> <https://issues.apache.org/jira/browse/NIFI-1717>
> PerformOCR - NIFI1718 <https://issues.apache.org/jira/browse/NIFI-1718>
> ProcessPDF - NIFI1719 <https://issues.apache.org/jira/browse/NIFI-1719>
> 
> I'll propagate more info into those as we discuss things more.
> 
> Mark, could you take a look at: NIFI1716
> <https://issues.apache.org/jira/browse/NIFI-1716>.  This is a separate
> topic so we could create a separate discussion thread for the CSV splitter.
> 
> Thanks,
> - Dmitry
> 
> 
> On Fri, Apr 1, 2016 at 9:06 AM, Mark Payne <ma...@hotmail.com> wrote:
> 
>> Dmitry,
>> 
>> I would be a bit concerned about providing options for filters that
>> include and
>> exclude certain things. I believe that if you send a FlowFile to the
>> Processor,
>> then the Processor should do its thing. If you want to filter out which
>> FlowFiles
>> have their content extracted, for example, I would suggest using a
>> Processor
>> like RouteOnAttribute to ensure that only the appropriate FlowFiles are
>> processed
>> by the ExtractMediaMetadata processor.
>> 
>> This allows the metadata extraction processor to focus purely on extracting
>> metadata and doesn't have to deal with all of the logic of filtering
>> things out. The logic
>> for filtering things out is almost guaranteed to grow much more complex as
>> people
>> start to use this more and more. NiFi already provides several route-based
>> processors
>> to allow for a great deal of flexibility with this type of logic
>> (RouteOnAttribute, RouteOnContent,
>> ScanAttribute, ScanContent, etc.).
>> 
>> Thanks
>> -Mark
>> 
>> 
>> 
>>> On Apr 1, 2016, at 12:55 AM, Dmitry Goldenberg <dg...@hexastax.com>
>> wrote:
>>> 
>>> Simon,
>>> 
>>> I believe we've moved on past the 'mode' option and have now switched to
>>> talking about how the include/exclude filters, for metadata and content,
>> on
>>> the one hand side, and filename or MIME type based, on the other hand
>> side,
>>> would drive whether meta, content, or both would get extracted.
>>> 
>>> For example, a user could configure the ExtractMediaAttributes processor
>> to
>>> extract metadata for all image files (but not content), extract content
>>> only for plain text documents (but no metadata), or both meta and content
>>> for documents with an extension ".pqr", based on the filename.
>>> 
>>> Could you elaborate on your vision of how relationships could "drive"
>> this
>>> type of functionality?  Joe has already built some of the filtering into
>>> the processor; I just suggested to extend that further, and we get all
>> the
>>> bases covered.
>>> 
>>> I'm not sure I followed your comment on the extracted content being
>>> transferred into a new FlowFile.  My thoughts were that the extracted
>>> content would be inserted into a new, dedicated field, called for
>> example,
>>> "text", on *the same* FlowFile.  I imagine that for a lot of use-cases,
>>> especially data ingestion into a search engine, the extracted attributes
>>> *and* the extracted text must travel together as part of the ingested
>>> document, with the original flowfile-content most likely getting dropped
>> on
>>> the way into the index.
>>> 
>>> I guess an alternative could be to have an option to represent the
>>> extraction results as a new document, and an option to drop the original,
>>> and an option to copy the original's attributes onto the new doc. Seems
>>> rather complex.  I like the "in-place" extraction.
>>> 
>>> Could you also elaborate on how a controller service would handle OCR?
>>> When a document floats into ExtractMediaAttributes, assuming Tesseract is
>>> installed properly, Tika will already automatically fire off OCR.  Unless
>>> we turn that off and cause OCR to only be supported via this service.
>> I'm
>>> tempted to say why don't we just let Tika do its job for all cases, OCR
>>> included.  Caveat being that OCR is expensive and it would be nice to
>> have
>>> ways of ensuring it has enough resources and doesn't bog the flow down.
>>> 
>>> For the PDF processor, I'm thinking, yes, PDFBox to break it up into
>> pages
>>> and then apply Tika page by page, then aggregate the output together,
>> with
>>> a configurable max of up to N pages per document to process (due to how
>>> slow OCR is).  I already have a prototype of this going, I'll file a JIRA
>>> ticket for this feature.
>>> 
>>> - Dmitry
>>> 
>>> 
>>> 
>>> On Thu, Mar 31, 2016 at 8:43 PM, Simon Ball <sb...@hortonworks.com>
>> wrote:
>>> 
>>>> What I’m suggesting is a single processor for both, but instead of
>> using a
>>>> mode property to determine which bits get extracted, you use the state
>> of
>>>> the relations on the processor to configure which options tika uses and
>>>> using a single pass to actually parse metadata into attributes, and
>> content
>>>> into a new flow file transfer into the parsed relation.
>>>> 
>>>> On the tesseract front, it may make sense to do this through a
>> controller
>>>> service.
>>>> 
>>>> A PDF processor might be interesting. Are you thinking of something like
>>>> PDFBox, or tika again?
>>>> 
>>>> Simon
>>>> 
>>>> 
>>>>> On 1 Apr 2016, at 01:30, Dmitry Goldenberg <dg...@hexastax.com>
>>>> wrote:
>>>>> 
>>>>> Simon,
>>>>> 
>>>>> Interesting commentary.  The issue that Joe and I have both looked at,
>>>> with
>>>>> the splitting of metadata and content extraction, is that if they're
>>>> split
>>>>> then the underlying Tika extraction has to process the file twice: once
>>>> to
>>>>> pull out the attributes and once to pull out the content.  Perhaps it
>> may
>>>>> be good to add ExtractMetadata and ExtractTextContent in addition to
>>>>> ExtractMediaAttributes - ? Seems kind of an overkill but I may be
>> wrong.
>>>>> 
>>>>> It seems prudent to provide one wholesome, out-of-the-box extractor
>>>>> processor with options to extract just metadata, just content, or both
>>>>> metadata and content.
>>>>> 
>>>>> I think what I'm hearing is that we need to allow for checking
>> somewhere
>>>>> for whether text/content has already been extracted by the time we get
>> to
>>>>> the ExtractMediaAttributes processor - ?  If that is the issue then I
>>>>> believe the user would use RouteOnAttribute and if the content is
>> already
>>>>> filled in then they'd not route to ExtractMediaAttributes.
>>>>> 
>>>>> As far as the OCR.  Tika internally supports OCR by directing image
>> files
>>>>> to Tesseract (if Tesseract is installed and configured properly).
>> We've
>>>>> started talking about how this could be reconciled in the
>>>>> ExtractMediaAttributes.
>>>>> 
>>>>> I think that once we have the basic ExtractMediaAttributes, we could
>> add
>>>>> filters for what files to enable the OCR on, and we'd need to expose a
>>>> few
>>>>> config parameters specific to OCR, such as e.g. the location of the
>>>>> Tesseract installation and the maximum file size on which to attempt
>> the
>>>>> OCR.  Perhaps there can also be a RunOCR processor which would be
>>>> dedicated
>>>>> to running OCR.  But since Tika already has OCR integrated we'd
>> probably
>>>>> want to take care of that in the ExtractMediaAttributes configuration.
>>>>> 
>>>>> Additionally, I've proposed the idea of a ProcessPDF processor which
>>>> would
>>>>> ascertain whether a PDF is 'text' or 'scanned'. If scanned, we would
>>>> break
>>>>> it up into pages and run OCR on each page, then aggregate the extracted
>>>>> text.
>>>>> 
>>>>> - Dmitry
>>>>> 
>>>>> 
>>>>> 
>>>>> On Thu, Mar 31, 2016 at 3:19 PM, Simon Ball <sb...@hortonworks.com>
>>>> wrote:
>>>>> 
>>>>>> Just a thought…
>>>>>> 
>>>>>> To keep consistent with other Nifi Parse patterns, would it make sense
>>>> to
>>>>>> based the extraction of content on the presence of a relation. So your
>>>> tika
>>>>>> processor would have an original relation which would have meta data
>>>>>> attached as attributed, and an extracted relation which would have the
>>>>>> metadata and the processed content (text from OCRed image for
>> example).
>>>>>> That way you can just use context.hasConnection(relationship) to
>>>> determine
>>>>>> whether to enable the tika content processing.
>>>>>> 
>>>>>> This seems more idiomatic than a mode flag.
>>>>>> 
>>>>>> Simon
>>>>>> 
>>>>>>> On 31 Mar 2016, at 19:48, Joe Skora <js...@gmail.com> wrote:
>>>>>>> 
>>>>>>> Dmitry,
>>>>>>> 
>>>>>>> I think we're good.  I was confused because "XXX_METADATA MIMETYPE
>>>>>> FILTER"
>>>>>>> entries referred to some MIME type of the metadata, but you meant to
>>>> use
>>>>>>> the file's MIME type to select what files have metadata extracted.
>>>>>>> 
>>>>>>> Sorry, about that, I think we are on the same page.
>>>>>>> 
>>>>>>> Joe
>>>>>>> 
>>>>>>> On Thu, Mar 31, 2016 at 11:40 AM, Dmitry Goldenberg <
>>>>>>> dgoldenberg@hexastax.com> wrote:
>>>>>>> 
>>>>>>>> Hi Joe,
>>>>>>>> 
>>>>>>>> I think if we have the filters in place then there's no need for the
>>>>>> 'mode'
>>>>>>>> enum, as the filters themselves guide the processor in deciding
>>>> whether
>>>>>>>> metadata and/or content is extracted for a given input file.
>>>>>>>> 
>>>>>>>> Agreed on the handling of archives as a separate processor
>> (template,
>>>>>> seems
>>>>>>>> like).
>>>>>>>> 
>>>>>>>> I think it's easiest to do both metadata and/or content in one
>>>> processor
>>>>>>>> since it can tell Tika whether to extract metadata and/or content,
>> in
>>>>>> one
>>>>>>>> pass over the file bytes (as you pointed out).
>>>>>>>> 
>>>>>>>> Agreed on the exclusions trumping inclusions; I think that makes
>>>> sense.
>>>>>>>> 
>>>>>>>>>> We will only have a mimetype for the original flow file itself so
>>>> I'm
>>>>>>>> not sure about the metadata mimetype filter.
>>>>>>>> 
>>>>>>>> I'm not sure where there might be an issue here. The metadata MIME
>>>> type
>>>>>>>> filter tells the processor for which MIME types to perform the
>>>> metadata
>>>>>>>> extraction.  For instance, extract metadata for images and videos,
>>>> only.
>>>>>>>> This could possibly be coupled with an exclusion filter for content
>>>> that
>>>>>>>> says, don't try to extract content from images and videos.
>>>>>>>> 
>>>>>>>> I think with the six filters we get all the bases covered:
>>>>>>>> 
>>>>>>>> 1. include metadata? --
>>>>>>>>   1. yes --
>>>>>>>>      1. determine the inclusion of metadata by filename pattern
>>>>>>>>      2. determine the inclusion of metadata by MIME type pattern
>>>>>>>>   2. no --
>>>>>>>>      1. determine the exclusion of metadata by filename pattern
>>>>>>>>      2. determine the exclusion of metadata by MIME type pattern
>>>>>>>>   2. include content? --
>>>>>>>>   1. yes --
>>>>>>>>      1. determine the inclusion of content by filename pattern
>>>>>>>>      2. determine the inclusion of content by MIME type pattern
>>>>>>>>   2. no --
>>>>>>>>      1. determine the exclusion of content by filename pattern
>>>>>>>>      2. determine the exclusion of content by MIME type pattern
>>>>>>>> 
>>>>>>>> Does this work?
>>>>>>>> 
>>>>>>>> Thanks,
>>>>>>>> - Dmitry
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Thu, Mar 31, 2016 at 9:27 AM, Joe Skora <js...@gmail.com>
>> wrote:
>>>>>>>> 
>>>>>>>>> Dmitry,
>>>>>>>>> 
>>>>>>>>> Looking at this and your prior email.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 1. I can see "extract metadata only" being as popular as "extract
>>>>>>>>> metadata and content".  It will all depend on the type of media,
>> for
>>>>>>>>> audio/video files adding the metadata to the flow file is enough
>> but
>>>>>>>> for
>>>>>>>>> Word, PDF, etc. files the content may be wanted as well.
>>>>>>>>> 2. After thinking about it, I agree on an enum for mode.
>>>>>>>>> 3. I think any handling of zips or archive files should be handled
>>>> by
>>>>>>>>> another processor, that keeps this processor cleaner and improves
>>>> its
>>>>>>>>> ability for re-use.
>>>>>>>>> 4. I like the addition of exclude filters but I'm not sure about
>>>>>>>> adding
>>>>>>>>> content filters.  We will only have a mimetype for the original
>> flow
>>>>>>>>> file
>>>>>>>>> itself so I'm not sure about the metadata mimetype filter.  I think
>>>>>>>>> content
>>>>>>>>> filtering may be best left for another downstream processor, but it
>>>>>>>>> might
>>>>>>>>> be run faster if included here since the entire content will be
>>>>>>>> handled
>>>>>>>>> during extraction.  If the content filters are implemented, for
>>>>>>>>> performance
>>>>>>>>> they need to short circuit so that if the property is not set or is
>>>>>>>> set
>>>>>>>>> to
>>>>>>>>> ".*" they don't evaluate the regex.
>>>>>>>>> 1. FILENAME_FILTER - selects flow files to process based on
>> filename
>>>>>>>>>   matching regex. (exists)
>>>>>>>>>   2. MIMETYPE_FILTER - selects flow files to process based on
>>>>>>>> mimetype
>>>>>>>>>   matching regex. (exists)
>>>>>>>>>   3. FILENAME_EXCLUDE - excludes already selected flow files from
>>>>>>>>>   processing based on filename matching regex. (new)
>>>>>>>>>   4. MIMETYPE_EXCLUDE - excludes already selected flow  files from
>>>>>>>>>   processing based on mimetype matching regex. (new)
>>>>>>>>>   5. CONTENT_FILTER (optional) - selects flow files for output
>>>> based
>>>>>>>> on
>>>>>>>>>   extracted content matching regex. (new)
>>>>>>>>>   6. CONTENT_EXCLUDE (optional) - excludes flow files from output
>>>>>>>> based
>>>>>>>>>   on extracted content matching regex. (new)
>>>>>>>>> 5. As indicated in the descriptions in #4, I don't think
>> overlapping
>>>>>>>>> filters are an error, instead excludes should take precedence over
>>>>>>>>> includes.  Then I can include a domain (like A*) but exclude
>>>> sub-sets
>>>>>>>>> (like
>>>>>>>>> AXYZ*).
>>>>>>>>> 
>>>>>>>>> I'm sure there's something we missed, but I think that covers most
>> of
>>>>>> it.
>>>>>>>>> 
>>>>>>>>> Regards,
>>>>>>>>> Joe
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Wed, Mar 30, 2016 at 1:56 PM, Dmitry Goldenberg <
>>>>>>>>> dgoldenberg@hexastax.com
>>>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> Joe,
>>>>>>>>>> 
>>>>>>>>>> Upon some thinking, I've started wondering whether all the cases
>> can
>>>>>> be
>>>>>>>>>> covered by the following filters:
>>>>>>>>>> 
>>>>>>>>>> INCLUDE_CONTENT_FILENAME_FILTER - defines any patterns for which
>>>> input
>>>>>>>>>> files get their content extracted, by file name
>>>>>>>>>> INCLUDE_METADATA_FILENAME_FILTER - defines any patterns for which
>>>>>> input
>>>>>>>>>> files get their metadata extracted, by file name
>>>>>>>>>> INCLUDE_CONTENT_MIMETYPE_FILTER - defines any patterns for which
>>>> input
>>>>>>>>>> files get their content extracted, by MIME type
>>>>>>>>>> INCLUDE_METADATA_MIMETYPE_FILTER - defines any patterns for which
>>>>>> input
>>>>>>>>>> files get their metadata extracted, by MIME type
>>>>>>>>>> 
>>>>>>>>>> EXCLUDE_CONTENT_FILENAME_FILTER - defines any patterns for which
>>>> input
>>>>>>>>>> files do NOT get their content extracted, by file name
>>>>>>>>>> EXCLUDE_METADATA_FILENAME_FILTER - defines any patterns for which
>>>>>> input
>>>>>>>>>> files do NOT get their metadata extracted, by file name
>>>>>>>>>> EXCLUDE_CONTENT_MIMETYPE_FILTER - defines any patterns for which
>>>> input
>>>>>>>>>> files do NOT get their content extracted, by MIME type
>>>>>>>>>> EXCLUDE_METADATA_MIMETYPE_FILTER - defines any patterns for which
>>>>>> input
>>>>>>>>>> files do NOT get their metadata extracted, by MIME type
>>>>>>>>>> 
>>>>>>>>>> I believe this gets all the bases covered. At processor init time,
>>>> we
>>>>>>>> can
>>>>>>>>>> analyze the inclusions vs. exclusions; any overlap would cause a
>>>>>>>>>> configuration error.
>>>>>>>>>> 
>>>>>>>>>> Let me know what you think, thanks.
>>>>>>>>>> - Dmitry
>>>>>>>>>> 
>>>>>>>>>> On Wed, Mar 30, 2016 at 10:41 AM, Dmitry Goldenberg <
>>>>>>>>>> dgoldenberg@hexastax.com> wrote:
>>>>>>>>>> 
>>>>>>>>>>> Hi Joe,
>>>>>>>>>>> 
>>>>>>>>>>> I follow your reasoning on the semantics of "media".  One might
>>>> argue
>>>>>>>>>> that
>>>>>>>>>>> media files are a case of "document" or that a document is a case
>>>> of
>>>>>>>>>>> "media".
>>>>>>>>>>> 
>>>>>>>>>>> I'm not proposing filters for the mode of processing, I'm
>>>> proposing a
>>>>>>>>>>> flag/enum with 3 values:
>>>>>>>>>>> 
>>>>>>>>>>> A) extract metadata only;
>>>>>>>>>>> B) extract content only and place it into the flowfile content;
>>>>>>>>>>> C) extract both metadata and content.
>>>>>>>>>>> 
>>>>>>>>>>> I think the default should be C, to extract both.  At least in my
>>>>>>>>>>> experience most flows I've dealt with were interested in
>> extracting
>>>>>>>>> both.
>>>>>>>>>>> 
>>>>>>>>>>> I don't see how this mode would benefit from being expression
>>>> driven
>>>>>>>> -
>>>>>>>>> ?
>>>>>>>>>>> 
>>>>>>>>>>> I think we can add this enum mode and have the basic use case
>>>>>>>> covered.
>>>>>>>>>>> 
>>>>>>>>>>> Additionally, further down the line, I was thinking we could
>> ponder
>>>>>>>> the
>>>>>>>>>>> following (these have been essential in search engine ingestion):
>>>>>>>>>>> 
>>>>>>>>>>> 1. Extraction from compressed files/archives. How would
>>>>>>>>> UnpackContent
>>>>>>>>>>> work with ExtractMediaAttributes? Use-case being, we've got a zip
>>>>>>>>>> file as
>>>>>>>>>>> input and want to crack it open and unravel it recursively; it
>> may
>>>>>>>>>> have
>>>>>>>>>>> other, nested zips inside, along with other documents. One way to
>>>>>>>>>> handle
>>>>>>>>>>> this is to treat the whole archive as one document and merge all
>>>>>>>>>> attributes
>>>>>>>>>>> into one FlowFile.  The other way would be to treat each archive
>>>>>>>>>> entry as
>>>>>>>>>>> its own flow file and keep a pointer back at the parent archive.
>>>>>>>>> Yet
>>>>>>>>>>> another case is when the user might want to only extract the
>>>>>>>> 'leaf'
>>>>>>>>>> entries
>>>>>>>>>>> and discard any parent container archives.
>>>>>>>>>>> 
>>>>>>>>>>> 2. Attachments and embeddings. Users may want to treat any
>>>>>>>> attached
>>>>>>>>> or
>>>>>>>>>>> embedded files as separate flowfiles with perhaps pointers back
>> to
>>>>>>>>> the
>>>>>>>>>>> parent files. This definitely warrants a filter. Oftentimes
>> Office
>>>>>>>>>>> documents have 'media' embeddings which are often not of
>> interest,
>>>>>>>>>>> especially for the case of ingesting into a search engine.
>>>>>>>>>>> 
>>>>>>>>>>> 3. PDF. For PDF's, we can do OCR. This is important for the
>>>>>>>>>>> 'image'/scanned PDF's for which Tika won't extract text.
>>>>>>>>>>> 
>>>>>>>>>>> I'd like to understand how much of this is already supported in
>>>> NiFi
>>>>>>>>> and
>>>>>>>>>>> if not I'd volunteer/collaborate to implement some of this.
>>>>>>>>>>> 
>>>>>>>>>>> - Dmitry
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On Wed, Mar 30, 2016 at 9:03 AM, Joe Skora <js...@gmail.com>
>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> Dmitry,
>>>>>>>>>>>> 
>>>>>>>>>>>> Are you proposing separate filters that determine the mode of
>>>>>>>>>> processing,
>>>>>>>>>>>> metadata/content/metadataAndContent?  I was thinking of one
>>>>>>>> selection
>>>>>>>>>>>> filters and a static mode switch at the processor instance
>> level,
>>>> to
>>>>>>>>>> make
>>>>>>>>>>>> configuration more obvious such that one instance of the
>> processor
>>>>>>>>> will
>>>>>>>>>>>> handle a known set of files regardless of the processing mode.
>>>>>>>>>>>> 
>>>>>>>>>>>> I was thinking it would be useful for the mode switch to support
>>>>>>>>>>>> expression
>>>>>>>>>>>> language, but I'm not sure about that since the selection
>> filters
>>>>>>>> will
>>>>>>>>>>>> control what files get processed and it would be harder to
>>>> configure
>>>>>>>>> if
>>>>>>>>>>>> the
>>>>>>>>>>>> output flow file could vary between source format and extracted
>>>>>>>> text.
>>>>>>>>>> So,
>>>>>>>>>>>> while it might be easy to do, and occasionally useful, I think
>> in
>>>>>>>>> normal
>>>>>>>>>>>> use I'd never have a varying mode but would more likely have
>>>>>>>> multiple
>>>>>>>>>>>> processor instances with some routing or selection going on
>>>> further
>>>>>>>>>>>> upstream.
>>>>>>>>>>>> 
>>>>>>>>>>>> I wrestled with the naming issue too.  I went with
>>>>>>>>>>>> "ExtractMediaAttributes"
>>>>>>>>>>>> over "ExtractDocumentAttributes" because it seemed to represent
>>>> the
>>>>>>>>>>>> broader
>>>>>>>>>>>> context better.  In reality, media files and documents and
>>>> documents
>>>>>>>>> are
>>>>>>>>>>>> media files, but in the end it's all just semantics.
>>>>>>>>>>>> 
>>>>>>>>>>>> I don't think I would change the NAR bundle name, because I
>> think
>>>>>>>>>>>> "nifi-media-nar" establishes it as a place to collect this and
>>>> other
>>>>>>>>>> media
>>>>>>>>>>>> related processors in the future.
>>>>>>>>>>>> 
>>>>>>>>>>>> Regards,
>>>>>>>>>>>> Joe
>>>>>>>>>>>> 
>>>>>>>>>>>> On Tue, Mar 29, 2016 at 3:09 PM, Dmitry Goldenberg <
>>>>>>>>>>>> dgoldenberg@hexastax.com
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> Hi Joe,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Thanks for all the details.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> I wanted to propose that I do some of this work so as to go
>>>>>>>> through
>>>>>>>>>> the
>>>>>>>>>>>>> full cycle of developing a processor and committing it.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Once your changes are merged, I could extend your
>>>>>>>>>> 'ExtractMediaMetadata'
>>>>>>>>>>>>> processor to handle the content, in addition to the metadata.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> We could keep the FILENAME_FILTER and MIMETYPE_FILTER but add a
>>>>>>>> mode
>>>>>>>>>>>> with 3
>>>>>>>>>>>>> values: metadataOnly, contentOnly, metadataAndContent.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> One thing that looks to be a design issue right now is, your
>>>>>>>> changes
>>>>>>>>>> and
>>>>>>>>>>>>> the 'nomenclature' seem media-oriented ("nifi-media-nar" etc.)
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Would it make sense to have a generic processor
>>>>>>>>>>>>> ExtractDocumentMetadataAndContent?  Are there enough specifics
>> in
>>>>>>>>> the
>>>>>>>>>>>>> image/video processing stuff to warrant that to be a separate
>>>>>>>> layer;
>>>>>>>>>>>>> perhaps a subclass of ExtractDocumentMetadataAndContent ?
>> Might
>>>>>>>> it
>>>>>>>>>> make
>>>>>>>>>>>>> sense to rename nifi-media-nar into nifi-text-extract-nar ?
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> - Dmitry
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Tue, Mar 29, 2016 at 2:36 PM, Joe Skora <js...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Dmitry,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Yeah, I agree, Tika is pretty impressive.  The original
>> ticket,
>>>>>>>>>>>> NIFI-615
>>>>>>>>>>>>>> <https://issues.apache.org/jira/browse/NIFI-615>, wanted
>>>>>>>>> extraction
>>>>>>>>>>>> of
>>>>>>>>>>>>>> metadata from WAV files, but as I got into it I found Tika so
>>>>>>>> for
>>>>>>>>>> the
>>>>>>>>>>>>> same
>>>>>>>>>>>>>> effort it supports the 1,000+ file formats Tika understands.
>>>>>>>> That
>>>>>>>>>> new
>>>>>>>>>>>>>> processor called "ExtractMediaMetadata", you can pull that
>> pull
>>>>>>>>>> PR-252
>>>>>>>>>>>>>> <https://github.com/apache/nifi/pull/252> from GitHub if you
>>>>>>>> want
>>>>>>>>>> to
>>>>>>>>>>>>> give
>>>>>>>>>>>>>> it a try before it's merged.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Extraction content for those 1,000+ formats would be a
>> valuable
>>>>>>>>>>>> addition.
>>>>>>>>>>>>>> I see two possible approaches, 1) create a new
>>>>>>>>> "ExtractMediaContent"
>>>>>>>>>>>>>> processor that would put the document content in a new flow
>>>>>>>> file,
>>>>>>>>>> and
>>>>>>>>>>>> 2)
>>>>>>>>>>>>>> extend the new "ExtractMediaMetadata" processor so it can
>>>>>>>> extract
>>>>>>>>>>>>> metadata,
>>>>>>>>>>>>>> content, or both.  One combined processor makes sense if it
>> can
>>>>>>>>>>>> provide a
>>>>>>>>>>>>>> performance gain, otherwise two complementary processors may
>>>>>>>> make
>>>>>>>>>>>> usage
>>>>>>>>>>>>>> easier.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I'm glad to help if you want to take a cut at the processor
>>>>>>>>>> yourself,
>>>>>>>>>>>> or
>>>>>>>>>>>>> I
>>>>>>>>>>>>>> can take a crack at it myself if you'd prefer.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Don't hesitate to ask questions or share comments and feedback
>>>>>>>>>>>> regarding
>>>>>>>>>>>>>> the ExtractMediaMetadata processor or the addition of content
>>>>>>>>>>>> handling.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>> Joe Skora
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Thu, Mar 24, 2016 at 11:40 AM, Dmitry Goldenberg <
>>>>>>>>>>>>>> dgoldenberg@hexastax.com> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Thanks, Joe!
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Hi Joe S. - I'm definitely up for discussing and
>> contributing.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> While building search-related ingestion systems, I've seen
>>>>>>>>>> metadata
>>>>>>>>>>>> and
>>>>>>>>>>>>>>> text extraction being done all the time; it's always there
>> and
>>>>>>>>>>>> always
>>>>>>>>>>>>> has
>>>>>>>>>>>>>>> to be done for building search indexes.  Beyond that,
>>>>>>>>> OCR-related
>>>>>>>>>>>>>>> capabilities are often requested, and the advantage of Tika
>> is
>>>>>>>>>> that
>>>>>>>>>>>> it
>>>>>>>>>>>>>>> supports OCR out of the box.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> - Dmitry
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Thu, Mar 24, 2016 at 11:36 AM, Joe Witt <
>>>>>>>> joe.witt@gmail.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Dmitry,
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Another community member (Joe Skora) has a PR outstanding
>>>>>>>> for
>>>>>>>>>>>>>>>> extracting metadata from media files using Tika.  Perhaps it
>>>>>>>>>> makes
>>>>>>>>>>>>>>>> sense to broaden that to in general extract what Tika can
>>>>>>>>> find.
>>>>>>>>>>>> Joe
>>>>>>>>>>>>> -
>>>>>>>>>>>>>>>> perhaps you can discuss your ideas with Dmitry and see if
>>>>>>>>>>>> broadening
>>>>>>>>>>>>>>>> is a good idea or if rather domain specific ones make more
>>>>>>>>>> sense.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> This concept of extracting metadata from documents/text
>>>>>>>> files,
>>>>>>>>>>>> etc..
>>>>>>>>>>>>>>>> using something like Tika is certainly useful as that then
>>>>>>>> can
>>>>>>>>>>>> drive
>>>>>>>>>>>>>>>> nice automated routing decisions.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>> Joe
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On Thu, Mar 24, 2016 at 9:28 AM, Dmitry Goldenberg
>>>>>>>>>>>>>>>> <dg...@hexastax.com> wrote:
>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> I see that the ExtractText processor extracts text using
>>>>>>>>>> regex.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> What about a processor that extracts text and metadata
>>>>>>>> from
>>>>>>>>>>>>> incoming
>>>>>>>>>>>>>>>>> files?  That doesn't seem to exist - but perhaps I didn't
>>>>>>>>>> quite
>>>>>>>>>>>>> look
>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> right spots.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> If that doesn't exist I'd like to implement and commit it,
>>>>>>>>>> using
>>>>>>>>>>>>>> Apache
>>>>>>>>>>>>>>>>> Tika.  There may also be a couple of related processors to
>>>>>>>>>> that.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Thoughts?
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>> - Dmitry
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>> 
>>>> 
>> 
>> 


Re: Text and metadata extraction processor

Posted by Dmitry Goldenberg <dg...@hexastax.com>.
Hi Mark,

That is a good point.  It also has crossed my mind.  AFAIK,
ExtractMediaAttributes already has a couple of similar filters on it; Joe
S., please correct me if I'm wrong.  I merely suggested that we extend
these filters.

I'd have to agree with your points, Mark, that it's cleaner to keep the
conditionals separate, on RouteOnAttribute and the like.

If that is the consensus then I believe we're back to the idea of a "mode"
configuration on ExtractMediaAttributes, with 3 values: a)
extractMetadataOnly, b) extractContentOnly, c) extractMetadataAndContent.
As an alternative we have also considered rolling 3 separate processors:
ExtractMetadata, ExtractContent, and ExtractMetadataAndContent.  Given that
ExtractMediaAttributes already exists, I think it may be easiest to roll
with the new "mode" config parameter.

One question then is also, what to do with the filters that are already on
ExtractMediaAttributes - ?  Should they still be there?

BTW, I've filed the following JIRA tickets related to the topics we've been
discussing:

Extract metadata and text - NIFI1717
<https://issues.apache.org/jira/browse/NIFI-1717>
PerformOCR - NIFI1718 <https://issues.apache.org/jira/browse/NIFI-1718>
ProcessPDF - NIFI1719 <https://issues.apache.org/jira/browse/NIFI-1719>

I'll propagate more info into those as we discuss things more.

Mark, could you take a look at: NIFI1716
<https://issues.apache.org/jira/browse/NIFI-1716>.  This is a separate
topic so we could create a separate discussion thread for the CSV splitter.

Thanks,
- Dmitry


On Fri, Apr 1, 2016 at 9:06 AM, Mark Payne <ma...@hotmail.com> wrote:

> Dmitry,
>
> I would be a bit concerned about providing options for filters that
> include and
> exclude certain things. I believe that if you send a FlowFile to the
> Processor,
> then the Processor should do its thing. If you want to filter out which
> FlowFiles
> have their content extracted, for example, I would suggest using a
> Processor
> like RouteOnAttribute to ensure that only the appropriate FlowFiles are
> processed
> by the ExtractMediaMetadata processor.
>
> This allows the metadata extraction processor to focus purely on extracting
> metadata and doesn't have to deal with all of the logic of filtering
> things out. The logic
> for filtering things out is almost guaranteed to grow much more complex as
> people
> start to use this more and more. NiFi already provides several route-based
> processors
> to allow for a great deal of flexibility with this type of logic
> (RouteOnAttribute, RouteOnContent,
> ScanAttribute, ScanContent, etc.).
>
> Thanks
> -Mark
>
>
>
> > On Apr 1, 2016, at 12:55 AM, Dmitry Goldenberg <dg...@hexastax.com>
> wrote:
> >
> > Simon,
> >
> > I believe we've moved on past the 'mode' option and have now switched to
> > talking about how the include/exclude filters, for metadata and content,
> on
> > the one hand side, and filename or MIME type based, on the other hand
> side,
> > would drive whether meta, content, or both would get extracted.
> >
> > For example, a user could configure the ExtractMediaAttributes processor
> to
> > extract metadata for all image files (but not content), extract content
> > only for plain text documents (but no metadata), or both meta and content
> > for documents with an extension ".pqr", based on the filename.
> >
> > Could you elaborate on your vision of how relationships could "drive"
> this
> > type of functionality?  Joe has already built some of the filtering into
> > the processor; I just suggested to extend that further, and we get all
> the
> > bases covered.
> >
> > I'm not sure I followed your comment on the extracted content being
> > transferred into a new FlowFile.  My thoughts were that the extracted
> > content would be inserted into a new, dedicated field, called for
> example,
> > "text", on *the same* FlowFile.  I imagine that for a lot of use-cases,
> > especially data ingestion into a search engine, the extracted attributes
> > *and* the extracted text must travel together as part of the ingested
> > document, with the original flowfile-content most likely getting dropped
> on
> > the way into the index.
> >
> > I guess an alternative could be to have an option to represent the
> > extraction results as a new document, and an option to drop the original,
> > and an option to copy the original's attributes onto the new doc. Seems
> > rather complex.  I like the "in-place" extraction.
> >
> > Could you also elaborate on how a controller service would handle OCR?
> > When a document floats into ExtractMediaAttributes, assuming Tesseract is
> > installed properly, Tika will already automatically fire off OCR.  Unless
> > we turn that off and cause OCR to only be supported via this service.
> I'm
> > tempted to say why don't we just let Tika do its job for all cases, OCR
> > included.  Caveat being that OCR is expensive and it would be nice to
> have
> > ways of ensuring it has enough resources and doesn't bog the flow down.
> >
> > For the PDF processor, I'm thinking, yes, PDFBox to break it up into
> pages
> > and then apply Tika page by page, then aggregate the output together,
> with
> > a configurable max of up to N pages per document to process (due to how
> > slow OCR is).  I already have a prototype of this going, I'll file a JIRA
> > ticket for this feature.
> >
> > - Dmitry
> >
> >
> >
> > On Thu, Mar 31, 2016 at 8:43 PM, Simon Ball <sb...@hortonworks.com>
> wrote:
> >
> >> What I’m suggesting is a single processor for both, but instead of
> using a
> >> mode property to determine which bits get extracted, you use the state
> of
> >> the relations on the processor to configure which options tika uses and
> >> using a single pass to actually parse metadata into attributes, and
> content
> >> into a new flow file transfer into the parsed relation.
> >>
> >> On the tesseract front, it may make sense to do this through a
> controller
> >> service.
> >>
> >> A PDF processor might be interesting. Are you thinking of something like
> >> PDFBox, or tika again?
> >>
> >> Simon
> >>
> >>
> >>> On 1 Apr 2016, at 01:30, Dmitry Goldenberg <dg...@hexastax.com>
> >> wrote:
> >>>
> >>> Simon,
> >>>
> >>> Interesting commentary.  The issue that Joe and I have both looked at,
> >> with
> >>> the splitting of metadata and content extraction, is that if they're
> >> split
> >>> then the underlying Tika extraction has to process the file twice: once
> >> to
> >>> pull out the attributes and once to pull out the content.  Perhaps it
> may
> >>> be good to add ExtractMetadata and ExtractTextContent in addition to
> >>> ExtractMediaAttributes - ? Seems kind of an overkill but I may be
> wrong.
> >>>
> >>> It seems prudent to provide one wholesome, out-of-the-box extractor
> >>> processor with options to extract just metadata, just content, or both
> >>> metadata and content.
> >>>
> >>> I think what I'm hearing is that we need to allow for checking
> somewhere
> >>> for whether text/content has already been extracted by the time we get
> to
> >>> the ExtractMediaAttributes processor - ?  If that is the issue then I
> >>> believe the user would use RouteOnAttribute and if the content is
> already
> >>> filled in then they'd not route to ExtractMediaAttributes.
> >>>
> >>> As far as the OCR.  Tika internally supports OCR by directing image
> files
> >>> to Tesseract (if Tesseract is installed and configured properly).
> We've
> >>> started talking about how this could be reconciled in the
> >>> ExtractMediaAttributes.
> >>>
> >>> I think that once we have the basic ExtractMediaAttributes, we could
> add
> >>> filters for what files to enable the OCR on, and we'd need to expose a
> >> few
> >>> config parameters specific to OCR, such as e.g. the location of the
> >>> Tesseract installation and the maximum file size on which to attempt
> the
> >>> OCR.  Perhaps there can also be a RunOCR processor which would be
> >> dedicated
> >>> to running OCR.  But since Tika already has OCR integrated we'd
> probably
> >>> want to take care of that in the ExtractMediaAttributes configuration.
> >>>
> >>> Additionally, I've proposed the idea of a ProcessPDF processor which
> >> would
> >>> ascertain whether a PDF is 'text' or 'scanned'. If scanned, we would
> >> break
> >>> it up into pages and run OCR on each page, then aggregate the extracted
> >>> text.
> >>>
> >>> - Dmitry
> >>>
> >>>
> >>>
> >>> On Thu, Mar 31, 2016 at 3:19 PM, Simon Ball <sb...@hortonworks.com>
> >> wrote:
> >>>
> >>>> Just a thought…
> >>>>
> >>>> To keep consistent with other Nifi Parse patterns, would it make sense
> >> to
> >>>> based the extraction of content on the presence of a relation. So your
> >> tika
> >>>> processor would have an original relation which would have meta data
> >>>> attached as attributed, and an extracted relation which would have the
> >>>> metadata and the processed content (text from OCRed image for
> example).
> >>>> That way you can just use context.hasConnection(relationship) to
> >> determine
> >>>> whether to enable the tika content processing.
> >>>>
> >>>> This seems more idiomatic than a mode flag.
> >>>>
> >>>> Simon
> >>>>
> >>>>> On 31 Mar 2016, at 19:48, Joe Skora <js...@gmail.com> wrote:
> >>>>>
> >>>>> Dmitry,
> >>>>>
> >>>>> I think we're good.  I was confused because "XXX_METADATA MIMETYPE
> >>>> FILTER"
> >>>>> entries referred to some MIME type of the metadata, but you meant to
> >> use
> >>>>> the file's MIME type to select what files have metadata extracted.
> >>>>>
> >>>>> Sorry, about that, I think we are on the same page.
> >>>>>
> >>>>> Joe
> >>>>>
> >>>>> On Thu, Mar 31, 2016 at 11:40 AM, Dmitry Goldenberg <
> >>>>> dgoldenberg@hexastax.com> wrote:
> >>>>>
> >>>>>> Hi Joe,
> >>>>>>
> >>>>>> I think if we have the filters in place then there's no need for the
> >>>> 'mode'
> >>>>>> enum, as the filters themselves guide the processor in deciding
> >> whether
> >>>>>> metadata and/or content is extracted for a given input file.
> >>>>>>
> >>>>>> Agreed on the handling of archives as a separate processor
> (template,
> >>>> seems
> >>>>>> like).
> >>>>>>
> >>>>>> I think it's easiest to do both metadata and/or content in one
> >> processor
> >>>>>> since it can tell Tika whether to extract metadata and/or content,
> in
> >>>> one
> >>>>>> pass over the file bytes (as you pointed out).
> >>>>>>
> >>>>>> Agreed on the exclusions trumping inclusions; I think that makes
> >> sense.
> >>>>>>
> >>>>>>>> We will only have a mimetype for the original flow file itself so
> >> I'm
> >>>>>> not sure about the metadata mimetype filter.
> >>>>>>
> >>>>>> I'm not sure where there might be an issue here. The metadata MIME
> >> type
> >>>>>> filter tells the processor for which MIME types to perform the
> >> metadata
> >>>>>> extraction.  For instance, extract metadata for images and videos,
> >> only.
> >>>>>> This could possibly be coupled with an exclusion filter for content
> >> that
> >>>>>> says, don't try to extract content from images and videos.
> >>>>>>
> >>>>>> I think with the six filters we get all the bases covered:
> >>>>>>
> >>>>>> 1. include metadata? --
> >>>>>>    1. yes --
> >>>>>>       1. determine the inclusion of metadata by filename pattern
> >>>>>>       2. determine the inclusion of metadata by MIME type pattern
> >>>>>>    2. no --
> >>>>>>       1. determine the exclusion of metadata by filename pattern
> >>>>>>       2. determine the exclusion of metadata by MIME type pattern
> >>>>>>    2. include content? --
> >>>>>>    1. yes --
> >>>>>>       1. determine the inclusion of content by filename pattern
> >>>>>>       2. determine the inclusion of content by MIME type pattern
> >>>>>>    2. no --
> >>>>>>       1. determine the exclusion of content by filename pattern
> >>>>>>       2. determine the exclusion of content by MIME type pattern
> >>>>>>
> >>>>>> Does this work?
> >>>>>>
> >>>>>> Thanks,
> >>>>>> - Dmitry
> >>>>>>
> >>>>>>
> >>>>>> On Thu, Mar 31, 2016 at 9:27 AM, Joe Skora <js...@gmail.com>
> wrote:
> >>>>>>
> >>>>>>> Dmitry,
> >>>>>>>
> >>>>>>> Looking at this and your prior email.
> >>>>>>>
> >>>>>>>
> >>>>>>> 1. I can see "extract metadata only" being as popular as "extract
> >>>>>>> metadata and content".  It will all depend on the type of media,
> for
> >>>>>>> audio/video files adding the metadata to the flow file is enough
> but
> >>>>>> for
> >>>>>>> Word, PDF, etc. files the content may be wanted as well.
> >>>>>>> 2. After thinking about it, I agree on an enum for mode.
> >>>>>>> 3. I think any handling of zips or archive files should be handled
> >> by
> >>>>>>> another processor, that keeps this processor cleaner and improves
> >> its
> >>>>>>> ability for re-use.
> >>>>>>> 4. I like the addition of exclude filters but I'm not sure about
> >>>>>> adding
> >>>>>>> content filters.  We will only have a mimetype for the original
> flow
> >>>>>>> file
> >>>>>>> itself so I'm not sure about the metadata mimetype filter.  I think
> >>>>>>> content
> >>>>>>> filtering may be best left for another downstream processor, but it
> >>>>>>> might
> >>>>>>> be run faster if included here since the entire content will be
> >>>>>> handled
> >>>>>>> during extraction.  If the content filters are implemented, for
> >>>>>>> performance
> >>>>>>> they need to short circuit so that if the property is not set or is
> >>>>>> set
> >>>>>>> to
> >>>>>>> ".*" they don't evaluate the regex.
> >>>>>>> 1. FILENAME_FILTER - selects flow files to process based on
> filename
> >>>>>>>    matching regex. (exists)
> >>>>>>>    2. MIMETYPE_FILTER - selects flow files to process based on
> >>>>>> mimetype
> >>>>>>>    matching regex. (exists)
> >>>>>>>    3. FILENAME_EXCLUDE - excludes already selected flow files from
> >>>>>>>    processing based on filename matching regex. (new)
> >>>>>>>    4. MIMETYPE_EXCLUDE - excludes already selected flow  files from
> >>>>>>>    processing based on mimetype matching regex. (new)
> >>>>>>>    5. CONTENT_FILTER (optional) - selects flow files for output
> >> based
> >>>>>> on
> >>>>>>>    extracted content matching regex. (new)
> >>>>>>>    6. CONTENT_EXCLUDE (optional) - excludes flow files from output
> >>>>>> based
> >>>>>>>    on extracted content matching regex. (new)
> >>>>>>> 5. As indicated in the descriptions in #4, I don't think
> overlapping
> >>>>>>> filters are an error, instead excludes should take precedence over
> >>>>>>> includes.  Then I can include a domain (like A*) but exclude
> >> sub-sets
> >>>>>>> (like
> >>>>>>> AXYZ*).
> >>>>>>>
> >>>>>>> I'm sure there's something we missed, but I think that covers most
> of
> >>>> it.
> >>>>>>>
> >>>>>>> Regards,
> >>>>>>> Joe
> >>>>>>>
> >>>>>>>
> >>>>>>> On Wed, Mar 30, 2016 at 1:56 PM, Dmitry Goldenberg <
> >>>>>>> dgoldenberg@hexastax.com
> >>>>>>>> wrote:
> >>>>>>>
> >>>>>>>> Joe,
> >>>>>>>>
> >>>>>>>> Upon some thinking, I've started wondering whether all the cases
> can
> >>>> be
> >>>>>>>> covered by the following filters:
> >>>>>>>>
> >>>>>>>> INCLUDE_CONTENT_FILENAME_FILTER - defines any patterns for which
> >> input
> >>>>>>>> files get their content extracted, by file name
> >>>>>>>> INCLUDE_METADATA_FILENAME_FILTER - defines any patterns for which
> >>>> input
> >>>>>>>> files get their metadata extracted, by file name
> >>>>>>>> INCLUDE_CONTENT_MIMETYPE_FILTER - defines any patterns for which
> >> input
> >>>>>>>> files get their content extracted, by MIME type
> >>>>>>>> INCLUDE_METADATA_MIMETYPE_FILTER - defines any patterns for which
> >>>> input
> >>>>>>>> files get their metadata extracted, by MIME type
> >>>>>>>>
> >>>>>>>> EXCLUDE_CONTENT_FILENAME_FILTER - defines any patterns for which
> >> input
> >>>>>>>> files do NOT get their content extracted, by file name
> >>>>>>>> EXCLUDE_METADATA_FILENAME_FILTER - defines any patterns for which
> >>>> input
> >>>>>>>> files do NOT get their metadata extracted, by file name
> >>>>>>>> EXCLUDE_CONTENT_MIMETYPE_FILTER - defines any patterns for which
> >> input
> >>>>>>>> files do NOT get their content extracted, by MIME type
> >>>>>>>> EXCLUDE_METADATA_MIMETYPE_FILTER - defines any patterns for which
> >>>> input
> >>>>>>>> files do NOT get their metadata extracted, by MIME type
> >>>>>>>>
> >>>>>>>> I believe this gets all the bases covered. At processor init time,
> >> we
> >>>>>> can
> >>>>>>>> analyze the inclusions vs. exclusions; any overlap would cause a
> >>>>>>>> configuration error.
> >>>>>>>>
> >>>>>>>> Let me know what you think, thanks.
> >>>>>>>> - Dmitry
> >>>>>>>>
> >>>>>>>> On Wed, Mar 30, 2016 at 10:41 AM, Dmitry Goldenberg <
> >>>>>>>> dgoldenberg@hexastax.com> wrote:
> >>>>>>>>
> >>>>>>>>> Hi Joe,
> >>>>>>>>>
> >>>>>>>>> I follow your reasoning on the semantics of "media".  One might
> >> argue
> >>>>>>>> that
> >>>>>>>>> media files are a case of "document" or that a document is a case
> >> of
> >>>>>>>>> "media".
> >>>>>>>>>
> >>>>>>>>> I'm not proposing filters for the mode of processing, I'm
> >> proposing a
> >>>>>>>>> flag/enum with 3 values:
> >>>>>>>>>
> >>>>>>>>> A) extract metadata only;
> >>>>>>>>> B) extract content only and place it into the flowfile content;
> >>>>>>>>> C) extract both metadata and content.
> >>>>>>>>>
> >>>>>>>>> I think the default should be C, to extract both.  At least in my
> >>>>>>>>> experience most flows I've dealt with were interested in
> extracting
> >>>>>>> both.
> >>>>>>>>>
> >>>>>>>>> I don't see how this mode would benefit from being expression
> >> driven
> >>>>>> -
> >>>>>>> ?
> >>>>>>>>>
> >>>>>>>>> I think we can add this enum mode and have the basic use case
> >>>>>> covered.
> >>>>>>>>>
> >>>>>>>>> Additionally, further down the line, I was thinking we could
> ponder
> >>>>>> the
> >>>>>>>>> following (these have been essential in search engine ingestion):
> >>>>>>>>>
> >>>>>>>>> 1. Extraction from compressed files/archives. How would
> >>>>>>> UnpackContent
> >>>>>>>>> work with ExtractMediaAttributes? Use-case being, we've got a zip
> >>>>>>>> file as
> >>>>>>>>> input and want to crack it open and unravel it recursively; it
> may
> >>>>>>>> have
> >>>>>>>>> other, nested zips inside, along with other documents. One way to
> >>>>>>>> handle
> >>>>>>>>> this is to treat the whole archive as one document and merge all
> >>>>>>>> attributes
> >>>>>>>>> into one FlowFile.  The other way would be to treat each archive
> >>>>>>>> entry as
> >>>>>>>>> its own flow file and keep a pointer back at the parent archive.
> >>>>>>> Yet
> >>>>>>>>> another case is when the user might want to only extract the
> >>>>>> 'leaf'
> >>>>>>>> entries
> >>>>>>>>> and discard any parent container archives.
> >>>>>>>>>
> >>>>>>>>> 2. Attachments and embeddings. Users may want to treat any
> >>>>>> attached
> >>>>>>> or
> >>>>>>>>> embedded files as separate flowfiles with perhaps pointers back
> to
> >>>>>>> the
> >>>>>>>>> parent files. This definitely warrants a filter. Oftentimes
> Office
> >>>>>>>>> documents have 'media' embeddings which are often not of
> interest,
> >>>>>>>>> especially for the case of ingesting into a search engine.
> >>>>>>>>>
> >>>>>>>>> 3. PDF. For PDF's, we can do OCR. This is important for the
> >>>>>>>>> 'image'/scanned PDF's for which Tika won't extract text.
> >>>>>>>>>
> >>>>>>>>> I'd like to understand how much of this is already supported in
> >> NiFi
> >>>>>>> and
> >>>>>>>>> if not I'd volunteer/collaborate to implement some of this.
> >>>>>>>>>
> >>>>>>>>> - Dmitry
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Wed, Mar 30, 2016 at 9:03 AM, Joe Skora <js...@gmail.com>
> >> wrote:
> >>>>>>>>>
> >>>>>>>>>> Dmitry,
> >>>>>>>>>>
> >>>>>>>>>> Are you proposing separate filters that determine the mode of
> >>>>>>>> processing,
> >>>>>>>>>> metadata/content/metadataAndContent?  I was thinking of one
> >>>>>> selection
> >>>>>>>>>> filters and a static mode switch at the processor instance
> level,
> >> to
> >>>>>>>> make
> >>>>>>>>>> configuration more obvious such that one instance of the
> processor
> >>>>>>> will
> >>>>>>>>>> handle a known set of files regardless of the processing mode.
> >>>>>>>>>>
> >>>>>>>>>> I was thinking it would be useful for the mode switch to support
> >>>>>>>>>> expression
> >>>>>>>>>> language, but I'm not sure about that since the selection
> filters
> >>>>>> will
> >>>>>>>>>> control what files get processed and it would be harder to
> >> configure
> >>>>>>> if
> >>>>>>>>>> the
> >>>>>>>>>> output flow file could vary between source format and extracted
> >>>>>> text.
> >>>>>>>> So,
> >>>>>>>>>> while it might be easy to do, and occasionally useful, I think
> in
> >>>>>>> normal
> >>>>>>>>>> use I'd never have a varying mode but would more likely have
> >>>>>> multiple
> >>>>>>>>>> processor instances with some routing or selection going on
> >> further
> >>>>>>>>>> upstream.
> >>>>>>>>>>
> >>>>>>>>>> I wrestled with the naming issue too.  I went with
> >>>>>>>>>> "ExtractMediaAttributes"
> >>>>>>>>>> over "ExtractDocumentAttributes" because it seemed to represent
> >> the
> >>>>>>>>>> broader
> >>>>>>>>>> context better.  In reality, media files and documents and
> >> documents
> >>>>>>> are
> >>>>>>>>>> media files, but in the end it's all just semantics.
> >>>>>>>>>>
> >>>>>>>>>> I don't think I would change the NAR bundle name, because I
> think
> >>>>>>>>>> "nifi-media-nar" establishes it as a place to collect this and
> >> other
> >>>>>>>> media
> >>>>>>>>>> related processors in the future.
> >>>>>>>>>>
> >>>>>>>>>> Regards,
> >>>>>>>>>> Joe
> >>>>>>>>>>
> >>>>>>>>>> On Tue, Mar 29, 2016 at 3:09 PM, Dmitry Goldenberg <
> >>>>>>>>>> dgoldenberg@hexastax.com
> >>>>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Hi Joe,
> >>>>>>>>>>>
> >>>>>>>>>>> Thanks for all the details.
> >>>>>>>>>>>
> >>>>>>>>>>> I wanted to propose that I do some of this work so as to go
> >>>>>> through
> >>>>>>>> the
> >>>>>>>>>>> full cycle of developing a processor and committing it.
> >>>>>>>>>>>
> >>>>>>>>>>> Once your changes are merged, I could extend your
> >>>>>>>> 'ExtractMediaMetadata'
> >>>>>>>>>>> processor to handle the content, in addition to the metadata.
> >>>>>>>>>>>
> >>>>>>>>>>> We could keep the FILENAME_FILTER and MIMETYPE_FILTER but add a
> >>>>>> mode
> >>>>>>>>>> with 3
> >>>>>>>>>>> values: metadataOnly, contentOnly, metadataAndContent.
> >>>>>>>>>>>
> >>>>>>>>>>> One thing that looks to be a design issue right now is, your
> >>>>>> changes
> >>>>>>>> and
> >>>>>>>>>>> the 'nomenclature' seem media-oriented ("nifi-media-nar" etc.)
> >>>>>>>>>>>
> >>>>>>>>>>> Would it make sense to have a generic processor
> >>>>>>>>>>> ExtractDocumentMetadataAndContent?  Are there enough specifics
> in
> >>>>>>> the
> >>>>>>>>>>> image/video processing stuff to warrant that to be a separate
> >>>>>> layer;
> >>>>>>>>>>> perhaps a subclass of ExtractDocumentMetadataAndContent ?
> Might
> >>>>>> it
> >>>>>>>> make
> >>>>>>>>>>> sense to rename nifi-media-nar into nifi-text-extract-nar ?
> >>>>>>>>>>>
> >>>>>>>>>>> Thanks,
> >>>>>>>>>>> - Dmitry
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> On Tue, Mar 29, 2016 at 2:36 PM, Joe Skora <js...@gmail.com>
> >>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> Dmitry,
> >>>>>>>>>>>>
> >>>>>>>>>>>> Yeah, I agree, Tika is pretty impressive.  The original
> ticket,
> >>>>>>>>>> NIFI-615
> >>>>>>>>>>>> <https://issues.apache.org/jira/browse/NIFI-615>, wanted
> >>>>>>> extraction
> >>>>>>>>>> of
> >>>>>>>>>>>> metadata from WAV files, but as I got into it I found Tika so
> >>>>>> for
> >>>>>>>> the
> >>>>>>>>>>> same
> >>>>>>>>>>>> effort it supports the 1,000+ file formats Tika understands.
> >>>>>> That
> >>>>>>>> new
> >>>>>>>>>>>> processor called "ExtractMediaMetadata", you can pull that
> pull
> >>>>>>>> PR-252
> >>>>>>>>>>>> <https://github.com/apache/nifi/pull/252> from GitHub if you
> >>>>>> want
> >>>>>>>> to
> >>>>>>>>>>> give
> >>>>>>>>>>>> it a try before it's merged.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Extraction content for those 1,000+ formats would be a
> valuable
> >>>>>>>>>> addition.
> >>>>>>>>>>>> I see two possible approaches, 1) create a new
> >>>>>>> "ExtractMediaContent"
> >>>>>>>>>>>> processor that would put the document content in a new flow
> >>>>>> file,
> >>>>>>>> and
> >>>>>>>>>> 2)
> >>>>>>>>>>>> extend the new "ExtractMediaMetadata" processor so it can
> >>>>>> extract
> >>>>>>>>>>> metadata,
> >>>>>>>>>>>> content, or both.  One combined processor makes sense if it
> can
> >>>>>>>>>> provide a
> >>>>>>>>>>>> performance gain, otherwise two complementary processors may
> >>>>>> make
> >>>>>>>>>> usage
> >>>>>>>>>>>> easier.
> >>>>>>>>>>>>
> >>>>>>>>>>>> I'm glad to help if you want to take a cut at the processor
> >>>>>>>> yourself,
> >>>>>>>>>> or
> >>>>>>>>>>> I
> >>>>>>>>>>>> can take a crack at it myself if you'd prefer.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Don't hesitate to ask questions or share comments and feedback
> >>>>>>>>>> regarding
> >>>>>>>>>>>> the ExtractMediaMetadata processor or the addition of content
> >>>>>>>>>> handling.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Regards,
> >>>>>>>>>>>> Joe Skora
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Thu, Mar 24, 2016 at 11:40 AM, Dmitry Goldenberg <
> >>>>>>>>>>>> dgoldenberg@hexastax.com> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Thanks, Joe!
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Hi Joe S. - I'm definitely up for discussing and
> contributing.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> While building search-related ingestion systems, I've seen
> >>>>>>>> metadata
> >>>>>>>>>> and
> >>>>>>>>>>>>> text extraction being done all the time; it's always there
> and
> >>>>>>>>>> always
> >>>>>>>>>>> has
> >>>>>>>>>>>>> to be done for building search indexes.  Beyond that,
> >>>>>>> OCR-related
> >>>>>>>>>>>>> capabilities are often requested, and the advantage of Tika
> is
> >>>>>>>> that
> >>>>>>>>>> it
> >>>>>>>>>>>>> supports OCR out of the box.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> - Dmitry
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Thu, Mar 24, 2016 at 11:36 AM, Joe Witt <
> >>>>>> joe.witt@gmail.com>
> >>>>>>>>>> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> Dmitry,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Another community member (Joe Skora) has a PR outstanding
> >>>>>> for
> >>>>>>>>>>>>>> extracting metadata from media files using Tika.  Perhaps it
> >>>>>>>> makes
> >>>>>>>>>>>>>> sense to broaden that to in general extract what Tika can
> >>>>>>> find.
> >>>>>>>>>> Joe
> >>>>>>>>>>> -
> >>>>>>>>>>>>>> perhaps you can discuss your ideas with Dmitry and see if
> >>>>>>>>>> broadening
> >>>>>>>>>>>>>> is a good idea or if rather domain specific ones make more
> >>>>>>>> sense.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> This concept of extracting metadata from documents/text
> >>>>>> files,
> >>>>>>>>>> etc..
> >>>>>>>>>>>>>> using something like Tika is certainly useful as that then
> >>>>>> can
> >>>>>>>>>> drive
> >>>>>>>>>>>>>> nice automated routing decisions.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Thanks
> >>>>>>>>>>>>>> Joe
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Thu, Mar 24, 2016 at 9:28 AM, Dmitry Goldenberg
> >>>>>>>>>>>>>> <dg...@hexastax.com> wrote:
> >>>>>>>>>>>>>>> Hi,
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> I see that the ExtractText processor extracts text using
> >>>>>>>> regex.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> What about a processor that extracts text and metadata
> >>>>>> from
> >>>>>>>>>>> incoming
> >>>>>>>>>>>>>>> files?  That doesn't seem to exist - but perhaps I didn't
> >>>>>>>> quite
> >>>>>>>>>>> look
> >>>>>>>>>>>> in
> >>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>> right spots.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> If that doesn't exist I'd like to implement and commit it,
> >>>>>>>> using
> >>>>>>>>>>>> Apache
> >>>>>>>>>>>>>>> Tika.  There may also be a couple of related processors to
> >>>>>>>> that.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Thoughts?
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>> - Dmitry
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>
> >>>>
> >>
> >>
>
>

Re: Text and metadata extraction processor

Posted by Mark Payne <ma...@hotmail.com>.
Dmitry,

I would be a bit concerned about providing options for filters that include and
exclude certain things. I believe that if you send a FlowFile to the Processor,
then the Processor should do its thing. If you want to filter out which FlowFiles
have their content extracted, for example, I would suggest using a Processor
like RouteOnAttribute to ensure that only the appropriate FlowFiles are processed
by the ExtractMediaMetadata processor.

This allows the metadata extraction processor to focus purely on extracting
metadata and doesn't have to deal with all of the logic of filtering things out. The logic
for filtering things out is almost guaranteed to grow much more complex as people
start to use this more and more. NiFi already provides several route-based processors
to allow for a great deal of flexibility with this type of logic (RouteOnAttribute, RouteOnContent,
ScanAttribute, ScanContent, etc.).

Thanks
-Mark



> On Apr 1, 2016, at 12:55 AM, Dmitry Goldenberg <dg...@hexastax.com> wrote:
> 
> Simon,
> 
> I believe we've moved on past the 'mode' option and have now switched to
> talking about how the include/exclude filters, for metadata and content, on
> the one hand side, and filename or MIME type based, on the other hand side,
> would drive whether meta, content, or both would get extracted.
> 
> For example, a user could configure the ExtractMediaAttributes processor to
> extract metadata for all image files (but not content), extract content
> only for plain text documents (but no metadata), or both meta and content
> for documents with an extension ".pqr", based on the filename.
> 
> Could you elaborate on your vision of how relationships could "drive" this
> type of functionality?  Joe has already built some of the filtering into
> the processor; I just suggested to extend that further, and we get all the
> bases covered.
> 
> I'm not sure I followed your comment on the extracted content being
> transferred into a new FlowFile.  My thoughts were that the extracted
> content would be inserted into a new, dedicated field, called for example,
> "text", on *the same* FlowFile.  I imagine that for a lot of use-cases,
> especially data ingestion into a search engine, the extracted attributes
> *and* the extracted text must travel together as part of the ingested
> document, with the original flowfile-content most likely getting dropped on
> the way into the index.
> 
> I guess an alternative could be to have an option to represent the
> extraction results as a new document, and an option to drop the original,
> and an option to copy the original's attributes onto the new doc. Seems
> rather complex.  I like the "in-place" extraction.
> 
> Could you also elaborate on how a controller service would handle OCR?
> When a document floats into ExtractMediaAttributes, assuming Tesseract is
> installed properly, Tika will already automatically fire off OCR.  Unless
> we turn that off and cause OCR to only be supported via this service.  I'm
> tempted to say why don't we just let Tika do its job for all cases, OCR
> included.  Caveat being that OCR is expensive and it would be nice to have
> ways of ensuring it has enough resources and doesn't bog the flow down.
> 
> For the PDF processor, I'm thinking, yes, PDFBox to break it up into pages
> and then apply Tika page by page, then aggregate the output together, with
> a configurable max of up to N pages per document to process (due to how
> slow OCR is).  I already have a prototype of this going, I'll file a JIRA
> ticket for this feature.
> 
> - Dmitry
> 
> 
> 
> On Thu, Mar 31, 2016 at 8:43 PM, Simon Ball <sb...@hortonworks.com> wrote:
> 
>> What I’m suggesting is a single processor for both, but instead of using a
>> mode property to determine which bits get extracted, you use the state of
>> the relations on the processor to configure which options tika uses and
>> using a single pass to actually parse metadata into attributes, and content
>> into a new flow file transfer into the parsed relation.
>> 
>> On the tesseract front, it may make sense to do this through a controller
>> service.
>> 
>> A PDF processor might be interesting. Are you thinking of something like
>> PDFBox, or tika again?
>> 
>> Simon
>> 
>> 
>>> On 1 Apr 2016, at 01:30, Dmitry Goldenberg <dg...@hexastax.com>
>> wrote:
>>> 
>>> Simon,
>>> 
>>> Interesting commentary.  The issue that Joe and I have both looked at,
>> with
>>> the splitting of metadata and content extraction, is that if they're
>> split
>>> then the underlying Tika extraction has to process the file twice: once
>> to
>>> pull out the attributes and once to pull out the content.  Perhaps it may
>>> be good to add ExtractMetadata and ExtractTextContent in addition to
>>> ExtractMediaAttributes - ? Seems kind of an overkill but I may be wrong.
>>> 
>>> It seems prudent to provide one wholesome, out-of-the-box extractor
>>> processor with options to extract just metadata, just content, or both
>>> metadata and content.
>>> 
>>> I think what I'm hearing is that we need to allow for checking somewhere
>>> for whether text/content has already been extracted by the time we get to
>>> the ExtractMediaAttributes processor - ?  If that is the issue then I
>>> believe the user would use RouteOnAttribute and if the content is already
>>> filled in then they'd not route to ExtractMediaAttributes.
>>> 
>>> As far as the OCR.  Tika internally supports OCR by directing image files
>>> to Tesseract (if Tesseract is installed and configured properly).  We've
>>> started talking about how this could be reconciled in the
>>> ExtractMediaAttributes.
>>> 
>>> I think that once we have the basic ExtractMediaAttributes, we could add
>>> filters for what files to enable the OCR on, and we'd need to expose a
>> few
>>> config parameters specific to OCR, such as e.g. the location of the
>>> Tesseract installation and the maximum file size on which to attempt the
>>> OCR.  Perhaps there can also be a RunOCR processor which would be
>> dedicated
>>> to running OCR.  But since Tika already has OCR integrated we'd probably
>>> want to take care of that in the ExtractMediaAttributes configuration.
>>> 
>>> Additionally, I've proposed the idea of a ProcessPDF processor which
>> would
>>> ascertain whether a PDF is 'text' or 'scanned'. If scanned, we would
>> break
>>> it up into pages and run OCR on each page, then aggregate the extracted
>>> text.
>>> 
>>> - Dmitry
>>> 
>>> 
>>> 
>>> On Thu, Mar 31, 2016 at 3:19 PM, Simon Ball <sb...@hortonworks.com>
>> wrote:
>>> 
>>>> Just a thought…
>>>> 
>>>> To keep consistent with other Nifi Parse patterns, would it make sense
>> to
>>>> based the extraction of content on the presence of a relation. So your
>> tika
>>>> processor would have an original relation which would have meta data
>>>> attached as attributed, and an extracted relation which would have the
>>>> metadata and the processed content (text from OCRed image for example).
>>>> That way you can just use context.hasConnection(relationship) to
>> determine
>>>> whether to enable the tika content processing.
>>>> 
>>>> This seems more idiomatic than a mode flag.
>>>> 
>>>> Simon
>>>> 
>>>>> On 31 Mar 2016, at 19:48, Joe Skora <js...@gmail.com> wrote:
>>>>> 
>>>>> Dmitry,
>>>>> 
>>>>> I think we're good.  I was confused because "XXX_METADATA MIMETYPE
>>>> FILTER"
>>>>> entries referred to some MIME type of the metadata, but you meant to
>> use
>>>>> the file's MIME type to select what files have metadata extracted.
>>>>> 
>>>>> Sorry, about that, I think we are on the same page.
>>>>> 
>>>>> Joe
>>>>> 
>>>>> On Thu, Mar 31, 2016 at 11:40 AM, Dmitry Goldenberg <
>>>>> dgoldenberg@hexastax.com> wrote:
>>>>> 
>>>>>> Hi Joe,
>>>>>> 
>>>>>> I think if we have the filters in place then there's no need for the
>>>> 'mode'
>>>>>> enum, as the filters themselves guide the processor in deciding
>> whether
>>>>>> metadata and/or content is extracted for a given input file.
>>>>>> 
>>>>>> Agreed on the handling of archives as a separate processor (template,
>>>> seems
>>>>>> like).
>>>>>> 
>>>>>> I think it's easiest to do both metadata and/or content in one
>> processor
>>>>>> since it can tell Tika whether to extract metadata and/or content, in
>>>> one
>>>>>> pass over the file bytes (as you pointed out).
>>>>>> 
>>>>>> Agreed on the exclusions trumping inclusions; I think that makes
>> sense.
>>>>>> 
>>>>>>>> We will only have a mimetype for the original flow file itself so
>> I'm
>>>>>> not sure about the metadata mimetype filter.
>>>>>> 
>>>>>> I'm not sure where there might be an issue here. The metadata MIME
>> type
>>>>>> filter tells the processor for which MIME types to perform the
>> metadata
>>>>>> extraction.  For instance, extract metadata for images and videos,
>> only.
>>>>>> This could possibly be coupled with an exclusion filter for content
>> that
>>>>>> says, don't try to extract content from images and videos.
>>>>>> 
>>>>>> I think with the six filters we get all the bases covered:
>>>>>> 
>>>>>> 1. include metadata? --
>>>>>>    1. yes --
>>>>>>       1. determine the inclusion of metadata by filename pattern
>>>>>>       2. determine the inclusion of metadata by MIME type pattern
>>>>>>    2. no --
>>>>>>       1. determine the exclusion of metadata by filename pattern
>>>>>>       2. determine the exclusion of metadata by MIME type pattern
>>>>>>    2. include content? --
>>>>>>    1. yes --
>>>>>>       1. determine the inclusion of content by filename pattern
>>>>>>       2. determine the inclusion of content by MIME type pattern
>>>>>>    2. no --
>>>>>>       1. determine the exclusion of content by filename pattern
>>>>>>       2. determine the exclusion of content by MIME type pattern
>>>>>> 
>>>>>> Does this work?
>>>>>> 
>>>>>> Thanks,
>>>>>> - Dmitry
>>>>>> 
>>>>>> 
>>>>>> On Thu, Mar 31, 2016 at 9:27 AM, Joe Skora <js...@gmail.com> wrote:
>>>>>> 
>>>>>>> Dmitry,
>>>>>>> 
>>>>>>> Looking at this and your prior email.
>>>>>>> 
>>>>>>> 
>>>>>>> 1. I can see "extract metadata only" being as popular as "extract
>>>>>>> metadata and content".  It will all depend on the type of media, for
>>>>>>> audio/video files adding the metadata to the flow file is enough but
>>>>>> for
>>>>>>> Word, PDF, etc. files the content may be wanted as well.
>>>>>>> 2. After thinking about it, I agree on an enum for mode.
>>>>>>> 3. I think any handling of zips or archive files should be handled
>> by
>>>>>>> another processor, that keeps this processor cleaner and improves
>> its
>>>>>>> ability for re-use.
>>>>>>> 4. I like the addition of exclude filters but I'm not sure about
>>>>>> adding
>>>>>>> content filters.  We will only have a mimetype for the original flow
>>>>>>> file
>>>>>>> itself so I'm not sure about the metadata mimetype filter.  I think
>>>>>>> content
>>>>>>> filtering may be best left for another downstream processor, but it
>>>>>>> might
>>>>>>> be run faster if included here since the entire content will be
>>>>>> handled
>>>>>>> during extraction.  If the content filters are implemented, for
>>>>>>> performance
>>>>>>> they need to short circuit so that if the property is not set or is
>>>>>> set
>>>>>>> to
>>>>>>> ".*" they don't evaluate the regex.
>>>>>>> 1. FILENAME_FILTER - selects flow files to process based on filename
>>>>>>>    matching regex. (exists)
>>>>>>>    2. MIMETYPE_FILTER - selects flow files to process based on
>>>>>> mimetype
>>>>>>>    matching regex. (exists)
>>>>>>>    3. FILENAME_EXCLUDE - excludes already selected flow files from
>>>>>>>    processing based on filename matching regex. (new)
>>>>>>>    4. MIMETYPE_EXCLUDE - excludes already selected flow  files from
>>>>>>>    processing based on mimetype matching regex. (new)
>>>>>>>    5. CONTENT_FILTER (optional) - selects flow files for output
>> based
>>>>>> on
>>>>>>>    extracted content matching regex. (new)
>>>>>>>    6. CONTENT_EXCLUDE (optional) - excludes flow files from output
>>>>>> based
>>>>>>>    on extracted content matching regex. (new)
>>>>>>> 5. As indicated in the descriptions in #4, I don't think overlapping
>>>>>>> filters are an error, instead excludes should take precedence over
>>>>>>> includes.  Then I can include a domain (like A*) but exclude
>> sub-sets
>>>>>>> (like
>>>>>>> AXYZ*).
>>>>>>> 
>>>>>>> I'm sure there's something we missed, but I think that covers most of
>>>> it.
>>>>>>> 
>>>>>>> Regards,
>>>>>>> Joe
>>>>>>> 
>>>>>>> 
>>>>>>> On Wed, Mar 30, 2016 at 1:56 PM, Dmitry Goldenberg <
>>>>>>> dgoldenberg@hexastax.com
>>>>>>>> wrote:
>>>>>>> 
>>>>>>>> Joe,
>>>>>>>> 
>>>>>>>> Upon some thinking, I've started wondering whether all the cases can
>>>> be
>>>>>>>> covered by the following filters:
>>>>>>>> 
>>>>>>>> INCLUDE_CONTENT_FILENAME_FILTER - defines any patterns for which
>> input
>>>>>>>> files get their content extracted, by file name
>>>>>>>> INCLUDE_METADATA_FILENAME_FILTER - defines any patterns for which
>>>> input
>>>>>>>> files get their metadata extracted, by file name
>>>>>>>> INCLUDE_CONTENT_MIMETYPE_FILTER - defines any patterns for which
>> input
>>>>>>>> files get their content extracted, by MIME type
>>>>>>>> INCLUDE_METADATA_MIMETYPE_FILTER - defines any patterns for which
>>>> input
>>>>>>>> files get their metadata extracted, by MIME type
>>>>>>>> 
>>>>>>>> EXCLUDE_CONTENT_FILENAME_FILTER - defines any patterns for which
>> input
>>>>>>>> files do NOT get their content extracted, by file name
>>>>>>>> EXCLUDE_METADATA_FILENAME_FILTER - defines any patterns for which
>>>> input
>>>>>>>> files do NOT get their metadata extracted, by file name
>>>>>>>> EXCLUDE_CONTENT_MIMETYPE_FILTER - defines any patterns for which
>> input
>>>>>>>> files do NOT get their content extracted, by MIME type
>>>>>>>> EXCLUDE_METADATA_MIMETYPE_FILTER - defines any patterns for which
>>>> input
>>>>>>>> files do NOT get their metadata extracted, by MIME type
>>>>>>>> 
>>>>>>>> I believe this gets all the bases covered. At processor init time,
>> we
>>>>>> can
>>>>>>>> analyze the inclusions vs. exclusions; any overlap would cause a
>>>>>>>> configuration error.
>>>>>>>> 
>>>>>>>> Let me know what you think, thanks.
>>>>>>>> - Dmitry
>>>>>>>> 
>>>>>>>> On Wed, Mar 30, 2016 at 10:41 AM, Dmitry Goldenberg <
>>>>>>>> dgoldenberg@hexastax.com> wrote:
>>>>>>>> 
>>>>>>>>> Hi Joe,
>>>>>>>>> 
>>>>>>>>> I follow your reasoning on the semantics of "media".  One might
>> argue
>>>>>>>> that
>>>>>>>>> media files are a case of "document" or that a document is a case
>> of
>>>>>>>>> "media".
>>>>>>>>> 
>>>>>>>>> I'm not proposing filters for the mode of processing, I'm
>> proposing a
>>>>>>>>> flag/enum with 3 values:
>>>>>>>>> 
>>>>>>>>> A) extract metadata only;
>>>>>>>>> B) extract content only and place it into the flowfile content;
>>>>>>>>> C) extract both metadata and content.
>>>>>>>>> 
>>>>>>>>> I think the default should be C, to extract both.  At least in my
>>>>>>>>> experience most flows I've dealt with were interested in extracting
>>>>>>> both.
>>>>>>>>> 
>>>>>>>>> I don't see how this mode would benefit from being expression
>> driven
>>>>>> -
>>>>>>> ?
>>>>>>>>> 
>>>>>>>>> I think we can add this enum mode and have the basic use case
>>>>>> covered.
>>>>>>>>> 
>>>>>>>>> Additionally, further down the line, I was thinking we could ponder
>>>>>> the
>>>>>>>>> following (these have been essential in search engine ingestion):
>>>>>>>>> 
>>>>>>>>> 1. Extraction from compressed files/archives. How would
>>>>>>> UnpackContent
>>>>>>>>> work with ExtractMediaAttributes? Use-case being, we've got a zip
>>>>>>>> file as
>>>>>>>>> input and want to crack it open and unravel it recursively; it may
>>>>>>>> have
>>>>>>>>> other, nested zips inside, along with other documents. One way to
>>>>>>>> handle
>>>>>>>>> this is to treat the whole archive as one document and merge all
>>>>>>>> attributes
>>>>>>>>> into one FlowFile.  The other way would be to treat each archive
>>>>>>>> entry as
>>>>>>>>> its own flow file and keep a pointer back at the parent archive.
>>>>>>> Yet
>>>>>>>>> another case is when the user might want to only extract the
>>>>>> 'leaf'
>>>>>>>> entries
>>>>>>>>> and discard any parent container archives.
>>>>>>>>> 
>>>>>>>>> 2. Attachments and embeddings. Users may want to treat any
>>>>>> attached
>>>>>>> or
>>>>>>>>> embedded files as separate flowfiles with perhaps pointers back to
>>>>>>> the
>>>>>>>>> parent files. This definitely warrants a filter. Oftentimes Office
>>>>>>>>> documents have 'media' embeddings which are often not of interest,
>>>>>>>>> especially for the case of ingesting into a search engine.
>>>>>>>>> 
>>>>>>>>> 3. PDF. For PDF's, we can do OCR. This is important for the
>>>>>>>>> 'image'/scanned PDF's for which Tika won't extract text.
>>>>>>>>> 
>>>>>>>>> I'd like to understand how much of this is already supported in
>> NiFi
>>>>>>> and
>>>>>>>>> if not I'd volunteer/collaborate to implement some of this.
>>>>>>>>> 
>>>>>>>>> - Dmitry
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Wed, Mar 30, 2016 at 9:03 AM, Joe Skora <js...@gmail.com>
>> wrote:
>>>>>>>>> 
>>>>>>>>>> Dmitry,
>>>>>>>>>> 
>>>>>>>>>> Are you proposing separate filters that determine the mode of
>>>>>>>> processing,
>>>>>>>>>> metadata/content/metadataAndContent?  I was thinking of one
>>>>>> selection
>>>>>>>>>> filters and a static mode switch at the processor instance level,
>> to
>>>>>>>> make
>>>>>>>>>> configuration more obvious such that one instance of the processor
>>>>>>> will
>>>>>>>>>> handle a known set of files regardless of the processing mode.
>>>>>>>>>> 
>>>>>>>>>> I was thinking it would be useful for the mode switch to support
>>>>>>>>>> expression
>>>>>>>>>> language, but I'm not sure about that since the selection filters
>>>>>> will
>>>>>>>>>> control what files get processed and it would be harder to
>> configure
>>>>>>> if
>>>>>>>>>> the
>>>>>>>>>> output flow file could vary between source format and extracted
>>>>>> text.
>>>>>>>> So,
>>>>>>>>>> while it might be easy to do, and occasionally useful, I think in
>>>>>>> normal
>>>>>>>>>> use I'd never have a varying mode but would more likely have
>>>>>> multiple
>>>>>>>>>> processor instances with some routing or selection going on
>> further
>>>>>>>>>> upstream.
>>>>>>>>>> 
>>>>>>>>>> I wrestled with the naming issue too.  I went with
>>>>>>>>>> "ExtractMediaAttributes"
>>>>>>>>>> over "ExtractDocumentAttributes" because it seemed to represent
>> the
>>>>>>>>>> broader
>>>>>>>>>> context better.  In reality, media files and documents and
>> documents
>>>>>>> are
>>>>>>>>>> media files, but in the end it's all just semantics.
>>>>>>>>>> 
>>>>>>>>>> I don't think I would change the NAR bundle name, because I think
>>>>>>>>>> "nifi-media-nar" establishes it as a place to collect this and
>> other
>>>>>>>> media
>>>>>>>>>> related processors in the future.
>>>>>>>>>> 
>>>>>>>>>> Regards,
>>>>>>>>>> Joe
>>>>>>>>>> 
>>>>>>>>>> On Tue, Mar 29, 2016 at 3:09 PM, Dmitry Goldenberg <
>>>>>>>>>> dgoldenberg@hexastax.com
>>>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> Hi Joe,
>>>>>>>>>>> 
>>>>>>>>>>> Thanks for all the details.
>>>>>>>>>>> 
>>>>>>>>>>> I wanted to propose that I do some of this work so as to go
>>>>>> through
>>>>>>>> the
>>>>>>>>>>> full cycle of developing a processor and committing it.
>>>>>>>>>>> 
>>>>>>>>>>> Once your changes are merged, I could extend your
>>>>>>>> 'ExtractMediaMetadata'
>>>>>>>>>>> processor to handle the content, in addition to the metadata.
>>>>>>>>>>> 
>>>>>>>>>>> We could keep the FILENAME_FILTER and MIMETYPE_FILTER but add a
>>>>>> mode
>>>>>>>>>> with 3
>>>>>>>>>>> values: metadataOnly, contentOnly, metadataAndContent.
>>>>>>>>>>> 
>>>>>>>>>>> One thing that looks to be a design issue right now is, your
>>>>>> changes
>>>>>>>> and
>>>>>>>>>>> the 'nomenclature' seem media-oriented ("nifi-media-nar" etc.)
>>>>>>>>>>> 
>>>>>>>>>>> Would it make sense to have a generic processor
>>>>>>>>>>> ExtractDocumentMetadataAndContent?  Are there enough specifics in
>>>>>>> the
>>>>>>>>>>> image/video processing stuff to warrant that to be a separate
>>>>>> layer;
>>>>>>>>>>> perhaps a subclass of ExtractDocumentMetadataAndContent ?  Might
>>>>>> it
>>>>>>>> make
>>>>>>>>>>> sense to rename nifi-media-nar into nifi-text-extract-nar ?
>>>>>>>>>>> 
>>>>>>>>>>> Thanks,
>>>>>>>>>>> - Dmitry
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On Tue, Mar 29, 2016 at 2:36 PM, Joe Skora <js...@gmail.com>
>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> Dmitry,
>>>>>>>>>>>> 
>>>>>>>>>>>> Yeah, I agree, Tika is pretty impressive.  The original ticket,
>>>>>>>>>> NIFI-615
>>>>>>>>>>>> <https://issues.apache.org/jira/browse/NIFI-615>, wanted
>>>>>>> extraction
>>>>>>>>>> of
>>>>>>>>>>>> metadata from WAV files, but as I got into it I found Tika so
>>>>>> for
>>>>>>>> the
>>>>>>>>>>> same
>>>>>>>>>>>> effort it supports the 1,000+ file formats Tika understands.
>>>>>> That
>>>>>>>> new
>>>>>>>>>>>> processor called "ExtractMediaMetadata", you can pull that pull
>>>>>>>> PR-252
>>>>>>>>>>>> <https://github.com/apache/nifi/pull/252> from GitHub if you
>>>>>> want
>>>>>>>> to
>>>>>>>>>>> give
>>>>>>>>>>>> it a try before it's merged.
>>>>>>>>>>>> 
>>>>>>>>>>>> Extraction content for those 1,000+ formats would be a valuable
>>>>>>>>>> addition.
>>>>>>>>>>>> I see two possible approaches, 1) create a new
>>>>>>> "ExtractMediaContent"
>>>>>>>>>>>> processor that would put the document content in a new flow
>>>>>> file,
>>>>>>>> and
>>>>>>>>>> 2)
>>>>>>>>>>>> extend the new "ExtractMediaMetadata" processor so it can
>>>>>> extract
>>>>>>>>>>> metadata,
>>>>>>>>>>>> content, or both.  One combined processor makes sense if it can
>>>>>>>>>> provide a
>>>>>>>>>>>> performance gain, otherwise two complementary processors may
>>>>>> make
>>>>>>>>>> usage
>>>>>>>>>>>> easier.
>>>>>>>>>>>> 
>>>>>>>>>>>> I'm glad to help if you want to take a cut at the processor
>>>>>>>> yourself,
>>>>>>>>>> or
>>>>>>>>>>> I
>>>>>>>>>>>> can take a crack at it myself if you'd prefer.
>>>>>>>>>>>> 
>>>>>>>>>>>> Don't hesitate to ask questions or share comments and feedback
>>>>>>>>>> regarding
>>>>>>>>>>>> the ExtractMediaMetadata processor or the addition of content
>>>>>>>>>> handling.
>>>>>>>>>>>> 
>>>>>>>>>>>> Regards,
>>>>>>>>>>>> Joe Skora
>>>>>>>>>>>> 
>>>>>>>>>>>> On Thu, Mar 24, 2016 at 11:40 AM, Dmitry Goldenberg <
>>>>>>>>>>>> dgoldenberg@hexastax.com> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> Thanks, Joe!
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Hi Joe S. - I'm definitely up for discussing and contributing.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> While building search-related ingestion systems, I've seen
>>>>>>>> metadata
>>>>>>>>>> and
>>>>>>>>>>>>> text extraction being done all the time; it's always there and
>>>>>>>>>> always
>>>>>>>>>>> has
>>>>>>>>>>>>> to be done for building search indexes.  Beyond that,
>>>>>>> OCR-related
>>>>>>>>>>>>> capabilities are often requested, and the advantage of Tika is
>>>>>>>> that
>>>>>>>>>> it
>>>>>>>>>>>>> supports OCR out of the box.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> - Dmitry
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Thu, Mar 24, 2016 at 11:36 AM, Joe Witt <
>>>>>> joe.witt@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Dmitry,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Another community member (Joe Skora) has a PR outstanding
>>>>>> for
>>>>>>>>>>>>>> extracting metadata from media files using Tika.  Perhaps it
>>>>>>>> makes
>>>>>>>>>>>>>> sense to broaden that to in general extract what Tika can
>>>>>>> find.
>>>>>>>>>> Joe
>>>>>>>>>>> -
>>>>>>>>>>>>>> perhaps you can discuss your ideas with Dmitry and see if
>>>>>>>>>> broadening
>>>>>>>>>>>>>> is a good idea or if rather domain specific ones make more
>>>>>>>> sense.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> This concept of extracting metadata from documents/text
>>>>>> files,
>>>>>>>>>> etc..
>>>>>>>>>>>>>> using something like Tika is certainly useful as that then
>>>>>> can
>>>>>>>>>> drive
>>>>>>>>>>>>>> nice automated routing decisions.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>> Joe
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Thu, Mar 24, 2016 at 9:28 AM, Dmitry Goldenberg
>>>>>>>>>>>>>> <dg...@hexastax.com> wrote:
>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> I see that the ExtractText processor extracts text using
>>>>>>>> regex.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> What about a processor that extracts text and metadata
>>>>>> from
>>>>>>>>>>> incoming
>>>>>>>>>>>>>>> files?  That doesn't seem to exist - but perhaps I didn't
>>>>>>>> quite
>>>>>>>>>>> look
>>>>>>>>>>>> in
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>> right spots.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> If that doesn't exist I'd like to implement and commit it,
>>>>>>>> using
>>>>>>>>>>>> Apache
>>>>>>>>>>>>>>> Tika.  There may also be a couple of related processors to
>>>>>>>> that.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Thoughts?
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>> - Dmitry
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>> 
>>>> 
>> 
>> 


Re: Text and metadata extraction processor

Posted by Dmitry Goldenberg <dg...@hexastax.com>.
Simon,

I believe we've moved on past the 'mode' option and have now switched to
talking about how the include/exclude filters, for metadata and content, on
the one hand side, and filename or MIME type based, on the other hand side,
would drive whether meta, content, or both would get extracted.

For example, a user could configure the ExtractMediaAttributes processor to
extract metadata for all image files (but not content), extract content
only for plain text documents (but no metadata), or both meta and content
for documents with an extension ".pqr", based on the filename.

Could you elaborate on your vision of how relationships could "drive" this
type of functionality?  Joe has already built some of the filtering into
the processor; I just suggested to extend that further, and we get all the
bases covered.

I'm not sure I followed your comment on the extracted content being
transferred into a new FlowFile.  My thoughts were that the extracted
content would be inserted into a new, dedicated field, called for example,
"text", on *the same* FlowFile.  I imagine that for a lot of use-cases,
especially data ingestion into a search engine, the extracted attributes
*and* the extracted text must travel together as part of the ingested
document, with the original flowfile-content most likely getting dropped on
the way into the index.

I guess an alternative could be to have an option to represent the
extraction results as a new document, and an option to drop the original,
and an option to copy the original's attributes onto the new doc. Seems
rather complex.  I like the "in-place" extraction.

Could you also elaborate on how a controller service would handle OCR?
When a document floats into ExtractMediaAttributes, assuming Tesseract is
installed properly, Tika will already automatically fire off OCR.  Unless
we turn that off and cause OCR to only be supported via this service.  I'm
tempted to say why don't we just let Tika do its job for all cases, OCR
included.  Caveat being that OCR is expensive and it would be nice to have
ways of ensuring it has enough resources and doesn't bog the flow down.

For the PDF processor, I'm thinking, yes, PDFBox to break it up into pages
and then apply Tika page by page, then aggregate the output together, with
a configurable max of up to N pages per document to process (due to how
slow OCR is).  I already have a prototype of this going, I'll file a JIRA
ticket for this feature.

- Dmitry



On Thu, Mar 31, 2016 at 8:43 PM, Simon Ball <sb...@hortonworks.com> wrote:

> What I’m suggesting is a single processor for both, but instead of using a
> mode property to determine which bits get extracted, you use the state of
> the relations on the processor to configure which options tika uses and
> using a single pass to actually parse metadata into attributes, and content
> into a new flow file transfer into the parsed relation.
>
> On the tesseract front, it may make sense to do this through a controller
> service.
>
> A PDF processor might be interesting. Are you thinking of something like
> PDFBox, or tika again?
>
> Simon
>
>
> > On 1 Apr 2016, at 01:30, Dmitry Goldenberg <dg...@hexastax.com>
> wrote:
> >
> > Simon,
> >
> > Interesting commentary.  The issue that Joe and I have both looked at,
> with
> > the splitting of metadata and content extraction, is that if they're
> split
> > then the underlying Tika extraction has to process the file twice: once
> to
> > pull out the attributes and once to pull out the content.  Perhaps it may
> > be good to add ExtractMetadata and ExtractTextContent in addition to
> > ExtractMediaAttributes - ? Seems kind of an overkill but I may be wrong.
> >
> > It seems prudent to provide one wholesome, out-of-the-box extractor
> > processor with options to extract just metadata, just content, or both
> > metadata and content.
> >
> > I think what I'm hearing is that we need to allow for checking somewhere
> > for whether text/content has already been extracted by the time we get to
> > the ExtractMediaAttributes processor - ?  If that is the issue then I
> > believe the user would use RouteOnAttribute and if the content is already
> > filled in then they'd not route to ExtractMediaAttributes.
> >
> > As far as the OCR.  Tika internally supports OCR by directing image files
> > to Tesseract (if Tesseract is installed and configured properly).  We've
> > started talking about how this could be reconciled in the
> > ExtractMediaAttributes.
> >
> > I think that once we have the basic ExtractMediaAttributes, we could add
> > filters for what files to enable the OCR on, and we'd need to expose a
> few
> > config parameters specific to OCR, such as e.g. the location of the
> > Tesseract installation and the maximum file size on which to attempt the
> > OCR.  Perhaps there can also be a RunOCR processor which would be
> dedicated
> > to running OCR.  But since Tika already has OCR integrated we'd probably
> > want to take care of that in the ExtractMediaAttributes configuration.
> >
> > Additionally, I've proposed the idea of a ProcessPDF processor which
> would
> > ascertain whether a PDF is 'text' or 'scanned'. If scanned, we would
> break
> > it up into pages and run OCR on each page, then aggregate the extracted
> > text.
> >
> > - Dmitry
> >
> >
> >
> > On Thu, Mar 31, 2016 at 3:19 PM, Simon Ball <sb...@hortonworks.com>
> wrote:
> >
> >> Just a thought…
> >>
> >> To keep consistent with other Nifi Parse patterns, would it make sense
> to
> >> based the extraction of content on the presence of a relation. So your
> tika
> >> processor would have an original relation which would have meta data
> >> attached as attributed, and an extracted relation which would have the
> >> metadata and the processed content (text from OCRed image for example).
> >> That way you can just use context.hasConnection(relationship) to
> determine
> >> whether to enable the tika content processing.
> >>
> >> This seems more idiomatic than a mode flag.
> >>
> >> Simon
> >>
> >>> On 31 Mar 2016, at 19:48, Joe Skora <js...@gmail.com> wrote:
> >>>
> >>> Dmitry,
> >>>
> >>> I think we're good.  I was confused because "XXX_METADATA MIMETYPE
> >> FILTER"
> >>> entries referred to some MIME type of the metadata, but you meant to
> use
> >>> the file's MIME type to select what files have metadata extracted.
> >>>
> >>> Sorry, about that, I think we are on the same page.
> >>>
> >>> Joe
> >>>
> >>> On Thu, Mar 31, 2016 at 11:40 AM, Dmitry Goldenberg <
> >>> dgoldenberg@hexastax.com> wrote:
> >>>
> >>>> Hi Joe,
> >>>>
> >>>> I think if we have the filters in place then there's no need for the
> >> 'mode'
> >>>> enum, as the filters themselves guide the processor in deciding
> whether
> >>>> metadata and/or content is extracted for a given input file.
> >>>>
> >>>> Agreed on the handling of archives as a separate processor (template,
> >> seems
> >>>> like).
> >>>>
> >>>> I think it's easiest to do both metadata and/or content in one
> processor
> >>>> since it can tell Tika whether to extract metadata and/or content, in
> >> one
> >>>> pass over the file bytes (as you pointed out).
> >>>>
> >>>> Agreed on the exclusions trumping inclusions; I think that makes
> sense.
> >>>>
> >>>>>> We will only have a mimetype for the original flow file itself so
> I'm
> >>>> not sure about the metadata mimetype filter.
> >>>>
> >>>> I'm not sure where there might be an issue here. The metadata MIME
> type
> >>>> filter tells the processor for which MIME types to perform the
> metadata
> >>>> extraction.  For instance, extract metadata for images and videos,
> only.
> >>>> This could possibly be coupled with an exclusion filter for content
> that
> >>>> says, don't try to extract content from images and videos.
> >>>>
> >>>> I think with the six filters we get all the bases covered:
> >>>>
> >>>>  1. include metadata? --
> >>>>     1. yes --
> >>>>        1. determine the inclusion of metadata by filename pattern
> >>>>        2. determine the inclusion of metadata by MIME type pattern
> >>>>     2. no --
> >>>>        1. determine the exclusion of metadata by filename pattern
> >>>>        2. determine the exclusion of metadata by MIME type pattern
> >>>>     2. include content? --
> >>>>     1. yes --
> >>>>        1. determine the inclusion of content by filename pattern
> >>>>        2. determine the inclusion of content by MIME type pattern
> >>>>     2. no --
> >>>>        1. determine the exclusion of content by filename pattern
> >>>>        2. determine the exclusion of content by MIME type pattern
> >>>>
> >>>> Does this work?
> >>>>
> >>>> Thanks,
> >>>> - Dmitry
> >>>>
> >>>>
> >>>> On Thu, Mar 31, 2016 at 9:27 AM, Joe Skora <js...@gmail.com> wrote:
> >>>>
> >>>>> Dmitry,
> >>>>>
> >>>>> Looking at this and your prior email.
> >>>>>
> >>>>>
> >>>>>  1. I can see "extract metadata only" being as popular as "extract
> >>>>>  metadata and content".  It will all depend on the type of media, for
> >>>>>  audio/video files adding the metadata to the flow file is enough but
> >>>> for
> >>>>>  Word, PDF, etc. files the content may be wanted as well.
> >>>>>  2. After thinking about it, I agree on an enum for mode.
> >>>>>  3. I think any handling of zips or archive files should be handled
> by
> >>>>>  another processor, that keeps this processor cleaner and improves
> its
> >>>>>  ability for re-use.
> >>>>>  4. I like the addition of exclude filters but I'm not sure about
> >>>> adding
> >>>>>  content filters.  We will only have a mimetype for the original flow
> >>>>> file
> >>>>>  itself so I'm not sure about the metadata mimetype filter.  I think
> >>>>> content
> >>>>>  filtering may be best left for another downstream processor, but it
> >>>>> might
> >>>>>  be run faster if included here since the entire content will be
> >>>> handled
> >>>>>  during extraction.  If the content filters are implemented, for
> >>>>> performance
> >>>>>  they need to short circuit so that if the property is not set or is
> >>>> set
> >>>>> to
> >>>>>  ".*" they don't evaluate the regex.
> >>>>>  1. FILENAME_FILTER - selects flow files to process based on filename
> >>>>>     matching regex. (exists)
> >>>>>     2. MIMETYPE_FILTER - selects flow files to process based on
> >>>> mimetype
> >>>>>     matching regex. (exists)
> >>>>>     3. FILENAME_EXCLUDE - excludes already selected flow files from
> >>>>>     processing based on filename matching regex. (new)
> >>>>>     4. MIMETYPE_EXCLUDE - excludes already selected flow  files from
> >>>>>     processing based on mimetype matching regex. (new)
> >>>>>     5. CONTENT_FILTER (optional) - selects flow files for output
> based
> >>>> on
> >>>>>     extracted content matching regex. (new)
> >>>>>     6. CONTENT_EXCLUDE (optional) - excludes flow files from output
> >>>> based
> >>>>>     on extracted content matching regex. (new)
> >>>>>  5. As indicated in the descriptions in #4, I don't think overlapping
> >>>>>  filters are an error, instead excludes should take precedence over
> >>>>>  includes.  Then I can include a domain (like A*) but exclude
> sub-sets
> >>>>> (like
> >>>>>  AXYZ*).
> >>>>>
> >>>>> I'm sure there's something we missed, but I think that covers most of
> >> it.
> >>>>>
> >>>>> Regards,
> >>>>> Joe
> >>>>>
> >>>>>
> >>>>> On Wed, Mar 30, 2016 at 1:56 PM, Dmitry Goldenberg <
> >>>>> dgoldenberg@hexastax.com
> >>>>>> wrote:
> >>>>>
> >>>>>> Joe,
> >>>>>>
> >>>>>> Upon some thinking, I've started wondering whether all the cases can
> >> be
> >>>>>> covered by the following filters:
> >>>>>>
> >>>>>> INCLUDE_CONTENT_FILENAME_FILTER - defines any patterns for which
> input
> >>>>>> files get their content extracted, by file name
> >>>>>> INCLUDE_METADATA_FILENAME_FILTER - defines any patterns for which
> >> input
> >>>>>> files get their metadata extracted, by file name
> >>>>>> INCLUDE_CONTENT_MIMETYPE_FILTER - defines any patterns for which
> input
> >>>>>> files get their content extracted, by MIME type
> >>>>>> INCLUDE_METADATA_MIMETYPE_FILTER - defines any patterns for which
> >> input
> >>>>>> files get their metadata extracted, by MIME type
> >>>>>>
> >>>>>> EXCLUDE_CONTENT_FILENAME_FILTER - defines any patterns for which
> input
> >>>>>> files do NOT get their content extracted, by file name
> >>>>>> EXCLUDE_METADATA_FILENAME_FILTER - defines any patterns for which
> >> input
> >>>>>> files do NOT get their metadata extracted, by file name
> >>>>>> EXCLUDE_CONTENT_MIMETYPE_FILTER - defines any patterns for which
> input
> >>>>>> files do NOT get their content extracted, by MIME type
> >>>>>> EXCLUDE_METADATA_MIMETYPE_FILTER - defines any patterns for which
> >> input
> >>>>>> files do NOT get their metadata extracted, by MIME type
> >>>>>>
> >>>>>> I believe this gets all the bases covered. At processor init time,
> we
> >>>> can
> >>>>>> analyze the inclusions vs. exclusions; any overlap would cause a
> >>>>>> configuration error.
> >>>>>>
> >>>>>> Let me know what you think, thanks.
> >>>>>> - Dmitry
> >>>>>>
> >>>>>> On Wed, Mar 30, 2016 at 10:41 AM, Dmitry Goldenberg <
> >>>>>> dgoldenberg@hexastax.com> wrote:
> >>>>>>
> >>>>>>> Hi Joe,
> >>>>>>>
> >>>>>>> I follow your reasoning on the semantics of "media".  One might
> argue
> >>>>>> that
> >>>>>>> media files are a case of "document" or that a document is a case
> of
> >>>>>>> "media".
> >>>>>>>
> >>>>>>> I'm not proposing filters for the mode of processing, I'm
> proposing a
> >>>>>>> flag/enum with 3 values:
> >>>>>>>
> >>>>>>> A) extract metadata only;
> >>>>>>> B) extract content only and place it into the flowfile content;
> >>>>>>> C) extract both metadata and content.
> >>>>>>>
> >>>>>>> I think the default should be C, to extract both.  At least in my
> >>>>>>> experience most flows I've dealt with were interested in extracting
> >>>>> both.
> >>>>>>>
> >>>>>>> I don't see how this mode would benefit from being expression
> driven
> >>>> -
> >>>>> ?
> >>>>>>>
> >>>>>>> I think we can add this enum mode and have the basic use case
> >>>> covered.
> >>>>>>>
> >>>>>>> Additionally, further down the line, I was thinking we could ponder
> >>>> the
> >>>>>>> following (these have been essential in search engine ingestion):
> >>>>>>>
> >>>>>>>  1. Extraction from compressed files/archives. How would
> >>>>> UnpackContent
> >>>>>>>  work with ExtractMediaAttributes? Use-case being, we've got a zip
> >>>>>> file as
> >>>>>>>  input and want to crack it open and unravel it recursively; it may
> >>>>>> have
> >>>>>>>  other, nested zips inside, along with other documents. One way to
> >>>>>> handle
> >>>>>>>  this is to treat the whole archive as one document and merge all
> >>>>>> attributes
> >>>>>>>  into one FlowFile.  The other way would be to treat each archive
> >>>>>> entry as
> >>>>>>>  its own flow file and keep a pointer back at the parent archive.
> >>>>> Yet
> >>>>>>>  another case is when the user might want to only extract the
> >>>> 'leaf'
> >>>>>> entries
> >>>>>>>  and discard any parent container archives.
> >>>>>>>
> >>>>>>>  2. Attachments and embeddings. Users may want to treat any
> >>>> attached
> >>>>> or
> >>>>>>>  embedded files as separate flowfiles with perhaps pointers back to
> >>>>> the
> >>>>>>>  parent files. This definitely warrants a filter. Oftentimes Office
> >>>>>>>  documents have 'media' embeddings which are often not of interest,
> >>>>>>>  especially for the case of ingesting into a search engine.
> >>>>>>>
> >>>>>>>  3. PDF. For PDF's, we can do OCR. This is important for the
> >>>>>>>  'image'/scanned PDF's for which Tika won't extract text.
> >>>>>>>
> >>>>>>> I'd like to understand how much of this is already supported in
> NiFi
> >>>>> and
> >>>>>>> if not I'd volunteer/collaborate to implement some of this.
> >>>>>>>
> >>>>>>> - Dmitry
> >>>>>>>
> >>>>>>>
> >>>>>>> On Wed, Mar 30, 2016 at 9:03 AM, Joe Skora <js...@gmail.com>
> wrote:
> >>>>>>>
> >>>>>>>> Dmitry,
> >>>>>>>>
> >>>>>>>> Are you proposing separate filters that determine the mode of
> >>>>>> processing,
> >>>>>>>> metadata/content/metadataAndContent?  I was thinking of one
> >>>> selection
> >>>>>>>> filters and a static mode switch at the processor instance level,
> to
> >>>>>> make
> >>>>>>>> configuration more obvious such that one instance of the processor
> >>>>> will
> >>>>>>>> handle a known set of files regardless of the processing mode.
> >>>>>>>>
> >>>>>>>> I was thinking it would be useful for the mode switch to support
> >>>>>>>> expression
> >>>>>>>> language, but I'm not sure about that since the selection filters
> >>>> will
> >>>>>>>> control what files get processed and it would be harder to
> configure
> >>>>> if
> >>>>>>>> the
> >>>>>>>> output flow file could vary between source format and extracted
> >>>> text.
> >>>>>> So,
> >>>>>>>> while it might be easy to do, and occasionally useful, I think in
> >>>>> normal
> >>>>>>>> use I'd never have a varying mode but would more likely have
> >>>> multiple
> >>>>>>>> processor instances with some routing or selection going on
> further
> >>>>>>>> upstream.
> >>>>>>>>
> >>>>>>>> I wrestled with the naming issue too.  I went with
> >>>>>>>> "ExtractMediaAttributes"
> >>>>>>>> over "ExtractDocumentAttributes" because it seemed to represent
> the
> >>>>>>>> broader
> >>>>>>>> context better.  In reality, media files and documents and
> documents
> >>>>> are
> >>>>>>>> media files, but in the end it's all just semantics.
> >>>>>>>>
> >>>>>>>> I don't think I would change the NAR bundle name, because I think
> >>>>>>>> "nifi-media-nar" establishes it as a place to collect this and
> other
> >>>>>> media
> >>>>>>>> related processors in the future.
> >>>>>>>>
> >>>>>>>> Regards,
> >>>>>>>> Joe
> >>>>>>>>
> >>>>>>>> On Tue, Mar 29, 2016 at 3:09 PM, Dmitry Goldenberg <
> >>>>>>>> dgoldenberg@hexastax.com
> >>>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>> Hi Joe,
> >>>>>>>>>
> >>>>>>>>> Thanks for all the details.
> >>>>>>>>>
> >>>>>>>>> I wanted to propose that I do some of this work so as to go
> >>>> through
> >>>>>> the
> >>>>>>>>> full cycle of developing a processor and committing it.
> >>>>>>>>>
> >>>>>>>>> Once your changes are merged, I could extend your
> >>>>>> 'ExtractMediaMetadata'
> >>>>>>>>> processor to handle the content, in addition to the metadata.
> >>>>>>>>>
> >>>>>>>>> We could keep the FILENAME_FILTER and MIMETYPE_FILTER but add a
> >>>> mode
> >>>>>>>> with 3
> >>>>>>>>> values: metadataOnly, contentOnly, metadataAndContent.
> >>>>>>>>>
> >>>>>>>>> One thing that looks to be a design issue right now is, your
> >>>> changes
> >>>>>> and
> >>>>>>>>> the 'nomenclature' seem media-oriented ("nifi-media-nar" etc.)
> >>>>>>>>>
> >>>>>>>>> Would it make sense to have a generic processor
> >>>>>>>>> ExtractDocumentMetadataAndContent?  Are there enough specifics in
> >>>>> the
> >>>>>>>>> image/video processing stuff to warrant that to be a separate
> >>>> layer;
> >>>>>>>>> perhaps a subclass of ExtractDocumentMetadataAndContent ?  Might
> >>>> it
> >>>>>> make
> >>>>>>>>> sense to rename nifi-media-nar into nifi-text-extract-nar ?
> >>>>>>>>>
> >>>>>>>>> Thanks,
> >>>>>>>>> - Dmitry
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Tue, Mar 29, 2016 at 2:36 PM, Joe Skora <js...@gmail.com>
> >>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> Dmitry,
> >>>>>>>>>>
> >>>>>>>>>> Yeah, I agree, Tika is pretty impressive.  The original ticket,
> >>>>>>>> NIFI-615
> >>>>>>>>>> <https://issues.apache.org/jira/browse/NIFI-615>, wanted
> >>>>> extraction
> >>>>>>>> of
> >>>>>>>>>> metadata from WAV files, but as I got into it I found Tika so
> >>>> for
> >>>>>> the
> >>>>>>>>> same
> >>>>>>>>>> effort it supports the 1,000+ file formats Tika understands.
> >>>> That
> >>>>>> new
> >>>>>>>>>> processor called "ExtractMediaMetadata", you can pull that pull
> >>>>>> PR-252
> >>>>>>>>>> <https://github.com/apache/nifi/pull/252> from GitHub if you
> >>>> want
> >>>>>> to
> >>>>>>>>> give
> >>>>>>>>>> it a try before it's merged.
> >>>>>>>>>>
> >>>>>>>>>> Extraction content for those 1,000+ formats would be a valuable
> >>>>>>>> addition.
> >>>>>>>>>> I see two possible approaches, 1) create a new
> >>>>> "ExtractMediaContent"
> >>>>>>>>>> processor that would put the document content in a new flow
> >>>> file,
> >>>>>> and
> >>>>>>>> 2)
> >>>>>>>>>> extend the new "ExtractMediaMetadata" processor so it can
> >>>> extract
> >>>>>>>>> metadata,
> >>>>>>>>>> content, or both.  One combined processor makes sense if it can
> >>>>>>>> provide a
> >>>>>>>>>> performance gain, otherwise two complementary processors may
> >>>> make
> >>>>>>>> usage
> >>>>>>>>>> easier.
> >>>>>>>>>>
> >>>>>>>>>> I'm glad to help if you want to take a cut at the processor
> >>>>>> yourself,
> >>>>>>>> or
> >>>>>>>>> I
> >>>>>>>>>> can take a crack at it myself if you'd prefer.
> >>>>>>>>>>
> >>>>>>>>>> Don't hesitate to ask questions or share comments and feedback
> >>>>>>>> regarding
> >>>>>>>>>> the ExtractMediaMetadata processor or the addition of content
> >>>>>>>> handling.
> >>>>>>>>>>
> >>>>>>>>>> Regards,
> >>>>>>>>>> Joe Skora
> >>>>>>>>>>
> >>>>>>>>>> On Thu, Mar 24, 2016 at 11:40 AM, Dmitry Goldenberg <
> >>>>>>>>>> dgoldenberg@hexastax.com> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Thanks, Joe!
> >>>>>>>>>>>
> >>>>>>>>>>> Hi Joe S. - I'm definitely up for discussing and contributing.
> >>>>>>>>>>>
> >>>>>>>>>>> While building search-related ingestion systems, I've seen
> >>>>>> metadata
> >>>>>>>> and
> >>>>>>>>>>> text extraction being done all the time; it's always there and
> >>>>>>>> always
> >>>>>>>>> has
> >>>>>>>>>>> to be done for building search indexes.  Beyond that,
> >>>>> OCR-related
> >>>>>>>>>>> capabilities are often requested, and the advantage of Tika is
> >>>>>> that
> >>>>>>>> it
> >>>>>>>>>>> supports OCR out of the box.
> >>>>>>>>>>>
> >>>>>>>>>>> - Dmitry
> >>>>>>>>>>>
> >>>>>>>>>>> On Thu, Mar 24, 2016 at 11:36 AM, Joe Witt <
> >>>> joe.witt@gmail.com>
> >>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> Dmitry,
> >>>>>>>>>>>>
> >>>>>>>>>>>> Another community member (Joe Skora) has a PR outstanding
> >>>> for
> >>>>>>>>>>>> extracting metadata from media files using Tika.  Perhaps it
> >>>>>> makes
> >>>>>>>>>>>> sense to broaden that to in general extract what Tika can
> >>>>> find.
> >>>>>>>> Joe
> >>>>>>>>> -
> >>>>>>>>>>>> perhaps you can discuss your ideas with Dmitry and see if
> >>>>>>>> broadening
> >>>>>>>>>>>> is a good idea or if rather domain specific ones make more
> >>>>>> sense.
> >>>>>>>>>>>>
> >>>>>>>>>>>> This concept of extracting metadata from documents/text
> >>>> files,
> >>>>>>>> etc..
> >>>>>>>>>>>> using something like Tika is certainly useful as that then
> >>>> can
> >>>>>>>> drive
> >>>>>>>>>>>> nice automated routing decisions.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Thanks
> >>>>>>>>>>>> Joe
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Thu, Mar 24, 2016 at 9:28 AM, Dmitry Goldenberg
> >>>>>>>>>>>> <dg...@hexastax.com> wrote:
> >>>>>>>>>>>>> Hi,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I see that the ExtractText processor extracts text using
> >>>>>> regex.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> What about a processor that extracts text and metadata
> >>>> from
> >>>>>>>>> incoming
> >>>>>>>>>>>>> files?  That doesn't seem to exist - but perhaps I didn't
> >>>>>> quite
> >>>>>>>>> look
> >>>>>>>>>> in
> >>>>>>>>>>>> the
> >>>>>>>>>>>>> right spots.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> If that doesn't exist I'd like to implement and commit it,
> >>>>>> using
> >>>>>>>>>> Apache
> >>>>>>>>>>>>> Tika.  There may also be a couple of related processors to
> >>>>>> that.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Thoughts?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>> - Dmitry
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>
> >>
>
>

Re: Text and metadata extraction processor

Posted by Simon Ball <sb...@hortonworks.com>.
What I’m suggesting is a single processor for both, but instead of using a mode property to determine which bits get extracted, you use the state of the relations on the processor to configure which options tika uses and using a single pass to actually parse metadata into attributes, and content into a new flow file transfer into the parsed relation. 

On the tesseract front, it may make sense to do this through a controller service. 

A PDF processor might be interesting. Are you thinking of something like PDFBox, or tika again?

Simon


> On 1 Apr 2016, at 01:30, Dmitry Goldenberg <dg...@hexastax.com> wrote:
> 
> Simon,
> 
> Interesting commentary.  The issue that Joe and I have both looked at, with
> the splitting of metadata and content extraction, is that if they're split
> then the underlying Tika extraction has to process the file twice: once to
> pull out the attributes and once to pull out the content.  Perhaps it may
> be good to add ExtractMetadata and ExtractTextContent in addition to
> ExtractMediaAttributes - ? Seems kind of an overkill but I may be wrong.
> 
> It seems prudent to provide one wholesome, out-of-the-box extractor
> processor with options to extract just metadata, just content, or both
> metadata and content.
> 
> I think what I'm hearing is that we need to allow for checking somewhere
> for whether text/content has already been extracted by the time we get to
> the ExtractMediaAttributes processor - ?  If that is the issue then I
> believe the user would use RouteOnAttribute and if the content is already
> filled in then they'd not route to ExtractMediaAttributes.
> 
> As far as the OCR.  Tika internally supports OCR by directing image files
> to Tesseract (if Tesseract is installed and configured properly).  We've
> started talking about how this could be reconciled in the
> ExtractMediaAttributes.
> 
> I think that once we have the basic ExtractMediaAttributes, we could add
> filters for what files to enable the OCR on, and we'd need to expose a few
> config parameters specific to OCR, such as e.g. the location of the
> Tesseract installation and the maximum file size on which to attempt the
> OCR.  Perhaps there can also be a RunOCR processor which would be dedicated
> to running OCR.  But since Tika already has OCR integrated we'd probably
> want to take care of that in the ExtractMediaAttributes configuration.
> 
> Additionally, I've proposed the idea of a ProcessPDF processor which would
> ascertain whether a PDF is 'text' or 'scanned'. If scanned, we would break
> it up into pages and run OCR on each page, then aggregate the extracted
> text.
> 
> - Dmitry
> 
> 
> 
> On Thu, Mar 31, 2016 at 3:19 PM, Simon Ball <sb...@hortonworks.com> wrote:
> 
>> Just a thought…
>> 
>> To keep consistent with other Nifi Parse patterns, would it make sense to
>> based the extraction of content on the presence of a relation. So your tika
>> processor would have an original relation which would have meta data
>> attached as attributed, and an extracted relation which would have the
>> metadata and the processed content (text from OCRed image for example).
>> That way you can just use context.hasConnection(relationship) to determine
>> whether to enable the tika content processing.
>> 
>> This seems more idiomatic than a mode flag.
>> 
>> Simon
>> 
>>> On 31 Mar 2016, at 19:48, Joe Skora <js...@gmail.com> wrote:
>>> 
>>> Dmitry,
>>> 
>>> I think we're good.  I was confused because "XXX_METADATA MIMETYPE
>> FILTER"
>>> entries referred to some MIME type of the metadata, but you meant to use
>>> the file's MIME type to select what files have metadata extracted.
>>> 
>>> Sorry, about that, I think we are on the same page.
>>> 
>>> Joe
>>> 
>>> On Thu, Mar 31, 2016 at 11:40 AM, Dmitry Goldenberg <
>>> dgoldenberg@hexastax.com> wrote:
>>> 
>>>> Hi Joe,
>>>> 
>>>> I think if we have the filters in place then there's no need for the
>> 'mode'
>>>> enum, as the filters themselves guide the processor in deciding whether
>>>> metadata and/or content is extracted for a given input file.
>>>> 
>>>> Agreed on the handling of archives as a separate processor (template,
>> seems
>>>> like).
>>>> 
>>>> I think it's easiest to do both metadata and/or content in one processor
>>>> since it can tell Tika whether to extract metadata and/or content, in
>> one
>>>> pass over the file bytes (as you pointed out).
>>>> 
>>>> Agreed on the exclusions trumping inclusions; I think that makes sense.
>>>> 
>>>>>> We will only have a mimetype for the original flow file itself so I'm
>>>> not sure about the metadata mimetype filter.
>>>> 
>>>> I'm not sure where there might be an issue here. The metadata MIME type
>>>> filter tells the processor for which MIME types to perform the metadata
>>>> extraction.  For instance, extract metadata for images and videos, only.
>>>> This could possibly be coupled with an exclusion filter for content that
>>>> says, don't try to extract content from images and videos.
>>>> 
>>>> I think with the six filters we get all the bases covered:
>>>> 
>>>>  1. include metadata? --
>>>>     1. yes --
>>>>        1. determine the inclusion of metadata by filename pattern
>>>>        2. determine the inclusion of metadata by MIME type pattern
>>>>     2. no --
>>>>        1. determine the exclusion of metadata by filename pattern
>>>>        2. determine the exclusion of metadata by MIME type pattern
>>>>     2. include content? --
>>>>     1. yes --
>>>>        1. determine the inclusion of content by filename pattern
>>>>        2. determine the inclusion of content by MIME type pattern
>>>>     2. no --
>>>>        1. determine the exclusion of content by filename pattern
>>>>        2. determine the exclusion of content by MIME type pattern
>>>> 
>>>> Does this work?
>>>> 
>>>> Thanks,
>>>> - Dmitry
>>>> 
>>>> 
>>>> On Thu, Mar 31, 2016 at 9:27 AM, Joe Skora <js...@gmail.com> wrote:
>>>> 
>>>>> Dmitry,
>>>>> 
>>>>> Looking at this and your prior email.
>>>>> 
>>>>> 
>>>>>  1. I can see "extract metadata only" being as popular as "extract
>>>>>  metadata and content".  It will all depend on the type of media, for
>>>>>  audio/video files adding the metadata to the flow file is enough but
>>>> for
>>>>>  Word, PDF, etc. files the content may be wanted as well.
>>>>>  2. After thinking about it, I agree on an enum for mode.
>>>>>  3. I think any handling of zips or archive files should be handled by
>>>>>  another processor, that keeps this processor cleaner and improves its
>>>>>  ability for re-use.
>>>>>  4. I like the addition of exclude filters but I'm not sure about
>>>> adding
>>>>>  content filters.  We will only have a mimetype for the original flow
>>>>> file
>>>>>  itself so I'm not sure about the metadata mimetype filter.  I think
>>>>> content
>>>>>  filtering may be best left for another downstream processor, but it
>>>>> might
>>>>>  be run faster if included here since the entire content will be
>>>> handled
>>>>>  during extraction.  If the content filters are implemented, for
>>>>> performance
>>>>>  they need to short circuit so that if the property is not set or is
>>>> set
>>>>> to
>>>>>  ".*" they don't evaluate the regex.
>>>>>  1. FILENAME_FILTER - selects flow files to process based on filename
>>>>>     matching regex. (exists)
>>>>>     2. MIMETYPE_FILTER - selects flow files to process based on
>>>> mimetype
>>>>>     matching regex. (exists)
>>>>>     3. FILENAME_EXCLUDE - excludes already selected flow files from
>>>>>     processing based on filename matching regex. (new)
>>>>>     4. MIMETYPE_EXCLUDE - excludes already selected flow  files from
>>>>>     processing based on mimetype matching regex. (new)
>>>>>     5. CONTENT_FILTER (optional) - selects flow files for output based
>>>> on
>>>>>     extracted content matching regex. (new)
>>>>>     6. CONTENT_EXCLUDE (optional) - excludes flow files from output
>>>> based
>>>>>     on extracted content matching regex. (new)
>>>>>  5. As indicated in the descriptions in #4, I don't think overlapping
>>>>>  filters are an error, instead excludes should take precedence over
>>>>>  includes.  Then I can include a domain (like A*) but exclude sub-sets
>>>>> (like
>>>>>  AXYZ*).
>>>>> 
>>>>> I'm sure there's something we missed, but I think that covers most of
>> it.
>>>>> 
>>>>> Regards,
>>>>> Joe
>>>>> 
>>>>> 
>>>>> On Wed, Mar 30, 2016 at 1:56 PM, Dmitry Goldenberg <
>>>>> dgoldenberg@hexastax.com
>>>>>> wrote:
>>>>> 
>>>>>> Joe,
>>>>>> 
>>>>>> Upon some thinking, I've started wondering whether all the cases can
>> be
>>>>>> covered by the following filters:
>>>>>> 
>>>>>> INCLUDE_CONTENT_FILENAME_FILTER - defines any patterns for which input
>>>>>> files get their content extracted, by file name
>>>>>> INCLUDE_METADATA_FILENAME_FILTER - defines any patterns for which
>> input
>>>>>> files get their metadata extracted, by file name
>>>>>> INCLUDE_CONTENT_MIMETYPE_FILTER - defines any patterns for which input
>>>>>> files get their content extracted, by MIME type
>>>>>> INCLUDE_METADATA_MIMETYPE_FILTER - defines any patterns for which
>> input
>>>>>> files get their metadata extracted, by MIME type
>>>>>> 
>>>>>> EXCLUDE_CONTENT_FILENAME_FILTER - defines any patterns for which input
>>>>>> files do NOT get their content extracted, by file name
>>>>>> EXCLUDE_METADATA_FILENAME_FILTER - defines any patterns for which
>> input
>>>>>> files do NOT get their metadata extracted, by file name
>>>>>> EXCLUDE_CONTENT_MIMETYPE_FILTER - defines any patterns for which input
>>>>>> files do NOT get their content extracted, by MIME type
>>>>>> EXCLUDE_METADATA_MIMETYPE_FILTER - defines any patterns for which
>> input
>>>>>> files do NOT get their metadata extracted, by MIME type
>>>>>> 
>>>>>> I believe this gets all the bases covered. At processor init time, we
>>>> can
>>>>>> analyze the inclusions vs. exclusions; any overlap would cause a
>>>>>> configuration error.
>>>>>> 
>>>>>> Let me know what you think, thanks.
>>>>>> - Dmitry
>>>>>> 
>>>>>> On Wed, Mar 30, 2016 at 10:41 AM, Dmitry Goldenberg <
>>>>>> dgoldenberg@hexastax.com> wrote:
>>>>>> 
>>>>>>> Hi Joe,
>>>>>>> 
>>>>>>> I follow your reasoning on the semantics of "media".  One might argue
>>>>>> that
>>>>>>> media files are a case of "document" or that a document is a case of
>>>>>>> "media".
>>>>>>> 
>>>>>>> I'm not proposing filters for the mode of processing, I'm proposing a
>>>>>>> flag/enum with 3 values:
>>>>>>> 
>>>>>>> A) extract metadata only;
>>>>>>> B) extract content only and place it into the flowfile content;
>>>>>>> C) extract both metadata and content.
>>>>>>> 
>>>>>>> I think the default should be C, to extract both.  At least in my
>>>>>>> experience most flows I've dealt with were interested in extracting
>>>>> both.
>>>>>>> 
>>>>>>> I don't see how this mode would benefit from being expression driven
>>>> -
>>>>> ?
>>>>>>> 
>>>>>>> I think we can add this enum mode and have the basic use case
>>>> covered.
>>>>>>> 
>>>>>>> Additionally, further down the line, I was thinking we could ponder
>>>> the
>>>>>>> following (these have been essential in search engine ingestion):
>>>>>>> 
>>>>>>>  1. Extraction from compressed files/archives. How would
>>>>> UnpackContent
>>>>>>>  work with ExtractMediaAttributes? Use-case being, we've got a zip
>>>>>> file as
>>>>>>>  input and want to crack it open and unravel it recursively; it may
>>>>>> have
>>>>>>>  other, nested zips inside, along with other documents. One way to
>>>>>> handle
>>>>>>>  this is to treat the whole archive as one document and merge all
>>>>>> attributes
>>>>>>>  into one FlowFile.  The other way would be to treat each archive
>>>>>> entry as
>>>>>>>  its own flow file and keep a pointer back at the parent archive.
>>>>> Yet
>>>>>>>  another case is when the user might want to only extract the
>>>> 'leaf'
>>>>>> entries
>>>>>>>  and discard any parent container archives.
>>>>>>> 
>>>>>>>  2. Attachments and embeddings. Users may want to treat any
>>>> attached
>>>>> or
>>>>>>>  embedded files as separate flowfiles with perhaps pointers back to
>>>>> the
>>>>>>>  parent files. This definitely warrants a filter. Oftentimes Office
>>>>>>>  documents have 'media' embeddings which are often not of interest,
>>>>>>>  especially for the case of ingesting into a search engine.
>>>>>>> 
>>>>>>>  3. PDF. For PDF's, we can do OCR. This is important for the
>>>>>>>  'image'/scanned PDF's for which Tika won't extract text.
>>>>>>> 
>>>>>>> I'd like to understand how much of this is already supported in NiFi
>>>>> and
>>>>>>> if not I'd volunteer/collaborate to implement some of this.
>>>>>>> 
>>>>>>> - Dmitry
>>>>>>> 
>>>>>>> 
>>>>>>> On Wed, Mar 30, 2016 at 9:03 AM, Joe Skora <js...@gmail.com> wrote:
>>>>>>> 
>>>>>>>> Dmitry,
>>>>>>>> 
>>>>>>>> Are you proposing separate filters that determine the mode of
>>>>>> processing,
>>>>>>>> metadata/content/metadataAndContent?  I was thinking of one
>>>> selection
>>>>>>>> filters and a static mode switch at the processor instance level, to
>>>>>> make
>>>>>>>> configuration more obvious such that one instance of the processor
>>>>> will
>>>>>>>> handle a known set of files regardless of the processing mode.
>>>>>>>> 
>>>>>>>> I was thinking it would be useful for the mode switch to support
>>>>>>>> expression
>>>>>>>> language, but I'm not sure about that since the selection filters
>>>> will
>>>>>>>> control what files get processed and it would be harder to configure
>>>>> if
>>>>>>>> the
>>>>>>>> output flow file could vary between source format and extracted
>>>> text.
>>>>>> So,
>>>>>>>> while it might be easy to do, and occasionally useful, I think in
>>>>> normal
>>>>>>>> use I'd never have a varying mode but would more likely have
>>>> multiple
>>>>>>>> processor instances with some routing or selection going on further
>>>>>>>> upstream.
>>>>>>>> 
>>>>>>>> I wrestled with the naming issue too.  I went with
>>>>>>>> "ExtractMediaAttributes"
>>>>>>>> over "ExtractDocumentAttributes" because it seemed to represent the
>>>>>>>> broader
>>>>>>>> context better.  In reality, media files and documents and documents
>>>>> are
>>>>>>>> media files, but in the end it's all just semantics.
>>>>>>>> 
>>>>>>>> I don't think I would change the NAR bundle name, because I think
>>>>>>>> "nifi-media-nar" establishes it as a place to collect this and other
>>>>>> media
>>>>>>>> related processors in the future.
>>>>>>>> 
>>>>>>>> Regards,
>>>>>>>> Joe
>>>>>>>> 
>>>>>>>> On Tue, Mar 29, 2016 at 3:09 PM, Dmitry Goldenberg <
>>>>>>>> dgoldenberg@hexastax.com
>>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> Hi Joe,
>>>>>>>>> 
>>>>>>>>> Thanks for all the details.
>>>>>>>>> 
>>>>>>>>> I wanted to propose that I do some of this work so as to go
>>>> through
>>>>>> the
>>>>>>>>> full cycle of developing a processor and committing it.
>>>>>>>>> 
>>>>>>>>> Once your changes are merged, I could extend your
>>>>>> 'ExtractMediaMetadata'
>>>>>>>>> processor to handle the content, in addition to the metadata.
>>>>>>>>> 
>>>>>>>>> We could keep the FILENAME_FILTER and MIMETYPE_FILTER but add a
>>>> mode
>>>>>>>> with 3
>>>>>>>>> values: metadataOnly, contentOnly, metadataAndContent.
>>>>>>>>> 
>>>>>>>>> One thing that looks to be a design issue right now is, your
>>>> changes
>>>>>> and
>>>>>>>>> the 'nomenclature' seem media-oriented ("nifi-media-nar" etc.)
>>>>>>>>> 
>>>>>>>>> Would it make sense to have a generic processor
>>>>>>>>> ExtractDocumentMetadataAndContent?  Are there enough specifics in
>>>>> the
>>>>>>>>> image/video processing stuff to warrant that to be a separate
>>>> layer;
>>>>>>>>> perhaps a subclass of ExtractDocumentMetadataAndContent ?  Might
>>>> it
>>>>>> make
>>>>>>>>> sense to rename nifi-media-nar into nifi-text-extract-nar ?
>>>>>>>>> 
>>>>>>>>> Thanks,
>>>>>>>>> - Dmitry
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Tue, Mar 29, 2016 at 2:36 PM, Joe Skora <js...@gmail.com>
>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> Dmitry,
>>>>>>>>>> 
>>>>>>>>>> Yeah, I agree, Tika is pretty impressive.  The original ticket,
>>>>>>>> NIFI-615
>>>>>>>>>> <https://issues.apache.org/jira/browse/NIFI-615>, wanted
>>>>> extraction
>>>>>>>> of
>>>>>>>>>> metadata from WAV files, but as I got into it I found Tika so
>>>> for
>>>>>> the
>>>>>>>>> same
>>>>>>>>>> effort it supports the 1,000+ file formats Tika understands.
>>>> That
>>>>>> new
>>>>>>>>>> processor called "ExtractMediaMetadata", you can pull that pull
>>>>>> PR-252
>>>>>>>>>> <https://github.com/apache/nifi/pull/252> from GitHub if you
>>>> want
>>>>>> to
>>>>>>>>> give
>>>>>>>>>> it a try before it's merged.
>>>>>>>>>> 
>>>>>>>>>> Extraction content for those 1,000+ formats would be a valuable
>>>>>>>> addition.
>>>>>>>>>> I see two possible approaches, 1) create a new
>>>>> "ExtractMediaContent"
>>>>>>>>>> processor that would put the document content in a new flow
>>>> file,
>>>>>> and
>>>>>>>> 2)
>>>>>>>>>> extend the new "ExtractMediaMetadata" processor so it can
>>>> extract
>>>>>>>>> metadata,
>>>>>>>>>> content, or both.  One combined processor makes sense if it can
>>>>>>>> provide a
>>>>>>>>>> performance gain, otherwise two complementary processors may
>>>> make
>>>>>>>> usage
>>>>>>>>>> easier.
>>>>>>>>>> 
>>>>>>>>>> I'm glad to help if you want to take a cut at the processor
>>>>>> yourself,
>>>>>>>> or
>>>>>>>>> I
>>>>>>>>>> can take a crack at it myself if you'd prefer.
>>>>>>>>>> 
>>>>>>>>>> Don't hesitate to ask questions or share comments and feedback
>>>>>>>> regarding
>>>>>>>>>> the ExtractMediaMetadata processor or the addition of content
>>>>>>>> handling.
>>>>>>>>>> 
>>>>>>>>>> Regards,
>>>>>>>>>> Joe Skora
>>>>>>>>>> 
>>>>>>>>>> On Thu, Mar 24, 2016 at 11:40 AM, Dmitry Goldenberg <
>>>>>>>>>> dgoldenberg@hexastax.com> wrote:
>>>>>>>>>> 
>>>>>>>>>>> Thanks, Joe!
>>>>>>>>>>> 
>>>>>>>>>>> Hi Joe S. - I'm definitely up for discussing and contributing.
>>>>>>>>>>> 
>>>>>>>>>>> While building search-related ingestion systems, I've seen
>>>>>> metadata
>>>>>>>> and
>>>>>>>>>>> text extraction being done all the time; it's always there and
>>>>>>>> always
>>>>>>>>> has
>>>>>>>>>>> to be done for building search indexes.  Beyond that,
>>>>> OCR-related
>>>>>>>>>>> capabilities are often requested, and the advantage of Tika is
>>>>>> that
>>>>>>>> it
>>>>>>>>>>> supports OCR out of the box.
>>>>>>>>>>> 
>>>>>>>>>>> - Dmitry
>>>>>>>>>>> 
>>>>>>>>>>> On Thu, Mar 24, 2016 at 11:36 AM, Joe Witt <
>>>> joe.witt@gmail.com>
>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> Dmitry,
>>>>>>>>>>>> 
>>>>>>>>>>>> Another community member (Joe Skora) has a PR outstanding
>>>> for
>>>>>>>>>>>> extracting metadata from media files using Tika.  Perhaps it
>>>>>> makes
>>>>>>>>>>>> sense to broaden that to in general extract what Tika can
>>>>> find.
>>>>>>>> Joe
>>>>>>>>> -
>>>>>>>>>>>> perhaps you can discuss your ideas with Dmitry and see if
>>>>>>>> broadening
>>>>>>>>>>>> is a good idea or if rather domain specific ones make more
>>>>>> sense.
>>>>>>>>>>>> 
>>>>>>>>>>>> This concept of extracting metadata from documents/text
>>>> files,
>>>>>>>> etc..
>>>>>>>>>>>> using something like Tika is certainly useful as that then
>>>> can
>>>>>>>> drive
>>>>>>>>>>>> nice automated routing decisions.
>>>>>>>>>>>> 
>>>>>>>>>>>> Thanks
>>>>>>>>>>>> Joe
>>>>>>>>>>>> 
>>>>>>>>>>>> On Thu, Mar 24, 2016 at 9:28 AM, Dmitry Goldenberg
>>>>>>>>>>>> <dg...@hexastax.com> wrote:
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> I see that the ExtractText processor extracts text using
>>>>>> regex.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> What about a processor that extracts text and metadata
>>>> from
>>>>>>>>> incoming
>>>>>>>>>>>>> files?  That doesn't seem to exist - but perhaps I didn't
>>>>>> quite
>>>>>>>>> look
>>>>>>>>>> in
>>>>>>>>>>>> the
>>>>>>>>>>>>> right spots.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> If that doesn't exist I'd like to implement and commit it,
>>>>>> using
>>>>>>>>>> Apache
>>>>>>>>>>>>> Tika.  There may also be a couple of related processors to
>>>>>> that.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Thoughts?
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> - Dmitry
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>> 
>>