You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@nifi.apache.org by "Peter Wicks (pwicks)" <pw...@micron.com> on 2017/05/18 05:54:17 UTC

RE: [EXT] Parsing Email Attachments

Nick,

Try escaping your \n’s, see if that helps.

(?s)(.*\\n\\n${boundary}\\nContent-Type: text\/plain; charset="UTF-8"\\n\\n)(.*?)(\\n\\n${boundary}.*)

From: Nick Carenza [mailto:nick.carenza@thecontrolgroup.com]
Sent: Thursday, May 18, 2017 11:27 AM
To: users@nifi.apache.org
Subject: [EXT] Parsing Email Attachments

Hey Nifi-ers,

I haven't been having any luck trying to parse email after consuming them with pop3.

I am composing a simple message with gmail with just plain text and it comes out like this (with many headers removed):

Delivered-To: slack@company.com<ma...@company.com>
Return-Path: <em...@company.com>>
MIME-Version: 1.0
Received: by 0.0.0.0 with HTTP; Tue, 16 May 2017 17:54:04 -0700 (PDT)
From: User <em...@company.com>>
Date: Tue, 16 May 2017 17:54:04 -0700
Subject: test subject
To: email@company.com<ma...@company.com>
Content-Type: multipart/alternative; boundary="f403045f83d499711a054fadb980"

--f403045f83d499711a054fadb980
Content-Type: text/plain; charset="UTF-8"

test email body

--f403045f83d499711a054fadb980
Content-Type: text/html; charset="UTF-8"

<div dir="ltr">test email body</div>

--f403045f83d499711a054fadb980--

I just want the email body and ExtractEmailAttachments doesn't seem to extract the parts between the boundaries like I hoped it would.

So instead I use ExtractEmailHeaders and additionally extract the Content-Type header which I then retrieve just the boundary value with an UpdateAttribute processor configure like:

boundary: ${email.headers.content-type:substringAfter('boundary="'):substringBefore('"'):prepend('--')}

Then I wrote a sweet regex for ReplaceText to clean this up:

(?s)(.*\n\n${boundary}\nContent-Type: text\/plain; charset="UTF-8"\n\n)(.*?)(\n\n${boundary}.*)

[Inline image 1]

... but even though this works in regex testers and sublimetext, it seems to have no effect in my flow.

Anyone have any insight on this?

Thanks,
Nick

Re: [EXT] Parsing Email Attachments

Posted by Nick Carenza <ni...@thecontrolgroup.com>.
@pwicks no luck on escaping \n. Thanks for the suggestion.

I even tried hardcoding the boundary value, in case it had something to do
with using nifi expressions in regex but that didn't work either.

On Wed, May 17, 2017 at 10:54 PM, Peter Wicks (pwicks) <pw...@micron.com>
wrote:

> Nick,
>
>
>
> Try escaping your \n’s, see if that helps.
>
>
>
> (?s)(.*\\n\\n${boundary}\\nContent-Type: text\/plain;
> charset="UTF-8"\\n\\n)(.*?)(\\n\\n${boundary}.*)
>
>
>
> *From:* Nick Carenza [mailto:nick.carenza@thecontrolgroup.com]
> *Sent:* Thursday, May 18, 2017 11:27 AM
> *To:* users@nifi.apache.org
> *Subject:* [EXT] Parsing Email Attachments
>
>
>
> Hey Nifi-ers,
>
>
>
> I haven't been having any luck trying to parse email after consuming them
> with pop3.
>
>
>
> I am composing a simple message with gmail with just plain text and it
> comes out like this (with many headers removed):
>
>
>
> Delivered-To: slack@company.com
>
> Return-Path: <em...@company.com>
>
> MIME-Version: 1.0
>
> Received: by 0.0.0.0 with HTTP; Tue, 16 May 2017 17:54:04 -0700 (PDT)
>
> From: User <em...@company.com>
>
> Date: Tue, 16 May 2017 17:54:04 -0700
>
> Subject: test subject
>
> To: email@company.com
>
> Content-Type: multipart/alternative; boundary="
> f403045f83d499711a054fadb980"
>
>
>
> --f403045f83d499711a054fadb980
>
> Content-Type: text/plain; charset="UTF-8"
>
>
>
> test email body
>
>
>
> --f403045f83d499711a054fadb980
>
> Content-Type: text/html; charset="UTF-8"
>
>
>
> <div dir="ltr">test email body</div>
>
>
>
> --f403045f83d499711a054fadb980--
>
>
>
> I just want the email body and ExtractEmailAttachments doesn't seem to
> extract the parts between the boundaries like I hoped it would.
>
>
>
> So instead I use ExtractEmailHeaders and additionally extract the
> Content-Type header which I then retrieve just the boundary value with an
> UpdateAttribute processor configure like:
>
>
>
> boundary: ${email.headers.content-type:substringAfter('boundary="'):
> substringBefore('"'):prepend('--')}
>
>
>
> Then I wrote a sweet regex for ReplaceText to clean this up:
>
>
>
> (?s)(.*\n\n${boundary}\nContent-Type: text\/plain;
> charset="UTF-8"\n\n)(.*?)(\n\n${boundary}.*)
>
>
>
> [image: Inline image 1]
>
>
>
> ... but even though this works in regex testers and sublimetext, it seems
> to have no effect in my flow.
>
>
>
> Anyone have any insight on this?
>
>
>
> Thanks,
>
> Nick
>