You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by "Anatharaman, Srinatha (Contractor)" <Sr...@comcast.com> on 2017/02/13 22:35:43 UTC

Issues with Solr Morphline reading RFC822 files

Hi,

I am loading email files which are in RFC822 format into SolrCloud using Flume
But some meta data of the emails is not getting loaded to Solr.
Please find below sample email, text which is colored in Bold Red is ignored by Solr
I can read this files ONLY using org.apache.tika.parser.mail.RFC822Parser Parser, If I want to read it using TXTparser Solr ignores the files with error "No supported MIME type found for _attachment_mimetype=message/rfc822"

How do I overcome this issue? I want to read the emails files without losing single word from the file

Received: from resqmta-po-08v.sys.XXXX.net ([196.114.154.167])
        by csp-imta02.westchester.pa.bo.XXXX.net with bizsmtp
        id EClZ1u0013cy81c01E9enp; Wed, 30 Nov 2016 14:09:38 +0000
Received: from resimta-po-14v.sys. XXXX.net ([96.114.154.142])
        by resqmta-po-08v.sys.XXXX.net with SMTP
        id C5ZqcRB3e2dNjC5ZqcQvHl; Wed, 30 Nov 2016 14:09:38 +0000
Received: from outgoingemail1.digitalrightscorp.com ([69.36.73.150])
        by resimta-po-14v.sys.XXXX.net with SMTP
        id C5ZNcJfg9npCYC5Zcceh9K; Wed, 30 Nov 2016 14:09:25 +0000
X-Xfinity-Message-Heuristics: IPv6:N;TLS=0;SPF=0;DMARC=
Received: from outgoingemail1-69-150 (localhost [127.0.0.1])
        by outgoingemail1. XXXXXRightsCorp.com (Postfix) with ESMTP id 15EB7100419
        for <dm...@XXXX.net>; Wed, 30 Nov 2016 06:05:52 -0800 (PST)
From: APMC@XXXXXRightsCorp.com
To: dmca@XXXX.net
Message-ID: <55...@outgoingemail1-69-150>
Subject: Unauthorized Use of Copyrights RE:
TC-cc0ae97d-8918-4a4b-8515-749ff9303bc0
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Date: Wed, 30 Nov 2016 06:05:52 -0800 (PST)
X-CMAE-Envelope: MS4wfAIoEnMl1VVV7nPS/7pis5Gr/ijSjTNaioaGiZVCAo4cXRoeTl9Z1Nt8SYSY4kX7RpDlZuxzGbzyeRDJIorfdeodi9fzNtQETs56Or8SwlysmgQQQt4R
kKDdiZaRx3Q0be579K6C4XZGyRC6JMDzDi1X6bXgBL8KYDFFA/aEyOBd+2Zrz1YpOi2aTjzyRc4d4MXJwaIGivtlXtZc6R5KypOhVP6eX1kx/qV9OwVzXAz6

**NOTE TO ISP: PLEASE FORWARD THE ENTIRE NOTICE***

Re: Unauthorized Use of Copyrights Owned Exclusively by The Bicycle Music Company

Reference#: ZBP96D4  IP Address: 73.166.122.44

Dear Sir or Madam:
.
.
.
.
.
.


Regards,
~Sri

RE: Issues with Solr Morphline reading RFC822 files

Posted by "Anatharaman, Srinatha (Contractor)" <Sr...@comcast.com>.
From the original email below lines are not indexed, These are metadata appears before the actual email

> Received: from resqmta-po-08v.sys.XXXX.net ([196.114.154.167])
>        by csp-imta02.westchester.pa.bo.XXXX.net with bizsmtp
>        id EClZ1u0013cy81c01E9enp; Wed, 30 Nov 2016 14:09:38 +0000
> Received: from resimta-po-14v.sys. XXXX.net ([96.114.154.142])
>        by resqmta-po-08v.sys.XXXX.net with SMTP
>        id C5ZqcRB3e2dNjC5ZqcQvHl; Wed, 30 Nov 2016 14:09:38 +0000
> Received: from outgoingemail1.digitalrightscorp.com ([69.36.73.150])
>        by resimta-po-14v.sys.XXXX.net with SMTP
>        id C5ZNcJfg9npCYC5Zcceh9K; Wed, 30 Nov 2016 14:09:25 +0000
> X-Xfinity-Message-Heuristics: IPv6:N;TLS=0;SPF=0;DMARC=
> Received: from outgoingemail1-69-150 (localhost [127.0.0.1])
>        by outgoingemail1. XXXXXRightsCorp.com (Postfix) with ESMTP id 15EB7100419
>        for <dm...@XXXX.net>; Wed, 30 Nov 2016 06:05:52 -0800 (PST)
> From: APMC@XXXXXRightsCorp.com
> To: dmca@XXXX.net
> Message-ID: 
> <55...@outgoingemail1-69-150>



-----Original Message-----
From: Dave [mailto:hastings.recursive@gmail.com] 
Sent: Monday, February 13, 2017 5:49 PM
To: solr-user@lucene.apache.org
Subject: Re: Issues with Solr Morphline reading RFC822 files

Can't see what's color coded in the email. 

> On Feb 13, 2017, at 5:35 PM, Anatharaman, Srinatha (Contractor) <Sr...@comcast.com> wrote:
> 
> Hi,
> 
> I am loading email files which are in RFC822 format into SolrCloud 
> using Flume But some meta data of the emails is not getting loaded to Solr.
> Please find below sample email, text which is colored in Bold Red is 
> ignored by Solr I can read this files ONLY using org.apache.tika.parser.mail.RFC822Parser Parser, If I want to read it using TXTparser Solr ignores the files with error "No supported MIME type found for _attachment_mimetype=message/rfc822"
> 
> How do I overcome this issue? I want to read the emails files without 
> losing single word from the file
> 
> Received: from resqmta-po-08v.sys.XXXX.net ([196.114.154.167])
>        by csp-imta02.westchester.pa.bo.XXXX.net with bizsmtp
>        id EClZ1u0013cy81c01E9enp; Wed, 30 Nov 2016 14:09:38 +0000
> Received: from resimta-po-14v.sys. XXXX.net ([96.114.154.142])
>        by resqmta-po-08v.sys.XXXX.net with SMTP
>        id C5ZqcRB3e2dNjC5ZqcQvHl; Wed, 30 Nov 2016 14:09:38 +0000
> Received: from outgoingemail1.digitalrightscorp.com ([69.36.73.150])
>        by resimta-po-14v.sys.XXXX.net with SMTP
>        id C5ZNcJfg9npCYC5Zcceh9K; Wed, 30 Nov 2016 14:09:25 +0000
> X-Xfinity-Message-Heuristics: IPv6:N;TLS=0;SPF=0;DMARC=
> Received: from outgoingemail1-69-150 (localhost [127.0.0.1])
>        by outgoingemail1. XXXXXRightsCorp.com (Postfix) with ESMTP id 15EB7100419
>        for <dm...@XXXX.net>; Wed, 30 Nov 2016 06:05:52 -0800 (PST)
> From: APMC@XXXXXRightsCorp.com
> To: dmca@XXXX.net
> Message-ID: 
> <55...@outgoingemail1-69-150>
> Subject: Unauthorized Use of Copyrights RE:
> TC-cc0ae97d-8918-4a4b-8515-749ff9303bc0
> MIME-Version: 1.0
> Content-Type: text/plain; charset=us-ascii
> Content-Transfer-Encoding: 7bit
> Date: Wed, 30 Nov 2016 06:05:52 -0800 (PST)
> X-CMAE-Envelope: 
> MS4wfAIoEnMl1VVV7nPS/7pis5Gr/ijSjTNaioaGiZVCAo4cXRoeTl9Z1Nt8SYSY4kX7Rp
> DlZuxzGbzyeRDJIorfdeodi9fzNtQETs56Or8SwlysmgQQQt4R
> kKDdiZaRx3Q0be579K6C4XZGyRC6JMDzDi1X6bXgBL8KYDFFA/aEyOBd+2Zrz1YpOi2aTj
> zyRc4d4MXJwaIGivtlXtZc6R5KypOhVP6eX1kx/qV9OwVzXAz6
> 
> **NOTE TO ISP: PLEASE FORWARD THE ENTIRE NOTICE***
> 
> Re: Unauthorized Use of Copyrights Owned Exclusively by The Bicycle 
> Music Company
> 
> Reference#: ZBP96D4  IP Address: 73.166.122.44
> 
> Dear Sir or Madam:
> .
> .
> .
> .
> .
> .
> 
> 
> Regards,
> ~Sri


Re: Issues with Solr Morphline reading RFC822 files

Posted by Dave <ha...@gmail.com>.
Can't see what's color coded in the email. 

> On Feb 13, 2017, at 5:35 PM, Anatharaman, Srinatha (Contractor) <Sr...@comcast.com> wrote:
> 
> Hi,
> 
> I am loading email files which are in RFC822 format into SolrCloud using Flume
> But some meta data of the emails is not getting loaded to Solr.
> Please find below sample email, text which is colored in Bold Red is ignored by Solr
> I can read this files ONLY using org.apache.tika.parser.mail.RFC822Parser Parser, If I want to read it using TXTparser Solr ignores the files with error "No supported MIME type found for _attachment_mimetype=message/rfc822"
> 
> How do I overcome this issue? I want to read the emails files without losing single word from the file
> 
> Received: from resqmta-po-08v.sys.XXXX.net ([196.114.154.167])
>        by csp-imta02.westchester.pa.bo.XXXX.net with bizsmtp
>        id EClZ1u0013cy81c01E9enp; Wed, 30 Nov 2016 14:09:38 +0000
> Received: from resimta-po-14v.sys. XXXX.net ([96.114.154.142])
>        by resqmta-po-08v.sys.XXXX.net with SMTP
>        id C5ZqcRB3e2dNjC5ZqcQvHl; Wed, 30 Nov 2016 14:09:38 +0000
> Received: from outgoingemail1.digitalrightscorp.com ([69.36.73.150])
>        by resimta-po-14v.sys.XXXX.net with SMTP
>        id C5ZNcJfg9npCYC5Zcceh9K; Wed, 30 Nov 2016 14:09:25 +0000
> X-Xfinity-Message-Heuristics: IPv6:N;TLS=0;SPF=0;DMARC=
> Received: from outgoingemail1-69-150 (localhost [127.0.0.1])
>        by outgoingemail1. XXXXXRightsCorp.com (Postfix) with ESMTP id 15EB7100419
>        for <dm...@XXXX.net>; Wed, 30 Nov 2016 06:05:52 -0800 (PST)
> From: APMC@XXXXXRightsCorp.com
> To: dmca@XXXX.net
> Message-ID: <55...@outgoingemail1-69-150>
> Subject: Unauthorized Use of Copyrights RE:
> TC-cc0ae97d-8918-4a4b-8515-749ff9303bc0
> MIME-Version: 1.0
> Content-Type: text/plain; charset=us-ascii
> Content-Transfer-Encoding: 7bit
> Date: Wed, 30 Nov 2016 06:05:52 -0800 (PST)
> X-CMAE-Envelope: MS4wfAIoEnMl1VVV7nPS/7pis5Gr/ijSjTNaioaGiZVCAo4cXRoeTl9Z1Nt8SYSY4kX7RpDlZuxzGbzyeRDJIorfdeodi9fzNtQETs56Or8SwlysmgQQQt4R
> kKDdiZaRx3Q0be579K6C4XZGyRC6JMDzDi1X6bXgBL8KYDFFA/aEyOBd+2Zrz1YpOi2aTjzyRc4d4MXJwaIGivtlXtZc6R5KypOhVP6eX1kx/qV9OwVzXAz6
> 
> **NOTE TO ISP: PLEASE FORWARD THE ENTIRE NOTICE***
> 
> Re: Unauthorized Use of Copyrights Owned Exclusively by The Bicycle Music Company
> 
> Reference#: ZBP96D4  IP Address: 73.166.122.44
> 
> Dear Sir or Madam:
> .
> .
> .
> .
> .
> .
> 
> 
> Regards,
> ~Sri