You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Zheng Lin Edwin Yeo <ed...@gmail.com> on 2019/08/01 01:38:25 UTC

Indexing information on number of attachments and their names in EML file

Hi,

Would like to check, Is there anyway which we can detect the number of
attachments and their names during indexing of EML files in Solr, and index
those information into Solr?

Currently, Solr is able to use Tika and Tesseract OCR to extract the
contents of the attachments. However, I could not find the information
about the number of attachments in the EML file and what are their filename.

I am using Solr 7.6.0 in production, and also trying out on the new Solr
8.2.0.

Regards,
Edwin

Re: Indexing information on number of attachments and their names in EML file

Posted by Zheng Lin Edwin Yeo <ed...@gmail.com>.
Hi Tim,

Regarding the returning of the list of Metadata objects, is the code
suppose to include the information on the number of attachments in the
particular email and/or the name of the attachment?
For example, if there are 3 attachments in the email, we should be able to
see immediately from the Metadata that there are attachments, and there are
3 of them.

Thank you.

Regards,
Edwin

On Sat, 3 Aug 2019 at 07:19, Zheng Lin Edwin Yeo <ed...@gmail.com>
wrote:

> Thanks for the reply, will find out more about it.
>
> Currently I am able to retrieve the normal Metadata of the email, but not
> the Metadata of the attachments which are part of the contents in the EML
> file, which looks something like this.
>
> --000000000000d8b77b057d59ca19--
>
> --000000000000d8b77e057d59ca1b
> Content-Type: application/pdf; name="file1.pdf"
> Content-Disposition: attachment; filename="file1.pdf"
> Content-Transfer-Encoding: base64
> Content-ID: <f_jpurtpnk0>
> X-Attachment-Id: f_jpurtpnk0
>
> Regards,
> Edwin
>
> On Sat, 3 Aug 2019 at 05:38, Tim Allison <ta...@apache.org> wrote:
>
>> I'd strongly recommend rolling your own ingest code.  See Erick's
>> superb: https://lucidworks.com/post/indexing-with-solrj/
>>
>> You can easily get attachments via the RecursiveParserWrapper, e.g.
>>
>> https://github.com/apache/tika/blob/master/tika-parsers/src/test/java/org/apache/tika/parser/RecursiveParserWrapperTest.java#L351
>>
>> This will return a list of Metadata objects; the first one will be the
>> main/container, each other entry will be an attachment.  Let us know
>> if you have any questions/surprises.  There are a couple of todos for
>> .eml...
>>
>> On Fri, Aug 2, 2019 at 3:43 AM Jan Høydahl <ja...@cominvent.com> wrote:
>> >
>> > Try the Apache Tika mailing list.
>> >
>> > --
>> > Jan Høydahl, search solution architect
>> > Cominvent AS - www.cominvent.com
>> >
>> > > 2. aug. 2019 kl. 05:01 skrev Zheng Lin Edwin Yeo <
>> edwinyeozl@gmail.com>:
>> > >
>> > > Hi,
>> > >
>> > > Does anyone knows if this can be done on the Solr side?
>> > > Or it has to be done on the Tika side?
>> > >
>> > > Regards,
>> > > Edwin
>> > >
>> > > On Thu, 1 Aug 2019 at 09:38, Zheng Lin Edwin Yeo <
>> edwinyeozl@gmail.com>
>> > > wrote:
>> > >
>> > >> Hi,
>> > >>
>> > >> Would like to check, Is there anyway which we can detect the number
>> of
>> > >> attachments and their names during indexing of EML files in Solr,
>> and index
>> > >> those information into Solr?
>> > >>
>> > >> Currently, Solr is able to use Tika and Tesseract OCR to extract the
>> > >> contents of the attachments. However, I could not find the
>> information
>> > >> about the number of attachments in the EML file and what are their
>> filename.
>> > >>
>> > >> I am using Solr 7.6.0 in production, and also trying out on the new
>> Solr
>> > >> 8.2.0.
>> > >>
>> > >> Regards,
>> > >> Edwin
>> > >>
>> >
>>
>

Re: Indexing information on number of attachments and their names in EML file

Posted by Zheng Lin Edwin Yeo <ed...@gmail.com>.
Thanks for the reply, will find out more about it.

Currently I am able to retrieve the normal Metadata of the email, but not
the Metadata of the attachments which are part of the contents in the EML
file, which looks something like this.

--000000000000d8b77b057d59ca19--

--000000000000d8b77e057d59ca1b
Content-Type: application/pdf; name="file1.pdf"
Content-Disposition: attachment; filename="file1.pdf"
Content-Transfer-Encoding: base64
Content-ID: <f_jpurtpnk0>
X-Attachment-Id: f_jpurtpnk0

Regards,
Edwin

On Sat, 3 Aug 2019 at 05:38, Tim Allison <ta...@apache.org> wrote:

> I'd strongly recommend rolling your own ingest code.  See Erick's
> superb: https://lucidworks.com/post/indexing-with-solrj/
>
> You can easily get attachments via the RecursiveParserWrapper, e.g.
>
> https://github.com/apache/tika/blob/master/tika-parsers/src/test/java/org/apache/tika/parser/RecursiveParserWrapperTest.java#L351
>
> This will return a list of Metadata objects; the first one will be the
> main/container, each other entry will be an attachment.  Let us know
> if you have any questions/surprises.  There are a couple of todos for
> .eml...
>
> On Fri, Aug 2, 2019 at 3:43 AM Jan Høydahl <ja...@cominvent.com> wrote:
> >
> > Try the Apache Tika mailing list.
> >
> > --
> > Jan Høydahl, search solution architect
> > Cominvent AS - www.cominvent.com
> >
> > > 2. aug. 2019 kl. 05:01 skrev Zheng Lin Edwin Yeo <edwinyeozl@gmail.com
> >:
> > >
> > > Hi,
> > >
> > > Does anyone knows if this can be done on the Solr side?
> > > Or it has to be done on the Tika side?
> > >
> > > Regards,
> > > Edwin
> > >
> > > On Thu, 1 Aug 2019 at 09:38, Zheng Lin Edwin Yeo <edwinyeozl@gmail.com
> >
> > > wrote:
> > >
> > >> Hi,
> > >>
> > >> Would like to check, Is there anyway which we can detect the number of
> > >> attachments and their names during indexing of EML files in Solr, and
> index
> > >> those information into Solr?
> > >>
> > >> Currently, Solr is able to use Tika and Tesseract OCR to extract the
> > >> contents of the attachments. However, I could not find the information
> > >> about the number of attachments in the EML file and what are their
> filename.
> > >>
> > >> I am using Solr 7.6.0 in production, and also trying out on the new
> Solr
> > >> 8.2.0.
> > >>
> > >> Regards,
> > >> Edwin
> > >>
> >
>

Re: Indexing information on number of attachments and their names in EML file

Posted by Tim Allison <ta...@apache.org>.
I'd strongly recommend rolling your own ingest code.  See Erick's
superb: https://lucidworks.com/post/indexing-with-solrj/

You can easily get attachments via the RecursiveParserWrapper, e.g.
https://github.com/apache/tika/blob/master/tika-parsers/src/test/java/org/apache/tika/parser/RecursiveParserWrapperTest.java#L351

This will return a list of Metadata objects; the first one will be the
main/container, each other entry will be an attachment.  Let us know
if you have any questions/surprises.  There are a couple of todos for
.eml...

On Fri, Aug 2, 2019 at 3:43 AM Jan Høydahl <ja...@cominvent.com> wrote:
>
> Try the Apache Tika mailing list.
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
>
> > 2. aug. 2019 kl. 05:01 skrev Zheng Lin Edwin Yeo <ed...@gmail.com>:
> >
> > Hi,
> >
> > Does anyone knows if this can be done on the Solr side?
> > Or it has to be done on the Tika side?
> >
> > Regards,
> > Edwin
> >
> > On Thu, 1 Aug 2019 at 09:38, Zheng Lin Edwin Yeo <ed...@gmail.com>
> > wrote:
> >
> >> Hi,
> >>
> >> Would like to check, Is there anyway which we can detect the number of
> >> attachments and their names during indexing of EML files in Solr, and index
> >> those information into Solr?
> >>
> >> Currently, Solr is able to use Tika and Tesseract OCR to extract the
> >> contents of the attachments. However, I could not find the information
> >> about the number of attachments in the EML file and what are their filename.
> >>
> >> I am using Solr 7.6.0 in production, and also trying out on the new Solr
> >> 8.2.0.
> >>
> >> Regards,
> >> Edwin
> >>
>

Re: Indexing information on number of attachments and their names in EML file

Posted by Tim Allison <ta...@apache.org>.
I'd strongly recommend rolling your own ingest code.  See Erick's
superb: https://lucidworks.com/post/indexing-with-solrj/

You can easily get attachments via the RecursiveParserWrapper, e.g.
https://github.com/apache/tika/blob/master/tika-parsers/src/test/java/org/apache/tika/parser/RecursiveParserWrapperTest.java#L351

This will return a list of Metadata objects; the first one will be the
main/container, each other entry will be an attachment.  Let us know
if you have any questions/surprises.  There are a couple of todos for
.eml...

On Fri, Aug 2, 2019 at 3:43 AM Jan Høydahl <ja...@cominvent.com> wrote:
>
> Try the Apache Tika mailing list.
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
>
> > 2. aug. 2019 kl. 05:01 skrev Zheng Lin Edwin Yeo <ed...@gmail.com>:
> >
> > Hi,
> >
> > Does anyone knows if this can be done on the Solr side?
> > Or it has to be done on the Tika side?
> >
> > Regards,
> > Edwin
> >
> > On Thu, 1 Aug 2019 at 09:38, Zheng Lin Edwin Yeo <ed...@gmail.com>
> > wrote:
> >
> >> Hi,
> >>
> >> Would like to check, Is there anyway which we can detect the number of
> >> attachments and their names during indexing of EML files in Solr, and index
> >> those information into Solr?
> >>
> >> Currently, Solr is able to use Tika and Tesseract OCR to extract the
> >> contents of the attachments. However, I could not find the information
> >> about the number of attachments in the EML file and what are their filename.
> >>
> >> I am using Solr 7.6.0 in production, and also trying out on the new Solr
> >> 8.2.0.
> >>
> >> Regards,
> >> Edwin
> >>
>

Re: Indexing information on number of attachments and their names in EML file

Posted by Jan Høydahl <ja...@cominvent.com>.
Try the Apache Tika mailing list.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

> 2. aug. 2019 kl. 05:01 skrev Zheng Lin Edwin Yeo <ed...@gmail.com>:
> 
> Hi,
> 
> Does anyone knows if this can be done on the Solr side?
> Or it has to be done on the Tika side?
> 
> Regards,
> Edwin
> 
> On Thu, 1 Aug 2019 at 09:38, Zheng Lin Edwin Yeo <ed...@gmail.com>
> wrote:
> 
>> Hi,
>> 
>> Would like to check, Is there anyway which we can detect the number of
>> attachments and their names during indexing of EML files in Solr, and index
>> those information into Solr?
>> 
>> Currently, Solr is able to use Tika and Tesseract OCR to extract the
>> contents of the attachments. However, I could not find the information
>> about the number of attachments in the EML file and what are their filename.
>> 
>> I am using Solr 7.6.0 in production, and also trying out on the new Solr
>> 8.2.0.
>> 
>> Regards,
>> Edwin
>> 


Re: Indexing information on number of attachments and their names in EML file

Posted by Zheng Lin Edwin Yeo <ed...@gmail.com>.
Hi,

Does anyone knows if this can be done on the Solr side?
Or it has to be done on the Tika side?

Regards,
Edwin

On Thu, 1 Aug 2019 at 09:38, Zheng Lin Edwin Yeo <ed...@gmail.com>
wrote:

> Hi,
>
> Would like to check, Is there anyway which we can detect the number of
> attachments and their names during indexing of EML files in Solr, and index
> those information into Solr?
>
> Currently, Solr is able to use Tika and Tesseract OCR to extract the
> contents of the attachments. However, I could not find the information
> about the number of attachments in the EML file and what are their filename.
>
> I am using Solr 7.6.0 in production, and also trying out on the new Solr
> 8.2.0.
>
> Regards,
> Edwin
>