You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by lude <lu...@googlemail.com> on 2006/08/15 19:28:41 UTC

Best Practice: emails and file-attachments

Hello,

does anybody has an idea what is the best design approch for realizing
the following:

The goal is to index emails and their corresponding file attachments.
One email could contain for example:

1 x subject
1 x sender-address
1 x to-addresses
1 x message-text
0..n x file-attachments  (each contains a 'file-name' and the
'file-content')

How should I build the index?

First approach:
Each email + attachments gets one document with the following fields:
subject, sender_address, to_address, message_text, 1_attachment_name,
1_attachment_content, 2_attachment_name, 2_attachment_content,
3_attachment_name, 3_attachment_content
Disadvantage:
Only three attachments could be indexed. It isn't a generic solution for
indexing 'n' file-attachments.

Second approach:
Each email gets one document with the main email-data and 0 to n documents
of file-attachments:
1 x  email_id, subject, sender_address, to_address, message_text
0..n x  email_id, attachment_name, attachment_content
Disadvantage:
At query time it is difficult to aggregate the documents that belongs to
each other. One hit per email (including attachments) should be shown.

Any thoughts?

Thanks
lude

Re: Best Practice: emails and file-attachments

Posted by John Haxby <jc...@scalix.com>.
Oh rats. Thunderbird ate the indenting. The two examples should be:

multipart/alternative
	text/plain
	multipart/related
		text/html
		image/gif
		image/gif
	application/msword

and

multipart/related
	text/html
	image/gif
	application/msword

the indenting indicates nesting. A message isn't just a bodypart 
followed by attachments, it has structure like a file system. Something 
which escapes most mail readers. Sigh.


John Haxby wrote:
> lude wrote:
>>> You also mentioned indexing each bodypart ("attachment") separately.
>>> Why? ....
>>> To my mind, there is no use case where it makes sense to search a 
>>> particular bodypart
>>
>> I will give you the use case:
>>
>> [snip]
>> 3.) The result list would show this:
>> 1. mail-1 'subject'
>> 'Abstract of the message-text'
>> 2. mail-2 'subject'
>> Attachment with name 'filename.doc' contains 'Abstract of
>> file-content'
>>
>> Another Use-Case would be an extended search, which allows to select if
>> "attached files"
>> should be searched (yes or no).
>
> That's a good use case. File it as a bug and close it WONTFIX :-) The 
> problem that you have is trying to determine whether something is 
> going to be inline or an attachment. I'll give you a real-life example 
> that caught out some old code the other day. We had a message with 
> this structure:
>
> multipart/alternative
> text/plain
> multipart/related
> text/html
> image/gif
> image/gif
> application/msword
>
> Is there an attached file in there? Think before you read on.
>
>
>
>
>
>
> The answer should be "no". Are you surprised that at least one client 
> decided that there was? What we have is three representations of the 
> same document: plain text, html (with two pictures) and MS Word. The 
> original, the Word document obviously has the best fidelity and comes 
> last. The one client I'm thinking of (and I've lost track of which one 
> it was) correctly suppressed the display of the text/plain 
> alternative, displayed the HTML with its pictures in-line and then 
> mistakenly displayed the Word document as an attachment.
>
> This is a fictional example, but it could exist:
>
> multipart/related
> text/html
> image/gif
> application/msword
>
> The gif image (and let's assume it can be indexed sensibly) is 
> "obviously" a picture in the HTML bodypart. What's the word document? 
> It's referenced from the HTML as a link just like the picture is. Is 
> it an attachment? What's the difference between the word document 
> referenced as a link within the multipart/related (by content-id) and 
> a link to an external document (by http URL)? From a user perspective 
> both are the same, but is one an attachment and the other not? I'm 
> being unfair, this is not only an unrealistic problem but there isn't 
> a right or a wrong answer. The word document isn't an attachment 
> because it doesn't (or shouldn't) appear in the list of attachments 
> and it's not in-line because you have to click on something to see it.
>
> So yes, I agree, your use-cases are good; I'm just not sure how you're 
> going to identify an attachment :-)
>
> I do like the idea, though, of when you do a search for "xyzzy" that 
> you get the abstract of the bodypart that contains "xyzzy" rather than 
> the abstract (or subject) of the entire message and I'm going to think 
> about that one some more. The problem that immediately springs to mind 
> though is that a message can have an arbitrary number of bodyparts so 
> if I have BODY-1, BODY-2, ..., BODY-N (where N is unknown) how hard is 
> it for me to construct the search? I think I probably should construct 
> the search that way because the score depends upon the size of the 
> document and it seems to make sense that the document is the bodypart, 
> not the entire message, but it seems more complex than is useful for 
> mail messages.
>
> jch
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Best Practice: emails and file-attachments

Posted by John Haxby <jc...@scalix.com>.
lude wrote:
>> You also mentioned indexing each bodypart ("attachment") separately.
>> Why? ....
>> To my mind, there is no use case where it makes sense to search a 
>> particular bodypart
>
> I will give you the use case:
>
> [snip]
> 3.) The result list would show this:
> 1. mail-1 'subject'
> 'Abstract of the message-text'
> 2. mail-2 'subject'
> Attachment with name 'filename.doc' contains 'Abstract of
> file-content'
>
> Another Use-Case would be an extended search, which allows to select if
> "attached files"
> should be searched (yes or no).

That's a good use case. File it as a bug and close it WONTFIX :-) The 
problem that you have is trying to determine whether something is going 
to be inline or an attachment. I'll give you a real-life example that 
caught out some old code the other day. We had a message with this 
structure:

multipart/alternative
text/plain
multipart/related
text/html
image/gif
image/gif
application/msword

Is there an attached file in there? Think before you read on.






The answer should be "no". Are you surprised that at least one client 
decided that there was? What we have is three representations of the 
same document: plain text, html (with two pictures) and MS Word. The 
original, the Word document obviously has the best fidelity and comes 
last. The one client I'm thinking of (and I've lost track of which one 
it was) correctly suppressed the display of the text/plain alternative, 
displayed the HTML with its pictures in-line and then mistakenly 
displayed the Word document as an attachment.

This is a fictional example, but it could exist:

multipart/related
text/html
image/gif
application/msword

The gif image (and let's assume it can be indexed sensibly) is 
"obviously" a picture in the HTML bodypart. What's the word document? 
It's referenced from the HTML as a link just like the picture is. Is it 
an attachment? What's the difference between the word document 
referenced as a link within the multipart/related (by content-id) and a 
link to an external document (by http URL)? From a user perspective both 
are the same, but is one an attachment and the other not? I'm being 
unfair, this is not only an unrealistic problem but there isn't a right 
or a wrong answer. The word document isn't an attachment because it 
doesn't (or shouldn't) appear in the list of attachments and it's not 
in-line because you have to click on something to see it.

So yes, I agree, your use-cases are good; I'm just not sure how you're 
going to identify an attachment :-)

I do like the idea, though, of when you do a search for "xyzzy" that you 
get the abstract of the bodypart that contains "xyzzy" rather than the 
abstract (or subject) of the entire message and I'm going to think about 
that one some more. The problem that immediately springs to mind though 
is that a message can have an arbitrary number of bodyparts so if I have 
BODY-1, BODY-2, ..., BODY-N (where N is unknown) how hard is it for me 
to construct the search? I think I probably should construct the search 
that way because the score depends upon the size of the document and it 
seems to make sense that the document is the bodypart, not the entire 
message, but it seems more complex than is useful for mail messages.

jch

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Best Practice: emails and file-attachments

Posted by lude <lu...@googlemail.com>.
Hi Johan,

thanks again for the many words and explanations!

> You also mentioned indexing each bodypart ("attachment") separately.
> Why? ....
> To my mind, there is no use case where it makes sense to search a
particular bodypart

I will give you the use case:

1.)  User searches for "abcd"
2.) Lucene matches the searchterm (at least) two times:
    - One email has the term in the plain message text   (mail-1)
    - One email contains five "file-attachment". One of this files matches
the search term  (mail-2)
3.) The result list would show this:
    1.   mail-1  'subject'
         'Abstract of the message-text'
    2.   mail-2 'subject'
         Attachment with name 'filename.doc' contains 'Abstract of
file-content'

Another Use-Case would be an extended search, which allows to select if
"attached files"
should be searched (yes or no).

Greetings
lude



>
>
>

Re: Best Practice: emails and file-attachments

Posted by John Haxby <jc...@scalix.com>.
lude wrote:
> Hi John,
>
> thanks for the detailed answer.
>
> You wrote:
>> If you're indexing a
>> multipart/alternative bodypart then index all the MIME headers, but only
>> index the content of the *first* bodypart.
>
> Does this mean you index just the first file-attachment?
> What do you advice, if you have to index mulitpart bodys (== more then 
> one
> file-attachment)?
> One lucene-document for each part (==file)?
> How do you handle the queries?
MIME has no concept of "attachment", that's something that the user 
agent programs have a concept of -- you "attach" a file to a message.  
The file might be a picture, a word document, a compressed tar archive 
-- as far MIME is concerned they're all the same (well, apart from the 
content-* headers that describe what's "attached").   The MIME type for 
a message with "attachments" is "multipart".   There are several 
subtypes though.   If you're typing a plain text message (whose MIME 
type is text/plain, a message like this one) and you attach a jpeg image 
to it you'll be sending a message whose type is multipart/mixed;  the 
first part will have type text/plain and the second image/jpeg.   In 
Google Mail  under "more options" you can "show original" to see the 
complete MIME message and you'll see the different parts separated by a 
boundary.

OK.   Now I'm in a position to answer your question.   Often, when you 
send an HTML formatted message the content of the message is sent twice: 
once as text/plain and once as text/html (or multipart/related if it has 
pictures and stuff).   The two parts are alternatives, apart from the 
formatting (and pictures) there's no difference between the two parts, 
you can read either.  The best fidelity of the alternatives (and there 
can be more than two) is last, the poorest fidelity first, but the 
intent of the sender is that you can read any of them.   This is a 
multipart/alternative bodypart.   Because all parts of the 
multipart/alternative have the same text then you can index any of them, 
so index the first as that's going to be the easiest to process (it's 
almost always going to be text/plain).

I've skipped loads.   You need to read the RFCs.   Start with RFC2045 
(http://www.rfc.net/rfc2045.html) and keep going.  If you get stuck with 
the details of how messages are constructed, go back and read RFC2822 
first, or at least skim it (it's quite long).  Note that RFC2045 
references RFC822 in its abstract, where ever you see references to 
RFC821 and RFC822 you can read them as references to RFC2821 and RFC2822 
respectively -- the newer ones are a little more precise when they need 
to be and have rather more explanation of awkward cases that you need to 
know about.

Someone earlier (and I'm sorry, I deleleted the message before realising 
i should reply) said something about attached files really being in an 
attached .tar.gz file.   Well, yes and no.   An attached compressed tar 
archive is a bodypart like any other and will need to be indexed like 
any other.   That will involve breaking it open and indexing the files 
that it contains.   It's not really any different to indexing an open 
office document (which is actually a zip file).

You also mentioned indexing each bodypart ("attachment") separately.   
Why?   When I'm searching, am I going to look for the word "xyzzy" in 
the first bodypart?   What if it was a multipart/alternative and 
Thunderbird (in my case) suppressed the first bodypart and "xyzzy" is 
something that couldn't be rendered in the (first) text/plain 
alternative?   To my mind, there is no use case where it makes sense to 
search a particular bodypart.  There *might* be a case for searching the 
"prime" bodypart and "attachments" but when you read the MIME spec 
you'll realise that detecting what the user sees as an attachment is not 
easy: it gets even harder when you discover that different mail user 
agents have different and legal (and sometimes reasonable) ways of 
deciding whether to treat something as in-line or as an attachment.   To 
be honest, people don't remember whether something was an attachment.   
They think "I remember reading about xyzzy in a mail message" and go off 
looking for that.   They often can't tell and remember even less that 
the "xyzzy" was in something that you decided was an attachment.   And 
if your rules for deciding whether you have something that's intended to 
be viewed as an attachment or in-line are different to the rules that 
the  user's mail reader is using then you'll have Awkward Bugs to 
explain.   You'll read about "Content-Disposition" in the RFCs, but 
don't believe that it's a foolproof way of deciding whether or not 
something is an attachment, lack of a content-disposition header doesn't 
mean "inline" or "attachment" and Microsoft, bless, have weird rules all 
of their own for deciding whether to display something in-line or not.

jch

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Best Practice: emails and file-attachments

Posted by lude <lu...@googlemail.com>.
Hi John,

thanks for the detailed answer.

You wrote:
> If you're indexing a
> multipart/alternative bodypart then index all the MIME headers, but only
> index the content of the *first* bodypart.

Does this mean you index just the first file-attachment?
What do you advice, if you have to index mulitpart bodys (== more then one
file-attachment)?
One lucene-document for each part (==file)?
How do you handle the queries?

Greetings
lude



On 8/15/06, John Haxby <jc...@scalix.com> wrote:
>
> lude wrote:
> > does anybody has an idea what is the best design approch for realizing
> > the following:
> >
> > The goal is to index emails and their corresponding file attachments.
> > One email could contain for example:
> I put a fair amount of thought into this when I was doing the design for
> our mail server -- I know about mail :-)   After a little trial and
> error I came up with the following scheme:
>
>   1. All header fields indexed under their own name with the name
>      converted to lower case.
>   2. Almost all bodyparts indexed in a single field called BODY (in
>      upper case)
>   3. Meta-data such as SIZE, DELIVERY-DATE and similar indexed with
>      uppercase fields
>   4. Extensions for other bodypart-specific or application-specific
>      fields indexed as something with an initial uppercase letter and
>      at least one lowercase letter
>
> That gives an extensible set of fields and does require that the index
> knows ahead of time what header fields will be present or relevant.   It
> means that there are potentially a lot of fields: we're running at about
> 60 depending on the user.
>
> Some header fields are special.   The various message-id fields
> (Message-Id, Resent-Message-Id, In-Reply-To and References) need to have
> their mesage-ids carefully extracted and then indexed untokenized.
> Recipient fields (to, cc, from, etc) need to parsed and then have their
> addresses re-assembled as a friendly-name and an RFC822 address -- the
> reason for the re-assembly is that addresses can be presented in
> equivalent but odd fashions.   Most header fields can have RFC2047
> encoded text which needs to be decoded.
>
> When indexing the bodyparts you need to be a little careful.   In
> general, the MIME headers for each part are all indexed as other message
> headers (content-id is a messge id field) and I also indexed the
> canonical content type under a CONTENT-TYPE field, again to get rid of
> fluff so that I can search for, say,
> CONTENT-TYPE:application/x-vnd-powerpoint to find all those annoyingly
> huge messages :-)  An attached message probably doesn't want all its
> headers indexed: subject is good; recipients are probably bad as it'll
> confuse the normal search and give unexpected results; message-id fields
> are almost certainly a bad idea.  If you're indexing a
> multipart/alternative bodypart then index all the MIME headers, but only
> index the content of the *first* bodypart.
>
> Does that all make sense?  Javamail is great for this, it's good at
> parsing and extracting the content of messages.  However, it's not
> enough to just read what I've said and the javamail doc.   If you're not
> intimately familiar with the MIME RFCs (I think the first one is
> RFC2045, but their not difficult to find as their all around RFC2047)
> and RFC2822, the message structure RFC itself.   If you just guess
> because the structure is "obvious" you'll come unstuck.
>
> jch
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Best Practice: emails and file-attachments

Posted by John Haxby <jc...@scalix.com>.
lude wrote:
> does anybody has an idea what is the best design approch for realizing
> the following:
>
> The goal is to index emails and their corresponding file attachments.
> One email could contain for example:
I put a fair amount of thought into this when I was doing the design for 
our mail server -- I know about mail :-)   After a little trial and 
error I came up with the following scheme:

   1. All header fields indexed under their own name with the name
      converted to lower case.
   2. Almost all bodyparts indexed in a single field called BODY (in
      upper case)
   3. Meta-data such as SIZE, DELIVERY-DATE and similar indexed with
      uppercase fields
   4. Extensions for other bodypart-specific or application-specific
      fields indexed as something with an initial uppercase letter and
      at least one lowercase letter

That gives an extensible set of fields and does require that the index 
knows ahead of time what header fields will be present or relevant.   It 
means that there are potentially a lot of fields: we're running at about 
60 depending on the user.

Some header fields are special.   The various message-id fields 
(Message-Id, Resent-Message-Id, In-Reply-To and References) need to have 
their mesage-ids carefully extracted and then indexed untokenized.   
Recipient fields (to, cc, from, etc) need to parsed and then have their 
addresses re-assembled as a friendly-name and an RFC822 address -- the 
reason for the re-assembly is that addresses can be presented in 
equivalent but odd fashions.   Most header fields can have RFC2047 
encoded text which needs to be decoded.

When indexing the bodyparts you need to be a little careful.   In 
general, the MIME headers for each part are all indexed as other message 
headers (content-id is a messge id field) and I also indexed the 
canonical content type under a CONTENT-TYPE field, again to get rid of 
fluff so that I can search for, say, 
CONTENT-TYPE:application/x-vnd-powerpoint to find all those annoyingly 
huge messages :-)  An attached message probably doesn't want all its 
headers indexed: subject is good; recipients are probably bad as it'll 
confuse the normal search and give unexpected results; message-id fields 
are almost certainly a bad idea.  If you're indexing a 
multipart/alternative bodypart then index all the MIME headers, but only 
index the content of the *first* bodypart.

Does that all make sense?  Javamail is great for this, it's good at 
parsing and extracting the content of messages.  However, it's not 
enough to just read what I've said and the javamail doc.   If you're not 
intimately familiar with the MIME RFCs (I think the first one is 
RFC2045, but their not difficult to find as their all around RFC2047) 
and RFC2822, the message structure RFC itself.   If you just guess 
because the structure is "obvious" you'll come unstuck.

jch

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Best Practice: emails and file-attachments

Posted by lude <lu...@googlemail.com>.
Hi Dejan,

how do you query for email- and(!) attachment-documents,
if you just want to present one hit per email (even if the searchterm
matches
in the email- and(!) in the corresponding attachment-document)?

Thanks
lude


On 8/15/06, Dejan Nenov <de...@jollyobject.com> wrote:
>
> The approach we I find best is to create both Email documents - where a
> list
> (and links) to all attachments is contained as well as individual
> Attachment
> documents.
>
> It gets a little tricky when you have a forwarded email, containing an
> original Email that contains a tar.gz attachment, which contains the
> "actual" attached files :)
>
> (Shameless promotion follows) If you are a Windows user, for a _very_ good
> example get a copy of X1 Desktop (free - also distributed as Yahoo!
> Desktop
> search) - then right-click on the column headers and look at the available
> fields for email.
>
>
> Dejan
>
> -----Original Message-----
> From: lude [mailto:lucene.developer@googlemail.com]
> Sent: Tuesday, August 15, 2006 10:29 AM
> To: java-user@lucene.apache.org
> Subject: Best Practice: emails and file-attachments
>
> Hello,
>
> does anybody has an idea what is the best design approch for realizing
> the following:
>
> The goal is to index emails and their corresponding file attachments.
> One email could contain for example:
>
> 1 x subject
> 1 x sender-address
> 1 x to-addresses
> 1 x message-text
> 0..n x file-attachments  (each contains a 'file-name' and the
> 'file-content')
>
> How should I build the index?
>
> First approach:
> Each email + attachments gets one document with the following fields:
> subject, sender_address, to_address, message_text, 1_attachment_name,
> 1_attachment_content, 2_attachment_name, 2_attachment_content,
> 3_attachment_name, 3_attachment_content
> Disadvantage:
> Only three attachments could be indexed. It isn't a generic solution for
> indexing 'n' file-attachments.
>
> Second approach:
> Each email gets one document with the main email-data and 0 to n documents
> of file-attachments:
> 1 x  email_id, subject, sender_address, to_address, message_text
> 0..n x  email_id, attachment_name, attachment_content
> Disadvantage:
> At query time it is difficult to aggregate the documents that belongs to
> each other. One hit per email (including attachments) should be shown.
>
> Any thoughts?
>
> Thanks
> lude
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

RE: Best Practice: emails and file-attachments

Posted by Dejan Nenov <de...@jollyobject.com>.
The approach we I find best is to create both Email documents - where a list
(and links) to all attachments is contained as well as individual Attachment
documents.

It gets a little tricky when you have a forwarded email, containing an
original Email that contains a tar.gz attachment, which contains the
"actual" attached files :)

(Shameless promotion follows) If you are a Windows user, for a _very_ good
example get a copy of X1 Desktop (free - also distributed as Yahoo! Desktop
search) - then right-click on the column headers and look at the available
fields for email.


Dejan

-----Original Message-----
From: lude [mailto:lucene.developer@googlemail.com] 
Sent: Tuesday, August 15, 2006 10:29 AM
To: java-user@lucene.apache.org
Subject: Best Practice: emails and file-attachments

Hello,

does anybody has an idea what is the best design approch for realizing
the following:

The goal is to index emails and their corresponding file attachments.
One email could contain for example:

1 x subject
1 x sender-address
1 x to-addresses
1 x message-text
0..n x file-attachments  (each contains a 'file-name' and the
'file-content')

How should I build the index?

First approach:
Each email + attachments gets one document with the following fields:
subject, sender_address, to_address, message_text, 1_attachment_name,
1_attachment_content, 2_attachment_name, 2_attachment_content,
3_attachment_name, 3_attachment_content
Disadvantage:
Only three attachments could be indexed. It isn't a generic solution for
indexing 'n' file-attachments.

Second approach:
Each email gets one document with the main email-data and 0 to n documents
of file-attachments:
1 x  email_id, subject, sender_address, to_address, message_text
0..n x  email_id, attachment_name, attachment_content
Disadvantage:
At query time it is difficult to aggregate the documents that belongs to
each other. One hit per email (including attachments) should be shown.

Any thoughts?

Thanks
lude


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org