You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by lude <lu...@googlemail.com> on 2006/08/15 19:28:41 UTC
Best Practice: emails and file-attachments
Hello,
does anybody has an idea what is the best design approch for realizing
the following:
The goal is to index emails and their corresponding file attachments.
One email could contain for example:
1 x subject
1 x sender-address
1 x to-addresses
1 x message-text
0..n x file-attachments (each contains a 'file-name' and the
'file-content')
How should I build the index?
First approach:
Each email + attachments gets one document with the following fields:
subject, sender_address, to_address, message_text, 1_attachment_name,
1_attachment_content, 2_attachment_name, 2_attachment_content,
3_attachment_name, 3_attachment_content
Disadvantage:
Only three attachments could be indexed. It isn't a generic solution for
indexing 'n' file-attachments.
Second approach:
Each email gets one document with the main email-data and 0 to n documents
of file-attachments:
1 x email_id, subject, sender_address, to_address, message_text
0..n x email_id, attachment_name, attachment_content
Disadvantage:
At query time it is difficult to aggregate the documents that belongs to
each other. One hit per email (including attachments) should be shown.
Any thoughts?
Thanks
lude
Re: Best Practice: emails and file-attachments
Posted by John Haxby <jc...@scalix.com>.
Oh rats. Thunderbird ate the indenting. The two examples should be:
multipart/alternative
text/plain
multipart/related
text/html
image/gif
image/gif
application/msword
and
multipart/related
text/html
image/gif
application/msword
the indenting indicates nesting. A message isn't just a bodypart
followed by attachments, it has structure like a file system. Something
which escapes most mail readers. Sigh.
John Haxby wrote:
> lude wrote:
>>> You also mentioned indexing each bodypart ("attachment") separately.
>>> Why? ....
>>> To my mind, there is no use case where it makes sense to search a
>>> particular bodypart
>>
>> I will give you the use case:
>>
>> [snip]
>> 3.) The result list would show this:
>> 1. mail-1 'subject'
>> 'Abstract of the message-text'
>> 2. mail-2 'subject'
>> Attachment with name 'filename.doc' contains 'Abstract of
>> file-content'
>>
>> Another Use-Case would be an extended search, which allows to select if
>> "attached files"
>> should be searched (yes or no).
>
> That's a good use case. File it as a bug and close it WONTFIX :-) The
> problem that you have is trying to determine whether something is
> going to be inline or an attachment. I'll give you a real-life example
> that caught out some old code the other day. We had a message with
> this structure:
>
> multipart/alternative
> text/plain
> multipart/related
> text/html
> image/gif
> image/gif
> application/msword
>
> Is there an attached file in there? Think before you read on.
>
>
>
>
>
>
> The answer should be "no". Are you surprised that at least one client
> decided that there was? What we have is three representations of the
> same document: plain text, html (with two pictures) and MS Word. The
> original, the Word document obviously has the best fidelity and comes
> last. The one client I'm thinking of (and I've lost track of which one
> it was) correctly suppressed the display of the text/plain
> alternative, displayed the HTML with its pictures in-line and then
> mistakenly displayed the Word document as an attachment.
>
> This is a fictional example, but it could exist:
>
> multipart/related
> text/html
> image/gif
> application/msword
>
> The gif image (and let's assume it can be indexed sensibly) is
> "obviously" a picture in the HTML bodypart. What's the word document?
> It's referenced from the HTML as a link just like the picture is. Is
> it an attachment? What's the difference between the word document
> referenced as a link within the multipart/related (by content-id) and
> a link to an external document (by http URL)? From a user perspective
> both are the same, but is one an attachment and the other not? I'm
> being unfair, this is not only an unrealistic problem but there isn't
> a right or a wrong answer. The word document isn't an attachment
> because it doesn't (or shouldn't) appear in the list of attachments
> and it's not in-line because you have to click on something to see it.
>
> So yes, I agree, your use-cases are good; I'm just not sure how you're
> going to identify an attachment :-)
>
> I do like the idea, though, of when you do a search for "xyzzy" that
> you get the abstract of the bodypart that contains "xyzzy" rather than
> the abstract (or subject) of the entire message and I'm going to think
> about that one some more. The problem that immediately springs to mind
> though is that a message can have an arbitrary number of bodyparts so
> if I have BODY-1, BODY-2, ..., BODY-N (where N is unknown) how hard is
> it for me to construct the search? I think I probably should construct
> the search that way because the score depends upon the size of the
> document and it seems to make sense that the document is the bodypart,
> not the entire message, but it seems more complex than is useful for
> mail messages.
>
> jch
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Best Practice: emails and file-attachments
Posted by John Haxby <jc...@scalix.com>.
lude wrote:
>> You also mentioned indexing each bodypart ("attachment") separately.
>> Why? ....
>> To my mind, there is no use case where it makes sense to search a
>> particular bodypart
>
> I will give you the use case:
>
> [snip]
> 3.) The result list would show this:
> 1. mail-1 'subject'
> 'Abstract of the message-text'
> 2. mail-2 'subject'
> Attachment with name 'filename.doc' contains 'Abstract of
> file-content'
>
> Another Use-Case would be an extended search, which allows to select if
> "attached files"
> should be searched (yes or no).
That's a good use case. File it as a bug and close it WONTFIX :-) The
problem that you have is trying to determine whether something is going
to be inline or an attachment. I'll give you a real-life example that
caught out some old code the other day. We had a message with this
structure:
multipart/alternative
text/plain
multipart/related
text/html
image/gif
image/gif
application/msword
Is there an attached file in there? Think before you read on.
The answer should be "no". Are you surprised that at least one client
decided that there was? What we have is three representations of the
same document: plain text, html (with two pictures) and MS Word. The
original, the Word document obviously has the best fidelity and comes
last. The one client I'm thinking of (and I've lost track of which one
it was) correctly suppressed the display of the text/plain alternative,
displayed the HTML with its pictures in-line and then mistakenly
displayed the Word document as an attachment.
This is a fictional example, but it could exist:
multipart/related
text/html
image/gif
application/msword
The gif image (and let's assume it can be indexed sensibly) is
"obviously" a picture in the HTML bodypart. What's the word document?
It's referenced from the HTML as a link just like the picture is. Is it
an attachment? What's the difference between the word document
referenced as a link within the multipart/related (by content-id) and a
link to an external document (by http URL)? From a user perspective both
are the same, but is one an attachment and the other not? I'm being
unfair, this is not only an unrealistic problem but there isn't a right
or a wrong answer. The word document isn't an attachment because it
doesn't (or shouldn't) appear in the list of attachments and it's not
in-line because you have to click on something to see it.
So yes, I agree, your use-cases are good; I'm just not sure how you're
going to identify an attachment :-)
I do like the idea, though, of when you do a search for "xyzzy" that you
get the abstract of the bodypart that contains "xyzzy" rather than the
abstract (or subject) of the entire message and I'm going to think about
that one some more. The problem that immediately springs to mind though
is that a message can have an arbitrary number of bodyparts so if I have
BODY-1, BODY-2, ..., BODY-N (where N is unknown) how hard is it for me
to construct the search? I think I probably should construct the search
that way because the score depends upon the size of the document and it
seems to make sense that the document is the bodypart, not the entire
message, but it seems more complex than is useful for mail messages.
jch
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Best Practice: emails and file-attachments
Posted by lude <lu...@googlemail.com>.
Hi Johan,
thanks again for the many words and explanations!
> You also mentioned indexing each bodypart ("attachment") separately.
> Why? ....
> To my mind, there is no use case where it makes sense to search a
particular bodypart
I will give you the use case:
1.) User searches for "abcd"
2.) Lucene matches the searchterm (at least) two times:
- One email has the term in the plain message text (mail-1)
- One email contains five "file-attachment". One of this files matches
the search term (mail-2)
3.) The result list would show this:
1. mail-1 'subject'
'Abstract of the message-text'
2. mail-2 'subject'
Attachment with name 'filename.doc' contains 'Abstract of
file-content'
Another Use-Case would be an extended search, which allows to select if
"attached files"
should be searched (yes or no).
Greetings
lude
>
>
>
Re: Best Practice: emails and file-attachments
Posted by John Haxby <jc...@scalix.com>.
lude wrote:
> Hi John,
>
> thanks for the detailed answer.
>
> You wrote:
>> If you're indexing a
>> multipart/alternative bodypart then index all the MIME headers, but only
>> index the content of the *first* bodypart.
>
> Does this mean you index just the first file-attachment?
> What do you advice, if you have to index mulitpart bodys (== more then
> one
> file-attachment)?
> One lucene-document for each part (==file)?
> How do you handle the queries?
MIME has no concept of "attachment", that's something that the user
agent programs have a concept of -- you "attach" a file to a message.
The file might be a picture, a word document, a compressed tar archive
-- as far MIME is concerned they're all the same (well, apart from the
content-* headers that describe what's "attached"). The MIME type for
a message with "attachments" is "multipart". There are several
subtypes though. If you're typing a plain text message (whose MIME
type is text/plain, a message like this one) and you attach a jpeg image
to it you'll be sending a message whose type is multipart/mixed; the
first part will have type text/plain and the second image/jpeg. In
Google Mail under "more options" you can "show original" to see the
complete MIME message and you'll see the different parts separated by a
boundary.
OK. Now I'm in a position to answer your question. Often, when you
send an HTML formatted message the content of the message is sent twice:
once as text/plain and once as text/html (or multipart/related if it has
pictures and stuff). The two parts are alternatives, apart from the
formatting (and pictures) there's no difference between the two parts,
you can read either. The best fidelity of the alternatives (and there
can be more than two) is last, the poorest fidelity first, but the
intent of the sender is that you can read any of them. This is a
multipart/alternative bodypart. Because all parts of the
multipart/alternative have the same text then you can index any of them,
so index the first as that's going to be the easiest to process (it's
almost always going to be text/plain).
I've skipped loads. You need to read the RFCs. Start with RFC2045
(http://www.rfc.net/rfc2045.html) and keep going. If you get stuck with
the details of how messages are constructed, go back and read RFC2822
first, or at least skim it (it's quite long). Note that RFC2045
references RFC822 in its abstract, where ever you see references to
RFC821 and RFC822 you can read them as references to RFC2821 and RFC2822
respectively -- the newer ones are a little more precise when they need
to be and have rather more explanation of awkward cases that you need to
know about.
Someone earlier (and I'm sorry, I deleleted the message before realising
i should reply) said something about attached files really being in an
attached .tar.gz file. Well, yes and no. An attached compressed tar
archive is a bodypart like any other and will need to be indexed like
any other. That will involve breaking it open and indexing the files
that it contains. It's not really any different to indexing an open
office document (which is actually a zip file).
You also mentioned indexing each bodypart ("attachment") separately.
Why? When I'm searching, am I going to look for the word "xyzzy" in
the first bodypart? What if it was a multipart/alternative and
Thunderbird (in my case) suppressed the first bodypart and "xyzzy" is
something that couldn't be rendered in the (first) text/plain
alternative? To my mind, there is no use case where it makes sense to
search a particular bodypart. There *might* be a case for searching the
"prime" bodypart and "attachments" but when you read the MIME spec
you'll realise that detecting what the user sees as an attachment is not
easy: it gets even harder when you discover that different mail user
agents have different and legal (and sometimes reasonable) ways of
deciding whether to treat something as in-line or as an attachment. To
be honest, people don't remember whether something was an attachment.
They think "I remember reading about xyzzy in a mail message" and go off
looking for that. They often can't tell and remember even less that
the "xyzzy" was in something that you decided was an attachment. And
if your rules for deciding whether you have something that's intended to
be viewed as an attachment or in-line are different to the rules that
the user's mail reader is using then you'll have Awkward Bugs to
explain. You'll read about "Content-Disposition" in the RFCs, but
don't believe that it's a foolproof way of deciding whether or not
something is an attachment, lack of a content-disposition header doesn't
mean "inline" or "attachment" and Microsoft, bless, have weird rules all
of their own for deciding whether to display something in-line or not.
jch
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Best Practice: emails and file-attachments
Posted by lude <lu...@googlemail.com>.
Hi John,
thanks for the detailed answer.
You wrote:
> If you're indexing a
> multipart/alternative bodypart then index all the MIME headers, but only
> index the content of the *first* bodypart.
Does this mean you index just the first file-attachment?
What do you advice, if you have to index mulitpart bodys (== more then one
file-attachment)?
One lucene-document for each part (==file)?
How do you handle the queries?
Greetings
lude
On 8/15/06, John Haxby <jc...@scalix.com> wrote:
>
> lude wrote:
> > does anybody has an idea what is the best design approch for realizing
> > the following:
> >
> > The goal is to index emails and their corresponding file attachments.
> > One email could contain for example:
> I put a fair amount of thought into this when I was doing the design for
> our mail server -- I know about mail :-) After a little trial and
> error I came up with the following scheme:
>
> 1. All header fields indexed under their own name with the name
> converted to lower case.
> 2. Almost all bodyparts indexed in a single field called BODY (in
> upper case)
> 3. Meta-data such as SIZE, DELIVERY-DATE and similar indexed with
> uppercase fields
> 4. Extensions for other bodypart-specific or application-specific
> fields indexed as something with an initial uppercase letter and
> at least one lowercase letter
>
> That gives an extensible set of fields and does require that the index
> knows ahead of time what header fields will be present or relevant. It
> means that there are potentially a lot of fields: we're running at about
> 60 depending on the user.
>
> Some header fields are special. The various message-id fields
> (Message-Id, Resent-Message-Id, In-Reply-To and References) need to have
> their mesage-ids carefully extracted and then indexed untokenized.
> Recipient fields (to, cc, from, etc) need to parsed and then have their
> addresses re-assembled as a friendly-name and an RFC822 address -- the
> reason for the re-assembly is that addresses can be presented in
> equivalent but odd fashions. Most header fields can have RFC2047
> encoded text which needs to be decoded.
>
> When indexing the bodyparts you need to be a little careful. In
> general, the MIME headers for each part are all indexed as other message
> headers (content-id is a messge id field) and I also indexed the
> canonical content type under a CONTENT-TYPE field, again to get rid of
> fluff so that I can search for, say,
> CONTENT-TYPE:application/x-vnd-powerpoint to find all those annoyingly
> huge messages :-) An attached message probably doesn't want all its
> headers indexed: subject is good; recipients are probably bad as it'll
> confuse the normal search and give unexpected results; message-id fields
> are almost certainly a bad idea. If you're indexing a
> multipart/alternative bodypart then index all the MIME headers, but only
> index the content of the *first* bodypart.
>
> Does that all make sense? Javamail is great for this, it's good at
> parsing and extracting the content of messages. However, it's not
> enough to just read what I've said and the javamail doc. If you're not
> intimately familiar with the MIME RFCs (I think the first one is
> RFC2045, but their not difficult to find as their all around RFC2047)
> and RFC2822, the message structure RFC itself. If you just guess
> because the structure is "obvious" you'll come unstuck.
>
> jch
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
Re: Best Practice: emails and file-attachments
Posted by John Haxby <jc...@scalix.com>.
lude wrote:
> does anybody has an idea what is the best design approch for realizing
> the following:
>
> The goal is to index emails and their corresponding file attachments.
> One email could contain for example:
I put a fair amount of thought into this when I was doing the design for
our mail server -- I know about mail :-) After a little trial and
error I came up with the following scheme:
1. All header fields indexed under their own name with the name
converted to lower case.
2. Almost all bodyparts indexed in a single field called BODY (in
upper case)
3. Meta-data such as SIZE, DELIVERY-DATE and similar indexed with
uppercase fields
4. Extensions for other bodypart-specific or application-specific
fields indexed as something with an initial uppercase letter and
at least one lowercase letter
That gives an extensible set of fields and does require that the index
knows ahead of time what header fields will be present or relevant. It
means that there are potentially a lot of fields: we're running at about
60 depending on the user.
Some header fields are special. The various message-id fields
(Message-Id, Resent-Message-Id, In-Reply-To and References) need to have
their mesage-ids carefully extracted and then indexed untokenized.
Recipient fields (to, cc, from, etc) need to parsed and then have their
addresses re-assembled as a friendly-name and an RFC822 address -- the
reason for the re-assembly is that addresses can be presented in
equivalent but odd fashions. Most header fields can have RFC2047
encoded text which needs to be decoded.
When indexing the bodyparts you need to be a little careful. In
general, the MIME headers for each part are all indexed as other message
headers (content-id is a messge id field) and I also indexed the
canonical content type under a CONTENT-TYPE field, again to get rid of
fluff so that I can search for, say,
CONTENT-TYPE:application/x-vnd-powerpoint to find all those annoyingly
huge messages :-) An attached message probably doesn't want all its
headers indexed: subject is good; recipients are probably bad as it'll
confuse the normal search and give unexpected results; message-id fields
are almost certainly a bad idea. If you're indexing a
multipart/alternative bodypart then index all the MIME headers, but only
index the content of the *first* bodypart.
Does that all make sense? Javamail is great for this, it's good at
parsing and extracting the content of messages. However, it's not
enough to just read what I've said and the javamail doc. If you're not
intimately familiar with the MIME RFCs (I think the first one is
RFC2045, but their not difficult to find as their all around RFC2047)
and RFC2822, the message structure RFC itself. If you just guess
because the structure is "obvious" you'll come unstuck.
jch
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Best Practice: emails and file-attachments
Posted by lude <lu...@googlemail.com>.
Hi Dejan,
how do you query for email- and(!) attachment-documents,
if you just want to present one hit per email (even if the searchterm
matches
in the email- and(!) in the corresponding attachment-document)?
Thanks
lude
On 8/15/06, Dejan Nenov <de...@jollyobject.com> wrote:
>
> The approach we I find best is to create both Email documents - where a
> list
> (and links) to all attachments is contained as well as individual
> Attachment
> documents.
>
> It gets a little tricky when you have a forwarded email, containing an
> original Email that contains a tar.gz attachment, which contains the
> "actual" attached files :)
>
> (Shameless promotion follows) If you are a Windows user, for a _very_ good
> example get a copy of X1 Desktop (free - also distributed as Yahoo!
> Desktop
> search) - then right-click on the column headers and look at the available
> fields for email.
>
>
> Dejan
>
> -----Original Message-----
> From: lude [mailto:lucene.developer@googlemail.com]
> Sent: Tuesday, August 15, 2006 10:29 AM
> To: java-user@lucene.apache.org
> Subject: Best Practice: emails and file-attachments
>
> Hello,
>
> does anybody has an idea what is the best design approch for realizing
> the following:
>
> The goal is to index emails and their corresponding file attachments.
> One email could contain for example:
>
> 1 x subject
> 1 x sender-address
> 1 x to-addresses
> 1 x message-text
> 0..n x file-attachments (each contains a 'file-name' and the
> 'file-content')
>
> How should I build the index?
>
> First approach:
> Each email + attachments gets one document with the following fields:
> subject, sender_address, to_address, message_text, 1_attachment_name,
> 1_attachment_content, 2_attachment_name, 2_attachment_content,
> 3_attachment_name, 3_attachment_content
> Disadvantage:
> Only three attachments could be indexed. It isn't a generic solution for
> indexing 'n' file-attachments.
>
> Second approach:
> Each email gets one document with the main email-data and 0 to n documents
> of file-attachments:
> 1 x email_id, subject, sender_address, to_address, message_text
> 0..n x email_id, attachment_name, attachment_content
> Disadvantage:
> At query time it is difficult to aggregate the documents that belongs to
> each other. One hit per email (including attachments) should be shown.
>
> Any thoughts?
>
> Thanks
> lude
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
RE: Best Practice: emails and file-attachments
Posted by Dejan Nenov <de...@jollyobject.com>.
The approach we I find best is to create both Email documents - where a list
(and links) to all attachments is contained as well as individual Attachment
documents.
It gets a little tricky when you have a forwarded email, containing an
original Email that contains a tar.gz attachment, which contains the
"actual" attached files :)
(Shameless promotion follows) If you are a Windows user, for a _very_ good
example get a copy of X1 Desktop (free - also distributed as Yahoo! Desktop
search) - then right-click on the column headers and look at the available
fields for email.
Dejan
-----Original Message-----
From: lude [mailto:lucene.developer@googlemail.com]
Sent: Tuesday, August 15, 2006 10:29 AM
To: java-user@lucene.apache.org
Subject: Best Practice: emails and file-attachments
Hello,
does anybody has an idea what is the best design approch for realizing
the following:
The goal is to index emails and their corresponding file attachments.
One email could contain for example:
1 x subject
1 x sender-address
1 x to-addresses
1 x message-text
0..n x file-attachments (each contains a 'file-name' and the
'file-content')
How should I build the index?
First approach:
Each email + attachments gets one document with the following fields:
subject, sender_address, to_address, message_text, 1_attachment_name,
1_attachment_content, 2_attachment_name, 2_attachment_content,
3_attachment_name, 3_attachment_content
Disadvantage:
Only three attachments could be indexed. It isn't a generic solution for
indexing 'n' file-attachments.
Second approach:
Each email gets one document with the main email-data and 0 to n documents
of file-attachments:
1 x email_id, subject, sender_address, to_address, message_text
0..n x email_id, attachment_name, attachment_content
Disadvantage:
At query time it is difficult to aggregate the documents that belongs to
each other. One hit per email (including attachments) should be shown.
Any thoughts?
Thanks
lude
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org