You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Lukáš Vlček <lu...@gmail.com> on 2010/03/08 15:55:56 UTC

Extracting content from mailman managed mail list archive

Hi,

is anybody willing to share experience about how to extract content from
mailing list archives in order to have it indexed by Lucene or Solr?

Imagine that we have access to archive of some mailling list (e.g.
http://www.mail-archive.com/mailman-users%40python.org/) and we would like
to index individual emails. Is there any easy way how to extract just the
text content produced by sender individual emails? I am interested in
content generated by particular sender omitting the original quoted text. We
can either access individual emails via web or we can download monthly
archive in plain text format (but the content of individual emails depends
on the email client of the author, i.e. plain text, html, html mixed with
plain text in <table> ... etc... it is very messy).

I would prefer information about mailing lists managed by mailman but I
don't want to limit the scope of this question so any general ideas are
welcome.

Regards,
Lukas

Re: Extracting content from mailman managed mail list archive

Posted by Chris Hostetter <ho...@fucit.org>.

: I just checked popular search services and it seems that neither
: lucidimagination search nor search-lucene support this:

it really depends on what you want to do ... most people i know who index 
email want to included quoted portions in the message because it's part of 
hte context of the message  ... if you are looking for emails from "Jim" 
send on day "X" about "Foo" then shouldn't a message match even if 
Jim never wrote the word "Foo" himself but it appeared in the quoted text 
he was commenting on several times?

your opinion may vary, but that's the reasoning i've heard behind why 
people index quoted portions even when they've identified them for the 
purposes of display which is what seems to be happening in the lucid 
example you cited...

: http://www.lucidimagination.com/search/document/954e8589ebbc4b16/terminating_slashes_in_url_normalization

...Jukka sent that message in plain text, but the lucid system detected 
the quoted portion and converted it to an html blockquote tag.  


-Hoss

Re: Extracting content from mailman managed mail list archive

Posted by Lukáš Vlček <lu...@gmail.com>.

I just checked popular search services and it seems that neither
lucidimagination search nor search-lucene support this:
http://www.lucidimagination.com/search/document/954e8589ebbc4b16/terminating_slashes_in_url_normalization
http://www.search-lucene.com/m?id=510143ac0608042241k49f4afe7wcd25df3fbacc7729@mail.gmail.com||mailman

Markmail does not support this as well
http://markmail.org/message/papbjx3aoz3uvbhh

Hmmm....
I think it would be useful to extract just the *NEW* content without all
quotes because this influences Lucene scoring.

Regards,
Lukas

On Mon, Mar 8, 2010 at 3:55 PM, Lukáš Vlček <lu...@gmail.com> wrote:

> Hi,
>
> is anybody willing to share experience about how to extract content from
> mailing list archives in order to have it indexed by Lucene or Solr?
>
> Imagine that we have access to archive of some mailling list (e.g.
> http://www.mail-archive.com/mailman-users%40python.org/) and we would like
> to index individual emails. Is there any easy way how to extract just the
> text content produced by sender individual emails? I am interested in
> content generated by particular sender omitting the original quoted text. We
> can either access individual emails via web or we can download monthly
> archive in plain text format (but the content of individual emails depends
> on the email client of the author, i.e. plain text, html, html mixed with
> plain text in <table> ... etc... it is very messy).
>
> I would prefer information about mailing lists managed by mailman but I
> don't want to limit the scope of this question so any general ideas are
> welcome.
>
> Regards,
> Lukas
>