You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Lukáš Vlček <lu...@gmail.com> on 2010/03/08 15:55:56 UTC
Extracting content from mailman managed mail list archive
Hi,
is anybody willing to share experience about how to extract content from
mailing list archives in order to have it indexed by Lucene or Solr?
Imagine that we have access to archive of some mailling list (e.g.
http://www.mail-archive.com/mailman-users%40python.org/) and we would like
to index individual emails. Is there any easy way how to extract just the
text content produced by sender individual emails? I am interested in
content generated by particular sender omitting the original quoted text. We
can either access individual emails via web or we can download monthly
archive in plain text format (but the content of individual emails depends
on the email client of the author, i.e. plain text, html, html mixed with
plain text in <table> ... etc... it is very messy).
I would prefer information about mailing lists managed by mailman but I
don't want to limit the scope of this question so any general ideas are
welcome.
Regards,
Lukas
Re: Extracting content from mailman managed mail list archive
Posted by Chris Hostetter <ho...@fucit.org>.
: I just checked popular search services and it seems that neither
: lucidimagination search nor search-lucene support this:
it really depends on what you want to do ... most people i know who index
email want to included quoted portions in the message because it's part of
hte context of the message ... if you are looking for emails from "Jim"
send on day "X" about "Foo" then shouldn't a message match even if
Jim never wrote the word "Foo" himself but it appeared in the quoted text
he was commenting on several times?
your opinion may vary, but that's the reasoning i've heard behind why
people index quoted portions even when they've identified them for the
purposes of display which is what seems to be happening in the lucid
example you cited...
: http://www.lucidimagination.com/search/document/954e8589ebbc4b16/terminating_slashes_in_url_normalization
...Jukka sent that message in plain text, but the lucid system detected
the quoted portion and converted it to an html blockquote tag.
-Hoss
Re: Extracting content from mailman managed mail list archive
Posted by Lukáš Vlček <lu...@gmail.com>.
I just checked popular search services and it seems that neither
lucidimagination search nor search-lucene support this:
http://www.lucidimagination.com/search/document/954e8589ebbc4b16/terminating_slashes_in_url_normalization
http://www.search-lucene.com/m?id=510143ac0608042241k49f4afe7wcd25df3fbacc7729@mail.gmail.com||mailman
Markmail does not support this as well
http://markmail.org/message/papbjx3aoz3uvbhh
Hmmm....
I think it would be useful to extract just the *NEW* content without all
quotes because this influences Lucene scoring.
Regards,
Lukas
On Mon, Mar 8, 2010 at 3:55 PM, Lukáš Vlček <lu...@gmail.com> wrote:
> Hi,
>
> is anybody willing to share experience about how to extract content from
> mailing list archives in order to have it indexed by Lucene or Solr?
>
> Imagine that we have access to archive of some mailling list (e.g.
> http://www.mail-archive.com/mailman-users%40python.org/) and we would like
> to index individual emails. Is there any easy way how to extract just the
> text content produced by sender individual emails? I am interested in
> content generated by particular sender omitting the original quoted text. We
> can either access individual emails via web or we can download monthly
> archive in plain text format (but the content of individual emails depends
> on the email client of the author, i.e. plain text, html, html mixed with
> plain text in <table> ... etc... it is very messy).
>
> I would prefer information about mailing lists managed by mailman but I
> don't want to limit the scope of this question so any general ideas are
> welcome.
>
> Regards,
> Lukas
>