You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@apr.apache.org by Justin Erenkrantz <je...@ebuilt.com> on 2001/04/22 05:09:09 UTC

Web archival of mailing lists

Hi all,

I believe that Roy has mentioned to some of you that I've been
working on a module that will process mbox archives and display
it in a nice format on the web with some other cool features.

Well, I think that we are at a stage where we would like some feedback
from the Apache community.  It has progressed enough where I think it is 
stable and feature-complete.  Everyone I have shown it to so far has 
given positive feedback.  Now, for the real critics...

You may see mod_mbox in action at:

http://www.apachelabs.org/

I currently have the entire new-httpd and apr-dev archives on there.
Note that this month's archive of both these lists is from a few days 
ago.

I also have ht://Dig running which should allow searching of the 
archives.  Please feel free to hammer the box.  I'm not exactly
sure how efficient ht://Dig is, but it seems to work reasonably
well (the search databases are big too large for my taste though).

The current snapshot of the mod_mbox code is on the website.  mod_mbox
is an Apache-2.0 module.  The indexing programs use only APR.  Note 
that I do not currently have access to Win32 platforms - it may not 
compile on there, but I doubt that there is anything too platform 
specific - it is all based on APR.  I have tested this on Linux, 
FreeBSD, and Solaris.

You take your mbox file and generate the index (see the provided 
generate_index.c file).  This creates all of the DBMs necessary for 
mod_mbox.  Simply add "AddHandler .mbox mbox-file" to your httpd.conf
(or other mechanisms that acheive the same goal of setting the handler
to be either mbox-file or mbox-handler) and you are up with mod_mbox.  
Due to the current build system, it is not particularly 
straight-forward to build an external module with dependent objects.  
I have tried to include enough "hints" in the tarball to provide
guidelines as to building mod_mbox from the source.  I don't intend
for what is on apachelabs.org to be a "release," but rather a 
"snapshot."

mod_mbox has the advantage over MHonArc in that it will only index
the mbox file when you explicitly tell it to (use the generate_index
program) rather then when a new message is delivered.  Here at eBuilt,
we've had to alter our internal mailing-list archival strategy to
compensate for the fact that MHonArc can not handle large lists
well.  Ideally, mod_mbox scales better.   generate_index on a 750MB
mbox file takes about two or three minutes (Sun U5/360).  The only
storage explictly required for mod_mbox is the DBMs.  And, with
such a high-traffic list, you can run the index a few times a day 
rather than when each new message is delivered.  

I do believe that Roy intends to check mod_mbox into the httpd-2.0
and apr-util trees so that it becomes part of the standard Apache
distribution.  Since I don't have commit access, please don't discuss 
the merits of mod_mbox's inclusion with me (I'm biased anyway).  =-)  
I do think a lot of sites would find this incredibly useful - in my 
opinion, apache.org is number one on this list.

Note that we intend to convert parts of the display logic to filters, 
but that really shouldn't affect the majority of the mbox code and what 
it displays (just how).  I think this is a good time to gauge feedback of 
what we have so far.

Now, to provide an overview of the mod_mbox module (functionally and
architecturally):

There are two real components to mod_mbox.  The first is mod_mbox.c
which is the actual Apache module.  Currently, there is not much to
this file - it is basically a wrapper around the other files.  This
file handles the displaying of the actual message.  mod_mbox is
intended to be a handler ("mbox-file" and "mbox-handler") and
produces a "virtual namespace" from which the user can browse in.

There are two main URIs of interest for each mbox:

http://foo.example.com/your.mbox/index.html
http://foo.example.com/your.mbox/threads.html

The default index is sorted by date, and the threading index is
sorted by date as well.  (I'll explain how the threading works 
later.)  The indexes provide links based on the message-id into the 
mbox file of the format:

http://foo.example.com/your.mbox/message-id

All of the other files constitute the core of the mbox functionality
(parsing, threading, sorting, etc.).  My intention is that these could
be placed within apr-util.  mod_mbox uses DBMs to "cache" all of the 
relevant information about the mbox (date, subject, from, references, 
offset within the original file, etc.).  This makes the display of the 
index and retrieval of a message fairly efficient while retaining the 
original archive.

Note that I have only tried it with the SDBM included in apr-util - I 
imagine that it'd work with Sleepycat DB and GDBM (apr-dbm has hooks 
for these, but part of this project was to test out the 
httpd/apr/apr-util code).  

The other key functionality is the threading algorithm.  I based
my threading implementation off of Jamie Zawinski's mail threading
algorithms (he wrote the original versions of Netscape Mail - see 
http://www.jwz.org/doc/threading.html).  His key point was not to 
store the threading tree in the database, but generate the tree on 
the fly.  It has proved to be very efficient and highly accurate.  

Note that I did not use any of his code - I only used his description 
of the algorithm.  This portion of the code is quite complex (although 
I wrote it in a span of 24 hours).  I have managed to test it with 
threads I know (with our internal mailing lists) and it seems reasonably 
accurate.  Subtle bugs may still exist.  If you find a bug, any help 
tracking these down would be greatly appreciated.

For the rest of the implementation details, please see the source code.  
Open source is nice that way.

I look forward to hearing any comments or suggestions ya'll might have.

Thanks in advance,
Justin Erenkrantz
jerenkrantz@ebuilt.com


Re: Web archival of mailing lists

Posted by Justin Erenkrantz <je...@ebuilt.com>.
On Wed, Apr 25, 2001 at 03:17:55PM -0700, Ask Bjoern Hansen wrote:
> [...] 
> > well, if mod_mbox actually generated an xml format of some kind,
> 
> What he said. The core of mod_mbox should do that
>
> With Apache 2.0 filters it will then be fairly easy to make it look
> like whatever we want and add whatever features. :)

A filter to convert from the mod_mbox core data structures into XML should 
be fairly straightforward.  But, I don't have any plans to write one.  
I have yet to be satisified with the scalability of the XSLTs.

> In any case, it looks great. I'm definitely planning to use it for
> the perl.org archives. (They use Mhonarc right now and it sucks).

Please let me know if you use it.  You don't have to email me, but I'd 
appreciate any feedback.  =)  I wouldn't dump Mhonarc quite yet though.

Roy and I combed through the SDBM code and found some locking problems
which caused the scalability to suffer tremendously - only one request
can open a particular SDBM at a time.  I am not exactly sure what the 
exact problem is - our current guess is fcntl-based implementations.  
(Hey, Roy just posted about it...)

Expect a patch to APR in the next few days that resolves this.  There
are probably more scalability problems hiding in APR...

(I have a midterm tomorrow at 8AM - I need to go study...) -- justin


Re: Web archival of mailing lists

Posted by "David N. Welton" <da...@apache.org>.
Jim Winstead <ji...@trainedmonkey.com> writes:

> but just to throw out a suggestion on the presentation side -- i
> have been playing with a web interface on top of the nntp server for
> the php mailing lists at news.php.net, and one presentation thing
> that i think looks really slick is to apply a different style to
> quotes and signatures. for example:

>   http://news.php.net/article.php?group=php.general&article=107358

FWIW, the gnus mailreader does something like that...

Looks nice, though.

-- 
David N. Welton
Free Software: http://people.debian.org/~davidw/
   Apache Tcl: http://tcl.apache.org/
     Personal: http://www.efn.org/~davidw/
         Work: http://www.innominate.com/

Re: Web archival of mailing lists

Posted by Ask Bjoern Hansen <as...@valueclick.com>.
On Tue, 24 Apr 2001, Jim Winstead wrote:

[...] 
> well, if mod_mbox actually generated an xml format of some kind,

What he said. The core of mod_mbox should do that

With Apache 2.0 filters it will then be fairly easy to make it look
like whatever we want and add whatever features. :)

In any case, it looks great. I'm definitely planning to use it for
the perl.org archives. (They use Mhonarc right now and it sucks).

 - ask

-- 
ask bjoern hansen, http://ask.netcetera.dk/   !try; do();
more than 100M impressions per day, http://valueclick.com


Re: Web archival of mailing lists

Posted by Sander van Zoest <sa...@covalent.net>.
On Tue, 24 Apr 2001, Jim Winstead wrote:

> On Tue, Apr 24, 2001 at 11:42:14AM -0700, Justin Erenkrantz wrote:
> > Yup.  That is the plan.  Somehow use mod_include in mod_mbox.  Define a 
> > template file and use that each time.  That's the goal.  Not sure whether 
> > we duplicate mod_include or add mod_include to the filter chain (we may 
> > want to define our own include syntax - this could define how you want 
> > the message or index displayed, headers, footers, etc.).  Roy and I have 
> > also kicked around the idea of using mod_php as well.  Not sure yet.
> well, if mod_mbox actually generated an xml format of some kind,
> you could insert an xslt filter that transformed that xml into html.
> of course, no such xslt filter exists yet, as far as i know.
  
This is one of the attempts we made at the mailing list archive that
was tested by Covalent Technologies (archive currently down).

It had an XML DTD that was used to transform mbox files into XML and
do XSLT translations on the fly using AxKit/Apache/modperl. You might
want to look at the mod_xslt project or xalan c++ of course.  

There was a BOF in Santa Clara for XSLT solutions with Apache, but 
sadly I missed it.

You could look at the IETF I-D <draft-klyne-message-rfc822-xml-01.txt>
which came out after our project, but surprisingly resembles roughly
the XML format we used.

Of course parsing XML isn't faster then parsing mbox files, but at 
least you can pre-parse MIME attachments and have the ability to apply
any stylesheet you want. One of the hardest areas you run into are
I18N and charset issues, cause most MUAs arent' really good about those.

Cheers,

--
Sander van Zoest                                         [sander@covalent.net]
Covalent Technologies, Inc.                           http://www.covalent.net/
(415) 536-5218                                     http://Sander.vanZoest.com/


Re: Web archival of mailing lists

Posted by Jim Winstead <ji...@trainedmonkey.com>.
On Tue, Apr 24, 2001 at 11:42:14AM -0700, Justin Erenkrantz wrote:
> Yup.  That is the plan.  Somehow use mod_include in mod_mbox.  Define a 
> template file and use that each time.  That's the goal.  Not sure whether 
> we duplicate mod_include or add mod_include to the filter chain (we may 
> want to define our own include syntax - this could define how you want 
> the message or index displayed, headers, footers, etc.).  Roy and I have 
> also kicked around the idea of using mod_php as well.  Not sure yet.

well, if mod_mbox actually generated an xml format of some kind,
you could insert an xslt filter that transformed that xml into html.
of course, no such xslt filter exists yet, as far as i know.

but just to throw out a suggestion on the presentation side --
i have been playing with a web interface on top of the nntp server
for the php mailing lists at news.php.net, and one presentation
thing that i think looks really slick is to apply a different style
to quotes and signatures. for example:

  http://news.php.net/article.php?group=php.general&article=107358

and turning links into real links is useful, too.

jim

Re: Web archival of mailing lists

Posted by Justin Erenkrantz <je...@ebuilt.com>.
On Tue, Apr 24, 2001 at 12:59:19PM -0400, Greg Marr wrote:
> At 09:19 PM 04/23/2001, Chris Pepper wrote:
> >>You may see mod_mbox in action at:
> >>
> >>http://www.apachelabs.org/
> >
> >Looks cool. Number-one nit: can you make the backgrounds white?
> 
> I happen to prefer gray backgrounds, actually, or in this case, 
> whatever background I have defined.  In any case, for the maximum in 
> usability and customization, the style of the pages should ultimately 
> come from some kind of configuration/template, or even better an 
> external style sheet, and not be hard-coded into the source of the 
> module.

Yup.  That is the plan.  Somehow use mod_include in mod_mbox.  Define a 
template file and use that each time.  That's the goal.  Not sure whether 
we duplicate mod_include or add mod_include to the filter chain (we may 
want to define our own include syntax - this could define how you want 
the message or index displayed, headers, footers, etc.).  Roy and I have 
also kicked around the idea of using mod_php as well.  Not sure yet.

I believe that the core of mod_mbox is complete - I'm working on the 
presentation of the messages now (which is confined to mod_mbox.c).  As
I've said before, I'm not a presentation person, so if someone wants to
help out with that, I'd greatly appreciate it.  =)

I'll probably be pushing out a new build of mod_mbox to www.apachelabs.org
tonight (including prev/next on each message and use of filters...).  I 
will probably also add a cronjob to fetch the latest mbox files from 
mail.apache.org and reindex it every few hours or so.

Thanks again for the input.  -- justin


Re: Web archival of mailing lists

Posted by Greg Marr <gr...@alum.wpi.edu>.
At 09:19 PM 04/23/2001, Chris Pepper wrote:
>>You may see mod_mbox in action at:
>>
>>http://www.apachelabs.org/
>
>Looks cool. Number-one nit: can you make the backgrounds white?

I happen to prefer gray backgrounds, actually, or in this case, 
whatever background I have defined.  In any case, for the maximum in 
usability and customization, the style of the pages should ultimately 
come from some kind of configuration/template, or even better an 
external style sheet, and not be hard-coded into the source of the 
module.

-- 
Greg Marr
gregm@alum.wpi.edu
"We thought you were dead."
"I was, but I'm better now." - Sheridan, "The Summoning"


Re: Web archival of mailing lists

Posted by Justin Erenkrantz <je...@ebuilt.com>.
> 	Looks cool. Number-one nit: can you make the backgrounds white?

I'll look into it.  I'm not a UI person.  I only use text browsers.

> 	Also, I think mod_mbox could be immediately useful with 
> forward/backward/thread-listing arrows on every message. This is 
> non-trivial, but otherwise mod_mbox looks great.

Actually, implementation-wise, it is fairly trivial (based on how I
do indexing and stuff), but it is awful speed-wise.  Since the only
way to know the relative messages (via date or threading) is to 
compute the index or threading tree.  Since I don't store the indexing 
information in the DBMs, I compute it on each hit.  I like not computing 
the indexes until it is requested - my gut feeling is it is cheaper to 
compute the indexes than to load it from the DBMs each time, but that 
may not prove to be the case.

That said, I am adding that functionality right now.  When it is done,
I'll post it on www.apachelabs.org.  We'll see how it scales...  AFAIK, 
MHonArc and Hypermail generate static HTML pages, so it is easy for them 
just to include the prev/next links (since they know what it is
when they generate the page).  I'm generating almost everything on the
fly.

Thanks for the feedback!  -- justin


Re: Web archival of mailing lists

Posted by Chris Pepper <pe...@mail.reppep.com>.
>You may see mod_mbox in action at:
>
>http://www.apachelabs.org/

Justin,

	Looks cool. Number-one nit: can you make the backgrounds white?

>The other key functionality is the threading algorithm.  I based
>my threading implementation off of Jamie Zawinski's mail threading
>algorithms (he wrote the original versions of Netscape Mail - see
>http://www.jwz.org/doc/threading.html).  His key point was not to
>store the threading tree in the database, but generate the tree on
>the fly.  It has proved to be very efficient and highly accurate.

	Also, I think mod_mbox could be immediately useful with 
forward/backward/thread-listing arrows on every message. This is 
non-trivial, but otherwise mod_mbox looks great.


						Thanks,


						Chris Pepper
-- 
Chris Pepper:                   <http://www.reppep.com/~pepper/>
Rockefeller U Computing Services:  <http://www.rockefeller.edu/>
Mac OS X Software:                      <http://www.mosxsw.com/>