You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@httpd.apache.org by Ingo Luetkebohle <in...@devcon.net> on 1998/01/09 20:24:15 UTC

sane mail-archive

I just browsed through the archive linked at dev.apache.org and found it
to be a real drag to browse the mbox directly. Is there a "real",
HTMLized archive? If not, anyone cared if I did one?

---Ingo Luetkebohle
dev/consulting Gesellschaft fuer Netzwerkentwicklung und -beratung mbH
url: http://www.devconsult.de/ - fon: 0521-1365800 - fax: 0521-1365803

Re: sane mail-archive

Posted by Brian Behlendorf <br...@organic.com>.
At 12:46 PM 1/10/98 +0100, Ingo Luetkebohle wrote:
>Brian Behlendorf wrote:
>> Disk space chewing.  huge problem.  That affects search index size too.
>
>Sure, but disk-space is cheap. gzipping the generated files does help,
>too.

It's not so much the size as the fact that they're all separate files.  You
know how much fun it is to do an "ls" in a directory with 10K files, or a
copy or a backup for that matter.  Add to the fact that disk space
blocksize overhead and 10K message are a lot more troublesome to deal with
than individual messages.  

And CPU power is getting cheaper faster than high-quality backed-up disk
space.

>You mention pine... What about setting up a public IMAP server
>containing the mail archive? Browsing and searching is built right in
>and it surely beats ftp´ing all those mbox-files and browsing them
>locally. With cyrus-imapd, you could even place all the mbox files into
>one big archive, without any performance degradation at all.

I've not investigated recent developments in imap clients.  Are they
finally usable for public archives?  Is it possible to refer to individual
messages on an IMAP server by a URL?  

	Brian



--=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=--
specialization is for insects				  brian@organic.com

Re: sane mail-archive

Posted by Rodent of Unusual Size <Ke...@Golux.Com>.
Brian Behlendorf wrote:
> 
> At 06:06 PM 1/9/98 -0500, Rodent of Unusual Size wrote:
> >
> >I have a half-implemented version of exactly this.  The major stumbling
> >block has been the index; I want full-text, and my first pass (before
> >I had to turn my attention elsewhere) involved breaking the mbox into
> >separate pieces and running a wais index of it, and rewriting the
> >index pointers to be MIDs rather than file names.
> 
> For the search tool I'd suggest just grabbing glimpse or swish or something
> and hacking it to not be file-system based but "feed me text, feed me an
> identification string"-based.

Cool - but where would I find those?  I've never heard of them..

#ken	P-)}

Re: sane mail-archive

Posted by Ingo Luetkebohle <in...@blank.pages.de>.
Brian Behlendorf wrote:
> For the search tool I'd suggest just grabbing glimpse or swish or something
> and hacking it to not be file-system based but "feed me text, feed me an
> identification string"-based.

Another approach is htdig, which just crawls over the HTML tree, indexes
it and doesn´t care where the web-server gets its data from. Its not
suited for selective searching of document parts, though.
 
---Ingo Luetkebohle
dev/consulting Gesellschaft fuer Netzwerkentwicklung und -beratung mbH
url: http://www.devconsult.de/ - fon: 0521-1365800 - fax: 0521-1365803

Re: sane mail-archive

Posted by Kevin Hughes <ke...@webhistory.org>.

On Fri, 9 Jan 1998, Brian Behlendorf wrote:

> >I have a half-implemented version of exactly this.  The major stumbling
> >block has been the index; I want full-text, and my first pass (before
> >I had to turn my attention elsewhere) involved breaking the mbox into
> >separate pieces and running a wais index of it, and rewriting the
> >index pointers to be MIDs rather than file names.
> 
> For the search tool I'd suggest just grabbing glimpse or swish or something
> and hacking it to not be file-system based but "feed me text, feed me an
> identification string"-based.

	Note that now hypermail is under GNU, and you folks are encouraged
to hack it to bits. Personally, I'd take a few ideas from MHonArc
(MIME support, page customization) and add them in if you all find it
useful. You can also get some ideas from www.findmail.com, which pretty
much rewrote hypermail from scratch, making it use a database of pointers
instead of individual files.

	Swish is also under GNU, and IMHO I'd start from the code at:

	http://sunsite.berkeley.edu/SWISH-E/

	and add ideas from htDig (also under GNU), rather than the other
way around, since htDig is getting rather big and complex and dependent
on wacky C++ library stuff.
	The SWISH-E project has added multiple fields support, and folks
on its mailing list are contributing code, patches, etc.

	-- Kevin

Re: sane mail-archive

Posted by Brian Behlendorf <br...@organic.com>.
At 06:06 PM 1/9/98 -0500, Rodent of Unusual Size wrote:
>Brian Behlendorf wrote:
>> 
>> I'll post the spec soon... but the basic idea is that for each mbox file
>> you have a dbm file (or sql database table, whatever) storing message
>> beginnings, the basic headers, x-ref's, etc.  I.e., a threads database like
>> news servers have.  The search engine returns hits to particular messages,
>> showing the metainfo.  every message has a unique URL.  Etc.
>
>I have a half-implemented version of exactly this.  The major stumbling
>block has been the index; I want full-text, and my first pass (before
>I had to turn my attention elsewhere) involved breaking the mbox into
>separate pieces and running a wais index of it, and rewriting the
>index pointers to be MIDs rather than file names.

For the search tool I'd suggest just grabbing glimpse or swish or something
and hacking it to not be file-system based but "feed me text, feed me an
identification string"-based.

	Brian


--=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=--
specialization is for insects				  brian@organic.com

Re: sane mail-archive

Posted by Rodent of Unusual Size <Ke...@Golux.Com>.
Brian Behlendorf wrote:
> 
> I'll post the spec soon... but the basic idea is that for each mbox file
> you have a dbm file (or sql database table, whatever) storing message
> beginnings, the basic headers, x-ref's, etc.  I.e., a threads database like
> news servers have.  The search engine returns hits to particular messages,
> showing the metainfo.  every message has a unique URL.  Etc.

I have a half-implemented version of exactly this.  The major stumbling
block has been the index; I want full-text, and my first pass (before
I had to turn my attention elsewhere) involved breaking the mbox into
separate pieces and running a wais index of it, and rewriting the
index pointers to be MIDs rather than file names.

#ken	P-)}

Re: sane mail-archive

Posted by Ingo Luetkebohle <in...@blank.pages.de>.
Brian Behlendorf wrote:
> Disk space chewing.  huge problem.  That affects search index size too.

Sure, but disk-space is cheap. gzipping the generated files does help,
too.

> And the fact you had to heavily customize it too doesn't make me feel good.

With customization I meant building a specific resources file for
MHonArc, not hacking at the MHonArc source. For explanation: MHonArc
lets you customize the looks of the generated HTML files by resource
files. This makes them generally much more useful.
 
> Not pine!  :)  Not eudora either, whose mailbox file format is the same as
> the Unix format.

You mention pine... What about setting up a public IMAP server
containing the mail archive? Browsing and searching is built right in
and it surely beats ftp´ing all those mbox-files and browsing them
locally. With cyrus-imapd, you could even place all the mbox files into
one big archive, without any performance degradation at all.
 
---/dev/il

Re: sane mail-archive

Posted by Dean Gaudet <dg...@arctic.org>.
Not if you use IMAP and a different mailbox format.  Insert caveats here. 

Dean

On Fri, 9 Jan 1998, Marc Slemko wrote:

> On Fri, 9 Jan 1998, Brian Behlendorf wrote:
> 
> > >Anyway, I´d say that a 1700k mbox file (and thats just for the last 9
> > >days) does pretty much choke any reader, too...
> > 
> > Not pine!  :)  Not eudora either, whose mailbox file format is the same as
> > the Unix format.
> 
> No, but my mailboxes have forced me to buy another 64 megs of RAM for my
> home box to load 40-100 meg folders into pine.  pine sucks because it
> reads the whole darn thing into memory.
> 
> Adding memory is cheaper than learning a new mailer though.
> 
> 
> 


Re: sane mail-archive

Posted by Marc Slemko <ma...@worldgate.com>.
On Fri, 9 Jan 1998, Brian Behlendorf wrote:

> >Anyway, I�d say that a 1700k mbox file (and thats just for the last 9
> >days) does pretty much choke any reader, too...
> 
> Not pine!  :)  Not eudora either, whose mailbox file format is the same as
> the Unix format.

No, but my mailboxes have forced me to buy another 64 megs of RAM for my
home box to load 40-100 meg folders into pine.  pine sucks because it
reads the whole darn thing into memory.

Adding memory is cheaper than learning a new mailer though.



Re: sane mail-archive

Posted by Brian Behlendorf <br...@organic.com>.
At 08:31 PM 1/9/98 +0100, Ingo Luetkebohle wrote:
>Marc Slemko wrote:
>> I am not aware of any program that produces a sane archive that is
>> useful and doesn't choke with volume.
>
>Hmm, we are using a combination of MHonArc, very customized, for
>mail2html and htdig for searching, which works quite well, even on large
>mailing-lists (bugtraq...)

Disk space chewing.  huge problem.  That affects search index size too.
And the fact you had to heavily customize it too doesn't make me feel good.

>Anyway, I´d say that a 1700k mbox file (and thats just for the last 9
>days) does pretty much choke any reader, too...

Not pine!  :)  Not eudora either, whose mailbox file format is the same as
the Unix format.

I'll post the spec soon... but the basic idea is that for each mbox file
you have a dbm file (or sql database table, whatever) storing message
beginnings, the basic headers, x-ref's, etc.  I.e., a threads database like
news servers have.  The search engine returns hits to particular messages,
showing the metainfo.  every message has a unique URL.  Etc.

	Brian


--=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=--
specialization is for insects				  brian@organic.com

Re: sane mail-archive

Posted by Ingo Luetkebohle <in...@devcon.net>.
Marc Slemko wrote:
> I am not aware of any program that produces a sane archive that is
> useful and doesn't choke with volume.

Hmm, we are using a combination of MHonArc, very customized, for
mail2html and htdig for searching, which works quite well, even on large
mailing-lists (bugtraq...)

Anyway, I´d say that a 1700k mbox file (and thats just for the last 9
days) does pretty much choke any reader, too...

> cf. some of Brian's messages...

Which messages are you referring to?
 
---Ingo Luetkebohle
dev/consulting Gesellschaft fuer Netzwerkentwicklung und -beratung mbH
url: http://www.devconsult.de/ - fon: 0521-1365800 - fax: 0521-1365803

Re: sane mail-archive

Posted by Marc Slemko <ma...@worldgate.com>.
I am not aware of any program that produces a sane archive that is
useful and doesn't choke with volume.

cf. some of Brian's messages...

On Fri, 9 Jan 1998, Ingo Luetkebohle wrote:

> I just browsed through the archive linked at dev.apache.org and found it
> to be a real drag to browse the mbox directly. Is there a "real",
> HTMLized archive? If not, anyone cared if I did one?
> 
> ---Ingo Luetkebohle
> dev/consulting Gesellschaft fuer Netzwerkentwicklung und -beratung mbH
> url: http://www.devconsult.de/ - fon: 0521-1365800 - fax: 0521-1365803
>