You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@httpd.apache.org by "Robert S. Thau" <rs...@ai.mit.edu> on 1996/05/30 18:31:43 UTC
Stuff that happened at a workshop in Cambridge.
I got a last-minute invite to a workshop on distributed searching and
indexing which happened in Cambridge over the last couple of days.
Some things that happened which might be of interest to people here,
in no particular order:
1) I was asked a couple of times, "could Apache support XXX"? My
general answer, after stressing that I was not in a postion to make
any commitments on behalf of the group, and that we're still trying
to make up our minds about what *exactly* the criteria ought to be,
was that if someone wanted to add Apache support for their hack of
choice, the right approach would be to:
a) Supply code, if it isn't easy to do that, and be willing to see
it distributed on our terms.
b) Persuade the group that it was a good feature for Apache to
support, fit well with the distribution, and was worth the risk
of taking a CERT advisory on the code.
c) Be willing to stick around and support the code when problems
came up (as is inevitable in the real world).
I hope this isn't too far from the actual (still evolving, I think)
consensus.
2) It turns out that a *lot* of people want some way of asking a
server, "what's changed since Tuesday?". Netscape actually has
implemented exactly that capability in their enterprise and catalog
servers; they've just been very quiet about the spec. Hopefully,
in this instance at least, they'll be adopting a more open attitude
--- if the stars are right, a description of what they have
implemented (which, BTW, the Netscape rep said he wouldn't mind
seeing cloned), may be showing up as a W3C draft in the near term.
One thing that would fall out of this sort of effort, BTW, is the
ability to give a well-mannered spider very precise directions
about what it should and should not try to index at any particular
site. (Ill-mannered spiders are, unfortunately, uncontrollable by
definition).
BTW, one of the reasons that people were asking about putting stuff
into Apache is that we could potentially provide a vehicle for
widespread deployment of this sort of thing if we *did* slot it
into our own core distribution.
3) Another item which was of interest to a number of people and groups
is establishing a common query format. There's a group at Stanford
which is trying to do exactly this by getting Excite, Verity, etc.,
to agree on whatever they can manage to agree on (with a spec that
leaves hooks for proprietary extensions).
4) In the meantime, the Z39.50 community is wondering "why don't these
people just use *our* stuff"? General Magic is wondering the same
thing, but they have even less of a chance. (Z39.50 is a common
query protocol framework which is used by a lot of library catalog
systems, and has been used in other fields; it's also the basis of
WAIS).
5) Yet *another* thing that may appear over the medium term is a
common full-text inverted index format. There is an interest on
the part of a number of search-engine vendors in defining such a
thing, provided that it leaves them with enough information to
deploy their own proprietary tricks without exposing them in
public.
6) Since there was a representative from Microsoft, I asked him if he
knew anything about the WBCLI and TinyWeb spiders which seemed to
be giving people trouble here last month; I also forwarded him a
few samples of the complaints I was seeing (with names stripped
off, in case any of you are contemplating business deals with
Microsoft).
FWIW, jericho2 is actually microsoft's firewall. It has a lot of
individual users behind it, and the access patterns they
collectively generate can wind up looking like a "stupid robot".
However, it isn't obvious to either me or the Microsoft guy that
that accounts for WBCLI or TinyWeb; he said he'd try talking things
over with the people who run jericho, and see if they have any idea
what's up.
7) For your amusement, since the better spiders have stopped indexing
the now-ubiquitous <!-- sex breast sex breast sex breast ... -->
HTML comments, current practice in fooling search engines has
evolved. The new state of the art is apparently as follows:
<body background=white>
<font color=white>
sex breast sex breast sex breast...
</font>
This says something about human ingenuity and resourcefulness, but
I'm not sure I want to know what.
8) On a similar note, there was a talk by one of the guys running the
c|net virtual software library; one of their serious concerns is
trying to figure out a way *not* to tell a ten-year-old who looks
at their list of most popular downloads that the number 3 item (or
whatever it is) is the hooters screen saver --- in particular,
they'd like a way to do this which does not involve human editorial
decisions and the associated potential for legal liability.
rst
Re: Stuff that happened at a workshop in Cambridge.
Posted by Brian Behlendorf <br...@organic.com>.
On Thu, 30 May 1996, Robert S. Thau wrote:
> 1) I was asked a couple of times, "could Apache support XXX"? My
> general answer, after stressing that I was not in a postion to make
> any commitments on behalf of the group, and that we're still trying
> to make up our minds about what *exactly* the criteria ought to be,
> was that if someone wanted to add Apache support for their hack of
> choice, the right approach would be to:
>
> a) Supply code, if it isn't easy to do that, and be willing to see
> it distributed on our terms.
> b) Persuade the group that it was a good feature for Apache to
> support, fit well with the distribution, and was worth the risk
> of taking a CERT advisory on the code.
> c) Be willing to stick around and support the code when problems
> came up (as is inevitable in the real world).
>
> I hope this isn't too far from the actual (still evolving, I think)
> consensus.
100% right.
> 2) It turns out that a *lot* of people want some way of asking a
> server, "what's changed since Tuesday?". Netscape actually has
> implemented exactly that capability in their enterprise and catalog
> servers; they've just been very quiet about the spec. Hopefully,
> in this instance at least, they'll be adopting a more open attitude
> --- if the stars are right, a description of what they have
> implemented (which, BTW, the Netscape rep said he wouldn't mind
> seeing cloned), may be showing up as a W3C draft in the near term.
Hmm, I was under the impression that the Catalog servers were using the
Harvest model and technology... they hired two of the top three Harvest
developers from colorado state, with the other one going to @Home for
their proxy server work.
> BTW, one of the reasons that people were asking about putting stuff
> into Apache is that we could potentially provide a vehicle for
> widespread deployment of this sort of thing if we *did* slot it
> into our own core distribution.
As a module in the distribution, sure. :)
> 5) Yet *another* thing that may appear over the medium term is a
> common full-text inverted index format. There is an interest on
> the part of a number of search-engine vendors in defining such a
> thing, provided that it leaves them with enough information to
> deploy their own proprietary tricks without exposing them in
> public.
I though that's what SOIF was... ah well.
Actually, on my long to-do list is to investigate harvest more deeply,
seeing if it made sense to build a bridge module between a harvest
indexer/gatherer and the httpd. Someone with more time than I may want
to pursue this, it would be a nice add-on.
Brian
--=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=--
brian@organic.com | We're hiring! http://www.organic.com/Home/Info/Jobs/