You are viewing a plain text version of this content. The canonical link for it is here.
Posted to xindice-dev@xml.apache.org by Gary Shea <sh...@gtsdesign.com> on 2003/03/12 21:06:44 UTC

metadata thoughts

I just finished adding binary resource support, and in the process ended
up writing an 'inline' metadata facility, where the metadata is stored
as a header on the data.  The metadata facility is enabled and
configured on a per-collection basis.

I'm now re-considering metdata, mostly because I don't think I gave the
existing metadata facility a fair chance, and want to get some group
feedback.  There were three reasons why I didn't use Dave Viner's
metadata facility:

1) it doubles the number of disk writes needed when a resource
    is inserted/updated
2) I _think_ the current implementation is not safe for internal use,
    as I believe there is public code for changing arbitrary metadata
    values (please correct me if I'm wrong...)
3) sheer laziness

A while back there was a metadata discussion on this list, and I've read
that discussion.  I didn't detect any consensus about what sort of
metadata should be supported.

It seems clear that collection-level metadata is best off in a 'system
table', which Dave Viner's metadata system models nicely.  Per-document
data is less clear.  Some of it will change with every save/update, some
won't.  The resource type stuff I just did isn't likely to change all
that often and might be a candidate for the non-inline metadata, if it
is safe from user tampering.  On the other hand, I am currently working
on Xindice enhancements that requires per-document metadata likely to
change with each update/save.

I'm interested in hearing arguments pro and con.

Regards,

	Gary

RE: metadata thoughts

Posted by Gary Shea <sh...@gtsdesign.com>.
On Thu, 13 Mar 2003, at 10:14 [-0800], Dave Viner (dviner@yahoo-inc.com) wrote:
> 	Have you contacted Murray about the XNode implementation?  It was put into
> the scratchpad, and looked promising but we hit some odd licensing issues
> that were never resolved.  (Or at least thats the last I remember of it.)
> 
> dave

I haven't talked to him and haven't had a chance to look at the XNode
spec yet (probably Monday).  I've just read previous list threads.

One way to support it (from the inline metadata perspective) would be to
use it as a read-only way to return system metadata with the document.
I mean something like:

   * store the document in the usual way, no XNode wrapping.
   * when the document is extracted from the BTree due to a query
      or whatever, take the document + metadata and rewrite it
      as an XNode doc and return it.

With inline metadata read-only XNode is easy, but only system metadata
can be available inline.  Kind of a waste of the XNode capabilities, but
it could be handy.

Regards,

	Gary

Re: metadata thoughts

Posted by Murray Altheim <m....@open.ac.uk>.
Murray Altheim wrote:
 > Dave Viner wrote:
 > [...]
 >
 >>     Have you contacted Murray about the XNode implementation?  It was
 >> put into
 >> the scratchpad, and looked promising but we hit some odd licensing issues
 >> that were never resolved.  (Or at least thats the last I remember of it.)
 >
 > If the code in the scratchpad has the wrong license, that's my
 > fault. All of the XNode API code *should* have an Apache license.
 > The API was designed for release to Apache, and I know that nobody
 > at Sun would care to alter that idea.

Evidence of this (if anyone cares to know) is the XNode XML namespace
URI, which is an Apache URL.

 > If you need a copy that does I'd be happy to send it on to whoever
 > is willing to update the scratchpad (I'm behind a firewall that
 > won't let my cvs work remotely).
 >
 > Sorry, I've been buried under in a lit review lately and am only
 > resurfacing today... There's no reason XNode couldn't include
 > hierarchical metadata in its <xnode:Header> element. We'd just
 > need to add a method to obtain it (I can't remember right now
 > where I left that issue).

Okay, I looked at my notes and here's a syntax example. The
asterisked line is the beginning of what could be used to store
hierarchical metadata, in this case Dublin Core RDF. You'd just
obtain the entire element rather than simply its name-value
attributes. We would have to add a method to do this, since the
current one returns the property value from the attribute, not
the entire element. It's an easy fix.

I'd still recommend using 'name' to identify the metadata though.

   <xnode:Envelope xmlns:xnode="http://www.apache.org/xnode/1.0/">
     <xnode:Header xnode:created="2001-10-22T18:33:24">
       <xnode:Property name="type" value="application/xhtml+xml" />
*     <xnode:Property name="DC">
         <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
                  xmlns:dc="http://purl.org/dc/elements/1.1/">
           <rdf:Description rdf:about="http://www.ilrt.bristol.ac.uk/people/cmdjb/">
             <dc:title>Dave Beckett's Home Page</dc:title>
             <dc:creator>Dave Beckett</dc:creator>
             <dc:publisher>ILRT, University of Bristol</dc:publisher>
             <dc:date>2002-07-31</dc:date>
           </rdf:Description>
         </rdf:RDF>
       </xnode:Property>
     </xnode:Header>
     <xnode:Body>
       <html xmlns="http://www.w3.org/1999/xhtml">
         <head>
         <title>Dave Beckett's Home Page</title>
         [rest of stored content...]
       </html>
     </xnode:Body>
   </xnode:Envelope>

I stole the RDF from

   http://dublincore.org/documents/2002/07/31/dcmes-xml/

One potential failing of XNode is it was not designed to add
metadata to Collections. I've not thought a whit about this
and would certainly entertain any suggestions on what changes to
the API might be needed, if any. I'm not sure there would be.
The <xnode:Body> would just contain the entire Collection --
there might be some Xindice-based problems with that though.

API javadocs at

   http://kmi.open.ac.uk/projects/ceryle/doc/api/org/apache/xnode/package-summary.html

Murray

......................................................................
Murray Altheim                  <http://kmi.open.ac.uk/people/murray/>
Knowledge Media Institute
The Open University, Milton Keynes, Bucks, MK7 6AA, UK

     "In Las Vegas Mr Gates also demonstrated a prototype
      fridge magnet which can be programmed to receive traffic
      reports, sports results and advertisements from local
      restaurants using the same FM signal as the wristwatch."
                                  -- The Guardian, 10 Jan 2003.


Re: metadata thoughts

Posted by Murray Altheim <m....@open.ac.uk>.
Dave Viner wrote:
[...]
> 	Have you contacted Murray about the XNode implementation?  It was put into
> the scratchpad, and looked promising but we hit some odd licensing issues
> that were never resolved.  (Or at least thats the last I remember of it.)

If the code in the scratchpad has the wrong license, that's my
fault. All of the XNode API code *should* have an Apache license.
The API was designed for release to Apache, and I know that nobody
at Sun would care to alter that idea.

If you need a copy that does I'd be happy to send it on to whoever
is willing to update the scratchpad (I'm behind a firewall that
won't let my cvs work remotely).

Sorry, I've been buried under in a lit review lately and am only
resurfacing today... There's no reason XNode couldn't include
hierarchical metadata in its <xnode:Header> element. We'd just
need to add a method to obtain it (I can't remember right now
where I left that issue).

Murray

......................................................................
Murray Altheim                  <http://kmi.open.ac.uk/people/murray/>
Knowledge Media Institute
The Open University, Milton Keynes, Bucks, MK7 6AA, UK

     "In Las Vegas Mr Gates also demonstrated a prototype
      fridge magnet which can be programmed to receive traffic
      reports, sports results and advertisements from local
      restaurants using the same FM signal as the wristwatch."
                                  -- The Guardian, 10 Jan 2003.


RE: metadata thoughts

Posted by Gary Shea <sh...@gtsdesign.com>.
Hi Dave --

On Thu, 13 Mar 2003, at 10:14 [-0800], Dave Viner (dviner@yahoo-inc.com) wrote:

> Hi Gary,
> 	I too don't want this discussion to be a "mine is better than yours".
> Bickering rarely leads to a good solution.  Sorry if my initial comments
> came off that way.  I really didn't intend them to have that effect.

No problem, I was just making it be clear that I don't want to go there
either...

To clarify a little more, even... right now I think we are still
clarifying understanding of 1) what we can do with the different
implementations, 2) what the gains and penalties are for the different
implementations, and 3) whether the gains justify having two.
Until we seem to share the same idea of all three, at least in terms of
the facts, I'm going to keep bringing up relevant points until I can't
think of any more :)  So, in the immortal words of Frank Zappa, "here's
some more".

> 
> 	You're correct in that your implementation avoids the double disk write
> issue.  I'm not sure what craziness was coursing thru me when I missed that,
> rather obvious, point.
> 
> 	This topic was discussed (at length) before:
> http://marc.theaimsgroup.com/?l=xindice-dev&m=104066672331546&w=2
> http://marc.theaimsgroup.com/?l=xindice-dev&m=103946437104874&w=2
> http://marc.theaimsgroup.com/?l=xindice-dev&m=103828918030140&w=2
> http://marc.theaimsgroup.com/?t=102873960400001&r=1&w=2

I'd read the more recent thread, will read the 8/2002 thread later today.
Thanks for the link!

> 
> The "inline" metadata approach certainly provides a lot of functionality.
> However, as noted in some of the archived messages, for some applications
> (mine included), altering the document itself is simply not an option.
> True, one could provide an API to fetch the document without the inlined
> metadata, but that requires more work than I'd want to do simply for the
> possibility of accessing metadata.

Let me see if I understand what you are implying with "altering the
document itself is simply not an option".  There are essentially three
cases of interest:

1) The original save of the document.  No problem, inline metadata goes
	into the record with the document.

2) Modification of the document.  Doesn't happen!

3) Modification of the system metadata without changing the document.
Last access time seems like the canonical example where this is an
issue, and I hadn't thought of it before.

4) Introducing metadata into a collection which doesn't have it.  Again,
the document must be read and re-written.

I am guessing that (3) is what you're talking about and it's a good
point.  Is "most recent access time" typically available in SQL db's for
instance?  Is it something your applications need/use?  It _sounds_
useful, if only for reports, and the inline model cannot support it
efficiently.  I suspect that the BTree could be modified to manage
certain kinds of metadata efficiently, but that's yet another set of
changes.  It does make me wonder if that's the way I should be
approaching this issues, though... first record block is metadata,
following blocks are data?  Interesting...


As far as an API to fetch the document without the metadata, no separate
API is required.  Fetching a document works the same way it always has.
The metadata is automatically stripped off by the reader plugins, there
is essentially no performance penalty.  I haven't yet dealt with
accessing system metadata from outside of the internals, but I
understand the desire to do so.  I suspect the API you're using now
would work fine for the inline system metadata as well, and the inline
performance advantage would still apply, assuming caching.

> 
> 	On a seperate note, have you performance tested the existing MetaData
> implementation and found it to be below your requirements?  If so, are you
> at liberty to disclose your requirements and tests?  I'd love to see Xindice
> improve the performance of Metadata if it's subpar.

That's an excellent question.  I saw the doubling issue and thought
"this is crazy".  That's as far as I went.  I take your point: if ya
can't tell, does it matter?

On the other hand, for a general-purpose tool like a database, I have
the impression that the goal is to go as fast as possible, within
reason.  Xindice could have been built as file-per-document,
directory per collection, and it would have only been a little slower.
Lots of much easier things could have been done, but in fact the current
solution is fairly close to state of the art.  What drove that?  Not a
particular application, I'd wager (and maybe lose :), but probably
someone's urge to "get it right".  I am all for metadata.  For optional
metadata that is only updated by the user, independent of the document,
performance is not that critical, and your solution is optimal anyway!
But doubling access time for every document read or write truly concerns
me.  Can it really be argued that doubling disk accesses is irrelevant?

I hear your argument and would like to pursue this question further.

> 
> 	Have you contacted Murray about the XNode implementation?  It was put into
> the scratchpad, and looked promising but we hit some odd licensing issues
> that were never resolved.  (Or at least thats the last I remember of it.)

No I haven't.  I will try to read about it again, I've forgotten
what I read the first time!

Regards,

	Gary


> dave

RE: metadata thoughts

Posted by Dave Viner <dv...@yahoo-inc.com>.
Hi Gary,
	I too don't want this discussion to be a "mine is better than yours".
Bickering rarely leads to a good solution.  Sorry if my initial comments
came off that way.  I really didn't intend them to have that effect.

	You're correct in that your implementation avoids the double disk write
issue.  I'm not sure what craziness was coursing thru me when I missed that,
rather obvious, point.

	This topic was discussed (at length) before:
http://marc.theaimsgroup.com/?l=xindice-dev&m=104066672331546&w=2
http://marc.theaimsgroup.com/?l=xindice-dev&m=103946437104874&w=2
http://marc.theaimsgroup.com/?l=xindice-dev&m=103828918030140&w=2
http://marc.theaimsgroup.com/?t=102873960400001&r=1&w=2

The "inline" metadata approach certainly provides a lot of functionality.
However, as noted in some of the archived messages, for some applications
(mine included), altering the document itself is simply not an option.
True, one could provide an API to fetch the document without the inlined
metadata, but that requires more work than I'd want to do simply for the
possibility of accessing metadata.

	On a seperate note, have you performance tested the existing MetaData
implementation and found it to be below your requirements?  If so, are you
at liberty to disclose your requirements and tests?  I'd love to see Xindice
improve the performance of Metadata if it's subpar.

	Have you contacted Murray about the XNode implementation?  It was put into
the scratchpad, and looked promising but we hit some odd licensing issues
that were never resolved.  (Or at least thats the last I remember of it.)

dave


-----Original Message-----
From: Gary Shea [mailto:shea@gtsdesign.com]
Sent: Wednesday, March 12, 2003 3:10 PM
To: xindice-dev@xml.apache.org
Subject: RE: metadata thoughts


Hey, thanks for the detailed reply!

On Wed, 12 Mar 2003, at 13:46 [-0800], Dave Viner (dviner@yahoo-inc.com)
wrote:

> I'm not sure I understand your arguments against using the MetaData stuff
I
> wrote.  You listed two reasons, doubling the number of disk writes on
> update, and public methods for changing metadata values.  You also said
that
> in your application, you'll need "requires per-document metadata [that is]
> likely to change with each update/save."  So, you'll be bitten by the your
> first objection, doubling disk writes.

I don't really want this to be a "my metadata is better than your
metadata (nah nah nah :)" argument, hopefully the end result will be
that my stuff will either 1) go away, or 2) take over part of yours
where it can do so more efficiently.

The "disk write doubling" bites m1 (you and David's implementation)
because m1 metadata is stored in a different BTree record than the
data record.  My implementation (m2) stores metadata in the same BTree
record as the data, so anytime the record's data is stored, storing the
changed metadata is free.

> As for public methods for changing metadata, it is true that there are
such
> methods.  The concept of the MetaData design that David Ku and I
implemented
> is that there are 3 types of metadata stored.  First, there are "system"
> elements, like last modified time, last access time.  These are handled by
> Xindice.  Second, there are "attributes", which is a big Hashtable.  This
is
> for the user to specify whatever key-value type metadata (s)he wants that
> might be app-specific.  Third, there is a custom XML document space.  This
> is for the user to specify whatever hierarchical metadata (s)he wants.
> Therefore, the methods available allow the user to easily add and remove
> key-value pairs from the hashtable, and the xml document section.  There
is
> also code to let the "power" user change system attributes, but you'd have
> to write the code to call those methods.  The XMLRPC methods provided
don't
> provide a way to change system elements.

The system attributes are the ones I am interested in.  What I want is a
maximally efficient, moderately configurable system metadata facility.
Consider data type (binary/xml) or data digests (MD5, etc), possibly
even Lamport type change-counter/timestamps for replication.  All
computed automatically by plugins registered and triggered at the
Collection level.

>>From your description it sounds like I could easily move my
metadata-generation code into your system metadata.  We pay the
performance penalty but gain reduced complexity.

Another alternative is to move your system metadata into m2, so it stays
with the data it refers to, thereby eliminating the disk-access doubling
penalty, but increasing complexity.

I don't know that it would be desirable to add all of the features
offered by m1 into m2, although it could be done.  I really see m2 as at
most useful for "system" metadata.

> Your application might need some metadata capabilities that are provided
by
> the metadata design and implementation that's in Xindice now.  Without
> knowing more about your requirements, it's hard to say.  But I think the
> current design is pretty flexible, and should provide you with the
> functionality you need.  If it doesn't, then we should identify if the
> missing piece is related to the design or the implementation of metadata
> storage in Xindice.  If it's the implementation, then we should add it.
If
> it's the design, then we should definitely examine what is wrong with the
> design.
>
> dave


To give you the flavor of what I'm doing, here's the simplest possible
useful example (this is a working example):

<collection name="test">
  <filer class="apache.xindice.core.filer.BTreeFiler" />
  <writer class="org.apache.xindice.core.inlinemeta.ResourceTypeWriter"/>
</collection>

In this example the data in each record in the "test" collection will be
prefixed by three addtional bytes:

    byte 1: length of the header
    byte 2: id of the metadata reader (ResourceType => 1)
    byte 3: resource type byte (1 => xml, 2 => binary)

Yeah it's a bit wasteful, but three bytes, so what.


Here's a somewhat more complex (and also currently working) example in
which the metadata will include an MD5 digest and a resource type flag.

<collection name="test" compressed="true">
  <filer class="apache.xindice.core.filer.BTreeFiler" />
  <writer class="org.apache.xindice.core.inlinemeta.AggregatingWriter">
    <writer
      class="org.apache.xindice.core.inlinemeta.DigestWriter"
      algorithm="MD5" />
    <writer class="org.apache.xindice.core.inlinemeta.ResourceTypeWriter"/>
  </writer>
</collection>

In this example, there's an overall header consisting of the two bytes
mentioned above, and then each of the aggregated metadata bits has its
own two byte header which the AggregatingReader uses to figure out what
to do.


This could easily be expanded to include the other stuff you guys are
maintaining separately from the data record:

<collection name="test" compressed="true" enable-inline-meta="true">
  <filer class="apache.xindice.core.filer.BTreeFiler" />
  <writer class="org.apache.xindice.core.inlinemeta.AggregatingWriter">
    <writer
      class="org.apache.xindice.core.inlinemeta.LastModifiedWriter"
    <writer
      class="org.apache.xindice.core.inlinemeta.LastAccessWriter"
    <writer
      class="org.apache.xindice.core.inlinemeta.DigestWriter"
      algorithm="MD5" />
    <writer class="org.apache.xindice.core.inlinemeta.ResourceTypeWriter"/>
  </writer>
</collection>

Incidentally, the writer configuration can be changed at any time
without breaking anything.  It's only turning on the inline metadata
initially which is a bit tricky.


My only concern is whether the performance gain is worth the complexity
pain.  The code is pretty simple and totally modular and non-intrusive
(there's probably only 5 or 10 lines of inline-metadata-specific code in
Collection), but every new line of code is a new place for stuff to
break, and a challenge to the poor fool who has to figure out what's
going on....

	Gary
>
>
> -----Original Message-----
> From: Gary Shea [mailto:shea@gtsdesign.com]
> Sent: Wednesday, March 12, 2003 12:07 PM
> To: xindice-dev@xml.apache.org
> Subject: metadata thoughts
>
>
> I just finished adding binary resource support, and in the process ended
> up writing an 'inline' metadata facility, where the metadata is stored
> as a header on the data.  The metadata facility is enabled and
> configured on a per-collection basis.
>
> I'm now re-considering metdata, mostly because I don't think I gave the
> existing metadata facility a fair chance, and want to get some group
> feedback.  There were three reasons why I didn't use Dave Viner's
> metadata facility:
>
> 1) it doubles the number of disk writes needed when a resource
>     is inserted/updated
> 2) I _think_ the current implementation is not safe for internal use,
>     as I believe there is public code for changing arbitrary metadata
>     values (please correct me if I'm wrong...)
> 3) sheer laziness
>
> A while back there was a metadata discussion on this list, and I've read
> that discussion.  I didn't detect any consensus about what sort of
> metadata should be supported.
>
> It seems clear that collection-level metadata is best off in a 'system
> table', which Dave Viner's metadata system models nicely.  Per-document
> data is less clear.  Some of it will change with every save/update, some
> won't.  The resource type stuff I just did isn't likely to change all
> that often and might be a candidate for the non-inline metadata, if it
> is safe from user tampering.  On the other hand, I am currently working
> on Xindice enhancements that requires per-document metadata likely to
> change with each update/save.
>
> I'm interested in hearing arguments pro and con.
>
> Regards,
>
> 	Gary
>
>
>
>

Regards,

	Gary Shea
	GTS Design Consulting
	shea AT gtsdesign DOT com



RE: metadata thoughts

Posted by Gary Shea <sh...@gtsdesign.com>.
Hey, thanks for the detailed reply!

On Wed, 12 Mar 2003, at 13:46 [-0800], Dave Viner (dviner@yahoo-inc.com) wrote:

> I'm not sure I understand your arguments against using the MetaData stuff I
> wrote.  You listed two reasons, doubling the number of disk writes on
> update, and public methods for changing metadata values.  You also said that
> in your application, you'll need "requires per-document metadata [that is]
> likely to change with each update/save."  So, you'll be bitten by the your
> first objection, doubling disk writes.

I don't really want this to be a "my metadata is better than your
metadata (nah nah nah :)" argument, hopefully the end result will be
that my stuff will either 1) go away, or 2) take over part of yours
where it can do so more efficiently.

The "disk write doubling" bites m1 (you and David's implementation)
because m1 metadata is stored in a different BTree record than the
data record.  My implementation (m2) stores metadata in the same BTree
record as the data, so anytime the record's data is stored, storing the
changed metadata is free.

> As for public methods for changing metadata, it is true that there are such
> methods.  The concept of the MetaData design that David Ku and I implemented
> is that there are 3 types of metadata stored.  First, there are "system"
> elements, like last modified time, last access time.  These are handled by
> Xindice.  Second, there are "attributes", which is a big Hashtable.  This is
> for the user to specify whatever key-value type metadata (s)he wants that
> might be app-specific.  Third, there is a custom XML document space.  This
> is for the user to specify whatever hierarchical metadata (s)he wants.
> Therefore, the methods available allow the user to easily add and remove
> key-value pairs from the hashtable, and the xml document section.  There is
> also code to let the "power" user change system attributes, but you'd have
> to write the code to call those methods.  The XMLRPC methods provided don't
> provide a way to change system elements.

The system attributes are the ones I am interested in.  What I want is a
maximally efficient, moderately configurable system metadata facility.
Consider data type (binary/xml) or data digests (MD5, etc), possibly
even Lamport type change-counter/timestamps for replication.  All
computed automatically by plugins registered and triggered at the
Collection level.

>From your description it sounds like I could easily move my
metadata-generation code into your system metadata.  We pay the
performance penalty but gain reduced complexity.

Another alternative is to move your system metadata into m2, so it stays
with the data it refers to, thereby eliminating the disk-access doubling
penalty, but increasing complexity.

I don't know that it would be desirable to add all of the features
offered by m1 into m2, although it could be done.  I really see m2 as at
most useful for "system" metadata.

> Your application might need some metadata capabilities that are provided by
> the metadata design and implementation that's in Xindice now.  Without
> knowing more about your requirements, it's hard to say.  But I think the
> current design is pretty flexible, and should provide you with the
> functionality you need.  If it doesn't, then we should identify if the
> missing piece is related to the design or the implementation of metadata
> storage in Xindice.  If it's the implementation, then we should add it.  If
> it's the design, then we should definitely examine what is wrong with the
> design.
> 
> dave


To give you the flavor of what I'm doing, here's the simplest possible
useful example (this is a working example):

<collection name="test">
  <filer class="apache.xindice.core.filer.BTreeFiler" />
  <writer class="org.apache.xindice.core.inlinemeta.ResourceTypeWriter"/>
</collection>

In this example the data in each record in the "test" collection will be
prefixed by three addtional bytes:

    byte 1: length of the header
    byte 2: id of the metadata reader (ResourceType => 1)
    byte 3: resource type byte (1 => xml, 2 => binary)

Yeah it's a bit wasteful, but three bytes, so what.


Here's a somewhat more complex (and also currently working) example in
which the metadata will include an MD5 digest and a resource type flag.

<collection name="test" compressed="true">
  <filer class="apache.xindice.core.filer.BTreeFiler" />
  <writer class="org.apache.xindice.core.inlinemeta.AggregatingWriter">
    <writer
      class="org.apache.xindice.core.inlinemeta.DigestWriter"
      algorithm="MD5" />
    <writer class="org.apache.xindice.core.inlinemeta.ResourceTypeWriter"/>
  </writer>
</collection>

In this example, there's an overall header consisting of the two bytes
mentioned above, and then each of the aggregated metadata bits has its
own two byte header which the AggregatingReader uses to figure out what
to do.


This could easily be expanded to include the other stuff you guys are
maintaining separately from the data record:

<collection name="test" compressed="true" enable-inline-meta="true">
  <filer class="apache.xindice.core.filer.BTreeFiler" />
  <writer class="org.apache.xindice.core.inlinemeta.AggregatingWriter">
    <writer
      class="org.apache.xindice.core.inlinemeta.LastModifiedWriter"
    <writer
      class="org.apache.xindice.core.inlinemeta.LastAccessWriter"
    <writer
      class="org.apache.xindice.core.inlinemeta.DigestWriter"
      algorithm="MD5" />
    <writer class="org.apache.xindice.core.inlinemeta.ResourceTypeWriter"/>
  </writer>
</collection>

Incidentally, the writer configuration can be changed at any time
without breaking anything.  It's only turning on the inline metadata
initially which is a bit tricky.


My only concern is whether the performance gain is worth the complexity
pain.  The code is pretty simple and totally modular and non-intrusive
(there's probably only 5 or 10 lines of inline-metadata-specific code in
Collection), but every new line of code is a new place for stuff to
break, and a challenge to the poor fool who has to figure out what's
going on....

	Gary
> 
> 
> -----Original Message-----
> From: Gary Shea [mailto:shea@gtsdesign.com]
> Sent: Wednesday, March 12, 2003 12:07 PM
> To: xindice-dev@xml.apache.org
> Subject: metadata thoughts
> 
> 
> I just finished adding binary resource support, and in the process ended
> up writing an 'inline' metadata facility, where the metadata is stored
> as a header on the data.  The metadata facility is enabled and
> configured on a per-collection basis.
> 
> I'm now re-considering metdata, mostly because I don't think I gave the
> existing metadata facility a fair chance, and want to get some group
> feedback.  There were three reasons why I didn't use Dave Viner's
> metadata facility:
> 
> 1) it doubles the number of disk writes needed when a resource
>     is inserted/updated
> 2) I _think_ the current implementation is not safe for internal use,
>     as I believe there is public code for changing arbitrary metadata
>     values (please correct me if I'm wrong...)
> 3) sheer laziness
> 
> A while back there was a metadata discussion on this list, and I've read
> that discussion.  I didn't detect any consensus about what sort of
> metadata should be supported.
> 
> It seems clear that collection-level metadata is best off in a 'system
> table', which Dave Viner's metadata system models nicely.  Per-document
> data is less clear.  Some of it will change with every save/update, some
> won't.  The resource type stuff I just did isn't likely to change all
> that often and might be a candidate for the non-inline metadata, if it
> is safe from user tampering.  On the other hand, I am currently working
> on Xindice enhancements that requires per-document metadata likely to
> change with each update/save.
> 
> I'm interested in hearing arguments pro and con.
> 
> Regards,
> 
> 	Gary
> 
> 
> 
> 

Regards,

	Gary Shea
	GTS Design Consulting
	shea AT gtsdesign DOT com

RE: metadata thoughts

Posted by Dave Viner <dv...@yahoo-inc.com>.
I'm not sure I understand your arguments against using the MetaData stuff I
wrote.  You listed two reasons, doubling the number of disk writes on
update, and public methods for changing metadata values.  You also said that
in your application, you'll need "requires per-document metadata [that is]
likely to change with each update/save."  So, you'll be bitten by the your
first objection, doubling disk writes.

As for public methods for changing metadata, it is true that there are such
methods.  The concept of the MetaData design that David Ku and I implemented
is that there are 3 types of metadata stored.  First, there are "system"
elements, like last modified time, last access time.  These are handled by
Xindice.  Second, there are "attributes", which is a big Hashtable.  This is
for the user to specify whatever key-value type metadata (s)he wants that
might be app-specific.  Third, there is a custom XML document space.  This
is for the user to specify whatever hierarchical metadata (s)he wants.
Therefore, the methods available allow the user to easily add and remove
key-value pairs from the hashtable, and the xml document section.  There is
also code to let the "power" user change system attributes, but you'd have
to write the code to call those methods.  The XMLRPC methods provided don't
provide a way to change system elements.


Your application might need some metadata capabilities that are provided by
the metadata design and implementation that's in Xindice now.  Without
knowing more about your requirements, it's hard to say.  But I think the
current design is pretty flexible, and should provide you with the
functionality you need.  If it doesn't, then we should identify if the
missing piece is related to the design or the implementation of metadata
storage in Xindice.  If it's the implementation, then we should add it.  If
it's the design, then we should definitely examine what is wrong with the
design.

dave


-----Original Message-----
From: Gary Shea [mailto:shea@gtsdesign.com]
Sent: Wednesday, March 12, 2003 12:07 PM
To: xindice-dev@xml.apache.org
Subject: metadata thoughts


I just finished adding binary resource support, and in the process ended
up writing an 'inline' metadata facility, where the metadata is stored
as a header on the data.  The metadata facility is enabled and
configured on a per-collection basis.

I'm now re-considering metdata, mostly because I don't think I gave the
existing metadata facility a fair chance, and want to get some group
feedback.  There were three reasons why I didn't use Dave Viner's
metadata facility:

1) it doubles the number of disk writes needed when a resource
    is inserted/updated
2) I _think_ the current implementation is not safe for internal use,
    as I believe there is public code for changing arbitrary metadata
    values (please correct me if I'm wrong...)
3) sheer laziness

A while back there was a metadata discussion on this list, and I've read
that discussion.  I didn't detect any consensus about what sort of
metadata should be supported.

It seems clear that collection-level metadata is best off in a 'system
table', which Dave Viner's metadata system models nicely.  Per-document
data is less clear.  Some of it will change with every save/update, some
won't.  The resource type stuff I just did isn't likely to change all
that often and might be a candidate for the non-inline metadata, if it
is safe from user tampering.  On the other hand, I am currently working
on Xindice enhancements that requires per-document metadata likely to
change with each update/save.

I'm interested in hearing arguments pro and con.

Regards,

	Gary