You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@subversion.apache.org by Ben Collins-Sussman <su...@red-bean.com> on 2007/05/15 14:00:01 UTC

searchable revprops?

For years now, people have been criticizing Subversion for telling
users to go ahead and invent new revprops, but then discovering that
revprops aren't searchable.  "What's the point of all this cool
metadata, if I can't even execute a query like 'show me all revisions
written by a specific author?'"

IIRC, our response has always been, "yeah, you're right, someday we
should build revprop indices on the server for this."

It occurred to me that now that we unconditionally require sqlite in
the repository, it would be really trivial to implement this feature.
If sqlite is indexing revprops, it would be pretty easy to add a new
RA interface to "return all revisions where revpropname matches
revpropvalue".

I certainly don't want to distract from our focus on merge-tracking,
but if anyone is hungry for a relatively easy and fun project... this
seems like a great opportunity.  Some real low-hanging fruit!

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: searchable revprops?

Posted by David Glasser <gl...@mit.edu>.
On 5/15/07, Ben Collins-Sussman <su...@red-bean.com> wrote:
> It occurred to me that now that we unconditionally require sqlite in
> the repository, it would be really trivial to implement this feature.

I seem to recall that when sqlite was first added for merge tracking,
there was talk of it only being used for fsfs and for the BDB fsfs to
potentially just use BDB instead (to minimize dependencies, etc).  Has
this changed?  (Sure, I know that we haven't actually implemented
non-sqlite merge tracking...)

--dave

-- 
David Glasser | glasser@mit.edu | http://www.davidglasser.net/

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: apr_dbd vs. direct sqlite (was Re: searchable revprops?)

Posted by Garrett Rooney <ro...@electricjellyfish.net>.
On 5/15/07, Eric Gillespie <ep...@pretzelnet.org> wrote:
> "Garrett Rooney" <ro...@electricjellyfish.net> writes:
>
> > apr_dbd is rather new (i.e. not in apr-util 0.9.x), and doesn't solve
>
> People are still using that?  1.2's been out for a while...  I
> understand we can't *break* people using 0.9.x when they upgrade
> to new versions, but who says the *new* features have to be
> usable with the old stuff?

Well, we've supported the APR 0.9.x line (and even encouraged its use
over the 1.x line by only shipping 0.9.x binaries) for some time, for
binary compatibility reasons.  Now there's no reason that can't
change, but if we're going to make new features depend on things like
APR 1.x then we need to make that very clear to users.

-garrett

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: apr_dbd vs. direct sqlite (was Re: searchable revprops?)

Posted by Charles Acknin <ch...@gmail.com>.
On 5/15/07, Eric Gillespie <ep...@pretzelnet.org> wrote:
> "Garrett Rooney" <ro...@electricjellyfish.net> writes:
>
> > apr_dbd is rather new (i.e. not in apr-util 0.9.x), and doesn't solve
>
> People are still using that?  1.2's been out for a while...  I

Yes.  I'm using gentoo and apr-util-1.x is not flagged as stable yet,
see http://packages.gentoo.org/search/?sstring=apr-util

I guess/hope I'm not the only one in this situation : )

Charles

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: apr_dbd vs. direct sqlite (was Re: searchable revprops?)

Posted by Daniel Rall <dl...@collab.net>.
On Tue, 15 May 2007, Eric Gillespie wrote:

> "Garrett Rooney" <ro...@electricjellyfish.net> writes:
...
> > many of the problems with swapping new db back ends in (i.e. sql
> > dialect differences), so I'd recommend sticking with raw sqlite unless
> > there's a pressing reason not to do so.
> 
> Dan Berlin assured me that svn uses a narrow subset of SQL, and
> should work with any back-end.

Not a whole lot has changed in our SQL usage since DannyB's initial
implementation -- porting to a new SQL-compatible backend should be a
matter of changing no more than 1 SQL statement.

While I don't really think it's necessary, in terms of moving to
apr_bdb or an alternate abstraction in a later release, there should
be no problem, as we aren't currently exposing sqlite as part of our
public API.

Re: apr_dbd vs. direct sqlite (was Re: searchable revprops?)

Posted by Eric Gillespie <ep...@pretzelnet.org>.
"Garrett Rooney" <ro...@electricjellyfish.net> writes:

> apr_dbd is rather new (i.e. not in apr-util 0.9.x), and doesn't solve

People are still using that?  1.2's been out for a while...  I
understand we can't *break* people using 0.9.x when they upgrade
to new versions, but who says the *new* features have to be
usable with the old stuff?

> many of the problems with swapping new db back ends in (i.e. sql
> dialect differences), so I'd recommend sticking with raw sqlite unless
> there's a pressing reason not to do so.

Dan Berlin assured me that svn uses a narrow subset of SQL, and
should work with any back-end.

-- 
Eric Gillespie <*> epg@pretzelnet.org

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: apr_dbd vs. direct sqlite (was Re: searchable revprops?)

Posted by Garrett Rooney <ro...@electricjellyfish.net>.
On 5/15/07, Eric Gillespie <ep...@pretzelnet.org> wrote:
> "Ben Collins-Sussman" <su...@red-bean.com> writes:
>
> > On 5/15/07, C. Michael Pilato <cm...@collab.net> wrote:
> >
> > > That's a good observation, Ben.  Let's not be guilty of rushing something
> > > under-designed into the codebase, though.
> >
> > Totally agree.  I meant, "here's a yummy feature someone could take
> > the time to write a design spec for."  :-)  Whatever the design may
> > be, I imagine that the implementation will be fairly easy, now that
> > we've got SQL at our disposal.
>
> Speaking of that, should we be using apr_dbd instead of sqlite
> directly?  If sqlite doesn't perform well enough for large
> installations (as i suspect), this would make it easier to plugin
> a replacement for all svn's SQL usage in one fell swoop.
>
> As long as we only use sqlite in one place, it's not as much of
> an issue, but if we think we're going to expand our usage, i
> think it makes sense to think about this.  Especially before 1.5
> ships and it becomes too late.  Or, hmm, i don't know, would
> switching from direct sqlite to apr_dbd in some post-1.5 release
> be a compatibility issue?  Still better to think about it sooner
> rather than later...

apr_dbd is rather new (i.e. not in apr-util 0.9.x), and doesn't solve
many of the problems with swapping new db back ends in (i.e. sql
dialect differences), so I'd recommend sticking with raw sqlite unless
there's a pressing reason not to do so.

-garrett

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

apr_dbd vs. direct sqlite (was Re: searchable revprops?)

Posted by Eric Gillespie <ep...@pretzelnet.org>.
"Ben Collins-Sussman" <su...@red-bean.com> writes:

> On 5/15/07, C. Michael Pilato <cm...@collab.net> wrote:
> 
> > That's a good observation, Ben.  Let's not be guilty of rushing something
> > under-designed into the codebase, though.
> 
> Totally agree.  I meant, "here's a yummy feature someone could take
> the time to write a design spec for."  :-)  Whatever the design may
> be, I imagine that the implementation will be fairly easy, now that
> we've got SQL at our disposal.

Speaking of that, should we be using apr_dbd instead of sqlite
directly?  If sqlite doesn't perform well enough for large
installations (as i suspect), this would make it easier to plugin
a replacement for all svn's SQL usage in one fell swoop.

As long as we only use sqlite in one place, it's not as much of
an issue, but if we think we're going to expand our usage, i
think it makes sense to think about this.  Especially before 1.5
ships and it becomes too late.  Or, hmm, i don't know, would
switching from direct sqlite to apr_dbd in some post-1.5 release
be a compatibility issue?  Still better to think about it sooner
rather than later...

-- 
Eric Gillespie <*> epg@pretzelnet.org

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: searchable revprops?

Posted by David Glasser <gl...@mit.edu>.
On 6/7/07, Ben Collins-Sussman <su...@red-bean.com> wrote:
> This is extremely cool!  I'd love to see folks run with this ... (at
> least those not currently working on merge tracking)!

I too would love to see folks run with this, but I'm not sure that
I'll have time to focus on it myself.  Rather than let the patch
languish as an attachment on the mailing list, I've created a branch
(revprop-sqlite) in the repository, applied the patch, and added a
README.branch describing the current status and two major design
questions (query API and index-vs-canonical).

--dave

-- 
David Glasser | glasser@mit.edu | http://www.davidglasser.net/

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: searchable revprops?

Posted by Ben Collins-Sussman <su...@red-bean.com>.
This is extremely cool!  I'd love to see folks run with this ... (at
least those not currently working on merge tracking)!

On 6/7/07, David Glasser <gl...@mit.edu> wrote:
> On 5/15/07, Ben Collins-Sussman <su...@red-bean.com> wrote:
> > On 5/15/07, C. Michael Pilato <cm...@collab.net> wrote:
> >
> > > That's a good observation, Ben.  Let's not be guilty of rushing something
> > > under-designed into the codebase, though.
> >
> > Totally agree.  I meant, "here's a yummy feature someone could take
> > the time to write a design spec for."  :-)  Whatever the design may
> > be, I imagine that the implementation will be fairly easy, now that
> > we've got SQL at our disposal.
>
> I can confirm that the latter is true; last night just for the sake of
> seeing how much work it would be, I implemented the creation of
> revprop indices.  As recognized above, the hardest part would be
> designing a flexible API for searching the index (over RA,
> presumably), but as Ben said, it's not too hard to implement the
> functionality once it's designed.
>
> Other than figuring out how search would work, the other big question
> would be whether sqlite should be used as the canonical location of
> the data or as an auxiliary index.  Advantages for the former include
> avoiding redundancy and (for FSFS) space efficiency: on filesystems
> with large minimum file sizes, the FSFS revprops directory is very
> wasteful.  For example, on my OSX machine, the minimum file size is 4k
> and most revprop files are around 250 bytes; my
> ~/.svk/local/db/revprops/ takes up half a gig!  In practice, sqlite
> seems to give about 5-6x space reduction.  Advantages to just being an
> index include not having to deal with blocking for reads (the same
> issue I raised in another thread about mergeinfo).
>
> I'm attaching a patch of what I did last night, though of course it's
> certainly not ready for production.  (It only writes to the index:
> there are no read APIs.  I only bothered to hook it into FSFS, though
> it should be trivial to hook into BDB.  The API for setting a revprop
> takes the hash of all the revprops for a revision even in the code
> path from "propset --revprop" which is only setting one.  It has the
> same SQLITE_BUSY issues as the mergeinfo code.  Much of the sqlite
> code is copied from mergeinfo and would be factored out if this were
> actually applied.  I didn't think incredibly carefully about what
> indices to put on the table.  etc.)  But it does work well enough for
> me to be able to run svnsync on svn.collab.net and then run queries
> like "select value, count(*) as c from revprops where name =
> 'svn:author' group by value order by c desc" on the sqlite db!
>
> --dave
>
> --
> David Glasser | glasser@mit.edu | http://www.davidglasser.net/
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: searchable revprops?

Posted by David Glasser <gl...@mit.edu>.
On 5/15/07, Ben Collins-Sussman <su...@red-bean.com> wrote:
> On 5/15/07, C. Michael Pilato <cm...@collab.net> wrote:
>
> > That's a good observation, Ben.  Let's not be guilty of rushing something
> > under-designed into the codebase, though.
>
> Totally agree.  I meant, "here's a yummy feature someone could take
> the time to write a design spec for."  :-)  Whatever the design may
> be, I imagine that the implementation will be fairly easy, now that
> we've got SQL at our disposal.

I can confirm that the latter is true; last night just for the sake of
seeing how much work it would be, I implemented the creation of
revprop indices.  As recognized above, the hardest part would be
designing a flexible API for searching the index (over RA,
presumably), but as Ben said, it's not too hard to implement the
functionality once it's designed.

Other than figuring out how search would work, the other big question
would be whether sqlite should be used as the canonical location of
the data or as an auxiliary index.  Advantages for the former include
avoiding redundancy and (for FSFS) space efficiency: on filesystems
with large minimum file sizes, the FSFS revprops directory is very
wasteful.  For example, on my OSX machine, the minimum file size is 4k
and most revprop files are around 250 bytes; my
~/.svk/local/db/revprops/ takes up half a gig!  In practice, sqlite
seems to give about 5-6x space reduction.  Advantages to just being an
index include not having to deal with blocking for reads (the same
issue I raised in another thread about mergeinfo).

I'm attaching a patch of what I did last night, though of course it's
certainly not ready for production.  (It only writes to the index:
there are no read APIs.  I only bothered to hook it into FSFS, though
it should be trivial to hook into BDB.  The API for setting a revprop
takes the hash of all the revprops for a revision even in the code
path from "propset --revprop" which is only setting one.  It has the
same SQLITE_BUSY issues as the mergeinfo code.  Much of the sqlite
code is copied from mergeinfo and would be factored out if this were
actually applied.  I didn't think incredibly carefully about what
indices to put on the table.  etc.)  But it does work well enough for
me to be able to run svnsync on svn.collab.net and then run queries
like "select value, count(*) as c from revprops where name =
'svn:author' group by value order by c desc" on the sqlite db!

--dave

-- 
David Glasser | glasser@mit.edu | http://www.davidglasser.net/

Re: searchable revprops?

Posted by Ben Collins-Sussman <su...@red-bean.com>.
On 5/15/07, C. Michael Pilato <cm...@collab.net> wrote:

> That's a good observation, Ben.  Let's not be guilty of rushing something
> under-designed into the codebase, though.

Totally agree.  I meant, "here's a yummy feature someone could take
the time to write a design spec for."  :-)  Whatever the design may
be, I imagine that the implementation will be fairly easy, now that
we've got SQL at our disposal.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: searchable revprops?

Posted by "C. Michael Pilato" <cm...@collab.net>.
Ben Collins-Sussman wrote:
> For years now, people have been criticizing Subversion for telling
> users to go ahead and invent new revprops, but then discovering that
> revprops aren't searchable.  "What's the point of all this cool
> metadata, if I can't even execute a query like 'show me all revisions
> written by a specific author?'"
> 
> IIRC, our response has always been, "yeah, you're right, someday we
> should build revprop indices on the server for this."
> 
> It occurred to me that now that we unconditionally require sqlite in
> the repository, it would be really trivial to implement this feature.
> If sqlite is indexing revprops, it would be pretty easy to add a new
> RA interface to "return all revisions where revpropname matches
> revpropvalue".
> 
> I certainly don't want to distract from our focus on merge-tracking,
> but if anyone is hungry for a relatively easy and fun project... this
> seems like a great opportunity.  Some real low-hanging fruit!

That's a good observation, Ben.  Let's not be guilty of rushing something
under-designed into the codebase, though.  "return all revisions where
revpropname matches revpropvalue" is certainly one use-case, but also common
are:

   * return all revisions (and revpropvalues?) for whom revpropname is set
     at all

   * return all revisions for whom revpropname's value matches some regexp

In fact, if you do that last one alone, you've hit all three of them:

   .?  --  is the property set at all?
   ^value$  --  does the property match exactly

Spend a little bit of time thinking through high-value functionality so we
don't have to maintain 15 new ra-dav REPORTs and more client-server compat code.

-- 
C. Michael Pilato <cm...@collab.net>
CollabNet   <>   www.collab.net   <>   Distributed Development On Demand