You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@subversion.apache.org by Mark Phippard <ma...@gmail.com> on 2008/01/25 02:54:41 UTC

Node origins cache rewrite

I see David has rewritten this to no longer use SQLite.  Yay!

That being said, I do still have some reservations.  Keep in mind that
CollabNet uses BDB repositories, so I am just speaking from what we
have heard in the past from users.

How many nodes will a large repository have?  We have heard from users
with working copies with thousands of folders and tens and hundreds of
thousands of files.  If this represents their trunk, and the have many
branches with modifications how many nodes can they expect.

As I said previously, just 100,000 nodes X 4kb block size is 400 MB of
disk space used.  Don't we think users might complain about the
increase?  Even if the repository is already 4 GB, I am sure they
would still notice the increase.

Does the Python script to generate the cache still work?  I wonder if
we could modify it or otherwise make it available for people to run on
some repositories to get an idea of the number of nodes in their
repository.  It would be interesting to see how many nodes are in the
ASF repository.  Perhaps we could run it on some of our large
repositories at CollabNet as well.

That being said, I suppose we should only do this if there are a
number of nodes at which point we would want to consider changing
this.

When we came up with this design, how many nodes were we thinking
might typically exist?  What is it optimized for?

Lots of questions, sorry.  Glad to see the progress being made towards
the 1.5 branch though.

-- 
Thanks

Mark Phippard
http://markphip.blogspot.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Node origins cache rewrite

Posted by Garrett Rooney <ro...@electricjellyfish.net>.

On Jan 28, 2008 1:17 PM, C. Michael Pilato <cm...@collab.net> wrote:

> So while I like the elegance of David's plan for storing new node-origins at
> commit time, I honestly believe the compat plan for existing repositories --
> that is, the utter lack thereof -- would be a horrendous mistake for this
> community to make.  You might as well tell big projects that merges are
> effectively disabled for them until they dump and load.

How about splitting the difference and doing both the
one-file-per-node cache and the encoding into the node id?  If users
really care about the cost of the cache they can dump/load.

-garrett

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Node origins cache rewrite

Posted by "C. Michael Pilato" <cm...@collab.net>.

C. Michael Pilato wrote:
> Mark Phippard wrote:
>> On Jan 28, 2008 12:55 PM, David Glasser <gl...@davidglasser.net> wrote:
>>> On Jan 25, 2008 12:50 PM, David Glasser <gl...@davidglasser.net> 
>>> wrote:
>>>> On Jan 25, 2008 11:16 AM, David Glasser <gl...@davidglasser.net> 
>>>> wrote:
>>>>> On Jan 24, 2008 6:54 PM, Mark Phippard <ma...@gmail.com> wrote:
>>>>>> I see David has rewritten this to no longer use SQLite.  Yay!
>>>>> Here's an alternative implementation.  In FSFS, at commit time, new
>>>>> node IDs are rewritten from a temporary value like "_ab3" to a unique
>>>>> value by adding "ab3" to the "start_node_id" field in the current
>>>>> file.  This makes them not only unique, but also part of an ordered
>>>>> sequence without gaps.
>>>>>
>>>>> Is it actually important that node IDs be ordered and gapless?  We
>>>>> could just change new node-IDs (in format 3 repositories) to be built
>>>>> as "<rev>-ab3".  get-node-origin-rev would be trivial on these nodes.
>>>>> Pre-format-3 repositories, or nodes in format 3 repositories that
>>>>> aren't dumped and loaded, would require the slow crawl.
>>>> Like this.  Can somebody review?
>>> New version, supporting "svnadmin recover".  Barring objections, will
>>> commit later today.
>>
>> I do not have objections, but I did ask some questions in this message
>> that have not been answered:
>>
>> http://subversion.tigris.org/servlets/ReadMsg?list=dev&msgNo=134583
> 
> Requiring a dump and load just to get node-origins is a non-starter, in 
> my opinion.  And it is repositories large enough to making dumping and 
> loading such a pain that will suffer the most from *not* have the 
> node-origins table.
> 
> The get-location-segments fallback logic is based on 'svn log', which 
> requires a loooooong time to run against, say, APR's trunk (after four 
> minutes, get-location-segments.py against that URL still hadn't 
> completed).  Worse still, because of this new way David wishes to 
> implement the feature, all that cost would be paid *every time a merge 
> was requested*, rather than simply the first time as it is in the 
> current implementation.

Oops.  Besides the slew of typing errors made in this post, I also mis-thunk 
this bit.  We wouldn't need to hit the fallback logic because that is keyed 
on the server being pre-1.5, which is not the case we're discussing here. 
However, I still expect costs to be quite high as the server crawls all the 
revisions in APR's trunk, unable to make use of such shortcuts as that which 
the svn_fs_closest_copy() API provide.

-- 
C. Michael Pilato <cm...@collab.net>
CollabNet   <>   www.collab.net   <>   Distributed Development On Demand

Re: Node origins cache rewrite

Posted by "C. Michael Pilato" <cm...@collab.net>.

Mark Phippard wrote:
> On Jan 28, 2008 12:55 PM, David Glasser <gl...@davidglasser.net> wrote:
>> On Jan 25, 2008 12:50 PM, David Glasser <gl...@davidglasser.net> wrote:
>>> On Jan 25, 2008 11:16 AM, David Glasser <gl...@davidglasser.net> wrote:
>>>> On Jan 24, 2008 6:54 PM, Mark Phippard <ma...@gmail.com> wrote:
>>>>> I see David has rewritten this to no longer use SQLite.  Yay!
>>>> Here's an alternative implementation.  In FSFS, at commit time, new
>>>> node IDs are rewritten from a temporary value like "_ab3" to a unique
>>>> value by adding "ab3" to the "start_node_id" field in the current
>>>> file.  This makes them not only unique, but also part of an ordered
>>>> sequence without gaps.
>>>>
>>>> Is it actually important that node IDs be ordered and gapless?  We
>>>> could just change new node-IDs (in format 3 repositories) to be built
>>>> as "<rev>-ab3".  get-node-origin-rev would be trivial on these nodes.
>>>> Pre-format-3 repositories, or nodes in format 3 repositories that
>>>> aren't dumped and loaded, would require the slow crawl.
>>> Like this.  Can somebody review?
>> New version, supporting "svnadmin recover".  Barring objections, will
>> commit later today.
> 
> I do not have objections, but I did ask some questions in this message
> that have not been answered:
> 
> http://subversion.tigris.org/servlets/ReadMsg?list=dev&msgNo=134583

Requiring a dump and load just to get node-origins is a non-starter, in my 
opinion.  And it is repositories large enough to making dumping and loading 
such a pain that will suffer the most from *not* have the node-origins table.

The get-location-segments fallback logic is based on 'svn log', which 
requires a loooooong time to run against, say, APR's trunk (after four 
minutes, get-location-segments.py against that URL still hadn't completed). 
  Worse still, because of this new way David wishes to implement the 
feature, all that cost would be paid *every time a merge was requested*, 
rather than simply the first time as it is in the current implementation.

So while I like the elegance of David's plan for storing new node-origins at 
commit time, I honestly believe the compat plan for existing repositories -- 
that is, the utter lack thereof -- would be a horrendous mistake for this 
community to make.  You might as well tell big projects that merges are 
effectively disabled for them until they dump and load.

-- 
C. Michael Pilato <cm...@collab.net>
CollabNet   <>   www.collab.net   <>   Distributed Development On Demand

Re: Node origins cache rewrite

Posted by Mark Phippard <ma...@gmail.com>.

On Jan 28, 2008 12:55 PM, David Glasser <gl...@davidglasser.net> wrote:
> On Jan 25, 2008 12:50 PM, David Glasser <gl...@davidglasser.net> wrote:
> > On Jan 25, 2008 11:16 AM, David Glasser <gl...@davidglasser.net> wrote:
> > > On Jan 24, 2008 6:54 PM, Mark Phippard <ma...@gmail.com> wrote:
> > > > I see David has rewritten this to no longer use SQLite.  Yay!
> > >
> > > Here's an alternative implementation.  In FSFS, at commit time, new
> > > node IDs are rewritten from a temporary value like "_ab3" to a unique
> > > value by adding "ab3" to the "start_node_id" field in the current
> > > file.  This makes them not only unique, but also part of an ordered
> > > sequence without gaps.
> > >
> > > Is it actually important that node IDs be ordered and gapless?  We
> > > could just change new node-IDs (in format 3 repositories) to be built
> > > as "<rev>-ab3".  get-node-origin-rev would be trivial on these nodes.
> > > Pre-format-3 repositories, or nodes in format 3 repositories that
> > > aren't dumped and loaded, would require the slow crawl.
> >
> > Like this.  Can somebody review?
>
> New version, supporting "svnadmin recover".  Barring objections, will
> commit later today.

I do not have objections, but I did ask some questions in this message
that have not been answered:

http://subversion.tigris.org/servlets/ReadMsg?list=dev&msgNo=134583


-- 
Thanks

Mark Phippard
http://markphip.blogspot.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Node origins cache rewrite

Posted by "C. Michael Pilato" <cm...@collab.net>.

David Glasser wrote:
> On Jan 25, 2008 12:50 PM, David Glasser <gl...@davidglasser.net> wrote:
>> On Jan 25, 2008 11:16 AM, David Glasser <gl...@davidglasser.net> wrote:
>>> On Jan 24, 2008 6:54 PM, Mark Phippard <ma...@gmail.com> wrote:
>>>> I see David has rewritten this to no longer use SQLite.  Yay!
>>> Here's an alternative implementation.  In FSFS, at commit time, new
>>> node IDs are rewritten from a temporary value like "_ab3" to a unique
>>> value by adding "ab3" to the "start_node_id" field in the current
>>> file.  This makes them not only unique, but also part of an ordered
>>> sequence without gaps.
>>>
>>> Is it actually important that node IDs be ordered and gapless?  We
>>> could just change new node-IDs (in format 3 repositories) to be built
>>> as "<rev>-ab3".  get-node-origin-rev would be trivial on these nodes.
>>> Pre-format-3 repositories, or nodes in format 3 repositories that
>>> aren't dumped and loaded, would require the slow crawl.
>> Like this.  Can somebody review?
> 
> New version, supporting "svnadmin recover".  Barring objections, will
> commit later today.

I object.  See other mails in the thread.

-- 
C. Michael Pilato <cm...@collab.net>
CollabNet   <>   www.collab.net   <>   Distributed Development On Demand

Re: Node origins cache rewrite

Posted by David Glasser <gl...@davidglasser.net>.

On Jan 25, 2008 12:50 PM, David Glasser <gl...@davidglasser.net> wrote:
> On Jan 25, 2008 11:16 AM, David Glasser <gl...@davidglasser.net> wrote:
> > On Jan 24, 2008 6:54 PM, Mark Phippard <ma...@gmail.com> wrote:
> > > I see David has rewritten this to no longer use SQLite.  Yay!
> >
> > Here's an alternative implementation.  In FSFS, at commit time, new
> > node IDs are rewritten from a temporary value like "_ab3" to a unique
> > value by adding "ab3" to the "start_node_id" field in the current
> > file.  This makes them not only unique, but also part of an ordered
> > sequence without gaps.
> >
> > Is it actually important that node IDs be ordered and gapless?  We
> > could just change new node-IDs (in format 3 repositories) to be built
> > as "<rev>-ab3".  get-node-origin-rev would be trivial on these nodes.
> > Pre-format-3 repositories, or nodes in format 3 repositories that
> > aren't dumped and loaded, would require the slow crawl.
>
> Like this.  Can somebody review?

New version, supporting "svnadmin recover".  Barring objections, will
commit later today.

[[[
In FSFS, instead of having a node-origin cache on disk, just change
the node-id to contain the node-origin-rev.

That is, instead of (at commit finalization time) rewriting node IDs
based on a node ID counter in the "current" file, rewrite them as
"<base36>-<rev>".  Do the same for copy IDs, for consistency.  Do this
only in Format 3.

Now svn_fs_node_origin_rev is a trivial "look in the node ID"
operation, unless you're in Format 2 or a repository sneakily upgraded
without a dump and load (not really supported anyway), in which case
you still do the history walk.

* subversion/libsvn_fs_fs/fs.h
  (PATH_NODE_ORIGINS_DIR): Remove.
  (SVN_FS_FS__MIN_NO_GLOBAL_IDS_FORMAT): New.

* subversion/libsvn_fs_fs/fs_fs.c
  (path_node_origin): Remove.
  (svn_fs_fs__hotcopy): Don't copy node origins cache.
  (write_final_rev): Depending on FS format, make new IDs either from
   revnum or from counter.  Remove node_origins hash parameter.
  (write_current, write_final_current): Only write out node/copy IDs
   for old formats.
  (struct commit_baton): Remove node_origins hash.
  (commit_body): Only read in node/copy IDs for old formats.  Don't
   pass node_origins hash to write_final_rev.
  (svn_fs_fs__commit): Remove post-commit node origins cache update.
  (svn_fs_fs__create): Don't write initial values of node/copy ID
   counters for new file format.
  (recover_body): Only calculate maximum node/copy IDs in the
   repository for old formats.
  (svn_fs_fs__ensure_dir_exists): Move back to lock.c (where it was
   before r29018).
  (svn_fs_fs__get_node_origin, set_node_origin,
   svn_fs_fs__set_node_origins, svn_fs_fs__set_node_origin): Remove.

* subversion/libsvn_fs_fs/fs_fs.h
  (svn_fs_fs__ensure_dir_exists, svn_fs_fs__set_node_origin,
   svn_fs_fs__set_node_origins, svn_fs_fs__get_node_origin): Remove.

* subversion/libsvn_fs_fs/lock.c
  (ensure_dir_exists): Move this back here (reverting r29018).
  (write_digest_file): Adjust.

* subversion/libsvn_fs_fs/structure
  Remove references to node origin cache.  Describe new node-ID
  structure.

* subversion/libsvn_fs_fs/tree.c
  (fs_node_origin_rev): Remove use of cache.  If the node-ID contains
   a '-', return the number after it.
]]]

-- 
David Glasser | glasser@davidglasser.net | http://www.davidglasser.net/

Re: Node origins cache rewrite

Posted by Mark Phippard <ma...@gmail.com>.

On Jan 27, 2008 12:10 PM, Justin Erenkrantz <ju...@erenkrantz.com> wrote:
> On Jan 27, 2008 8:43 AM, Mark Phippard <ma...@gmail.com> wrote:
> > If there was no fallback would you take the performance hit (which is
> > huge on your repository) or do the dump/load?  Both options are so
> > bad, that I was suggesting it might be worth the effort to possibly
> > support both options to give a better upgrade path.
>
> If doing a dump-load would give us remarkably better disk usage, ya,
> we'd do that.  But, AIUI, that's not the case here, is it?  That's
> only if a hypothetical separate 'node origin cache' were to be
> introduced separated from the per-inode one glasser just committed,
> right?

Glasser has proposed an alternative system that would remove the need
for the cache.  But this would require a dump/load for existing
repositories.  Compared to 1.4, your repository size would stay the
same.  The problem is that if you do not dump/load, then the code
needs to crawl the repository to get the same information.  This is
really slow.

-- 
Thanks

Mark Phippard
http://markphip.blogspot.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Node origins cache rewrite

Posted by David Glasser <gl...@davidglasser.net>.

On Jan 27, 2008 9:10 AM, Justin Erenkrantz <ju...@erenkrantz.com> wrote:
> If doing a dump-load would give us remarkably better disk usage, ya,
> we'd do that.  But, AIUI, that's not the case here, is it?  That's
> only if a hypothetical separate 'node origin cache' were to be
> introduced separated from the per-inode one glasser just committed,
> right?

Yeah, see my posted patch which encodes the correct answer into node
IDs instead of having any sort of cache.

--dave

-- 
David Glasser | glasser@davidglasser.net | http://www.davidglasser.net/

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Node origins cache rewrite

Posted by Justin Erenkrantz <ju...@erenkrantz.com>.

On Jan 27, 2008 8:43 AM, Mark Phippard <ma...@gmail.com> wrote:
> If there was no fallback would you take the performance hit (which is
> huge on your repository) or do the dump/load?  Both options are so
> bad, that I was suggesting it might be worth the effort to possibly
> support both options to give a better upgrade path.

If doing a dump-load would give us remarkably better disk usage, ya,
we'd do that.  But, AIUI, that's not the case here, is it?  That's
only if a hypothetical separate 'node origin cache' were to be
introduced separated from the per-inode one glasser just committed,
right?

> Well, I was suggesting some kind of filesystem format that handles
> lots of small files efficiently.  Depending on the format, you might
> not need a lot of disk space to handle it.

We currently use FreeBSD's ufs/ffs - IIRC, the FFS default block size
is 8KB; we *may* consider using FreeBSD 7's zfs implementation for our
new SVN server.  (So, at 8KB/inode with currently 1.2 million inodes,
I think that'd be 9GB for what is now a 24GB repository...I just did a
du, so the 24GB overall size is accurate...)

But, having multiple file-systems on one disk isn't efficient (or
reliable), so we try to avoid that.  -- justin

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Node origins cache rewrite

Posted by Mark Phippard <ma...@gmail.com>.

On Jan 27, 2008 11:34 AM, Justin Erenkrantz <ju...@erenkrantz.com> wrote:
> On Jan 27, 2008 8:08 AM, Mark Phippard <ma...@gmail.com> wrote:
> > Justin, what would ASF likely do?  Would you be able to dump/load that
> > repository or would you just incur the disk space cost of the cache
> > that currently exists?
>
> We'd just burn the disk space, I guess.  But, I'm not really sure it's
> worth an FS inode per SVN node-id...that just doesn't seem to scale.

If there was no fallback would you take the performance hit (which is
huge on your repository) or do the dump/load?  Both options are so
bad, that I was suggesting it might be worth the effort to possibly
support both options to give a better upgrade path.

> >  I imagine you could symlink that folder to a
> > volume that handles lots of small files more efficiently.
>
> We're not Google, so, throwing dedicated disks at it isn't a
> cost-effective option.  =)  -- justin

Well, I was suggesting some kind of filesystem format that handles
lots of small files efficiently.  Depending on the format, you might
not need a lot of disk space to handle it.

-- 
Thanks

Mark Phippard
http://markphip.blogspot.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Node origins cache rewrite

Posted by Justin Erenkrantz <ju...@erenkrantz.com>.

On Jan 27, 2008 8:08 AM, Mark Phippard <ma...@gmail.com> wrote:
> Justin, what would ASF likely do?  Would you be able to dump/load that
> repository or would you just incur the disk space cost of the cache
> that currently exists?

We'd just burn the disk space, I guess.  But, I'm not really sure it's
worth an FS inode per SVN node-id...that just doesn't seem to scale.

>  I imagine you could symlink that folder to a
> volume that handles lots of small files more efficiently.

We're not Google, so, throwing dedicated disks at it isn't a
cost-effective option.  =)  -- justin

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Node origins cache rewrite

Posted by Justin Erenkrantz <ju...@erenkrantz.com>.

On Jan 28, 2008 10:25 AM, Mark Phippard <ma...@gmail.com> wrote:
> I do not recall Justin saying he would be happy with this.  Are you
> saying he would use svnsync to migrate and then "flip the switch"?

Doing a dump/load would be acceptable.  The actual downtime would be
quite minimal.

FWIW, we do plan to roll out a read-only replica using dav-proxy &
svnsync, so that'd minimize the downtime even further...  -- justin

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Node origins cache rewrite

Posted by Mark Phippard <ma...@gmail.com>.

On Jan 28, 2008 1:18 PM, David Glasser <gl...@davidglasser.net> wrote:
>
> On Jan 27, 2008 8:08 AM, Mark Phippard <ma...@gmail.com> wrote:
> >
> > Can you explain a little more the impact on existing repositories.
> >
> > 1) Dump/Load would generate the new node-ID so all would be good if
> > you did that approach.
>
> Yes.
>
> > 2) Say I have 10,000 revisions in my 1.4 repository.  I move to 1.5.
>
> See, what does that mean?  It could mean one of three things:
>
> (a) You don't change the FS format of the repository at all, so it's
> still '2'.  You have the poor performance of uncached history-walks.
> On the other hand, it is very likely that we will have to forbid you
> from using merge tracking on such a repository anyway.
>
> (b) You change the FS format to '3' using the *only currently
> officially supported method*, which is a dump and a load.  All's well.
>
> (c) You change the FS format to '3' using the unsupported method of
> manually changing the format number (and creating txn-current, and
> creating txn-protorevs, etc etc etc).  New node-IDs contain the rev in
> them; old node-IDs (including new noderevs from old nodes) don't and
> require the slow walk.  But hey, you just did something unsupported
> anyway.
>
> (d) You run some sort of "svnadmin upgrade" command which does the
> same thing as (c), except it's actually supported.  Then, well, yeah,
> you have the same downside, and us developers don't get the excuse of
> "you did something unsupported".

Well, we have the script to shard a repository.  That must change the
format.  I was also under the impression (back when SQLite was used)
that the first time you committed something with mergeinfo, then
SQLite db was created.  Did this not bump the format?  Maybe that was
just tied to sharding?


> Really I think if you care about the performance of this particular
> operation, then you dump and load.  (Or run svnsync, or whatever.)
> It's not that hard, and as long as you have space on the machine it
> shouldn't even require downtime.  Justin said the ASF would be happy
> with that.

I do not recall Justin saying he would be happy with this.  Are you
saying he would use svnsync to migrate and then "flip the switch"?

-- 
Thanks

Mark Phippard
http://markphip.blogspot.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Node origins cache rewrite

Posted by David Glasser <gl...@davidglasser.net>.

On Jan 27, 2008 8:08 AM, Mark Phippard <ma...@gmail.com> wrote:
>
> On Jan 25, 2008 3:50 PM, David Glasser <gl...@davidglasser.net> wrote:
> > On Jan 25, 2008 11:16 AM, David Glasser <gl...@davidglasser.net> wrote:
> > > On Jan 24, 2008 6:54 PM, Mark Phippard <ma...@gmail.com> wrote:
> > > > I see David has rewritten this to no longer use SQLite.  Yay!
> > >
> > > Here's an alternative implementation.  In FSFS, at commit time, new
> > > node IDs are rewritten from a temporary value like "_ab3" to a unique
> > > value by adding "ab3" to the "start_node_id" field in the current
> > > file.  This makes them not only unique, but also part of an ordered
> > > sequence without gaps.
> > >
> > > Is it actually important that node IDs be ordered and gapless?  We
> > > could just change new node-IDs (in format 3 repositories) to be built
> > > as "<rev>-ab3".  get-node-origin-rev would be trivial on these nodes.
> > > Pre-format-3 repositories, or nodes in format 3 repositories that
> > > aren't dumped and loaded, would require the slow crawl.
> >
> > Like this.  Can somebody review?
> >
> > [[[
> > In FSFS, instead of having a node-origin cache on disk, just change
> > the node-id to contain the node-origin-rev.
> >
> > That is, instead of (at commit finalization time) rewriting node IDs
> > based on a node ID counter in the "current" file, rewrite them as
> > "base36-REV".  Do the same for copy IDs, just for the hell of it.  Do
> > this only in Format 3.
> >
> > Now svn_fs_node_origin_rev is a trivial "look in the node ID"
> > operation, unless you're in Format 2 or a repository sneakily upgraded
> > without a dump and load (not really supported anyway), in which case
> > you still do the history walk.
> >
> > *******************************************************************
> > *** svn 1.5 adds "svnadmin recover" to FSFS which fixes the two ***
> > *** of current that were removed here; this code has not been   ***
> > *** updated.                                                    ***
> > *******************************************************************
>
> Can you explain a little more the impact on existing repositories.
>
> 1) Dump/Load would generate the new node-ID so all would be good if
> you did that approach.

Yes.

> 2) Say I have 10,000 revisions in my 1.4 repository.  I move to 1.5.

See, what does that mean?  It could mean one of three things:

(a) You don't change the FS format of the repository at all, so it's
still '2'.  You have the poor performance of uncached history-walks.
On the other hand, it is very likely that we will have to forbid you
from using merge tracking on such a repository anyway.

(b) You change the FS format to '3' using the *only currently
officially supported method*, which is a dump and a load.  All's well.

(c) You change the FS format to '3' using the unsupported method of
manually changing the format number (and creating txn-current, and
creating txn-protorevs, etc etc etc).  New node-IDs contain the rev in
them; old node-IDs (including new noderevs from old nodes) don't and
require the slow walk.  But hey, you just did something unsupported
anyway.

(d) You run some sort of "svnadmin upgrade" command which does the
same thing as (c), except it's actually supported.  Then, well, yeah,
you have the same downside, and us developers don't get the excuse of
"you did something unsupported".

Really I think if you care about the performance of this particular
operation, then you dump and load.  (Or run svnsync, or whatever.)
It's not that hard, and as long as you have space on the machine it
shouldn't even require downtime.  Justin said the ASF would be happy
with that.

> Do new nodes that are created pick up the new node ID's?  So you get
> mixture of performance based on what node was created?  Does the code
> detect whether the node-ID contains a revision based on some
> heuristic, or does it assume based on the format?

Assumes based on format.

> 3) Would it be possible to have a conversion routine that re-writes
> the node-ID's?

Not without being as expensive (and far more error-prone) as a dump and load.

> 4) How hard would it be to have a hybrid approach?  Someone with a 1.4
> repository could incur the time to a dump/load, or they could run the
> svn-populate-node-origins-index routine to generate the current style
> cache.  If we detect the new node-ID we use that, if not, we fallback
> to code that looks for the cache and lazy populates it?

I don't see how it's worth it.

--dave

-- 
David Glasser | glasser@davidglasser.net | http://www.davidglasser.net/

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Node origins cache rewrite

Posted by Mark Phippard <ma...@gmail.com>.

On Jan 25, 2008 3:50 PM, David Glasser <gl...@davidglasser.net> wrote:
> On Jan 25, 2008 11:16 AM, David Glasser <gl...@davidglasser.net> wrote:
> > On Jan 24, 2008 6:54 PM, Mark Phippard <ma...@gmail.com> wrote:
> > > I see David has rewritten this to no longer use SQLite.  Yay!
> >
> > Here's an alternative implementation.  In FSFS, at commit time, new
> > node IDs are rewritten from a temporary value like "_ab3" to a unique
> > value by adding "ab3" to the "start_node_id" field in the current
> > file.  This makes them not only unique, but also part of an ordered
> > sequence without gaps.
> >
> > Is it actually important that node IDs be ordered and gapless?  We
> > could just change new node-IDs (in format 3 repositories) to be built
> > as "<rev>-ab3".  get-node-origin-rev would be trivial on these nodes.
> > Pre-format-3 repositories, or nodes in format 3 repositories that
> > aren't dumped and loaded, would require the slow crawl.
>
> Like this.  Can somebody review?
>
> [[[
> In FSFS, instead of having a node-origin cache on disk, just change
> the node-id to contain the node-origin-rev.
>
> That is, instead of (at commit finalization time) rewriting node IDs
> based on a node ID counter in the "current" file, rewrite them as
> "base36-REV".  Do the same for copy IDs, just for the hell of it.  Do
> this only in Format 3.
>
> Now svn_fs_node_origin_rev is a trivial "look in the node ID"
> operation, unless you're in Format 2 or a repository sneakily upgraded
> without a dump and load (not really supported anyway), in which case
> you still do the history walk.
>
> *******************************************************************
> *** svn 1.5 adds "svnadmin recover" to FSFS which fixes the two ***
> *** of current that were removed here; this code has not been   ***
> *** updated.                                                    ***
> *******************************************************************

Can you explain a little more the impact on existing repositories.

1) Dump/Load would generate the new node-ID so all would be good if
you did that approach.

2) Say I have 10,000 revisions in my 1.4 repository.  I move to 1.5.
Do new nodes that are created pick up the new node ID's?  So you get
mixture of performance based on what node was created?  Does the code
detect whether the node-ID contains a revision based on some
heuristic, or does it assume based on the format?

3) Would it be possible to have a conversion routine that re-writes
the node-ID's?

4) How hard would it be to have a hybrid approach?  Someone with a 1.4
repository could incur the time to a dump/load, or they could run the
svn-populate-node-origins-index routine to generate the current style
cache.  If we detect the new node-ID we use that, if not, we fallback
to code that looks for the cache and lazy populates it?

I imagine #4 has a lot of ickiness in terms of code bloat.  I just
know that dump/load can be a real burden for some repositories and
they might prefer the option to carry the extra disk space of the
current style cache.

Justin, what would ASF likely do?  Would you be able to dump/load that
repository or would you just incur the disk space cost of the cache
that currently exists?  I imagine you could symlink that folder to a
volume that handles lots of small files more efficiently.

-- 
Thanks

Mark Phippard
http://markphip.blogspot.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Node origins cache rewrite

Posted by David Glasser <gl...@davidglasser.net>.

On Jan 25, 2008 11:16 AM, David Glasser <gl...@davidglasser.net> wrote:
> On Jan 24, 2008 6:54 PM, Mark Phippard <ma...@gmail.com> wrote:
> > I see David has rewritten this to no longer use SQLite.  Yay!
>
> Here's an alternative implementation.  In FSFS, at commit time, new
> node IDs are rewritten from a temporary value like "_ab3" to a unique
> value by adding "ab3" to the "start_node_id" field in the current
> file.  This makes them not only unique, but also part of an ordered
> sequence without gaps.
>
> Is it actually important that node IDs be ordered and gapless?  We
> could just change new node-IDs (in format 3 repositories) to be built
> as "<rev>-ab3".  get-node-origin-rev would be trivial on these nodes.
> Pre-format-3 repositories, or nodes in format 3 repositories that
> aren't dumped and loaded, would require the slow crawl.

Like this.  Can somebody review?

[[[
In FSFS, instead of having a node-origin cache on disk, just change
the node-id to contain the node-origin-rev.

That is, instead of (at commit finalization time) rewriting node IDs
based on a node ID counter in the "current" file, rewrite them as
"base36-REV".  Do the same for copy IDs, just for the hell of it.  Do
this only in Format 3.

Now svn_fs_node_origin_rev is a trivial "look in the node ID"
operation, unless you're in Format 2 or a repository sneakily upgraded
without a dump and load (not really supported anyway), in which case
you still do the history walk.

*******************************************************************
*** svn 1.5 adds "svnadmin recover" to FSFS which fixes the two ***
*** of current that were removed here; this code has not been   ***
*** updated.                                                    ***
*******************************************************************

* subversion/libsvn_fs_fs/fs.h
  (PATH_NODE_ORIGINS_DIR): Remove.
  (SVN_FS_FS__MIN_NO_GLOBAL_IDS_FORMAT): New.

* subversion/libsvn_fs_fs/fs_fs.c
  (path_node_origin): Remove.
  (svn_fs_fs__hotcopy): Don't copy node origins cache.
  (write_final_rev): Depending on FS format, make new IDs either from
   revnum or from counter.  Remove node_origins hash parameter.
  (write_current, write_final_current): Only write out node/copy IDs
   for old formats.
  (struct commit_baton): Remove node_origins hash.
  (commit_body): Only read in node/copy IDs for old formats.  Don't
   pass node_origins hash to write_final_rev.
  (svn_fs_fs__commit): Remove post-commit node origins cache update.
  (svn_fs_fs__create): Don't write initial values of node/copy ID
   counters for new file format.
  (svn_fs_fs__ensure_dir_exists): Move back to lock.c (where it was
   before r29018).
  (svn_fs_fs__get_node_origin, set_node_origin,
   svn_fs_fs__set_node_origins, svn_fs_fs__set_node_origin): Remove.

* subversion/libsvn_fs_fs/fs_fs.h
  (svn_fs_fs__ensure_dir_exists, svn_fs_fs__set_node_origin,
   svn_fs_fs__set_node_origins, svn_fs_fs__get_node_origin): Remove.

* subversion/libsvn_fs_fs/lock.c
  (ensure_dir_exists): Move this back here (reverting r29018).
  (write_digest_file): Adjust.

* subversion/libsvn_fs_fs/structure
  Remove references to node origin cache.  Describe new node-ID
  structure.

* subversion/libsvn_fs_fs/tree.c
  (fs_node_origin_rev): Remove use of cache.  If the node-ID contains
   a '-', return the number after it.
]]]

-- 
David Glasser | glasser@davidglasser.net | http://www.davidglasser.net/

Re: Node origins cache rewrite

Posted by David Glasser <gl...@davidglasser.net>.

On Jan 24, 2008 6:54 PM, Mark Phippard <ma...@gmail.com> wrote:
> I see David has rewritten this to no longer use SQLite.  Yay!

Here's an alternative implementation.  In FSFS, at commit time, new
node IDs are rewritten from a temporary value like "_ab3" to a unique
value by adding "ab3" to the "start_node_id" field in the current
file.  This makes them not only unique, but also part of an ordered
sequence without gaps.

Is it actually important that node IDs be ordered and gapless?  We
could just change new node-IDs (in format 3 repositories) to be built
as "<rev>-ab3".  get-node-origin-rev would be trivial on these nodes.
Pre-format-3 repositories, or nodes in format 3 repositories that
aren't dumped and loaded, would require the slow crawl.

--dave


-- 
David Glasser | glasser@davidglasser.net | http://www.davidglasser.net/

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Node origins cache rewrite

Posted by Eric Gillespie <ep...@pretzelnet.org>.

"David Glasser" <gl...@davidglasser.net> writes:

> On Jan 24, 2008 7:37 PM, Eric Gillespie <ep...@pretzelnet.org> wrote:
> > I don't think so.  We do need to "shard" it, though, based on the
> > sharded line in the format file.
> 
> I don't see why the shardedness needs to be the same as the rev
> shardedness.  The keys are not revnums.  A fixed shardedness should be
> fine.

You don't know how many nodes I'll have, nor what the best upper
limit for directories on my file system is.  However, I've
already informed svn of an appropriate sharding level in the
format file.  It would be rude to make me configure this twice.

--  
Eric Gillespie <*> epg@pretzelnet.org

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Node origins cache rewrite

Posted by David Glasser <gl...@davidglasser.net>.

On Jan 24, 2008 7:37 PM, Eric Gillespie <ep...@pretzelnet.org> wrote:
> I don't think so.  We do need to "shard" it, though, based on the
> sharded line in the format file.

I don't see why the shardedness needs to be the same as the rev
shardedness.  The keys are not revnums.  A fixed shardedness should be
fine.

--dave

-- 
David Glasser | glasser@davidglasser.net | http://www.davidglasser.net/

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Node origins cache rewrite

Posted by Eric Gillespie <ep...@pretzelnet.org>.

"Mark Phippard" <ma...@gmail.com> writes:

> As I said previously, just 100,000 nodes X 4kb block size is 400 MB of
> disk space used.  Don't we think users might complain about the
> increase?  Even if the repository is already 4 GB, I am sure they
> would still notice the increase.

Admins of large repositories want a separate file system for this
cache, one created with small files in mind, and maybe one in
memory rather than on disk.  This kind of flexibility is the
beauty of using the file system directly.  With BDB, you get what
you get ;->.

> That being said, I suppose we should only do this if there are a
> number of nodes at which point we would want to consider changing
> this.

I don't think so.  We do need to "shard" it, though, based on the
sharded line in the format file.

-- 
Eric Gillespie <*> epg@pretzelnet.org

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Node origins cache rewrite

Posted by "C. Michael Pilato" <cm...@collab.net>.

David Glasser wrote:
>> Does the Python script to generate the cache still work?  I wonder if
>> we could modify it or otherwise make it available for people to run on
>> some repositories to get an idea of the number of nodes in their
>> repository.  It would be interesting to see how many nodes are in the
>> ASF repository.  Perhaps we could run it on some of our large
>> repositories at CollabNet as well.
> 
> I expect it ought to still work, yes.  Though hmm, I thought you said
> your repositories are BDB?

It's not a Python script, it's a compiled program (and is so for maximum 
ease of delivery by packagers to platforms).  The program assumes that 
svn_fs_node_origin() will, upon *not* finding an origin in the index, answer 
the question *and store the answer in the index*.  This is not a guarantee 
of the API, of course (because FSFS should promise no such thing).  But 
it'll be our dirty little secret.

At any rate, I think it will still work fine after this change.

-- 
C. Michael Pilato <cm...@collab.net>
CollabNet   <>   www.collab.net   <>   Distributed Development On Demand

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Node origins cache rewrite

Posted by Justin Erenkrantz <ju...@erenkrantz.com>.

On Jan 25, 2008 8:10 AM, Mark Phippard <ma...@gmail.com> wrote:
> > Base-36 translations of the above: 1212766 326012
>
> So the cache would add roughly 4.9 GB of disk space for you?

Eww.  The repository is only ~12GB now...  -- justin

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Node origins cache rewrite

Posted by Mark Phippard <ma...@gmail.com>.

On Jan 25, 2008 11:03 AM, Justin Erenkrantz <ju...@erenkrantz.com> wrote:
> On Jan 25, 2008 5:55 AM, Mark Phippard <ma...@gmail.com> wrote:
> > the design.  As an example, if the ASF repository had over a million
> > nodes would we want to change the design?  By the same token, we would
> > probably all feel better to learn that the ASF repository only had
> > 10,000 nodes, were that the case.
>
> 615253 pzry 6zjw
>
> Base-36 translations of the above: 1212766 326012

So the cache would add roughly 4.9 GB of disk space for you?

-- 
Thanks

Mark Phippard
http://markphip.blogspot.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Node origins cache rewrite

Posted by Justin Erenkrantz <ju...@erenkrantz.com>.

On Jan 25, 2008 5:55 AM, Mark Phippard <ma...@gmail.com> wrote:
> the design.  As an example, if the ASF repository had over a million
> nodes would we want to change the design?  By the same token, we would
> probably all feel better to learn that the ASF repository only had
> 10,000 nodes, were that the case.

615253 pzry 6zjw

Base-36 translations of the above: 1212766 326012

HTH.  -- justin

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Node origins cache rewrite

Posted by Malcolm Rowe <ma...@farside.org.uk>.

On Fri, Jan 25, 2008 at 09:38:47AM -0500, C. Michael Pilato wrote:
> David Glasser wrote:
>> If you want to know the number of nodes (for FSFS) you can just look
>> at the current file.
>
> Does that number include (as BDB's equivalent would) nodes that came into 
> being only in failed commit transactions?
>

No, transactions have temporary node id's -- they get converted to real
id's during commit finalisation.

Regards,
Malcolm

Re: Node origins cache rewrite

Posted by "C. Michael Pilato" <cm...@collab.net>.

David Glasser wrote:
> If you want to know the number of nodes (for FSFS) you can just look
> at the current file.

Does that number include (as BDB's equivalent would) nodes that came into 
being only in failed commit transactions?

-- 
C. Michael Pilato <cm...@collab.net>
CollabNet   <>   www.collab.net   <>   Distributed Development On Demand

Re: Node origins cache rewrite

Posted by "C. Michael Pilato" <cm...@collab.net>.

kmradke@rockwellcollins.com wrote:
> Possibly.  However, we have been VERY happy with FSFS.  I'd
> hate to change just for this reason.  It would be interesting
> to hear why collabnet still uses BDB, but that would
> belong in a different thread.

No need to spawn a whole thread for a simple answer:  when CollabNet started 
hosting with Subversion, there was no FSFS.  Then there was an FSFS, but it 
was brand new and not field tested to the degree that BDB was.  Then it was 
field tested and found to be on par with BDB's stability, and so now the 
CollabNet product can use either back-end.  But CollabNet's Operations group 
isn't in the habit of changing things without good reason, and actually 
finds that due to the peculiarities of the way they run the show, BDB is a 
better fit.  Besides, BDB's shortcomings are well known at this point, 
whereas we still hear today of random seriously lossy corruptions of FSFS 
repositories.

-- 
C. Michael Pilato <cm...@collab.net>
CollabNet   <>   www.collab.net   <>   Distributed Development On Demand

Re: Node origins cache rewrite

Posted by km...@rockwellcollins.com.

"Mark Phippard" <ma...@gmail.com> wrote on 01/25/2008 09:49:19 AM:
> On Jan 25, 2008 10:45 AM,  <km...@rockwellcollins.com> wrote:
> > The current physical size on disk of this repo is 9.6GB.
> > (svn 1.4)
> >
> > 190k * 4k = 760MB.
> >
> > So worst case, that would be another 300GB of storage
> > for our ~400 repos.  Ugh.
> 
> Wow, 400 repositories!  You must have quite the SAN.

I'm just glad I didn't have to buy it... :)
(Luckily all of our repos do not have that many revisions, but
 our largest physical repo is ~45GB, so it all averages out.)

> Would using BDB be an option?

Possibly.  However, we have been VERY happy with FSFS.  I'd
hate to change just for this reason.  It would be interesting
to hear why collabnet still uses BDB, but that would
belong in a different thread.

Kevin R.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Node origins cache rewrite

Posted by Mark Phippard <ma...@gmail.com>.

On Jan 25, 2008 10:45 AM,  <km...@rockwellcollins.com> wrote:
> The current physical size on disk of this repo is 9.6GB.
> (svn 1.4)
>
> 190k * 4k = 760MB.
>
> So worst case, that would be another 300GB of storage
> for our ~400 repos.  Ugh.

Wow, 400 repositories!  You must have quite the SAN.

Would using BDB be an option?

-- 
Thanks

Mark Phippard
http://markphip.blogspot.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Node origins cache rewrite

Posted by km...@rockwellcollins.com.

"Mark Phippard" <ma...@gmail.com> wrote on 01/25/2008 09:25:42 AM:
> On Jan 25, 2008 10:16 AM,  <km...@rockwellcollins.com> wrote:
> > Malcolm Rowe <ma...@farside.org.uk> wrote on 01/25/2008 
09:07:18
> > AM:
> > > On Fri, Jan 25, 2008 at 09:47:23AM -0500, Mark Phippard wrote:
> > > > I thought that file just stored the HEAD revision.
> > > >
> > > > I have a local cache of the Subclipse repository that has this 
value:
> > > >
> > > > 3218 1z7 160
> > > >
> > > > What does it mean?
> > >
> > > Last revision: 3218
> > > next node id: 1z7
> > > next copy id: 160
> >
> > So our "biggest" one is this:
> >
> > 56242 42yz dn3
> 
> 42yz = 190475
> 
> Assuming this is Base-36.

The current physical size on disk of this repo is 9.6GB.
(svn 1.4)

190k * 4k = 760MB.

So worst case, that would be another 300GB of storage
for our ~400 repos.  Ugh.

Kevin R.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Node origins cache rewrite

Posted by Mark Phippard <ma...@gmail.com>.

On Jan 25, 2008 10:16 AM,  <km...@rockwellcollins.com> wrote:
> Malcolm Rowe <ma...@farside.org.uk> wrote on 01/25/2008 09:07:18
> AM:
>
> > On Fri, Jan 25, 2008 at 09:47:23AM -0500, Mark Phippard wrote:
> > > I thought that file just stored the HEAD revision.
> > >
> > > I have a local cache of the Subclipse repository that has this value:
> > >
> > > 3218 1z7 160
> > >
> > > What does it mean?
> > >
> >
> > Last revision: 3218
> > next node id: 1z7
> > next copy id: 160
>
> So our "biggest" one is this:
>
> 56242 42yz dn3

42yz = 190475

Assuming this is Base-36.

-- 
Thanks

Mark Phippard
http://markphip.blogspot.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Node origins cache rewrite

Posted by Mark Phippard <ma...@gmail.com>.

On Jan 25, 2008 10:15 AM, C. Michael Pilato <cm...@collab.net> wrote:
> >>> What does it mean?
> >>>
> >> Last revision: 3218
> >> next node id: 1z7
> >
> > Right, but what's the radix? (Or, more interesting: what's the decimal
> > representation of 1z7?)
>
> radix 36; 2563?

I did a dump/load on this repository.  OSX reports the cache as 10.1
MB of disk space.  The cache itself only uses 146 kb.

-- 
Thanks

Mark Phippard
http://markphip.blogspot.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Node origins cache rewrite

Posted by "C. Michael Pilato" <cm...@collab.net>.

Erik Huelsmann wrote:
> On 1/25/08, Malcolm Rowe <ma...@farside.org.uk> wrote:
>> On Fri, Jan 25, 2008 at 09:47:23AM -0500, Mark Phippard wrote:
>>> I thought that file just stored the HEAD revision.
>>>
>>> I have a local cache of the Subclipse repository that has this value:
>>>
>>> 3218 1z7 160
>>>
>>> What does it mean?
>>>
>> Last revision: 3218
>> next node id: 1z7
> 
> Right, but what's the radix? (Or, more interesting: what's the decimal
> representation of 1z7?)

radix 36; 2563?

-- 
C. Michael Pilato <cm...@collab.net>
CollabNet   <>   www.collab.net   <>   Distributed Development On Demand

Re: Node origins cache rewrite

Posted by Erik Huelsmann <eh...@gmail.com>.

On 1/25/08, Malcolm Rowe <ma...@farside.org.uk> wrote:
> On Fri, Jan 25, 2008 at 09:47:23AM -0500, Mark Phippard wrote:
> > I thought that file just stored the HEAD revision.
> >
> > I have a local cache of the Subclipse repository that has this value:
> >
> > 3218 1z7 160
> >
> > What does it mean?
> >
>
> Last revision: 3218
> next node id: 1z7

Right, but what's the radix? (Or, more interesting: what's the decimal
representation of 1z7?)

> next copy id: 160


Bye,

Erik.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Node origins cache rewrite

Posted by km...@rockwellcollins.com.

Malcolm Rowe <ma...@farside.org.uk> wrote on 01/25/2008 09:07:18 
AM:
> On Fri, Jan 25, 2008 at 09:47:23AM -0500, Mark Phippard wrote:
> > I thought that file just stored the HEAD revision.
> > 
> > I have a local cache of the Subclipse repository that has this value:
> > 
> > 3218 1z7 160
> > 
> > What does it mean?
> > 
> 
> Last revision: 3218
> next node id: 1z7
> next copy id: 160

So our "biggest" one is this:

56242 42yz dn3

Can the last 2 be converted to "real" numbers of nodes?

Kevin R.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Node origins cache rewrite

Posted by Malcolm Rowe <ma...@farside.org.uk>.

On Fri, Jan 25, 2008 at 09:47:23AM -0500, Mark Phippard wrote:
> I thought that file just stored the HEAD revision.
> 
> I have a local cache of the Subclipse repository that has this value:
> 
> 3218 1z7 160
> 
> What does it mean?
> 

Last revision: 3218
next node id: 1z7
next copy id: 160

Regards,
Malcolm

Re: Node origins cache rewrite

Posted by Mark Phippard <ma...@gmail.com>.

On Jan 25, 2008 9:31 AM, David Glasser <gl...@davidglasser.net> wrote:
>
> On Jan 25, 2008 5:55 AM, Mark Phippard <ma...@gmail.com> wrote:
> > On Jan 24, 2008 10:04 PM, David Glasser <gl...@davidglasser.net> wrote:
> >
> > > > Does the Python script to generate the cache still work?  I wonder if
> > > > we could modify it or otherwise make it available for people to run on
> > > > some repositories to get an idea of the number of nodes in their
> > > > repository.  It would be interesting to see how many nodes are in the
> > > > ASF repository.  Perhaps we could run it on some of our large
> > > > repositories at CollabNet as well.
> > >
> > > I expect it ought to still work, yes.  Though hmm, I thought you said
> > > your repositories are BDB?
> >
> > What I was getting at, was that the script (turns out it is not a
> > script) could likely be modified to simply count the number of nodes.
> > We could then run it on some large repositories to get an idea how
> > many nodes they contain.  As I said though, there is no point in doing
> > this unless there is a number at which point we would want to change
> > the design.  As an example, if the ASF repository had over a million
> > nodes would we want to change the design?  By the same token, we would
> > probably all feel better to learn that the ASF repository only had
> > 10,000 nodes, were that the case.
>
> If you want to know the number of nodes (for FSFS) you can just look
> at the current file.

I thought that file just stored the HEAD revision.

I have a local cache of the Subclipse repository that has this value:

3218 1z7 160

What does it mean?

-- 
Thanks

Mark Phippard
http://markphip.blogspot.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Node origins cache rewrite

Posted by David Glasser <gl...@davidglasser.net>.

On Jan 25, 2008 5:55 AM, Mark Phippard <ma...@gmail.com> wrote:
> On Jan 24, 2008 10:04 PM, David Glasser <gl...@davidglasser.net> wrote:
>
> > > Does the Python script to generate the cache still work?  I wonder if
> > > we could modify it or otherwise make it available for people to run on
> > > some repositories to get an idea of the number of nodes in their
> > > repository.  It would be interesting to see how many nodes are in the
> > > ASF repository.  Perhaps we could run it on some of our large
> > > repositories at CollabNet as well.
> >
> > I expect it ought to still work, yes.  Though hmm, I thought you said
> > your repositories are BDB?
>
> What I was getting at, was that the script (turns out it is not a
> script) could likely be modified to simply count the number of nodes.
> We could then run it on some large repositories to get an idea how
> many nodes they contain.  As I said though, there is no point in doing
> this unless there is a number at which point we would want to change
> the design.  As an example, if the ASF repository had over a million
> nodes would we want to change the design?  By the same token, we would
> probably all feel better to learn that the ASF repository only had
> 10,000 nodes, were that the case.

If you want to know the number of nodes (for FSFS) you can just look
at the current file.

--dave


-- 
David Glasser | glasser@davidglasser.net | http://www.davidglasser.net/

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Node origins cache rewrite

Posted by David O'Shea <da...@s3group.com>.

On 25/01/2008 13:55, Mark Phippard wrote:
> What I was getting at, was that the script (turns out it is not a
> script) could likely be modified to simply count the number of nodes.
> We could then run it on some large repositories to get an idea how
> many nodes they contain.  As I said though, there is no point in doing
> this unless there is a number at which point we would want to change
> the design.  As an example, if the ASF repository had over a million
> nodes would we want to change the design?  By the same token, we would
> probably all feel better to learn that the ASF repository only had
> 10,000 nodes, were that the case.

The biggest repository I've checked has

5094 dhtu m58

or 629634 * 4k -> ~2.5GB

Current repository size is ~25GB so a growth of 10% in this case.

David.
-- 

The information contained in this e-mail and in any attachments is confidential and is designated solely for the attention of the intended recipient(s). If you are not an intended recipient, you must not use, disclose, copy, distribute or retain this e-mail or any part thereof. If you have received this e-mail in error, please notify the sender by return e-mail and delete all copies of this e-mail from your computer system(s).
Please direct any additional queries to: communications@s3group.com.
Thank You.
Silicon and Software Systems Limited. Registered in Ireland no. 378073.
Registered Office: South County Business Park, Leopardstown, Dublin 18

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Node origins cache rewrite

Posted by Mark Phippard <ma...@gmail.com>.

On Jan 24, 2008 10:04 PM, David Glasser <gl...@davidglasser.net> wrote:

> > Does the Python script to generate the cache still work?  I wonder if
> > we could modify it or otherwise make it available for people to run on
> > some repositories to get an idea of the number of nodes in their
> > repository.  It would be interesting to see how many nodes are in the
> > ASF repository.  Perhaps we could run it on some of our large
> > repositories at CollabNet as well.
>
> I expect it ought to still work, yes.  Though hmm, I thought you said
> your repositories are BDB?

What I was getting at, was that the script (turns out it is not a
script) could likely be modified to simply count the number of nodes.
We could then run it on some large repositories to get an idea how
many nodes they contain.  As I said though, there is no point in doing
this unless there is a number at which point we would want to change
the design.  As an example, if the ASF repository had over a million
nodes would we want to change the design?  By the same token, we would
probably all feel better to learn that the ASF repository only had
10,000 nodes, were that the case.

-- 
Thanks

Mark Phippard
http://markphip.blogspot.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Node origins cache rewrite

Posted by David Glasser <gl...@davidglasser.net>.

On Jan 24, 2008 6:54 PM, Mark Phippard <ma...@gmail.com> wrote:
> I see David has rewritten this to no longer use SQLite.  Yay!
>
> That being said, I do still have some reservations.  Keep in mind that
> CollabNet uses BDB repositories, so I am just speaking from what we
> have heard in the past from users.

Right, BDB is irrelevant here (it uses its own tables).

> How many nodes will a large repository have?  We have heard from users
> with working copies with thousands of folders and tens and hundreds of
> thousands of files.  If this represents their trunk, and the have many
> branches with modifications how many nodes can they expect.
>
> As I said previously, just 100,000 nodes X 4kb block size is 400 MB of
> disk space used.  Don't we think users might complain about the
> increase?  Even if the repository is already 4 GB, I am sure they
> would still notice the increase.

My assumption is that the cache will be strictly smaller than a single
checkout of *one branch* of (every project in) the repository, and if
that doesn't fit on your server, then something's wrong already.

(As a completely separate issue, perhaps the structure should be
sharded; I'm fine with people trying that.)

> Does the Python script to generate the cache still work?  I wonder if
> we could modify it or otherwise make it available for people to run on
> some repositories to get an idea of the number of nodes in their
> repository.  It would be interesting to see how many nodes are in the
> ASF repository.  Perhaps we could run it on some of our large
> repositories at CollabNet as well.

I expect it ought to still work, yes.  Though hmm, I thought you said
your repositories are BDB?

> That being said, I suppose we should only do this if there are a
> number of nodes at which point we would want to consider changing
> this.
>
> When we came up with this design, how many nodes were we thinking
> might typically exist?  What is it optimized for?

Optimized for "not requiring a prereq for one lousy little cache".

--dave


-- 
David Glasser | glasser@davidglasser.net | http://www.davidglasser.net/

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org