You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@subversion.apache.org by Justin Erenkrantz <ju...@erenkrantz.com> on 2010/09/04 17:18:56 UTC

Repeated SQL queries when doing 'svn st'

When compiled with SVN_DEBUG and SQLITE3_DEBUG and 'svn st' against a
svn trunk WC, a number of things pop out.

We perform 28,062 SQL queries.

---
DBG: sqlite.c:  63: sql="select root, uuid from repository where id = 1;"
---

We execute *this* query (STMT_SELECT_REPOSITORY_BY_ID) 2215 times.  Yikes.

I think this has to do with svn_wc__db_base_get_info's call to
fetch_repos_info.  I'd think we'd be able to cache this result.  I'll
take a stab and see if this reduction saves us any real time.  The
root and uuid should be constant for an wc_id...right?

For each file that we hit the DB for, we execute the following queries:

---
DBG: sqlite.c:  63: sql="select repos_id, repos_relpath, presence,
kind, revnum, checksum,   translated_size, changed_rev, changed_date,
changed_author, depth,   symlink_target, last_mod_time, properties
from base_node where wc_id = 1 and local_relpath =
'contrib/server-side/fsfsverify.py';"
DBG: sqlite.c:  63: sql="select presence, kind, checksum,
translated_size,   changed_rev, changed_date, changed_author, depth,
symlink_target,   copyfrom_repos_id, copyfrom_repos_path,
copyfrom_revnum,   moved_here, moved_to, last_mod_time, properties
from working_node where wc_id = 1 and local_relpath =
'contrib/server-side/fsfsverify.py';"
DBG: sqlite.c:  63: sql="select prop_reject, changelist, conflict_old,
conflict_new, conflict_working, tree_conflict_data, properties from
actual_node where wc_id = 1 and local_relpath =
'contrib/server-side/fsfsverify.py';"
DBG: sqlite.c:  63: sql="select base_node.repos_id,
base_node.repos_relpath, presence, kind,   revnum, checksum,
translated_size, changed_rev, changed_date,   changed_author, depth,
symlink_target, last_mod_time, properties,   lock_token, lock_owner,
lock_comment, lock_date from base_node left outer join lock on
base_node.repos_id = lock.repos_id   and base_node.repos_relpath =
lock.repos_relpath where wc_id = 1 and local_relpath =
'contrib/server-side/fsfsverify.py';"
DBG: sqlite.c:  63: sql="select presence, kind, checksum,
translated_size,   changed_rev, changed_date, changed_author, depth,
symlink_target,   copyfrom_repos_id, copyfrom_repos_path,
copyfrom_revnum,   moved_here, moved_to, last_mod_time, properties
from working_node where wc_id = 1 and local_relpath =
'contrib/server-side/fsfsverify.py';"
DBG: sqlite.c:  63: sql="select prop_reject, changelist, conflict_old,
conflict_new, conflict_working, tree_conflict_data, properties from
actual_node where wc_id = 1 and local_relpath =
'contrib/server-side/fsfsverify.py';"
DBG: sqlite.c:  63: sql="select root, uuid from repository where id = 1;"
DBG: sqlite.c:  63: sql="select prop_reject, changelist, conflict_old,
conflict_new, conflict_working, tree_conflict_data, properties from
actual_node where wc_id = 1 and local_relpath =
'contrib/server-side';"
DBG: sqlite.c:  63: sql="SELECT properties, presence FROM WORKING_NODE
WHERE wc_id = 1 AND local_relpath =
'contrib/server-side/fsfsverify.py';"
DBG: sqlite.c:  63: sql="select properties from base_node where wc_id
= 1 and local_relpath = 'contrib/server-side/fsfsverify.py';"
DBG: sqlite.c:  63: sql="select properties from actual_node where
wc_id = 1 and local_relpath = 'contrib/server-side/fsfsverify.py';"
DBG: sqlite.c:  63: sql="SELECT properties, presence FROM WORKING_NODE
WHERE wc_id = 1 AND local_relpath =
'contrib/server-side/fsfsverify.py';"
DBG: sqlite.c:  63: sql="select properties from base_node where wc_id
= 1 and local_relpath = 'contrib/server-side/fsfsverify.py';"
---

Notably, AFAICT, we're repeating a few of these queries:

- STMT_SELECT_WORKING_NODE (2 times)
- STMT_SELECT_ACTUAL_NODE (3 times)
- STMT_SELECT_WORKING_PROPS (2 times)
- STMT_SELECT_BASE_PROPS (2 times)

I haven't yet dug into why we're repeating the queries.

So, I'd bet we can cut our volume of SQL calls dramatically with some
minor rejiggering not to lose the values when we do the first calls of
the statement.  -- justin

Re: Repeated SQL queries when doing 'svn st'

Posted by Stefan Fuhrmann <st...@alice-dsl.de>.

Philip Martin wrote:
> Branko Čibej <br...@xbc.nu> writes:
>
>   
>>  On 06.09.2010 12:16, Philip Martin wrote:
>>     
>>> To use a per-directory query strategy we would probably have to cache
>>> data in memory, although not to the same extent as in 1.6.  We should
>>> probably avoid having Subversion make status callbacks into the
>>> application while a query is in progress, so we would accumulate all
>>> the row data and complete the query before making any callbacks.  Some
>>> sort of private svn_wc__db_node_t to hold the results of the select
>>> would probably be sufficient.
>>>       
>> I wonder if per-directory is really necessary; I guess I'm worrying
>> about the case were the WC tree has lots of directories with few files.
>> Do we not have the whole tree in a single Sqlide DB now? Depending on
>> the schema, it might be possible to load the status information from the
>> database in one single query.
>>     
>
> Yes, per-tree would probably work but I expect most WCs have more
> files than directories so the gains over per-dir would be small.  One
> big advantage of doing status per-tree is that it gives a proper
> snapshot, the tree cannot be modified during the status walk.  I'm not
> pushing per-dir as the final solution, my point is that per-node
> SQLite queries are not going to be fast enough.
There are actually two or three reasons why status should
run queries on directory granularity:

* directories usually resemble files in that opening them is
  expensive relative to reading their content
* operation can be canceled in a timely manner (may or may
  not be an issue with huge SQL query results)
* maybe: queries for a specific folder may be simpler / faster
  than for sub-trees (depends on schema)

Also, I don't think there is a need to cache query results.
Instead, the algorithm should be modified to look like this:

dir_status:

    // get all relevant info; each array sorted by name
    stat_recorded = sql_query("BASE + recorded change info of dir entries")
    stat_actual = read_dir()
    prop_changes = sql_query("find prop changes in dir")
  
    // "align" / "merge" arrays and send results to client
    foreach name do
       recorded= has(stat_recorded,name) ? stat_recorded[name] : NULL;
       actual = has(stat_actual,name) ? stat_actual[name] : NULL;
       changed_props = has(prop_changes,name) ? prop_changes[name] : NULL;

       // compare file content if necessary
       if (recorded&& actual && needs_content_check(recorded, actual))
          actual = check_content(name)

       send_node_status(recorded, actual, changed_props)

Only two SQL queries (give or take) per directory.

-- Stefan^2.

Re: Repeated SQL queries when doing 'svn st'

Posted by Philip Martin <ph...@wandisco.com>.

Branko Čibej <br...@xbc.nu> writes:

>  On 06.09.2010 12:16, Philip Martin wrote:
>> To use a per-directory query strategy we would probably have to cache
>> data in memory, although not to the same extent as in 1.6.  We should
>> probably avoid having Subversion make status callbacks into the
>> application while a query is in progress, so we would accumulate all
>> the row data and complete the query before making any callbacks.  Some
>> sort of private svn_wc__db_node_t to hold the results of the select
>> would probably be sufficient.
>
> I wonder if per-directory is really necessary; I guess I'm worrying
> about the case were the WC tree has lots of directories with few files.
> Do we not have the whole tree in a single Sqlide DB now? Depending on
> the schema, it might be possible to load the status information from the
> database in one single query.

Yes, per-tree would probably work but I expect most WCs have more
files than directories so the gains over per-dir would be small.  One
big advantage of doing status per-tree is that it gives a proper
snapshot, the tree cannot be modified during the status walk.  I'm not
pushing per-dir as the final solution, my point is that per-node
SQLite queries are not going to be fast enough.

-- 
Philip

Re: Repeated SQL queries when doing 'svn st'

Posted by Branko Čibej <br...@xbc.nu>.

 On 06.09.2010 12:16, Philip Martin wrote:
> To use a per-directory query strategy we would probably have to cache
> data in memory, although not to the same extent as in 1.6.  We should
> probably avoid having Subversion make status callbacks into the
> application while a query is in progress, so we would accumulate all
> the row data and complete the query before making any callbacks.  Some
> sort of private svn_wc__db_node_t to hold the results of the select
> would probably be sufficient.

I wonder if per-directory is really necessary; I guess I'm worrying
about the case were the WC tree has lots of directories with few files.
Do we not have the whole tree in a single Sqlide DB now? Depending on
the schema, it might be possible to load the status information from the
database in one single query.

(On a side note ... it's all very well to shun specialized caching and
expect your database layer do do it for you [optimally], but I see no
reason to believe that Sqlite would be significantly better at that than
BDB.)

-- Brane

Re: Repeated SQL queries when doing 'svn st'

Posted by Philip Martin <ph...@wandisco.com>.

Justin Erenkrantz <ju...@erenkrantz.com> writes:

> When compiled with SVN_DEBUG and SQLITE3_DEBUG and 'svn st' against a
> svn trunk WC, a number of things pop out.
>
> We perform 28,062 SQL queries.

It's not just repeat queries that are the problem, we simply make too
many queries.  This is mainly because the code is converted from 1.6
when we had in an in memory entry cache.

In 1.7 'svn status' is doing queries per-node, something like:

   dir_status:
     node_status
     "select count(*) from base_node where parent_relpath = "
     "select local_relpath, kind from base_node where parent_relpath = "

     for each child
       "select * from base_node where local_relpath = "
       if kind is dir
         recurse
       else
         node_status

Back in February I wrote a test program to investigate whether it was
possible for SQLite and single-db to be fast enough:

http://mail-archives.apache.org/mod_mbox/subversion-dev/201002.mbox/<87y6iq2ym4.fsf%40stat.home.lan>
http://svn.haxx.se/dev/archive-2010-02/0457.shtml

The status code in my test program was doing queries per-directory,
not per-node, something like:

   dir_status:
     node_status
     "select * from base_node where parent_relpath = "
     for each row
       if kind is dir
         push subdirs(name)
       else
         node_status
     for each subdir
       recurse


I've modified my test program to do per-node queries[1], making it
more like 1.7, and the performance is significantly worse.  It's about
a factor of 7 slower, both in CPU and runtime, on the dataset I was
using. The test program goes from being a little faster than 1.6 to be
being significantly slower.

To use a per-directory query strategy we would probably have to cache
data in memory, although not to the same extent as in 1.6.  We should
probably avoid having Subversion make status callbacks into the
application while a query is in progress, so we would accumulate all
the row data and complete the query before making any callbacks.  Some
sort of private svn_wc__db_node_t to hold the results of the select
would probably be sufficient.



[1] client to query database

#include "svn_pools.h"
#include "svn_sqlite.h"
#include <stdio.h>

static svn_error_t *
status_query(svn_sqlite__db_t *sdb,
             const char *local_relpath,
             svn_boolean_t display,
             apr_pool_t *pool)
{
  svn_sqlite__stmt_t *stmt;
  svn_boolean_t have_row;
  const char *kind;
  apr_pool_t *subpool;
  apr_array_header_t *subdirs = apr_array_make(pool, 10, sizeof(const char *));
  int i;
  apr_array_header_t *files;

  SVN_ERR(svn_sqlite__get_statement(&stmt, sdb, 0));
  SVN_ERR(svn_sqlite__bindf(stmt, "is", 1, local_relpath));
  SVN_ERR(svn_sqlite__step(&have_row, stmt));
  if (!have_row)
    {
      SVN_ERR(svn_sqlite__reset(stmt));
      return SVN_NO_ERROR;
    }
  kind = svn_sqlite__column_text(stmt, 0, pool);
  if (display)
    printf("%s %s\n", local_relpath, kind);
  SVN_ERR(svn_sqlite__reset(stmt));
  
  if (!strcmp(kind, "dir"))
    {
#ifdef PER_NODE
      int num;

      SVN_ERR(svn_sqlite__get_statement(&stmt, sdb, 2));
      SVN_ERR(svn_sqlite__bindf(stmt, "is", 1, local_relpath));
      SVN_ERR(svn_sqlite__step(&have_row, stmt));
      if (!have_row)
        {
          SVN_ERR(svn_sqlite__reset(stmt));
          return SVN_NO_ERROR;
        }
      num =  svn_sqlite__column_int64(stmt, 0);
      SVN_ERR(svn_sqlite__reset(stmt));
      files = apr_array_make(pool, num, sizeof(const char *));
#endif
      SVN_ERR(svn_sqlite__get_statement(&stmt, sdb, 1));
      SVN_ERR(svn_sqlite__bindf(stmt, "is", 1, local_relpath));
      SVN_ERR(svn_sqlite__step(&have_row, stmt));
      while (have_row)
        {
          const char *child_relpath = svn_sqlite__column_text(stmt, 0, NULL);
          kind = svn_sqlite__column_text(stmt, 1, NULL);
          if (!strcmp(kind, "dir"))
            APR_ARRAY_PUSH(subdirs, const char *)
              = apr_pstrdup(pool, child_relpath);
#ifdef PER_NODE
          else
            APR_ARRAY_PUSH(files, const char *)
              = apr_pstrdup(pool, child_relpath);
#else
          else if (display)
            printf("%s %s\n", child_relpath, kind);
#endif
          SVN_ERR(svn_sqlite__step(&have_row, stmt));
        }
      SVN_ERR(svn_sqlite__reset(stmt));
    }

  subpool = svn_pool_create(pool);

#ifdef PER_NODE
  for (i = 0; i < files->nelts; ++i)
    {
      const char *child_relpath = APR_ARRAY_IDX(files, i, const char*);
      svn_pool_clear(subpool);
      SVN_ERR(svn_sqlite__get_statement(&stmt, sdb, 0));
      SVN_ERR(svn_sqlite__bindf(stmt, "is", 1, child_relpath));
      SVN_ERR(svn_sqlite__step(&have_row, stmt));
      if (!have_row)
        {
          SVN_ERR(svn_sqlite__reset(stmt));
          return SVN_NO_ERROR;
        }
      kind = svn_sqlite__column_text(stmt, 0, pool);
      if (display)
        printf("%s %s\n", child_relpath, kind);
      SVN_ERR(svn_sqlite__reset(stmt));
    }
#endif

  for (i = 0; i < subdirs->nelts; ++i)
    {
      svn_pool_clear(subpool);
      SVN_ERR(status_query(sdb, APR_ARRAY_IDX(subdirs, i, const char*), display,
                           subpool));
    }
  svn_pool_destroy(subpool);

  return SVN_NO_ERROR;
}

int main()
{
  apr_pool_t *pool;  
  svn_sqlite__db_t *sdb;
  const char * const statements[] = {
    "select kind from base_node" \
    "   where wc_id = ?1 and local_relpath = ?2;",
    "select local_relpath, kind from base_node" \
    "   where wc_id = ?1 and parent_relpath = ?2;",
    "select count(*) from base_node" \
    "   where wc_id = ?1 and parent_relpath = ?2;",
    NULL
  };

  apr_initialize();
  pool = svn_pool_create(NULL);
  SVN_INT_ERR(svn_sqlite__open(&sdb, "wcx.db", svn_sqlite__mode_rwcreate,
                               statements, 0, NULL, pool, pool));
  SVN_INT_ERR(status_query(sdb, "", FALSE, pool));

  return EXIT_SUCCESS;
}

-- 
Philip

Re: Repeated SQL queries when doing 'svn st'

Posted by Branko Čibej <br...@xbc.nu>.

 On 04.09.2010 21:45, Justin Erenkrantz wrote:
> On Sat, Sep 4, 2010 at 10:18 AM, Justin Erenkrantz
> <ju...@erenkrantz.com> wrote:
>> Notably, AFAICT, we're repeating a few of these queries:
>>
>> - STMT_SELECT_WORKING_NODE (2 times)
>> - STMT_SELECT_ACTUAL_NODE (3 times)
>> - STMT_SELECT_WORKING_PROPS (2 times)
>> - STMT_SELECT_BASE_PROPS (2 times)
>>
>> I haven't yet dug into why we're repeating the queries.
> Okay - I now know why we're repeating the core queries twice.
>
> In get_dir_status, we want to do a check to identify if the node
> exists and what kind it is - which is done by a call to
> svn_wc__db_read_info (around line 1269 in status.c).  But, most of the
> parameters (except for node and kind) are NULL.  If it's not excluded
> and we can go into the depth, then we call handle_dir_entry on the
> entry a few lines down - which turns right around and calls
> svn_wc__db_read_info - this time asking for everything.
>
> This causes the core per-file queries to be executed twice.
>
> I'm going to see what a quick check to retrieve just the kind and
> status will do for the query volume.  I think it's unlikely we have to
> pull everything out of sqlite to answer that basic question.  --
> justin

Possibly this existence check could be one single query for the whole WC
and the results cached in memory? There shouldn't be a significant
difference in per-query overhead, and you need all those results in any
case for a whole-depth status. Of course it increases memory usage, but
really ... I can't see that as terribly significant.

    $ sudo find -x / -print | wc
      775161 1091167 81342644

80 megs of "file metadata" on my box with some 120 gigs of stuff and OS
install on it, I doubt even a fairly large working copy would do worse
than that.

-- Brane

Re: Repeated SQL queries when doing 'svn st'

Posted by Justin Erenkrantz <ju...@erenkrantz.com>.

On Sat, Sep 4, 2010 at 12:45 PM, Justin Erenkrantz
<ju...@erenkrantz.com> wrote:
> I'm going to see what a quick check to retrieve just the kind and
> status will do for the query volume.  I think it's unlikely we have to
> pull everything out of sqlite to answer that basic question.  --
> justin

We can reduce the query volume by one (per file) as we can skip the
active table query - see quick & dirty patch below.

It does replace 2 larger queries with 2 smaller ones, but I think our
bottleneck is likely more around query volume than query complexity...

Anyway, I'll stop replying to myself and enjoy the long weekend.  =)  -- justin

Index: status.c
===================================================================
--- status.c	(revision 992534)
+++ status.c	(working copy)
@@ -1263,10 +1270,7 @@ get_dir_status(const struct walk_status_baton *wb,
           svn_wc__db_status_t node_status;
           svn_wc__db_kind_t node_kind;

-          SVN_ERR(svn_wc__db_read_info(&node_status, &node_kind, NULL, NULL,
-                                   NULL, NULL, NULL, NULL, NULL, NULL, NULL,
-                                   NULL, NULL, NULL, NULL, NULL, NULL, NULL,
-                                   NULL, NULL, NULL, NULL, NULL, NULL,
+          SVN_ERR(svn_wc__db_read_info_exist(&node_status, &node_kind,
                                    wb->db, node_abspath, iterpool, iterpool));

           if (node_status != svn_wc__db_status_not_present
Index: wc_db.c
===================================================================
--- wc_db.c	(revision 992534)
+++ wc_db.c	(working copy)
@@ -51,7 +51,6 @@
 #include "private/svn_wc_private.h"
 #include "private/svn_token.h"

-
 #define NOT_IMPLEMENTED() SVN__NOT_IMPLEMENTED()


@@ -5051,6 +5050,161 @@ svn_wc__db_temp_op_delete(svn_wc__db_t *db,


 svn_error_t *
+svn_wc__db_read_info_exist(svn_wc__db_status_t *status,
+                     svn_wc__db_kind_t *kind,
+                     svn_wc__db_t *db,
+                     const char *local_abspath,
+                     apr_pool_t *result_pool,
+                     apr_pool_t *scratch_pool)
+{
+  svn_wc__db_pdh_t *pdh;
+  const char *local_relpath;
+  svn_sqlite__stmt_t *stmt_base;
+  svn_sqlite__stmt_t *stmt_work;
+  svn_boolean_t have_base;
+  svn_boolean_t have_work;
+  svn_error_t *err = NULL;
+
+  SVN_ERR_ASSERT(svn_dirent_is_absolute(local_abspath));
+
+  SVN_ERR(svn_wc__db_pdh_parse_local_abspath(&pdh, &local_relpath, db,
+                              local_abspath, svn_sqlite__mode_readonly,
+                              scratch_pool, scratch_pool));
+  VERIFY_USABLE_PDH(pdh);
+
+  SVN_ERR(svn_sqlite__get_statement(&stmt_base, pdh->wcroot->sdb,
+                                    STMT_SELECT_BASE_NODE_EXIST));
+  SVN_ERR(svn_sqlite__bindf(stmt_base, "is",
+                            pdh->wcroot->wc_id, local_relpath));
+  SVN_ERR(svn_sqlite__step(&have_base, stmt_base));
+
+  SVN_ERR(svn_sqlite__get_statement(&stmt_work, pdh->wcroot->sdb,
+                                    STMT_SELECT_WORKING_NODE_EXIST));
+  SVN_ERR(svn_sqlite__bindf(stmt_work, "is",
+                            pdh->wcroot->wc_id, local_relpath));
+  SVN_ERR(svn_sqlite__step(&have_work, stmt_work));
+
+  if (have_base || have_work)
+    {
+      svn_wc__db_kind_t node_kind;
+
+      if (have_work)
+        node_kind = svn_sqlite__column_token(stmt_work, 1, kind_map);
+      else
+        node_kind = svn_sqlite__column_token(stmt_base, 1, kind_map);
+
+      if (status)
+        {
+          if (have_base)
+            {
+              *status = svn_sqlite__column_token(stmt_base, 0, presence_map);
+
+              /* We have a presence that allows a WORKING_NODE override
+                 (normal or not-present), or we don't have an override.  */
+              /* ### for now, allow an override of an incomplete BASE_NODE
+                 ### row. it appears possible to get rows in BASE/WORKING
+                 ### both set to 'incomplete'.  */
+              SVN_ERR_ASSERT((*status != svn_wc__db_status_absent
+                              && *status != svn_wc__db_status_excluded
+                              /* && *status != svn_wc__db_status_incomplete */)
+                             || !have_work);
+
+#ifndef SVN_WC__SINGLE_DB
+              if (node_kind == svn_wc__db_kind_subdir
+                  && *status == svn_wc__db_status_normal)
+                {
+                  /* We should have read a row from the subdir wc.db. It
+                     must be obstructed in some way.
+
+                     It is also possible that a WORKING node will override
+                     this value with a proper status.  */
+                  *status = svn_wc__db_status_obstructed;
+                }
+#endif
+            }
+
+          if (have_work)
+            {
+              svn_wc__db_status_t work_status;
+
+              work_status = svn_sqlite__column_token(stmt_work, 0,
+                                                     presence_map);
+              SVN_ERR_ASSERT(work_status == svn_wc__db_status_normal
+                             || work_status == svn_wc__db_status_not_present
+                             || work_status == svn_wc__db_status_base_deleted
+                             || work_status == svn_wc__db_status_incomplete
+                             || work_status == svn_wc__db_status_excluded);
+
+              if (work_status == svn_wc__db_status_incomplete)
+                {
+                  *status = svn_wc__db_status_incomplete;
+                }
+              else if (work_status == svn_wc__db_status_excluded)
+                {
+                  *status = svn_wc__db_status_excluded;
+                }
+              else if (work_status == svn_wc__db_status_not_present
+                       || work_status == svn_wc__db_status_base_deleted)
+                {
+                  /* The caller should scan upwards to detect whether this
+                     deletion has occurred because this node has been moved
+                     away, or it is a regular deletion. Also note that the
+                     deletion could be of the BASE tree, or a child of
+                     something that has been copied/moved here. */
+
+#ifndef SVN_WC__SINGLE_DB
+                  /* If we're looking at the data in the parent, then
+                     something has obstructed the child data. Inform
+                     the caller.  */
+                  if (node_kind == svn_wc__db_kind_subdir)
+                    *status = svn_wc__db_status_obstructed_delete;
+                  else
+#endif
+                    *status = svn_wc__db_status_deleted;
+                }
+              else /* normal */
+                {
+                  /* The caller should scan upwards to detect whether this
+                     addition has occurred because of a simple addition,
+                     a copy, or is the destination of a move. */
+
+#ifndef SVN_WC__SINGLE_DB
+                  /* If we're looking at the data in the parent, then
+                     something has obstructed the child data. Inform
+                     the caller.  */
+                  if (node_kind == svn_wc__db_kind_subdir)
+                    *status = svn_wc__db_status_obstructed_add;
+                  else
+#endif
+                    *status = svn_wc__db_status_added;
+                }
+            }
+        }
+      if (kind)
+        {
+#ifndef SVN_WC__SINGLE_DB
+          if (node_kind == svn_wc__db_kind_subdir)
+            *kind = svn_wc__db_kind_dir;
+          else
+#endif
+            *kind = node_kind;
+        }
+    }
+  else
+    {
+      err = svn_error_createf(SVN_ERR_WC_PATH_NOT_FOUND, NULL,
+                              _("The node '%s' was not found."),
+                              svn_dirent_local_style(local_abspath,
+                                                     scratch_pool));
+    }
+
+  err = svn_error_compose_create(err, svn_sqlite__reset(stmt_base));
+  SVN_ERR(svn_error_compose_create(err, svn_sqlite__reset(stmt_work)));
+
+  return SVN_NO_ERROR;
+}
+
+svn_error_t *
 svn_wc__db_read_info(svn_wc__db_status_t *status,
                      svn_wc__db_kind_t *kind,
                      svn_revnum_t *revision,
Index: wc_db.h
===================================================================
--- wc_db.h	(revision 992534)
+++ wc_db.h	(working copy)
@@ -1523,7 +1523,16 @@ svn_wc__db_read_info(svn_wc__db_status_t *status,
                      apr_pool_t *result_pool,
                      apr_pool_t *scratch_pool);

+svn_error_t *
+svn_wc__db_read_info_exist(svn_wc__db_status_t *status,  /* ### derived */
+                     svn_wc__db_kind_t *kind,

+                     svn_wc__db_t *db,
+                     const char *local_abspath,
+                     apr_pool_t *result_pool,
+                     apr_pool_t *scratch_pool);
+
+
 /* Set *PROPVAL to the value of the property named PROPNAME of the node
    LOCAL_ABSPATH in the ACTUAL tree (looking through to the WORKING or BASE
    tree as required).
Index: wc-queries.sql
===================================================================
--- wc-queries.sql	(revision 992534)
+++ wc-queries.sql	(working copy)
@@ -43,6 +43,11 @@ left outer join lock on base_node.repos_id = lock.
   and base_node.repos_relpath = lock.repos_relpath
 where wc_id = ?1 and local_relpath = ?2;

+-- STMT_SELECT_BASE_NODE_EXIST
+select presence, kind
+from base_node
+where wc_id = ?1 and local_relpath = ?2;
+
 -- STMT_SELECT_WORKING_NODE
 select presence, kind, checksum, translated_size,
   changed_rev, changed_date, changed_author, depth, symlink_target,
@@ -51,6 +56,11 @@ select presence, kind, checksum, translated_size,
 from working_node
 where wc_id = ?1 and local_relpath = ?2;

+-- STMT_SELECT_WORKING_NODE_EXIST
+select presence, kind
+from working_node
+where wc_id = ?1 and local_relpath = ?2;
+
 -- STMT_SELECT_ACTUAL_NODE
 select prop_reject, changelist, conflict_old, conflict_new,
 conflict_working, tree_conflict_data, properties

Re: Repeated SQL queries when doing 'svn st'

Posted by Justin Erenkrantz <ju...@erenkrantz.com>.

On Sat, Sep 4, 2010 at 10:18 AM, Justin Erenkrantz
<ju...@erenkrantz.com> wrote:
> Notably, AFAICT, we're repeating a few of these queries:
>
> - STMT_SELECT_WORKING_NODE (2 times)
> - STMT_SELECT_ACTUAL_NODE (3 times)
> - STMT_SELECT_WORKING_PROPS (2 times)
> - STMT_SELECT_BASE_PROPS (2 times)
>
> I haven't yet dug into why we're repeating the queries.

Okay - I now know why we're repeating the core queries twice.

In get_dir_status, we want to do a check to identify if the node
exists and what kind it is - which is done by a call to
svn_wc__db_read_info (around line 1269 in status.c).  But, most of the
parameters (except for node and kind) are NULL.  If it's not excluded
and we can go into the depth, then we call handle_dir_entry on the
entry a few lines down - which turns right around and calls
svn_wc__db_read_info - this time asking for everything.

This causes the core per-file queries to be executed twice.

I'm going to see what a quick check to retrieve just the kind and
status will do for the query volume.  I think it's unlikely we have to
pull everything out of sqlite to answer that basic question.  --
justin

RE: Repeated SQL queries when doing 'svn st'

Posted by Bert Huijben <be...@qqmail.nl>.

> -----Original Message-----
> From: justin.erenkrantz@gmail.com [mailto:justin.erenkrantz@gmail.com]
> On Behalf Of Justin Erenkrantz
> Sent: zaterdag 4 september 2010 20:20
> To: Subversion Development
> Subject: Re: Repeated SQL queries when doing 'svn st'
> 
> On Sat, Sep 4, 2010 at 10:18 AM, Justin Erenkrantz
> <ju...@erenkrantz.com> wrote:
> > When compiled with SVN_DEBUG and SQLITE3_DEBUG and 'svn st' against a
> > svn trunk WC, a number of things pop out.
> >
> > We perform 28,062 SQL queries.
> >
> > ---
> > DBG: sqlite.c:  63: sql="select root, uuid from repository where id =
> 1;"
> > ---
> >
> > We execute *this* query (STMT_SELECT_REPOSITORY_BY_ID) 2215 times.
>  Yikes.
> >
> > I think this has to do with svn_wc__db_base_get_info's call to
> > fetch_repos_info.  I'd think we'd be able to cache this result.  I'll
> > take a stab and see if this reduction saves us any real time.  The
> > root and uuid should be constant for an wc_id...right?
> 
> It's actually svn_wc__db_read_info's fetch_repos_info call...

A working copy should have only one repository, so while we could cache the
result (usually 1 value per wcroot) we might be able to avoid the call
completely.
(Using switch --relocate halfway in your working copy most likely breaks
'svn update' anyway).

A major issue in our code is that in all of our info and scan functions we
usually use a pattern like
* get_statement()
* <use-statement>
* reset_statement()
* get_statement()
...
* reset_statement()
(This is the safest pattern that makes sure that we reset statements)

With this pattern we release the shared lock with each reset and when we
start using the next statement we have to reobtain the lock. Releasing all
of them at the end would be much faster

[<snip from "Re: Worried about single-db performance" thread>]
> Aha.  Adding exclusive locking into our pragma
> [http://www.sqlite.org/pragma.html] calls in "svn_sqlite__open":
> 
> "PRAGMA locking_mode=exclusive;"
> 
> brings the time for "svn st" down from 0.680 to 0.310 seconds.  And,
> yes, the I/O percentages drop dramatically:

I had even better results by just adding the SQLITE 3.6.8+ specific
"SAVEPOINT q;"
After the initial pragma block. 

This just keeps the shared read lock (instead of an exclusive write lock),
but for some reason it had a better result for svn status. 
(Maybe sqlite verifies if it still has the write lock? (Don't know))

(Just taking an exclusive write lock would break current API users like
TortoiseSVN, which perform background status requests while other Subversion
tools access the working copy)

I think that for a 'svn status' it would be safe to use this block around
all operations, but if we are worried about taking the lock to long we could
use it per directory. (The SAVEPOINT operations can be nested, so that would
also safely speed up 'svn status -u' which does multiple small local status
walks).

It would be interesting to see what combining the SAVEPOINT and
read_uncommitted pragma would do here. (And maybe we should also look at the
Write Ahead Logging here, but that only works for local disks)

	Bert

Re: Repeated SQL queries when doing 'svn st'

Posted by Justin Erenkrantz <ju...@erenkrantz.com>.

On Sat, Sep 4, 2010 at 10:18 AM, Justin Erenkrantz
<ju...@erenkrantz.com> wrote:
> When compiled with SVN_DEBUG and SQLITE3_DEBUG and 'svn st' against a
> svn trunk WC, a number of things pop out.
>
> We perform 28,062 SQL queries.
>
> ---
> DBG: sqlite.c:  63: sql="select root, uuid from repository where id = 1;"
> ---
>
> We execute *this* query (STMT_SELECT_REPOSITORY_BY_ID) 2215 times.  Yikes.
>
> I think this has to do with svn_wc__db_base_get_info's call to
> fetch_repos_info.  I'd think we'd be able to cache this result.  I'll
> take a stab and see if this reduction saves us any real time.  The
> root and uuid should be constant for an wc_id...right?

It's actually svn_wc__db_read_info's fetch_repos_info call...

With 2215 queries: ~/Applications/svn-trunk-no-debug/bin/svn st  0.26s
user 0.05s system 98% cpu 0.311 total
With a quick-and-hacky cache:
~/Applications/svn-trunk-no-debug/bin/svn st  0.25s user 0.05s system
98% cpu 0.298 total

It's worth a good 4% time savings...

A quick back-of-the-envelope calculation says that if we can remove
all of the extraneous 13,290 SQL queries (out of 28,062 ; leaving
behind 14,772 queries) - we will likely gain something like 25% from
the 0.311 down to around 0.233 seconds.

It's still much higher than 0.050 than 'svn st' on 1.6.x yields, but
inching closer...  -- justin