You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@couchdb.apache.org by GitBox <gi...@apache.org> on 2022/10/24 16:34:43 UTC

[GitHub] [couchdb] nickva opened a new pull request, #4238: Optimize _bulk_get endpoint

nickva opened a new pull request, #4238:
URL: https://github.com/apache/couchdb/pull/4238

   Use new `fabric:open_revs/3` API implemented in #4201 to optimize the _bulk_get HTTP API. Since `open_revs/3` itself is new, allow reverting to individual doc fetches using the previous `open_revs/4` API via a config setting, mostly as a precautionary measure.
   
   The implementation consists of three main parts:
     * Parse and validate args
     * Fetch the docs using `open_revs/3` or `open_revs/4`
     * Emit results as json or multipart, based on the `Accept` header value
   
   Parsing and validation checks for various errors and then returns a map of `#{Ref => {DocId, RevOrError, DocOptions}}` and a list of Refs in the original argument order. The middle tuple element of `RevOrError` is notable that it may hold either the revision ID (`[Rev]` or `all`) or `{error, {Rev, ErrorTag, ErrorReason}}`.
   
   Fetching the docs is fairly straightforward. The slightly interesting aspect is when an error is returned from `open_revs/3` we have to pretend that all the batched docs failed with that error. That is done to preserve the "zip" property, where all the input arguments have their matching result at the same position in results list. Another notable thing here is we fixed a bug where the error returned from `fabric:open_revs/3,4` was not formatted in a way it could have been emitted as json resulting in a function clause. That is why we call `couch_util:to_binary/1` on it. This was detected by the integration testing outline before and was missed by the previous mocked unit test.
   
   The last part is emitting the results as either json or multipart. Here most changes are cleanups and grouping into separate handler functions. The `Accept` header can be either `multipart/related` or `multipart/mixed` and we try to emit the same content type as it was passed in the `Accept` header. One notable thing here is by DRY-ing the filtering of attachments in `non_stubbed_attachments/1` we fixed another bug when the multipart result was returning nonsense in cases when all attachments were stubs. The doc was returned as a multipart chunk with content type `multipart/...` instead of application/json. This was also detected in the integration tests described below.
   
   The largest changes are in the testing area. Previous multipart tests were using mocks heavily, were quite fragile, and didn't have good coverage. Those tests were removed and replaced by new end-to-end tests in `chttpd_bulk_get_test.erl`. To make that happen add a simple multipart parser utility function which knows how to parse multipart responses into maps. Those maps preserve chunk headers and we can match those with `?assertMatch(...)` fairly easily. The tests try to get decent coverage for `chttpd_db.erl` bulk_get implementation and its utility functions, but they are also end-to-end tests so they test everything below, including fabric and couch layers as well.
   
   Quick 1 node testing using the couchdyno replicating of 1 million docs shows at least a 2x speedup to complete the replication using this PR.
   
   On main:
   
   ```
   r=rep.Rep(); r.replicate_1_to_n_and_compare(1, num=1000000, normal=True)
   330 sec
   ```
   
   With this PR:
   ```
   r=rep.Rep(); r.replicate_1_to_n_and_compare(1, num=1000000, normal=True)
   160 sec
   ```
   
   Individual `_bulk_get` response times shows an even higher improvement: an 8x speedup:
   
   On main:
   ```
   [notice] ... POST /cdyno-0000001/_bulk_get?latest=true&revs=true&attachments=false 200 ok 468
   [notice] ... POST /cdyno-0000001/_bulk_get?latest=true&revs=true&attachments=false 200 ok 479
   ```
   
   With this PR:
   ```
   [notice] ... POST /cdyno-0000001/_bulk_get?latest=true&revs=true&attachments=false 200 ok 54
   [notice] ... POST /cdyno-0000001/_bulk_get?latest=true&revs=true&attachments=false 200 ok 61
   ```
   
   Fixes: https://github.com/apache/couchdb/issues/4183


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@couchdb.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [couchdb] jaydoane commented on a diff in pull request #4238: Optimize _bulk_get endpoint

Posted by GitBox <gi...@apache.org>.
jaydoane commented on code in PR #4238:
URL: https://github.com/apache/couchdb/pull/4238#discussion_r1004912556


##########
src/chttpd/src/chttpd_db.erl:
##########
@@ -2363,6 +2288,175 @@ set_namespace(NS, #mrargs{} = Args) ->
 
 %% /db/_bulk_get stuff
 
+bulk_get_is_multipart(#httpd{mochi_req = MochiReq}) ->
+    Json = MochiReq:accepts_content_type("application/json"),
+    Mixed = MochiReq:accepts_content_type("multipart/mixed"),
+    Related = MochiReq:accepts_content_type("multipart/related"),
+    not Json andalso (Mixed orelse Related).
+
+bulk_get_docs(Db, #{} = ArgsMap, Options) ->
+    % Sort args by doc ID to hopefully make querying B-trees a bit faster
+    KeyFun = fun({Ref, {DocId, _, _}}) -> {DocId, Ref} end,
+    CmpFun = fun(A, B) -> KeyFun(A) =< KeyFun(B) end,
+    ArgsList = lists:sort(CmpFun, maps:to_list(ArgsMap)),
+    % Split out known errors. Later, before returning, recombine them back into
+    % the final result map.
+    PartFun = fun({_Ref, {_DocId, RevsOrError, _DocOpts}}) ->
+        case RevsOrError of
+            L when is_list(L) -> true;
+            all -> true;
+            {error, _} -> false
+        end
+    end,
+    {ValidArgs, ErrorArgs} = lists:partition(PartFun, ArgsList),
+    UseBatches = config:get_boolean("chttpd", "bulk_get_use_batches", true),
+    Responses =
+        case UseBatches of
+            true -> bulk_get_docs_batched(Db, ValidArgs, Options);
+            false -> bulk_get_docs_individually(Db, ValidArgs, Options)
+        end,
+    MapFun = fun({Ref, {DocId, Response, _}} = RespTuple) ->
+        case Response of
+            [] ->
+                % Remap empty reponses to errors. This is a peculiarity of the

Review Comment:
   Calling this annoyance a "peculiarity" seems quite gracious!



##########
src/chttpd/src/chttpd_db.erl:
##########
@@ -2363,6 +2288,175 @@ set_namespace(NS, #mrargs{} = Args) ->
 
 %% /db/_bulk_get stuff
 
+bulk_get_is_multipart(#httpd{mochi_req = MochiReq}) ->
+    Json = MochiReq:accepts_content_type("application/json"),
+    Mixed = MochiReq:accepts_content_type("multipart/mixed"),
+    Related = MochiReq:accepts_content_type("multipart/related"),
+    not Json andalso (Mixed orelse Related).
+
+bulk_get_docs(Db, #{} = ArgsMap, Options) ->
+    % Sort args by doc ID to hopefully make querying B-trees a bit faster
+    KeyFun = fun({Ref, {DocId, _, _}}) -> {DocId, Ref} end,
+    CmpFun = fun(A, B) -> KeyFun(A) =< KeyFun(B) end,
+    ArgsList = lists:sort(CmpFun, maps:to_list(ArgsMap)),
+    % Split out known errors. Later, before returning, recombine them back into
+    % the final result map.
+    PartFun = fun({_Ref, {_DocId, RevsOrError, _DocOpts}}) ->
+        case RevsOrError of
+            L when is_list(L) -> true;
+            all -> true;
+            {error, _} -> false
+        end
+    end,
+    {ValidArgs, ErrorArgs} = lists:partition(PartFun, ArgsList),
+    UseBatches = config:get_boolean("chttpd", "bulk_get_use_batches", true),
+    Responses =
+        case UseBatches of
+            true -> bulk_get_docs_batched(Db, ValidArgs, Options);
+            false -> bulk_get_docs_individually(Db, ValidArgs, Options)
+        end,
+    MapFun = fun({Ref, {DocId, Response, _}} = RespTuple) ->
+        case Response of
+            [] ->
+                % Remap empty reponses to errors. This is a peculiarity of the
+                % _bulk_get HTTP API. If revision was not specifed, `undefined`
+                % must be returned as the error revision ID.
+                #{Ref := {_, Revs, _}} = ArgsMap,
+                RevStr = bulk_get_rev_error(Revs),
+                Error = {RevStr, <<"not_found">>, <<"missing">>},
+                {Ref, {DocId, {error, Error}, []}};
+            [_ | _] = DocRevisions ->
+                chttpd_stats:incr_reads(length(DocRevisions)),
+                RespTuple;
+            _ ->
+                RespTuple
+        end
+    end,
+    % Recombine with the inital known errors and return as a map
+    maps:from_list(lists:map(MapFun, Responses) ++ ErrorArgs).
+
+bulk_get_docs_batched(Db, Args, Options) when is_list(Args) ->
+    % Args is [{Ref, {DocId, Revs, DocOpts}}, ...] but fabric:open_revs/3
+    % accepts [{{DocId, Revs}, DocOpts}, ...] so we need to transform them
+    ArgsFun = fun({_Ref, {DocId, Revs, DocOpts}}) ->
+        {{DocId, Revs}, DocOpts}
+    end,
+    OpenRevsArgs = lists:map(ArgsFun, Args),
+    case fabric:open_revs(Db, OpenRevsArgs, Options) of
+        {ok, Responses} ->
+            ZipFun = fun({Ref, {DocId, _Rev, DocOpts}}, Response) ->
+                {Ref, {DocId, Response, DocOpts ++ Options}}
+            end,
+            lists:zipwith(ZipFun, Args, Responses);
+        {error, Error} ->
+            % Distribute error to all request args, so it looks like they
+            % individually failed with that error
+            MapFun = fun({Ref, {DocId, Revs, _DocOpts}}) ->
+                RevStr = bulk_get_rev_error(Revs),
+                Tag = internal_fabric_error,
+                % This error will be emitted as json so make sure it's rendered
+                % to a string first.
+                Reason = couch_util:to_binary(Error),
+                {Ref, {DocId, {error, {RevStr, Tag, Reason}}, []}}
+            end,
+            lists:map(MapFun, Args)
+    end.
+
+bulk_get_docs_individually(Db, Args, Options) when is_list(Args) ->
+    MapFun = fun({Ref, {DocId, Revs, DocOpts}}) ->
+        case fabric:open_revs(Db, DocId, Revs, DocOpts ++ Options) of
+            {ok, Response} ->
+                {Ref, {DocId, Response, DocOpts}};
+            {error, Error} ->
+                RevStr = bulk_get_rev_error(Revs),
+                Tag = internal_fabric_error,
+                % This error will be emitted as json so make sure it's rendered
+                % to a string first.
+                Reason = couch_util:to_binary(Error),
+                {Ref, {DocId, {error, {RevStr, Tag, Reason}}, []}}
+        end
+    end,
+    lists:map(MapFun, Args).
+
+bulk_get_ret_json(#httpd{} = Req, ArgsRefs, ResultsMap, Options) ->
+    send_json(Req, 200, #{
+        <<"results">> => lists:map(
+            fun(Ref) ->
+                #{Ref := {DocId, Result, DocOpts}} = ResultsMap,
+                % We are about to encode the document into json and some of the
+                % provided general options might affect that so we make sure to
+                % combine all doc options and the general options together
+                AllOptions = DocOpts ++ Options,
+                #{
+                    <<"id">> => DocId,
+                    <<"docs">> => bulk_get_result(DocId, Result, AllOptions)
+                }
+            end,
+            ArgsRefs
+        )
+    }).
+
+bulk_get_result(DocId, {error, {Rev, Error, Reason}}, _Options) ->
+    [bulk_get_json_error_map(DocId, Rev, Error, Reason)];
+bulk_get_result(DocId, [_ | _] = DocRevs, Options) ->
+    MapFun = fun
+        ({ok, Doc}) ->
+            #{<<"ok">> => couch_doc:to_json_obj(Doc, Options)};
+        ({{Error, Reason}, RevId}) ->
+            Rev = couch_doc:rev_to_str(RevId),
+            bulk_get_json_error_map(DocId, Rev, Error, Reason)
+    end,
+    lists:map(MapFun, DocRevs).
+
+bulk_get_json_error_map(DocId, Rev, Error, Reason) ->
+    #{
+        <<"error">> => #{
+            <<"id">> => DocId,
+            <<"rev">> => Rev,
+            <<"error">> => Error,
+            <<"reason">> => Reason
+        }
+    }.
+
+bulk_get_ret_multipart(#httpd{} = Req, ArgsRefs, ResultsMap, Options) ->
+    MochiReq = Req#httpd.mochi_req,
+    Mixed = MochiReq:accepts_content_type("multipart/mixed"),
+    MpType =
+        case Mixed of
+            true -> "multipart/mixed";
+            false -> "multipart/related"
+        end,
+    Boundary = bulk_get_multipart_boundary(),
+    BoundaryCType = MpType ++ "; boundary=\"" ++ ?b2l(Boundary) ++ "\"",
+    CType = {"Content-Type", BoundaryCType},
+    {ok, Resp} = start_chunked_response(Req, 200, [CType]),
+    ForeachFun = fun(Ref) ->
+        #{Ref := {DocId, Result, DocOpts}} = ResultsMap,
+        case Result of
+            [_ | _] = DocRevs ->
+                AllOptions = DocOpts ++ Options,
+                send_docs_multipart_bulk_get(DocRevs, AllOptions, Boundary, Resp);
+            {error, {RevId, Error, Reason}} ->
+                EJson = bulk_get_json_error_map(DocId, RevId, Error, Reason),
+                Json = ?JSON_ENCODE(map_get(<<"error">>, EJson)),
+                ErrCType = <<"Content-Type: application/json">>,
+                Prefix = <<"\r\n", ErrCType/binary, "; error=\"true\"\r\n\r\n">>,
+                ErrorChunk = [<<"\r\n--", Boundary/binary>>, Prefix, Json],
+                couch_httpd:send_chunk(Resp, ErrorChunk)
+        end
+    end,
+    lists:foreach(ForeachFun, ArgsRefs),
+    case ArgsRefs of
+        [] ->
+            % Didn't send any docs, don't need to send a closing boundary
+            ok;
+        [_ | _] ->

Review Comment:
   👍 



##########
src/chttpd/test/eunit/chttpd_bulk_get_test.erl:
##########
@@ -654,6 +1303,34 @@ t_missing_rev_latest({_, DbUrl}) ->
         Res
     ).
 
+t_fabric_worker_error({_, DbUrl}) ->
+    % Check the case of latest and a missing rev

Review Comment:
   Copy-pasta error?



##########
src/chttpd/src/chttpd_db.erl:
##########
@@ -2363,6 +2288,175 @@ set_namespace(NS, #mrargs{} = Args) ->
 
 %% /db/_bulk_get stuff
 
+bulk_get_is_multipart(#httpd{mochi_req = MochiReq}) ->
+    Json = MochiReq:accepts_content_type("application/json"),
+    Mixed = MochiReq:accepts_content_type("multipart/mixed"),
+    Related = MochiReq:accepts_content_type("multipart/related"),
+    not Json andalso (Mixed orelse Related).
+
+bulk_get_docs(Db, #{} = ArgsMap, Options) ->
+    % Sort args by doc ID to hopefully make querying B-trees a bit faster
+    KeyFun = fun({Ref, {DocId, _, _}}) -> {DocId, Ref} end,
+    CmpFun = fun(A, B) -> KeyFun(A) =< KeyFun(B) end,
+    ArgsList = lists:sort(CmpFun, maps:to_list(ArgsMap)),
+    % Split out known errors. Later, before returning, recombine them back into
+    % the final result map.
+    PartFun = fun({_Ref, {_DocId, RevsOrError, _DocOpts}}) ->
+        case RevsOrError of
+            L when is_list(L) -> true;
+            all -> true;
+            {error, _} -> false
+        end
+    end,
+    {ValidArgs, ErrorArgs} = lists:partition(PartFun, ArgsList),
+    UseBatches = config:get_boolean("chttpd", "bulk_get_use_batches", true),
+    Responses =
+        case UseBatches of
+            true -> bulk_get_docs_batched(Db, ValidArgs, Options);
+            false -> bulk_get_docs_individually(Db, ValidArgs, Options)
+        end,
+    MapFun = fun({Ref, {DocId, Response, _}} = RespTuple) ->
+        case Response of
+            [] ->
+                % Remap empty reponses to errors. This is a peculiarity of the
+                % _bulk_get HTTP API. If revision was not specifed, `undefined`
+                % must be returned as the error revision ID.
+                #{Ref := {_, Revs, _}} = ArgsMap,
+                RevStr = bulk_get_rev_error(Revs),
+                Error = {RevStr, <<"not_found">>, <<"missing">>},
+                {Ref, {DocId, {error, Error}, []}};
+            [_ | _] = DocRevisions ->

Review Comment:
   Although it gives you the opportunity to use this awesome pattern a lot ;)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@couchdb.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [couchdb] jaydoane commented on a diff in pull request #4238: Optimize _bulk_get endpoint

Posted by GitBox <gi...@apache.org>.
jaydoane commented on code in PR #4238:
URL: https://github.com/apache/couchdb/pull/4238#discussion_r1005026285


##########
src/chttpd/test/eunit/chttpd_bulk_get_test.erl:
##########
@@ -654,6 +1303,34 @@ t_missing_rev_latest({_, DbUrl}) ->
         Res
     ).
 
+t_fabric_worker_error({_, DbUrl}) ->
+    % Check the case of latest and a missing rev

Review Comment:
   Copy-pasta?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@couchdb.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [couchdb] nickva commented on a diff in pull request #4238: Optimize _bulk_get endpoint

Posted by GitBox <gi...@apache.org>.
nickva commented on code in PR #4238:
URL: https://github.com/apache/couchdb/pull/4238#discussion_r1005150409


##########
src/chttpd/test/eunit/chttpd_bulk_get_test.erl:
##########
@@ -654,6 +1303,34 @@ t_missing_rev_latest({_, DbUrl}) ->
         Res
     ).
 
+t_fabric_worker_error({_, DbUrl}) ->
+    % Check the case of latest and a missing rev

Review Comment:
   Yes it is. Good find!
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@couchdb.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [couchdb] nickva commented on pull request #4238: Optimize _bulk_get endpoint

Posted by GitBox <gi...@apache.org>.
nickva commented on PR #4238:
URL: https://github.com/apache/couchdb/pull/4238#issuecomment-1289443795

   This PR also fixed https://github.com/apache/couchdb/issues/3328


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@couchdb.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [couchdb] nickva merged pull request #4238: Optimize _bulk_get endpoint

Posted by GitBox <gi...@apache.org>.
nickva merged PR #4238:
URL: https://github.com/apache/couchdb/pull/4238


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@couchdb.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [couchdb] nickva commented on a diff in pull request #4238: Optimize _bulk_get endpoint

Posted by GitBox <gi...@apache.org>.
nickva commented on code in PR #4238:
URL: https://github.com/apache/couchdb/pull/4238#discussion_r1005148598


##########
src/chttpd/src/chttpd_db.erl:
##########
@@ -2363,6 +2288,175 @@ set_namespace(NS, #mrargs{} = Args) ->
 
 %% /db/_bulk_get stuff
 
+bulk_get_is_multipart(#httpd{mochi_req = MochiReq}) ->
+    Json = MochiReq:accepts_content_type("application/json"),
+    Mixed = MochiReq:accepts_content_type("multipart/mixed"),
+    Related = MochiReq:accepts_content_type("multipart/related"),
+    not Json andalso (Mixed orelse Related).
+
+bulk_get_docs(Db, #{} = ArgsMap, Options) ->
+    % Sort args by doc ID to hopefully make querying B-trees a bit faster
+    KeyFun = fun({Ref, {DocId, _, _}}) -> {DocId, Ref} end,
+    CmpFun = fun(A, B) -> KeyFun(A) =< KeyFun(B) end,
+    ArgsList = lists:sort(CmpFun, maps:to_list(ArgsMap)),
+    % Split out known errors. Later, before returning, recombine them back into
+    % the final result map.
+    PartFun = fun({_Ref, {_DocId, RevsOrError, _DocOpts}}) ->
+        case RevsOrError of
+            L when is_list(L) -> true;
+            all -> true;
+            {error, _} -> false
+        end
+    end,
+    {ValidArgs, ErrorArgs} = lists:partition(PartFun, ArgsList),
+    UseBatches = config:get_boolean("chttpd", "bulk_get_use_batches", true),
+    Responses =
+        case UseBatches of
+            true -> bulk_get_docs_batched(Db, ValidArgs, Options);
+            false -> bulk_get_docs_individually(Db, ValidArgs, Options)
+        end,
+    MapFun = fun({Ref, {DocId, Response, _}} = RespTuple) ->
+        case Response of
+            [] ->
+                % Remap empty reponses to errors. This is a peculiarity of the
+                % _bulk_get HTTP API. If revision was not specifed, `undefined`
+                % must be returned as the error revision ID.
+                #{Ref := {_, Revs, _}} = ArgsMap,
+                RevStr = bulk_get_rev_error(Revs),
+                Error = {RevStr, <<"not_found">>, <<"missing">>},
+                {Ref, {DocId, {error, Error}, []}};
+            [_ | _] = DocRevisions ->

Review Comment:
   [Yup](https://medium.com/erlang-battleground/ode-to-the-robot-butt-bbd69e69beb2)!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@couchdb.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org