You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@couchdb.apache.org by "Eli Stevens (Gmail)" <wi...@gmail.com> on 2011/06/08 10:36:53 UTC

Upload speed for large attachments

Running the following code on a macbook pro, using CouchDBX 1.0.2
(everything local), we're seeing the following output when trying to
attach a file with 10MB of random data:

Code: https://gist.github.com/bc0c36f36be0c85e2a36 (code included in full below)
Output:

Using curl: 0.168450117111
Using put_attachment: 0.309157133102
post time: 2.5557808876
Using multipart: 2.61283898354
Encoding base64: 0.0497629642487
Updating: 5.0550069809

Server log: https://gist.github.com/a80a495fd35049ff871f (there's a
HEAD/DELETE/PUT/GET cycle that's just cleanup)

The calls in question are:

Using curl: 0.168450117111
1> [info] [<0.27828.7>] 127.0.0.1 - - 'PUT'
/benchmark_entity/bigfile/bigfile/bigfile.gz?rev=78-db58ded2899c5546e349feb5a8c0eee4
201

Using put_attachment: 0.309157133102
1> [info] [<0.27809.7>] 127.0.0.1 - - 'PUT'
/benchmark_entity/bigfile/smallfile?rev=81-c538b38a8463952f0136143cfa49e9fa
201

Using multipart: 2.61283898354 (post time: 2.5557808876)
1> [info] [<0.27809.7>] 127.0.0.1 - - 'POST' /benchmark_entity/bigfile 201

Updating: 5.0550069809
1> [info] [<0.27809.7>] 127.0.0.1 - - 'POST' /benchmark_entity/_bulk_docs 201

Profiling our code shows 1.5 sec of CPU usage in our code (which
covers setup / cleanup code that's not included in the times above),
and 11.8 sec of total run time, which roughly matches up with the
PUT/POST times above.  Basically, I feel pretty confident that the
bulk of the times above are not in our client code, and are instead
due to couchdb's handling time.

Why is the form/multipart handler so much slower than using a bare PUT
on the attachment?  Why is the base64 approach even slower?  Is it due
to bandwidth issues, couchdb CPU usage...?

Thanks for any help,
Eli

Full code from: https://gist.github.com/bc0c36f36be0c85e2a36

import base64
import contextlib
import cStringIO
import subprocess
import time

import couchdb
import couchdb.json
import couchdb.multipart

@contextlib.contextmanager
def stopwatch(m=''):
    t0=time.time()
    yield
    tdiff=time.time() - t0
    if m:
        print '{}: {}'.format(m, tdiff)
    else:
        print tdiff

def reset(d):
    try:
        del d['bigfile']
    except couchdb.http.ResourceNotFound:
        pass
    d['bigfile'] = {'foo': 'bar'}
    return d['bigfile']

s = couchdb.Server()
d = s['benchmark_entity']

fn = '/tmp/bigfile.gz'
fn = '/tmp/smallfile'

doc = reset(d)
with stopwatch('Using curl'):
    p = subprocess.Popen([
        'curl',
        '-X', 'PUT',
        'http://localhost:5984/benchmark_entity/{}/bigfile/bigfile.gz?rev={}'.format(doc.id,
doc.rev),
        '-d', '@{}'.format(fn),
        '-H', 'Content-Type: application/gzip'
        ])
    p.wait()

doc = reset(d)
with open(fn, 'r') as f:
    with stopwatch('Using put_attachment'):
        d.put_attachment(doc, f)

doc = reset(d)
with open(fn, 'r') as f:
    content_name = 'bigfile.gz'
    content = f.read()
    content_type = 'application/gzip'
    with stopwatch('Using multipart'):
        fileobj = cStringIO.StringIO()

        with couchdb.multipart.MultipartWriter(fileobj, headers=None,
subtype='form-data') as mpw:
            mime_headers = {'Content-Disposition': '''form-data; name="_doc"'''}
            mpw.add('application/json', couchdb.json.encode(doc), mime_headers)

            mime_headers = {'Content-Disposition': '''form-data;
name="_attachments"; filename="{}"'''.format(content_name)}
            mpw.add(content_type, content, mime_headers)

        header_str, blank_str, body = fileobj.getvalue().split('\r\n', 2)

        http_headers = {'Referer': d.resource.url, 'Content-Type':
header_str[len('Content-Type: '):]}
        params = {}
        t0 = time.time()
        status, msg, data = d.resource.post(doc['_id'], body,
http_headers, **params)
        print 'post time: {}'.format(time.time() - t0)

doc = reset(d)
with open(fn, 'r') as f:
    content_name = 'bigfile.gz'
    content = f.read()
    content_type = 'application/gzip'
    with stopwatch('Encoding base64'):
        doc['_attachments'] = {content_name: {'content_type':
content_type, 'data': base64.b64encode(content)}}
    with stopwatch('Updating'):
        d.update([doc])

Re: Upload speed for large attachments

Posted by "Eli Stevens (Gmail)" <wi...@gmail.com>.
Tilgovi on IRC asked me to open an issue:

https://issues.apache.org/jira/browse/COUCHDB-1192

Cheers,
Eli

On Wed, Jun 8, 2011 at 1:36 AM, Eli Stevens (Gmail)
<wi...@gmail.com> wrote:
> Running the following code on a macbook pro, using CouchDBX 1.0.2
> (everything local), we're seeing the following output when trying to
> attach a file with 10MB of random data:
>
> Code: https://gist.github.com/bc0c36f36be0c85e2a36 (code included in full below)
> Output:
>
> Using curl: 0.168450117111
> Using put_attachment: 0.309157133102
> post time: 2.5557808876
> Using multipart: 2.61283898354
> Encoding base64: 0.0497629642487
> Updating: 5.0550069809
>
> Server log: https://gist.github.com/a80a495fd35049ff871f (there's a
> HEAD/DELETE/PUT/GET cycle that's just cleanup)
>
> The calls in question are:
>
> Using curl: 0.168450117111
> 1> [info] [<0.27828.7>] 127.0.0.1 - - 'PUT'
> /benchmark_entity/bigfile/bigfile/bigfile.gz?rev=78-db58ded2899c5546e349feb5a8c0eee4
> 201
>
> Using put_attachment: 0.309157133102
> 1> [info] [<0.27809.7>] 127.0.0.1 - - 'PUT'
> /benchmark_entity/bigfile/smallfile?rev=81-c538b38a8463952f0136143cfa49e9fa
> 201
>
> Using multipart: 2.61283898354 (post time: 2.5557808876)
> 1> [info] [<0.27809.7>] 127.0.0.1 - - 'POST' /benchmark_entity/bigfile 201
>
> Updating: 5.0550069809
> 1> [info] [<0.27809.7>] 127.0.0.1 - - 'POST' /benchmark_entity/_bulk_docs 201
>
> Profiling our code shows 1.5 sec of CPU usage in our code (which
> covers setup / cleanup code that's not included in the times above),
> and 11.8 sec of total run time, which roughly matches up with the
> PUT/POST times above.  Basically, I feel pretty confident that the
> bulk of the times above are not in our client code, and are instead
> due to couchdb's handling time.
>
> Why is the form/multipart handler so much slower than using a bare PUT
> on the attachment?  Why is the base64 approach even slower?  Is it due
> to bandwidth issues, couchdb CPU usage...?
>
> Thanks for any help,
> Eli
>
> Full code from: https://gist.github.com/bc0c36f36be0c85e2a36
>
> import base64
> import contextlib
> import cStringIO
> import subprocess
> import time
>
> import couchdb
> import couchdb.json
> import couchdb.multipart
>
> @contextlib.contextmanager
> def stopwatch(m=''):
>    t0=time.time()
>    yield
>    tdiff=time.time() - t0
>    if m:
>        print '{}: {}'.format(m, tdiff)
>    else:
>        print tdiff
>
> def reset(d):
>    try:
>        del d['bigfile']
>    except couchdb.http.ResourceNotFound:
>        pass
>    d['bigfile'] = {'foo': 'bar'}
>    return d['bigfile']
>
> s = couchdb.Server()
> d = s['benchmark_entity']
>
> fn = '/tmp/bigfile.gz'
> fn = '/tmp/smallfile'
>
> doc = reset(d)
> with stopwatch('Using curl'):
>    p = subprocess.Popen([
>        'curl',
>        '-X', 'PUT',
>        'http://localhost:5984/benchmark_entity/{}/bigfile/bigfile.gz?rev={}'.format(doc.id,
> doc.rev),
>        '-d', '@{}'.format(fn),
>        '-H', 'Content-Type: application/gzip'
>        ])
>    p.wait()
>
> doc = reset(d)
> with open(fn, 'r') as f:
>    with stopwatch('Using put_attachment'):
>        d.put_attachment(doc, f)
>
> doc = reset(d)
> with open(fn, 'r') as f:
>    content_name = 'bigfile.gz'
>    content = f.read()
>    content_type = 'application/gzip'
>    with stopwatch('Using multipart'):
>        fileobj = cStringIO.StringIO()
>
>        with couchdb.multipart.MultipartWriter(fileobj, headers=None,
> subtype='form-data') as mpw:
>            mime_headers = {'Content-Disposition': '''form-data; name="_doc"'''}
>            mpw.add('application/json', couchdb.json.encode(doc), mime_headers)
>
>            mime_headers = {'Content-Disposition': '''form-data;
> name="_attachments"; filename="{}"'''.format(content_name)}
>            mpw.add(content_type, content, mime_headers)
>
>        header_str, blank_str, body = fileobj.getvalue().split('\r\n', 2)
>
>        http_headers = {'Referer': d.resource.url, 'Content-Type':
> header_str[len('Content-Type: '):]}
>        params = {}
>        t0 = time.time()
>        status, msg, data = d.resource.post(doc['_id'], body,
> http_headers, **params)
>        print 'post time: {}'.format(time.time() - t0)
>
> doc = reset(d)
> with open(fn, 'r') as f:
>    content_name = 'bigfile.gz'
>    content = f.read()
>    content_type = 'application/gzip'
>    with stopwatch('Encoding base64'):
>        doc['_attachments'] = {content_name: {'content_type':
> content_type, 'data': base64.b64encode(content)}}
>    with stopwatch('Updating'):
>        d.update([doc])
>