You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@couchdb.apache.org by Brian Candler <B....@pobox.com> on 2010/01/14 17:45:58 UTC

Bulk CSV import?

I can use a _list function to export a database in a non-JSON format (e.g.
CSV)

But what about importing? I see there is _update, but as far as I can see it
only acts on a single document.  Is there something that acts at the
database level and can build a bulk_save batch?

Thanks,

Brian.

Re: Bulk CSV import?

Posted by Chris Anderson <jc...@apache.org>.

On Thu, Jan 14, 2010 at 2:08 PM, Brian Candler <B....@pobox.com> wrote:
> On Thu, Jan 14, 2010 at 12:58:36PM -0800, Roger Binns wrote:
>> > But what about importing?
>>
>> Although CSV may seem trivial, it is actually a nightmarish format.
>
> I never said it was a good format for all applications, but there are cases
> for which it works well.
>
>> That said, if you do want to import CSV may I suggest using APSW.
>
> You miss my point. I was asking for a way by which a CouchApp author could
> provide their user with an import facility (whether it be XML, CSV or
> whatever), which they could drive through their browser.
>
> Anyway, it's not high on my priority list. I'm just identifying things in my
> current Rails-based app which would need to be sorted if porting it to a
> CouchApp.

Yeah I definitely like this feature. Bonus points if it is streamed
instead of buffers (or supports a stream mode).

Chris

>
> Regards,
>
> Brian.
>



-- 
Chris Anderson
http://jchrisa.net
http://couch.io

Re: Bulk CSV import?

Posted by Brian Candler <B....@pobox.com>.

On Thu, Jan 14, 2010 at 12:58:36PM -0800, Roger Binns wrote:
> > But what about importing? 
> 
> Although CSV may seem trivial, it is actually a nightmarish format.

I never said it was a good format for all applications, but there are cases
for which it works well.

> That said, if you do want to import CSV may I suggest using APSW.

You miss my point. I was asking for a way by which a CouchApp author could
provide their user with an import facility (whether it be XML, CSV or
whatever), which they could drive through their browser.

Anyway, it's not high on my priority list. I'm just identifying things in my
current Rails-based app which would need to be sorted if porting it to a
CouchApp.

Regards,

Brian.

Re: Bulk CSV import?

Posted by Roger Binns <ro...@rogerbinns.com>.

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Brian Candler wrote:
> But what about importing? 

Although CSV may seem trivial, it is actually a nightmarish format.  No two
programs do exactly the same thing.  If CouchDB accepted CSV directly there
would be all sorts of extra parameters that would need to be supplied.  For
example when are columns or values considered numbers?  (A sequence of
digits is not a good test since they could be a phone number and an 'e'
within the digits could really just be a letter or it could be part of a
real.)  Similarly if the row has a zero length for one column, what do you
translate that into?  (A zero length string? null? Omit the key for that
doc?)  See the SQLite mailing list for frequent postings about CSV and that
is for a system that has supported CSV import and export for almost a decade!

That said, if you do want to import CSV may I suggest using APSW.  It has an
interactive shell so you do not need to know or care about Python.  APSW
comes with a module that bridges between SQLite and CouchDB.  You can do the
CSV import, data typing and cleanup using SQLite and then push the results
into CouchDB:

  http://apsw.googlecode.com/svn/publish/couchdb.html#importing-csv-data

[Disclaimer: I am the author of APSW]

Roger
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAktPhXwACgkQmOOfHg372QTuGgCgq0euhxljlroRVQGSn8wVWMrl
tTUAoI01G33qbQ6csJaINSA7NXXMawbh
=G3ju
-----END PGP SIGNATURE-----

Re: Bulk CSV import?

Posted by Chris Anderson <jc...@apache.org>.

On Fri, Jan 15, 2010 at 12:45 AM, Brian Candler <B....@pobox.com> wrote:
> On Thu, Jan 14, 2010 at 02:15:54PM -0800, Chris Anderson wrote:
>> > More difficult would be to allow bulk *updates* via this mechanism, because
>> > having parsed out the IDs you'd need to be able to fetch existing docs,
>> > modify and write back.
>> >
>>
>> If the CSV source was responsible for tracking _revs then it could work easily.
>
> What I mean is, couchdb itself can't parse out the _id and _rev as the
> stream comes in (since the CSV parsing isn't built into couchdb), so it
> can't pre-fetch the docs. The doc fetch requests would have to be bounced
> back to couchdb core. e.g.

I was thinking of an API where the function just transforms CSV to
JSON, not where it also compares it to existing documents.

You're right it gets complex as a real bulk update function. Maybe
this is better addressed with a 3rd party script.

Chris

-- 
Chris Anderson
http://jchrisa.net
http://couch.io

Re: Binary data

Posted by Chris Anderson <jc...@apache.org>.

On Mon, Jan 18, 2010 at 6:39 AM, Brian Candler <B....@pobox.com> wrote:
>> > On Fri, Jan 15, 2010 at 09:32:34AM -0800, Chris Anderson wrote:
>> >> list and show should be able to return base64 data, to be decoded to
>> >> binary before sending to the client. I know there is a test that _show
>> >> can serve a favicon.ico file.
>> >
>> > Thank you for pointing that out, I found the _show test.
>> >
>> > I had already grepped the source for base64, and found only one relevant
>> > instance in couch_httpd_external.erl. Now I see that gets called from
>> > couch_httpd_show.erl too.
>> >
>> > I can't see how _list calls it though?
>>
>> that's worth writing a test for. written tests rock. -- move to dev@?
>
> Moved to dev.
>
> I've done the test for _update (and it already passes):
> https://issues.apache.org/jira/browse/COUCHDB-626
>
> However I'm not sure what you want to do with _list. At the moment you emit
> chunks of plain strings. Do you want something like
>    send({base64:"..."})
> ?

I think the send({base64:"..."}) option is easier to understand and
probably less burden on implementors.

Thanks,
Chris

>
> Or do you want to put a tag in the header which says all the chunks are
> base64? (like a Content-Transfer-Encoding: base64 which is stripped)
>



-- 
Chris Anderson
http://jchrisa.net
http://couch.io

Binary data

Posted by Brian Candler <B....@pobox.com>.

> > On Fri, Jan 15, 2010 at 09:32:34AM -0800, Chris Anderson wrote:
> >> list and show should be able to return base64 data, to be decoded to
> >> binary before sending to the client. I know there is a test that _show
> >> can serve a favicon.ico file.
> >
> > Thank you for pointing that out, I found the _show test.
> >
> > I had already grepped the source for base64, and found only one relevant
> > instance in couch_httpd_external.erl. Now I see that gets called from
> > couch_httpd_show.erl too.
> >
> > I can't see how _list calls it though?
> 
> that's worth writing a test for. written tests rock. -- move to dev@?

Moved to dev.

I've done the test for _update (and it already passes):
https://issues.apache.org/jira/browse/COUCHDB-626

However I'm not sure what you want to do with _list. At the moment you emit
chunks of plain strings. Do you want something like
    send({base64:"..."})
?

Or do you want to put a tag in the header which says all the chunks are
base64? (like a Content-Transfer-Encoding: base64 which is stripped)

Re: Binary data [was Bulk CSV import?]

Posted by Chris Anderson <jc...@apache.org>.

On Fri, Jan 15, 2010 at 2:16 PM, Brian Candler <B....@pobox.com> wrote:
> On Fri, Jan 15, 2010 at 09:32:34AM -0800, Chris Anderson wrote:
>> list and show should be able to return base64 data, to be decoded to
>> binary before sending to the client. I know there is a test that _show
>> can serve a favicon.ico file.
>
> Thank you for pointing that out, I found the _show test.
>
> I had already grepped the source for base64, and found only one relevant
> instance in couch_httpd_external.erl. Now I see that gets called from
> couch_httpd_show.erl too.
>
> I can't see how _list calls it though?
>

that's worth writing a test for. written tests rock. -- move to dev@?



-- 
Chris Anderson
http://jchrisa.net
http://couch.io

Re: Binary data [was Bulk CSV import?]

Posted by Brian Candler <B....@pobox.com>.

On Fri, Jan 15, 2010 at 09:32:34AM -0800, Chris Anderson wrote:
> list and show should be able to return base64 data, to be decoded to
> binary before sending to the client. I know there is a test that _show
> can serve a favicon.ico file.

Thank you for pointing that out, I found the _show test.

I had already grepped the source for base64, and found only one relevant
instance in couch_httpd_external.erl. Now I see that gets called from
couch_httpd_show.erl too.

I can't see how _list calls it though?

Re: Binary data [was Bulk CSV import?]

Posted by Chris Anderson <jc...@apache.org>.

On Fri, Jan 15, 2010 at 1:50 AM, Brian Candler <B....@pobox.com> wrote:
> A couple more thoughts about importing and exporting aggregate batches of
> document-oriented data.
>
> * A desktop-friendly way to bundle documents is in a ZIP file.
> Unfortunately that's a binary format, and _list/_external use a JSON (UTF-8)
> protocol.
>
> I see that an _external function can give back base64-encoded binary data:
> http://wiki.apache.org/couchdb/ExternalProcesses
>
> I don't think _list or _show can, and in any case you'd need a ZIP library
> written in Javascript.

list and show should be able to return base64 data, to be decoded to
binary before sending to the client. I know there is a test that _show
can serve a favicon.ico file.

but yes, it seems a bit much to do ZIP in JS. maybe not in... 2010!

>
> Maybe this is too far out of scope for a JS-backed CouchApp. But with
> erlview it makes a lot more sense: you have a binary-clean interface, and
> many libraries available for handling binary formats. e.g.
> http://www.erlang.org/doc/man/zip.html
>
> * The other option I've tried is MIME multipart documents, but in my testing
> I found that browsers don't handle them well. At best you just get the first
> one saved. I think that option can be discarded.
>
> Regards,
>
> Brian.
>



-- 
Chris Anderson
http://jchrisa.net
http://couch.io

Binary data [was Bulk CSV import?]

Posted by Brian Candler <B....@pobox.com>.

A couple more thoughts about importing and exporting aggregate batches of
document-oriented data.

* A desktop-friendly way to bundle documents is in a ZIP file. 
Unfortunately that's a binary format, and _list/_external use a JSON (UTF-8)
protocol.

I see that an _external function can give back base64-encoded binary data:
http://wiki.apache.org/couchdb/ExternalProcesses

I don't think _list or _show can, and in any case you'd need a ZIP library
written in Javascript.

Maybe this is too far out of scope for a JS-backed CouchApp. But with
erlview it makes a lot more sense: you have a binary-clean interface, and
many libraries available for handling binary formats. e.g.
http://www.erlang.org/doc/man/zip.html

* The other option I've tried is MIME multipart documents, but in my testing
I found that browsers don't handle them well. At best you just get the first
one saved. I think that option can be discarded.

Regards,

Brian.

Re: Bulk CSV import?

Posted by Brian Candler <B....@pobox.com>.

On Thu, Jan 14, 2010 at 02:15:54PM -0800, Chris Anderson wrote:
> > More difficult would be to allow bulk *updates* via this mechanism, because
> > having parsed out the IDs you'd need to be able to fetch existing docs,
> > modify and write back.
> >
> 
> If the CSV source was responsible for tracking _revs then it could work easily.

What I mean is, couchdb itself can't parse out the _id and _rev as the
stream comes in (since the CSV parsing isn't built into couchdb), so it
can't pre-fetch the docs. The doc fetch requests would have to be bounced
back to couchdb core. e.g.

data over HTTP            opaque data
-------------> couchdb ----------------> updater function
                         _ids and _revs
                       <----------------
                          original docs
                       ---------------->
                          updated docs
                       <----------------

But if we allow streaming that's going to be awkward; the 'opaque data'
stream may have to be interleaved with the 'original docs' stream.

Then after updating the docs, what is couchdb going to do with the results
of each save, i.e.  success/fail and new _revs?  It could send them back to
the client in JSON format like the result of a _bulk_save, but that won't
mean much to must users. So you probably also want:

                         save statuses
                       ---------------->
                        response stream or HTML status page
                       <----------------

If you want to stream all this, and you don't want couchjs functions to be
able to make asynchronous callbacks to couchdb, you could run three separate
couchjs processes in parallel:

data over HTTP            opaque data
-------------> couchdb ----------------> parser function
                         _ids,_revs and updates
                       <----------------

                         JSON docs+updates
                       ----------------> updater function
                          updated docs
                       <----------------

                          doc statuses
                       ----------------> results list function
                          opaque data
                       <----------------

Maybe there's a way to do this multipass load using some sort of staging
docs in the database itself. Imagine saving '_bulk_docs' requests and
responses as docs themselves, then spooling them out using a list function.

It could be simpler without streaming:

                      -------> blob
                      <------- _all_docs request
                      -------> _all_docs response
                      <------- _bulk_save request
                      -------> _bulk_save response
                      <------- blob

That would let you import 10MB of data via a couchapp, but for 10GB you'd
need a custom app in front.

Re: Bulk CSV import?

Posted by Chris Anderson <jc...@apache.org>.

On Thu, Jan 14, 2010 at 2:12 PM, Brian Candler <B....@pobox.com> wrote:
> On Thu, Jan 14, 2010 at 01:37:02PM -0600, Zachary Zolton wrote:
>> What would a function (API, really) look like for importing CSV?
>
> Probably something like a cross between the _update and _list APIs:
>
> * raw data is passed in, parsing is completely the responsibility of the
>  user-written function (so it could be CSV, XML, or whatever)
>
> * it has a callback which lets it emit a stream of objects like _list,
>  and these are emitted as JSON docs and written to the database.
>
> More difficult would be to allow bulk *updates* via this mechanism, because
> having parsed out the IDs you'd need to be able to fetch existing docs,
> modify and write back.
>

If the CSV source was responsible for tracking _revs then it could work easily.

-- 
Chris Anderson
http://jchrisa.net
http://couch.io

Re: Bulk CSV import?

Posted by Brian Candler <B....@pobox.com>.

On Thu, Jan 14, 2010 at 01:37:02PM -0600, Zachary Zolton wrote:
> What would a function (API, really) look like for importing CSV?

Probably something like a cross between the _update and _list APIs:

* raw data is passed in, parsing is completely the responsibility of the
  user-written function (so it could be CSV, XML, or whatever)

* it has a callback which lets it emit a stream of objects like _list,
  and these are emitted as JSON docs and written to the database.

More difficult would be to allow bulk *updates* via this mechanism, because
having parsed out the IDs you'd need to be able to fetch existing docs,
modify and write back.

Re: Bulk CSV import?

Posted by Zachary Zolton <za...@gmail.com>.

What would a function (API, really) look like for importing CSV? Also,
would you need to be able to stream the input data?

On Thu, Jan 14, 2010 at 11:07 AM, Chris Anderson <jc...@apache.org> wrote:
> On Thu, Jan 14, 2010 at 8:45 AM, Brian Candler <B....@pobox.com> wrote:
>> I can use a _list function to export a database in a non-JSON format (e.g.
>> CSV)
>>
>> But what about importing? I see there is _update, but as far as I can see it
>> only acts on a single document.  Is there something that acts at the
>> database level and can build a bulk_save batch?
>>
>
> This isn't the first request I've heard for this functionality. Maybe
> you can add a Jira ticket?
>
> Chris
>
>> Thanks,
>>
>> Brian.
>>
>
>
>
> --
> Chris Anderson
> http://jchrisa.net
> http://couch.io
>

Re: Bulk CSV import?

Posted by Chris Anderson <jc...@apache.org>.

On Thu, Jan 14, 2010 at 8:45 AM, Brian Candler <B....@pobox.com> wrote:
> I can use a _list function to export a database in a non-JSON format (e.g.
> CSV)
>
> But what about importing? I see there is _update, but as far as I can see it
> only acts on a single document.  Is there something that acts at the
> database level and can build a bulk_save batch?
>

This isn't the first request I've heard for this functionality. Maybe
you can add a Jira ticket?

Chris

> Thanks,
>
> Brian.
>



-- 
Chris Anderson
http://jchrisa.net
http://couch.io