You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@couchdb.apache.org by David Mitchell <mo...@gmail.com> on 2011/04/04 09:58:23 UTC

Uploading CSV data to Couchapp

Hello all,

I'm just about to start on my first (wildly ambitious) Couchapp.  I've had
quite a bit of Erlang experience, but not for the past couple of years so
I'm a bit rusty.  I've had a tiny bit of experience with CouchDB via various
Python scripts, but that's all been treating CouchDB as a "black box"
database so I've currently got little knowledge of what it can do beyond
being a document datastore.

Initially, I'm trying to understand my options for uploading CSV files,
parsing out the content and storing them in CouchDB (one CouchDB record per
line of CSV content).  While it's reasonably straightforward to do this if I
was using e.g. Python as a batch load tool, I don't want to go outside
Javascript for this project if I can avoid it.  The CSV files are anywhere
from 1k-30k records, with 8-10 fields in each that are straightforward
timestamps and floating point numbers.

For an old-school Web app with distinct database and app server layers,
there's a straightforward option - upload the data to a file on the web
server, then process the data out of the file and load it into your
database.  Sure there's variations on this approach such as saving data as a
database blob, but I'm looking for the best CouchApp-specific approach if
one exists.

Options I can see:
- upload the data & save it into a single "uploaded_csv" document in
CouchDB.  Within CouchDB, detect the presence of a new "uploaded_csv"
document, extract and process the content using Javascript and save it into
multiple "data" records, with appropriate indexing, then dispose of the
"uploaded_csv" document or mark it as "processed".  This seems reasonably
straightforward, but I'm not sure how to detect the presence of a new
"uploaded_csv" document (is there a cron equivalent in Couch?) and I'd have
to track the progress of processing each uploaded CSV file to detect when
they've been processed into "data" records
- upload the data & save it into a single "uploaded_csv" document in
CouchDB.  Have CouchDB running embedded in an Erlang app, and use Erlang to
read the "uploaded_csv" data, then send a series of e.g. HTTP PUTs to load
the data into multiple "data" records in CouchDB.  This just seems ugly to
me, but I'm pretty confident I could get it working pretty easily
- upload the data and process it directly into "data" records from a web
page served from CouchApp.  This seems like it could impact on scalability
due to having long-running connections between client and server, but at
least a user would know when their data has been uploaded and processed
successfully with trivial extra work on my part
- upload the data, convert it to JSON on the client using clientside
Javascript, then send it as a set of document uploads (i.e. one document per
CSV record) from the client to the Couch server.  This would let me parse
out any bogus CSV content without sending it to the server, but I'll have
users running browsers on mobile devices and I'm not sure I want to put that
processing load onto the client

Are there any "recommended" approaches for this type of task?  I suspect
this question and others I'll ask have probably already been considered and
dealt with by various experts; if there's a "CouchApp cookbook" with
recommended solutions for these and other common situations, I'd appreciate
a pointer to it so I could start to answer my own questions.

Thanks in advance

Dave M.

Re: Uploading CSV data to Couchapp

Posted by Cliff Williams <cl...@aol.com>.

David,

This is my "snippet" for consuming a continuous _changes feed using 
node. Take a look and and see what you think. We can take it offline to 
discuss in detail if you wish.

Best regards

cliff

var
   sys = require("sys"),
   http = require('http'),
   events = require("events"),
   host = "10.0.0.10",
   port = 5984,
   database="test",
   changesurl = "http://"+ host +":"+ port.toString() +"/"+database+ 
"/_changes?feed=continuous&include_docs=true&heartbeat=10000",
   infourl =  "http://"+ host +":"+port.toString()+"/"+database,
   lastsequence=0,
   stream = new process.EventEmitter(),
   timeout=0;

     sys.puts(changesurl);
// set up connection
     couchdbconnection = http.createClient(port, host);
// handlers to monitor the actual connection
     couchdbconnection.addListener('close', function() {
         couchdbconnection.destroy();
       return stream.emit('seq', info.update_seq);
     });
     couchdbconnection.addListener('error',function() {
         console.log("couchdbconnection listener ....error");
     });

// Send request
         sys.puts(infourl);
         request = couchdbconnection.request('GET', infourl);
         request.end();

// Listen for any responses
     request.addListener("response", function(res) {
/* "data" gets triggered on receipt of line end information so no need 
to do
  any special buffering ...... or at least I dont think so */
     res.addListener('data', function(infodata) {

// info comes in as a string
         info = JSON.parse(infodata);
      });
     });

// Wait until the code used to get the last sequence number completes
     stream.addListener('seq',function (lastseq) {
// set up connection
     couchdbconnection = http.createClient(port, host);
     couchdbconnection.setTimeout(timeout);
// handlers to monitor the actual connection
     couchdbconnection.addListener('close', function() {
     console.log("couchdbconnection Listener .....connection closed");
   });
     couchdbconnection.addListener('error',function() {
         sys.puts("couchdbconnection listener ....error");
     });
// Send request
     request = couchdbconnection.request('GET', 
changesurl+"&since="+lastseq);
     request.end();

// Listen for any responses
     request.addListener("response", function(res) {
/* "data" gets triggered on receipt of line end information so no need 
to do
  any special buffering */
     res.addListener('data', function(changedata) {

// Couch sends an empty line as the "heartbeat"
         if (changedata == "\n") {
                   console.log("heartbeat");
                     }
                     else {
// info comes in as a string so create JSON object to add additional info
                     changes=JSON.parse(changedata);
                     changes.database=database;
                     changedata=JSON.stringify(changes);
                     console.log("Actual data Changed "+ changedata);

               };});});});


On 04/04/11 11:50, David Mitchell wrote:
> Hi Cliff,
>
> What you're describing doing using Node.js sounds exactly like what I 
> want to do - process the uploads transparently in the background, 
> entirely within the context of my Couchapp.
>
> I've searched around for info on using Node.js and CouchDB like this, 
> but only found links that describe the technique in very broad detail. 
>  Do you know of any links that describe it in reasonable detail?
>
> I'm an experienced Python/Ruby/C/C#/... coder and an occasional Erlang 
> coder, but stuff like Node.js is completely new to me - I'm assuming 
> that if I could see how to do this sort of thing in a fairly concise 
> example, it'd trigger the "aha" moment in my brain and then I'd be off 
> and running...
>
> Thanks again
>
> Dave M.
>
> On 4 April 2011 19:34, Cliff Williams <cliffywills@aol.com 
> <ma...@aol.com>> wrote:
>
>     David,
>
>     I hope you are well.
>
>     I think that you have covered your options pretty well.
>
>
>     "- upload the data&  save it into a single "uploaded_csv" document in
>     CouchDB.  Within CouchDB, detect the presence of a new "uploaded_csv"
>     document, extract and process the content using Javascript and
>     save it into
>     multiple "data" records, with appropriate indexing, then dispose
>     of the
>     "uploaded_csv" document or mark it as "processed".  This seems
>     reasonably
>     straightforward, but I'm not sure how to detect the presence of a new
>     "uploaded_csv" document"
>
>     This is the approach that I would take.
>
>     Couchdb has a quite excellent _changes feed which will notify you
>     (or can be set up to notify) in real time on any changes made to
>     specific databases.
>
>     I personally would use Node.js to monitor the changes feed and
>     process your csv files (Javascript and very fast) but you could of
>     course use anything (erlang python (Ruby's CSV processing
>     libraries are also quite good)).
>
>     best regards
>
>     Cliff
>
>     On 04/04/11 08:58, David Mitchell wrote:
>
>         Hello all,
>
>         I'm just about to start on my first (wildly ambitious)
>         Couchapp.  I've had
>         quite a bit of Erlang experience, but not for the past couple
>         of years so
>         I'm a bit rusty.  I've had a tiny bit of experience with
>         CouchDB via various
>         Python scripts, but that's all been treating CouchDB as a
>         "black box"
>         database so I've currently got little knowledge of what it can
>         do beyond
>         being a document datastore.
>
>         Initially, I'm trying to understand my options for uploading
>         CSV files,
>         parsing out the content and storing them in CouchDB (one
>         CouchDB record per
>         line of CSV content).  While it's reasonably straightforward
>         to do this if I
>         was using e.g. Python as a batch load tool, I don't want to go
>         outside
>         Javascript for this project if I can avoid it.  The CSV files
>         are anywhere
>         from 1k-30k records, with 8-10 fields in each that are
>         straightforward
>         timestamps and floating point numbers.
>
>         For an old-school Web app with distinct database and app
>         server layers,
>         there's a straightforward option - upload the data to a file
>         on the web
>         server, then process the data out of the file and load it into
>         your
>         database.  Sure there's variations on this approach such as
>         saving data as a
>         database blob, but I'm looking for the best CouchApp-specific
>         approach if
>         one exists.
>
>         Options I can see:
>         - upload the data&  save it into a single "uploaded_csv"
>         document in
>         CouchDB.  Within CouchDB, detect the presence of a new
>         "uploaded_csv"
>         document, extract and process the content using Javascript and
>         save it into
>         multiple "data" records, with appropriate indexing, then
>         dispose of the
>         "uploaded_csv" document or mark it as "processed".  This seems
>         reasonably
>         straightforward, but I'm not sure how to detect the presence
>         of a new
>         "uploaded_csv" document (is there a cron equivalent in Couch?)
>         and I'd have
>         to track the progress of processing each uploaded CSV file to
>         detect when
>         they've been processed into "data" records
>         - upload the data&  save it into a single "uploaded_csv"
>         document in
>         CouchDB.  Have CouchDB running embedded in an Erlang app, and
>         use Erlang to
>         read the "uploaded_csv" data, then send a series of e.g. HTTP
>         PUTs to load
>         the data into multiple "data" records in CouchDB.  This just
>         seems ugly to
>         me, but I'm pretty confident I could get it working pretty easily
>         - upload the data and process it directly into "data" records
>         from a web
>         page served from CouchApp.  This seems like it could impact on
>         scalability
>         due to having long-running connections between client and
>         server, but at
>         least a user would know when their data has been uploaded and
>         processed
>         successfully with trivial extra work on my part
>         - upload the data, convert it to JSON on the client using
>         clientside
>         Javascript, then send it as a set of document uploads (i.e.
>         one document per
>         CSV record) from the client to the Couch server.  This would
>         let me parse
>         out any bogus CSV content without sending it to the server,
>         but I'll have
>         users running browsers on mobile devices and I'm not sure I
>         want to put that
>         processing load onto the client
>
>         Are there any "recommended" approaches for this type of task?
>          I suspect
>         this question and others I'll ask have probably already been
>         considered and
>         dealt with by various experts; if there's a "CouchApp
>         cookbook" with
>         recommended solutions for these and other common situations,
>         I'd appreciate
>         a pointer to it so I could start to answer my own questions.
>
>         Thanks in advance
>
>         Dave M.
>
>

Re: Uploading CSV data to Couchapp

Posted by David Mitchell <mo...@gmail.com>.

Hi Cliff,

What you're describing doing using Node.js sounds exactly like what I want
to do - process the uploads transparently in the background, entirely within
the context of my Couchapp.

I've searched around for info on using Node.js and CouchDB like this, but
only found links that describe the technique in very broad detail.  Do you
know of any links that describe it in reasonable detail?

I'm an experienced Python/Ruby/C/C#/... coder and an occasional Erlang
coder, but stuff like Node.js is completely new to me - I'm assuming that if
I could see how to do this sort of thing in a fairly concise example, it'd
trigger the "aha" moment in my brain and then I'd be off and running...

Thanks again

Dave M.

On 4 April 2011 19:34, Cliff Williams <cl...@aol.com> wrote:

> David,
>
> I hope you are well.
>
> I think that you have covered your options pretty well.
>
>
> "- upload the data&  save it into a single "uploaded_csv" document in
> CouchDB.  Within CouchDB, detect the presence of a new "uploaded_csv"
> document, extract and process the content using Javascript and save it into
> multiple "data" records, with appropriate indexing, then dispose of the
> "uploaded_csv" document or mark it as "processed".  This seems reasonably
> straightforward, but I'm not sure how to detect the presence of a new
> "uploaded_csv" document"
>
> This is the approach that I would take.
>
> Couchdb has a quite excellent _changes feed which will notify you (or can
> be set up to notify) in real time on any changes made to specific databases.
>
> I personally would use Node.js to monitor the changes feed and process your
> csv files (Javascript and very fast) but you could of course use anything
> (erlang python (Ruby's CSV processing libraries are also quite good)).
>
> best regards
>
> Cliff
>
> On 04/04/11 08:58, David Mitchell wrote:
>
>> Hello all,
>>
>> I'm just about to start on my first (wildly ambitious) Couchapp.  I've had
>> quite a bit of Erlang experience, but not for the past couple of years so
>> I'm a bit rusty.  I've had a tiny bit of experience with CouchDB via
>> various
>> Python scripts, but that's all been treating CouchDB as a "black box"
>> database so I've currently got little knowledge of what it can do beyond
>> being a document datastore.
>>
>> Initially, I'm trying to understand my options for uploading CSV files,
>> parsing out the content and storing them in CouchDB (one CouchDB record
>> per
>> line of CSV content).  While it's reasonably straightforward to do this if
>> I
>> was using e.g. Python as a batch load tool, I don't want to go outside
>> Javascript for this project if I can avoid it.  The CSV files are anywhere
>> from 1k-30k records, with 8-10 fields in each that are straightforward
>> timestamps and floating point numbers.
>>
>> For an old-school Web app with distinct database and app server layers,
>> there's a straightforward option - upload the data to a file on the web
>> server, then process the data out of the file and load it into your
>> database.  Sure there's variations on this approach such as saving data as
>> a
>> database blob, but I'm looking for the best CouchApp-specific approach if
>> one exists.
>>
>> Options I can see:
>> - upload the data&  save it into a single "uploaded_csv" document in
>> CouchDB.  Within CouchDB, detect the presence of a new "uploaded_csv"
>> document, extract and process the content using Javascript and save it
>> into
>> multiple "data" records, with appropriate indexing, then dispose of the
>> "uploaded_csv" document or mark it as "processed".  This seems reasonably
>> straightforward, but I'm not sure how to detect the presence of a new
>> "uploaded_csv" document (is there a cron equivalent in Couch?) and I'd
>> have
>> to track the progress of processing each uploaded CSV file to detect when
>> they've been processed into "data" records
>> - upload the data&  save it into a single "uploaded_csv" document in
>> CouchDB.  Have CouchDB running embedded in an Erlang app, and use Erlang
>> to
>> read the "uploaded_csv" data, then send a series of e.g. HTTP PUTs to load
>> the data into multiple "data" records in CouchDB.  This just seems ugly to
>> me, but I'm pretty confident I could get it working pretty easily
>> - upload the data and process it directly into "data" records from a web
>> page served from CouchApp.  This seems like it could impact on scalability
>> due to having long-running connections between client and server, but at
>> least a user would know when their data has been uploaded and processed
>> successfully with trivial extra work on my part
>> - upload the data, convert it to JSON on the client using clientside
>> Javascript, then send it as a set of document uploads (i.e. one document
>> per
>> CSV record) from the client to the Couch server.  This would let me parse
>> out any bogus CSV content without sending it to the server, but I'll have
>> users running browsers on mobile devices and I'm not sure I want to put
>> that
>> processing load onto the client
>>
>> Are there any "recommended" approaches for this type of task?  I suspect
>> this question and others I'll ask have probably already been considered
>> and
>> dealt with by various experts; if there's a "CouchApp cookbook" with
>> recommended solutions for these and other common situations, I'd
>> appreciate
>> a pointer to it so I could start to answer my own questions.
>>
>> Thanks in advance
>>
>> Dave M.
>>
>>

Re: Uploading CSV data to Couchapp

Posted by Cliff Williams <cl...@aol.com>.

David,

I hope you are well.

I think that you have covered your options pretty well.

"- upload the data&  save it into a single "uploaded_csv" document in
CouchDB.  Within CouchDB, detect the presence of a new "uploaded_csv"
document, extract and process the content using Javascript and save it into
multiple "data" records, with appropriate indexing, then dispose of the
"uploaded_csv" document or mark it as "processed".  This seems reasonably
straightforward, but I'm not sure how to detect the presence of a new
"uploaded_csv" document"

This is the approach that I would take.

Couchdb has a quite excellent _changes feed which will notify you (or 
can be set up to notify) in real time on any changes made to specific 
databases.

I personally would use Node.js to monitor the changes feed and process 
your csv files (Javascript and very fast) but you could of course use 
anything (erlang python (Ruby's CSV processing libraries are also quite 
good)).

best regards

Cliff
On 04/04/11 08:58, David Mitchell wrote:
> Hello all,
>
> I'm just about to start on my first (wildly ambitious) Couchapp.  I've had
> quite a bit of Erlang experience, but not for the past couple of years so
> I'm a bit rusty.  I've had a tiny bit of experience with CouchDB via various
> Python scripts, but that's all been treating CouchDB as a "black box"
> database so I've currently got little knowledge of what it can do beyond
> being a document datastore.
>
> Initially, I'm trying to understand my options for uploading CSV files,
> parsing out the content and storing them in CouchDB (one CouchDB record per
> line of CSV content).  While it's reasonably straightforward to do this if I
> was using e.g. Python as a batch load tool, I don't want to go outside
> Javascript for this project if I can avoid it.  The CSV files are anywhere
> from 1k-30k records, with 8-10 fields in each that are straightforward
> timestamps and floating point numbers.
>
> For an old-school Web app with distinct database and app server layers,
> there's a straightforward option - upload the data to a file on the web
> server, then process the data out of the file and load it into your
> database.  Sure there's variations on this approach such as saving data as a
> database blob, but I'm looking for the best CouchApp-specific approach if
> one exists.
>
> Options I can see:
> - upload the data&  save it into a single "uploaded_csv" document in
> CouchDB.  Within CouchDB, detect the presence of a new "uploaded_csv"
> document, extract and process the content using Javascript and save it into
> multiple "data" records, with appropriate indexing, then dispose of the
> "uploaded_csv" document or mark it as "processed".  This seems reasonably
> straightforward, but I'm not sure how to detect the presence of a new
> "uploaded_csv" document (is there a cron equivalent in Couch?) and I'd have
> to track the progress of processing each uploaded CSV file to detect when
> they've been processed into "data" records
> - upload the data&  save it into a single "uploaded_csv" document in
> CouchDB.  Have CouchDB running embedded in an Erlang app, and use Erlang to
> read the "uploaded_csv" data, then send a series of e.g. HTTP PUTs to load
> the data into multiple "data" records in CouchDB.  This just seems ugly to
> me, but I'm pretty confident I could get it working pretty easily
> - upload the data and process it directly into "data" records from a web
> page served from CouchApp.  This seems like it could impact on scalability
> due to having long-running connections between client and server, but at
> least a user would know when their data has been uploaded and processed
> successfully with trivial extra work on my part
> - upload the data, convert it to JSON on the client using clientside
> Javascript, then send it as a set of document uploads (i.e. one document per
> CSV record) from the client to the Couch server.  This would let me parse
> out any bogus CSV content without sending it to the server, but I'll have
> users running browsers on mobile devices and I'm not sure I want to put that
> processing load onto the client
>
> Are there any "recommended" approaches for this type of task?  I suspect
> this question and others I'll ask have probably already been considered and
> dealt with by various experts; if there's a "CouchApp cookbook" with
> recommended solutions for these and other common situations, I'd appreciate
> a pointer to it so I could start to answer my own questions.
>
> Thanks in advance
>
> Dave M.
>

Re: Uploading CSV data to Couchapp

Posted by Gabor Ratky <ga...@secretsaucepartners.com>.

Dave,

we're doing something similar, where we push data we initially receive as CSV into CouchDB. As we receive multiple CSVs and need to do transformations on them, we first load it into a SQLite DB locally (where the processing happens), perform the transformations that include some relational queries (hence the SQLite) then iterate through the data and push it into CouchDB using a (hierarchical) document structure that makes the most sense for our documents. We're using Ruby and the couchrest gem (just because we've been using it for other purposes and this was the easiest), which batches up document creates and updates into batches of 500 documents (or whatever fits you) to reduce the HTTP overhead.

As we also need to do updates to documents and check for their existence, this can incur quite a few HTTP requests with big payloads so closer you are to CouchDB the better (preferably local to it).

I haven't really thought of doing this 'inside CouchDB' and not sure how that would be possible. Talking to CouchDB with whatever tools (we also use node.js and would be a great candidate for this as well) would be done through the HTTP API in my mind. If there are any other interesting ways of doing processing on CouchDB documents (such as a hypothetical 'uploaded_csv' document) that is as close to CouchDB as the JS View Engine is, I'd love to hear it.

Gabor

On Apr 4, 2011, at 9:58 AM, David Mitchell wrote:

> Hello all,
> 
> I'm just about to start on my first (wildly ambitious) Couchapp.  I've had
> quite a bit of Erlang experience, but not for the past couple of years so
> I'm a bit rusty.  I've had a tiny bit of experience with CouchDB via various
> Python scripts, but that's all been treating CouchDB as a "black box"
> database so I've currently got little knowledge of what it can do beyond
> being a document datastore.
> 
> Initially, I'm trying to understand my options for uploading CSV files,
> parsing out the content and storing them in CouchDB (one CouchDB record per
> line of CSV content).  While it's reasonably straightforward to do this if I
> was using e.g. Python as a batch load tool, I don't want to go outside
> Javascript for this project if I can avoid it.  The CSV files are anywhere
> from 1k-30k records, with 8-10 fields in each that are straightforward
> timestamps and floating point numbers.
> 
> For an old-school Web app with distinct database and app server layers,
> there's a straightforward option - upload the data to a file on the web
> server, then process the data out of the file and load it into your
> database.  Sure there's variations on this approach such as saving data as a
> database blob, but I'm looking for the best CouchApp-specific approach if
> one exists.
> 
> Options I can see:
> - upload the data & save it into a single "uploaded_csv" document in
> CouchDB.  Within CouchDB, detect the presence of a new "uploaded_csv"
> document, extract and process the content using Javascript and save it into
> multiple "data" records, with appropriate indexing, then dispose of the
> "uploaded_csv" document or mark it as "processed".  This seems reasonably
> straightforward, but I'm not sure how to detect the presence of a new
> "uploaded_csv" document (is there a cron equivalent in Couch?) and I'd have
> to track the progress of processing each uploaded CSV file to detect when
> they've been processed into "data" records
> - upload the data & save it into a single "uploaded_csv" document in
> CouchDB.  Have CouchDB running embedded in an Erlang app, and use Erlang to
> read the "uploaded_csv" data, then send a series of e.g. HTTP PUTs to load
> the data into multiple "data" records in CouchDB.  This just seems ugly to
> me, but I'm pretty confident I could get it working pretty easily
> - upload the data and process it directly into "data" records from a web
> page served from CouchApp.  This seems like it could impact on scalability
> due to having long-running connections between client and server, but at
> least a user would know when their data has been uploaded and processed
> successfully with trivial extra work on my part
> - upload the data, convert it to JSON on the client using clientside
> Javascript, then send it as a set of document uploads (i.e. one document per
> CSV record) from the client to the Couch server.  This would let me parse
> out any bogus CSV content without sending it to the server, but I'll have
> users running browsers on mobile devices and I'm not sure I want to put that
> processing load onto the client
> 
> Are there any "recommended" approaches for this type of task?  I suspect
> this question and others I'll ask have probably already been considered and
> dealt with by various experts; if there's a "CouchApp cookbook" with
> recommended solutions for these and other common situations, I'd appreciate
> a pointer to it so I could start to answer my own questions.
> 
> Thanks in advance
> 
> Dave M.