You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@couchdb.apache.org by Fidel Viegas <fi...@gmail.com> on 2009/10/27 18:04:34 UTC

CouchDB and Census software

Hi all,

I have been following the discussions on CouchDB, and in fact I have
purchased the early books from apress and o'reilly. I have read one of
the books, and have sort of understood how CouchDB works. I have had
exposure to map reduce before in Python, but have never put it into
practice. Now, I sort of get how it all works. What I am working on
right now, is a system that is going to work in 18 provinces, and each
province will have some districts. I have to implement a system that
will gather some sort of census data for food safety and nutrition,
and then generate reports from that. What most of these projects have
been using so far is excell. They feed an excell spreadsheet with the
data and then use filters to extract whatever reports they need. I
want to cut all of that and create something more sophisticated and
modern. In other words, to create a distributed system that will
replicate data from districts to provinces and provinces to a national
database. The replication is one way only, and perhaps each province
and each district will have a replication server as a backup.

Now, I know that CouchDB would be perfect for this, but I would like
to hear from anyone that has some experience in building a system like
this and that takes into account the following restrictions:

1) Power isn't stable across the whole country. Even in the capitol we
experience power cuts.
2) Communications are really bad. We have mobile communications, but
sometimes they don't work. Even the normal Cable and DSL ones don't
work properly. The most stable ones are VSAT Internet, which is very
expensive, but will have to be used in some sites.

The replication is uni-directional. That is, from district to province
and from province to nationwide db.

Would you suggest using CouchDB for a system like this? And if yes,
how would you tackle it? What would you suggest?

Thanks in advance,

Fidel.

Re: CouchDB and Census software

Posted by Fidel Viegas <fi...@gmail.com>.

Hi Noah,

> You could try HTTP over SMTP!
>

I have never done this before, but thanks for bringing that up. I will
research on that.

Regards,

Fidel.

Re: CouchDB and Census software

Posted by Noah Slater <ns...@tumbolia.org>.

On 27 Oct 2009, at 22:36, Leo Simons wrote:

> For locations where there is _really_ bad connectivity, e-mail
> sometimes still works ok after HTTP breaks down (at least that was the
> case in parts Africa a few years ago). If you find you have that
> problem, building a simple e-mail interface that allows submitting and
> retrieving CSV data can work well.

You could try HTTP over SMTP!

Re: CouchDB and Census software

Posted by Fidel Viegas <fi...@gmail.com>.

Hi Leo,

First of foremost, thanks for your reply.

> Yes, its an excellent fit.
>
> Besides the use of HTTP and replication, the free-form document format
> mirrors the evolvability of excel spreadsheets pretty well. You should
> be able to supporting adding and removing of columns pretty well.
>

I was reading more about it and yes, I think it is an excellent fit.
Each survey is a self-contained document, which sort of fits the
CouchDB document model.

There is one thing though, that I think takes quite considerable
space. For each document we insert, we will have space for the field
names, which makes the the database grow faster than an RDBMS
counterpart, for instance. Nonetheless, I really like the way CouchDB
works and I am pretty excited to work with it on this project.

> Depending on the size of these spreadsheets, you may want to make a
> separate CouchDB database per province or per district. Writing the
> map/reduce jobs will be a bit more tedious, but the data will be
> easier to manage and replicate around.
>

After reading a few more chapters of CouchDB The Definitive Guide, I
came to the same conclusion. Each province is independent of each
other, and each district is also independent. All they need is to be
able to generate reports districtwise, provincewise and nationwide.
The district does not need to know about other districts, and same
goes for the provinces. But provinces will gather data from districts
and the central db from provinces.


> For your first version I would install a single CouchDB on what will
> eventually be your main/central server; it can serve all your
> databases. Get the best possible connectivity to that server. Then,
> write a little (web-based?) UI on top that allows submitting a
> spreadsheet directly into this database. Then, add a function to
> export the data out again.
>
> Importing/exporting excel can be a bit tricky to get right in a
> webapp; the easiest and most robust approach actually is if you have
> users select all the data in the spreadsheet, copy it to the
> clipboard, and then paste it into a <textarea> on a web page. In
> particular that will help a lot with character encoding conversions if
> your user is on windows, the spreadsheet is in excel, and the browser
> is IE or firefox.
>

This is actually a good idea. I think I will start with a main
central/db. Lucky of me, I will not need to feed the application with
spreadsheets. This is a project that was implemented in another
African country, and will start from scratch. They normally use Excell
and are all excited about it, thinking that Excell is a panacea.
Usually, these are old guys that haven't really got contact with
Clipper or recent RDBMSs. The are usually skeptic about moving away
from Excell, and they aren't that technology aware. The consultant I
am working with has given me these spreadsheets to analyse them, but
we will start something from scratch, which is good as I can start
something from scratch using the CouchDB data model.

> This first system could then go into production. It will give you with
> a smooth migration path for those people that do have enough internet
> access to connect to your central server using their web browser.
>

Now, this is where the problem comes. They don't use anything right
now. We will implement something for them. For gathering data in the
townships and villages that will then be fed into the district
database, I am thinking of using mobile units. Perhaps PDAs or
Netbooks. These will need an application to gather the data and then
feed them into the district database. The district database, then
feeds the province db and the national db. I was saying that the
feeding would be from district to province to national db, but after
some thought, they will not gather any data in terms of province. The
data will be all gathered districtwise, this in turn will feed the
province for analysis. Unless we summarize the data from the province
and feed it to the national db.

This is the first time I work on a census like application, so I don't
really know if that makes any sense. Should we summarize the data to
be fed to the national db, or should we feed the main data? Perhaps
feeding the main data will allow the national analysis have more data
to work on. Does this make any sense?

> Obviously you can make prettier UIs than an excel import/export once
> you have the data in CouchDB and people will soon stop using excel :).
>

That's the idea. To make prettier UI.

> You can then start to set up the "slave" installations in those places
> that have the bad connectivity. I would suggest using push-based
> replication from the slave to the master. You can use a cron job to
> trigger this replication. Set up this way, it is the most robust
> against bad connections.
>

Thanks for the tip. It makes sense.

> Think about bi-directional replication too though. You could add
> pull-based replication from the master site back to the slave. if
> bandwidth allows it, the local sites could have a full copy of the
> data which may be nice when the network connection is down. Or, you
> might clean up the data on the central server and then push an update
> back to the slave site. Due to the nature of your data structures, its
> not likely that you ever really have to deal with document conflicts.
>

Like I said above, teh districts don't need data from other districts.
They only need to feed the provincial database for analysis, and they
will do analysis based on their district. Maybe I am not getting the
picture.

> For locations where there is _really_ bad connectivity, e-mail
> sometimes still works ok after HTTP breaks down (at least that was the
> case in parts Africa a few years ago). If you find you have that
> problem, building a simple e-mail interface that allows submitting and
> retrieving CSV data can work well.
>

It is still the case. Connectivity is really bad across the country.
You can work, but we have been experiencing quite some problem from
the last 4 months. The most reliable one is VSAT, which is very, very
expensive.

> Hope this helps :)

Yes, it did help quite a lot. I have a different picture of what I had
thought before. I think I will start with you first suggestion of
creating a central db first, and then work from there.

Thanks a lot for your input.

Fidel.

Re: CouchDB and Census software

Posted by Leo Simons <ma...@leosimons.com>.

Hey Fidel,

On Tue, Oct 27, 2009 at 5:04 PM, Fidel Viegas <fi...@gmail.com> wrote:
> [snip explanation of use case of replacing lots of spreadsheets with CouchDB and replication]
...
> 1) Power isn't stable across the whole country. Even in the capitol we
> experience power cuts.
> 2) Communications are really bad. We have mobile communications, but
> sometimes they don't work. Even the normal Cable and DSL ones don't
> work properly. The most stable ones are VSAT Internet, which is very
> expensive, but will have to be used in some sites.
>
> The replication is uni-directional. That is, from district to province
> and from province to nationwide db.
>
> Would you suggest using CouchDB for a system like this?

Yes, its an excellent fit.

Besides the use of HTTP and replication, the free-form document format
mirrors the evolvability of excel spreadsheets pretty well. You should
be able to supporting adding and removing of columns pretty well.

> And if yes, how would you tackle it? What would you suggest?

Depending on the size of these spreadsheets, you may want to make a
separate CouchDB database per province or per district. Writing the
map/reduce jobs will be a bit more tedious, but the data will be
easier to manage and replicate around.

For your first version I would install a single CouchDB on what will
eventually be your main/central server; it can serve all your
databases. Get the best possible connectivity to that server. Then,
write a little (web-based?) UI on top that allows submitting a
spreadsheet directly into this database. Then, add a function to
export the data out again.

Importing/exporting excel can be a bit tricky to get right in a
webapp; the easiest and most robust approach actually is if you have
users select all the data in the spreadsheet, copy it to the
clipboard, and then paste it into a <textarea> on a web page. In
particular that will help a lot with character encoding conversions if
your user is on windows, the spreadsheet is in excel, and the browser
is IE or firefox.

This first system could then go into production. It will give you with
a smooth migration path for those people that do have enough internet
access to connect to your central server using their web browser.

Obviously you can make prettier UIs than an excel import/export once
you have the data in CouchDB and people will soon stop using excel :).

You can then start to set up the "slave" installations in those places
that have the bad connectivity. I would suggest using push-based
replication from the slave to the master. You can use a cron job to
trigger this replication. Set up this way, it is the most robust
against bad connections.

Think about bi-directional replication too though. You could add
pull-based replication from the master site back to the slave. if
bandwidth allows it, the local sites could have a full copy of the
data which may be nice when the network connection is down. Or, you
might clean up the data on the central server and then push an update
back to the slave site. Due to the nature of your data structures, its
not likely that you ever really have to deal with document conflicts.

For locations where there is _really_ bad connectivity, e-mail
sometimes still works ok after HTTP breaks down (at least that was the
case in parts Africa a few years ago). If you find you have that
problem, building a simple e-mail interface that allows submitting and
retrieving CSV data can work well.

Hope this helps :)

cheers,

Leo