You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@couchdb.apache.org by fana <fa...@2flub.org> on 2009/06/13 16:13:51 UTC

Looking for advice using CouchDB for a FreeSoftware project

Hi,

I heard about CouchDB in a german Podcast[1] last week
and I think I found the last missing piece for a FreeSoftware project[2].

  Background:

There is a program called "SubDownloader"[3] which is an XML-RPC client
to the XML-RPC server of http://www.opensubtitles.org . It works like this:

 * You have a movie and you want a subtitle for it.
 * You open your movie with Subdownloader.
 * Subdownloader hashes[4] your movie file.
 * Subdownloader asks XML-RPC server whether it has a subtitle for this
movie hash and downloads it.

Problem now is that opensubtitles.org infrastructure can't handle the load
anymore[5] and it's not possible to scale it.

We now re-implement the XML-RPC server in Python but it was a big headache
designing the database, because we don't want to "navigate the ship in the
same iceberg" as opensubtitles.org did.

I think that CouchDB is perfect for us in terms of scalability,
replication, collaboration and design changes in the future.

As I want to eliminate as much mistakes from the beginning as possible
I would like to ask here for advice and created a first draft how our
database would look like.

Would this draft work out with CouchDB or is there a better way?

SubtitleFile
------------

{
  "_id"              : "String",       (MD5 hash of subtitle file)
  "type"             : "subtitlefile",
  "format"           : "String",       (e.g. "SubRip")
  "language"         : "String",       (ISO 639-2 code)
  "hearing_impaired" : "String",       ("True" or "False")
  "fansub"           : "String",       ("True" or "False")
  "uploader"         : "String",
  "_attachments"     :

  {
    "subtitle.srt":
    {
      "content_type" : "text\/plain",
      "data"         : "VGhpcyBpcyBhIGJhc2U2NCBlbmNvZGVkIHRleHQ="
    }
  }

}

  THERE IS NO HOSTING OF MOVIE FILES OF THE MOVIE INDUSTRY
  (just peoples' file hashes)

MovieFile
---------

{
  "_id"      : "String",               (Computed hash of movie file)
  "type"     : "moviefile",
  "length"   :  number,                (seconds)
  "filesize" :  number,                (kb)
  "fps"      :  number, 
  "uploader" : "String"
}

Relation
--------

{
                                       (here "_id" will be generated by
CouchDB)
  "type"            : "relation"       
  "id_subtitlefile" : "String",        (the MD5 hash of the subtitle)
  "id_moviefile"    : "String"         (the     hash of the movie file)
}


[1] http://chaosradio.ccc.de/cre125.html
[2] https://launchpad.net/osclone
[3] http://subdownloader.net
[4]
http://trac.opensubtitles.org/projects/opensubtitles/wiki/HashSourceCodes
[5] http://forum.opensubtitles.org/viewtopic.php?t=1775

RE: Looking for advice using CouchDB for a FreeSoftware project

Posted by fana <fa...@2flub.org>.

> I think I'd make the movie hash a regular field in the document instead
> of the _id. Then you can just have multiple subtitle documents for a
movie
> and you could create a view that emit this field as a key so you can
query
> for it.

Problem is, that there is a ManyToMany relation between them.
One MovieFile can have many suitable SubtitleFiles and vice-versa.

With the "relation" document I don't have to make sure that existing hashes
don't get lost
and so it is easier for someone to add further matching hashes.

> JSON has booleans, so I'd make fansub a boolean instead of a True or
> False string.

Yes, good point. Thanks for the hint.

> You could use attachments to store the subtitle files.

Yeah, it's already in my draft


On Sat, 13 Jun 2009 18:24:39 +0200, Nils Breunese <N....@vpro.nl>
wrote:
> Hello,
> 
> Some ideas:
> 
> - I think I'd make the movie hash a regular field in the document instead
> of the _id. Then you can just have multiple subtitle documents for a
movie
> and you could create a view that emit this field as a key so you can
query
> for it.
> - JSON has booleans, so I'd make fansub a boolean instead of a True or
> False string.
> - You could use attachments to store the subtitle files.
> 
> Nils Breunese.
> 
> ________________________________________
> Van: fana [fana@2flub.org]
> Verzonden: zaterdag 13 juni 2009 16:13
> Aan: user@couchdb.apache.org
> Onderwerp: Looking for advice using CouchDB for a FreeSoftware project
> 
> Hi,
> 
> I heard about CouchDB in a german Podcast[1] last week
> and I think I found the last missing piece for a FreeSoftware project[2].
> 
>   Background:
> 
> There is a program called "SubDownloader"[3] which is an XML-RPC client
> to the XML-RPC server of http://www.opensubtitles.org . It works like
this:
> 
>  * You have a movie and you want a subtitle for it.
>  * You open your movie with Subdownloader.
>  * Subdownloader hashes[4] your movie file.
>  * Subdownloader asks XML-RPC server whether it has a subtitle for this
> movie hash and downloads it.
> 
> Problem now is that opensubtitles.org infrastructure can't handle the
load
> anymore[5] and it's not possible to scale it.
> 
> We now re-implement the XML-RPC server in Python but it was a big
headache
> designing the database, because we don't want to "navigate the ship in
the
> same iceberg" as opensubtitles.org did.
> 
> I think that CouchDB is perfect for us in terms of scalability,
> replication, collaboration and design changes in the future.
> 
> As I want to eliminate as much mistakes from the beginning as possible
> I would like to ask here for advice and created a first draft how our
> database would look like.
> 
> Would this draft work out with CouchDB or is there a better way?
> 
> SubtitleFile
> ------------
> 
> {
>   "_id"              : "String",       (MD5 hash of subtitle file)
>   "type"             : "subtitlefile",
>   "format"           : "String",       (e.g. "SubRip")
>   "language"         : "String",       (ISO 639-2 code)
>   "hearing_impaired" : "String",       ("True" or "False")
>   "fansub"           : "String",       ("True" or "False")
>   "uploader"         : "String",
>   "_attachments"     :
> 
>   {
>     "subtitle.srt":
>     {
>       "content_type" : "text\/plain",
>       "data"         : "VGhpcyBpcyBhIGJhc2U2NCBlbmNvZGVkIHRleHQ="
>     }
>   }
> 
> }
> 
> 
> 
>   THERE IS NO HOSTING OF MOVIE FILES OF THE MOVIE INDUSTRY
>   (just peoples' file hashes)
> 
> MovieFile
> ---------
> 
> {
>   "_id"      : "String",               (Computed hash of movie file)
>   "type"     : "moviefile",
>   "length"   :  number,                (seconds)
>   "filesize" :  number,                (kb)
>   "fps"      :  number,
>   "uploader" : "String"
> }
> 
> Relation
> --------
> 
> {
>                                        (here "_id" will be generated by
> CouchDB)
>   "type"            : "relation"
>   "id_subtitlefile" : "String",        (the MD5 hash of the subtitle)
>   "id_moviefile"    : "String"         (the     hash of the movie file)
> }
> 
> 
> [1] http://chaosradio.ccc.de/cre125.html
> [2] https://launchpad.net/osclone
> [3] http://subdownloader.net
> [4]
> http://trac.opensubtitles.org/projects/opensubtitles/wiki/HashSourceCodes
> [5] http://forum.opensubtitles.org/viewtopic.php?t=1775
> 
> De informatie vervat in deze  e-mail en meegezonden bijlagen is
uitsluitend
> bedoeld voor gebruik door de geadresseerde en kan vertrouwelijke
informatie
> bevatten. Openbaarmaking, vermenigvuldiging, verspreiding en/of
> verstrekking van deze informatie aan derden is voorbehouden aan
> geadresseerde. De VPRO staat niet in voor de juiste en volledige
> overbrenging van de inhoud van een verzonden e-mail, noch voor tijdige
> ontvangst daarvan.

RE: Looking for advice using CouchDB for a FreeSoftware project

Posted by Nils Breunese <N....@vpro.nl>.

Hello,

Some ideas:

- I think I'd make the movie hash a regular field in the document instead of the _id. Then you can just have multiple subtitle documents for a movie and you could create a view that emit this field as a key so you can query for it.
- JSON has booleans, so I'd make fansub a boolean instead of a True or False string.
- You could use attachments to store the subtitle files.

Nils Breunese.

________________________________________
Van: fana [fana@2flub.org]
Verzonden: zaterdag 13 juni 2009 16:13
Aan: user@couchdb.apache.org
Onderwerp: Looking for advice using CouchDB for a FreeSoftware project

Hi,

I heard about CouchDB in a german Podcast[1] last week
and I think I found the last missing piece for a FreeSoftware project[2].

  Background:

There is a program called "SubDownloader"[3] which is an XML-RPC client
to the XML-RPC server of http://www.opensubtitles.org . It works like this:

 * You have a movie and you want a subtitle for it.
 * You open your movie with Subdownloader.
 * Subdownloader hashes[4] your movie file.
 * Subdownloader asks XML-RPC server whether it has a subtitle for this
movie hash and downloads it.

Problem now is that opensubtitles.org infrastructure can't handle the load
anymore[5] and it's not possible to scale it.

We now re-implement the XML-RPC server in Python but it was a big headache
designing the database, because we don't want to "navigate the ship in the
same iceberg" as opensubtitles.org did.

I think that CouchDB is perfect for us in terms of scalability,
replication, collaboration and design changes in the future.

As I want to eliminate as much mistakes from the beginning as possible
I would like to ask here for advice and created a first draft how our
database would look like.

Would this draft work out with CouchDB or is there a better way?

SubtitleFile
------------

{
  "_id"              : "String",       (MD5 hash of subtitle file)
  "type"             : "subtitlefile",
  "format"           : "String",       (e.g. "SubRip")
  "language"         : "String",       (ISO 639-2 code)
  "hearing_impaired" : "String",       ("True" or "False")
  "fansub"           : "String",       ("True" or "False")
  "uploader"         : "String",
  "_attachments"     :

  {
    "subtitle.srt":
    {
      "content_type" : "text\/plain",
      "data"         : "VGhpcyBpcyBhIGJhc2U2NCBlbmNvZGVkIHRleHQ="
    }
  }

}



  THERE IS NO HOSTING OF MOVIE FILES OF THE MOVIE INDUSTRY
  (just peoples' file hashes)

MovieFile
---------

{
  "_id"      : "String",               (Computed hash of movie file)
  "type"     : "moviefile",
  "length"   :  number,                (seconds)
  "filesize" :  number,                (kb)
  "fps"      :  number,
  "uploader" : "String"
}

Relation
--------

{
                                       (here "_id" will be generated by
CouchDB)
  "type"            : "relation"
  "id_subtitlefile" : "String",        (the MD5 hash of the subtitle)
  "id_moviefile"    : "String"         (the     hash of the movie file)
}


[1] http://chaosradio.ccc.de/cre125.html
[2] https://launchpad.net/osclone
[3] http://subdownloader.net
[4]
http://trac.opensubtitles.org/projects/opensubtitles/wiki/HashSourceCodes
[5] http://forum.opensubtitles.org/viewtopic.php?t=1775

De informatie vervat in deze  e-mail en meegezonden bijlagen is uitsluitend bedoeld voor gebruik door de geadresseerde en kan vertrouwelijke informatie bevatten. Openbaarmaking, vermenigvuldiging, verspreiding en/of verstrekking van deze informatie aan derden is voorbehouden aan geadresseerde. De VPRO staat niet in voor de juiste en volledige overbrenging van de inhoud van een verzonden e-mail, noch voor tijdige ontvangst daarvan.

Re: Looking for advice using CouchDB for a FreeSoftware project

Posted by Jeremy Wall <jw...@google.com>.

ahhhh ok that makes sense then.

On Sat, Jun 13, 2009 at 11:21 AM, fana <fa...@2flub.org> wrote:

> Hi, thanks for the quicky reply,
>
>
> On Sat, 13 Jun 2009 10:11:00 -0500, Jeremy Wall <jw...@google.com> wrote:
> > I think you can actually get rid of the relation table.
> > just put the movie hash as an attribute of your subtitle document.
>
> This was one of my first thoughts, too.
> At the beginning I had a list of movie hashes in the SubtitleFile document.
> Then I thought it would be better in the other direction and have a list of
> subtitle hashes in the MovieFile document.
>
> The problem I had, is, that there is a ManyToMany relation between them.
> One MovieFile can have many suitable SubtitleFiles and vice-versa.
>
> Maybe I still think too "relational-databased" but the advantage I see
> with the "relation" document is, that if somebody wants to add further
> matching hashes,
> I don't have to make sure that existing hashes don't get lost.
>
>
> > On Sat, Jun 13, 2009 at 9:13 AM, fana <fa...@2flub.org> wrote:
> >
> >> Hi,
> >>
> >> I heard about CouchDB in a german Podcast[1] last week
> >> and I think I found the last missing piece for a FreeSoftware
> project[2].
> >>
> >>  Background:
> >>
> >> There is a program called "SubDownloader"[3] which is an XML-RPC client
> >> to the XML-RPC server of http://www.opensubtitles.org . It works like
> >> this:
> >>
> >>  * You have a movie and you want a subtitle for it.
> >>  * You open your movie with Subdownloader.
> >>  * Subdownloader hashes[4] your movie file.
> >>  * Subdownloader asks XML-RPC server whether it has a subtitle for this
> >> movie hash and downloads it.
> >>
> >> Problem now is that opensubtitles.org infrastructure can't handle the
> >> load
> >> anymore[5] and it's not possible to scale it.
> >>
> >> We now re-implement the XML-RPC server in Python but it was a big
> >> headache
> >> designing the database, because we don't want to "navigate the ship in
> >> the
> >> same iceberg" as opensubtitles.org did.
> >>
> >> I think that CouchDB is perfect for us in terms of scalability,
> >> replication, collaboration and design changes in the future.
> >>
> >> As I want to eliminate as much mistakes from the beginning as possible
> >> I would like to ask here for advice and created a first draft how our
> >> database would look like.
> >>
> >> Would this draft work out with CouchDB or is there a better way?
> >>
> >
> > Modify this to be:
> >
> >>
> >> SubtitleFile
> >> ------------
> >>
> >> {
> >>  "_id"              : "String",       (MD5 hash of subtitle file)
> >
> >    "movie_hash"  : "String",       (Id from the movie document)
> >
> >>
> >>  "type"             : "subtitlefile",
> >>  "format"           : "String",       (e.g. "SubRip")
> >>  "language"         : "String",       (ISO 639-2 code)
> >>  "hearing_impaired" : "String",       ("True" or "False")
> >>  "fansub"           : "String",       ("True" or "False")
> >>  "uploader"         : "String",
> >>  "_attachments"     :
> >>
> >>  {
> >>    "subtitle.srt":
> >>    {
> >>      "content_type" : "text\/plain",
> >>      "data"         : "VGhpcyBpcyBhIGJhc2U2NCBlbmNvZGVkIHRleHQ="
> >>    }
> >>  }
> >>
> >> }
> >>
> >>  THERE IS NO HOSTING OF MOVIE FILES OF THE MOVIE INDUSTRY
> >>  (just peoples' file hashes)
> >>
> >
> > Keep this the same:
> >
> >
> >>
> >> MovieFile
> >> ---------
> >>
> >> {
> >>  "_id"      : "String",               (Computed hash of movie file)
> >>  "type"     : "moviefile",
> >>  "length"   :  number,                (seconds)
> >>  "filesize" :  number,                (kb)
> >>  "fps"      :  number,
> >>  "uploader" : "String"
> >> }
> >>
> >
> > get rid of this completely:
> >
> >
> >>
> >> Relation
> >> --------
> >>
> >> {
> >>                                       (here "_id" will be generated by
> >> CouchDB)
> >>  "type"            : "relation"
> >>  "id_subtitlefile" : "String",        (the MD5 hash of the subtitle)
> >>  "id_moviefile"    : "String"         (the     hash of the movie file)
> >
> >
> >> }
> >
> >
> > You can still look up subtitles by the the movie id and you get rid of an
> > unecessary document. In my (admittedly limited) experience a linking
> > document is usually unnessary.
> >
> >>
> >>
> >>
> >> [1] http://chaosradio.ccc.de/cre125.html
> >> [2] https://launchpad.net/osclone
> >> [3] http://subdownloader.net
> >> [4]
> >>
> http://trac.opensubtitles.org/projects/opensubtitles/wiki/HashSourceCodes
> >> [5] http://forum.opensubtitles.org/viewtopic.php?t=1775
> >>
>

Re: Looking for advice using CouchDB for a FreeSoftware project

Posted by fana <fa...@2flub.org>.

Hi, thanks for the quicky reply,


On Sat, 13 Jun 2009 10:11:00 -0500, Jeremy Wall <jw...@google.com> wrote:
> I think you can actually get rid of the relation table.
> just put the movie hash as an attribute of your subtitle document.

This was one of my first thoughts, too.
At the beginning I had a list of movie hashes in the SubtitleFile document.
Then I thought it would be better in the other direction and have a list of
subtitle hashes in the MovieFile document.

The problem I had, is, that there is a ManyToMany relation between them.
One MovieFile can have many suitable SubtitleFiles and vice-versa.

Maybe I still think too "relational-databased" but the advantage I see
with the "relation" document is, that if somebody wants to add further
matching hashes,
I don't have to make sure that existing hashes don't get lost.


> On Sat, Jun 13, 2009 at 9:13 AM, fana <fa...@2flub.org> wrote:
> 
>> Hi,
>>
>> I heard about CouchDB in a german Podcast[1] last week
>> and I think I found the last missing piece for a FreeSoftware
project[2].
>>
>>  Background:
>>
>> There is a program called "SubDownloader"[3] which is an XML-RPC client
>> to the XML-RPC server of http://www.opensubtitles.org . It works like
>> this:
>>
>>  * You have a movie and you want a subtitle for it.
>>  * You open your movie with Subdownloader.
>>  * Subdownloader hashes[4] your movie file.
>>  * Subdownloader asks XML-RPC server whether it has a subtitle for this
>> movie hash and downloads it.
>>
>> Problem now is that opensubtitles.org infrastructure can't handle the
>> load
>> anymore[5] and it's not possible to scale it.
>>
>> We now re-implement the XML-RPC server in Python but it was a big
>> headache
>> designing the database, because we don't want to "navigate the ship in
>> the
>> same iceberg" as opensubtitles.org did.
>>
>> I think that CouchDB is perfect for us in terms of scalability,
>> replication, collaboration and design changes in the future.
>>
>> As I want to eliminate as much mistakes from the beginning as possible
>> I would like to ask here for advice and created a first draft how our
>> database would look like.
>>
>> Would this draft work out with CouchDB or is there a better way?
>>
> 
> Modify this to be:
> 
>>
>> SubtitleFile
>> ------------
>>
>> {
>>  "_id"              : "String",       (MD5 hash of subtitle file)
> 
>    "movie_hash"  : "String",       (Id from the movie document)
> 
>>
>>  "type"             : "subtitlefile",
>>  "format"           : "String",       (e.g. "SubRip")
>>  "language"         : "String",       (ISO 639-2 code)
>>  "hearing_impaired" : "String",       ("True" or "False")
>>  "fansub"           : "String",       ("True" or "False")
>>  "uploader"         : "String",
>>  "_attachments"     :
>>
>>  {
>>    "subtitle.srt":
>>    {
>>      "content_type" : "text\/plain",
>>      "data"         : "VGhpcyBpcyBhIGJhc2U2NCBlbmNvZGVkIHRleHQ="
>>    }
>>  }
>>
>> }
>>
>>  THERE IS NO HOSTING OF MOVIE FILES OF THE MOVIE INDUSTRY
>>  (just peoples' file hashes)
>>
> 
> Keep this the same:
> 
> 
>>
>> MovieFile
>> ---------
>>
>> {
>>  "_id"      : "String",               (Computed hash of movie file)
>>  "type"     : "moviefile",
>>  "length"   :  number,                (seconds)
>>  "filesize" :  number,                (kb)
>>  "fps"      :  number,
>>  "uploader" : "String"
>> }
>>
> 
> get rid of this completely:
> 
> 
>>
>> Relation
>> --------
>>
>> {
>>                                       (here "_id" will be generated by
>> CouchDB)
>>  "type"            : "relation"
>>  "id_subtitlefile" : "String",        (the MD5 hash of the subtitle)
>>  "id_moviefile"    : "String"         (the     hash of the movie file)
> 
> 
>> }
> 
> 
> You can still look up subtitles by the the movie id and you get rid of an
> unecessary document. In my (admittedly limited) experience a linking
> document is usually unnessary.
> 
>>
>>
>>
>> [1] http://chaosradio.ccc.de/cre125.html
>> [2] https://launchpad.net/osclone
>> [3] http://subdownloader.net
>> [4]
>>
http://trac.opensubtitles.org/projects/opensubtitles/wiki/HashSourceCodes
>> [5] http://forum.opensubtitles.org/viewtopic.php?t=1775
>>

Re: Looking for advice using CouchDB for a FreeSoftware project

Posted by Jeremy Wall <jw...@google.com>.

I think you can actually get rid of the relation table.
just put the movie hash as an attribute of your subtitle document.

On Sat, Jun 13, 2009 at 9:13 AM, fana <fa...@2flub.org> wrote:

> Hi,
>
> I heard about CouchDB in a german Podcast[1] last week
> and I think I found the last missing piece for a FreeSoftware project[2].
>
>  Background:
>
> There is a program called "SubDownloader"[3] which is an XML-RPC client
> to the XML-RPC server of http://www.opensubtitles.org . It works like
> this:
>
>  * You have a movie and you want a subtitle for it.
>  * You open your movie with Subdownloader.
>  * Subdownloader hashes[4] your movie file.
>  * Subdownloader asks XML-RPC server whether it has a subtitle for this
> movie hash and downloads it.
>
> Problem now is that opensubtitles.org infrastructure can't handle the load
> anymore[5] and it's not possible to scale it.
>
> We now re-implement the XML-RPC server in Python but it was a big headache
> designing the database, because we don't want to "navigate the ship in the
> same iceberg" as opensubtitles.org did.
>
> I think that CouchDB is perfect for us in terms of scalability,
> replication, collaboration and design changes in the future.
>
> As I want to eliminate as much mistakes from the beginning as possible
> I would like to ask here for advice and created a first draft how our
> database would look like.
>
> Would this draft work out with CouchDB or is there a better way?
>

Modify this to be:

>
> SubtitleFile
> ------------
>
> {
>  "_id"              : "String",       (MD5 hash of subtitle file)

   "movie_hash"  : "String",       (Id from the movie document)

>
>  "type"             : "subtitlefile",
>  "format"           : "String",       (e.g. "SubRip")
>  "language"         : "String",       (ISO 639-2 code)
>  "hearing_impaired" : "String",       ("True" or "False")
>  "fansub"           : "String",       ("True" or "False")
>  "uploader"         : "String",
>  "_attachments"     :
>
>  {
>    "subtitle.srt":
>    {
>      "content_type" : "text\/plain",
>      "data"         : "VGhpcyBpcyBhIGJhc2U2NCBlbmNvZGVkIHRleHQ="
>    }
>  }
>
> }
>
>  THERE IS NO HOSTING OF MOVIE FILES OF THE MOVIE INDUSTRY
>  (just peoples' file hashes)
>

Keep this the same:


>
> MovieFile
> ---------
>
> {
>  "_id"      : "String",               (Computed hash of movie file)
>  "type"     : "moviefile",
>  "length"   :  number,                (seconds)
>  "filesize" :  number,                (kb)
>  "fps"      :  number,
>  "uploader" : "String"
> }
>

get rid of this completely:


>
> Relation
> --------
>
> {
>                                       (here "_id" will be generated by
> CouchDB)
>  "type"            : "relation"
>  "id_subtitlefile" : "String",        (the MD5 hash of the subtitle)
>  "id_moviefile"    : "String"         (the     hash of the movie file)


> }


You can still look up subtitles by the the movie id and you get rid of an
unecessary document. In my (admittedly limited) experience a linking
document is usually unnessary.

>
>
>
> [1] http://chaosradio.ccc.de/cre125.html
> [2] https://launchpad.net/osclone
> [3] http://subdownloader.net
> [4]
> http://trac.opensubtitles.org/projects/opensubtitles/wiki/HashSourceCodes
> [5] http://forum.opensubtitles.org/viewtopic.php?t=1775
>