You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@couchdb.apache.org by Ronan Jouchet <ro...@cadensimaging.com> on 2015/12/24 20:05:01 UTC

Document ID naming: random UUIDs or structured?

Hi.

I'm coming back on an already much debated subject, with a few questions
I couldn't find answers for.

I started working on a new system backed by CouchDB, and am questioning
our choice to use "meaningful"/structured IDs (as opposed to UUIDs). Our
data revolves around documents called "cases", which can relate to
various documents, like notes, findings, measures. So we build IDs
looking like:
- 1234_case
- 1234_finding_f2ac2351
- 1234_finding_aa928399
- 1234_note_22933cf5
- 1234_measure_928dca87

Colleagues say they initially went for UUIDs, then moved on to a
meaningful scheme for guess-ability, which enabled easier replication,
as well as a few views referencing IDs (thanks to knowledge of the
naming structure), which expand to full documents with include_docs=true.

On my side, as a NoSQL freshman and without the project history, I can't
help wanting to move back to UUIDs, because:

1. As we're leaning heavily on the *naming* of our documents, I have the
feeling we're hiding ourselves we're not properly structuring our data
in a way that is view-friendly. Feels like it's going to come back and
bite us later on.

2. As we are adding logic, we're starting to see unwieldy IDs
(hash1_thing1_hash2_thing2_hash3_thing3_hash4)

3. Currently, the information contained in the ID (in the above example:
caseId, type, hash) is currently *only* here. So to "extract" this
information we have repetitive-but-slightly-different "splitId"
functions that extract and type these ids (for example:
"1234_finding_f2ac2351" -> {"caseId": 1234, "type": "finding",
"contentId": "f2ac2351"}, which is painful.

3.1. The obvious solution is be to repeat {caseId, type, hash} as
document properties. Then I can use them without having to call
splitId(doc._id). But then there's duplicated data, which will have to
be updated jointly. Is it a problem or is it just the time for me to
learn to stop worrying and not care about this kind of minor duplication
in NoSQL land?

Then, looking at what the internet says (see references below),

a. Both [PDB] and [DC] say non-uuid IDs are convenient for bare-bones
_all_docs querying (e.g. for "all of Bob Dylan's albums released between
1964 and 1965", just {startkey: 'album_dylan_1965_', endkey:
'album_dylan_1964_\uffff'}).
True, but how often will I be able to use such simple queries? I feel
like I'm going to need views anyway.

b. Both [PDB] and [DC] say that a structured ID naming means usable
indexes "for free", taking no additional space compared to a solution
with random UUIDs complemented with views.
- Also, both note that using UUIDs (thus, needing views) means
failing to use the built-anyway index on _id. True.
- [DC] goes as far as saying that "getting rid of as many views
(relying on _all_docs instead) as you can is a worthwhile goal". Is this
a shared opinion?

c. [INOI] and [GUIDE] note that incremental IDs will yield better
performance on bulk document inserts. Okay.

d. [SO] proposes to "use UUIDs unless you have a good reason not to",
and recommends to base your choice on "Cost of changing ID vs. How
likely the ID is to change" (if the ID is likely to change a lot, use a
UUID to force yourself to not rely on it).

What do you think? What do you use in your own projects?

Thanks for your help, thanks for CouchDB, and happy end-of-year :)

References ----

[PDB] (section "Use and abuse your doc IDs")
http://pouchdb.com/2014/05/01/secondary-indexes-have-landed-in-pouchdb.html

[DC]
http://davidcaylor.com/2012/05/26/can-i-see-your-id-please-the-importance-of-couchdb-record-ids/

[GUIDE] http://guide.couchdb.org/draft/performance.html#bulk

[INOI]
http://blog.inoi.fi/2010/11/impact-of-document-ids-on-performance.html

[SO]
http://stackoverflow.com/questions/1963632/what-is-best-practice-when-creating-document-ids-in-couchdb/1964947#1964947

--
Ronan

Re: Document ID naming: random UUIDs or structured?

Posted by Alexander Harm <co...@aharm.de>.

Hello Ronan,

my two cents:

I tend to incorporate the type and possible parent into my id, so in your case that would look like

case_1234
finding_1234_f2ac2351
finding_1234_aa928399
note_1234_22933cf5
measure_1234_928dca87

However, I tend to normalise the type and all “ids" into a fixed length e. g. 
case_1234
fndg_1234_f2ac2351
fndg_1234_aa928399
note_1234_22933cf5
msre_1234_928dca87

That enables me to pull an overview of all cases with all docs
startkey case_
endkey case_\uffff
and then access all details by type
startkey fndg_1234_
endkey fndg_1234_\uffff

That works pretty well for my use case (querying all cases and details only when needed). By adding the type to the start I make sure the docs are stored in order (your 3.1 c). Whether or not to use UUID depends. In the example of a people directory each person has a unique incremental UUID:
person_<person-uuid>
the telephone number could be shortened to the type
telphn_<person-uuid>_home
telphn_<person-uuid>_work
telphn_<person-uuid>_fax
telphn_<person-uuid>_mobile

If there is a chance of conflicts I would always go for a UUID.

Regards,

Alexander





> On 24. Dec. 2015, at 20:05, Ronan Jouchet <ro...@cadensimaging.com> wrote:
> 
> Hi.
> 
> I'm coming back on an already much debated subject, with a few questions I couldn't find answers for.
> 
> I started working on a new system backed by CouchDB, and am questioning our choice to use "meaningful"/structured IDs (as opposed to UUIDs). Our data revolves around documents called "cases", which can relate to various documents, like notes, findings, measures. So we build IDs looking like:
> - 1234_case
> - 1234_finding_f2ac2351
> - 1234_finding_aa928399
> - 1234_note_22933cf5
> - 1234_measure_928dca87
> 
> Colleagues say they initially went for UUIDs, then moved on to a meaningful scheme for guess-ability, which enabled easier replication, as well as a few views referencing IDs (thanks to knowledge of the naming structure), which expand to full documents with include_docs=true.
> 
> On my side, as a NoSQL freshman and without the project history, I can't help wanting to move back to UUIDs, because:
> 
> 1. As we're leaning heavily on the *naming* of our documents, I have the feeling we're hiding ourselves we're not properly structuring our data in a way that is view-friendly. Feels like it's going to come back and bite us later on.
> 
> 2. As we are adding logic, we're starting to see unwieldy IDs (hash1_thing1_hash2_thing2_hash3_thing3_hash4)
> 
> 3. Currently, the information contained in the ID (in the above example: caseId, type, hash) is currently *only* here. So to "extract" this information we have repetitive-but-slightly-different "splitId" functions that extract and type these ids (for example: "1234_finding_f2ac2351" -> {"caseId": 1234, "type": "finding", "contentId": "f2ac2351"}, which is painful.
> 
>   3.1. The obvious solution is be to repeat {caseId, type, hash} as document properties. Then I can use them without having to call splitId(doc._id). But then there's duplicated data, which will have to be updated jointly. Is it a problem or is it just the time for me to learn to stop worrying and not care about this kind of minor duplication in NoSQL land?
> 
> Then, looking at what the internet says (see references below),
> 
> a. Both [PDB] and [DC] say non-uuid IDs are convenient for bare-bones _all_docs querying (e.g. for "all of Bob Dylan's albums released between 1964 and 1965", just {startkey: 'album_dylan_1965_', endkey: 'album_dylan_1964_\uffff'}).
> True, but how often will I be able to use such simple queries? I feel like I'm going to need views anyway.
> 
> b. Both [PDB] and [DC] say that a structured ID naming means usable indexes "for free", taking no additional space compared to a solution with random UUIDs complemented with views.
>  - Also, both note that using UUIDs (thus, needing views) means failing to use the built-anyway index on _id. True.
>  - [DC] goes as far as saying that "getting rid of as many views (relying on _all_docs instead) as you can is a worthwhile goal". Is this a shared opinion?
> 
> c. [INOI] and [GUIDE] note that incremental IDs will yield better performance on bulk document inserts. Okay.
> 
> d. [SO] proposes to "use UUIDs unless you have a good reason not to", and recommends to base your choice on "Cost of changing ID vs. How likely the ID is to change" (if the ID is likely to change a lot, use a UUID to force yourself to not rely on it).
> 
> What do you think? What do you use in your own projects?
> 
> Thanks for your help, thanks for CouchDB, and happy end-of-year :)
> 
> References ----
> 
> [PDB] (section "Use and abuse your doc IDs") http://pouchdb.com/2014/05/01/secondary-indexes-have-landed-in-pouchdb.html
> 
> [DC] http://davidcaylor.com/2012/05/26/can-i-see-your-id-please-the-importance-of-couchdb-record-ids/
> 
> [GUIDE] http://guide.couchdb.org/draft/performance.html#bulk
> 
> [INOI] http://blog.inoi.fi/2010/11/impact-of-document-ids-on-performance.html
> 
> [SO] http://stackoverflow.com/questions/1963632/what-is-best-practice-when-creating-document-ids-in-couchdb/1964947#1964947
> 
> -- 
> Ronan