You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@roller.apache.org by David Levy <Da...@Sun.COM> on 2006/01/04 12:51:03 UTC

About Tags, a proposed data-model

I have documented a data model for tags. This is held at my blog

http://blogs.sun.com/roller/page/DaveLevy?entry=implementing_tags_in_a_database

I have a graphic demonstrating the relationship between authors, 
articles and tags and illustrating the first and obvious indexes. (I 
have identified that both "Date Published" and tag aggregates are 
missing from the model).  Since the model was built to help me 
understand del.icio.us, I call the entities Users, Bookmarks and Tags, 
but hopefully its simple to see that these are pretty synonomous to 
authors, articles and tags.

I hope that this is useful for those looking at implementing tags.

I am still working out how to use the mail-list, so I hope that x-refing 
you to my blog isn't deprecated. I also need to work how to maintain 
thread connections i.e. undertake a reply.
-- 

Dave

<http://www.sun.com> 	* David Levy *
*Sun Microsystems Ltd.*
55, King William St.,
London EC4R 9ND United Kingdom

Phone +44 (0) 20 7469 9908/x18308
Mobile +44 (0) 7710-360922
Blog http://blogs.sun.com/DaveLevy
Email David.Levy@Sun.COM

Sun Proprietary & Confidential . This e-mail message is for the sole use 
of the intended recipient(s) and may contain confidential and 
privilidged information. Any unauthorised review, use, disclosure or 
distribution is prohibited. If you are not the intended recepient, 
please contact the sender by reply e-mail and destroy all copies of the 
original message.

Re: About Tags, a proposed data-model

Posted by Anil Gangolli <an...@busybuddha.org>.

David Levy wrote:

> This has been written after the proposal has been updated and the 
> table defintion documented there. i.e
>
> http://rollerweblogger.org/wiki/Wiki.jsp?page=Proposal_WeblogTags
>
> But I have decided to extent this thread; it pertains to the database 
> data model.
>
> I have some questions.
>
> is id a surrogate (sequential or hashed) key

Yes, all of the primary key ids in Roller are hashed generated keys.

> what is entryid? i.e. the article URL? i.e. foreign key to entries

This is a foreign key to the entry that is being tagged.

> what is name? (The value of the tag?) i.e. foreign key to Tags table

This is the tag value.  It is not a foreign key to a tags table.  There 
is no separate tags table.  It's not fully normalized in this sense 
because the same tag for multiple articles would appear multiple times 
in the weblogentrytag table.  We should consider this normalization.   
That would mean having a separate table for tags, in which each distinct 
tag value appears exactly once, and the weblogentrytag table being 
purely an association between the weblog entry and the tag.

--a.

>
> Do I need the definitions of the entries and entries2tags to 
> understand what we need to do?
>
> I have come to the conclusion that using surrogate keys is(nearly 
> always) a mistake. The space management advantages are normally 
> negated by the requirement to join against the tables holding the real 
> keys. You normally require an additonal unique index to enforce the 
> uniqueness constraints against both the surrogate & real keys, and 
> will often need to de-optimise the TNF structure to replicate the real 
> data around the database to avoid otherwise unnecessary joins. 
> Maintaining these de-optimisations have both human and perfromance costs.
>
> If you agree that real keys should be used then we need to consider 
> capitalisation & plurals of tags. Easy one first - leave plural 
> management to the authors. I know that I  am using blog & blogs 
> seperately, althouhg maybe I should have chosen blogging & blog to be 
> clearer, but the point is that tag authors are not going to think very 
> hard about their tags, this is why the del.icio.us tag prompt scheme 
> is so good. Delicious manages to ignore capitalisation, which I think 
> is good. Its certainly best for query and aggregation for popularity 
> queries., I can't believe that anyone wants to know that both 
> "internet" & "Internet" are popular tags (technorati!)
>
> This implies that we might have two fields to document a tag,
>
> user_tag_name            char(32)
> normal_tag_name        char(32)  is proper(user_tag_name)
>
> (sorry for the pseudo-code, but normal* needs to be enforced with a 
> constraint or trigger.)
>
> This then leaves us with the question of indices (indexes?)
>
> The implications of my data model is that the entry_author key should 
> be copied down through the entry tabel into the entry-tags table.
>
> author_name            char(32) is (part of) FK to  entry
> entry_name              char(128) is (part of) FK to entry
>
> although, a full roller URL contains the author name, but I assume 
> that the entryname consists of the string used in the entry & anchor 
> queries so I have a table that looks like this (again pseudo code and 
> I have guessed at the database column types).
>
> create definedtags as (
> author_name                    char(32)
> entry_name (or id?) ,       char(128)
> user_tag_name,               char(64)
> normal_tag_name,           char(64)
> entry_date                         datetime
> date_created                    datetime )
>
> create constraint defined_tags_name as
>    normal_tag_name=proper(user_tag_name)
>
> create constraint defined_tags_entry_date as
>    entry_date = select entry_date from entry
>                           where definedtags.entry_name = entry.entry_name
>
> and the following indexes
>
> create dt.tags on definedtags
>    as author_name, entry_name, normal_tag_name unique
>
> create dt.tags2 on definedtags
>    as normal_tag_name
>
> create dt.entries on definedtags
>    as author_name, entry_name
>
> create dt.date_written
>    on definedtags as entry_date.
>
> I have written this in a hurry so it may not be though out as well as 
> some of my writing, but hopefully this is collaborative development. 
> Also I have difficulty in commenting on and reading some of the 
> front-end & java orientated stuff. (I have ordered a couple of books 
> to help me catchup). Hopefully this is helpfull
>
>
>
> Elias Torres wrote:
>
>> Welcome David to the Roller list.
>>
>> Thank you for your post. I have read your blog post on a tag data
>> model for Roller. I'm looking forward to your relational algebra and
>> query cost analysis. I wanted to tell you that we (IBM) have already
>> added basic tagging support to Roller and it actually supports a
>> TagCloud. I am supposed to put a proposal in the roller wiki so others
>> could comment and once I do that, you could put your comments there as
>> well.
>>
>> Just to kickstart the conversation I'm including the tagging table we
>> are currently using.
>>
>> create table weblogentrytag (
>>    id              varchar(48)   not null primary key,
>>    entryid        varchar(48)   not null,
>>    name            varchar(255)  not null,
>>    tagtime         timestamp     not null
>> );
>>
>> We have basically two tables: entries and entry2tags, but are missing
>> a tag table. At first, I was very set on having a tag table and use a
>> foreign key to "save" space on repetitive tag names. But I was shown
>> it's not really a big space saving technique, especially since tag
>> names are relatively short storing a guid or int would almost be
>> comparable in space. There are also increased costs in inserting and
>> joining on tables to get tag names if using a foreign key, so we have
>> settle on this for now until we have other queries requirements. I'll
>> be summarizing all of our changes to roller to support tagging in a
>> wiki proposal soon.
>>
>> Regarding the use of the list, some people have been using nabble.com
>> to interact with it. Maybe you can give it a try. I simply use gmail.
>>
>> http://www.nabble.com/Roller-f12275.html
>>
>> Regards,
>>
>> Elias
>>
>> On 1/4/06, David Levy <Da...@sun.com> wrote:
>>  
>>
>>> I have documented a data model for tags. This is held at my blog
>>>
>>> http://blogs.sun.com/roller/page/DaveLevy?entry=implementing_tags_in_a_database 
>>>
>>>
>>> I have a graphic demonstrating the relationship between authors,
>>> articles and tags and illustrating the first and obvious indexes. (I
>>> have identified that both "Date Published" and tag aggregates are
>>> missing from the model).  Since the model was built to help me
>>> understand del.icio.us, I call the entities Users, Bookmarks and Tags,
>>> but hopefully its simple to see that these are pretty synonomous to
>>> authors, articles and tags.
>>>
>>> I hope that this is useful for those looking at implementing tags.
>>>
>>> I am still working out how to use the mail-list, so I hope that 
>>> x-refing
>>> you to my blog isn't deprecated. I also need to work how to maintain
>>> thread connections i.e. undertake a reply.
>>> -- 
>>>
>>> Dave
>>>
>>> <http://www.sun.com>    * David Levy *
>>> *Sun Microsystems Ltd.*
>>> 55, King William St.,
>>> London EC4R 9ND United Kingdom
>>>
>>> Phone +44 (0) 20 7469 9908/x18308
>>> Mobile +44 (0) 7710-360922
>>> Blog http://blogs.sun.com/DaveLevy
>>> Email David.Levy@Sun.COM
>>>
>>> Sun Proprietary & Confidential . This e-mail message is for the sole 
>>> use
>>> of the intended recipient(s) and may contain confidential and
>>> privilidged information. Any unauthorised review, use, disclosure or
>>> distribution is prohibited. If you are not the intended recepient,
>>> please contact the sender by reply e-mail and destroy all copies of the
>>> original message.
>>>
>>>
>>>   
>>
>

Re: About Tags, a proposed data-model

Posted by Ian Kallen <ik...@technorati.com>.

Well, wait. del & flickr support URL encoded strings as tags so you can 
have multi-word tags, it's just kinda grotty ... URLs I'd tagged 
"language detection"
http://del.icio.us/spidaman/%22language%2Bdetection%22

Disallowing spaces in tags forces people to puzzle over alternate word 
separators (plus signs, underscores, periods or just mushing words 
together), which is awkward.  Right or wrong, Technorati normalizes case 
and word separators at query time, ergo these are equivalent:
http://technorati.com/tag/map+reduce
http://technorati.com/tag/mapreduce
http://technorati.com/tag/map.reduce

Personally I'd prefer to see a combo-box populator (IIRC, that's how 
ecto and flock handle tag creation) implemented in AJAX/javascript. The 
single text field text entry UIs force awkward word delimiting. Sorry, 
I'm not volunteering on the combo box (wish I could), cross browser 
javascript dev has always been a slow rtfm process for me.
-Ian

David M Johnson wrote:
>
> On Jan 14, 2006, at 2:37 PM, Allen Gilliland wrote:
>> On Sat, 2006-01-14 at 08:21, David M Johnson wrote:
>>> On Jan 6, 2006, at 3:08 PM, Allen Gilliland wrote:
>>>> I prefer the del.icio.us style, where spaces are separators and are
>>>> not supported in tags.  I think this is the easiest for users to
>>>> understand and work with.
>>>
>>> That's not true. Both Technorati Tags and del.icio.us tags allow
>>> spaces in tags. I believe we should do the same thing.
>>
>> I'm pretty sure that del.icio.us doesn't allow multi-word tags or tags
>> with spaces ...
>>
>> http://del.icio.us/help/tags
>>
>> "Tags are one-word descriptors that you can assign to any bookmark. Tags
>> can't contain quotation marks or whitespace, but are otherwise
>> unrestricted."
>
> By golly, you're right. And flickr.com is the same way.
>
> Technorati on the other hand *does* support spaces within tags.
>
> So I guess I'm starting to agree with you on this one.
>
> - Dave
>
>
-- 
Ian Kallen || Architect, Technorati Inc. || m: 415.505.5208
http://www.arachna.com/roller/page/spidaman

Re: About Tags, a proposed data-model

Posted by David M Johnson <Da...@Sun.COM>.

On Jan 14, 2006, at 2:37 PM, Allen Gilliland wrote:
> On Sat, 2006-01-14 at 08:21, David M Johnson wrote:
>> On Jan 6, 2006, at 3:08 PM, Allen Gilliland wrote:
>>> I prefer the del.icio.us style, where spaces are separators and are
>>> not supported in tags.  I think this is the easiest for users to
>>> understand and work with.
>>
>> That's not true. Both Technorati Tags and del.icio.us tags allow
>> spaces in tags. I believe we should do the same thing.
>
> I'm pretty sure that del.icio.us doesn't allow multi-word tags or tags
> with spaces ...
>
> http://del.icio.us/help/tags
>
> "Tags are one-word descriptors that you can assign to any bookmark.  
> Tags
> can't contain quotation marks or whitespace, but are otherwise
> unrestricted."

By golly, you're right. And flickr.com is the same way.

Technorati on the other hand *does* support spaces within tags.

So I guess I'm starting to agree with you on this one.

- Dave

Re: About Tags, a proposed data-model

Posted by Allen Gilliland <Al...@Sun.COM>.

On Sat, 2006-01-14 at 08:21, David M Johnson wrote:
> On Jan 6, 2006, at 3:08 PM, Allen Gilliland wrote:
> > I prefer the del.icio.us style, where spaces are separators and are  
> > not supported in tags.  I think this is the easiest for users to  
> > understand and work with.
> 
> That's not true. Both Technorati Tags and del.icio.us tags allow  
> spaces in tags. I believe we should do the same thing.

I'm pretty sure that del.icio.us doesn't allow multi-word tags or tags
with spaces ...

http://del.icio.us/help/tags

"Tags are one-word descriptors that you can assign to any bookmark. Tags
can't contain quotation marks or whitespace, but are otherwise
unrestricted."

maybe you were talking about spaces as being the tag delimiter?  my
thoughts are that this should be legal ...

tag etc SanFrancisco San_Francisco multiWordTag or_multi_word_tag

but anything like this is not ...

tag "San Francisco" etc

-- Allen

> 
> - Dave
> 
>

Re: About Tags, a proposed data-model

Posted by David M Johnson <Da...@Sun.COM>.

On Jan 6, 2006, at 3:08 PM, Allen Gilliland wrote:
> I prefer the del.icio.us style, where spaces are separators and are  
> not supported in tags.  I think this is the easiest for users to  
> understand and work with.

That's not true. Both Technorati Tags and del.icio.us tags allow  
spaces in tags. I believe we should do the same thing.

- Dave

Re: About Tags, a proposed data-model

Posted by David Levy <Da...@Sun.COM>.

I agree with this. 

Getting spaces into the tag isn't the problem, well it is, but learning 
the query syntax is a pain.

Allen Gilliland wrote:

>On Fri, 2006-01-06 at 11:17, Elias Torres wrote:
>  
>
>>Additionally, we have a question of what to do with spaces. Should tags
>>be multi-word or not? My suggestion to Phay (one of our developers) was
>>to use spaces as separators in the input field, therefore not supporting
>>spaces. But we could do multiple things to support spaces, such as
>>quoting multi-word tags. I believe Flickr supports multi-words but they
>>remove the spaces from the tags, but technorati does maintain spaces. I
>>don't like them myself, because I think it fragments the tag space much
>>more than single words and you could still use intersections to get the
>>sort of the same result.
>>
>>    
>>
>
>I prefer the del.icio.us style, where spaces are separators and are not supported in tags.  I think this is the easiest for users to understand and work with.
>
>-- Allen
>
>  
>
Lots ot stuff deleted.
--
Dave

Re: About Tags, a proposed data-model

Posted by Allen Gilliland <Al...@Sun.COM>.

On Fri, 2006-01-06 at 11:17, Elias Torres wrote:
> Additionally, we have a question of what to do with spaces. Should tags
> be multi-word or not? My suggestion to Phay (one of our developers) was
> to use spaces as separators in the input field, therefore not supporting
> spaces. But we could do multiple things to support spaces, such as
> quoting multi-word tags. I believe Flickr supports multi-words but they
> remove the spaces from the tags, but technorati does maintain spaces. I
> don't like them myself, because I think it fragments the tag space much
> more than single words and you could still use intersections to get the
> sort of the same result.
> 

I prefer the del.icio.us style, where spaces are separators and are not supported in tags.  I think this is the easiest for users to understand and work with.

-- Allen


> 
> > >
> > > create constraint defined_tags_name as
> > >     normal_tag_name=proper(user_tag_name)
> > >
> > > create constraint defined_tags_entry_date as
> > >     entry_date = select entry_date from entry
> > >                            where definedtags.entry_name = entry.entry_name
> > >
> > > and the following indexes
> > >
> > > create dt.tags on definedtags
> > >     as author_name, entry_name, normal_tag_name unique
> > >
> > > create dt.tags2 on definedtags
> > >     as normal_tag_name
> > >
> > > create dt.entries on definedtags
> > >     as author_name, entry_name
> > >
> > > create dt.date_written
> > >     on definedtags as entry_date.
> > >
> > > I have written this in a hurry so it may not be though out as well as
> > > some of my writing, but hopefully this is collaborative development.
> > > Also I have difficulty in commenting on and reading some of the
> > > front-end & java orientated stuff. (I have ordered a couple of books to
> > > help me catchup). Hopefully this is helpfull
> > >
> >
> > I think this all makes sense to me so far and I certainly appreciate the help.  I think getting the data model correct is a *very* important issue before we move forward with implemenatation, so I'm glad we are having this discussion.
> >
> > -- Allen
> 
> [1] http://www.tartarus.org/~martin/PorterStemmer/def.txt
> [2] http://torrez.us/archives/2005/07/13/tagrank.pdf
> 
> >
> > >
> > >
> > > Elias Torres wrote:
> > >
> > > >Welcome David to the Roller list.
> > > >
> > > >Thank you for your post. I have read your blog post on a tag data
> > > >model for Roller. I'm looking forward to your relational algebra and
> > > >query cost analysis. I wanted to tell you that we (IBM) have already
> > > >added basic tagging support to Roller and it actually supports a
> > > >TagCloud. I am supposed to put a proposal in the roller wiki so others
> > > >could comment and once I do that, you could put your comments there as
> > > >well.
> > > >
> > > >Just to kickstart the conversation I'm including the tagging table we
> > > >are currently using.
> > > >
> > > >create table weblogentrytag (
> > > >    id              varchar(48)   not null primary key,
> > > >    entryid        varchar(48)   not null,
> > > >    name            varchar(255)  not null,
> > > >    tagtime         timestamp     not null
> > > >);
> > > >
> > > >We have basically two tables: entries and entry2tags, but are missing
> > > >a tag table. At first, I was very set on having a tag table and use a
> > > >foreign key to "save" space on repetitive tag names. But I was shown
> > > >it's not really a big space saving technique, especially since tag
> > > >names are relatively short storing a guid or int would almost be
> > > >comparable in space. There are also increased costs in inserting and
> > > >joining on tables to get tag names if using a foreign key, so we have
> > > >settle on this for now until we have other queries requirements. I'll
> > > >be summarizing all of our changes to roller to support tagging in a
> > > >wiki proposal soon.
> > > >
> > > >Regarding the use of the list, some people have been using nabble.com
> > > >to interact with it. Maybe you can give it a try. I simply use gmail.
> > > >
> > > >http://www.nabble.com/Roller-f12275.html
> > > >
> > > >Regards,
> > > >
> > > >Elias
> > > >
> > > >On 1/4/06, David Levy <Da...@sun.com> wrote:
> > > >
> > > >
> > > >>I have documented a data model for tags. This is held at my blog
> > > >>
> > > >>http://blogs.sun.com/roller/page/DaveLevy?entry=implementing_tags_in_a_database
> > > >>
> > > >>I have a graphic demonstrating the relationship between authors,
> > > >>articles and tags and illustrating the first and obvious indexes. (I
> > > >>have identified that both "Date Published" and tag aggregates are
> > > >>missing from the model).  Since the model was built to help me
> > > >>understand del.icio.us, I call the entities Users, Bookmarks and Tags,
> > > >>but hopefully its simple to see that these are pretty synonomous to
> > > >>authors, articles and tags.
> > > >>
> > > >>I hope that this is useful for those looking at implementing tags.
> > > >>
> > > >>I am still working out how to use the mail-list, so I hope that x-refing
> > > >>you to my blog isn't deprecated. I also need to work how to maintain
> > > >>thread connections i.e. undertake a reply.
> > > >>--
> > > >>
> > > >>Dave
> > > >>
> > > >><http://www.sun.com>    * David Levy *
> > > >>*Sun Microsystems Ltd.*
> > > >>55, King William St.,
> > > >>London EC4R 9ND United Kingdom
> > > >>
> > > >>Phone +44 (0) 20 7469 9908/x18308
> > > >>Mobile +44 (0) 7710-360922
> > > >>Blog http://blogs.sun.com/DaveLevy
> > > >>Email David.Levy@Sun.COM
> > > >>
> > > >>Sun Proprietary & Confidential . This e-mail message is for the sole use
> > > >>of the intended recipient(s) and may contain confidential and
> > > >>privilidged information. Any unauthorised review, use, disclosure or
> > > >>distribution is prohibited. If you are not the intended recepient,
> > > >>please contact the sender by reply e-mail and destroy all copies of the
> > > >>original message.
> > > >>
> > > >>
> > > >>
> > > >>
> > >
> > > --
> > >
> > > Dave
> > >
> > > <http://www.sun.com>  * David Levy *
> > > *Sun Microsystems Ltd.*
> > > 55, King William St.,
> > > London EC4R 9ND United Kingdom
> > >
> > > Phone +44 (0) 20 7469 9908/x18308
> > > Mobile +44 (0) 7710-360922
> > > Blog http://blogs.sun.com/DaveLevy
> > > Email David.Levy@Sun.COM
> > >
> > > Sun Proprietary & Confidential . This e-mail message is for the sole use
> > > of the intended recipient(s) and may contain confidential and
> > > privilidged information. Any unauthorised review, use, disclosure or
> > > distribution is prohibited. If you are not the intended recepient,
> > > please contact the sender by reply e-mail and destroy all copies of the
> > > original message.
> > >
> >
> >

Re: About Tags, a proposed data-model

Posted by Dave Johnson <da...@rollerweblogger.org>.

On Jan 30, 2006, at 1:45 PM, Allen Gilliland wrote:
> I agree that is true *if* results are cached, but therein lies the  
> problem.
>
> I have spent quite a bit of time working on caching and performance  
> in Roller and with our current setup it's not caching that is hard,  
> it's having a big enough cache that's hard.  The fact is that a  
> blog takes up a lot of space because you have to consider caching  
> entries, comments, bookmarks & folders, categories, referers, and  
> templates.  As a blog grows so do most of those things, especially  
> the entries & comments.  On top of just caching those objects we  
> currently cache fully rendered pages and feeds, so that means a  
> handful of xml feeds and quite a few full html pages.  The point  
> being, on a large site there is tons of data that needs caching  
> already without having to cache tag related data.
>
> Currently, I have no idea how we can expect to cache all the data  
> that would be needed for a full tagging system along with  
> everything else we cache right now.


Here are my thoughts on tags and tag search vs. caching

*** We can't cache everything

For example, if we allow people to perform arbitrary tag queries
via the Roller UI, we're not going be able to cache the results.
In that case, we're probably OK. After all, how many people
are going to be doing tag searches on a Roller site simultaneously.

*** Tag based newsfeeds are where the problem arises

The problem arises when we start to allow people to subscribe
to tag searches. In that case, we'll have newsfeed readers and
aggregators hammering away hourly night and day. What's
worse, the number of feeds will go to infinity.

And we already have too many feeds:
total feeds = (number of blogs) X (cats per blog) X (2 feed types  
Atom and RSS)

So...

*** We should allow admins to disable tag based newsfeeds

We should allow arbitrary tag searches and getting the results as
newsfeeds, but we should also make it possible to turn both of those
off via Roller properties or the UI.

*** We should allow admins to define a finite list of site-wide feeds

What I'd like to do with Proposal Atlas is provide a way for a site
administrator to decide what feeds are to be displayed on the front
page of a Roller site and define a finite list of feeds to be provided
based on aggregations, tags, and internal objects (e.g. new user
and new blog feeds). The I'll have a new rev. of the proposal ready
in the next day or two.

- Dave




> On Mon, 2006-01-30 at 07:07, John Hoffmann wrote:
>> I'd just like to add that performing joins in sql is not something  
>> to be
>> avoided,  the impact can be almost completely mitigated by caching  
>> the
>> results.   The only real cause for concern is for truly massive  
>> datasets
>> in which the join cannot be performed in the amount of memory  
>> available
>> to the database.
>>
>> -John
>>
>> Allen Gilliland wrote:
>>
>>> I don't want to lose this thread because I think there are still  
>>> some ideas to continue flushing out.  More comments inline ...
>>>
>>> On Fri, 2006-01-13 at 05:14, David Levy wrote:
>>>
>>>
>>>> Sorry to have taken so long.
>>>>
>>>> The denormalisations of the author_name (which may be owner  
>>>> name) and
>>>> entrydate are to support queries.  This is because I expect a  
>>>> macro to
>>>> create a tag cloud for a user so that the html versions can have  
>>>> the tag
>>>> cloud,
>>>>
>>>> So
>>>>
>>>> select normal_tag_name, count(*)
>>>> from entry2tags
>>>> where author_name = "DaveLevy"
>>>> group by normal_tag_name
>>>>
>>>> gives us the data required for a tag cloud, for a single blog .  
>>>> No join
>>>> as you can see, where it gets fun is if you want a hot tags cloud
>>>>
>>>> we add a line so the query becomes
>>>>
>>>> select normal_tag_name, count(*)
>>>> from entry2tags
>>>> where author_name = "DaveLevy"
>>>> and     entry_date > @sevedaysago
>>>> group by normal_tag_name
>>>>
>>>>
>>>
>>> very cool stuff ... i like the looks of that.
>>>
>>> ... lots of stuff chopped out here ...
>>>
>>>
>>>
>>>>>> is there a reason to copy down the entry_date rather than  
>>>>>> access it via a join on the weblogentry table?  you can join  
>>>>>> with the weblogentry table using the entry_id column.  how  
>>>>>> were you planning to use the entry_date field?
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>> see above,  I don't want to join to the entry table, and this  
>>>> goal also
>>>> impacts my index design.  entry_date allows hot tags queries to be
>>>> driven by the entry not the tagged date
>>>>
>>>>
>>>
>>> ok.  I agree that joins are a likely performance problem, but my  
>>> next question then becomes ... How do we plan to deal with  
>>> getting the data for the list of entries marked with a given  
>>> tag?  I am expecting that when someone uses the tag dashboard or  
>>> a tag cloud to try and view a list of entries with the tag "foo",  
>>> that list will look something like the Roller front page.   
>>> example ...
>>>
>>> url = /roller/tag/entries/java+netbeans
>>>
>>> you then populate a page with 50? 25? entry summaries for people  
>>> to browse through and those summaries will at least require the  
>>> entry title and a summary of the entry content and may also  
>>> require the entry date, category, author, and weblog title.  I  
>>> would think we are going to require a join to get that data.
>>>
>>>
>>>
>>>>>> what acts as the primary key?  (author_name, entry_id,  
>>>>>> user_tag_name)?
>>>>>>
>>>>>>
>>>>>>
>>>> my PK is author_name, entry_id and normal_tag_name
>>>>
>>>>
>>>>
>>>>>> we may still need to use a surrogate key to uniquely identify  
>>>>>> the row to avoid having a multi column primary key.
>>>>>>
>>>>>>
>>>>>>
>>>> yeah, looks like it
>>>>
>>>>
>>>>
>>>>>> i'm not sure that would be much of an issue though because it  
>>>>>> doesn't look like you are planning for any joins for the tag  
>>>>>> names, correct?
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>> I'm trying to avoid any joins, but if you are looking for  
>>>> entries on a
>>>> blog and tagged, then it would be good to enter the query on
>>>> author_name, but since we have not copied the title down to the
>>>> entry2tags table we still need the join and can go in on  
>>>> author_name on
>>>> either table, but best do it on entry table (see below).
>>>>
>>>> select  entry.title
>>>> from    entry, entry2tags e2t
>>>> where e2t.entry_id = entry.id
>>>> and     entry.author_name = @KnownName
>>>> and     e2t.normal_tag_name in (@TagQueryList)
>>>>
>>>>
>>>
>>> based on my example above, how would we get the necessary entry  
>>> data when we don't know the author name because we are searching  
>>> through the entire tag system, not just from a single author or  
>>> weblog?
>>>
>>> -- Allen
>>>
>>>
>>>
>>>
>>>>> Right, Allen. This is very similar to what we already have. I'm  
>>>>> not sure
>>>>> why having author name here, when we already have that through  
>>>>> a join
>>>>> with the entry_id.
>>>>>
>>>>>
>>>>>
>>>> That's right, but I don't want to join, the entry2tags table is big
>>>> enough without joining.
>>>>
>>>>
>>>>
>>>>> I'm not sure why do we need entry_date when we have
>>>>> the tagging date.
>>>>>
>>>>>
>>>>>
>>>> I think the queries should be driven through the entry  
>>>> publication date.
>>>>
>>>>
>>>>
>>>>> I do like the normal tag name. I was thinking this too for the  
>>>>> output of the Porter Stemming algorithm so I wouldn't lose the  
>>>>> original information entered by the user. Everything else is  
>>>>> the same :-).
>>>>>
>>>>>
>>>>>
>>>>>
>>>> I need to read your references to understand this, but I think  
>>>> you agree
>>>> this is OK
>>>>
>>>>
>>>>
>>>>> Another point I want to make is the fact, that we do a little  
>>>>> bit of
>>>>> extra processing when saving an existing entry that make sure  
>>>>> it keeps
>>>>> the original date for each tag. For example, when I first  
>>>>> created an
>>>>> entry I tagged it A,B,C. The first time I edited, I removed B.  
>>>>> The A and
>>>>> C tags will retain the dates when they were added as opposed to  
>>>>> the edit
>>>>> date.
>>>>>
>>>>>
>>>>>
>>>>>
>>>> Are you holding entry_tag data on the entry table?
>>>>
>>>>
>>>>
>>>>> Additionally, we have a question of what to do with spaces.  
>>>>> Should tags
>>>>> be multi-word or not? My suggestion to Phay (one of our  
>>>>> developers) was
>>>>> to use spaces as separators in the input field, therefore not  
>>>>> supporting
>>>>> spaces. But we could do multiple things to support spaces, such as
>>>>> quoting multi-word tags. I believe Flickr supports multi-words  
>>>>> but they
>>>>> remove the spaces from the tags, but technorati does maintain  
>>>>> spaces. I
>>>>> don't like them myself, because I think it fragments the tag  
>>>>> space much
>>>>> more than single words and you could still use intersections to  
>>>>> get the
>>>>> sort of the same result.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>>> create constraint defined_tags_name as
>>>>>>>   normal_tag_name=proper(user_tag_name)
>>>>>>>
>>>>>>> create constraint defined_tags_entry_date as
>>>>>>>   entry_date = select entry_date from entry
>>>>>>>                          where definedtags.entry_name =  
>>>>>>> entry.entry_name
>>>>>>>
>>>>>>> and the following indexes
>>>>>>>
>>>>>>> create dt.tags on definedtags
>>>>>>>   as author_name, entry_name, normal_tag_name unique
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>> This is the real primary key
>>>>
>>>>
>>>>
>>>>>>> create dt.tags2 on definedtags
>>>>>>>   as normal_tag_name
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>> This is the tag entity (or operational master)
>>>>
>>>>
>>>>
>>>>>>> create dt.entries on definedtags
>>>>>>>   as author_name, entry_name
>>>>>>>
>>>>>>> create dt.date_written
>>>>>>>   on definedtags as entry_date.
>>>>>>>
>>>>>>> I have written this in a hurry so it may not be though out as  
>>>>>>> well as
>>>>>>> some of my writing, but hopefully this is collaborative  
>>>>>>> development.
>>>>>>> Also I have difficulty in commenting on and reading some of the
>>>>>>> front-end & java orientated stuff. (I have ordered a couple  
>>>>>>> of books to
>>>>>>> help me catchup). Hopefully this is helpfull
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>> I think this all makes sense to me so far and I certainly  
>>>>>> appreciate the help.  I think getting the data model correct  
>>>>>> is a *very* important issue before we move forward with  
>>>>>> implemenatation, so I'm glad we are having this discussion.
>>>>>>
>>>>>> -- Allen
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>> [1] http://www.tartarus.org/~martin/PorterStemmer/def.txt
>>>>> [2] http://torrez.us/archives/2005/07/13/tagrank.pdf
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>>> Elias Torres wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> Welcome David to the Roller list.
>>>>>>>>
>>>>>>>> Thank you for your post. I have read your blog post on a tag  
>>>>>>>> data
>>>>>>>> model for Roller. I'm looking forward to your relational  
>>>>>>>> algebra and
>>>>>>>> query cost analysis. I wanted to tell you that we (IBM) have  
>>>>>>>> already
>>>>>>>> added basic tagging support to Roller and it actually  
>>>>>>>> supports a
>>>>>>>> TagCloud. I am supposed to put a proposal in the roller wiki  
>>>>>>>> so others
>>>>>>>> could comment and once I do that, you could put your  
>>>>>>>> comments there as
>>>>>>>> well.
>>>>>>>>
>>>>>>>> Just to kickstart the conversation I'm including the tagging  
>>>>>>>> table we
>>>>>>>> are currently using.
>>>>>>>>
>>>>>>>> create table weblogentrytag (
>>>>>>>>  id              varchar(48)   not null primary key,
>>>>>>>>  entryid        varchar(48)   not null,
>>>>>>>>  name            varchar(255)  not null,
>>>>>>>>  tagtime         timestamp     not null
>>>>>>>> );
>>>>>>>>
>>>>>>>> We have basically two tables: entries and entry2tags, but  
>>>>>>>> are missing
>>>>>>>> a tag table. At first, I was very set on having a tag table  
>>>>>>>> and use a
>>>>>>>> foreign key to "save" space on repetitive tag names. But I  
>>>>>>>> was shown
>>>>>>>> it's not really a big space saving technique, especially  
>>>>>>>> since tag
>>>>>>>> names are relatively short storing a guid or int would  
>>>>>>>> almost be
>>>>>>>> comparable in space. There are also increased costs in  
>>>>>>>> inserting and
>>>>>>>> joining on tables to get tag names if using a foreign key,  
>>>>>>>> so we have
>>>>>>>> settle on this for now until we have other queries  
>>>>>>>> requirements. I'll
>>>>>>>> be summarizing all of our changes to roller to support  
>>>>>>>> tagging in a
>>>>>>>> wiki proposal soon.
>>>>>>>>
>>>>>>>> Regarding the use of the list, some people have been using  
>>>>>>>> nabble.com
>>>>>>>> to interact with it. Maybe you can give it a try. I simply  
>>>>>>>> use gmail.
>>>>>>>>
>>>>>>>> http://www.nabble.com/Roller-f12275.html
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>>
>>>>>>>> Elias
>>>>>>>>
>>>>>>>> On 1/4/06, David Levy <Da...@sun.com> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> I have documented a data model for tags. This is held at my  
>>>>>>>>> blog
>>>>>>>>>
>>>>>>>>> http://blogs.sun.com/roller/page/DaveLevy? 
>>>>>>>>> entry=implementing_tags_in_a_database
>>>>>>>>>
>>>>>>>>> I have a graphic demonstrating the relationship between  
>>>>>>>>> authors,
>>>>>>>>> articles and tags and illustrating the first and obvious  
>>>>>>>>> indexes. (I
>>>>>>>>> have identified that both "Date Published" and tag  
>>>>>>>>> aggregates are
>>>>>>>>> missing from the model).  Since the model was built to help me
>>>>>>>>> understand del.icio.us, I call the entities Users,  
>>>>>>>>> Bookmarks and Tags,
>>>>>>>>> but hopefully its simple to see that these are pretty  
>>>>>>>>> synonomous to
>>>>>>>>> authors, articles and tags.
>>>>>>>>>
>>>>>>>>> I hope that this is useful for those looking at  
>>>>>>>>> implementing tags.
>>>>>>>>>
>>>>>>>>> I am still working out how to use the mail-list, so I hope  
>>>>>>>>> that x-refing you to my blog isn't deprecated. I also need  
>>>>>>>>> to work how to maintain thread connections i.e. undertake a  
>>>>>>>>> reply.
>>>>>>>>> --
>>>>>>>>>
>>>>>>>>> Dave
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>> --
>>>>>>>
>>>>>>> Dave
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>> -- 
>>>>
>>>> Dave
>>>>
>>>> <http://www.sun.com> 	* David Levy *
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Blog http://blogs.sun.com/DaveLevy
>>>> Email David.Levy@Sun.COM
>>>>
>>>> Sun Proprietary & Confidential . This e-mail message is for the  
>>>> sole use
>>>> of the intended recipient(s) and may contain confidential and
>>>> privilidged information. Any unauthorised review, use,  
>>>> disclosure or
>>>> distribution is prohibited. If you are not the intended recepient,
>>>> please contact the sender by reply e-mail and destroy all copies  
>>>> of the
>>>> original message.
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>

Re: About Tags, a proposed data-model

Posted by Allen Gilliland <Al...@Sun.COM>.

I agree that is true *if* results are cached, but therein lies the problem.

I have spent quite a bit of time working on caching and performance in Roller and with our current setup it's not caching that is hard, it's having a big enough cache that's hard.  The fact is that a blog takes up a lot of space because you have to consider caching entries, comments, bookmarks & folders, categories, referers, and templates.  As a blog grows so do most of those things, especially the entries & comments.  On top of just caching those objects we currently cache fully rendered pages and feeds, so that means a handful of xml feeds and quite a few full html pages.  The point being, on a large site there is tons of data that needs caching already without having to cache tag related data.

Currently, I have no idea how we can expect to cache all the data that would be needed for a full tagging system along with everything else we cache right now.

-- Allen


On Mon, 2006-01-30 at 07:07, John Hoffmann wrote:
> I'd just like to add that performing joins in sql is not something to be 
> avoided,  the impact can be almost completely mitigated by caching the 
> results.   The only real cause for concern is for truly massive datasets 
> in which the join cannot be performed in the amount of memory available 
> to the database.
> 
> -John
> 
> Allen Gilliland wrote:
> 
> >I don't want to lose this thread because I think there are still some ideas to continue flushing out.  More comments inline ...
> >
> >On Fri, 2006-01-13 at 05:14, David Levy wrote:
> >  
> >
> >>Sorry to have taken so long.
> >>
> >>The denormalisations of the author_name (which may be owner name) and 
> >>entrydate are to support queries.  This is because I expect a macro to 
> >>create a tag cloud for a user so that the html versions can have the tag 
> >>cloud,
> >>
> >>So
> >>
> >>select normal_tag_name, count(*)
> >>from entry2tags
> >>where author_name = "DaveLevy"
> >>group by normal_tag_name
> >>
> >>gives us the data required for a tag cloud, for a single blog . No join 
> >>as you can see, where it gets fun is if you want a hot tags cloud
> >>
> >>we add a line so the query becomes
> >>
> >>select normal_tag_name, count(*)
> >>from entry2tags
> >>where author_name = "DaveLevy"
> >>and     entry_date > @sevedaysago
> >>group by normal_tag_name
> >>    
> >>
> >
> >very cool stuff ... i like the looks of that.
> >
> >... lots of stuff chopped out here ...
> >
> >  
> >
> >>>>is there a reason to copy down the entry_date rather than access it via a join on the weblogentry table?  you can join with the weblogentry table using the entry_id column.  how were you planning to use the entry_date field?
> >>>>   
> >>>>
> >>>>        
> >>>>
> >>see above,  I don't want to join to the entry table, and this goal also 
> >>impacts my index design.  entry_date allows hot tags queries to be 
> >>driven by the entry not the tagged date
> >>    
> >>
> >
> >ok.  I agree that joins are a likely performance problem, but my next question then becomes ... How do we plan to deal with getting the data for the list of entries marked with a given tag?  I am expecting that when someone uses the tag dashboard or a tag cloud to try and view a list of entries with the tag "foo", that list will look something like the Roller front page.  example ...
> >
> >url = /roller/tag/entries/java+netbeans
> >
> >you then populate a page with 50? 25? entry summaries for people to browse through and those summaries will at least require the entry title and a summary of the entry content and may also require the entry date, category, author, and weblog title.  I would think we are going to require a join to get that data.
> >
> >  
> >
> >>>>what acts as the primary key?  (author_name, entry_id, user_tag_name)?  
> >>>>
> >>>>        
> >>>>
> >>my PK is author_name, entry_id and normal_tag_name
> >>
> >>    
> >>
> >>>>we may still need to use a surrogate key to uniquely identify the row to avoid having a multi column primary key. 
> >>>>
> >>>>        
> >>>>
> >>yeah, looks like it
> >>
> >>    
> >>
> >>>>i'm not sure that would be much of an issue though because it doesn't look like you are planning for any joins for the tag names, correct?
> >>>>
> >>>>   
> >>>>
> >>>>        
> >>>>
> >>I'm trying to avoid any joins, but if you are looking for entries on a 
> >>blog and tagged, then it would be good to enter the query on 
> >>author_name, but since we have not copied the title down to the 
> >>entry2tags table we still need the join and can go in on author_name on 
> >>either table, but best do it on entry table (see below).
> >>
> >>select  entry.title
> >>from    entry, entry2tags e2t
> >>where e2t.entry_id = entry.id
> >>and     entry.author_name = @KnownName
> >>and     e2t.normal_tag_name in (@TagQueryList)
> >>    
> >>
> >
> >based on my example above, how would we get the necessary entry data when we don't know the author name because we are searching through the entire tag system, not just from a single author or weblog?
> >
> >-- Allen
> >
> >
> >  
> >
> >>>Right, Allen. This is very similar to what we already have. I'm not sure
> >>>why having author name here, when we already have that through a join
> >>>with the entry_id.
> >>>
> >>>      
> >>>
> >>That's right, but I don't want to join, the entry2tags table is big 
> >>enough without joining.
> >>
> >>    
> >>
> >>>I'm not sure why do we need entry_date when we have
> >>>the tagging date. 
> >>>
> >>>      
> >>>
> >>I think the queries should be driven through the entry publication date.
> >>
> >>    
> >>
> >>>I do like the normal tag name. I was thinking this too for the output of the Porter Stemming algorithm so I wouldn't lose the original information entered by the user. Everything else is the same :-).
> >>> 
> >>>
> >>>      
> >>>
> >>I need to read your references to understand this, but I think you agree 
> >>this is OK
> >>
> >>    
> >>
> >>>Another point I want to make is the fact, that we do a little bit of
> >>>extra processing when saving an existing entry that make sure it keeps
> >>>the original date for each tag. For example, when I first created an
> >>>entry I tagged it A,B,C. The first time I edited, I removed B. The A and
> >>>C tags will retain the dates when they were added as opposed to the edit
> >>>date.
> >>> 
> >>>
> >>>      
> >>>
> >>Are you holding entry_tag data on the entry table?
> >>
> >>    
> >>
> >>>Additionally, we have a question of what to do with spaces. Should tags
> >>>be multi-word or not? My suggestion to Phay (one of our developers) was
> >>>to use spaces as separators in the input field, therefore not supporting
> >>>spaces. But we could do multiple things to support spaces, such as
> >>>quoting multi-word tags. I believe Flickr supports multi-words but they
> >>>remove the spaces from the tags, but technorati does maintain spaces. I
> >>>don't like them myself, because I think it fragments the tag space much
> >>>more than single words and you could still use intersections to get the
> >>>sort of the same result.
> >>>
> >>>
> >>> 
> >>>
> >>>      
> >>>
> >>>>>create constraint defined_tags_name as
> >>>>>   normal_tag_name=proper(user_tag_name)
> >>>>>
> >>>>>create constraint defined_tags_entry_date as
> >>>>>   entry_date = select entry_date from entry
> >>>>>                          where definedtags.entry_name = entry.entry_name
> >>>>>
> >>>>>and the following indexes
> >>>>>
> >>>>>create dt.tags on definedtags
> >>>>>   as author_name, entry_name, normal_tag_name unique
> >>>>>     
> >>>>>
> >>>>>          
> >>>>>
> >>This is the real primary key
> >>
> >>    
> >>
> >>>>>create dt.tags2 on definedtags
> >>>>>   as normal_tag_name
> >>>>>     
> >>>>>
> >>>>>          
> >>>>>
> >>This is the tag entity (or operational master)
> >>
> >>    
> >>
> >>>>>create dt.entries on definedtags
> >>>>>   as author_name, entry_name
> >>>>>
> >>>>>create dt.date_written
> >>>>>   on definedtags as entry_date.
> >>>>>
> >>>>>I have written this in a hurry so it may not be though out as well as
> >>>>>some of my writing, but hopefully this is collaborative development.
> >>>>>Also I have difficulty in commenting on and reading some of the
> >>>>>front-end & java orientated stuff. (I have ordered a couple of books to
> >>>>>help me catchup). Hopefully this is helpfull
> >>>>>
> >>>>>     
> >>>>>
> >>>>>          
> >>>>>
> >>>>I think this all makes sense to me so far and I certainly appreciate the help.  I think getting the data model correct is a *very* important issue before we move forward with implemenatation, so I'm glad we are having this discussion.
> >>>>
> >>>>-- Allen
> >>>>   
> >>>>
> >>>>        
> >>>>
> >>>[1] http://www.tartarus.org/~martin/PorterStemmer/def.txt
> >>>[2] http://torrez.us/archives/2005/07/13/tagrank.pdf
> >>>
> >>> 
> >>>
> >>>      
> >>>
> >>>>>Elias Torres wrote:
> >>>>>
> >>>>>     
> >>>>>
> >>>>>          
> >>>>>
> >>>>>>Welcome David to the Roller list.
> >>>>>>
> >>>>>>Thank you for your post. I have read your blog post on a tag data
> >>>>>>model for Roller. I'm looking forward to your relational algebra and
> >>>>>>query cost analysis. I wanted to tell you that we (IBM) have already
> >>>>>>added basic tagging support to Roller and it actually supports a
> >>>>>>TagCloud. I am supposed to put a proposal in the roller wiki so others
> >>>>>>could comment and once I do that, you could put your comments there as
> >>>>>>well.
> >>>>>>
> >>>>>>Just to kickstart the conversation I'm including the tagging table we
> >>>>>>are currently using.
> >>>>>>
> >>>>>>create table weblogentrytag (
> >>>>>>  id              varchar(48)   not null primary key,
> >>>>>>  entryid        varchar(48)   not null,
> >>>>>>  name            varchar(255)  not null,
> >>>>>>  tagtime         timestamp     not null
> >>>>>>);
> >>>>>>
> >>>>>>We have basically two tables: entries and entry2tags, but are missing
> >>>>>>a tag table. At first, I was very set on having a tag table and use a
> >>>>>>foreign key to "save" space on repetitive tag names. But I was shown
> >>>>>>it's not really a big space saving technique, especially since tag
> >>>>>>names are relatively short storing a guid or int would almost be
> >>>>>>comparable in space. There are also increased costs in inserting and
> >>>>>>joining on tables to get tag names if using a foreign key, so we have
> >>>>>>settle on this for now until we have other queries requirements. I'll
> >>>>>>be summarizing all of our changes to roller to support tagging in a
> >>>>>>wiki proposal soon.
> >>>>>>
> >>>>>>Regarding the use of the list, some people have been using nabble.com
> >>>>>>to interact with it. Maybe you can give it a try. I simply use gmail.
> >>>>>>
> >>>>>>http://www.nabble.com/Roller-f12275.html
> >>>>>>
> >>>>>>Regards,
> >>>>>>
> >>>>>>Elias
> >>>>>>
> >>>>>>On 1/4/06, David Levy <Da...@sun.com> wrote:
> >>>>>>
> >>>>>>
> >>>>>>       
> >>>>>>
> >>>>>>            
> >>>>>>
> >>>>>>>I have documented a data model for tags. This is held at my blog
> >>>>>>>
> >>>>>>>http://blogs.sun.com/roller/page/DaveLevy?entry=implementing_tags_in_a_database
> >>>>>>>
> >>>>>>>I have a graphic demonstrating the relationship between authors,
> >>>>>>>articles and tags and illustrating the first and obvious indexes. (I
> >>>>>>>have identified that both "Date Published" and tag aggregates are
> >>>>>>>missing from the model).  Since the model was built to help me
> >>>>>>>understand del.icio.us, I call the entities Users, Bookmarks and Tags,
> >>>>>>>but hopefully its simple to see that these are pretty synonomous to
> >>>>>>>authors, articles and tags.
> >>>>>>>
> >>>>>>>I hope that this is useful for those looking at implementing tags.
> >>>>>>>
> >>>>>>>I am still working out how to use the mail-list, so I hope that x-refing you to my blog isn't deprecated. I also need to work how to maintain thread connections i.e. undertake a reply.
> >>>>>>>--
> >>>>>>>
> >>>>>>>Dave
> >>>>>>>         
> >>>>>>>
> >>>>>>>              
> >>>>>>>
> >>>>>--
> >>>>>
> >>>>>Dave
> >>>>>
> >>>>>     
> >>>>>
> >>>>>          
> >>>>>
> >>>>   
> >>>>
> >>>>        
> >>>>
> >>-- 
> >>
> >>Dave
> >>
> >><http://www.sun.com> 	* David Levy *
> >>
> >>
> >>
> >>
> >>
> >>
> >>Blog http://blogs.sun.com/DaveLevy
> >>Email David.Levy@Sun.COM
> >>
> >>Sun Proprietary & Confidential . This e-mail message is for the sole use 
> >>of the intended recipient(s) and may contain confidential and 
> >>privilidged information. Any unauthorised review, use, disclosure or 
> >>distribution is prohibited. If you are not the intended recepient, 
> >>please contact the sender by reply e-mail and destroy all copies of the 
> >>original message.
> >>
> >>    
> >>
> >
> >  
> >
>

Re: About Tags, a proposed data-model

Posted by John Hoffmann <Jo...@Sun.COM>.

I'd just like to add that performing joins in sql is not something to be 
avoided,  the impact can be almost completely mitigated by caching the 
results.   The only real cause for concern is for truly massive datasets 
in which the join cannot be performed in the amount of memory available 
to the database.

-John

Allen Gilliland wrote:

>I don't want to lose this thread because I think there are still some ideas to continue flushing out.  More comments inline ...
>
>On Fri, 2006-01-13 at 05:14, David Levy wrote:
>  
>
>>Sorry to have taken so long.
>>
>>The denormalisations of the author_name (which may be owner name) and 
>>entrydate are to support queries.  This is because I expect a macro to 
>>create a tag cloud for a user so that the html versions can have the tag 
>>cloud,
>>
>>So
>>
>>select normal_tag_name, count(*)
>>from entry2tags
>>where author_name = "DaveLevy"
>>group by normal_tag_name
>>
>>gives us the data required for a tag cloud, for a single blog . No join 
>>as you can see, where it gets fun is if you want a hot tags cloud
>>
>>we add a line so the query becomes
>>
>>select normal_tag_name, count(*)
>>from entry2tags
>>where author_name = "DaveLevy"
>>and     entry_date > @sevedaysago
>>group by normal_tag_name
>>    
>>
>
>very cool stuff ... i like the looks of that.
>
>... lots of stuff chopped out here ...
>
>  
>
>>>>is there a reason to copy down the entry_date rather than access it via a join on the weblogentry table?  you can join with the weblogentry table using the entry_id column.  how were you planning to use the entry_date field?
>>>>   
>>>>
>>>>        
>>>>
>>see above,  I don't want to join to the entry table, and this goal also 
>>impacts my index design.  entry_date allows hot tags queries to be 
>>driven by the entry not the tagged date
>>    
>>
>
>ok.  I agree that joins are a likely performance problem, but my next question then becomes ... How do we plan to deal with getting the data for the list of entries marked with a given tag?  I am expecting that when someone uses the tag dashboard or a tag cloud to try and view a list of entries with the tag "foo", that list will look something like the Roller front page.  example ...
>
>url = /roller/tag/entries/java+netbeans
>
>you then populate a page with 50? 25? entry summaries for people to browse through and those summaries will at least require the entry title and a summary of the entry content and may also require the entry date, category, author, and weblog title.  I would think we are going to require a join to get that data.
>
>  
>
>>>>what acts as the primary key?  (author_name, entry_id, user_tag_name)?  
>>>>
>>>>        
>>>>
>>my PK is author_name, entry_id and normal_tag_name
>>
>>    
>>
>>>>we may still need to use a surrogate key to uniquely identify the row to avoid having a multi column primary key. 
>>>>
>>>>        
>>>>
>>yeah, looks like it
>>
>>    
>>
>>>>i'm not sure that would be much of an issue though because it doesn't look like you are planning for any joins for the tag names, correct?
>>>>
>>>>   
>>>>
>>>>        
>>>>
>>I'm trying to avoid any joins, but if you are looking for entries on a 
>>blog and tagged, then it would be good to enter the query on 
>>author_name, but since we have not copied the title down to the 
>>entry2tags table we still need the join and can go in on author_name on 
>>either table, but best do it on entry table (see below).
>>
>>select  entry.title
>>from    entry, entry2tags e2t
>>where e2t.entry_id = entry.id
>>and     entry.author_name = @KnownName
>>and     e2t.normal_tag_name in (@TagQueryList)
>>    
>>
>
>based on my example above, how would we get the necessary entry data when we don't know the author name because we are searching through the entire tag system, not just from a single author or weblog?
>
>-- Allen
>
>
>  
>
>>>Right, Allen. This is very similar to what we already have. I'm not sure
>>>why having author name here, when we already have that through a join
>>>with the entry_id.
>>>
>>>      
>>>
>>That's right, but I don't want to join, the entry2tags table is big 
>>enough without joining.
>>
>>    
>>
>>>I'm not sure why do we need entry_date when we have
>>>the tagging date. 
>>>
>>>      
>>>
>>I think the queries should be driven through the entry publication date.
>>
>>    
>>
>>>I do like the normal tag name. I was thinking this too for the output of the Porter Stemming algorithm so I wouldn't lose the original information entered by the user. Everything else is the same :-).
>>> 
>>>
>>>      
>>>
>>I need to read your references to understand this, but I think you agree 
>>this is OK
>>
>>    
>>
>>>Another point I want to make is the fact, that we do a little bit of
>>>extra processing when saving an existing entry that make sure it keeps
>>>the original date for each tag. For example, when I first created an
>>>entry I tagged it A,B,C. The first time I edited, I removed B. The A and
>>>C tags will retain the dates when they were added as opposed to the edit
>>>date.
>>> 
>>>
>>>      
>>>
>>Are you holding entry_tag data on the entry table?
>>
>>    
>>
>>>Additionally, we have a question of what to do with spaces. Should tags
>>>be multi-word or not? My suggestion to Phay (one of our developers) was
>>>to use spaces as separators in the input field, therefore not supporting
>>>spaces. But we could do multiple things to support spaces, such as
>>>quoting multi-word tags. I believe Flickr supports multi-words but they
>>>remove the spaces from the tags, but technorati does maintain spaces. I
>>>don't like them myself, because I think it fragments the tag space much
>>>more than single words and you could still use intersections to get the
>>>sort of the same result.
>>>
>>>
>>> 
>>>
>>>      
>>>
>>>>>create constraint defined_tags_name as
>>>>>   normal_tag_name=proper(user_tag_name)
>>>>>
>>>>>create constraint defined_tags_entry_date as
>>>>>   entry_date = select entry_date from entry
>>>>>                          where definedtags.entry_name = entry.entry_name
>>>>>
>>>>>and the following indexes
>>>>>
>>>>>create dt.tags on definedtags
>>>>>   as author_name, entry_name, normal_tag_name unique
>>>>>     
>>>>>
>>>>>          
>>>>>
>>This is the real primary key
>>
>>    
>>
>>>>>create dt.tags2 on definedtags
>>>>>   as normal_tag_name
>>>>>     
>>>>>
>>>>>          
>>>>>
>>This is the tag entity (or operational master)
>>
>>    
>>
>>>>>create dt.entries on definedtags
>>>>>   as author_name, entry_name
>>>>>
>>>>>create dt.date_written
>>>>>   on definedtags as entry_date.
>>>>>
>>>>>I have written this in a hurry so it may not be though out as well as
>>>>>some of my writing, but hopefully this is collaborative development.
>>>>>Also I have difficulty in commenting on and reading some of the
>>>>>front-end & java orientated stuff. (I have ordered a couple of books to
>>>>>help me catchup). Hopefully this is helpfull
>>>>>
>>>>>     
>>>>>
>>>>>          
>>>>>
>>>>I think this all makes sense to me so far and I certainly appreciate the help.  I think getting the data model correct is a *very* important issue before we move forward with implemenatation, so I'm glad we are having this discussion.
>>>>
>>>>-- Allen
>>>>   
>>>>
>>>>        
>>>>
>>>[1] http://www.tartarus.org/~martin/PorterStemmer/def.txt
>>>[2] http://torrez.us/archives/2005/07/13/tagrank.pdf
>>>
>>> 
>>>
>>>      
>>>
>>>>>Elias Torres wrote:
>>>>>
>>>>>     
>>>>>
>>>>>          
>>>>>
>>>>>>Welcome David to the Roller list.
>>>>>>
>>>>>>Thank you for your post. I have read your blog post on a tag data
>>>>>>model for Roller. I'm looking forward to your relational algebra and
>>>>>>query cost analysis. I wanted to tell you that we (IBM) have already
>>>>>>added basic tagging support to Roller and it actually supports a
>>>>>>TagCloud. I am supposed to put a proposal in the roller wiki so others
>>>>>>could comment and once I do that, you could put your comments there as
>>>>>>well.
>>>>>>
>>>>>>Just to kickstart the conversation I'm including the tagging table we
>>>>>>are currently using.
>>>>>>
>>>>>>create table weblogentrytag (
>>>>>>  id              varchar(48)   not null primary key,
>>>>>>  entryid        varchar(48)   not null,
>>>>>>  name            varchar(255)  not null,
>>>>>>  tagtime         timestamp     not null
>>>>>>);
>>>>>>
>>>>>>We have basically two tables: entries and entry2tags, but are missing
>>>>>>a tag table. At first, I was very set on having a tag table and use a
>>>>>>foreign key to "save" space on repetitive tag names. But I was shown
>>>>>>it's not really a big space saving technique, especially since tag
>>>>>>names are relatively short storing a guid or int would almost be
>>>>>>comparable in space. There are also increased costs in inserting and
>>>>>>joining on tables to get tag names if using a foreign key, so we have
>>>>>>settle on this for now until we have other queries requirements. I'll
>>>>>>be summarizing all of our changes to roller to support tagging in a
>>>>>>wiki proposal soon.
>>>>>>
>>>>>>Regarding the use of the list, some people have been using nabble.com
>>>>>>to interact with it. Maybe you can give it a try. I simply use gmail.
>>>>>>
>>>>>>http://www.nabble.com/Roller-f12275.html
>>>>>>
>>>>>>Regards,
>>>>>>
>>>>>>Elias
>>>>>>
>>>>>>On 1/4/06, David Levy <Da...@sun.com> wrote:
>>>>>>
>>>>>>
>>>>>>       
>>>>>>
>>>>>>            
>>>>>>
>>>>>>>I have documented a data model for tags. This is held at my blog
>>>>>>>
>>>>>>>http://blogs.sun.com/roller/page/DaveLevy?entry=implementing_tags_in_a_database
>>>>>>>
>>>>>>>I have a graphic demonstrating the relationship between authors,
>>>>>>>articles and tags and illustrating the first and obvious indexes. (I
>>>>>>>have identified that both "Date Published" and tag aggregates are
>>>>>>>missing from the model).  Since the model was built to help me
>>>>>>>understand del.icio.us, I call the entities Users, Bookmarks and Tags,
>>>>>>>but hopefully its simple to see that these are pretty synonomous to
>>>>>>>authors, articles and tags.
>>>>>>>
>>>>>>>I hope that this is useful for those looking at implementing tags.
>>>>>>>
>>>>>>>I am still working out how to use the mail-list, so I hope that x-refing you to my blog isn't deprecated. I also need to work how to maintain thread connections i.e. undertake a reply.
>>>>>>>--
>>>>>>>
>>>>>>>Dave
>>>>>>>         
>>>>>>>
>>>>>>>              
>>>>>>>
>>>>>--
>>>>>
>>>>>Dave
>>>>>
>>>>>     
>>>>>
>>>>>          
>>>>>
>>>>   
>>>>
>>>>        
>>>>
>>-- 
>>
>>Dave
>>
>><http://www.sun.com> 	* David Levy *
>>
>>
>>
>>
>>
>>
>>Blog http://blogs.sun.com/DaveLevy
>>Email David.Levy@Sun.COM
>>
>>Sun Proprietary & Confidential . This e-mail message is for the sole use 
>>of the intended recipient(s) and may contain confidential and 
>>privilidged information. Any unauthorised review, use, disclosure or 
>>distribution is prohibited. If you are not the intended recepient, 
>>please contact the sender by reply e-mail and destroy all copies of the 
>>original message.
>>
>>    
>>
>
>  
>

Re: About Tags, a proposed data-model

Posted by Allen Gilliland <Al...@Sun.COM>.

I don't want to lose this thread because I think there are still some ideas to continue flushing out.  More comments inline ...

On Fri, 2006-01-13 at 05:14, David Levy wrote:
> Sorry to have taken so long.
> 
> The denormalisations of the author_name (which may be owner name) and 
> entrydate are to support queries.  This is because I expect a macro to 
> create a tag cloud for a user so that the html versions can have the tag 
> cloud,
> 
> So
> 
> select normal_tag_name, count(*)
> from entry2tags
> where author_name = "DaveLevy"
> group by normal_tag_name
> 
> gives us the data required for a tag cloud, for a single blog . No join 
> as you can see, where it gets fun is if you want a hot tags cloud
> 
> we add a line so the query becomes
> 
> select normal_tag_name, count(*)
> from entry2tags
> where author_name = "DaveLevy"
> and     entry_date > @sevedaysago
> group by normal_tag_name

very cool stuff ... i like the looks of that.

... lots of stuff chopped out here ...

> 
> >>is there a reason to copy down the entry_date rather than access it via a join on the weblogentry table?  you can join with the weblogentry table using the entry_id column.  how were you planning to use the entry_date field?
> >>    
> >>
> see above,  I don't want to join to the entry table, and this goal also 
> impacts my index design.  entry_date allows hot tags queries to be 
> driven by the entry not the tagged date

ok.  I agree that joins are a likely performance problem, but my next question then becomes ... How do we plan to deal with getting the data for the list of entries marked with a given tag?  I am expecting that when someone uses the tag dashboard or a tag cloud to try and view a list of entries with the tag "foo", that list will look something like the Roller front page.  example ...

url = /roller/tag/entries/java+netbeans

you then populate a page with 50? 25? entry summaries for people to browse through and those summaries will at least require the entry title and a summary of the entry content and may also require the entry date, category, author, and weblog title.  I would think we are going to require a join to get that data.

> 
> >>what acts as the primary key?  (author_name, entry_id, user_tag_name)?  
> >>
> my PK is author_name, entry_id and normal_tag_name
> 
> >>we may still need to use a surrogate key to uniquely identify the row to avoid having a multi column primary key. 
> >>
> yeah, looks like it
> 
> >> i'm not sure that would be much of an issue though because it doesn't look like you are planning for any joins for the tag names, correct?
> >>
> >>    
> >>
> I'm trying to avoid any joins, but if you are looking for entries on a 
> blog and tagged, then it would be good to enter the query on 
> author_name, but since we have not copied the title down to the 
> entry2tags table we still need the join and can go in on author_name on 
> either table, but best do it on entry table (see below).
> 
> select  entry.title
> from    entry, entry2tags e2t
> where e2t.entry_id = entry.id
> and     entry.author_name = @KnownName
> and     e2t.normal_tag_name in (@TagQueryList)

based on my example above, how would we get the necessary entry data when we don't know the author name because we are searching through the entire tag system, not just from a single author or weblog?

-- Allen


> 
> >
> >Right, Allen. This is very similar to what we already have. I'm not sure
> >why having author name here, when we already have that through a join
> >with the entry_id.
> >
> That's right, but I don't want to join, the entry2tags table is big 
> enough without joining.
> 
> > I'm not sure why do we need entry_date when we have
> >the tagging date. 
> >
> I think the queries should be driven through the entry publication date.
> 
> >I do like the normal tag name. I was thinking this too for the output of the Porter Stemming algorithm so I wouldn't lose the original information entered by the user. Everything else is the same :-).
> >  
> >
> I need to read your references to understand this, but I think you agree 
> this is OK
> 
> >Another point I want to make is the fact, that we do a little bit of
> >extra processing when saving an existing entry that make sure it keeps
> >the original date for each tag. For example, when I first created an
> >entry I tagged it A,B,C. The first time I edited, I removed B. The A and
> >C tags will retain the dates when they were added as opposed to the edit
> >date.
> >  
> >
> Are you holding entry_tag data on the entry table?
> 
> >Additionally, we have a question of what to do with spaces. Should tags
> >be multi-word or not? My suggestion to Phay (one of our developers) was
> >to use spaces as separators in the input field, therefore not supporting
> >spaces. But we could do multiple things to support spaces, such as
> >quoting multi-word tags. I believe Flickr supports multi-words but they
> >remove the spaces from the tags, but technorati does maintain spaces. I
> >don't like them myself, because I think it fragments the tag space much
> >more than single words and you could still use intersections to get the
> >sort of the same result.
> >
> >
> >  
> >
> >>>create constraint defined_tags_name as
> >>>    normal_tag_name=proper(user_tag_name)
> >>>
> >>>create constraint defined_tags_entry_date as
> >>>    entry_date = select entry_date from entry
> >>>                           where definedtags.entry_name = entry.entry_name
> >>>
> >>>and the following indexes
> >>>
> >>>create dt.tags on definedtags
> >>>    as author_name, entry_name, normal_tag_name unique
> >>>      
> >>>
> This is the real primary key
> 
> >>>create dt.tags2 on definedtags
> >>>    as normal_tag_name
> >>>      
> >>>
> This is the tag entity (or operational master)
> 
> >>>create dt.entries on definedtags
> >>>    as author_name, entry_name
> >>>
> >>>create dt.date_written
> >>>    on definedtags as entry_date.
> >>>
> >>>I have written this in a hurry so it may not be though out as well as
> >>>some of my writing, but hopefully this is collaborative development.
> >>>Also I have difficulty in commenting on and reading some of the
> >>>front-end & java orientated stuff. (I have ordered a couple of books to
> >>>help me catchup). Hopefully this is helpfull
> >>>
> >>>      
> >>>
> >>I think this all makes sense to me so far and I certainly appreciate the help.  I think getting the data model correct is a *very* important issue before we move forward with implemenatation, so I'm glad we are having this discussion.
> >>
> >>-- Allen
> >>    
> >>
> >
> >[1] http://www.tartarus.org/~martin/PorterStemmer/def.txt
> >[2] http://torrez.us/archives/2005/07/13/tagrank.pdf
> >
> >  
> >
> >>>Elias Torres wrote:
> >>>
> >>>      
> >>>
> >>>>Welcome David to the Roller list.
> >>>>
> >>>>Thank you for your post. I have read your blog post on a tag data
> >>>>model for Roller. I'm looking forward to your relational algebra and
> >>>>query cost analysis. I wanted to tell you that we (IBM) have already
> >>>>added basic tagging support to Roller and it actually supports a
> >>>>TagCloud. I am supposed to put a proposal in the roller wiki so others
> >>>>could comment and once I do that, you could put your comments there as
> >>>>well.
> >>>>
> >>>>Just to kickstart the conversation I'm including the tagging table we
> >>>>are currently using.
> >>>>
> >>>>create table weblogentrytag (
> >>>>   id              varchar(48)   not null primary key,
> >>>>   entryid        varchar(48)   not null,
> >>>>   name            varchar(255)  not null,
> >>>>   tagtime         timestamp     not null
> >>>>);
> >>>>
> >>>>We have basically two tables: entries and entry2tags, but are missing
> >>>>a tag table. At first, I was very set on having a tag table and use a
> >>>>foreign key to "save" space on repetitive tag names. But I was shown
> >>>>it's not really a big space saving technique, especially since tag
> >>>>names are relatively short storing a guid or int would almost be
> >>>>comparable in space. There are also increased costs in inserting and
> >>>>joining on tables to get tag names if using a foreign key, so we have
> >>>>settle on this for now until we have other queries requirements. I'll
> >>>>be summarizing all of our changes to roller to support tagging in a
> >>>>wiki proposal soon.
> >>>>
> >>>>Regarding the use of the list, some people have been using nabble.com
> >>>>to interact with it. Maybe you can give it a try. I simply use gmail.
> >>>>
> >>>>http://www.nabble.com/Roller-f12275.html
> >>>>
> >>>>Regards,
> >>>>
> >>>>Elias
> >>>>
> >>>>On 1/4/06, David Levy <Da...@sun.com> wrote:
> >>>>
> >>>>
> >>>>        
> >>>>
> >>>>>I have documented a data model for tags. This is held at my blog
> >>>>>
> >>>>>http://blogs.sun.com/roller/page/DaveLevy?entry=implementing_tags_in_a_database
> >>>>>
> >>>>>I have a graphic demonstrating the relationship between authors,
> >>>>>articles and tags and illustrating the first and obvious indexes. (I
> >>>>>have identified that both "Date Published" and tag aggregates are
> >>>>>missing from the model).  Since the model was built to help me
> >>>>>understand del.icio.us, I call the entities Users, Bookmarks and Tags,
> >>>>>but hopefully its simple to see that these are pretty synonomous to
> >>>>>authors, articles and tags.
> >>>>>
> >>>>>I hope that this is useful for those looking at implementing tags.
> >>>>>
> >>>>>I am still working out how to use the mail-list, so I hope that x-refing you to my blog isn't deprecated. I also need to work how to maintain thread connections i.e. undertake a reply.
> >>>>>--
> >>>>>
> >>>>>Dave
> >>>>>          
> >>>>>
> >>>--
> >>>
> >>>Dave
> >>>
> >>>      
> >>>
> >>    
> >>
> 
> -- 
> 
> Dave
> 
> <http://www.sun.com> 	* David Levy *
> 
> 
> 
> 
> 
> 
> Blog http://blogs.sun.com/DaveLevy
> Email David.Levy@Sun.COM
> 
> Sun Proprietary & Confidential . This e-mail message is for the sole use 
> of the intended recipient(s) and may contain confidential and 
> privilidged information. Any unauthorised review, use, disclosure or 
> distribution is prohibited. If you are not the intended recepient, 
> please contact the sender by reply e-mail and destroy all copies of the 
> original message.
>

Re: About Tags, a proposed data-model

Posted by David Levy <Da...@Sun.COM>.

Sorry to have taken so long.

The denormalisations of the author_name (which may be owner name) and 
entrydate are to support queries.  This is because I expect a macro to 
create a tag cloud for a user so that the html versions can have the tag 
cloud,

So

select normal_tag_name, count(*)
from entry2tags
where author_name = "DaveLevy"
group by normal_tag_name

gives us the data required for a tag cloud, for a single blog . No join 
as you can see, where it gets fun is if you want a hot tags cloud

we add a line so the query becomes

select normal_tag_name, count(*)
from entry2tags
where author_name = "DaveLevy"
and     entry_date > @sevedaysago
group by normal_tag_name


(sorry for the pseudo code, I could do it in Sybase TSQL but its no more 
helpful than pseudo code). I have again avoided the join and believe 
that a hot tags cloud should be based upon the entry's date not on a tag 
created date so as to keep the 'hot' concept based upon what you've 
written, not what you've tagged. (This may be an important distinction 
and also a perfromance enhancement when doing a site wide "Hot Tags Cloud")

Further replies are cut into the mail body below, and I have deleted 
some copies of my signature file.

Elias Torres wrote:

>On 1/6/06, Allen Gilliland <Al...@sun.com> wrote:
>  
>
>>On Fri, 2006-01-06 at 07:05, David Levy wrote:
>>    
>>
>>>This has been written after the proposal has been updated and the table defintion documented there. i.e
>>>
>>>http://rollerweblogger.org/wiki/Wiki.jsp?page=Proposal_WeblogTags
>>>
>>>But I have decided to extent this thread; it pertains to the database data model.
>>>
>>>I have some questions.
>>>
>>>is id a surrogate (sequential or hashed) key
>>>what is entryid? i.e. the article URL? i.e. foreign key to entries
>>>what is name? (The value of the tag?) i.e. foreign key to Tags table
>>>      
>>>
>>Anil answered these in his email.
>>    
>>
>>>Do I need the definitions of the entries and entries2tags to understand what we need to do?
>>>
>>>I have come to the conclusion that using surrogate keys is(nearly
>>>always) a mistake. The space management advantages are normally negated by the requirement to join against the tables holding the real keys. You normally require an additonal unique index to enforce the uniqueness constraints against both the surrogate & real keys, and will often need to de-optimise the TNF structure to replicate the real data around the
>>>database to avoid otherwise unnecessary joins. Maintaining these
>>>de-optimisations have both human and perfromance costs.
>>>      
>>>
>>I am often of the same opinion, but part of the problem may be that Hibernate (our ORM persistence tool) suggests using surrogate keys and probably doesn't like multi column primary keys.
>>    
>>
Bother, so if we want the real key data on the table it needs to be 
copied down.

>>    
>>
>
>I agree with David. If you see that's exactly why we didn't normalize
>our table to create a tag table. We are instead just duplicating the
>actual tag because a surrogate key would take up even much more space
>than most of the tags. BTW, this is working fine with Hibernate so far,
>although, I believe some of the queries were written using HQL (???)
>because of their complexity in grouping, etc.
>  
>
and the tags entity is implemented in an index.

>
>  
>
>>>If you agree that real keys should be used then we need to consider
>>>capitalisation & plurals of tags. Easy one first - leave plural
>>>management to the authors. I know that I  am using blog & blogs
>>>seperately, althouhg maybe I should have chosen blogging & blog to be
>>>clearer, but the point is that tag authors are not going to think very
>>>hard about their tags, this is why the del.icio.us tag prompt scheme is
>>>so good. Delicious manages to ignore capitalisation, which I think is
>>>good. Its certainly best for query and aggregation for popularity
>>>queries., I can't believe that anyone wants to know that both "internet"
>>>& "Internet" are popular tags (technorati!)
>>>
>>>This implies that we might have two fields to document a tag,
>>>
>>>user_tag_name            char(32)
>>>normal_tag_name        char(32)  is proper(user_tag_name)
>>>      
>>>
>>I agree that plurals should be managed by users and tags should be case-insensitive.  My approach to this would be a bit more simplified though, I would prefer to simply have the software convert all tags to lowercase after user input.  I suppose some people could get a little upset by that, but it seems so much easier.
>>
>>Also, is that "proper()" function part of ANSI sql?  Will it work on all databases or have an equivalent on all databases?  I am also confused on how that would work when you are trying to determine how many times the tag "java" is applied.  How would you do that?
>>
>>    
>>
Actually I don't  know if its an SQL standard or not, but I have checked 
my Postgres manual and it doesn't seem to be there. So lower case it is 
then. , The code above uses the normal_tag_name  to give us the sum of 
queries java + Java (actually plus jAva etc ). It is the reason  why the 
queries above use the normal tag as the  projection list column.

>
>Our current implementation performs a lowercase on all tags (I took that
>from del.icio.us. But I'm torn on whether or not to let users manage
>plurals. I'll be adding to our internal server a filter using the Porter
>Stemming [1] algorithm to see if its results are desirable. I'll report
>on the experiment as we get some results.
>  
>
as you see I have been persauded that lower is right.

>I'm not necessarily convinced that the prompt scheme is all that
>beneficial. There are some possible consequences that I have studied a
>little bit on my TagRank [2] paper. The problem is that care must be
>taken to "suggest" tags because if not, the users will reach a local
>maximum and never allowing other tags to reach popular/meaningful
>levels in the network.
>  
>
>>>(sorry for the pseudo-code, but normal* needs to be enforced with a
>>>constraint or trigger.)
>>>
>>>This then leaves us with the question of indices (indexes?)
>>>
>>>The implications of my data model is that the entry_author key should be
>>>copied down through the entry tabel into the entry-tags table.
>>>
>>>author_name            char(32) is (part of) FK to  entry
>>>entry_name              char(128) is (part of) FK to entry
>>>
>>>although, a full roller URL contains the author name, but I assume that
>>>the entryname consists of the string used in the entry & anchor queries
>>>so I have a table that looks like this (again pseudo code and I have
>>>guessed at the database column types).
>>>
>>>create definedtags as (
>>>author_name                    char(32)
>>>entry_name (or id?) ,       char(128)
>>>user_tag_name,               char(64)
>>>normal_tag_name,           char(64)
>>>entry_date                         datetime
>>>date_created                    datetime )
>>>      
>>>
>>i think this is still pretty similar to what we have now correct?  the main difference being that you are tracking which author wrote the specified tag, along with what date the tagged entry was published.  i like the idea of having the FK to the author included because that would allow us to easily lookup all tags used by a specific author.
>>    
>>
Yup, see above

>>is there a reason to copy down the entry_date rather than access it via a join on the weblogentry table?  you can join with the weblogentry table using the entry_id column.  how were you planning to use the entry_date field?
>>    
>>
see above,  I don't want to join to the entry table, and this goal also 
impacts my index design.  entry_date allows hot tags queries to be 
driven by the entry not the tagged date

>>what acts as the primary key?  (author_name, entry_id, user_tag_name)?  
>>
my PK is author_name, entry_id and normal_tag_name

>>we may still need to use a surrogate key to uniquely identify the row to avoid having a multi column primary key. 
>>
yeah, looks like it

>> i'm not sure that would be much of an issue though because it doesn't look like you are planning for any joins for the tag names, correct?
>>
>>    
>>
I'm trying to avoid any joins, but if you are looking for entries on a 
blog and tagged, then it would be good to enter the query on 
author_name, but since we have not copied the title down to the 
entry2tags table we still need the join and can go in on author_name on 
either table, but best do it on entry table (see below).

select  entry.title
from    entry, entry2tags e2t
where e2t.entry_id = entry.id
and     entry.author_name = @KnownName
and     e2t.normal_tag_name in (@TagQueryList)

>
>Right, Allen. This is very similar to what we already have. I'm not sure
>why having author name here, when we already have that through a join
>with the entry_id.
>
That's right, but I don't want to join, the entry2tags table is big 
enough without joining.

> I'm not sure why do we need entry_date when we have
>the tagging date. 
>
I think the queries should be driven through the entry publication date.

>I do like the normal tag name. I was thinking this too for the output of the Porter Stemming algorithm so I wouldn't lose the original information entered by the user. Everything else is the same :-).
>  
>
I need to read your references to understand this, but I think you agree 
this is OK

>Another point I want to make is the fact, that we do a little bit of
>extra processing when saving an existing entry that make sure it keeps
>the original date for each tag. For example, when I first created an
>entry I tagged it A,B,C. The first time I edited, I removed B. The A and
>C tags will retain the dates when they were added as opposed to the edit
>date.
>  
>
Are you holding entry_tag data on the entry table?

>Additionally, we have a question of what to do with spaces. Should tags
>be multi-word or not? My suggestion to Phay (one of our developers) was
>to use spaces as separators in the input field, therefore not supporting
>spaces. But we could do multiple things to support spaces, such as
>quoting multi-word tags. I believe Flickr supports multi-words but they
>remove the spaces from the tags, but technorati does maintain spaces. I
>don't like them myself, because I think it fragments the tag space much
>more than single words and you could still use intersections to get the
>sort of the same result.
>
>
>  
>
>>>create constraint defined_tags_name as
>>>    normal_tag_name=proper(user_tag_name)
>>>
>>>create constraint defined_tags_entry_date as
>>>    entry_date = select entry_date from entry
>>>                           where definedtags.entry_name = entry.entry_name
>>>
>>>and the following indexes
>>>
>>>create dt.tags on definedtags
>>>    as author_name, entry_name, normal_tag_name unique
>>>      
>>>
This is the real primary key

>>>create dt.tags2 on definedtags
>>>    as normal_tag_name
>>>      
>>>
This is the tag entity (or operational master)

>>>create dt.entries on definedtags
>>>    as author_name, entry_name
>>>
>>>create dt.date_written
>>>    on definedtags as entry_date.
>>>
>>>I have written this in a hurry so it may not be though out as well as
>>>some of my writing, but hopefully this is collaborative development.
>>>Also I have difficulty in commenting on and reading some of the
>>>front-end & java orientated stuff. (I have ordered a couple of books to
>>>help me catchup). Hopefully this is helpfull
>>>
>>>      
>>>
>>I think this all makes sense to me so far and I certainly appreciate the help.  I think getting the data model correct is a *very* important issue before we move forward with implemenatation, so I'm glad we are having this discussion.
>>
>>-- Allen
>>    
>>
>
>[1] http://www.tartarus.org/~martin/PorterStemmer/def.txt
>[2] http://torrez.us/archives/2005/07/13/tagrank.pdf
>
>  
>
>>>Elias Torres wrote:
>>>
>>>      
>>>
>>>>Welcome David to the Roller list.
>>>>
>>>>Thank you for your post. I have read your blog post on a tag data
>>>>model for Roller. I'm looking forward to your relational algebra and
>>>>query cost analysis. I wanted to tell you that we (IBM) have already
>>>>added basic tagging support to Roller and it actually supports a
>>>>TagCloud. I am supposed to put a proposal in the roller wiki so others
>>>>could comment and once I do that, you could put your comments there as
>>>>well.
>>>>
>>>>Just to kickstart the conversation I'm including the tagging table we
>>>>are currently using.
>>>>
>>>>create table weblogentrytag (
>>>>   id              varchar(48)   not null primary key,
>>>>   entryid        varchar(48)   not null,
>>>>   name            varchar(255)  not null,
>>>>   tagtime         timestamp     not null
>>>>);
>>>>
>>>>We have basically two tables: entries and entry2tags, but are missing
>>>>a tag table. At first, I was very set on having a tag table and use a
>>>>foreign key to "save" space on repetitive tag names. But I was shown
>>>>it's not really a big space saving technique, especially since tag
>>>>names are relatively short storing a guid or int would almost be
>>>>comparable in space. There are also increased costs in inserting and
>>>>joining on tables to get tag names if using a foreign key, so we have
>>>>settle on this for now until we have other queries requirements. I'll
>>>>be summarizing all of our changes to roller to support tagging in a
>>>>wiki proposal soon.
>>>>
>>>>Regarding the use of the list, some people have been using nabble.com
>>>>to interact with it. Maybe you can give it a try. I simply use gmail.
>>>>
>>>>http://www.nabble.com/Roller-f12275.html
>>>>
>>>>Regards,
>>>>
>>>>Elias
>>>>
>>>>On 1/4/06, David Levy <Da...@sun.com> wrote:
>>>>
>>>>
>>>>        
>>>>
>>>>>I have documented a data model for tags. This is held at my blog
>>>>>
>>>>>http://blogs.sun.com/roller/page/DaveLevy?entry=implementing_tags_in_a_database
>>>>>
>>>>>I have a graphic demonstrating the relationship between authors,
>>>>>articles and tags and illustrating the first and obvious indexes. (I
>>>>>have identified that both "Date Published" and tag aggregates are
>>>>>missing from the model).  Since the model was built to help me
>>>>>understand del.icio.us, I call the entities Users, Bookmarks and Tags,
>>>>>but hopefully its simple to see that these are pretty synonomous to
>>>>>authors, articles and tags.
>>>>>
>>>>>I hope that this is useful for those looking at implementing tags.
>>>>>
>>>>>I am still working out how to use the mail-list, so I hope that x-refing you to my blog isn't deprecated. I also need to work how to maintain thread connections i.e. undertake a reply.
>>>>>--
>>>>>
>>>>>Dave
>>>>>          
>>>>>
>>>--
>>>
>>>Dave
>>>
>>>      
>>>
>>    
>>

-- 

Dave

<http://www.sun.com> 	* David Levy *






Blog http://blogs.sun.com/DaveLevy
Email David.Levy@Sun.COM

Sun Proprietary & Confidential . This e-mail message is for the sole use 
of the intended recipient(s) and may contain confidential and 
privilidged information. Any unauthorised review, use, disclosure or 
distribution is prohibited. If you are not the intended recepient, 
please contact the sender by reply e-mail and destroy all copies of the 
original message.

Re: About Tags, a proposed data-model

Posted by Elias Torres <el...@torrez.us>.

On 1/6/06, Allen Gilliland <Al...@sun.com> wrote:
> On Fri, 2006-01-06 at 07:05, David Levy wrote:
> > This has been written after the proposal has been updated and the table
> > defintion documented there. i.e
> >
> > http://rollerweblogger.org/wiki/Wiki.jsp?page=Proposal_WeblogTags
> >
> > But I have decided to extent this thread; it pertains to the database data model.
> >
> > I have some questions.
> >
> > is id a surrogate (sequential or hashed) key
> > what is entryid? i.e. the article URL? i.e. foreign key to entries
> > what is name? (The value of the tag?) i.e. foreign key to Tags table
>
> Anil answered these in his email.
>
> >
> > Do I need the definitions of the entries and entries2tags to understand what we need to do?
> >
> > I have come to the conclusion that using surrogate keys is(nearly
> > always) a mistake. The space management advantages are normally negated
> > by the requirement to join against the tables holding the real keys. You
> > normally require an additonal unique index to enforce the uniqueness
> > constraints against both the surrogate & real keys, and will often need
> > to de-optimise the TNF structure to replicate the real data around the
> > database to avoid otherwise unnecessary joins. Maintaining these
> > de-optimisations have both human and perfromance costs.
>
> I am often of the same opinion, but part of the problem may be that Hibernate (our ORM persistence tool) suggests using surrogate keys and probably doesn't like multi column primary keys.
>

I agree with David. If you see that's exactly why we didn't normalize
our table to create a tag table. We are instead just duplicating the
actual tag because a surrogate key would take up even much more space
than most of the tags. BTW, this is working fine with Hibernate so far,
although, I believe some of the queries were written using HQL (???)
because of their complexity in grouping, etc.


> >
> > If you agree that real keys should be used then we need to consider
> > capitalisation & plurals of tags. Easy one first - leave plural
> > management to the authors. I know that I  am using blog & blogs
> > seperately, althouhg maybe I should have chosen blogging & blog to be
> > clearer, but the point is that tag authors are not going to think very
> > hard about their tags, this is why the del.icio.us tag prompt scheme is
> > so good. Delicious manages to ignore capitalisation, which I think is
> > good. Its certainly best for query and aggregation for popularity
> > queries., I can't believe that anyone wants to know that both "internet"
> > & "Internet" are popular tags (technorati!)
> >
> > This implies that we might have two fields to document a tag,
> >
> > user_tag_name            char(32)
> > normal_tag_name        char(32)  is proper(user_tag_name)
>
> I agree that plurals should be managed by users and tags should be case-insensitive.  My approach to this would be a bit more simplified though, I would prefer to simply have the software convert all tags to lowercase after user input.  I suppose some people could get a little upset by that, but it seems so much easier.
>
> Also, is that "proper()" function part of ANSI sql?  Will it work on all databases or have an equivalent on all databases?  I am also confused on how that would work when you are trying to determine how many times the tag "java" is applied.  How would you do that?
>

Our current implementation performs a lowercase on all tags (I took that
from del.icio.us. But I'm torn on whether or not to let users manage
plurals. I'll be adding to our internal server a filter using the Porter
Stemming [1] algorithm to see if its results are desirable. I'll report
on the experiment as we get some results.

I'm not necessarily convinced that the prompt scheme is all that
beneficial. There are some possible consequences that I have studied a
little bit on my TagRank [2] paper. The problem is that care must be
taken to "suggest" tags because if not, the users will reach a local
maximum and never allowing other tags to reach popular/meaningful
levels in the network.

> >
> > (sorry for the pseudo-code, but normal* needs to be enforced with a
> > constraint or trigger.)
> >
> > This then leaves us with the question of indices (indexes?)
> >
> > The implications of my data model is that the entry_author key should be
> > copied down through the entry tabel into the entry-tags table.
> >
> > author_name            char(32) is (part of) FK to  entry
> > entry_name              char(128) is (part of) FK to entry
> >
> > although, a full roller URL contains the author name, but I assume that
> > the entryname consists of the string used in the entry & anchor queries
> > so I have a table that looks like this (again pseudo code and I have
> > guessed at the database column types).
> >
> > create definedtags as (
> > author_name                    char(32)
> > entry_name (or id?) ,       char(128)
> > user_tag_name,               char(64)
> > normal_tag_name,           char(64)
> > entry_date                         datetime
> > date_created                    datetime )
>
> i think this is still pretty similar to what we have now correct?  the main difference being that you are tracking which author wrote the specified tag, along with what date the tagged entry was published.  i like the idea of having the FK to the author included because that would allow us to easily lookup all tags used by a specific author.
>
> is there a reason to copy down the entry_date rather than access it via a join on the weblogentry table?  you can join with the weblogentry table using the entry_id column.  how were you planning to use the entry_date field?
>
> what acts as the primary key?  (author_name, entry_id, user_tag_name)?  we may still need to use a surrogate key to uniquely identify the row to avoid having a multi column primary key.  i'm not sure that would be much of an issue though because it doesn't look like you are planning for any joins for the tag names, correct?
>

Right, Allen. This is very similar to what we already have. I'm not sure
why having author name here, when we already have that through a join
with the entry_id. I'm not sure why do we need entry_date when we have
the tagging date. I do like the normal tag name. I was thinking this too
for the output of the Porter Stemming algorithm so I wouldn't lose the
original information entered by the user. Everything else is the same :-).

Another point I want to make is the fact, that we do a little bit of
extra processing when saving an existing entry that make sure it keeps
the original date for each tag. For example, when I first created an
entry I tagged it A,B,C. The first time I edited, I removed B. The A and
C tags will retain the dates when they were added as opposed to the edit
date.

Additionally, we have a question of what to do with spaces. Should tags
be multi-word or not? My suggestion to Phay (one of our developers) was
to use spaces as separators in the input field, therefore not supporting
spaces. But we could do multiple things to support spaces, such as
quoting multi-word tags. I believe Flickr supports multi-words but they
remove the spaces from the tags, but technorati does maintain spaces. I
don't like them myself, because I think it fragments the tag space much
more than single words and you could still use intersections to get the
sort of the same result.


> >
> > create constraint defined_tags_name as
> >     normal_tag_name=proper(user_tag_name)
> >
> > create constraint defined_tags_entry_date as
> >     entry_date = select entry_date from entry
> >                            where definedtags.entry_name = entry.entry_name
> >
> > and the following indexes
> >
> > create dt.tags on definedtags
> >     as author_name, entry_name, normal_tag_name unique
> >
> > create dt.tags2 on definedtags
> >     as normal_tag_name
> >
> > create dt.entries on definedtags
> >     as author_name, entry_name
> >
> > create dt.date_written
> >     on definedtags as entry_date.
> >
> > I have written this in a hurry so it may not be though out as well as
> > some of my writing, but hopefully this is collaborative development.
> > Also I have difficulty in commenting on and reading some of the
> > front-end & java orientated stuff. (I have ordered a couple of books to
> > help me catchup). Hopefully this is helpfull
> >
>
> I think this all makes sense to me so far and I certainly appreciate the help.  I think getting the data model correct is a *very* important issue before we move forward with implemenatation, so I'm glad we are having this discussion.
>
> -- Allen

[1] http://www.tartarus.org/~martin/PorterStemmer/def.txt
[2] http://torrez.us/archives/2005/07/13/tagrank.pdf

>
> >
> >
> > Elias Torres wrote:
> >
> > >Welcome David to the Roller list.
> > >
> > >Thank you for your post. I have read your blog post on a tag data
> > >model for Roller. I'm looking forward to your relational algebra and
> > >query cost analysis. I wanted to tell you that we (IBM) have already
> > >added basic tagging support to Roller and it actually supports a
> > >TagCloud. I am supposed to put a proposal in the roller wiki so others
> > >could comment and once I do that, you could put your comments there as
> > >well.
> > >
> > >Just to kickstart the conversation I'm including the tagging table we
> > >are currently using.
> > >
> > >create table weblogentrytag (
> > >    id              varchar(48)   not null primary key,
> > >    entryid        varchar(48)   not null,
> > >    name            varchar(255)  not null,
> > >    tagtime         timestamp     not null
> > >);
> > >
> > >We have basically two tables: entries and entry2tags, but are missing
> > >a tag table. At first, I was very set on having a tag table and use a
> > >foreign key to "save" space on repetitive tag names. But I was shown
> > >it's not really a big space saving technique, especially since tag
> > >names are relatively short storing a guid or int would almost be
> > >comparable in space. There are also increased costs in inserting and
> > >joining on tables to get tag names if using a foreign key, so we have
> > >settle on this for now until we have other queries requirements. I'll
> > >be summarizing all of our changes to roller to support tagging in a
> > >wiki proposal soon.
> > >
> > >Regarding the use of the list, some people have been using nabble.com
> > >to interact with it. Maybe you can give it a try. I simply use gmail.
> > >
> > >http://www.nabble.com/Roller-f12275.html
> > >
> > >Regards,
> > >
> > >Elias
> > >
> > >On 1/4/06, David Levy <Da...@sun.com> wrote:
> > >
> > >
> > >>I have documented a data model for tags. This is held at my blog
> > >>
> > >>http://blogs.sun.com/roller/page/DaveLevy?entry=implementing_tags_in_a_database
> > >>
> > >>I have a graphic demonstrating the relationship between authors,
> > >>articles and tags and illustrating the first and obvious indexes. (I
> > >>have identified that both "Date Published" and tag aggregates are
> > >>missing from the model).  Since the model was built to help me
> > >>understand del.icio.us, I call the entities Users, Bookmarks and Tags,
> > >>but hopefully its simple to see that these are pretty synonomous to
> > >>authors, articles and tags.
> > >>
> > >>I hope that this is useful for those looking at implementing tags.
> > >>
> > >>I am still working out how to use the mail-list, so I hope that x-refing
> > >>you to my blog isn't deprecated. I also need to work how to maintain
> > >>thread connections i.e. undertake a reply.
> > >>--
> > >>
> > >>Dave
> > >>
> > >><http://www.sun.com>    * David Levy *
> > >>*Sun Microsystems Ltd.*
> > >>55, King William St.,
> > >>London EC4R 9ND United Kingdom
> > >>
> > >>Phone +44 (0) 20 7469 9908/x18308
> > >>Mobile +44 (0) 7710-360922
> > >>Blog http://blogs.sun.com/DaveLevy
> > >>Email David.Levy@Sun.COM
> > >>
> > >>Sun Proprietary & Confidential . This e-mail message is for the sole use
> > >>of the intended recipient(s) and may contain confidential and
> > >>privilidged information. Any unauthorised review, use, disclosure or
> > >>distribution is prohibited. If you are not the intended recepient,
> > >>please contact the sender by reply e-mail and destroy all copies of the
> > >>original message.
> > >>
> > >>
> > >>
> > >>
> >
> > --
> >
> > Dave
> >
> > <http://www.sun.com>  * David Levy *
> > *Sun Microsystems Ltd.*
> > 55, King William St.,
> > London EC4R 9ND United Kingdom
> >
> > Phone +44 (0) 20 7469 9908/x18308
> > Mobile +44 (0) 7710-360922
> > Blog http://blogs.sun.com/DaveLevy
> > Email David.Levy@Sun.COM
> >
> > Sun Proprietary & Confidential . This e-mail message is for the sole use
> > of the intended recipient(s) and may contain confidential and
> > privilidged information. Any unauthorised review, use, disclosure or
> > distribution is prohibited. If you are not the intended recepient,
> > please contact the sender by reply e-mail and destroy all copies of the
> > original message.
> >
>
>

Re: About Tags, a proposed data-model

Posted by Allen Gilliland <Al...@Sun.COM>.

On Fri, 2006-01-06 at 07:05, David Levy wrote:
> This has been written after the proposal has been updated and the table 
> defintion documented there. i.e
> 
> http://rollerweblogger.org/wiki/Wiki.jsp?page=Proposal_WeblogTags
> 
> But I have decided to extent this thread; it pertains to the database data model.
> 
> I have some questions.
> 
> is id a surrogate (sequential or hashed) key
> what is entryid? i.e. the article URL? i.e. foreign key to entries
> what is name? (The value of the tag?) i.e. foreign key to Tags table

Anil answered these in his email.

> 
> Do I need the definitions of the entries and entries2tags to understand what we need to do?
> 
> I have come to the conclusion that using surrogate keys is(nearly 
> always) a mistake. The space management advantages are normally negated 
> by the requirement to join against the tables holding the real keys. You 
> normally require an additonal unique index to enforce the uniqueness 
> constraints against both the surrogate & real keys, and will often need 
> to de-optimise the TNF structure to replicate the real data around the 
> database to avoid otherwise unnecessary joins. Maintaining these 
> de-optimisations have both human and perfromance costs.

I am often of the same opinion, but part of the problem may be that Hibernate (our ORM persistence tool) suggests using surrogate keys and probably doesn't like multi column primary keys.

> 
> If you agree that real keys should be used then we need to consider 
> capitalisation & plurals of tags. Easy one first - leave plural 
> management to the authors. I know that I  am using blog & blogs 
> seperately, althouhg maybe I should have chosen blogging & blog to be 
> clearer, but the point is that tag authors are not going to think very 
> hard about their tags, this is why the del.icio.us tag prompt scheme is 
> so good. Delicious manages to ignore capitalisation, which I think is 
> good. Its certainly best for query and aggregation for popularity 
> queries., I can't believe that anyone wants to know that both "internet" 
> & "Internet" are popular tags (technorati!)
> 
> This implies that we might have two fields to document a tag,
> 
> user_tag_name            char(32)
> normal_tag_name        char(32)  is proper(user_tag_name)

I agree that plurals should be managed by users and tags should be case-insensitive.  My approach to this would be a bit more simplified though, I would prefer to simply have the software convert all tags to lowercase after user input.  I suppose some people could get a little upset by that, but it seems so much easier.  

Also, is that "proper()" function part of ANSI sql?  Will it work on all databases or have an equivalent on all databases?  I am also confused on how that would work when you are trying to determine how many times the tag "java" is applied.  How would you do that?

> 
> (sorry for the pseudo-code, but normal* needs to be enforced with a 
> constraint or trigger.)
> 
> This then leaves us with the question of indices (indexes?)
> 
> The implications of my data model is that the entry_author key should be 
> copied down through the entry tabel into the entry-tags table.
> 
> author_name            char(32) is (part of) FK to  entry
> entry_name              char(128) is (part of) FK to entry
> 
> although, a full roller URL contains the author name, but I assume that 
> the entryname consists of the string used in the entry & anchor queries 
> so I have a table that looks like this (again pseudo code and I have 
> guessed at the database column types).
> 
> create definedtags as (
> author_name                    char(32)
> entry_name (or id?) ,       char(128)
> user_tag_name,               char(64)
> normal_tag_name,           char(64)
> entry_date                         datetime
> date_created                    datetime )

i think this is still pretty similar to what we have now correct?  the main difference being that you are tracking which author wrote the specified tag, along with what date the tagged entry was published.  i like the idea of having the FK to the author included because that would allow us to easily lookup all tags used by a specific author.

is there a reason to copy down the entry_date rather than access it via a join on the weblogentry table?  you can join with the weblogentry table using the entry_id column.  how were you planning to use the entry_date field?

what acts as the primary key?  (author_name, entry_id, user_tag_name)?  we may still need to use a surrogate key to uniquely identify the row to avoid having a multi column primary key.  i'm not sure that would be much of an issue though because it doesn't look like you are planning for any joins for the tag names, correct?

> 
> create constraint defined_tags_name as
>     normal_tag_name=proper(user_tag_name)
> 
> create constraint defined_tags_entry_date as
>     entry_date = select entry_date from entry
>                            where definedtags.entry_name = entry.entry_name
> 
> and the following indexes
> 
> create dt.tags on definedtags
>     as author_name, entry_name, normal_tag_name unique
> 
> create dt.tags2 on definedtags
>     as normal_tag_name
> 
> create dt.entries on definedtags
>     as author_name, entry_name
> 
> create dt.date_written
>     on definedtags as entry_date.
> 
> I have written this in a hurry so it may not be though out as well as 
> some of my writing, but hopefully this is collaborative development. 
> Also I have difficulty in commenting on and reading some of the 
> front-end & java orientated stuff. (I have ordered a couple of books to 
> help me catchup). Hopefully this is helpfull
> 

I think this all makes sense to me so far and I certainly appreciate the help.  I think getting the data model correct is a *very* important issue before we move forward with implemenatation, so I'm glad we are having this discussion.

-- Allen

> 
> 
> Elias Torres wrote:
> 
> >Welcome David to the Roller list.
> >
> >Thank you for your post. I have read your blog post on a tag data
> >model for Roller. I'm looking forward to your relational algebra and
> >query cost analysis. I wanted to tell you that we (IBM) have already
> >added basic tagging support to Roller and it actually supports a
> >TagCloud. I am supposed to put a proposal in the roller wiki so others
> >could comment and once I do that, you could put your comments there as
> >well.
> >
> >Just to kickstart the conversation I'm including the tagging table we
> >are currently using.
> >
> >create table weblogentrytag (
> >    id              varchar(48)   not null primary key,
> >    entryid        varchar(48)   not null,
> >    name            varchar(255)  not null,
> >    tagtime         timestamp     not null
> >);
> >
> >We have basically two tables: entries and entry2tags, but are missing
> >a tag table. At first, I was very set on having a tag table and use a
> >foreign key to "save" space on repetitive tag names. But I was shown
> >it's not really a big space saving technique, especially since tag
> >names are relatively short storing a guid or int would almost be
> >comparable in space. There are also increased costs in inserting and
> >joining on tables to get tag names if using a foreign key, so we have
> >settle on this for now until we have other queries requirements. I'll
> >be summarizing all of our changes to roller to support tagging in a
> >wiki proposal soon.
> >
> >Regarding the use of the list, some people have been using nabble.com
> >to interact with it. Maybe you can give it a try. I simply use gmail.
> >
> >http://www.nabble.com/Roller-f12275.html
> >
> >Regards,
> >
> >Elias
> >
> >On 1/4/06, David Levy <Da...@sun.com> wrote:
> >  
> >
> >>I have documented a data model for tags. This is held at my blog
> >>
> >>http://blogs.sun.com/roller/page/DaveLevy?entry=implementing_tags_in_a_database
> >>
> >>I have a graphic demonstrating the relationship between authors,
> >>articles and tags and illustrating the first and obvious indexes. (I
> >>have identified that both "Date Published" and tag aggregates are
> >>missing from the model).  Since the model was built to help me
> >>understand del.icio.us, I call the entities Users, Bookmarks and Tags,
> >>but hopefully its simple to see that these are pretty synonomous to
> >>authors, articles and tags.
> >>
> >>I hope that this is useful for those looking at implementing tags.
> >>
> >>I am still working out how to use the mail-list, so I hope that x-refing
> >>you to my blog isn't deprecated. I also need to work how to maintain
> >>thread connections i.e. undertake a reply.
> >>--
> >>
> >>Dave
> >>
> >><http://www.sun.com>    * David Levy *
> >>*Sun Microsystems Ltd.*
> >>55, King William St.,
> >>London EC4R 9ND United Kingdom
> >>
> >>Phone +44 (0) 20 7469 9908/x18308
> >>Mobile +44 (0) 7710-360922
> >>Blog http://blogs.sun.com/DaveLevy
> >>Email David.Levy@Sun.COM
> >>
> >>Sun Proprietary & Confidential . This e-mail message is for the sole use
> >>of the intended recipient(s) and may contain confidential and
> >>privilidged information. Any unauthorised review, use, disclosure or
> >>distribution is prohibited. If you are not the intended recepient,
> >>please contact the sender by reply e-mail and destroy all copies of the
> >>original message.
> >>
> >>
> >>    
> >>
> 
> -- 
> 
> Dave
> 
> <http://www.sun.com> 	* David Levy *
> *Sun Microsystems Ltd.*
> 55, King William St.,
> London EC4R 9ND United Kingdom
> 
> Phone +44 (0) 20 7469 9908/x18308
> Mobile +44 (0) 7710-360922
> Blog http://blogs.sun.com/DaveLevy
> Email David.Levy@Sun.COM
> 
> Sun Proprietary & Confidential . This e-mail message is for the sole use 
> of the intended recipient(s) and may contain confidential and 
> privilidged information. Any unauthorised review, use, disclosure or 
> distribution is prohibited. If you are not the intended recepient, 
> please contact the sender by reply e-mail and destroy all copies of the 
> original message.
>

Re: About Tags, a proposed data-model

Posted by David Levy <Da...@Sun.COM>.

This has been written after the proposal has been updated and the table 
defintion documented there. i.e

http://rollerweblogger.org/wiki/Wiki.jsp?page=Proposal_WeblogTags

But I have decided to extent this thread; it pertains to the database data model.

I have some questions.

is id a surrogate (sequential or hashed) key
what is entryid? i.e. the article URL? i.e. foreign key to entries
what is name? (The value of the tag?) i.e. foreign key to Tags table

Do I need the definitions of the entries and entries2tags to understand what we need to do?

I have come to the conclusion that using surrogate keys is(nearly 
always) a mistake. The space management advantages are normally negated 
by the requirement to join against the tables holding the real keys. You 
normally require an additonal unique index to enforce the uniqueness 
constraints against both the surrogate & real keys, and will often need 
to de-optimise the TNF structure to replicate the real data around the 
database to avoid otherwise unnecessary joins. Maintaining these 
de-optimisations have both human and perfromance costs.

If you agree that real keys should be used then we need to consider 
capitalisation & plurals of tags. Easy one first - leave plural 
management to the authors. I know that I  am using blog & blogs 
seperately, althouhg maybe I should have chosen blogging & blog to be 
clearer, but the point is that tag authors are not going to think very 
hard about their tags, this is why the del.icio.us tag prompt scheme is 
so good. Delicious manages to ignore capitalisation, which I think is 
good. Its certainly best for query and aggregation for popularity 
queries., I can't believe that anyone wants to know that both "internet" 
& "Internet" are popular tags (technorati!)

This implies that we might have two fields to document a tag,

user_tag_name            char(32)
normal_tag_name        char(32)  is proper(user_tag_name)

(sorry for the pseudo-code, but normal* needs to be enforced with a 
constraint or trigger.)

This then leaves us with the question of indices (indexes?)

The implications of my data model is that the entry_author key should be 
copied down through the entry tabel into the entry-tags table.

author_name            char(32) is (part of) FK to  entry
entry_name              char(128) is (part of) FK to entry

although, a full roller URL contains the author name, but I assume that 
the entryname consists of the string used in the entry & anchor queries 
so I have a table that looks like this (again pseudo code and I have 
guessed at the database column types).

create definedtags as (
author_name                    char(32)
entry_name (or id?) ,       char(128)
user_tag_name,               char(64)
normal_tag_name,           char(64)
entry_date                         datetime
date_created                    datetime )

create constraint defined_tags_name as
    normal_tag_name=proper(user_tag_name)

create constraint defined_tags_entry_date as
    entry_date = select entry_date from entry
                           where definedtags.entry_name = entry.entry_name

and the following indexes

create dt.tags on definedtags
    as author_name, entry_name, normal_tag_name unique

create dt.tags2 on definedtags
    as normal_tag_name

create dt.entries on definedtags
    as author_name, entry_name

create dt.date_written
    on definedtags as entry_date.

I have written this in a hurry so it may not be though out as well as 
some of my writing, but hopefully this is collaborative development. 
Also I have difficulty in commenting on and reading some of the 
front-end & java orientated stuff. (I have ordered a couple of books to 
help me catchup). Hopefully this is helpfull

Elias Torres wrote:

>Welcome David to the Roller list.
>
>Thank you for your post. I have read your blog post on a tag data
>model for Roller. I'm looking forward to your relational algebra and
>query cost analysis. I wanted to tell you that we (IBM) have already
>added basic tagging support to Roller and it actually supports a
>TagCloud. I am supposed to put a proposal in the roller wiki so others
>could comment and once I do that, you could put your comments there as
>well.
>
>Just to kickstart the conversation I'm including the tagging table we
>are currently using.
>
>create table weblogentrytag (
>    id              varchar(48)   not null primary key,
>    entryid        varchar(48)   not null,
>    name            varchar(255)  not null,
>    tagtime         timestamp     not null
>);
>
>We have basically two tables: entries and entry2tags, but are missing
>a tag table. At first, I was very set on having a tag table and use a
>foreign key to "save" space on repetitive tag names. But I was shown
>it's not really a big space saving technique, especially since tag
>names are relatively short storing a guid or int would almost be
>comparable in space. There are also increased costs in inserting and
>joining on tables to get tag names if using a foreign key, so we have
>settle on this for now until we have other queries requirements. I'll
>be summarizing all of our changes to roller to support tagging in a
>wiki proposal soon.
>
>Regarding the use of the list, some people have been using nabble.com
>to interact with it. Maybe you can give it a try. I simply use gmail.
>
>http://www.nabble.com/Roller-f12275.html
>
>Regards,
>
>Elias
>
>On 1/4/06, David Levy <Da...@sun.com> wrote:
>  
>
>>I have documented a data model for tags. This is held at my blog
>>
>>http://blogs.sun.com/roller/page/DaveLevy?entry=implementing_tags_in_a_database
>>
>>I have a graphic demonstrating the relationship between authors,
>>articles and tags and illustrating the first and obvious indexes. (I
>>have identified that both "Date Published" and tag aggregates are
>>missing from the model).  Since the model was built to help me
>>understand del.icio.us, I call the entities Users, Bookmarks and Tags,
>>but hopefully its simple to see that these are pretty synonomous to
>>authors, articles and tags.
>>
>>I hope that this is useful for those looking at implementing tags.
>>
>>I am still working out how to use the mail-list, so I hope that x-refing
>>you to my blog isn't deprecated. I also need to work how to maintain
>>thread connections i.e. undertake a reply.
>>--
>>
>>Dave
>>
>><http://www.sun.com>    * David Levy *
>>*Sun Microsystems Ltd.*
>>55, King William St.,
>>London EC4R 9ND United Kingdom
>>
>>Phone +44 (0) 20 7469 9908/x18308
>>Mobile +44 (0) 7710-360922
>>Blog http://blogs.sun.com/DaveLevy
>>Email David.Levy@Sun.COM
>>
>>Sun Proprietary & Confidential . This e-mail message is for the sole use
>>of the intended recipient(s) and may contain confidential and
>>privilidged information. Any unauthorised review, use, disclosure or
>>distribution is prohibited. If you are not the intended recepient,
>>please contact the sender by reply e-mail and destroy all copies of the
>>original message.
>>
>>
>>    
>>

-- 

Dave

<http://www.sun.com> 	* David Levy *
*Sun Microsystems Ltd.*
55, King William St.,
London EC4R 9ND United Kingdom

Phone +44 (0) 20 7469 9908/x18308
Mobile +44 (0) 7710-360922
Blog http://blogs.sun.com/DaveLevy
Email David.Levy@Sun.COM

Sun Proprietary & Confidential . This e-mail message is for the sole use 
of the intended recipient(s) and may contain confidential and 
privilidged information. Any unauthorised review, use, disclosure or 
distribution is prohibited. If you are not the intended recepient, 
please contact the sender by reply e-mail and destroy all copies of the 
original message.

Re: About Tags, a proposed data-model

Posted by Elias Torres <el...@torrez.us>.

Welcome David to the Roller list.

Thank you for your post. I have read your blog post on a tag data
model for Roller. I'm looking forward to your relational algebra and
query cost analysis. I wanted to tell you that we (IBM) have already
added basic tagging support to Roller and it actually supports a
TagCloud. I am supposed to put a proposal in the roller wiki so others
could comment and once I do that, you could put your comments there as
well.

Just to kickstart the conversation I'm including the tagging table we
are currently using.

create table weblogentrytag (
    id              varchar(48)   not null primary key,
    entryid        varchar(48)   not null,
    name            varchar(255)  not null,
    tagtime         timestamp     not null
);

We have basically two tables: entries and entry2tags, but are missing
a tag table. At first, I was very set on having a tag table and use a
foreign key to "save" space on repetitive tag names. But I was shown
it's not really a big space saving technique, especially since tag
names are relatively short storing a guid or int would almost be
comparable in space. There are also increased costs in inserting and
joining on tables to get tag names if using a foreign key, so we have
settle on this for now until we have other queries requirements. I'll
be summarizing all of our changes to roller to support tagging in a
wiki proposal soon.

Regarding the use of the list, some people have been using nabble.com
to interact with it. Maybe you can give it a try. I simply use gmail.

http://www.nabble.com/Roller-f12275.html

Regards,

Elias

On 1/4/06, David Levy <Da...@sun.com> wrote:
> I have documented a data model for tags. This is held at my blog
>
> http://blogs.sun.com/roller/page/DaveLevy?entry=implementing_tags_in_a_database
>
> I have a graphic demonstrating the relationship between authors,
> articles and tags and illustrating the first and obvious indexes. (I
> have identified that both "Date Published" and tag aggregates are
> missing from the model).  Since the model was built to help me
> understand del.icio.us, I call the entities Users, Bookmarks and Tags,
> but hopefully its simple to see that these are pretty synonomous to
> authors, articles and tags.
>
> I hope that this is useful for those looking at implementing tags.
>
> I am still working out how to use the mail-list, so I hope that x-refing
> you to my blog isn't deprecated. I also need to work how to maintain
> thread connections i.e. undertake a reply.
> --
>
> Dave
>
> <http://www.sun.com>    * David Levy *
> *Sun Microsystems Ltd.*
> 55, King William St.,
> London EC4R 9ND United Kingdom
>
> Phone +44 (0) 20 7469 9908/x18308
> Mobile +44 (0) 7710-360922
> Blog http://blogs.sun.com/DaveLevy
> Email David.Levy@Sun.COM
>
> Sun Proprietary & Confidential . This e-mail message is for the sole use
> of the intended recipient(s) and may contain confidential and
> privilidged information. Any unauthorised review, use, disclosure or
> distribution is prohibited. If you are not the intended recepient,
> please contact the sender by reply e-mail and destroy all copies of the
> original message.
>
>