You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@roller.apache.org by David Levy <Da...@Sun.COM> on 2006/01/31 13:50:02 UTC
[Fwd: Re: About Tags, a proposed data-model]

Alan has asked how one gets a whole db,all users query  based on an 
indeterminate number of tags, (well actually he suggested two, but we 
don't like the number two).

I have developed a query against our model, refined it, proposed a 
de-normalisation to support retreiving tags as entry attributes, and 
then shown a way to do the can I add another tag to my query filter bit. 
I briefly reply to John Hoffman, but not to Dave Johnson, I need to 
think more about what he says.

I think the obvious SQL is something like

select entry.url, entry.comment
from entry.id in ( select taga.entryid
             from entry2tags taga, entry2tagsb tabb
             where taga.id = tagb.id
             and     taga.normalname = normalise(@1)
             and     tagb.normalname = normalise(@2 ))

So, we have a join, a sub query and a filtered cartesian cross product 
on our biggest table. Not good.

John has raised the issue of database caching. It is my view that one 
gets magnitudes of perfromance gain from query tuning, then database 
design, then platfrom tuning. Cache size and colouring are platform 
tuning issues. IMO I hope everyone sees from the above  that this query 
is a problem. The internet generation won't wait for this and its 
probably the reason that del.icio.us don't offer the incremental tag 
query., which is where I started my interest in this problem from. Also 
we need to understand the database optimisers' query plan design logic, 
and if there is a common hint semantic; we use more than one.

On a side note in Szekely, Torres, Ranking paper 
<http://torrez.us/archives/2005/07/13/tagrank.pdf>  they have some 
statitics about occurrence and frequency.

Anyone want to suggest the  author:entry:entry2tag ratios and growth 
rates, please?

Anyway, lets flatten the above query.

select  tagA.entryid,
            entry.entryurl,
            entry.entrycomment
from    entry2tags tagA,
            entry2tags tagB,
            entries entry
where  tagA.normalname = normalise(@1)
and      tagB.normalname = normalise(@2)
and      entry.id = tagA.entryid
and      tagA.id = tabB.id

I think thats it , all corrections welcome.

The ideal entry point is the least frequent normal name occurrence.  
Again, the Szekely,Torres relevance curve might help. Can/should we 
store this?. Can we store the the  upper decile tags to develop query 
hints i.e. if one of the tags is on the top 10% use the other as the 
query entry point.

In these cases, I have not retreived the tags/entry to create the lovely 
displays we've come to know and love. It looks very expensive to re 
query the database; I think SQL mandates the joining of our entry.id 
list with the tags table again and reporting the reply as a table. We no 
longer have a table of entries, but a table of tags, which we'd need to 
pivot. (I'm not sure how to do this in SQL, but I'll have a think). We 
want (IMO) to treat the tags as a list at this moment in time, so I 
suggest that we denormalise the entry tabel with a tags column.

amend table entrys
add tag_summary    varchar(512 ) /* This might be a bit small and not 
very 32 bit may
                                                                need a 
text column */

tag summary is created via a trigger when the first tag is applied, and 
amended when tags are added or deleted to the entry in the entry2tags 
table, i.e. post insert & delete triggers. Alternativly we need to 
create a database procedure to manage changes to the entry to tags 
table, if the free databases don''t have triggers.

So far we have avoided dynamic SQL. Hooray! 

Let's consider the case where someone wants to go in on {all}  cases 
using a tag and refine the query, into a temporary table

insert into temp #querycache
select  tagA.entryid,
            entry.entryurl,
            entry.entrycomment
            entry.taglist
from    entry2tags
            entries entry
where  entry2tags .normalname = normalise(@1)
and      entry.id = entry2tags .entryid

we could then delete from the cache as new required tags were specified.

delete #querycache
where position (entry.taglist, @2) = 0

again, I think this works. We'd need to be carefull about using the 
normal and given tag names. How to maintin the table name available to 
its owner I'm not sure, well I can use a SQL monitor but  it all depends 
on what you're doing.

I think this moves towards answering Allen's questions, I think this is 
the hardest, but any more?

Allen Gilliland wrote:

>I don't want to lose this thread because I think there are still some ideas to continue flushing out.  More comments inline ...
>
>On Fri, 2006-01-13 at 05:14, David Levy wrote:
>  
>
>>Sorry to have taken so long.
>>
>>The denormalisations of the author_name (which may be owner name) and 
>>entrydate are to support queries.  This is because I expect a macro to 
>>create a tag cloud for a user so that the html versions can have the tag 
>>cloud,
>>
>>So
>>
>>select normal_tag_name, count(*)
>>from entry2tags
>>where author_name = "DaveLevy"
>>group by normal_tag_name
>>
>>gives us the data required for a tag cloud, for a single blog . No join 
>>as you can see, where it gets fun is if you want a hot tags cloud
>>
>>we add a line so the query becomes
>>
>>select normal_tag_name, count(*)
>>from entry2tags
>>where author_name = "DaveLevy"
>>and     entry_date > @sevedaysago
>>group by normal_tag_name
>>    
>>
>
>very cool stuff ... i like the looks of that.
>
>... lots of stuff chopped out here ...
>
>  
>
>>>>is there a reason to copy down the entry_date rather than access it via a join on the weblogentry table?  you can join with the weblogentry table using the entry_id column.  how were you planning to use the entry_date field?
>>>>   
>>>>
>>>>        
>>>>
>>see above,  I don't want to join to the entry table, and this goal also 
>>impacts my index design.  entry_date allows hot tags queries to be 
>>driven by the entry not the tagged date
>>    
>>
>
>ok.  I agree that joins are a likely performance problem, but my next question then becomes ... How do we plan to deal with getting the data for the list of entries marked with a given tag?  I am expecting that when someone uses the tag dashboard or a tag cloud to try and view a list of entries with the tag "foo", that list will look something like the Roller front page.  example ...
>
>url = /roller/tag/entries/java+netbeans
>
>you then populate a page with 50? 25? entry summaries for people to browse through and those summaries will at least require the entry title and a summary of the entry content and may also require the entry date, category, author, and weblog title.  I would think we are going to require a join to get that data.
>
>  
>
>>>>what acts as the primary key?  (author_name, entry_id, user_tag_name)?  
>>>>
>>>>        
>>>>
>>my PK is author_name, entry_id and normal_tag_name
>>
>>    
>>
>>>>we may still need to use a surrogate key to uniquely identify the row to avoid having a multi column primary key. 
>>>>
>>>>        
>>>>
>>yeah, looks like it
>>
>>    
>>
>>>>i'm not sure that would be much of an issue though because it doesn't look like you are planning for any joins for the tag names, correct?
>>>>
>>>>   
>>>>
>>>>        
>>>>
>>I'm trying to avoid any joins, but if you are looking for entries on a 
>>blog and tagged, then it would be good to enter the query on 
>>author_name, but since we have not copied the title down to the 
>>entry2tags table we still need the join and can go in on author_name on 
>>either table, but best do it on entry table (see below).
>>
>>select  entry.title
>>from    entry, entry2tags e2t
>>where e2t.entry_id = entry.id
>>and     entry.author_name = @KnownName
>>and     e2t.normal_tag_name in (@TagQueryList)
>>    
>>
>
>based on my example above, how would we get the necessary entry data when we don't know the author name because we are searching through the entire tag system, not just from a single author or weblog?
>
>-- Allen
>
>
>  
>
>>>Right, Allen. This is very similar to what we already have. I'm not sure
>>>why having author name here, when we already have that through a join
>>>with the entry_id.
>>>
>>>      
>>>
>>That's right, but I don't want to join, the entry2tags table is big 
>>enough without joining.
>>
>>    
>>
>>>I'm not sure why do we need entry_date when we have
>>>the tagging date. 
>>>
>>>      
>>>
>>I think the queries should be driven through the entry publication date.
>>
>>    
>>
>>>I do like the normal tag name. I was thinking this too for the output of the Porter Stemming algorithm so I wouldn't lose the original information entered by the user. Everything else is the same :-).
>>> 
>>>
>>>      
>>>
>>I need to read your references to understand this, but I think you agree 
>>this is OK
>>
>>    
>>
>>>Another point I want to make is the fact, that we do a little bit of
>>>extra processing when saving an existing entry that make sure it keeps
>>>the original date for each tag. For example, when I first created an
>>>entry I tagged it A,B,C. The first time I edited, I removed B. The A and
>>>C tags will retain the dates when they were added as opposed to the edit
>>>date.
>>> 
>>>
>>>      
>>>
>>Are you holding entry_tag data on the entry table?
>>
>>    
>>
>>>Additionally, we have a question of what to do with spaces. Should tags
>>>be multi-word or not? My suggestion to Phay (one of our developers) was
>>>to use spaces as separators in the input field, therefore not supporting
>>>spaces. But we could do multiple things to support spaces, such as
>>>quoting multi-word tags. I believe Flickr supports multi-words but they
>>>remove the spaces from the tags, but technorati does maintain spaces. I
>>>don't like them myself, because I think it fragments the tag space much
>>>more than single words and you could still use intersections to get the
>>>sort of the same result.
>>>
>>>
>>> 
>>>
>>>      
>>>
>>>>>create constraint defined_tags_name as
>>>>>   normal_tag_name=proper(user_tag_name)
>>>>>
>>>>>create constraint defined_tags_entry_date as
>>>>>   entry_date = select entry_date from entry
>>>>>                          where definedtags.entry_name = entry.entry_name
>>>>>
>>>>>and the following indexes
>>>>>
>>>>>create dt.tags on definedtags
>>>>>   as author_name, entry_name, normal_tag_name unique
>>>>>     
>>>>>
>>>>>          
>>>>>
>>This is the real primary key
>>
>>    
>>
>>>>>create dt.tags2 on definedtags
>>>>>   as normal_tag_name
>>>>>     
>>>>>
>>>>>          
>>>>>
>>This is the tag entity (or operational master)
>>
>>    
>>
>>>>>create dt.entries on definedtags
>>>>>   as author_name, entry_name
>>>>>
>>>>>create dt.date_written
>>>>>   on definedtags as entry_date.
>>>>>
>>>>>I have written this in a hurry so it may not be though out as well as
>>>>>some of my writing, but hopefully this is collaborative development.
>>>>>Also I have difficulty in commenting on and reading some of the
>>>>>front-end & java orientated stuff. (I have ordered a couple of books to
>>>>>help me catchup). Hopefully this is helpfull
>>>>>
>>>>>     
>>>>>
>>>>>          
>>>>>
>>>>I think this all makes sense to me so far and I certainly appreciate the help.  I think getting the data model correct is a *very* important issue before we move forward with implemenatation, so I'm glad we are having this discussion.
>>>>
>>>>-- Allen
>>>>   
>>>>
>>>>        
>>>>
>>>[1] http://www.tartarus.org/~martin/PorterStemmer/def.txt
>>>[2] http://torrez.us/archives/2005/07/13/tagrank.pdf
>>>
>>> 
>>>
>>>      
>>>
>>>>>Elias Torres wrote:
>>>>>
>>>>>     
>>>>>
>>>>>          
>>>>>
>>>>>>Welcome David to the Roller list.
>>>>>>
>>>>>>Thank you for your post. I have read your blog post on a tag data
>>>>>>model for Roller. I'm looking forward to your relational algebra and
>>>>>>query cost analysis. I wanted to tell you that we (IBM) have already
>>>>>>added basic tagging support to Roller and it actually supports a
>>>>>>TagCloud. I am supposed to put a proposal in the roller wiki so others
>>>>>>could comment and once I do that, you could put your comments there as
>>>>>>well.
>>>>>>
>>>>>>Just to kickstart the conversation I'm including the tagging table we
>>>>>>are currently using.
>>>>>>
>>>>>>create table weblogentrytag (
>>>>>>  id              varchar(48)   not null primary key,
>>>>>>  entryid        varchar(48)   not null,
>>>>>>  name            varchar(255)  not null,
>>>>>>  tagtime         timestamp     not null
>>>>>>);
>>>>>>
>>>>>>We have basically two tables: entries and entry2tags, but are missing
>>>>>>a tag table. At first, I was very set on having a tag table and use a
>>>>>>foreign key to "save" space on repetitive tag names. But I was shown
>>>>>>it's not really a big space saving technique, especially since tag
>>>>>>names are relatively short storing a guid or int would almost be
>>>>>>comparable in space. There are also increased costs in inserting and
>>>>>>joining on tables to get tag names if using a foreign key, so we have
>>>>>>settle on this for now until we have other queries requirements. I'll
>>>>>>be summarizing all of our changes to roller to support tagging in a
>>>>>>wiki proposal soon.
>>>>>>
>>>>>>Regarding the use of the list, some people have been using nabble.com
>>>>>>to interact with it. Maybe you can give it a try. I simply use gmail.
>>>>>>
>>>>>>http://www.nabble.com/Roller-f12275.html
>>>>>>
>>>>>>Regards,
>>>>>>
>>>>>>Elias
>>>>>>
>>>>>>On 1/4/06, David Levy <Da...@sun.com> wrote:
>>>>>>
>>>>>>
>>>>>>       
>>>>>>
>>>>>>            
>>>>>>
>>>>>>>I have documented a data model for tags. This is held at my blog
>>>>>>>
>>>>>>>http://blogs.sun.com/roller/page/DaveLevy?entry=implementing_tags_in_a_database
>>>>>>>
>>>>>>>I have a graphic demonstrating the relationship between authors,
>>>>>>>articles and tags and illustrating the first and obvious indexes. (I
>>>>>>>have identified that both "Date Published" and tag aggregates are
>>>>>>>missing from the model).  Since the model was built to help me
>>>>>>>understand del.icio.us, I call the entities Users, Bookmarks and Tags,
>>>>>>>but hopefully its simple to see that these are pretty synonomous to
>>>>>>>authors, articles and tags.
>>>>>>>
>>>>>>>I hope that this is useful for those looking at implementing tags.
>>>>>>>
>>>>>>>I am still working out how to use the mail-list, so I hope that x-refing you to my blog isn't deprecated. I also need to work how to maintain thread connections i.e. undertake a reply.
>>>>>>>--
>>>>>>>
>>>>>>>Dave
>>>>>>>         
>>>>>>>
>>>>>>>              
>>>>>>>
>>>>>--
>>>>>
>>>>>Dave
>>>>>
>>>>>     
>>>>>
>>>>>          
>>>>>
>>>>   
>>>>
>>>>        
>>>>
>>-- 
>>
>>Dave
>>
>><http://www.sun.com> 	* David Levy *
>>
>>
>>
>>
>>
>>
>>Dave Levy's sig deleted
>>
>>    
>>
>
>  
>
-- 

Dave

<http://www.sun.com> 	* David Levy *
*Sun Microsystems Ltd.*
55, King William St.,
London EC4R 9ND United Kingdom

Phone +44 (0) 20 7469 9908/x18308
Mobile +44 (0) 7710-360922
Blog http://blogs.sun.com/DaveLevy
Email David.Levy@Sun.COM

Sun Proprietary & Confidential . This e-mail message is for the sole use 
of the intended recipient(s) and may contain confidential and 
privilidged information. Any unauthorised review, use, disclosure or 
distribution is prohibited. If you are not the intended recepient, 
please contact the sender by reply e-mail and destroy all copies of the 
original message.