Friday, May 25, 2012

Hierarchy Cache



In the past I tried to avoid using hierarchies in SQL, but I couldn't find another simple data structure for a system requirement: all elements must be children of other elements with no depth limit.


I was able to change the depth limit to 100.


So I started out with:


CREATE TABLE [dbo].[Hierarchy](
    [ElementId] [int] IDENTITY(1,1) NOT NULL,
    [ParentId] [int] NULL,
 CONSTRAINT [PK_Hierarchy] PRIMARY KEY CLUSTERED 
(
    [ElementId] ASC
)


at first everything worked fine, I could traverse the tree with CTE without any problem, it was even fast for the current load, then came another system requirement, the tree size will be about 1 million records.


with cte as
(
 select ElementId, ParentId, 0 as Level
 from Hierarchy
 where ParentId is null
 
 union all
 
 select Hierarchy.ElementId, Hierarchy.ParentId, Level + 1
 from Hierarchy
  join cte on cte.ElementId = Hierarchy.ParentId
)

select * from cte


trying to traverse a tree of 1M records with CTE with only 15 levels took more than a production sql should take, even if the system's requirement should be live or near live.


We've been looking for solutions in every corner of the internet, graph databases (we'll still need to use these IDs in the SQL so passing the data back will be a design nightmare or we can move the portions or whole application, I guess these are other options), trying to find all sorts of algorithms (which will be error prune and might be a heavy toll on the development team), CLR functions and of curse scrapping the whole hierarchy idea (no can do, system requirements).


Eventually we came to the conclusion that keeping a cache will not be too costly on the memory, cpu and disk provided that the right indexes will be created and they will provide the best result and the fun part is, its very easy and fast to join the cache table on any other table we'll need. 


But we had to compromise somewhere, we compromised on node parent changes, some of them might cause the entire tree to rebuild.


I built a method to quickly update the tree given an elementid and if elementid was not provided, it will rebuild the whole tree.


I can smell just one little problem, which will be solved when updating records, need to detect circular references.


CREATE TABLE [dbo].[HierarchyCache](
 [OwnerId] [int] NOT NULL,
 [ElementId] [int] NOT NULL
)

And the SQL code updating the HierarchyCache, note that I'm using merge instead of dropping the whole table or writing individual insert/delete.

declare @ElementId int = null


--if no ElementId was supplied, it means we should go over all Hierarchy and rebuild the table
if (@ElementId is null)
begin


    with parentHierarchy (ElementId, ParentId , TopElementId)
    as
    (
        select  ElementId, ParentId, ElementId as TopElementId
        from Hierarchy with (nolock)

        union all

        select  [Hierarchy].ElementId,[Hierarchy].ParentId , pu.TopElementId
        from Hierarchy with (nolock)
        inner join parentHierarchy as pu
            on pu.ParentId = [Hierarchy].ElementId
    )

    --instead of updating the entire table, we're doing a merge, more efficient and easier on the disk
    merge into HierarchyCache as target
    using parentHierarchy
        on (parentHierarchy.TopElementId = target.OwnerId and parentHierarchy.ElementId = target.ElementId)
    when not matched by target then
        insert (ElementId, OwnerId) values (parentHierarchy.ElementId,parentHierarchy.TopElementId)
    when not matched by source then
        delete;

end
else
begin
    --a ElementId was supplied, update only its tree

    with parentHierarchy (ElementId, ParentId  , TopElementId)
    as
    (
        select  ElementId, ParentId , ElementId as TopElementId
        from Hierarchy with (nolock)
        where ElementId = @ElementId

        union all

        select  [Hierarchy].ElementId,(case when (Hierarchy.ParentId = Hierarchy.ElementId) then null else Hierarchy.ParentId end) as ParentId , pu.TopElementId
        from Hierarchy  with (nolock)
        inner join parentHierarchy as pu
            on pu.ParentId = [Hierarchy].ElementId

    )

    --instead of updating the entire table, we're doing a merge, more efficient and easier on the disk
    merge into HierarchyCache as target
    using parentHierarchy
        on (parentHierarchy.TopElementId = target.OwnerId and parentHierarchy.ElementId = target.ElementId)
    when not matched by target then
        insert (ElementId, OwnerId) values (parentHierarchy.ElementId,parentHierarchy.TopElementId);


end



No comments:

Post a Comment