Monday, June 11, 2012

SQL Query Engine Without Dynamic Queries

There are many ways to implement query engines, dynamic SQL, stored procedures with finite amount of IFs, dynamically creating stored procedures for each query  and a solution I'm writing here.

(I'm sure there are other solutions, I'll be glad to hear about them...)

Dynamic SQL is probably the fastest if you need to execute a one time query, but if multiple actions execute different queries the sql query engine will be very busy with building query plans and executing them, if you need more performance, sometimes the amount of indexes you'll need to create is significant and then the updates and deletes become very slow.

The problem with stored procedure is either your one stored procedures is going to be a maintenance nightmare or if you're creating stored procedures dynamically, its going to be a nightmare to write the code that maintains these stored procedures.

And then again, there's the indexes issue which must be addressed.

I've figured out another way of creating dynamic queries, its not very fast since its using cross joins, but its very quick to set up, so you can get started with low volume tables (< 100k records).

So lets start with our data structure:

1. The Entities table


CREATE TABLE Entities (
  EntityId int  IDENTITY(1,1) NOT NULL,
  Name varchar(255) default NULL,
  Phone varchar(100) default NULL,
  Email varchar(255) default NULL,
  Address varchar(255) default NULL,
  City varchar(50) default NULL,
  Zip varchar(10) default NULL,
  State varchar(50) default NULL,
  Country varchar(50) default NULL,
  BirthDate varchar(50) default NULL
)


2. The Query table


CREATE TABLE Queries
(
    QueryId int NOT NULL IDENTITY (1, 1),
    Name varchar(100) NOT NULL
)


3. The Query Items table


create table QueryItems
(
    QueryItemId int not null identity (1,1),
    QueryId int,
    GroupId tinyint,
    AttributeCode tinyint,
    Value nvarchar(255)
)


The Entities table is pretty straight forward and so is the Queries table.

The QueryItems table has GroupId for queries with groups of items (each group is ANDed, each item inside a group is ORed), the AttributeCode is what this item is asking for, 0 for Country, 1 for Birthday Month and 2 for Email Contains queries, so all possible combinations of these attributes, groups and values will produce a correct answer.

And for the query engine part:


WITH cte AS 
(
   SELECT EntitiesRulesGroups.EntityId, EntitiesRulesGroups.QueryId, MatchingUserGroups.EntityId AS MatchingEntityId
   FROM (
    SELECT EntityId, QueryId, GroupId
    FROM Entities
    CROSS JOIN (
     SELECT DISTINCT QueryId, GroupId
     FROM QueryItems
     ) AS QueryItemsGroups
    ) EntitiesRulesGroups
   LEFT OUTER JOIN 
   (
    SELECT EntityId, QueryId, GroupId, Sum(MatchBit) AS MatchBit
    FROM 
    (
         SELECT EntityId, QueryItems.QueryId, QueryItems.GroupId, 
            (
       /*Country*/            (CASE WHEN ((AttributeCode = 0) AND (Entities.Country = QueryItems.Value))                      THEN 1 ELSE 0 END) +
       /*Birthday Month*/   (CASE WHEN ((AttributeCode = 1) AND (datepart(MONTH, Entities.BirthDate) = QueryItems.Value)) THEN 1 ELSE 0 END) +
       /*email contains*/   (CASE WHEN ((AttributeCode = 2) AND (Entities.email like '%' + QueryItems.Value + '%'))          THEN 1 ELSE 0 END) 
            ) AS MatchBit
            
         FROM Entities
     CROSS JOIN QueryItems
    ) as Matching
   WHERE (MatchBit > 0)
   GROUP BY EntityId, QueryId, GroupId
   ) AS MatchingUserGroups
    ON MatchingUserGroups.EntityId = EntitiesRulesGroups.EntityId AND MatchingUserGroups.QueryId = EntitiesRulesGroups.QueryId AND MatchingUserGroups.GroupId = EntitiesRulesGroups.GroupId
)


select EntityId, QueryId
from Entities
cross join Queries
where not exists
(
    SELECT *
    FROM cte
    WHERE cte.EntityId = Entities.EntityId AND cte.QueryId = Queries.QueryId AND MatchingEntityId IS NULL
)


So what's happening here?

First, we're writing a CTE that will cross check all entities against all query items, so we're creating a big table that tells us if an entity is valid for a specific query item.

In the actual query, we're generating a cross on all entities and all queries and then we're asking the CTE if there is not match in the list.

So why aren't we checking if there is a record instead of there is no nulls?

Well, if there is a record will only tell us if there is a match between a query item and an entity, but there is no way to detect if there are no matches for the ANDed groups except asking if there was a failure to compare a group and an entity.

Of curse its a slow query, its a cross join on multiple tables and there is no way for the query optimizer to do anything, the amount of comparison that is done its almost as comparing each item in code rather than in SQL, but the advantage of it is that everything is done inside the SQL without cursors so data access is actually faster.

There are ways to improve this engine, I'm showing here only the basics for better understanding:

1. Caching the results and only update the cache on changes to either a specific entity against all queries or for a specific query against all entities, then querying the cache is very fast.
If the cache ever gets out of sync, its relatively quick to do a rebuild of the results. You should read up on SQL Merge.
2. Building a pre-calculated temp tables for the queries and the entities so a cast or convert is done only once for an entity <> query item pair.
3. Use with (nolock) where applicable.

I've tested this method with 500k entities (and their own entities - up to 10 sub entities) and 1k queries, the performance was about 500ms for a query to all entities and entity to all queries and a total cache rebuild was about 5 minutes, your results may vary based on too many things to count.

No comments:

Post a Comment