Optimizing T-SQL Queries

I am working on some updates to the internal web analytics system we provide to our clients (in the absence of a preferred provider or Google Analytics) and I am working on the following request:

select 
    path as EntryPage, 
    count(Path) as [Count] 
from 
    (
        /* Sub-query 1 */
        select 
            pv2.path
        from 
            pageviews pv2 
                inner join
                    (
                        /* Sub-query 2 */
                        select
                            pv1.sessionid,
                            min(pv1.created) as created
                        from
                            pageviews pv1 
                                inner join Sessions s1 on pv1.SessionID = s1.SessionID
                                inner join Visitors v1 on s1.VisitorID = v1.VisitorID
                        where
                            pv1.Domain = isnull(@Domain, pv1.Domain) and
                            v1.Campaign = @Campaign
                        group by
                            pv1.sessionid
                    ) t1 on pv2.sessionid = t1.sessionid and pv2.created = t1.created
    ) t2
group by 
    Path;

      

I tested this query with 2 million rows in the PageViews table and it takes about 20 seconds to complete. I go through a cluster index scan in the execution plan twice, both times it hits the PageViews table. The New column in this table has a clustered index.

The problem is that in both cases it repeats across all 2 million rows, which is a performance bottleneck in my opinion. Is there something I can do to prevent this, or am I pretty much maximized in terms of optimization?

For reference, the purpose of the query is to find the first page view for each session.

EDIT: After a lot of frustration, despite the help I got here, I was unable to get this query to work. So I decided to just save the link to the login page (and now log out) in the sessions table, which allows me to do the following:

select
    pv.Path,
    count(*)
from
    PageViews pv
        inner join Sessions s on pv.SessionID = s.SessionID
            and pv.PageViewID = s.ExitPage
        inner join Visitors v on s.VisitorID = v.VisitorID
where
    (
        @Domain is null or 
        pv.Domain = @Domain
    ) and
    v.Campaign = @Campaign
group by pv.Path;

      

This request runs in 3 seconds or less. Now I need to either refresh the login / logout page in real time when the page records are written (optimal solution), or run a batch refresh periodically. Anyway, this solves the problem, but not in the way I intended.

Edit Edit: adding the missing index (after clearing from last night) reduced the query to a few milliseconds). Woo hoo!

+1


source to share


6 answers


To start

    where pv1.Domain = isnull(@Domain, pv1.Domain) 

      



there will be no SARG. As I recall, you cannot optimize function match.

+2


source


Continue with doofledorf.

Try the following:

where
   (@Domain is null or pv1.Domain = @Domain) and
   v1.Campaign = @Campaign

      



Ok I have some suggestions

  • Create this private index:

     create index idx2 on [PageViews]([SessionID], Domain, Created, Path)
    
          

  • If you can modify the sessions table to store the login page for example. EntryPageViewID, you should be able to optimize that pretty much.

+1


source


Your inner query (pv1) will need a non-clustered index (Domain).

The second query (pv2) can already find the rows it needs due to the clustered index in Created, but pv1 can return so many rows that SQL Server decides the table scan is faster than all the locks it needs to take. As pv1-groups in SessionID (and therefore must be ordered by SessionID), non-clustered index in SessionID, The created and including path must allow MERGE join. If not, you can force the join with "SELECT .. FROM pageviews pv2 INNER MERGE JOIN ..."

The two indexes listed above would be:

CREATE CONTINUOUS INDEX ncixcampaigndomain ON PageViews (Domain)

CREATE NONCLUSTERED INDEX ncixsessionidcreated ON PageViews (SessionID, Created) INCLUDE (path)

+1


source


I returned. To answer your first question, you could just do the union on the two conditions, since they clearly do not overlap.

What you are really trying to cover is how the case when you provide a domain and where you don't. You need two queries. They can be optimized in a completely different way.

+1


source


What is the nature of the data in these tables? Do you find that most data gets inserted / deleted regularly?

Is this a complete schema for tables? A different index is displayed in the query plan. Edit: Sorry, just read the last line of text. I would suggest that if tables are regularly flushed / inserted, you might consider dropping the clustered index and using the tables as a bunch of tables .. just a thought

Definitely should be putting non-clustered indexes in Campaign, Domain as John suggested

+1


source


SELECT  
    sessionid,  
    MIN(created) AS created  
FROM  
    pageviews pv  
JOIN  
    visitors v ON pv.visitorid = v.visitorid  
WHERE  
    v.campaign = @Campaign  
GROUP BY  
    sessionid  

      

to give you sessions for the campaign. Now let's see what you do with this.

OK, this will save you from your grouping:

SELECT  
    campaignid,  
    sessionid,   
    pv.path  
FROM  
    pageviews pv  
JOIN  
    visitors v ON pv.visitorid = v.visitorid  
WHERE  
    v.campaign = @Campaign  
    AND NOT EXISTS (  
        SELECT 1 FROM pageviews  
        WHERE sessionid = pv.sessionid  
        AND created < pv.created  
    )  

      

+1


source







All Articles