SQL Server: Combining Results into Too Many Rows

I have two SQL Server tables.

The first table is named Content

and contains, among other things, columns called

FileID, FileHighResolutionID, FileHighResolutionProID, FileVectorID

      

The second table is named Analytics

and contains, among other things, a column named FileID

. This column contains a value from one of the four aforementioned columns in Content

.

Doing the following ...

SELECT 
    * 
FROM 
    Analytics a
WHERE 
    a.Created BETWEEN '2017-03-07' AND '2017-03-08'

      

results in 782 lines.

But doing the following ...

SELECT 
    * 
FROM 
    Analytics a
INNER JOIN 
    Content c ON (c.FileID = a.FileID OR c.FileHighResolutionID = a.FileID OR c.FileHighResolutionProID = a.FileID OR c.FileVectorID = a.FileID)
WHERE 
    a.Created BETWEEN '2017-03-07' AND '2017-03-08'

      

results in 843 lines.

I know I have something wrong with my JOIN because now I have too many entries. I tried INNER JOINs, LEFT OUTER JOINS, RIGHT OUTER JOINs, but each one results in 61 mysterious extra entries.

Can some SQL expert look over and tell what I am doing wrong?

+3


source to share


3 answers


You have to understand that the problem is what OR

matches multiple columns in c

. You get a separate line for each match. Voila! Unexpected lines.

One way to eliminate this is using a "lateral connection". This is similar to a correlated subquery, but it can return more than one column and more than one row (not required here). SQL Server does this with APPLY

:

SELECT a.*, c.*
FROM Analytics a CROSS APPLY
     (SELECT TOP 1 c.*
      FROM Content c 
      WHERE a.FileID IN (c.FileId, c.FileHighResolutionID, c.FileHighResolutionProID, c.FileVectorID)
    ) c
WHERE a.Created BETWEEN '2017-03-07' AND '2017-03-08';

      

Returns an arbitrary matching string. You can get a specific string with ORDER BY

:



SELECT a.*, c.*
FROM Analytics a CROSS APPLY
     (SELECT TOP 1 c.*
      FROM Content c 
      WHERE a.FileID IN (c.FileId, c.FileHighResolutionID, c.FileHighResolutionProID, c.FileVectorID)
      ORDER BY (CASE a.FileId WHEN c.FileId THEN 1 c.FileHighResolutionID THEN 2 c.FileHighResolutionProID THEN 3 c.FileVectorID THEN 4
                END)
    ) c
WHERE a.Created BETWEEN '2017-03-07' AND '2017-03-08';

      

Note. I agree with the answer that asks the question of use BETWEEN

with dates / times. This is dangerous because the times cause faulty logic. I highly recommend one of the following:

WHERE a.Created = '2017-03-07'
WHERE a.Created >= '2017-03-07' AND a.Created < '2017-03-09';

      

+2


source


If you don't need data from a table Content

, you can go for EXISTS

:

SELECT *
FROM Analytics AS A
WHERE A.Created >= '2017-03-07'
    AND A.Created < '2017-03-08'
    AND EXISTS (
        SELECT *
        FROM Content AS C
        WHERE A.FileID IN (C.FileID, C.FileHighResolutionID, C.FileHighResolutionProID, C.FileVectorID)
        );

      



EXISTS

will give either true

/ false

in the WHERE

condition, and will not create duplicates.

+2


source


Another bad practice is to use BETWEEN in the WHERE clause when filtering by dates

WHERE a.Created >= '20170307' AND a.Created < '20170308'

      

You have more rows in your second query because you are joining multiple columns and that might be the reason you are getting more records. In the first query, you are not joining another table.

0


source







All Articles