SQL Server: Combining Results into Too Many Rows
I have two SQL Server tables.
The first table is named Content
and contains, among other things, columns called
FileID, FileHighResolutionID, FileHighResolutionProID, FileVectorID
The second table is named Analytics
and contains, among other things, a column named FileID
. This column contains a value from one of the four aforementioned columns in Content
.
Doing the following ...
SELECT
*
FROM
Analytics a
WHERE
a.Created BETWEEN '2017-03-07' AND '2017-03-08'
results in 782 lines.
But doing the following ...
SELECT
*
FROM
Analytics a
INNER JOIN
Content c ON (c.FileID = a.FileID OR c.FileHighResolutionID = a.FileID OR c.FileHighResolutionProID = a.FileID OR c.FileVectorID = a.FileID)
WHERE
a.Created BETWEEN '2017-03-07' AND '2017-03-08'
results in 843 lines.
I know I have something wrong with my JOIN because now I have too many entries. I tried INNER JOINs, LEFT OUTER JOINS, RIGHT OUTER JOINs, but each one results in 61 mysterious extra entries.
Can some SQL expert look over and tell what I am doing wrong?
source to share
You have to understand that the problem is what OR
matches multiple columns in c
. You get a separate line for each match. Voila! Unexpected lines.
One way to eliminate this is using a "lateral connection". This is similar to a correlated subquery, but it can return more than one column and more than one row (not required here). SQL Server does this with APPLY
:
SELECT a.*, c.*
FROM Analytics a CROSS APPLY
(SELECT TOP 1 c.*
FROM Content c
WHERE a.FileID IN (c.FileId, c.FileHighResolutionID, c.FileHighResolutionProID, c.FileVectorID)
) c
WHERE a.Created BETWEEN '2017-03-07' AND '2017-03-08';
Returns an arbitrary matching string. You can get a specific string with ORDER BY
:
SELECT a.*, c.*
FROM Analytics a CROSS APPLY
(SELECT TOP 1 c.*
FROM Content c
WHERE a.FileID IN (c.FileId, c.FileHighResolutionID, c.FileHighResolutionProID, c.FileVectorID)
ORDER BY (CASE a.FileId WHEN c.FileId THEN 1 c.FileHighResolutionID THEN 2 c.FileHighResolutionProID THEN 3 c.FileVectorID THEN 4
END)
) c
WHERE a.Created BETWEEN '2017-03-07' AND '2017-03-08';
Note. I agree with the answer that asks the question of use BETWEEN
with dates / times. This is dangerous because the times cause faulty logic. I highly recommend one of the following:
WHERE a.Created = '2017-03-07'
WHERE a.Created >= '2017-03-07' AND a.Created < '2017-03-09';
source to share
If you don't need data from a table Content
, you can go for EXISTS
:
SELECT *
FROM Analytics AS A
WHERE A.Created >= '2017-03-07'
AND A.Created < '2017-03-08'
AND EXISTS (
SELECT *
FROM Content AS C
WHERE A.FileID IN (C.FileID, C.FileHighResolutionID, C.FileHighResolutionProID, C.FileVectorID)
);
EXISTS
will give either true
/ false
in the WHERE
condition, and will not create duplicates.
source to share
Another bad practice is to use BETWEEN in the WHERE clause when filtering by dates
WHERE a.Created >= '20170307' AND a.Created < '20170308'
You have more rows in your second query because you are joining multiple columns and that might be the reason you are getting more records. In the first query, you are not joining another table.
source to share