Sum date between two date and group by month
Mục Lục
TL;DR
I added some sample data over a previous year (see fiddle) and included a period which straddled New Year’s Eve. The approach I’ve used assumes the existence of a calendar table – ways of doing this using recursive CTE’s are discussed below. By adding the WHERE
clauses supplied (commented out), one can either restrict the date range and/or whether there was any stay in a given month.
SELECT
DATEPART(mm, ad.tdate) AS "Month",
DATENAME(MONTH, ad.tdate) AS "Name", -- Not strictly necessary
DATEPART(yy, ad.tdate) AS "YEAR",
SUM
(
CASE
WHEN s.start_date IS NOT NULL THEN 1
ELSE 0
END
) AS "Stays"
FROM all_dates ad
LEFT JOIN stay s
ON ad.tdate >= s.start_date AND ad.tdate <= s.end_date
-- WHERE ad.tdate >= '2021-03-01'
-- AND ad.tdate <= '2021-06-30' -- Restrict dates of stays
-- WHERE s.start_date IS NOT NULL -- Show only months with at least 1 stay
GROUP BY DATEPART(yy, ad.tdate), DATEPART(mm, ad.tdate), DATENAME(MONTH, ad.tdate)
ORDER BY DATEPART(yy, ad.tdate), DATEPART(mm, ad.tdate)
OPTION (MAXRECURSION 0);
Result (see extra data in fiddle – this result appears to be consistent):
Month Name YEAR Stays
1 January 2020 0
2 February 2020 0
3 March 2020 0
4 April 2020 0
5 May 2020 12
6 June 2020 20
7 July 2020 12
8 August 2020 20
9 September 2020 0
10 October 2020 6
11 November 2020 11
12 December 2020 17
1 January 2021 2
2 February 2021 2
3 March 2021 2
4 April 2021 11
5 May 2021 48
6 June 2021 0
7 July 2021 0
8 August 2021 0
9 September 2021 2
10 October 2021 0
11 November 2021 11
12 December 2021 5
More detailed discussion:
There are two or three alternative methods to the accepted answer which might work. This is a classic case of where a calendar table is required. There are two ways of doing this – the first is by using “classic” SQL – using a function – and also by using the more modern CTE
(Common Table Expression) approach, in particular a recursive CTE (see links in the reference above).
The first gem I came across was this article by Jon Tavernier on how to create a (materialised) calendar table in SQL Server – I pared it back to the minimum required by this problem, although this sort of table has so many potential uses, you may wish to consider doing it for period 10-20 years (only ~ 3.5K – 7K records) and with far more fields (e.g. 1st day of month, end of accounting period…).
Just one small point – your table definition seems to be simplified, but I would suggest that you make it something more like this – increasing the amount of information, the server has makes it easier for it to come up with a good plan:
CREATE TABLE stay
(
name VARCHAR(255) NOT NULL,
start_date DATE NOT NULL,
end_date DATE NOT NULL,
CONSTRAINT n_sd_ed_pk PRIMARY KEY (name, start_date, end_date),
CONSTRAINT n_sd_uq UNIQUE (name, start_date),
CONSTRAINT n_ed_uq UNIQUE (name, end_date),
CONSTRAINT sd_lt_ed_ck CHECK (start_date < end_date),
);
You might also want to ensure that there is no overlap between a given name
and their start_date
and end_date
– but I don’t know how to do this in SQL Server short of using a trigger – my UNIQUE constraints are an attempt at that through SQL – but they won’t prevent nested intervals…
Here’s an article on how to do it using a trigger.
First approach (calendar table – Tavernier):
The code for approach 1 is available on the fiddle here.
DECLARE @start_dt AS DATE = '2021-01-01'; -- Date from which the calendar table will be created (inclusive).
DECLARE @end_dt AS DATE = '2022/01/01'; -- Calendar table will be created up to this date (not inclusive).
CREATE TABLE all_dates
(
tdate DATE PRIMARY KEY
);
WHILE @start_dt < @end_dt
BEGIN
INSERT INTO all_dates
(
tdate -- can generate many more useful values - see article by Jon Tavernier
)
VALUES
(
@start_dt
)
SET @start_dt = DATEADD(DAY, 1, @start_dt)
END;
and then I check by:
SELECT * FROM all_dates;
Result (snipped for brevity):
tdate
2021-01-01
2021-01-02
2021-01-03
2021-01-04
2021-01-05
Next, we set up our test data as per the OP’s question (I called the table stay
):
name start_date end_date
Joe 2021-04-20 2021-05-20
John 2021-05-01 2021-05-28
Then, I ran this query:
SELECT
DATEPART(mm, ad.tdate) AS "Month",
DATENAME(MONTH, ad.tdate) AS "M Name",
COUNT(MONTH(ad.tdate)) AS "Count"
FROM stay s
JOIN all_dates ad
ON ad.tdate >= s.start_date AND ad.tdate <= s.end_date
GROUP BY DATEPART(mm, ad.tdate), DATENAME(MONTH, ad.tdate)
ORDER BY DATEPART(mm, ad.tdate);
Result:
Month Name Count
4 April 11
5 May 48
The month name wasn’t included in the spec, and can easily be dropped – just thought it was a nice to have.
I then turned on Statistics
as follows:
SET STATISTICS PROFILE ON;
SET STATISTICS TIME ON;
SET STATISTICS IO ON;
and reran the query – results discussed below.
Second approach (calendar year table using pre-defined dates via a CTE):
The code for this approach is available on the fiddle here.
Sometimes consultants aren’t allowed to create physical tables, and I wanted the natural approach for this is to use a CTE approach and I stumbled across this next gem (Creating a date dimension or calendar table in SQL Server
) by Aaron Bertand (of this parish).
He begins his piece with these sage words:
One of the biggest objections I hear to calendar tables is that people
don’t want to create a table. I can’t stress enough how cheap a table
can be in terms of size and memory usage, especially as underlying
storage continues to be larger and faster, compared to using all kinds
of functions to determine date-related information in every single
query. Twenty or thirty years of dates stored in a table takes a few
MBs at most, even less with compression, and if you use them often
enough, they’ll always be in memory.
He goes on to say that:
This is a one-time population, so I’m not worried about speed, even
though this specific CTE approach is no slouch. I like to materialize
all of the columns to disk, rather than rely on computed columns,
since the table becomes read-only after initial population. So I’m
going to do a lot of those calculations during the initial series of
CTEs. To start, I’ll show the output of each CTE one at a time.
So, even the function in the first approach might be suitable – but, CTE
‘s (Common Table Expressions – also, see the link to RECURSIVE CTE’s within) were, as best as I can ascertain, introduced into SQL Server after Tavernier’s article above.
I adapted the code from Bertrand’s second CTE as follows:
WITH seq (n) AS
(
SELECT 0
UNION ALL
SELECT n + 1 FROM seq
WHERE n < DATEDIFF(DAY, '2021-01-01', '2021-12-31')
),
all_dates (tdate) AS
(
SELECT DATEADD(DAY, n, '2021-01-01') FROM seq
)
SELECT tdate FROM all_dates
OPTION (MAXRECURSION 0); -- Important!!! Fails after 100 recursions if not present
Result (snipped for brevity):
tdate
2021-01-01 00:00:00.000
2021-01-02 00:00:00.000
2021-01-03 00:00:00.000
So, we’re golden as far as our calendar table is concerned…
Then, I just put that over my original query from approach 1 as follows:
WITH seq (n) AS
(
SELECT 0
UNION ALL
SELECT n + 1 FROM seq
WHERE n < DATEDIFF(DAY, '2021-01-01', '2021-12-31')
),
all_dates (tdate) AS
(
SELECT DATEADD(DAY, n, '2021-01-01') FROM seq
)
SELECT
DATEPART(mm, ad.tdate) AS "Month",
DATENAME(MONTH, ad.tdate) AS "Name",
COUNT(MONTH(ad.tdate)) AS "Count"
FROM stay s
JOIN all_dates ad
ON ad.tdate >= s.start_date AND ad.tdate <= s.end_date
GROUP BY DATEPART(mm, ad.tdate), DATENAME(MONTH, ad.tdate)
ORDER BY DATEPART(mm, ad.tdate)
OPTION (MAXRECURSION 0);
Result (same):
Month Name Count
4 April 11
5 May 48
So, I did the same as for the first approach and turned on Statisics – discussed below.
3rd approach (restricted dates):
The fiddle is available here.
Here, in the interests of “efficiency”, I restricted the dates in the calendar table to the MIN()
of start_time and the MAX()
of end_time
in the stay
table as follows:
WITH stays (min_sd, max_ed, no_days) AS
(
SELECT
MIN(start_date) AS min_sd,
MAX(end_date) AS max_ed,
DATEDIFF(DAY, MIN(start_date), MAX(end_date)) AS no_days
FROM stay
),
seq(n, tdate) AS
(
SELECT 0, (SELECT min_sd FROM stays)
UNION ALL
SELECT n + 1, DATEADD(DAY, n + 1, (SELECT min_sd FROM stays))
FROM seq
WHERE n < (SELECT no_days FROM stays)
)
SELECT * FROM seq;
Result (snipped for brevity):
n tdate
0 2021-04-20
1 2021-04-21
2 2021-04-22
(39 rows...)
So, now we have every date between the earliest start_date
and the latest end_date
.
And using these dates as our calendar table, we formulate the query:
WITH stays (min_sd, max_ed, no_days) AS
(
SELECT
MIN(start_date) AS min_sd,
MAX(end_date) AS max_ed,
DATEDIFF(DAY, MIN(start_date), MAX(end_date)) AS no_days
FROM stay
),
all_dates(n, tdate) AS
(
SELECT 0, (SELECT min_sd FROM stays)
UNION ALL
SELECT n + 1, DATEADD(DAY, n + 1, (SELECT min_sd FROM stays))
FROM all_dates
WHERE n < (SELECT no_days FROM stays)
)
SELECT
DATEPART(mm, ad.tdate) AS "Month",
DATENAME(MONTH, ad.tdate) AS "Name",
COUNT(MONTH(ad.tdate)) AS "Count"
FROM stay s
JOIN all_dates ad
ON ad.tdate >= s.start_date AND ad.tdate <= s.end_date
GROUP BY DATEPART(mm, ad.tdate), DATENAME(MONTH, ad.tdate)
ORDER BY DATEPART(mm, ad.tdate)
OPTION (MAXRECURSION 0);
Result (same):
Month Name Count
4 April 11
5 May 48
I have some performance stats at the end of the fiddles. It is, of course, very difficult to draw many conclusions from performance analyses of (relatively) tiny amounts of data on a server over which one has no control – I would urge you to test with your own data and under your own load…