Sum date between two date and group by month - Chia Sẻ Kiến Thức Điện Máy Việt Nam

Mục Lục

TL;DR

I added some sample data over a previous year (see fiddle) and included a period which straddled New Year’s Eve. The approach I’ve used assumes the existence of a calendar table – ways of doing this using recursive CTE’s are discussed below. By adding the WHERE clauses supplied (commented out), one can either restrict the date range and/or whether there was any stay in a given month.

SELECT
  DATEPART(mm, ad.tdate)    AS "Month",
  DATENAME(MONTH, ad.tdate) AS "Name",  -- Not strictly necessary
  DATEPART(yy, ad.tdate)    AS "YEAR",
  SUM
  (
    CASE
      WHEN s.start_date IS NOT NULL THEN 1
      ELSE 0
    END
  ) AS "Stays"
FROM all_dates ad
LEFT JOIN stay s
  ON ad.tdate >= s.start_date AND ad.tdate <= s.end_date
-- WHERE ad.tdate >= '2021-03-01' 
-- AND ad.tdate <= '2021-06-30'       -- Restrict dates of stays  
-- WHERE s.start_date IS NOT NULL     -- Show only months with at least 1 stay
GROUP BY DATEPART(yy, ad.tdate), DATEPART(mm, ad.tdate), DATENAME(MONTH, ad.tdate)
ORDER BY DATEPART(yy, ad.tdate), DATEPART(mm, ad.tdate)
OPTION (MAXRECURSION 0);

Result (see extra data in fiddle – this result appears to be consistent):

Month        Name   YEAR    Stays
    1     January   2020        0
    2    February   2020        0
    3       March   2020        0
    4       April   2020        0
    5         May   2020       12
    6        June   2020       20
    7        July   2020       12
    8      August   2020       20
    9   September   2020        0
   10     October   2020        6
   11    November   2020       11
   12    December   2020       17
    1     January   2021        2
    2    February   2021        2
    3       March   2021        2
    4       April   2021       11
    5         May   2021       48
    6        June   2021        0
    7        July   2021        0
    8      August   2021        0
    9   September   2021        2
   10     October   2021        0
   11    November   2021       11
   12    December   2021        5

More detailed discussion:

There are two or three alternative methods to the accepted answer which might work. This is a classic case of where a calendar table is required. There are two ways of doing this – the first is by using “classic” SQL – using a function – and also by using the more modern CTE (Common Table Expression) approach, in particular a recursive CTE (see links in the reference above).

The first gem I came across was this article by Jon Tavernier on how to create a (materialised) calendar table in SQL Server – I pared it back to the minimum required by this problem, although this sort of table has so many potential uses, you may wish to consider doing it for period 10-20 years (only ~ 3.5K – 7K records) and with far more fields (e.g. 1st day of month, end of accounting period…).

Just one small point – your table definition seems to be simplified, but I would suggest that you make it something more like this – increasing the amount of information, the server has makes it easier for it to come up with a good plan:

CREATE TABLE stay 
(
  name VARCHAR(255) NOT NULL, 
  start_date DATE NOT NULL, 
  end_date DATE   NOT NULL,
  
  CONSTRAINT n_sd_ed_pk PRIMARY KEY (name, start_date, end_date),
  
  CONSTRAINT n_sd_uq UNIQUE (name, start_date),
  
  CONSTRAINT n_ed_uq UNIQUE (name, end_date),
  
  CONSTRAINT sd_lt_ed_ck CHECK (start_date < end_date),
  
);

You might also want to ensure that there is no overlap between a given name and their start_date and end_date – but I don’t know how to do this in SQL Server short of using a trigger – my UNIQUE constraints are an attempt at that through SQL – but they won’t prevent nested intervals…

Here’s an article on how to do it using a trigger.

First approach (calendar table – Tavernier):

The code for approach 1 is available on the fiddle here.

DECLARE @start_dt AS DATE = '2021-01-01';       -- Date from which the calendar table will be created (inclusive).
DECLARE @end_dt   AS DATE = '2022/01/01';       -- Calendar table will be created up to this date (not inclusive).

CREATE TABLE all_dates
(
 tdate DATE PRIMARY KEY
);

WHILE @start_dt < @end_dt
BEGIN
  INSERT INTO all_dates 
  (
    tdate   -- can generate many more useful values - see article by Jon Tavernier
  ) 
  VALUES 
  (
    @start_dt 
  )
  SET @start_dt = DATEADD(DAY, 1, @start_dt)
END;

and then I check by:

SELECT * FROM all_dates;

Result (snipped for brevity):

  tdate
2021-01-01
2021-01-02
2021-01-03
2021-01-04
2021-01-05

Next, we set up our test data as per the OP’s question (I called the table stay):

name    start_date  end_date
Joe     2021-04-20  2021-05-20
John    2021-05-01  2021-05-28

Then, I ran this query:

SELECT 
  DATEPART(mm, ad.tdate) AS "Month",
  DATENAME(MONTH, ad.tdate) AS "M Name",
  COUNT(MONTH(ad.tdate)) AS "Count"
FROM stay s
JOIN all_dates ad
  ON ad.tdate >= s.start_date AND ad.tdate <= s.end_date
GROUP BY DATEPART(mm, ad.tdate), DATENAME(MONTH, ad.tdate) 
ORDER BY DATEPART(mm, ad.tdate);

Result:

Month     Name  Count
    4    April     11
    5      May     48

The month name wasn’t included in the spec, and can easily be dropped – just thought it was a nice to have.

I then turned on Statistics as follows:

SET STATISTICS PROFILE ON;
SET STATISTICS TIME ON;
SET STATISTICS IO ON;

and reran the query – results discussed below.

Second approach (calendar year table using pre-defined dates via a CTE):

The code for this approach is available on the fiddle here.

Sometimes consultants aren’t allowed to create physical tables, and I wanted the natural approach for this is to use a CTE approach and I stumbled across this next gem (Creating a date dimension or calendar table in SQL Server) by Aaron Bertand (of this parish).

He begins his piece with these sage words:

One of the biggest objections I hear to calendar tables is that people
don’t want to create a table. I can’t stress enough how cheap a table
can be in terms of size and memory usage, especially as underlying
storage continues to be larger and faster, compared to using all kinds
of functions to determine date-related information in every single
query. Twenty or thirty years of dates stored in a table takes a few
MBs at most, even less with compression, and if you use them often
enough, they’ll always be in memory.

He goes on to say that:

This is a one-time population, so I’m not worried about speed, even
though this specific CTE approach is no slouch. I like to materialize
all of the columns to disk, rather than rely on computed columns,
since the table becomes read-only after initial population. So I’m
going to do a lot of those calculations during the initial series of
CTEs. To start, I’ll show the output of each CTE one at a time.

So, even the function in the first approach might be suitable – but, CTE‘s (Common Table Expressions – also, see the link to RECURSIVE CTE’s within) were, as best as I can ascertain, introduced into SQL Server after Tavernier’s article above.

I adapted the code from Bertrand’s second CTE as follows:

WITH seq (n) AS
(
  SELECT 0
  UNION ALL
  SELECT n + 1 FROM seq
  WHERE n < DATEDIFF(DAY, '2021-01-01', '2021-12-31')
),
all_dates (tdate) AS
(
  SELECT DATEADD(DAY, n, '2021-01-01')  FROM seq
)
SELECT tdate FROM all_dates
OPTION (MAXRECURSION 0);  -- Important!!! Fails after 100 recursions if not present

Result (snipped for brevity):

         tdate
2021-01-01 00:00:00.000
2021-01-02 00:00:00.000
2021-01-03 00:00:00.000

So, we’re golden as far as our calendar table is concerned…

Then, I just put that over my original query from approach 1 as follows:

WITH seq (n) AS
(
  SELECT 0
  UNION ALL
  SELECT n + 1 FROM seq
  WHERE n < DATEDIFF(DAY, '2021-01-01', '2021-12-31')
),
all_dates (tdate) AS
(
  SELECT DATEADD(DAY, n, '2021-01-01') FROM seq
)
SELECT 
  DATEPART(mm, ad.tdate) AS "Month",
  DATENAME(MONTH, ad.tdate) AS "Name",
  COUNT(MONTH(ad.tdate)) AS "Count"
FROM stay s
JOIN all_dates ad
  ON ad.tdate >= s.start_date AND ad.tdate <= s.end_date
GROUP BY DATEPART(mm, ad.tdate), DATENAME(MONTH, ad.tdate) 
ORDER BY DATEPART(mm, ad.tdate)
OPTION (MAXRECURSION 0);

Result (same):

Month    Name   Count
    4   April      11
    5     May      48

So, I did the same as for the first approach and turned on Statisics – discussed below.

3rd approach (restricted dates):

The fiddle is available here.

Here, in the interests of “efficiency”, I restricted the dates in the calendar table to the MIN() of start_time and the MAX() of end_time in the stay table as follows:

WITH stays (min_sd, max_ed, no_days) AS
(
  SELECT 
    MIN(start_date) AS min_sd, 
    MAX(end_date) AS max_ed,
    DATEDIFF(DAY, MIN(start_date), MAX(end_date)) AS no_days
  FROM stay
),
seq(n, tdate) AS
(
  SELECT 0, (SELECT min_sd FROM stays)
  UNION ALL
  SELECT n + 1, DATEADD(DAY, n + 1, (SELECT min_sd FROM stays))
  FROM seq
  WHERE n < (SELECT no_days FROM stays)
)
SELECT * FROM seq;

Result (snipped for brevity):

n   tdate
0   2021-04-20
1   2021-04-21
2   2021-04-22
(39 rows...)

So, now we have every date between the earliest start_date and the latest end_date.

And using these dates as our calendar table, we formulate the query:

WITH stays (min_sd, max_ed, no_days) AS
(
  SELECT 
    MIN(start_date) AS min_sd, 
    MAX(end_date) AS max_ed,
    DATEDIFF(DAY, MIN(start_date), MAX(end_date)) AS no_days
  FROM stay
),
all_dates(n, tdate) AS
(
  SELECT 0, (SELECT min_sd FROM stays)
  UNION ALL
  SELECT n + 1, DATEADD(DAY, n + 1, (SELECT min_sd FROM stays))
  FROM all_dates
  WHERE n < (SELECT no_days FROM stays)
)
SELECT 
  DATEPART(mm, ad.tdate) AS "Month",
  DATENAME(MONTH, ad.tdate) AS "Name",
  COUNT(MONTH(ad.tdate)) AS "Count"
FROM stay s
JOIN all_dates ad
  ON ad.tdate >= s.start_date AND ad.tdate <= s.end_date
GROUP BY DATEPART(mm, ad.tdate), DATENAME(MONTH, ad.tdate) 
ORDER BY DATEPART(mm, ad.tdate)
OPTION (MAXRECURSION 0);

Result (same):

Month    Name   Count
    4   April      11
    5     May      48

I have some performance stats at the end of the fiddles. It is, of course, very difficult to draw many conclusions from performance analyses of (relatively) tiny amounts of data on a server over which one has no control – I would urge you to test with your own data and under your own load…