Data 출처 : Data Mining for Business Analytics에서 사용한 미국 철도 회사 “Amtrak”에서 수집한 1991년 1월~2004년 3월까지 매달 환승 고객 수

Introduction

Bayesian Structural Time Series (BSTS)는 Structural Time Seires (STS) 모형에 Bayesian 방법을 적용하는 방법이다.
STS는 Linear Gaussian State Space Model로써 Dynamic Linear Model (DLM)과 동일한 개념이다.
STS의 장점은 다음과 같다.
- 유연성(Flexible) : 모든 ARIMA 및 VARMA 모델을 포함하여 매우 큰 클래스의 모델을 상태 공간 형식으로 표현할 수 있다.
- 모듈식(Modular) : 데이터의 중요한 특징을 포착하기 위해 모형은 상태 성분(State-component) 하위 모델(Sub-model)의 라이브러리로부터 조립될 수 있다.
  - 추세(Trend), 계절성(Seasonality) 등을 포착하기 위해 널리 사용되는 여러 상태 성분을 사용할 수 있다.
\(t\)시점에서 관측된 시계열 데이터 \(y_{t}\)가 \(m\)차원이고, 관측되지 않는(Unobserved) 상태 성분이 \(p\)차원일 때, STS는 방정식은 다음과 같다.

\[ \begin{aligned} Y_{t}&=Z^{T}_{t}\alpha_{t}+\epsilon_{t},~~~~~~~\epsilon_{t}\sim N_{m}(0,H_{t}), \\ \alpha_{t+1}&=T_{t}\alpha_{t}+R_{t}\eta_{t},~~~\eta_{t}\sim N_{q}(0,Q_{t}),\\ \alpha_{1} &\sim N_{p} (\mu_{1}, P_{1}) \end{aligned} \]

첫번째는 관측방정식, 두번째는 상태방정식이다.
- \(Y_{t}\) : 시점 \(t\)에서 관측값
- \(\alpha_{t}\) : 시점 \(t\)에서 관측할 수 없는 상태 (시계열에서 추세, 계절성 등이 상태가 될 수 있음)
  - 직접 관찰할 수 없지만 시간이 지남에 따라 어떻게 변화하는지 알고 있다고 가정하는 것이 합리적
  - 시간이 지남에 따라 어떻게 변화하는지 정의
- \(Z_{t}\), \(T_{t}\), \(R_{t}\) : 0과 1을 포함하여 알고 있는 값과 미지의 모수를 포함하는 행렬
  - \(Z_{t}\) : \(p\times m\) 결과 행렬 (Output Matirx)
  - \(T_{t}\) : \(p\times p\) 전이 행렬 (Transition Matrix)
  - \(R_{t}\) : \(p\times q\) 제어 행렬 (Control Matrix)
- \(\epsilon_{t}\), \(\eta_{t}\) : 오차로써, 연속적으로 상관관계가 없으며 또한 모든 기간 동안 서로 상관관계가 없는 것으로 가정한다.
  - \(\epsilon_{t}\) : \(m\times m\) 분산공분산행렬(Variance-covariance Matrix) \(H_{t}\)을 가진 \(m\times 1\) 벡터
  - \(\eta_{t}\) : \(q\times q\) 상태확산행렬(State Diffusion Matrix) \(Q_{t}\)을 가진 \(q \times 1\) 벡터 (\(q\le d\))

BSTS Package

BSTS는 R package bsts를 통해 다룰 수 있다.
- bsts는 Markov chain Monte Carlo (MCMC) 방법으로 BSTS의 사후분포로부터 표본을 추출하여 적합과 예측이 수행된다.
다음으로, 관측값 \(y_{t}\)가 univariate time series일 때 BSTS에서 가장 유용하게 쓰이는 대표적인 모형들에 대해 설명한다. 다른 모형들에 대해서는 여기를 참조한다.

Trend

Local Level Model

관측값 \(y_{t}\)를 추세의 평균인 Level \(\mu_{t}\)로만 나타낸 간단한 모형식이다.
- Random Walk + 오차
\(Z^{T}_{t} = 1\), \(T_{t} = 1\), \(\alpha_{t}=\mu_{t}\), \(R_{t} = 1\), \(\eta_{t} = \xi_{t}\)일 때, 모형식은 다음과 같다.

\[ \begin{aligned} Y_{t} &= \mu_{t} + \epsilon_{t},~~~~\epsilon_{t}\sim N(0, \sigma^2_{\epsilon})\\ \mu_{t+1} &= \mu_{t} + \xi_{t}, ~~~\xi_{t}\sim N(0,\sigma^2_{\xi }), \end{aligned} \]

R package "bsts"에서 AddLocalLevel()을 사용하면 된다.

ss <- list()
ss <- bsts::AddLocalLevel(ss, y)   # y : The time series to be modeled

Local Linear Trend Model

관측값 \(y_{t}\)를 추세의 평균인 Level \(\mu_{t}\)와 추세의 기울기(=추세의 증가률) \(\delta_{t}\)로 나타낸 모형식이다.
Linear Trend Model보다 유연하며, 단기 예측에 유용하다.
\(Z^{T}_{t} = (1, 0)\), \(T_{t} = \left[\begin{matrix} 1 & 1\\ 0 & 1 \end{matrix}\right]\), \(\alpha_{t}=(\mu_{t}, \delta_{t})^{T}\), \(R_t=\left[\begin{matrix} 1 & 0 \\ 0 & 1 \end{matrix}\right]\), \(\eta_{t}=(\xi_{t},\zeta_{t})^{T}\)일 때, 모형식은 다음과 같다.

\[ \begin{aligned} Y_{t} &= \mu_{t} + \epsilon_{t},~~~~~~~~~~\epsilon_{t}\sim N(0, \sigma^2_{\epsilon})\\ \mu_{t+1} &= \mu_{t} + \delta_{t} + \xi_{t}, ~~~\xi_{t}\sim N(0,\sigma^2_{\xi }),\\ \delta_{t+1} &= \delta_{t} + \zeta_{t}, ~~~~~~~~~~~~\zeta_{t}\sim N(0,\sigma^2_{\zeta}), \end{aligned} \]

R package "bsts"에서 AddLocalLinearTrend()을 사용하면 된다.

ss <- list()
ss <- bsts::AddLocalLinearTrend(ss, y)   # y : The time series to be modeled

Seasonality

Regression with Seasonal Dummy Variables

계절성을 포착하기 위해 흔히 사용되는 상태 성분 모형이다.
- 서로 다른 주기를 가진 다중 계절 성분을 허용하는 모형으로 확장할 수 있다.
관측값 \(y_{t}\)의 계절 주기가 \(S\), \(Z^{T}_{t} = (1, 0,\ldots, 0)\), \(T_{t} = \left[\begin{matrix} -1 & - 1 & \cdots & -1 & -1 \\ 1 & 0 & \cdots & 0 & 0\\ 0 & 1 & \cdots & 0 & 0 \\ \vdots &\vdots &\vdots & \vdots &\vdots \\ 0 & 0 & \cdots & 1 & 0 \end{matrix}\right]\), \(\alpha_{t}=(\tau_{t}, \ldots, \tau_{t-S+2})^{T}\), \(R_{t}=(1,0,\ldots,0)^{T}\), \(\eta_{t}=\omega_{t}\)일 때, 모형식은 다음과 같다.

\[ \begin{aligned} Y_{t} &= \tau_{t} + \epsilon_{t},~~~~~~~~~~\epsilon_{t}\sim N(0, \sigma^2_{\epsilon})\\ \tau_{t+1} &= -\sum_{s=1}^{S-1} \tau_{t+1-s} + \omega_{t}, ~~~\omega_{t}\sim N(0,\sigma^2_{\omega}), \end{aligned} \]

R package "bsts"에서 AddSeasonal()을 사용하면 된다.

ss <- list()
ss <- bsts::AddSeasonal(ss, y,           # y : The time series to be modeled
                        nseasons,        # season.duration의 반복 수
                        season.duration) # 각 시즌에서 관측수수

# cycle (s) = season.duration * nseasons

Application

Ridership on Amtrak Trains(미국 철도 회사 “Amtrak”에서 수집한 1991년 1월~2004년 3월까지 매달 환승 고객 수) 예제를 이용하여 BSTS가 실제 데이터에 어떻게 적용되는지 설명한다.

Data 불러오기

pacman::p_load("dplyr", "bsts", "forecast", "ggplot2")

# In Mac
# guess_encoding("Amtrak.csv")
# Amtrak.data <- read.csv("Amtrak.csv", fileEncoding="EUC-KR")

Amtrak.data  <- read.csv("C:/Users/User/Desktop/Amtrak.csv")
ridership.ts <- ts(Amtrak.data$Ridership, start=c(1991,1), end=c(2004,3), freq=12)

Data 분할

train.ts  <- window(ridership.ts,start=c(1991,1), end=c(2001,3))   # Training Data
test.ts   <- window(ridership.ts,start=c(2001,4))                  # Test Data
n.test    <- length(test.ts)

정규성 확인

par(mfrow=c(2,1))
hist(train.ts, prob=TRUE, 12)
lines(density(train.ts), col="blue")
qqnorm(train.ts)
qqline(train.ts)

두 그래프를 보면 Ridership은 정규분포를 따른다는 것을 알 수 있다.

모형 적합

bsts package에 있는 bsts()을 이용하여 모형을 적합시킬 수 있다.

bsts(formula, state.specification, family = c("gaussian", "logit", "poisson", "student"), data, niter, seed = NULL, ...)

formula : 시계열 데이터 \(y_{t}\)와 예측변수 \(x_{i,t}\) 사이의 관계를 설명하는 형태
- 예측변수가 존재하지 않을 때 : y
- 예측변수가 존재할 때 : \(y\sim x\)
state.specification : BSTS Package에서 소개된 상태 성분에 대한 모형식
family : 관측방정식의 분포
data : formula에서 나타낸 변수를 포함하는 Data Frame 형태의 변수
niter : 추출하기 원하는 MCMC 수
seed : 실행시킬 때마다 다른 결과가 나오지 않게 고정할 시드 값

예제 데이터에서는 대중적이고 유용한 Basic STS Model을 이용하여 적합하고자 한다.
회귀 성분(=예측 변수)를 포함하는 Basic STS Model은 다음과 같다.

\[ \begin{aligned} Y_{t} &= \mu_{t} + \tau_{t} + \beta^{T}x_{t} + \epsilon_{t},~~~~\epsilon_{t}\sim N(0, \sigma^2_{\epsilon})\\ \mu_{t+1} &= \mu_{t} + \delta_{t} + \xi_{t}, ~~~~~~~~~\xi_{t}\sim N(0,\sigma^2_{\xi }),\\ \delta_{t+1} &= \delta_{t} + \zeta_{t}, ~~~~~~~~~~~~~~~~~~~~~\zeta_{t}\sim N(0,\sigma^2_{\zeta})\\ \tau_{t+1}&=-\sum_{s=1}^{S-1} \tau_{t+1-s} + \omega_{t}, ~~~~~~~~~~\omega_{t}\sim N(0,\sigma^2_{\omega }) \end{aligned} \] - Basic STS Model은 \(Z^{T}_{t} = (1,0,1,\ldots, 0)\), \(T_{t} = \left[\begin{smallmatrix} 1 & 1 & \\ 0 & 1 & \\ & & -1 & - 1 & \cdots & -1 & -1 \\ & & 1 & 0 & \cdots &0& 0\\ & & 0 & 1 & \cdots & 0 &0 \\ & & \vdots &\vdots &\vdots &\vdots &\vdots &\\ & & 0 & 0 & \cdots & 1 & 0 \end{smallmatrix}\right]\), \(\alpha_{t} = (\mu_{t}, \delta_{t}, \tau_{t}, \ldots, \tau_{t-S+2})^{T}\), \(R_{t}=\left[\begin{smallmatrix}1 & 0 \\ 0 & 1 \\ & & 1 \\ & & 0 \\ & & \vdots \\ & & 0 \\ \end{smallmatrix}\right]\), \(\eta_{t}=(\xi_{t}, \zeta_{t},\omega_{t})^{T}\).

# Local Linear Trend
ss <- list()
ss <- bsts::AddLocalLinearTrend(ss, train.ts)
# Seasonal
ss <- bsts::AddSeasonal(ss, train.ts, nseasons = 12, season.duration = 1) 

BSTS.fit <- bsts(train.ts, state.specification = ss, niter = 1000, seed=100)  # niter : MCMC 반복복

=-=-=-=-= Iteration 0 Thu May 19 02:28:39 2022
 =-=-=-=-=
=-=-=-=-= Iteration 100 Thu May 19 02:28:39 2022
 =-=-=-=-=
=-=-=-=-= Iteration 200 Thu May 19 02:28:39 2022
 =-=-=-=-=
=-=-=-=-= Iteration 300 Thu May 19 02:28:39 2022
 =-=-=-=-=
=-=-=-=-= Iteration 400 Thu May 19 02:28:39 2022
 =-=-=-=-=
=-=-=-=-= Iteration 500 Thu May 19 02:28:40 2022
 =-=-=-=-=
=-=-=-=-= Iteration 600 Thu May 19 02:28:40 2022
 =-=-=-=-=
=-=-=-=-= Iteration 700 Thu May 19 02:28:40 2022
 =-=-=-=-=
=-=-=-=-= Iteration 800 Thu May 19 02:28:40 2022
 =-=-=-=-=
=-=-=-=-= Iteration 900 Thu May 19 02:28:40 2022
 =-=-=-=-=

summary(BSTS.fit)

$residual.sd
[1] 38.96105

$prediction.sd
[1] 86.43389

$rsquare
[1] 0.9407548

$relative.gof
[1] 0.7354909

예측

burn          <- SuggestBurn(0.1, BSTS.fit)     # MCMC에서 버릴 갯수릴 갯수
BSTS.forecast <- predict(BSTS.fit, horizon = n.test, burn = burn,  quantiles = c(0.025, 0.975))  # horizon : the number of prediction
BSTS.forecast$mean

 [1] 2008.007 2034.247 2005.145 2114.927 2159.952 1849.643 1985.271
 [8] 1980.410 2025.083 1771.644 1738.847 2042.618 2047.263 2078.716
[15] 2049.774 2156.080 2205.664 1890.307 2028.975 2029.909 2066.173
[22] 1817.965 1788.446 2091.587 2098.778 2123.918 2100.609 2209.087
[29] 2258.041 1946.782 2080.985 2076.564 2118.590 1871.648 1843.989
[36] 2146.205

plot(BSTS.forecast, plot.original = 100)

accuracy(test.ts, BSTS.forecast$mean)

               ME     RMSE      MAE      MPE     MAPE
Test set 30.17445 66.89158 46.26838 1.502453 2.293063

모형 적합 with 예측 변수

## Month 변수 생성생성
xts(ridership.ts, order = as.Date(ridership.ts))

               [,1]
1991-01-01 1708.917
1991-02-01 1620.586
1991-03-01 1972.715
1991-04-01 1811.665
1991-05-01 1974.964
1991-06-01 1862.356
1991-07-01 1939.860
1991-08-01 2013.264
1991-09-01 1595.657
1991-10-01 1724.924
1991-11-01 1675.667
1991-12-01 1813.863
1992-01-01 1614.827
1992-02-01 1557.088
1992-03-01 1891.223
1992-04-01 1955.981
1992-05-01 1884.714
1992-06-01 1623.042
1992-07-01 1903.309
1992-08-01 1996.712
1992-09-01 1703.897
1992-10-01 1810.000
1992-11-01 1861.601
1992-12-01 1875.122
1993-01-01 1705.259
1993-02-01 1618.535
1993-03-01 1836.709
1993-04-01 1957.043
1993-05-01 1917.185
1993-06-01 1882.398
1993-07-01 1933.009
1993-08-01 1996.167
1993-09-01 1672.841
1993-10-01 1752.827
1993-11-01 1720.377
1993-12-01 1734.292
1994-01-01 1563.365
1994-02-01 1573.959
1994-03-01 1902.639
1994-04-01 1833.888
1994-05-01 1831.049
1994-06-01 1775.755
1994-07-01 1867.508
1994-08-01 1906.608
1994-09-01 1685.632
1994-10-01 1778.546
1994-11-01 1775.995
1994-12-01 1783.350
1995-01-01 1548.415
1995-02-01 1496.925
1995-03-01 1798.316
1995-04-01 1732.895
1995-05-01 1772.345
1995-06-01 1761.207
1995-07-01 1791.655
1995-08-01 1874.820
1995-09-01 1571.309
1995-10-01 1646.948
1995-11-01 1672.631
1995-12-01 1656.845
1996-01-01 1381.758
1996-02-01 1360.852
1996-03-01 1558.575
1996-04-01 1608.420
1996-05-01 1696.696
1996-06-01 1693.183
1996-07-01 1835.516
1996-08-01 1942.573
1996-09-01 1551.401
1996-10-01 1686.508
1996-11-01 1576.204
1996-12-01 1700.433
1997-01-01 1396.588
1997-02-01 1371.690
1997-03-01 1707.522
1997-04-01 1654.604
1997-05-01 1762.903
1997-06-01 1775.800
1997-07-01 1934.219
1997-08-01 2008.055
1997-09-01 1615.924
1997-10-01 1773.910
1997-11-01 1732.368
1997-12-01 1796.626
1998-01-01 1570.330
1998-02-01 1412.691
1998-03-01 1754.641
1998-04-01 1824.932
1998-05-01 1843.289
1998-06-01 1825.964
1998-07-01 1968.172
1998-08-01 1921.645
1998-09-01 1669.597
1998-10-01 1791.474
1998-11-01 1816.714
1998-12-01 1846.754
1999-01-01 1599.427
1999-02-01 1548.804
1999-03-01 1832.333
1999-04-01 1839.720
1999-05-01 1846.498
1999-06-01 1864.852
1999-07-01 1965.743
1999-08-01 1949.002
1999-09-01 1607.373
1999-10-01 1803.664
1999-11-01 1850.309
1999-12-01 1836.435
2000-01-01 1541.660
2000-02-01 1616.928
2000-03-01 1919.538
2000-04-01 1971.493
2000-05-01 1992.301
2000-06-01 2009.763
2000-07-01 2053.996
2000-08-01 2097.471
2000-09-01 1823.706
2000-10-01 1976.997
2000-11-01 1981.408
2000-12-01 2000.153
2001-01-01 1683.148
2001-02-01 1663.404
2001-03-01 2007.928
2001-04-01 2023.792
2001-05-01 2047.008
2001-06-01 2072.913
2001-07-01 2126.717
2001-08-01 2202.638
2001-09-01 1707.693
2001-10-01 1950.716
2001-11-01 1973.614
2001-12-01 1984.729
2002-01-01 1759.629
2002-02-01 1770.595
2002-03-01 2019.912
2002-04-01 2048.398
2002-05-01 2068.763
2002-06-01 1994.267
2002-07-01 2075.258
2002-08-01 2026.560
2002-09-01 1734.155
2002-10-01 1916.771
2002-11-01 1858.345
2002-12-01 1996.352
2003-01-01 1778.033
2003-02-01 1749.489
2003-03-01 2066.466
2003-04-01 2098.899
2003-05-01 2104.911
2003-06-01 2129.671
2003-07-01 2223.349
2003-08-01 2174.360
2003-09-01 1931.406
2003-10-01 2121.470
2003-11-01 2076.054
2003-12-01 2140.677
2004-01-01 1831.508
2004-02-01 1838.006
2004-03-01 2132.446

Month  <- as.Date(ridership.ts) %>%                  # Date 추출NAlubridate::month()                                 # Month 추출NATrain.Data <- data.frame("y"=train.ts, "Month"= Month[1:length(train.ts)])
Test.Data  <- data.frame("y"=test.ts, "Month"= Month[-(1:length(train.ts))])


BSTS.fit2 <- bsts(y ~ Month, state.specification = ss, data = Train.Data, niter = 1000, seed=100)  # niter : MCMC 반복복

=-=-=-=-= Iteration 0 Thu May 19 02:28:41 2022
 =-=-=-=-=
=-=-=-=-= Iteration 100 Thu May 19 02:28:41 2022
 =-=-=-=-=
=-=-=-=-= Iteration 200 Thu May 19 02:28:42 2022
 =-=-=-=-=
=-=-=-=-= Iteration 300 Thu May 19 02:28:42 2022
 =-=-=-=-=
=-=-=-=-= Iteration 400 Thu May 19 02:28:42 2022
 =-=-=-=-=
=-=-=-=-= Iteration 500 Thu May 19 02:28:42 2022
 =-=-=-=-=
=-=-=-=-= Iteration 600 Thu May 19 02:28:43 2022
 =-=-=-=-=
=-=-=-=-= Iteration 700 Thu May 19 02:28:43 2022
 =-=-=-=-=
=-=-=-=-= Iteration 800 Thu May 19 02:28:43 2022
 =-=-=-=-=
=-=-=-=-= Iteration 900 Thu May 19 02:28:43 2022
 =-=-=-=-=

summary(BSTS.fit2)

$residual.sd
[1] 38.38387

$prediction.sd
[1] 86.37815

$rsquare
[1] 0.9424972

$relative.gof
[1] 0.7358398

$size
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
0.00000 0.00000 0.00000 0.07092 0.00000 1.00000 

$coefficients
                 mean       sd mean.inc   sd.inc   inc.prob
Month       0.1077952 0.539501 1.519912 1.414357 0.07092199
(Intercept) 0.0000000 0.000000 0.000000 0.000000 0.00000000

회귀 계수

# Ref. https://michaeldewittjr.com/programming/2018-07-05-bayesian-time-series-analysis-with-bsts_files/

burn2 <- SuggestBurn(0.1, BSTS.fit2)

### MCMC sample에 0인 값들 삭제하고 평균 내기 위한 함수NAPositiveMean <- function(b) {
  b <- b[abs(b) > 0]
  if (length(b) > 0) 
    return(mean(b))
  return(0)
}

### Get the average coefficients when variables were selected (non-zero slopes)
coeff <- data.frame(reshape2::melt(apply(BSTS.fit2$coefficients[-(1:burn2),], 2, PositiveMean)))  # Fun : mean (0 포함 평균)NAcoeff$Variable <- as.character(row.names(coeff))
coeff

               value    Variable
(Intercept) 0.000000 (Intercept)
Month       1.519912       Month

ggplot(data=coeff, aes(x=reorder(Variable,value), y=value)) + 
  coord_flip()+
  geom_bar(stat="identity", position="identity") + 
  theme(axis.text.x=element_text(angle = -90, hjust = 0)) +
  xlab("") + ylab("") + ggtitle("Average coefficients")

### Inclusion probabilities -- i.e., how often were the variables selected 
inclusionprobs <- reshape2::melt(colMeans(BSTS.fit2$coefficients[-(1:burn2),] != 0))
inclusionprobs$Variable <- as.character(row.names(inclusionprobs))
ggplot(data=inclusionprobs, aes(x=reorder(Variable, value), y=value)) + 
  geom_bar(stat="identity", position="identity") + 
  theme(axis.text.x=element_text(angle = -90, hjust = 0)) + 
  coord_flip()+
  xlab("") + ylab("") + ggtitle("Inclusion probabilities")

예측 with 예측 변수

# 예측
BSTS.forecast2 <- predict(BSTS.fit2, horizon = n.test, burn = burn2, newdata = Month[-(1:length(train.ts))], # newdata : 예측변수를 포함하는 변수
                          quantiles = c(0.025, 0.975))  # horizon : the number of prediction
BSTS.forecast2$mean

 [1] 2008.649 2033.859 2005.471 2115.656 2159.608 1849.589 1985.606
 [8] 1980.895 2025.518 1771.432 1738.900 2042.580 2046.658 2077.231
[15] 2049.163 2155.931 2204.486 1889.507 2028.428 2029.719 2066.013
[22] 1816.823 1787.502 2091.551 2097.854 2122.777 2100.560 2208.762
[29] 2256.867 1945.645 2080.120 2076.238 2118.645 1870.599 1843.099
[36] 2145.638

plot(BSTS.forecast2, plot.original = 100)

accuracy(BSTS.forecast2$mean, test.ts)

                ME     RMSE      MAE       MPE     MAPE      ACF1
Test set -29.77804 66.58763 45.96603 -1.599523 2.379736 0.5265906
         Theil's U
Test set 0.3875899

Bayesian Structural Time Seires