Imputation using Kalman Smoothing for Time Series Data
\[ \begin{aligned} Y_{t}&=Z^{T}_{t}\alpha_{t}+\epsilon_{t},~~~~~~~\epsilon_{t}\sim N_{m}(0,H_{t}), \\ \alpha_{t+1}&=T_{t}\alpha_{t}+R_{t}\eta_{t},~~~\eta_{t}\sim N_{q}(0,Q_{t}),\\ \alpha_{1} &\sim N_{p} (\mu_{1}, P_{1}) \end{aligned} \]
결측값
이 있을 때 유용imputeTS
Package은 단변량 시계열 데이터 (Univariate time series)에 결측값을 Imputation하기 위해 개발된 패키지이다.na_kalmna()
함수로 Kalman Smoothing 방법을 이용한 Imputation을 할 수 있다.na_kalman(x, model = "StructTS", smooth = TRUE, ...)
auto.arima
: auto.arima에 의해 적합된 ARIMA 모형
StructTS
: 최대 우도(Maximum Likelihood)에 의해 적합된 structural 모형TRUE
이면, Kalman Smoothing, FALSE
이면 Kalmna Run(Kalman Filtering)
na_kalmna()
의 Imputation 과정은 다음과 같다.
NA
의 위치를 파악한다.NA
를 추정된 \(\hat{y}_{t}=Z^{T}_{t}\hat{\alpha}_{t}\)로 대체한다.stats
Package에 있는 함수로, 주어진 시계열 데이터를 structural 모형에 적합한다.
\[ \begin{aligned} Y_{t} &= \mu_{t} + \epsilon_{t},~~~~\epsilon_{t}\sim N(0, \sigma^2_{\epsilon})\\ \mu_{t+1} &= \mu_{t} + \xi_{t}, ~~~\xi_{t}\sim N(0,\sigma^2_{\xi }), \end{aligned} \]
\[ \begin{aligned} Y_{t} &= \mu_{t} + \epsilon_{t},~~~~~~~~~~\epsilon_{t}\sim N(0, \sigma^2_{\epsilon})\\ \mu_{t+1} &= \mu_{t} + \delta_{t} + \xi_{t}, ~~~\xi_{t}\sim N(0,\sigma^2_{\xi }),\\ \delta_{t+1} &= \delta_{t} + \zeta_{t}, ~~~~~~~~~~~~\zeta_{t}\sim N(0,\sigma^2_{\zeta}), \end{aligned} \]
\[ \begin{aligned} Y_{t} &= \mu_{t} + \tau_{t} + \epsilon_{t},~~~~\epsilon_{t}\sim N(0, \sigma^2_{\epsilon})\\ \mu_{t+1} &= \mu_{t} + \delta_{t} + \xi_{t}, ~~~~~~~~~\xi_{t}\sim N(0,\sigma^2_{\xi }),\\ \delta_{t+1} &= \delta_{t} + \zeta_{t}, ~~~~~~~~~~~~~~~~~~~~~\zeta_{t}\sim N(0,\sigma^2_{\zeta})\\ \tau_{t+1}&=-\sum_{s=1}^{S-1} \tau_{t+1-s} + \omega_{t}, ~~~~~~~~~~\omega_{t}\sim N(0,\sigma^2_{\omega }) \end{aligned} \]
pacman::p_load("dplyr", "xts", "ggplot2", "imputeTS")
# In Mac
# guess_encoding("Amtrak.csv")
# Amtrak.data <- read.csv("Amtrak.csv", fileEncoding="EUC-KR")
Amtrak.data <- read.csv("C:/Users/User/Desktop/Amtrak.csv")
ridership.ts <- ts(Amtrak.data$Ridership, start=c(1991,1), end=c(2004,3), freq=12)
autoplot(ridership.ts) +
labs(y="Ridership") +
theme_bw()
# NA 생성성
set.seed(100)
loc <- sample(1:length(ridership.ts), 5) # Randomly Location
ridership.ts.NA <- ridership.ts
ridership.ts.NA[loc] <- NA
# NA 갯수와 위치
xts(ridership.ts.NA, order = as.Date(ridership.ts.NA))
[,1]
1991-01-01 1708.917
1991-02-01 1620.586
1991-03-01 1972.715
1991-04-01 NA
1991-05-01 1974.964
1991-06-01 1862.356
1991-07-01 1939.860
1991-08-01 2013.264
1991-09-01 1595.657
1991-10-01 1724.924
1991-11-01 1675.667
1991-12-01 1813.863
1992-01-01 1614.827
1992-02-01 1557.088
1992-03-01 1891.223
1992-04-01 1955.981
1992-05-01 1884.714
1992-06-01 1623.042
1992-07-01 1903.309
1992-08-01 1996.712
1992-09-01 1703.897
1992-10-01 1810.000
1992-11-01 1861.601
1992-12-01 1875.122
1993-01-01 1705.259
1993-02-01 1618.535
1993-03-01 1836.709
1993-04-01 1957.043
1993-05-01 1917.185
1993-06-01 1882.398
1993-07-01 1933.009
1993-08-01 1996.167
1993-09-01 1672.841
1993-10-01 1752.827
1993-11-01 1720.377
1993-12-01 1734.292
1994-01-01 1563.365
1994-02-01 1573.959
1994-03-01 1902.639
1994-04-01 1833.888
1994-05-01 1831.049
1994-06-01 1775.755
1994-07-01 1867.508
1994-08-01 1906.608
1994-09-01 1685.632
1994-10-01 1778.546
1994-11-01 1775.995
1994-12-01 1783.350
1995-01-01 1548.415
1995-02-01 1496.925
1995-03-01 1798.316
1995-04-01 1732.895
1995-05-01 1772.345
1995-06-01 1761.207
1995-07-01 NA
1995-08-01 1874.820
1995-09-01 1571.309
1995-10-01 1646.948
1995-11-01 1672.631
1995-12-01 1656.845
1996-01-01 1381.758
1996-02-01 1360.852
1996-03-01 1558.575
1996-04-01 1608.420
1996-05-01 1696.696
1996-06-01 1693.183
1996-07-01 1835.516
1996-08-01 1942.573
1996-09-01 1551.401
1996-10-01 1686.508
1996-11-01 1576.204
1996-12-01 1700.433
1997-01-01 1396.588
1997-02-01 1371.690
1997-03-01 1707.522
1997-04-01 1654.604
1997-05-01 1762.903
1997-06-01 1775.800
1997-07-01 1934.219
1997-08-01 2008.055
1997-09-01 1615.924
1997-10-01 1773.910
1997-11-01 1732.368
1997-12-01 1796.626
1998-01-01 1570.330
1998-02-01 1412.691
1998-03-01 1754.641
1998-04-01 1824.932
1998-05-01 1843.289
1998-06-01 1825.964
1998-07-01 1968.172
1998-08-01 1921.645
1998-09-01 1669.597
1998-10-01 1791.474
1998-11-01 1816.714
1998-12-01 1846.754
1999-01-01 1599.427
1999-02-01 1548.804
1999-03-01 1832.333
1999-04-01 1839.720
1999-05-01 1846.498
1999-06-01 NA
1999-07-01 1965.743
1999-08-01 1949.002
1999-09-01 1607.373
1999-10-01 1803.664
1999-11-01 1850.309
1999-12-01 1836.435
2000-01-01 1541.660
2000-02-01 1616.928
2000-03-01 1919.538
2000-04-01 NA
2000-05-01 1992.301
2000-06-01 2009.763
2000-07-01 2053.996
2000-08-01 2097.471
2000-09-01 1823.706
2000-10-01 1976.997
2000-11-01 1981.408
2000-12-01 2000.153
2001-01-01 1683.148
2001-02-01 1663.404
2001-03-01 2007.928
2001-04-01 2023.792
2001-05-01 2047.008
2001-06-01 2072.913
2001-07-01 2126.717
2001-08-01 2202.638
2001-09-01 1707.693
2001-10-01 1950.716
2001-11-01 1973.614
2001-12-01 1984.729
2002-01-01 1759.629
2002-02-01 1770.595
2002-03-01 2019.912
2002-04-01 2048.398
2002-05-01 2068.763
2002-06-01 1994.267
2002-07-01 2075.258
2002-08-01 2026.560
2002-09-01 1734.155
2002-10-01 1916.771
2002-11-01 1858.345
2002-12-01 1996.352
2003-01-01 1778.033
2003-02-01 1749.489
2003-03-01 2066.466
2003-04-01 2098.899
2003-05-01 2104.911
2003-06-01 2129.671
2003-07-01 NA
2003-08-01 2174.360
2003-09-01 1931.406
2003-10-01 2121.470
2003-11-01 2076.054
2003-12-01 2140.677
2004-01-01 1831.508
2004-02-01 1838.006
2004-03-01 2132.446
as.Date(ridership.ts.NA)
[1] "1991-01-01" "1991-02-01" "1991-03-01" "1991-04-01" "1991-05-01"
[6] "1991-06-01" "1991-07-01" "1991-08-01" "1991-09-01" "1991-10-01"
[11] "1991-11-01" "1991-12-01" "1992-01-01" "1992-02-01" "1992-03-01"
[16] "1992-04-01" "1992-05-01" "1992-06-01" "1992-07-01" "1992-08-01"
[21] "1992-09-01" "1992-10-01" "1992-11-01" "1992-12-01" "1993-01-01"
[26] "1993-02-01" "1993-03-01" "1993-04-01" "1993-05-01" "1993-06-01"
[31] "1993-07-01" "1993-08-01" "1993-09-01" "1993-10-01" "1993-11-01"
[36] "1993-12-01" "1994-01-01" "1994-02-01" "1994-03-01" "1994-04-01"
[41] "1994-05-01" "1994-06-01" "1994-07-01" "1994-08-01" "1994-09-01"
[46] "1994-10-01" "1994-11-01" "1994-12-01" "1995-01-01" "1995-02-01"
[51] "1995-03-01" "1995-04-01" "1995-05-01" "1995-06-01" "1995-07-01"
[56] "1995-08-01" "1995-09-01" "1995-10-01" "1995-11-01" "1995-12-01"
[61] "1996-01-01" "1996-02-01" "1996-03-01" "1996-04-01" "1996-05-01"
[66] "1996-06-01" "1996-07-01" "1996-08-01" "1996-09-01" "1996-10-01"
[71] "1996-11-01" "1996-12-01" "1997-01-01" "1997-02-01" "1997-03-01"
[76] "1997-04-01" "1997-05-01" "1997-06-01" "1997-07-01" "1997-08-01"
[81] "1997-09-01" "1997-10-01" "1997-11-01" "1997-12-01" "1998-01-01"
[86] "1998-02-01" "1998-03-01" "1998-04-01" "1998-05-01" "1998-06-01"
[91] "1998-07-01" "1998-08-01" "1998-09-01" "1998-10-01" "1998-11-01"
[96] "1998-12-01" "1999-01-01" "1999-02-01" "1999-03-01" "1999-04-01"
[101] "1999-05-01" "1999-06-01" "1999-07-01" "1999-08-01" "1999-09-01"
[106] "1999-10-01" "1999-11-01" "1999-12-01" "2000-01-01" "2000-02-01"
[111] "2000-03-01" "2000-04-01" "2000-05-01" "2000-06-01" "2000-07-01"
[116] "2000-08-01" "2000-09-01" "2000-10-01" "2000-11-01" "2000-12-01"
[121] "2001-01-01" "2001-02-01" "2001-03-01" "2001-04-01" "2001-05-01"
[126] "2001-06-01" "2001-07-01" "2001-08-01" "2001-09-01" "2001-10-01"
[131] "2001-11-01" "2001-12-01" "2002-01-01" "2002-02-01" "2002-03-01"
[136] "2002-04-01" "2002-05-01" "2002-06-01" "2002-07-01" "2002-08-01"
[141] "2002-09-01" "2002-10-01" "2002-11-01" "2002-12-01" "2003-01-01"
[146] "2003-02-01" "2003-03-01" "2003-04-01" "2003-05-01" "2003-06-01"
[151] "2003-07-01" "2003-08-01" "2003-09-01" "2003-10-01" "2003-11-01"
[156] "2003-12-01" "2004-01-01" "2004-02-01" "2004-03-01"
ggplot_na_distribution(as.numeric(ridership.ts.NA),
x_axis_labels = as.Date(ridership.ts.NA), subtitle = "") +
ggplot2::theme(plot.title = ggplot2::element_text(hjust = 0.5))
# ggplot_na_distribution(as.numeric(ridership.ts),
# x_axis_labels = as.Date(Amtrak.data$Month, format = "%m/%d/%Y"))
# 1. Using Basic Structural Model (Local Linear Trend + Seasonal) Due to frequency of ridership.ts.NA
Kalman.Imp <- na_kalman(ridership.ts.NA)
# Imputation 확인Kalman.Imp[loc]
[1] 1849.782 1945.737 2221.376 1930.777 1817.434
ggplot_na_imputations(ridership.ts.NA, Kalman.Imp, x_axis_labels = as.Date(ridership.ts.NA))
# 실제값과 비교비교
ggplot_na_imputations(ridership.ts.NA, Kalman.Imp, ridership.ts)
# 2. Using Local Linear Trend
Kalman.Imp <- na_kalman(ridership.ts.NA, type="trend")
# Imputation 확인Kalman.Imp[loc]
[1] 1810.990 1874.212 2021.916 1768.566 1692.744
ggplot_na_imputations(ridership.ts.NA, Kalman.Imp, x_axis_labels = as.Date(ridership.ts.NA))
# 실제값과 비교비교
ggplot_na_imputations(ridership.ts.NA, Kalman.Imp, ridership.ts)
# 3. Auto.arima
Kalman.Imp.arima <- na_kalman(ridership.ts.NA, model = "auto.arima")
# Imputation 확인Kalman.Imp.arima[loc]
[1] 1848.001 1952.100 2190.084 1990.073 1855.564
ggplot_na_imputations(ridership.ts.NA, Kalman.Imp.arima, x_axis_labels = as.Date(ridership.ts.NA))
# 실제값과 비교비교
ggplot_na_imputations(ridership.ts.NA, Kalman.Imp.arima, ridership.ts)
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".