INLA는 잠재 가우스 모형 (latent gaussian model)에 특화된 방법이며 복잡한 사후 분포를 단순화된 라플라스 근사치를 이용하고 적분을 유한개의 가중 합으로 계산하기 때문에 MCMC에 비해 비교적 짧은 계산시간에 정확한 결과를 제공한다는 이점이 있다. 이러한 장점을 이용하여 2019년 서울에서 발생한 강간강제추행을 분석하기 위해 R-INLA (Integrated Nested Laplace Approximations)를 적용해보았다. 게다가 서울 강간강체주행범죄에 영향을 미칠 수 있는 요인으로 치안시설 수, 안심귀가스카우트 이용수, 여성인구비를 고려해보았다. 데이터는 서울 열린데이터광장에서 수집하였으며, BYM Model을 이용하여 공간 분석을 실시하였다.

BYM model with Ecological regression

서울 강간강제추행범죄는 count data이기 때문에 Poisson regression이 사용된다. $i$번째 자치구의 강간강제추행 발생 건수를 $y_{i}$라고 할 때, \[\begin{align*} y_{i} &\thicksim Poisson(\lambda_{i}), \;\; \lambda_{i} = E_{i}\rho_{i}\\ \eta_{i} &= \log{\rho_{i}} = b_{0} + u_{i} + v_{i} + \sum_{j} \beta_{j}x_{ji} \end{align*}\]

$E_{i}$ : $i$번째 자치구에서 예상된 수
$\rho_{i}$ : $i$번째 자치구의 상대적 위험 (Relative risk)
$b_{0}$ : 모든 자치구에서의 평균 $y$
$u_{i}$ : 구조화된 공간 효과 (Structured spatial effect)
$v_{i}$ : 비구조화된 공간 효과 (Unstructured spatial effect)
$\beta_{j}$ : 고정 효과 (Fixed effect)

Real Data

Loading Data

pacman::p_load("maptools",     # For readShapePoly
               "spdep",        # For poly2nb
               "dplyr", 
               "ggplot2",
               "RColorBrewer", # For brewer.pal
               "INLA")



dat.2019   <- read.csv("2019_crime.csv",header=T) 

# Convert rows in the order of ESPI_PK
dat.2019 <- dat.2019[order(dat.2019$ESRI_PK),]

head(dat.2019)

   ESRI_PK year district rape pop_total pop_femal sec_fac
10       0 2019   도봉구   90    335631    171670      10
12       1 2019   은평구  194    484546    251186      22
6        2 2019 동대문구  168    363023    184533      21
20       3 2019   동작구  251    408912    211206      18
18       4 2019   금천구  153    251820    122866      13
17       5 2019   구로구  226    439371    219769      17
   safe_return_use
10           12602
12           10145
6            20776
20           18766
18           17946
17           15233

Loading .shp

seoul.map   <- maptools::readShapePoly("./TL_SCCO_SIG_W_SHP/TL_SCCO_SIG_W.shp")   # Call .shp file
seoul.nb    <- poly2nb(seoul.map)      # Builds a neighbours list based on regions with contiguous boundaries
seoul.listw <- nb2listw(seoul.nb)      # Supplements a neighbours list with spatial weights for the chosen coding scheme
seoul.mat   <- nb2mat(seoul.nb)        # Generates a weights matrix for a neighbours list with spatial weights for the chosen coding scheme
                                       # Object of class "nb"

Moran’s I Test

Spatial weights matrix를 이용하여 공간 상관관계가 존재하는 지 Test를 실시하였다.

### moran.test(observation, listw : a listw object created for example by nb2listw)
moran.test(dat.2019$rape,seoul.listw)


    Moran I test under randomisation

data:  dat.2019$rape  
weights: seoul.listw    

Moran I statistic standard deviate = 2.1969, p-value = 0.01401
alternative hypothesis: greater
sample estimates:
Moran I statistic       Expectation          Variance 
       0.22560008       -0.04166667        0.01480007

➤ $p$값이 0.01401로 유의수준 5%하에서 $p$값이 0.05보다 작기 때문에 귀무가설을 기각한다. 즉, 강간강제추행에 대하여 공간상관관계가 존재한다.

Expected Case

흔하게 사용하는 예상되는 수 ($E_{i}$)는 \[\begin{align*} E_{i} = n_{i}\frac{\sum y_{i}}{\sum n_{i}}, \end{align*}\]

$n_{i}$ : $i$ 번째 지역의 인구수

(참고 : Bayesian Disease Mapping: Hierarchical Modeling in Spatial Epidemiology p.85)

dat.2019$E <- sum(dat.2019$rape)/sum(dat.2019$pop_total)*dat.2019$pop_total

Adjacency Matrix

INLA에서는 근접행렬을 이용하여 그래프로 나타낼 수 있다.

H         <- inla.read.graph(filename="Seoul.graph")
image(inla.graph2matrix(H),xlab="",ylab="")

Generate Variables

dat.2019 <- dat.2019 %>%
    mutate(pop_female_rate = pop_femal/pop_total)

var <- c("pop_female_rate", "sec_fac", "safe_return_use")

summarise_at(dat.2019, var, c("mean", "sd"))

  pop_female_rate_mean sec_fac_mean safe_return_use_mean
1            0.5123515        17.72              14038.2
  pop_female_rate_sd sec_fac_sd safe_return_use_sd
1        0.009309344   4.158525           7354.122

➤ 치안시설 수, 안심귀가스카우트 이용수의 표준편차가 큰 것을 알 수 있다. 이를 줄이기 위하여 자연로그 변환을 실시하였다.

dat.2019 <- dat.2019 %>%
  mutate(ln_sec_fac = log(sec_fac)) %>%                                 
  mutate(ln_safe_return_use = log(safe_return_use))

BYM Model

BYM Model with Ecological regression

\[\begin{align*} \eta_{i} &= \log{\rho_{i}} = b_{0} + u_{i} + v_{i} + \sum_{j} \beta_{j}x_{ji},\\ \hat{\rho_{i}} &= \frac{y_{i}}{E_{i}} = \frac{y_{i}/n_{i}}{\sum y_{i}/ \sum n_{i}}. \end{align*}\]

dat.2019$ID.area  <- 1:25  # The Identifiers For The Boroughs 

formula <- rape ~ 1 + f(ID.area, model="bym", graph = seoul.adj) +
                 ln_sec_fac + ln_safe_return_use + pop_female_rate

bym     <- inla(formula,family = "poisson", data = dat.2019, E = E, # E = E or offset = log(E)
                control.predictor=list(compute=TRUE),               # Compute the marginals of the model parameters
                control.compute = list(dic = TRUE, 
                                       waic = TRUE, cpo = TRUE))    # Compute some model choice criteria

summary(bym)


Call:
   c("inla(formula = formula, family = \"poisson\", data = 
   dat.2019, ", " E = E, control.compute = list(dic = TRUE, waic 
   = TRUE, cpo = TRUE), ", " control.predictor = list(compute = 
   TRUE))") 
Time used:
    Pre = 0.772, Running = 0.861, Post = 0.114, Total = 1.75 
Fixed effects:
                     mean     sd 0.025quant 0.5quant 0.975quant
(Intercept)        -5.887  6.765    -19.176   -5.916      7.546
ln_sec_fac          0.441  0.476     -0.499    0.440      1.384
ln_safe_return_use  0.002  0.274     -0.542    0.002      0.543
pop_female_rate     8.878 11.523    -14.040    8.927     31.493
                     mode kld
(Intercept)        -5.969   0
ln_sec_fac          0.439   0
ln_safe_return_use  0.002   0
pop_female_rate     9.018   0

Random effects:
  Name    Model
    ID.area BYM model

Model hyperparameters:
                                             mean      sd 0.025quant
Precision for ID.area (iid component)        3.76    1.13       1.95
Precision for ID.area (spatial component) 1826.54 1822.34     116.97
                                          0.5quant 0.975quant   mode
Precision for ID.area (iid component)         3.63       6.36   3.38
Precision for ID.area (spatial component)  1284.26    6661.13 316.24

Expected number of effective parameters(stdev): 24.62(0.114)
Number of equivalent replicates : 1.02 

Deviance Information Criterion (DIC) ...............: 231.40
Deviance Information Criterion (DIC, saturated) ....: 49.69
Effective number of parameters .....................: 24.64

Watanabe-Akaike information criterion (WAIC) ...: 224.20
Effective number of parameters .................: 12.57

Marginal log-Likelihood:  -165.03 
CPO and PIT are computed

Posterior marginals for the linear predictor and
 the fitted values are computed

$\log(치안시설수)$가 1 증가할 때, 상대적으로 risk는 $\exp(0.441*1)=1.55$ 증가한다.
$\log(안심귀가스카우트 이용수)$가 1 증가할 때, 상대적 위험은 $\exp(0.002*1)=1.00$ 증가한다.
여성인구비가 0.01 증가할 때, 상대적 위험은 $\exp(8.878*0.01)=1.09$ 증가한다.

Significance for Variables

회귀계수 ($\beta$)가 0과 다른지 즉, 예측변수가 강간강제추행에 영향을 미치는 지 알아보기 위해서 Posterior Probability를 이용하였다. 왜냐하면 Posterior probability는 Bayesian에서 $p$-value에 대응되는 것으로 고려되기 때문이다.

$\beta>0$일 때, $P(\beta>0 \vert Y)$이 80~90%이상이면 유의
$\beta<0$일 때, $P(\beta<0 \vert Y)$이 80~90%이상이면 유의

marginal <- inla.smarginal(bym$marginals.fixed$ln_sec_fac)
marginal <- data.frame(marginal)
ggplot(marginal, aes(x = x, y = y)) +
  labs(x = expression(beta[log(sec_fac)]), y = "Density") +
  geom_vline(xintercept = 0, col = "red") +
  geom_hline(yintercept = 0, col = "grey") +
  geom_line() +
  theme_classic()

[1] 0.8273022

➤ 치안시설 수에 대한 회귀계수의 Posterior probability가 80%이상이므로 치안시설 수는 강간강제추행에 유의하다. 즉, 치안시설수가 많은수록 강간강제추행의 위험은 증가한다.

marginal <- inla.smarginal(bym$marginals.fixed$ln_safe_return_use)
marginal <- data.frame(marginal)
ggplot(marginal, aes(x = x, y = y)) +
  labs(x = expression(beta[log(safe_return_use)]), y = "Density") +
  geom_vline(xintercept = 0, col = "red") +
  geom_hline(yintercept = 0, col = "grey") +
  geom_line() +
  theme_classic()

[1] 0.5012571

➤ 안심귀가스카우트 이용수에 대한 회귀계수의 Posterior probability가 80%이하이므로 안심귀가스카우트 이용수는 강간강제추행에 유의하지 않다.

marginal <- inla.smarginal(bym$marginals.fixed$pop_female_rate)
marginal <- data.frame(marginal)
ggplot(marginal, aes(x = x, y = y)) +
  labs(x = expression(beta[pop_female_rate]), y = "Density") +
  geom_vline(xintercept = 0, col = "red") +
  geom_hline(yintercept = 0, col = "grey") +
  geom_line() +
  theme_classic()

[1] 0.7832572

➤ 여성인구비에 대한 회귀계수의 Posterior probability가 80%이하이므로 여성인구비는 강간강제추행에 유의하지 않다.

Mapping for spatial effect

Random Effect

# Random Effect for spatial structure (ui+vi)
csi  <- bym$marginals.random$ID.area[1:25]             # Extract the marginal posterior distribution for each element of the random effects
zeta <- lapply(csi,function(x) inla.emarginal(exp,x))  # The exponential transformation and calculate the posterior mean for each of them

# Define the cutoff for zeta
zeta.cutoff <- c(0.4, 0.8, 1.0, 1.6, 2.0, 2.4)

# Transform in categorical variable
cat.zeta <- cut(unlist(zeta),breaks=zeta.cutoff,
                include.lowest=TRUE)

➤ BYM Model은 $\xi_{i}=u_{i}+v_{i}$ 와 $u_{i}$를 모수화기 때문에 marginals.random$ID.area의 총 2K개가 존재한다.

➤ 우리가 관심있는 것은 $\xi_{i}=u_{i}+v_{i}$이므로 marginals.random$ID.area[1:25]를 추출한다.

Exceedance probabilities

단순히 공간효과만 비교하는 것보다 초과 확률(Exceedance probabilities)을 비교하는 것도 용이하다. 왜냐하면 초과 확률은 불확실성(Uncertainty)을 고려하기 때문이다. 또한, 단순히 점추정이 1이 넘는 지역이라도 불확실성을 고려한 posterior probability를 보면 cutoff value를 넘지 못할 수 있으며, 그 지역이 위험한 지역이라 선언하기 충분한 신뢰성이 없을 수 있다.

a        <- 0
prob.csi <- lapply(csi, function(x) {1 - inla.pmarginal(a, x)})   # inla.pmarginal : P(x<a)

# Define the cutoff for zeta
prob.cutoff <-  c(0.0, 0.2, 0.4, 0.6, 0.8, 1.0)

cat.prob <- cut(unlist(prob.csi),breaks=prob.cutoff,
                include.lowest=TRUE)

➤ $P(\exp(\xi_{i}>1)\vert Y) = P(\xi_{i}>0\vert Y)$.

maps.cat.zeta <- data.frame(ESRI_PK=dat.2019$ESRI_PK, cat.zeta=cat.zeta, cat.prob=cat.prob)

#Add the categorized zeta to the spatial polygon
data.boroughs           <- attr(seoul.map, "data")
attr(seoul.map, "data") <- merge(data.boroughs, maps.cat.zeta,
                                 by="ESRI_PK")
# For adding name of Seoul
lbls <- as.character(seoul.map$SIG_ENG_NM)             # 자치구명
spl  <- list('sp.text', coordinates(seoul.map), lbls, cex=.7)

# Map zeta
spplot(obj=seoul.map, zcol= "cat.zeta", sp.layout = spl,
       col.regions=brewer.pal(5,"Blues"), asp=1)

➤ $\exp(\xi_{i})$가 1 보다 크면 근사적으로 $\rho_{i} > 1$이므로 치안시설수, 안심귀가스카우트 이용수, 여성인구비를 고려한 후, $i$번째 지역은 전체 지역보다 높은 위험 (risk) 을 가진다.

치안시설수, 안심귀가스카우트 이용수, 여성인구비를 고려한 후, 한강을 기준으로 강남 지역이 강북 지역보다 높은 위험을 가지는 지역이 많다.
치안시설수, 안심귀가스카우트 이용수, 여성인구비를 고려한 후, 강북 지역의 광진, 용산, 마포, 중구, 종로를 제외한 지역은 전체지역보다 낮은 위험을 가진다.

# Map prob
spplot(obj=seoul.map, zcol= "cat.prob",sp.layout = spl,
       col.regions=brewer.pal(5,"Blues"), asp=1)

Estimated Relative Risk

위험 요인들과 공간효과를 고려하여 추정된 상대적 위험(Relative Risk)는 summary.fitted.values을 이용하여 확인할 수 있다.

# Estimated Relative risk
est.rr <- bym$summary.fitted.values$mean          

rr.cutoff <- c(0.4, 0.8, 1.0, 1.6, 2.0, 2.4)

cat.rr      <- cut(unlist(est.rr),breaks = rr.cutoff,
                   include.lowest = TRUE)

# Posterior probability p(Relative risk > a1|y)
mar.rr  <- bym$marginals.fitted.values
a1      <- 1
prob.rr <- lapply(mar.rr, function(x) {1 - inla.pmarginal(a1, x)})

cat.rr.prob <- cut(unlist(prob.rr),breaks = prob.cutoff,
                   include.lowest = TRUE)


maps.cat.rr <- data.frame(ESRI_PK=dat.2019$ESRI_PK, cat.rr=cat.rr, cat.rr.prob=cat.rr.prob)

#Add the categorized zeta to the spatial polygon
data.boroughs           <- attr(seoul.map, "data")
attr(seoul.map, "data") <- merge(data.boroughs, maps.cat.rr,
                                 by="ESRI_PK")


# Map 
spplot(obj=seoul.map, zcol= "cat.rr",sp.layout = spl,
       col.regions=brewer.pal(5,"Blues"), asp=1)

➤ $\rho_{i}$가 1 보다 크면, 위험요인과 공간효과를 고려했을 때 $i$번째 지역은 전체 지역보다 높은 위험 (risk) 을 가진다.

한강을 기준으로 강남 지역이 강북 지역보다 높은 위험을 가지는 지역이 많다.
강북 지역의 광진, 용산, 마포, 중구, 종로를 제외한 지역은 전체지역보다 낮은 위험을 가진다.

# Map prob
spplot(obj=seoul.map, zcol= "cat.rr.prob",sp.layout = spl,
       col.regions=brewer.pal(5,"Blues"), asp=1)

➤ 0.8 이상인 지역을 위험이 높은 hot-spot 지역, 0.2 이하인 지역을 위험이 낮은 cool-spot 지역이라 한다.

BYM Model with Ecological regression (Change prior for fixed effect)

기본값으로 회귀계수에 대한 prior은 Gaussian Distribution이며, 평균은 0이고 precison($1/\sigma$)은 0.001이다.

prior.fixed <- list(mean.intercept = 0, prec.intercept = 0.05,
                    mean = 0, prec = 0.05)

bym1 <- inla(formula,family = "poisson", data = dat.2019, E = E,
                   control.predictor=list(compute=TRUE),
                   control.fixed = prior.fixed,
                   control.compute = list(dic = TRUE, waic = TRUE, cpo = TRUE))

summary(bym1)


Call:
   c("inla(formula = formula, family = \"poisson\", data = 
   dat.2019, ", " E = E, control.compute = list(dic = TRUE, waic 
   = TRUE, cpo = TRUE), ", " control.predictor = list(compute = 
   TRUE), control.fixed = prior.fixed)" ) 
Time used:
    Pre = 0.41, Running = 0.832, Post = 0.0755, Total = 1.32 
Fixed effects:
                     mean    sd 0.025quant 0.5quant 0.975quant   mode
(Intercept)        -1.121 2.738     -6.474   -1.131      4.282 -1.150
ln_sec_fac          0.476 0.458     -0.430    0.476      1.380  0.476
ln_safe_return_use -0.072 0.238     -0.543   -0.072      0.395 -0.070
pop_female_rate     0.746 3.846     -6.820    0.751      8.279  0.761
                   kld
(Intercept)          0
ln_sec_fac           0
ln_safe_return_use   0
pop_female_rate      0

Random effects:
  Name    Model
    ID.area BYM model

Model hyperparameters:
                                             mean      sd 0.025quant
Precision for ID.area (iid component)        3.82    1.12       2.02
Precision for ID.area (spatial component) 1831.81 1824.48     118.39
                                          0.5quant 0.975quant   mode
Precision for ID.area (iid component)         3.69       6.39   3.45
Precision for ID.area (spatial component)  1289.58    6669.68 320.75

Expected number of effective parameters(stdev): 24.60(0.116)
Number of equivalent replicates : 1.02 

Deviance Information Criterion (DIC) ...............: 231.37
Deviance Information Criterion (DIC, saturated) ....: 49.65
Effective number of parameters .....................: 24.61

Watanabe-Akaike information criterion (WAIC) ...: 224.18
Effective number of parameters .................: 12.56

Marginal log-Likelihood:  -163.13 
CPO and PIT are computed

Posterior marginals for the linear predictor and
 the fitted values are computed

BYM Model with Risk Factor