참고 : R을 활용한 다변량 자료분석 방법론, 강현철 \(\cdot\) 연규필 \(\cdot\) 한상태 저

1. 프로그램 유형별 시청 정도 데이터

자유아카데미에서 출판한 책 R을 활용한 다변량 자료분석 방법론의 데이터 파일 중 “tvprog.csv”를 활용한다.
이 데이터는 1,000명의 개체로부터 8개 TV 프로그램 유형별 시청 정도를 조사하여 얻은 것으로, 변수는 다음과 같다.
1. \(x_1\) : 뉴스/보도
2. \(x_2\) : 드라마
3. \(x_3\) : 영화
4. \(x_4\) : 쇼/오락
5. \(x_5\) : 스포츠
6. \(x_6\) : 다큐멘터리
7. \(x_7\) : 생활정보
8. \(x_8\) : 어린이/만화
변수값은 클수록 시청 정도가 많은 것을 뜻하며, “1 = 전혀 안본다.”, “2 = 안본다.”, “3 = 별로 안본다.”, “4 = 약간 본다.”, “5 = 어느 정도 본다.”, “6 = 매우 많이 본다.”, “. = 모름/거절”을 의미한다.

# 데이터 불러오기
tvprog <- read.csv("C:/Users/User/Desktop/tvprog.csv")
head(tvprog)

  ID x1 x2 x3 x4 x5 x6 x7 x8 gender married age region job house
1  1  2  1  1  1  4  2  1  1      1       1   2      1   3     1
2  2  5 NA  2 NA  1  1  1  1      1       1   2      1   3     2
3  3  5  2  2  2  6  5  5  1      1       2   3      1   1     1
4  4  4  4  3  4  6  6  4  5      1       2   3      1   4     3
5  5  5  2  3  2  3  4  4  1      1       2   3      1   4     3
6  6  4  5  5  4  4  5  4  1      1       2   3      1   2     1

# 데이터 전처리pacman::p_load("dplyr")

tvprog.X <- tvprog %>%
  na.omit() %>%             # NA 제거
  .[,2:9]                   # 2~9열 선택택

head(tvprog.X)

  x1 x2 x3 x4 x5 x6 x7 x8
1  2  1  1  1  4  2  1  1
3  5  2  2  2  6  5  5  1
4  4  4  3  4  6  6  4  5
5  5  2  3  2  3  4  4  1
6  4  5  5  4  4  5  4  1
7  4  5  3  5  3  5  5  3

1-1. 상관행렬과 고유값

# 상관행렬round( cor(tvprog.X), 3)

      x1    x2    x3    x4    x5    x6    x7    x8
x1 1.000 0.331 0.257 0.302 0.411 0.488 0.470 0.150
x2 0.331 1.000 0.443 0.521 0.305 0.243 0.418 0.258
x3 0.257 0.443 1.000 0.509 0.374 0.365 0.369 0.362
x4 0.302 0.521 0.509 1.000 0.395 0.283 0.418 0.365
x5 0.411 0.305 0.374 0.395 1.000 0.510 0.430 0.237
x6 0.488 0.243 0.365 0.283 0.510 1.000 0.574 0.182
x7 0.470 0.418 0.369 0.418 0.430 0.574 1.000 0.282
x8 0.150 0.258 0.362 0.365 0.237 0.182 0.282 1.000

# 주성분분석석
tvprog.pca <- princomp(tvprog.X,
                       cor = TRUE) # 상관행렬에 기초한 주성분 분석round(tvprog.pca$sdev^2, 3)        # 주성분의 분산 = 고유값

Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 
 3.597  1.143  0.769  0.640  0.557  0.506  0.442  0.346

summary(tvprog.pca)                # 주성분의 설명비율 출력

Importance of components:
                          Comp.1    Comp.2     Comp.3     Comp.4
Standard deviation     1.8966376 1.0689307 0.87702566 0.79969166
Proportion of Variance 0.4496543 0.1428266 0.09614675 0.07993834
Cumulative Proportion  0.4496543 0.5924809 0.68862763 0.76856597
                           Comp.5     Comp.6     Comp.7     Comp.8
Standard deviation     0.74615128 0.71150455 0.66515111 0.58827356
Proportion of Variance 0.06959272 0.06327984 0.05530325 0.04325822
Cumulative Proportion  0.83815869 0.90143853 0.95674178 1.00000000

Result! 첫 번째 고유값은 3.597이고 이는 전체 분산의 약 3.597/8(변수 개수, 전체 분산)=45%에 해당하며, 두 번째 고유값은 1.143이고 이는 전체 분산의 1.143/8(변수 개수, 전체 분산)=14%에 해당한다. 또한, 첫 두 개의 고유값에 의한 누적 설명비율은 45%+14%=59%이다.

1-2. KMO 표본적합성 측도

Kaiser-Meyer-Olkin (KMO)의 표본적합성 측도는 관측된 상관계수들의 값과 편상관계수들의 값을 비교하는 지수로서, 이 값이 클수록 측정변수들 저변에 공통적인 잠재요인(공통인자)이 존재함을 나타낸다.
KMO 표본적합성 측도는 Package psych에서 제공하는 함수 KMO()를 이용하여 수행할 수 있다.

pacman::p_load("psych")

KMO(tvprog.X)

Kaiser-Meyer-Olkin factor adequacy
Call: KMO(r = tvprog.X)
Overall MSA =  0.84
MSA for each item = 
  x1   x2   x3   x4   x5   x6   x7   x8 
0.87 0.83 0.85 0.84 0.88 0.77 0.85 0.87

Result! 표본적합성 측도(MSA)가 0.84로 출력되었다.
Caution! Kaiser(1974)는 KMO 측도에 대하여 다음과 같은 기준을 제시하였다.

KMO 측도	결과
\(\ge\) 0.90	훌륭한(Marvelous)
0.80~0.89	가치 있는(Meritorious)
0.70~0.79	중급의(Midding)
0.60~0.69	평범한(Mediocre)
0.50~0.59	빈약한(Miserable)
\(<\) 0.50	받아들이기 힘든(Unacceptable)

즉, 이 측도의 값이 적으면 인자분석을 위한 변수들의 선정이 좋지 못함을 나타낸다. 예를 들어, 위에서 KMO 측도가 0.84로 출력되었으므로, 이 데이터의 경우 “가치 있는” 정도에 해당한다.

1-3. Bartlett의 구형성 검정

Bartlett의 구형성 검정(Sphericity Test)은 “상관계수 행렬이 단위행렬이다.”라는 귀무가설을 기각할 수 있는지를 검정하는 것이다.
- 즉, “귀무가설 : 공통인자가 존재하지 않는다.”와 “대립가설 : 공통인자가 존재한다.”를 대상으로 검정을 수행하는 것이다.
  - 왜냐하면, 상관계수가 단위행렬일 경우, 변수들 간 상관관계가 없기 때문에 인자분석을 하기에 적절하지 않다는 것을 의미한다.
따라서, Bartlett 검정의 \(p\)-값이 유의수준보다 작아서 귀무가설을 기각해야 그 데이터에 대해 인자분석을 수행할 가치가 있음을 나타낸다.
Bartlett의 구형성 검정은 Package psych에서 제공하는 함수 cortest.bartlett()을 이용하여 수행할 수 있다.

pacman::p_load("psych")

tvprog.cor <- cor(tvprog.X)           # 상관행렬cortest.bartlett(tvprog.cor,          # 상관행렬                 n = nrow(tvprog.X))  # 표본 크기기

$chisq
[1] 2371.949

$p.value
[1] 0

$df
[1] 28

Result! Bartlett 검정에 대한 카이제곱 값과 \(p\)-값이 각각 \(\chi^2=\) 2371.949, \(p=\) 0이다. 이에 근거하여, 유의수준 5%에서 \(p\)-값이 0.05보다 작으므로 귀무가설을 기각한다. 즉, 이 데이터에는 적어도 공통인자가 1개 이상 존재한다고 할 수 있다.

1-4. 인자분석

인자분석은 Package psych에서 제공하는 함수 principal()을 이용하여 수행할 수 있다.
- 이 함수는 주성분분석에 의한 인자추출 방식의 인자분석을 수행한다.
원자료를 입력받은 경우, 내부적으로 상관행렬을 계산하고 이에 기초하여 인자분석을 수행한다.
- 이때 옵션 nfactors에 인자의 개수를 지정하며, 옵션 rotate에는 인자의 회전방법을 지정할 수 있다.
- 만약 옵션 cor = "cov"을 지정하면 공분산행렬을 계산하고 이에 기초하여 인자분석을 수행한다.
자세한 옵션은 여기를 참고한다.

# 1. 데이터행렬을 입력으로 하는 경우NApacman::p_load("psych")

tvprog.fa <- principal(tvprog.X,            # 데이터행렬NA= "cor",         # 상관행렬에 기초하여 인자분석 수행, cor = "cov"이면 공분산행렬에 기초하여 인자분석 수행NA= 2,        # 인자의 개수NA= "varimax")  # 인자의 회전방법NAprint(tvprog.fa, 
      sort = TRUE,                          # 인자적재값이 큰 순서대로 정렬   NA= 5)                            # 수치들의 소숫점 자릿수

Principal Components Analysis
Call: principal(r = tvprog.X, nfactors = 2, rotate = "varimax", cor = "cor")
Standardized loadings (pattern matrix) based upon correlation matrix
   item     RC1     RC2      h2      u2    com
x6    6 0.83888 0.11744 0.71751 0.28249 1.0392
x1    1 0.76661 0.11683 0.60134 0.39866 1.0464
x7    7 0.70755 0.34881 0.62230 0.37770 1.4589
x5    5 0.67094 0.29586 0.53770 0.46230 1.3747
x4    4 0.25733 0.76882 0.65731 0.34269 1.2213
x3    3 0.27090 0.71821 0.58921 0.41079 1.2789
x8    8 0.02628 0.69287 0.48076 0.51924 1.0029
x2    2 0.26517 0.68075 0.53373 0.46627 1.2966

                          RC1     RC2
SS loadings           2.45281 2.28703
Proportion Var        0.30660 0.28588
Cumulative Var        0.30660 0.59248
Proportion Explained  0.51749 0.48251
Cumulative Proportion 0.51749 1.00000

Mean item complexity =  1.2
Test of the hypothesis that 2 components are sufficient.

The root mean square of the residuals (RMSR) is  0.09581 
 with the empirical chi square  502.1939  with prob <  5.0056e-99 

Fit based upon off diagonal values = 0.93676

Caution! 함수 print()를 이용하여 인자분석의 주요 결과를 출력한다. 출력 결과에는 표준화된 인자적재값 RC과 인자적재값의 제곱합인 공통성 h2, 특수성 u2을 의미한다. 또한, SS liadings는 각 인자에 의해 설명되는 분산의 양을 나타내며, Proportion Var는 각 인자가 설명하는 총 분산의 비율을 의미한다.

# 상관행렬의 고유값 출력NAprint(tvprog.fa$values,  
      digit = 3)              # 수치들의 소숫점 자릿수

[1] 3.597 1.143 0.769 0.640 0.557 0.506 0.442 0.346

Result! 위에서 함수 princomp()를 이용하여 구한 고유값과 동일하다.

# 인자적재값 출력NAprint(tvprog.fa$loadings, 
      digit = 3,              # 수치들의 소숫점 자릿수      cut   = 0)              # 지정값보다 작은 인자적재값은 출력 XNA


Loadings:
   RC1   RC2  
x1 0.767 0.117
x2 0.265 0.681
x3 0.271 0.718
x4 0.257 0.769
x5 0.671 0.296
x6 0.839 0.117
x7 0.708 0.349
x8 0.026 0.693

                 RC1   RC2
SS loadings    2.453 2.287
Proportion Var 0.307 0.286
Cumulative Var 0.307 0.592

Result! 인자적재값은 각 관찰변수와 인자들 간의 연관성 크기를 나타낸다. 출력 결과를 살펴보면, 먼저 첫 인자에 높은 적재값을 가지고 있는 변수들은 \(x_1\)(뉴스/보도), \(x_5\)(스포츠), \(x_6\)(다큐멘터리), \(x_7\)(생활정보)이므로 첫 인자는 "정보추구 유형"을 나타낸다고 해석할 수 있다. 반면에, 두 번째 인자에 높은 적재값을 가지고 있는 변수들은 \(x_2\)(드라마), \(x_3\)(영화), \(x_4\)(쇼/오락), \(x_8\)(어린이/만화)이므로 두 번째 인자는 "재미추구 유형"을 나타낸다고 해석할 수 있다.
또한, (표준화된)관찰변수들의 전체 분산 중 두 인자에 의해서 설명되는 분산은 각각 2.453, 2.287로서, 이는 각각 전체 분산의 2.453/8=30.7%, 2.287/8=28.6%에 해당한다. 따라서, 두 인자에 의해서 설명되는 분산 비율의 합계는 59.3%이다.

# 인자적재 플롯NApacman::p_load("psych")

biplot(x = tvprog.fa$loadings[, c(1,2)],
       y = tvprog.fa$loadings[, c(1,2)],
       xlabs = colnames(tvprog.X),
       ylabs = colnames(tvprog.X))

Result! 변수 \(x_1\)(뉴스/보도), \(x_5\)(스포츠), \(x_6\)(다큐멘터리), \(x_7\)(생활정보)은 첫 번째 인자축에 가깝고, 변수 \(x_2\)(드라마), \(x_3\)(영화), \(x_4\)(쇼/오락), \(x_8\)(어린이/만화)은 두 번째 인자축에 가까운 것을 시각적으로 확인할 수 있다.

# 2. 상관행렬을 입력으로 하는 경우NAtvprog.fa.cor <- principal(r = tvprog.cor,         # 상관행렬                           n.obs = nrow(tvprog.X), # 표본 개수 수 
                           nfactors = 2,           # 인자의 개수NA= "varimax")     # 인자의 회전방법NAprint(tvprog.fa.cor, 
      sort = TRUE,                                 # 인자적재값이 큰 순서대로 정렬   NA= 5)                                   # 수치들의 소숫점 자릿수

Principal Components Analysis
Call: principal(r = tvprog.cor, nfactors = 2, rotate = "varimax", n.obs = nrow(tvprog.X))
Standardized loadings (pattern matrix) based upon correlation matrix
   item     RC1     RC2      h2      u2    com
x6    6 0.83888 0.11744 0.71751 0.28249 1.0392
x1    1 0.76661 0.11683 0.60134 0.39866 1.0464
x7    7 0.70755 0.34881 0.62230 0.37770 1.4589
x5    5 0.67094 0.29586 0.53770 0.46230 1.3747
x4    4 0.25733 0.76882 0.65731 0.34269 1.2213
x3    3 0.27090 0.71821 0.58921 0.41079 1.2789
x8    8 0.02628 0.69287 0.48076 0.51924 1.0029
x2    2 0.26517 0.68075 0.53373 0.46627 1.2966

                          RC1     RC2
SS loadings           2.45281 2.28703
Proportion Var        0.30660 0.28588
Cumulative Var        0.30660 0.59248
Proportion Explained  0.51749 0.48251
Cumulative Proportion 0.51749 1.00000

Mean item complexity =  1.2
Test of the hypothesis that 2 components are sufficient.

The root mean square of the residuals (RMSR) is  0.09581 
 with the empirical chi square  502.1939  with prob <  5.0056e-99 

Fit based upon off diagonal values = 0.93676

1-5. 인자적재와 특수분산의 추정

인자적재와 특수분산을 추정하는 방법에는 1. 주성분분석법, 2. 주축인자법, 3. 최대우도법이 있다.
먼저, 주성분분석법은 Package psych에서 제공하는 함수 principal()을 통해 수행할 수 있다.

# 1. 주성분분석법법
pacman::p_load("psych")

tvprog.fa.pm <- principal(tvprog.X, 
                          cor = "cor",         # 상관행렬에 기초하여 인자분석 수행, cor = "cov"이면 공분산행렬에 기초하여 인자분석 수행NA= 2,        # 인자의 개수NA= "none")     # 인자의 회전방법NAprint(tvprog.fa.pm, digit = 3)

Principal Components Analysis
Call: principal(r = tvprog.X, nfactors = 2, rotate = "none", cor = "cor")
Standardized loadings (pattern matrix) based upon correlation matrix
     PC1    PC2    h2    u2  com
x1 0.640 -0.438 0.601 0.399 1.77
x2 0.659  0.316 0.534 0.466 1.44
x3 0.688  0.340 0.589 0.411 1.46
x4 0.713  0.386 0.657 0.343 1.54
x5 0.692 -0.242 0.538 0.462 1.24
x6 0.693 -0.487 0.718 0.282 1.79
x7 0.755 -0.228 0.622 0.378 1.18
x8 0.492  0.488 0.481 0.519 2.00

                        PC1   PC2
SS loadings           3.597 1.143
Proportion Var        0.450 0.143
Cumulative Var        0.450 0.592
Proportion Explained  0.759 0.241
Cumulative Proportion 0.759 1.000

Mean item complexity =  1.6
Test of the hypothesis that 2 components are sufficient.

The root mean square of the residuals (RMSR) is  0.096 
 with the empirical chi square  502.194  with prob <  5.01e-99 

Fit based upon off diagonal values = 0.937

# 주성분분석석
tvprog.pca <- princomp(tvprog.X, cor = TRUE)
tvprog.pca$sdev^2             # 주성분의 분산 = 고유값

   Comp.1    Comp.2    Comp.3    Comp.4    Comp.5    Comp.6    Comp.7 
3.5972341 1.1426129 0.7691740 0.6395068 0.5567417 0.5062387 0.4424260 
   Comp.8 
0.3460658

tvprog.pca$loadings           # 주성분 계수 = 고유벡터


Loadings:
   Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8
x1  0.337  0.410  0.166  0.423  0.375  0.583  0.138  0.109
x2  0.347 -0.296  0.585  0.201        -0.136 -0.579 -0.233
x3  0.363 -0.318        -0.456 -0.396  0.563         0.287
x4  0.376 -0.361  0.216 -0.112  0.203 -0.207  0.733 -0.220
x5  0.365  0.226 -0.186 -0.522  0.561 -0.275 -0.269  0.213
x6  0.365  0.456 -0.221 -0.128 -0.371               -0.675
x7  0.398  0.214         0.313 -0.433 -0.453  0.106  0.542
x8  0.260 -0.457 -0.707  0.415  0.146        -0.146       

               Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7
SS loadings     1.000  1.000  1.000  1.000  1.000  1.000  1.000
Proportion Var  0.125  0.125  0.125  0.125  0.125  0.125  0.125
Cumulative Var  0.125  0.250  0.375  0.500  0.625  0.750  0.875
               Comp.8
SS loadings     1.000
Proportion Var  0.125
Cumulative Var  1.000

Caution! 주성분분석법에 의해 추정된 인자적재값은 주성분분석의 계수에 고유값의 제곱근을 곱한 것과 같다.
Result! 인자적재행렬의 절대값은 첫 2개의 고유벡터에 고유값의 제곱근을 곱하여 얻어짐을 알 수 있다. 예를 들어, 첫 번째 인자의 첫 번째 인자적재값 \(\hat{l}_{11}=\) 0.640 \(=\sqrt{3.597}\times0.337\)이고, 두 번째 인자적재값 \(\hat{l}_{12}=\) 0.659 \(=\sqrt{3.597}\times0.347\)이다. 두 번째 인자의 첫 번째 인자적재값 \(\hat{l}_{21}=\) 0.438 \(=\sqrt{1.143}\times0.410\)이고, 두 번째 인자적재값 \(\hat{l}_{22}=\) 0.316 \(=\sqrt{1.143}\times0.296\)이다.

Caution! 주축인자법과 최대우도법을 적용하기 위해서 함수 fa()를 이용할 수 있다. 주축인자법의 경우, 함수 fa(, fm = "pa"), 최대우도법의 경우, 함수 fa(, fm = "ml")을 지정하면 된다. 그 외에 자세한 옵션은 여기를 참고한다.

# 2. 주축인자법NAtvprog.fa.pa <- fa(tvprog.X, 
                   cor = "cor",         # 상관행렬에 기초하여 인자분석 수행, cor = "cov"이면 공분산행렬에 기초하여 인자분석 수행NA= 2,        # 인자의 개수NA= "pa",           # 주죽인자법NA= "none")     # 인자의 회전방법NAprint(tvprog.fa.pa, digit = 3)

Factor Analysis using method =  pa
Call: fa(r = tvprog.X, nfactors = 2, rotate = "none", fm = "pa", cor = "cor")
Standardized loadings (pattern matrix) based upon correlation matrix
     PA1    PA2    h2    u2  com
x1 0.576 -0.243 0.390 0.610 1.35
x2 0.595  0.265 0.425 0.575 1.38
x3 0.629  0.242 0.454 0.546 1.29
x4 0.685  0.384 0.616 0.384 1.57
x5 0.626 -0.139 0.411 0.589 1.10
x6 0.688 -0.469 0.693 0.307 1.77
x7 0.710 -0.163 0.531 0.469 1.11
x8 0.415  0.223 0.222 0.778 1.53

                        PA1   PA2
SS loadings           3.091 0.651
Proportion Var        0.386 0.081
Cumulative Var        0.386 0.468
Proportion Explained  0.826 0.174
Cumulative Proportion 0.826 1.000

Mean item complexity =  1.4
Test of the hypothesis that 2 factors are sufficient.

The degrees of freedom for the null model are  28  and the objective function was  2.439 with Chi Square of  2371.949
The degrees of freedom for the model are 13  and the objective function was  0.079 

The root mean square of the residuals (RMSR) is  0.029 
The df corrected root mean square of the residuals is  0.042 

The harmonic number of observations is  977 with the empirical chi square  45.083  with prob <  2.03e-05 
The total number of observations was  977  with Likelihood Chi Square =  77.161  with prob <  3.76e-11 

Tucker Lewis Index of factoring reliability =  0.941
RMSEA index =  0.0711  and the 90 % confidence intervals are  0.0563 0.0868
BIC =  -12.338
Fit based upon off diagonal values = 0.994
Measures of factor score adequacy             
                                                    PA1   PA2
Correlation of (regression) scores with factors   0.932 0.781
Multiple R square of scores with factors          0.868 0.611
Minimum correlation of possible factor scores     0.736 0.221

# 3. 최대우도법
tvprog.fa.ml <- fa(tvprog.X, 
                   cor = "cor",         # 상관행렬에 기초하여 인자분석 수행, cor = "cov"이면 공분산행렬에 기초하여 인자분석 수행NA= 2,        # 인자의 개수NA= "ml",           # 최대우도법
                   rotate = "none")     # 인자의 회전방법NAprint(tvprog.fa.ml, digit = 3)

Factor Analysis using method =  ml
Call: fa(r = tvprog.X, nfactors = 2, rotate = "none", fm = "ml", cor = "cor")
Standardized loadings (pattern matrix) based upon correlation matrix
     ML1    ML2    h2    u2  com
x1 0.589 -0.141 0.366 0.634 1.11
x2 0.558  0.369 0.447 0.553 1.73
x3 0.603  0.272 0.437 0.563 1.39
x4 0.636  0.460 0.616 0.384 1.82
x5 0.637 -0.062 0.410 0.590 1.02
x6 0.752 -0.423 0.744 0.256 1.58
x7 0.719 -0.073 0.523 0.477 1.02
x8 0.389  0.257 0.217 0.783 1.74

                        ML1   ML2
SS loadings           3.065 0.695
Proportion Var        0.383 0.087
Cumulative Var        0.383 0.470
Proportion Explained  0.815 0.185
Cumulative Proportion 0.815 1.000

Mean item complexity =  1.4
Test of the hypothesis that 2 factors are sufficient.

The degrees of freedom for the null model are  28  and the objective function was  2.439 with Chi Square of  2371.949
The degrees of freedom for the model are 13  and the objective function was  0.076 

The root mean square of the residuals (RMSR) is  0.03 
The df corrected root mean square of the residuals is  0.044 

The harmonic number of observations is  977 with the empirical chi square  48.427  with prob <  5.54e-06 
The total number of observations was  977  with Likelihood Chi Square =  73.724  with prob <  1.64e-10 

Tucker Lewis Index of factoring reliability =  0.9441
RMSEA index =  0.0691  and the 90 % confidence intervals are  0.0543 0.0849
BIC =  -15.774
Fit based upon off diagonal values = 0.994
Measures of factor score adequacy             
                                                    ML1   ML2
Correlation of (regression) scores with factors   0.935 0.799
Multiple R square of scores with factors          0.875 0.638
Minimum correlation of possible factor scores     0.749 0.275

# 인자구조 다이어그램pacman::p_load("psych")

fa.diagram(tvprog.fa.pm, 
           simple = FALSE,              # 각 관찰변수에 대하여 가장 큰 적재값만 표현하는 지 여부
           cut = 0,                     # 지정값보다 작은 값은 출력 X
           digit = 3)

fa.diagram(tvprog.fa.pa, 
           simple = FALSE,              # 각 관찰변수에 대하여 가장 큰 적재값만 표현하는 지 여부
           cut = 0,                     # 지정값보다 작은 값은 출력 X
           digit = 3)

Caution! Package psych에서 제공하는 함수 fa.diagram()을 통해 인자구조를 시각적으로 살펴볼 수 있다.

1-6. 인자의 회전

인자축을 회전하기 위해서는 옵션 rotate에 회전방법을 지정하면 된다.
- 직교회전 : varimax, quartimax, bentlerT, equamax
- 사각회전 : promax, oblimin, simplimax, bentlerQ

# 회전방법 수행 XNAtvprog.fa <- principal(tvprog.X, 
                       cor = "cor",         # 상관행렬에 기초하여 인자분석 수행, cor = "cov"이면 공분산행렬에 기초하여 인자분석 수행NA= 2,        # 인자의 개수NA= "none")     # 인자의 회전방법NAprint(tvprog.fa, digit = 3)

Principal Components Analysis
Call: principal(r = tvprog.X, nfactors = 2, rotate = "none", cor = "cor")
Standardized loadings (pattern matrix) based upon correlation matrix
     PC1    PC2    h2    u2  com
x1 0.640 -0.438 0.601 0.399 1.77
x2 0.659  0.316 0.534 0.466 1.44
x3 0.688  0.340 0.589 0.411 1.46
x4 0.713  0.386 0.657 0.343 1.54
x5 0.692 -0.242 0.538 0.462 1.24
x6 0.693 -0.487 0.718 0.282 1.79
x7 0.755 -0.228 0.622 0.378 1.18
x8 0.492  0.488 0.481 0.519 2.00

                        PC1   PC2
SS loadings           3.597 1.143
Proportion Var        0.450 0.143
Cumulative Var        0.450 0.592
Proportion Explained  0.759 0.241
Cumulative Proportion 0.759 1.000

Mean item complexity =  1.6
Test of the hypothesis that 2 components are sufficient.

The root mean square of the residuals (RMSR) is  0.096 
 with the empirical chi square  502.194  with prob <  5.01e-99 

Fit based upon off diagonal values = 0.937

Result! 초기 인자적재행렬을 살펴보면, 첫 번째 인자는 대부분의 변수들이 높은 인자적재값을 가지므로 이를 "전반적인 시청 정도"를 나타내는 인자로 생각할 수 있다. 두 번째 인자는 변수 \(x_1\)(뉴스/보도), \(x_5\)(스포츠), \(x_6\)(다큐멘터리), \(x_7\)(생활정보)는 음의 적재값을, 변수 \(x_2\)(드라마), \(x_3\)(영화), \(x_4\)(쇼/오락), \(x_8\)(어린이/만화)은 양의 적재값을 가지므로 이 인자에 의미를 부여하는 것은 쉽지 않다. 두 번째 인자의 의미를 "재미추구와 정보추구 유형의 차이" 정도로 해석할 수 있다.

# 인자적재 플롯NAbiplot(x = tvprog.fa$loadings[, c(1,2)],
       y = tvprog.fa$loadings[, c(1,2)],
       xlabs = colnames(tvprog.X),
       ylabs = colnames(tvprog.X))

# Varimax 회전방법tvprog.fa.varimax <- principal(tvprog.X, 
                               cor = "cor",         # 상관행렬에 기초하여 인자분석 수행, cor = "cov"이면 공분산행렬에 기초하여 인자분석 수행NA= 2,        # 인자의 개수NA= "varimax")  # 인자의 회전방법NAprint(tvprog.fa.varimax, digit = 3)

Principal Components Analysis
Call: principal(r = tvprog.X, nfactors = 2, rotate = "varimax", cor = "cor")
Standardized loadings (pattern matrix) based upon correlation matrix
     RC1   RC2    h2    u2  com
x1 0.767 0.117 0.601 0.399 1.05
x2 0.265 0.681 0.534 0.466 1.30
x3 0.271 0.718 0.589 0.411 1.28
x4 0.257 0.769 0.657 0.343 1.22
x5 0.671 0.296 0.538 0.462 1.37
x6 0.839 0.117 0.718 0.282 1.04
x7 0.708 0.349 0.622 0.378 1.46
x8 0.026 0.693 0.481 0.519 1.00

                        RC1   RC2
SS loadings           2.453 2.287
Proportion Var        0.307 0.286
Cumulative Var        0.307 0.592
Proportion Explained  0.517 0.483
Cumulative Proportion 0.517 1.000

Mean item complexity =  1.2
Test of the hypothesis that 2 components are sufficient.

The root mean square of the residuals (RMSR) is  0.096 
 with the empirical chi square  502.194  with prob <  5.01e-99 

Fit based upon off diagonal values = 0.937

Result! Varimax 회전방법을 수행했을 때 회전한 인자적재행렬을 살펴보면, 첫 인자에 높은 적재값을 가지고 있는 변수들은 \(x_1\)(뉴스/보도), \(x_5\)(스포츠), \(x_6\)(다큐멘터리), \(x_7\)(생활정보)이므로 첫 인자는 "정보추구 유형"을 나타낸다고 해석할 수 있다. 반면에, 두 번째 인자에 높은 적재값을 가지고 있는 변수들은 \(x_2\)(드라마), \(x_3\)(영화), \(x_4\)(쇼/오락), \(x_8\)(어린이/만화)이므로 두 번째 인자는 "재미추구 유형"을 나타낸다고 해석할 수 있다.

# 인자적재 플롯NAbiplot(x = tvprog.fa.varimax$loadings[, c(1,2)],
       y = tvprog.fa.varimax$loadings[, c(1,2)],
       xlabs = colnames(tvprog.X),
       ylabs = colnames(tvprog.X))

# Promax 회전방법tvprog.fa.promax <- principal(tvprog.X, 
                              cor = "cor",         # 상관행렬에 기초하여 인자분석 수행, cor = "cov"이면 공분산행렬에 기초하여 인자분석 수행NA= 2,        # 인자의 개수NA= "promax")   # 인자의 회전방법NAprint(tvprog.fa.promax, digit = 3)

Principal Components Analysis
Call: principal(r = tvprog.X, nfactors = 2, rotate = "promax", cor = "cor")
Standardized loadings (pattern matrix) based upon correlation matrix
      RC1    RC2    h2    u2  com
x1  0.814 -0.082 0.601 0.399 1.02
x2  0.067  0.694 0.534 0.466 1.02
x3  0.061  0.735 0.589 0.411 1.01
x4  0.029  0.796 0.657 0.343 1.00
x5  0.648  0.147 0.538 0.462 1.10
x6  0.894 -0.101 0.718 0.282 1.03
x7  0.671  0.196 0.622 0.378 1.17
x8 -0.203  0.774 0.481 0.519 1.14

                        RC1   RC2
SS loadings           2.395 2.345
Proportion Var        0.299 0.293
Cumulative Var        0.299 0.592
Proportion Explained  0.505 0.495
Cumulative Proportion 0.505 1.000

 With component correlations of 
     RC1  RC2
RC1 1.00 0.51
RC2 0.51 1.00

Mean item complexity =  1.1
Test of the hypothesis that 2 components are sufficient.

The root mean square of the residuals (RMSR) is  0.096 
 with the empirical chi square  502.194  with prob <  5.01e-99 

Fit based upon off diagonal values = 0.937

Result! “인자간 상관행렬”로부터 두 인자의 상관계수가 0.51 정도가 되도록 회전이 이루어졌음을 알 수 있다.
Caution! 사각회전 하에서는 인자들이 서로 직교하지 않기 때문에 회전된 인자적재행렬은 더 이상 상관계수와 같은 개념으로 해석할 수 없다. 이에 대해 인자구조행렬(Factor Structure Matrix)는 사각회전 하에서 인자와 원래 변수들 간의 단순상관계수로서 인자의 해석에 이용될 수 있다.

# 인자구조행렬NAround(tvprog.fa.promax$Structure, 3)

     RC1   RC2
x1 0.772 0.333
x2 0.421 0.728
x3 0.435 0.766
x4 0.434 0.810
x5 0.722 0.477
x6 0.843 0.354
x7 0.771 0.538
x8 0.192 0.671

1-7. 인자의 개수

적절한 인자의 개수를 선정하기 위해서 다양한 판정기준들이 있다.
1. 고유값 크기 : 고유값이 1이 넘는 개수만큼 인자를 보유
2. Scree 그래프 : 평평한 직선 바로 직전까지의 개수를 인자의 개수로 고려
3. 인자의 공헌도 : 전체 분산을 최대한 설명하는 인자 개수 선정
4. 카이제곱 적합도검정 : 최대우도법을 이용할 경우 수행

# 1. 고유값 크기기
print(tvprog.fa.varimax$values, # 상관행렬의 고유값       digit = 3)                # 수치들의 소숫점 자릿수

[1] 3.597 1.143 0.769 0.640 0.557 0.506 0.442 0.346

# 2. 스크리 그래프래프
pacman::p_load("psych")

scree(tvprog.X)

# 3. 인자의 공헌도NAprint(tvprog.fa.varimax, digit = 3)

Principal Components Analysis
Call: principal(r = tvprog.X, nfactors = 2, rotate = "varimax", cor = "cor")
Standardized loadings (pattern matrix) based upon correlation matrix
     RC1   RC2    h2    u2  com
x1 0.767 0.117 0.601 0.399 1.05
x2 0.265 0.681 0.534 0.466 1.30
x3 0.271 0.718 0.589 0.411 1.28
x4 0.257 0.769 0.657 0.343 1.22
x5 0.671 0.296 0.538 0.462 1.37
x6 0.839 0.117 0.718 0.282 1.04
x7 0.708 0.349 0.622 0.378 1.46
x8 0.026 0.693 0.481 0.519 1.00

                        RC1   RC2
SS loadings           2.453 2.287
Proportion Var        0.307 0.286
Cumulative Var        0.307 0.592
Proportion Explained  0.517 0.483
Cumulative Proportion 0.517 1.000

Mean item complexity =  1.2
Test of the hypothesis that 2 components are sufficient.

The root mean square of the residuals (RMSR) is  0.096 
 with the empirical chi square  502.194  with prob <  5.01e-99 

Fit based upon off diagonal values = 0.937

# 4. 카이제곱 적합도검정NAtvprog.fa.ml1 <- fa(tvprog.X, 
                    cor = "cor",         # 상관행렬에 기초하여 인자분석 수행, cor = "cov"이면 공분산행렬에 기초하여 인자분석 수행NA= 1,        # 인자의 개수NA= "ml",           # 최대우도법
                    rotate = "varimax")  # 인자의 회전방법NAfactor.stats(r = tvprog.X,               # 데이터 프레임 / 상관행렬NA= tvprog.fa.ml1)          # 인자분석 결과

Call: fa.stats(r = r, f = f, phi = phi, n.obs = n.obs, np.obs = np.obs, 
    alpha = alpha, fm = fm)

Test of the hypothesis that 1 factor is sufficient.

The degrees of freedom for the model is 20  and the fit was  0.41 
The number of observations was  977  with Chi Square =  396.76  with prob <  9.6e-72 

Measures of factor score adequacy             
 Correlation of scores with factors            0.91
Multiple R square of scores with factors       0.83
Minimum correlation of factor score estimates  0.67

Caution! Package psych에서 제공하는 함수 factor.stats()를 이용하여 카이제곱 적합도검정을 수행할 수 있다.
Result! 먼저, “1개의 인자가 충분하다.”는 귀무가설과 “그 이상의 인자가 필요하다.”는 대립가설의 경우를 살펴보면, \(\chi^2=\) 396.76, \(p\)-값이 9.6e-72이다. 이에 근거하여, 유의수준 5%에서 \(p\)-값이 0.05보다 작기 때문에 귀무가설을 기각한다. 즉, 1개의 인자보다 더 많은 인자가 필요하다.

tvprog.fa.ml2 <- fa(tvprog.X, 
                    cor = "cor",         # 상관행렬에 기초하여 인자분석 수행, cor = "cov"이면 공분산행렬에 기초하여 인자분석 수행NA= 2,        # 인자의 개수NA= "ml",           # 최대우도법
                    rotate = "varimax")  # 인자의 회전방법NAfactor.stats(r = tvprog.X,               # 데이터 프레임 / 상관행렬NA= tvprog.fa.ml2)          # 인자분석 결과

Call: fa.stats(r = r, f = f, phi = phi, n.obs = n.obs, np.obs = np.obs, 
    alpha = alpha, fm = fm)

Test of the hypothesis that 2 factors are sufficient.

The degrees of freedom for the model is 13  and the fit was  0.08 
The number of observations was  977  with Chi Square =  73.72  with prob <  1.6e-10 

Measures of factor score adequacy             
 Correlation of scores with factors            0.89 0.85
Multiple R square of scores with factors       0.79 0.73
Minimum correlation of factor score estimates  0.57 0.45

Result! 먼저, “2개의 인자가 충분하다.”는 귀무가설과 “그 이상의 인자가 필요하다.”는 대립가설의 경우를 살펴보면, \(\chi^2=\) 73.72, \(p\)-값이 1.6e-10이다. 이에 근거하여, 유의수준 5%에서 \(p\)-값이 0.05보다 작기 때문에 귀무가설을 기각한다. 즉, 2개의 인자보다 더 많은 인자가 필요하다.

tvprog.fa.ml4 <- fa(tvprog.X, 
                    cor = "cor",         # 상관행렬에 기초하여 인자분석 수행, cor = "cov"이면 공분산행렬에 기초하여 인자분석 수행NA= 4,        # 인자의 개수NA= "ml",           # 최대우도법
                    rotate = "varimax")  # 인자의 회전방법NAfactor.stats(r = tvprog.X,               # 데이터 프레임 / 상관행렬NA= tvprog.fa.ml4)          # 인자분석 결과

Call: fa.stats(r = r, f = f, phi = phi, n.obs = n.obs, np.obs = np.obs, 
    alpha = alpha, fm = fm)

Test of the hypothesis that 4 factors are sufficient.

The degrees of freedom for the model is 2  and the fit was  0.01 
The number of observations was  977  with Chi Square =  7.42  with prob <  0.025 

Measures of factor score adequacy             
 Correlation of scores with factors            0.79 0.96 0.72 0.97
Multiple R square of scores with factors       0.63 0.93 0.51 0.94
Minimum correlation of factor score estimates  0.25 0.86 0.03 0.89

Result! 먼저, “4개의 인자가 충분하다.”는 귀무가설과 “그 이상의 인자가 필요하다.”는 대립가설의 경우를 살펴보면, \(\chi^2=\) 7.42, \(p\)-값이 0.025이다. 이에 근거하여, 유의수준 5%에서 \(p\)-값이 0.05보다 작기 때문에 귀무가설을 기각한다. 즉, 4개의 인자보다 더 많은 인자가 필요하다.

※ 다양한 기준으로 적절한 인자의 개수를 살펴보았을 때, “카이제곱 적합도검정”을 제외하고 인자의 개수가 2개가 적절해 보이는 것으로 나타났다.

2. 고객만족 데이터

자유아카데미에서 출판한 책 R을 활용한 다변량 자료분석 방법론의 데이터 파일 중 “satis.csv”를 활용한다.
이 데이터는 어떤 제품에 대한 고객의 만족도를 조사하여 얻어진 데이터로 총 8개의 변수들로 이루어져 있다.
1. ID : 고객 아이디
2. gender : 고객 성별
3. age : 고객 나이
4. \(x_1\) : 가격에 대한 만족도
5. \(x_2\) : 성능에 대한 만족도
6. \(x_3\) : 편리성에 대한 만족도
7. \(x_4\) : 디자인에 대한 만족도
8. \(x_5\) : 색상에 대한 만족도
만족도는 5점 척도로 측정되어 있으며, “1 = 매우 만족하지 않는다.”, “2 = 만족하지 않는다.”, “3 = 보통이다.”, “4 = 만족한다.”, “5 = 매우 만족한다.”를 의미한다.

# 데이터 불러오기
satis <- read.csv("C:/Users/User/Desktop/satis.csv")
head(satis)

  ID gender age x1 x2 x3 x4 x5
1  1      F  10  1  2  4  1  1
2  2      F  10  1  2  3  2  1
3  3      F  20  2  5  5  2  2
4  4      F  20  2  5  4  2  2
5  5      F  30  1  2  3  4  3
6  6      M  30  1  3  4  1  1

# 데이터 전처리pacman::p_load("dplyr")

satis.X <- satis %>%
  .[,4:8]                   # 4~8열 선택택

head(satis.X)

  x1 x2 x3 x4 x5
1  1  2  4  1  1
2  1  2  3  2  1
3  2  5  5  2  2
4  2  5  4  2  2
5  1  2  3  4  3
6  1  3  4  1  1

2-1. 상관행렬과 고유값

# 상관행렬round( cor(satis.X), 3)

      x1    x2    x3    x4    x5
x1 1.000 0.708 0.758 0.450 0.574
x2 0.708 1.000 0.728 0.059 0.291
x3 0.758 0.728 1.000 0.242 0.444
x4 0.450 0.059 0.242 1.000 0.939
x5 0.574 0.291 0.444 0.939 1.000

# 주성분분석석
satis.pca <- princomp(satis.X,
                      cor = TRUE)  # 상관행렬에 기초한 주성분 분석round(satis.pca$sdev^2, 3)         # 주성분의 분산 = 고유값

Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 
 3.108  1.399  0.253  0.213  0.027

summary(satis.pca)                 # 주성분의 설명비율 출력

Importance of components:
                          Comp.1    Comp.2     Comp.3     Comp.4
Standard deviation     1.7628337 1.1829978 0.50336951 0.46102546
Proportion of Variance 0.6215165 0.2798968 0.05067617 0.04250889
Cumulative Proportion  0.6215165 0.9014133 0.95208945 0.99459834
                           Comp.5
Standard deviation     0.16434203
Proportion of Variance 0.00540166
Cumulative Proportion  1.00000000

Result! 첫 번째 고유값은 3.108이고 이는 전체 분산의 약 3.108/5(변수 개수, 전체 분산)=62%에 해당하며, 두 번째 고유값은 1.399이고 이는 전체 분산의 1.399/5(변수 개수, 전체 분산)=28%에 해당한다. 또한, 첫 두 개의 고유값에 의한 누적 설명비율은 62%+28%=90%이다.

2-2. KMO 표본적합성 측도

pacman::p_load("psych")

KMO(satis.X)

Kaiser-Meyer-Olkin factor adequacy
Call: KMO(r = satis.X)
Overall MSA =  0.59
MSA for each item = 
  x1   x2   x3   x4   x5 
0.71 0.59 0.74 0.44 0.54

Result! 표본적합성 측도(MSA)가 0.59로 출력되었으며, 이 데이터의 경우 인자분석을 수행하기에 “빈약한” 정도에 해당한다.

2-3. Bartlett의 구형성 검정

pacman::p_load("psych")

satis.cor <- cor(satis.X)             # 상관행렬cortest.bartlett(satis.cor,           # 상관행렬                 n = nrow(satis.X))   # 표본 크기기

$chisq
[1] 32.91033

$p.value
[1] 0.000281999

$df
[1] 10

Result! Bartlett 검정에 대한 카이제곱 값과 \(p\)-값이 각각 \(\chi^2=\) 32.91, \(p=\) 0.0003이다. 이에 근거하여, 유의수준 5%에서 \(p\)-값이 0.05보다 작으므로 귀무가설을 기각한다. 즉, 이 데이터에는 적어도 공통인자가 1개 이상 존재한다고 할 수 있다.

2-4. 인자분석

# 1. 데이터행렬을 입력으로 하는 경우NApacman::p_load("psych")

satis.fa <- principal(satis.X,             # 데이터행렬NA= "cor",         # 상관행렬에 기초하여 인자분석 수행, cor = "cov"이면 공분산행렬에 기초하여 인자분석 수행NA= 2,        # 인자의 개수NA= "varimax")  # 인자의 회전방법NAprint(satis.fa, 
      sort = TRUE,                         # 인자적재값이 큰 순서대로 정렬   NA= 5)                           # 수치들의 소숫점 자릿수

Principal Components Analysis
Call: principal(r = satis.X, nfactors = 2, rotate = "varimax", cor = "cor")
Standardized loadings (pattern matrix) based upon correlation matrix
   item     RC1      RC2      h2       u2    com
x2    2 0.92938 -0.00905 0.86383 0.136175 1.0002
x3    3 0.89321  0.19954 0.83764 0.162357 1.0996
x1    1 0.82728  0.40875 0.85147 0.148525 1.4608
x4    4 0.05817  0.99152 0.98650 0.013501 1.0069
x5    5 0.28656  0.94101 0.96762 0.032376 1.1839

                          RC1     RC2
SS loadings           2.43147 2.07559
Proportion Var        0.48629 0.41512
Cumulative Var        0.48629 0.90141
Proportion Explained  0.53948 0.46052
Cumulative Proportion 0.53948 1.00000

Mean item complexity =  1.2
Test of the hypothesis that 2 components are sufficient.

The root mean square of the residuals (RMSR) is  0.04578 
 with the empirical chi square  0.41913  with prob <  0.51737 

Fit based upon off diagonal values = 0.99377

# 상관행렬의 고유값 출력NAprint(satis.fa$values,  
      digit = 3)              # 수치들의 소숫점 자릿수

[1] 3.108 1.399 0.253 0.213 0.027

Result! 위에서 함수 princomp()를 이용하여 구한 고유값과 동일하다.

# 인자적재값 출력NAprint(satis.fa$loadings, 
      digit = 3,              # 수치들의 소숫점 자릿수      cut   = 0)              # 지정값보다 작은 인자적재값은 출력 XNA


Loadings:
   RC1    RC2   
x1  0.827  0.409
x2  0.929 -0.009
x3  0.893  0.200
x4  0.058  0.992
x5  0.287  0.941

                 RC1   RC2
SS loadings    2.431 2.076
Proportion Var 0.486 0.415
Cumulative Var 0.486 0.901

Result! 인자적재값은 각 관찰변수와 인자들 간의 연관성 크기를 나타낸다. 출력 결과를 살펴보면, 먼저 첫 인자에 높은 적재값을 가지고 있는 변수들은 \(x_1\)(가격), \(x_2\)(성능), \(x_3\)(편리성)이므로 첫 인자는 "내형적 요인"을 나타낸다고 해석할 수 있다. 반면에, 두 번째 인자에 높은 적재값을 가지고 있는 변수들은 \(x_4\)(디자인), \(x_5\)(색상)이므로 두 번째 인자는 "외형적 요인"을 나타낸다고 해석할 수 있다.
또한, (표준화된)관찰변수들의 전체 분산 중 두 인자에 의해서 설명되는 분산은 각각 2.431, 2.076으로서, 이는 각각 전체 분산의 2.431/5=48.6%, 2.076/5=41.5%에 해당한다. 따라서, 두 인자에 의해서 설명되는 분산 비율의 합계는 90.1%이다.

# 인자적재 플롯NApacman::p_load("psych")

biplot(x = satis.fa$loadings[, c(1,2)],
       y = satis.fa$loadings[, c(1,2)],
       xlabs = colnames(satis.X),
       ylabs = colnames(satis.X))

Result! 변수 \(x_1\)(가격), \(x_2\)(성능), \(x_3\)(편리성)는 첫 번째 인자축에 가깝고, 변수 \(x_4\)(디자인), \(x_5\)(색상)는 두 번째 인자축에 가까운 것을 시각적으로 확인할 수 있다.

# 2. 상관행렬을 입력으로 하는 경우NAsatis.fa.cor <- principal(r = satis.cor,          # 상관행렬                          n.obs = nrow(satis.X),  # 표본 개수 수 
                          nfactors = 2,           # 인자의 개수NA= "varimax")     # 인자의 회전방법NAprint(satis.fa.cor, 
      sort = TRUE,                                # 인자적재값이 큰 순서대로 정렬   NA= 5)                                  # 수치들의 소숫점 자릿수

Principal Components Analysis
Call: principal(r = satis.cor, nfactors = 2, rotate = "varimax", n.obs = nrow(satis.X))
Standardized loadings (pattern matrix) based upon correlation matrix
   item     RC1      RC2      h2       u2    com
x2    2 0.92938 -0.00905 0.86383 0.136175 1.0002
x3    3 0.89321  0.19954 0.83764 0.162357 1.0996
x1    1 0.82728  0.40875 0.85147 0.148525 1.4608
x4    4 0.05817  0.99152 0.98650 0.013501 1.0069
x5    5 0.28656  0.94101 0.96762 0.032376 1.1839

                          RC1     RC2
SS loadings           2.43147 2.07559
Proportion Var        0.48629 0.41512
Cumulative Var        0.48629 0.90141
Proportion Explained  0.53948 0.46052
Cumulative Proportion 0.53948 1.00000

Mean item complexity =  1.2
Test of the hypothesis that 2 components are sufficient.

The root mean square of the residuals (RMSR) is  0.04578 
 with the empirical chi square  0.41913  with prob <  0.51737 

Fit based upon off diagonal values = 0.99377

2-5. 인자적재와 특수분산의 추정

# 1. 주성분분석법법
pacman::p_load("psych")

satis.fa.pm <- principal(satis.X, 
                         cor = "cor",         # 상관행렬에 기초하여 인자분석 수행, cor = "cov"이면 공분산행렬에 기초하여 인자분석 수행NA= 2,        # 인자의 개수NA= "none")     # 인자의 회전방법NAprint(satis.fa.pm, digit = 3)

Principal Components Analysis
Call: principal(r = satis.X, nfactors = 2, rotate = "none", cor = "cor")
Standardized loadings (pattern matrix) based upon correlation matrix
     PC1    PC2    h2     u2  com
x1 0.900 -0.203 0.851 0.1485 1.10
x2 0.717 -0.592 0.864 0.1362 1.93
x3 0.820 -0.407 0.838 0.1624 1.46
x4 0.669  0.734 0.986 0.0135 1.98
x5 0.815  0.551 0.968 0.0324 1.76

                        PC1   PC2
SS loadings           3.108 1.399
Proportion Var        0.622 0.280
Cumulative Var        0.622 0.901
Proportion Explained  0.689 0.311
Cumulative Proportion 0.689 1.000

Mean item complexity =  1.6
Test of the hypothesis that 2 components are sufficient.

The root mean square of the residuals (RMSR) is  0.046 
 with the empirical chi square  0.419  with prob <  0.517 

Fit based upon off diagonal values = 0.994

# 주성분분석석
satis.pca <- princomp(satis.X, cor = TRUE)
satis.pca$sdev^2             # 주성분의 분산 = 고유값

   Comp.1    Comp.2    Comp.3    Comp.4    Comp.5 
3.1075826 1.3994838 0.2533809 0.2125445 0.0270083

satis.pca$loadings           # 주성분 계수 = 고유벡터


Loadings:
   Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
x1  0.511  0.171         0.833  0.116
x2  0.407  0.500  0.647 -0.372 -0.166
x3  0.465  0.344 -0.754 -0.290 -0.111
x4  0.380 -0.621               -0.684
x5  0.462 -0.466        -0.289  0.692

               Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
SS loadings       1.0    1.0    1.0    1.0    1.0
Proportion Var    0.2    0.2    0.2    0.2    0.2
Cumulative Var    0.2    0.4    0.6    0.8    1.0

Result! 인자적재행렬의 절대값은 첫 2개의 고유벡터에 고유값의 제곱근을 곱하여 얻어짐을 알 수 있다. 예를 들어, 첫 번째 인자의 첫 번째 인자적재값 \(\hat{l}_{11}=\) 0.900 \(=\sqrt{3.108}\times0.511\)이고, 두 번째 인자적재값 \(\hat{l}_{12}=\) 0.717 \(=\sqrt{3.108}\times0.407\)이다. 두 번째 인자의 첫 번째 인자적재값 \(\hat{l}_{21}=\) 0.203 \(=\sqrt{1.399}\times0.171\)이고, 두 번째 인자적재값 \(\hat{l}_{22}=\) 0.592 \(=\sqrt{1.399}\times0.500\)이다.

# 2. 주축인자법NAsatis.fa.pa <- fa(satis.X, 
                  cor = "cor",         # 상관행렬에 기초하여 인자분석 수행, cor = "cov"이면 공분산행렬에 기초하여 인자분석 수행NA= 2,        # 인자의 개수NA= "pa",           # 주죽인자법NA= "none")     # 인자의 회전방법NAprint(satis.fa.pa, digit = 3)

Factor Analysis using method =  pa
Call: fa(r = satis.X, nfactors = 2, rotate = "none", fm = "pa", cor = "cor")
Standardized loadings (pattern matrix) based upon correlation matrix
     PA1    PA2    h2     u2  com
x1 0.853  0.268 0.799  0.201 1.20
x2 0.649  0.568 0.743  0.257 1.97
x3 0.754  0.426 0.750  0.250 1.58
x4 0.756 -0.758 1.146 -0.146 2.00
x5 0.817 -0.423 0.846  0.154 1.50

                        PA1   PA2
SS loadings           2.956 1.328
Proportion Var        0.591 0.266
Cumulative Var        0.591 0.857
Proportion Explained  0.690 0.310
Cumulative Proportion 0.690 1.000

Mean item complexity =  1.6
Test of the hypothesis that 2 factors are sufficient.

The degrees of freedom for the null model are  10  and the objective function was  5.063 with Chi Square of  32.91
The degrees of freedom for the model are 1  and the objective function was  0.05 

The root mean square of the residuals (RMSR) is  0.005 
The df corrected root mean square of the residuals is  0.016 

The harmonic number of observations is  10 with the empirical chi square  0.005  with prob <  0.942 
The total number of observations was  10  with Likelihood Chi Square =  0.256  with prob <  0.613 

Tucker Lewis Index of factoring reliability =  1.4605
RMSEA index =  0  and the 90 % confidence intervals are  0 0.7029
BIC =  -2.047
Fit based upon off diagonal values = 1

Result! Heywood 상황이 발생하여 경고메시지로 ultra-Heywood가 출력되었으며, 변수 \(x_4\)(디자인)의 공통성 \(h_2\)는 1이 넘으며, 특수성 \(u_2\)는 음수로 나타났다. 이러한 경우, 추정량은 타당하지 못하다.

# 3. 최대우도법
satis.fa.ml <- fa(satis.X, 
                  cor = "cor",         # 상관행렬에 기초하여 인자분석 수행, cor = "cov"이면 공분산행렬에 기초하여 인자분석 수행NA= 2,        # 인자의 개수NA= "ml",           # 최대우도법
                  rotate = "none")     # 인자의 회전방법NAprint(satis.fa.ml, digit = 3)

Factor Analysis using method =  ml
Call: fa(r = satis.X, nfactors = 2, rotate = "none", fm = "ml", cor = "cor")
Standardized loadings (pattern matrix) based upon correlation matrix
     ML1    ML2    h2      u2  com
x1 0.472  0.733 0.760 0.24025 1.71
x2 0.089  0.894 0.807 0.19344 1.02
x3 0.271  0.821 0.748 0.25246 1.21
x4 0.997 -0.033 0.995 0.00496 1.00
x5 0.949  0.215 0.946 0.05378 1.10

                        ML1   ML2
SS loadings           2.198 2.057
Proportion Var        0.440 0.411
Cumulative Var        0.440 0.851
Proportion Explained  0.517 0.483
Cumulative Proportion 0.517 1.000

Mean item complexity =  1.2
Test of the hypothesis that 2 factors are sufficient.

The degrees of freedom for the null model are  10  and the objective function was  5.063 with Chi Square of  32.91
The degrees of freedom for the model are 1  and the objective function was  0.18 

The root mean square of the residuals (RMSR) is  0.018 
The df corrected root mean square of the residuals is  0.056 

The harmonic number of observations is  10 with the empirical chi square  0.063  with prob <  0.801 
The total number of observations was  10  with Likelihood Chi Square =  0.932  with prob <  0.334 

Tucker Lewis Index of factoring reliability =  1.0421
RMSEA index =  0  and the 90 % confidence intervals are  0 0.8695
BIC =  -1.371
Fit based upon off diagonal values = 0.999
Measures of factor score adequacy             
                                                    ML1   ML2
Correlation of (regression) scores with factors   0.998 0.954
Multiple R square of scores with factors          0.995 0.910
Minimum correlation of possible factor scores     0.991 0.820

# 인자구조 다이어그램pacman::p_load("psych")

fa.diagram(satis.fa.pm, 
           simple = FALSE,              # 각 관찰변수에 대하여 가장 큰 적재값만 표현하는 지 여부
           cut = 0,                     # 지정값보다 작은 값은 출력 X
           digit = 3)

fa.diagram(satis.fa.pa, 
           simple = FALSE,              # 각 관찰변수에 대하여 가장 큰 적재값만 표현하는 지 여부
           cut = 0,                     # 지정값보다 작은 값은 출력 X
           digit = 3)

Caution! Package psych에서 제공하는 함수 fa.diagram()을 통해 인자구조를 시각적으로 살펴볼 수 있다.

2-6. 인자의 회전

인자축을 회전하기 위해서는 옵션 rotate에 회전방법을 지정하면 된다.
- 직교회전 : varimax, quartimax, bentlerT, equamax
- 사각회전 : promax, oblimin, simplimax, bentlerQ

# 회전방법 수행 XNAsatis.fa <- principal(satis.X, 
                      cor = "cor",         # 상관행렬에 기초하여 인자분석 수행, cor = "cov"이면 공분산행렬에 기초하여 인자분석 수행NA= 2,        # 인자의 개수NA= "none")     # 인자의 회전방법NAprint(satis.fa, digit = 3)

Principal Components Analysis
Call: principal(r = satis.X, nfactors = 2, rotate = "none", cor = "cor")
Standardized loadings (pattern matrix) based upon correlation matrix
     PC1    PC2    h2     u2  com
x1 0.900 -0.203 0.851 0.1485 1.10
x2 0.717 -0.592 0.864 0.1362 1.93
x3 0.820 -0.407 0.838 0.1624 1.46
x4 0.669  0.734 0.986 0.0135 1.98
x5 0.815  0.551 0.968 0.0324 1.76

                        PC1   PC2
SS loadings           3.108 1.399
Proportion Var        0.622 0.280
Cumulative Var        0.622 0.901
Proportion Explained  0.689 0.311
Cumulative Proportion 0.689 1.000

Mean item complexity =  1.6
Test of the hypothesis that 2 components are sufficient.

The root mean square of the residuals (RMSR) is  0.046 
 with the empirical chi square  0.419  with prob <  0.517 

Fit based upon off diagonal values = 0.994

Result! 초기 인자적재행렬을 살펴보면, 첫 번째 인자는 대부분의 변수들이 높은 인자적재값을 가지므로 이를 "전반적인 만족도"를 나타내는 인자로 생각할 수 있다. 두 번째 인자는 변수 \(x_1\)(가격), \(x_2\)(성능), \(x_3\)(편리성)는 음의 적재값을, 변수 \(x_4\)(디자인), \(x_5\)(색상)는 양의 적재값을 가지므로 이 인자에 의미를 부여하는 것은 쉽지 않다. 두 번째 인자의 의미를 "외형적 요인과 내형적 요인의 차이" 정도로 해석할 수 있다.

# 인자적재 플롯NAbiplot(x = satis.fa$loadings[, c(1,2)],
       y = satis.fa$loadings[, c(1,2)],
       xlabs = colnames(satis.X),
       ylabs = colnames(satis.X))

# Varimax 회전방법satis.fa.varimax <- principal(satis.X, 
                              cor = "cor",         # 상관행렬에 기초하여 인자분석 수행, cor = "cov"이면 공분산행렬에 기초하여 인자분석 수행NA= 2,        # 인자의 개수NA= "varimax")  # 인자의 회전방법NAprint(satis.fa.varimax, digit = 3)

Principal Components Analysis
Call: principal(r = satis.X, nfactors = 2, rotate = "varimax", cor = "cor")
Standardized loadings (pattern matrix) based upon correlation matrix
     RC1    RC2    h2     u2  com
x1 0.827  0.409 0.851 0.1485 1.46
x2 0.929 -0.009 0.864 0.1362 1.00
x3 0.893  0.200 0.838 0.1624 1.10
x4 0.058  0.992 0.986 0.0135 1.01
x5 0.287  0.941 0.968 0.0324 1.18

                        RC1   RC2
SS loadings           2.431 2.076
Proportion Var        0.486 0.415
Cumulative Var        0.486 0.901
Proportion Explained  0.539 0.461
Cumulative Proportion 0.539 1.000

Mean item complexity =  1.2
Test of the hypothesis that 2 components are sufficient.

The root mean square of the residuals (RMSR) is  0.046 
 with the empirical chi square  0.419  with prob <  0.517 

Fit based upon off diagonal values = 0.994

Result! Varimax 회전방법을 수행했을 때 회전한 인자적재행렬을 살펴보면, 첫 인자에 높은 적재값을 가지고 있는 변수들은 \(x_1\)(가격), \(x_2\)(성능), \(x_3\)(편리성)이므로 첫 인자는 "내형적 요인"을 나타낸다고 해석할 수 있다. 반면에, 두 번째 인자에 높은 적재값을 가지고 있는 변수들은 \(x_4\)(디자인), \(x_5\)(색상)이므로 두 번째 인자는 "외형적 요인"을 나타낸다고 해석할 수 있다.

# 인자적재 플롯NAbiplot(x = satis.fa.varimax$loadings[, c(1,2)],
       y = satis.fa.varimax$loadings[, c(1,2)],
       xlabs = colnames(satis.X),
       ylabs = colnames(satis.X))

# Promax 회전방법satis.fa.promax <- principal(satis.X, 
                             cor = "cor",         # 상관행렬에 기초하여 인자분석 수행, cor = "cov"이면 공분산행렬에 기초하여 인자분석 수행NA= 2,        # 인자의 개수NA= "promax")   # 인자의 회전방법NAprint(satis.fa.promax, digit = 3)

Principal Components Analysis
Call: principal(r = satis.X, nfactors = 2, rotate = "promax", cor = "cor")
Standardized loadings (pattern matrix) based upon correlation matrix
      RC1    RC2    h2     u2  com
x1  0.795  0.255 0.851 0.1485 1.20
x2  0.989 -0.210 0.864 0.1362 1.09
x3  0.908  0.019 0.838 0.1624 1.00
x4 -0.140  1.038 0.986 0.0135 1.04
x5  0.113  0.935 0.968 0.0324 1.03

                        RC1   RC2
SS loadings           2.456 2.051
Proportion Var        0.491 0.410
Cumulative Var        0.491 0.901
Proportion Explained  0.545 0.455
Cumulative Proportion 0.545 1.000

 With component correlations of 
     RC1  RC2
RC1 1.00 0.38
RC2 0.38 1.00

Mean item complexity =  1.1
Test of the hypothesis that 2 components are sufficient.

The root mean square of the residuals (RMSR) is  0.046 
 with the empirical chi square  0.419  with prob <  0.517 

Fit based upon off diagonal values = 0.994

Result! “인자간 상관행렬”로부터 두 인자의 상관계수가 0.38 정도가 되도록 회전이 이루어졌음을 알 수 있다.
Caution! 사각회전 하에서는 인자들이 서로 직교하지 않기 때문에 회전된 인자적재행렬은 더 이상 상관계수와 같은 개념으로 해석할 수 없다. 이에 대해 인자구조행렬(Factor Structure Matrix)는 사각회전 하에서 인자와 원래 변수들 간의 단순상관계수로서 인자의 해석에 이용될 수 있다.

# 인자구조행렬NAround(satis.fa.promax$Structure, 3)

     RC1   RC2
x1 0.892 0.557
x2 0.909 0.166
x3 0.915 0.364
x4 0.255 0.985
x5 0.468 0.978

2-7. 인자의 개수

적절한 인자의 개수를 선정하기 위해서 다양한 판정기준들이 있다.
1. 고유값 크기 : 고유값이 1이 넘는 개수만큼 인자를 보유
2. Scree 그래프 : 평평한 직선 바로 직전까지의 개수를 인자의 개수로 고려
3. 인자의 공헌도 : 전체 분산을 최대한 설명하는 인자 개수 선정
4. 카이제곱 적합도검정 : 최대우도법을 이용할 경우 수행

# 1. 고유값 크기기
print(satis.fa.promax$values,   # 상관행렬의 고유값       digit = 3)                # 수치들의 소숫점 자릿수

[1] 3.108 1.399 0.253 0.213 0.027

# 2. 스크리 그래프래프
pacman::p_load("psych")

scree(satis.X)

# 3. 인자의 공헌도NAprint(satis.fa.promax, digit = 3)

Principal Components Analysis
Call: principal(r = satis.X, nfactors = 2, rotate = "promax", cor = "cor")
Standardized loadings (pattern matrix) based upon correlation matrix
      RC1    RC2    h2     u2  com
x1  0.795  0.255 0.851 0.1485 1.20
x2  0.989 -0.210 0.864 0.1362 1.09
x3  0.908  0.019 0.838 0.1624 1.00
x4 -0.140  1.038 0.986 0.0135 1.04
x5  0.113  0.935 0.968 0.0324 1.03

                        RC1   RC2
SS loadings           2.456 2.051
Proportion Var        0.491 0.410
Cumulative Var        0.491 0.901
Proportion Explained  0.545 0.455
Cumulative Proportion 0.545 1.000

 With component correlations of 
     RC1  RC2
RC1 1.00 0.38
RC2 0.38 1.00

Mean item complexity =  1.1
Test of the hypothesis that 2 components are sufficient.

The root mean square of the residuals (RMSR) is  0.046 
 with the empirical chi square  0.419  with prob <  0.517 

Fit based upon off diagonal values = 0.994

# 4. 카이제곱 적합도검정NAsatis.fa.ml1 <- fa(satis.X, 
                   cor = "cor",          # 상관행렬에 기초하여 인자분석 수행, cor = "cov"이면 공분산행렬에 기초하여 인자분석 수행NA= 1,         # 인자의 개수NA= "ml",            # 최대우도법
                   rotate = "varimax")   # 인자의 회전방법NAfactor.stats(r = satis.X,                # 데이터 프레임 / 상관행렬NA= satis.fa.ml1)           # 인자분석 결과

Call: fa.stats(r = r, f = f, phi = phi, n.obs = n.obs, np.obs = np.obs, 
    alpha = alpha, fm = fm)

Test of the hypothesis that 1 factor is sufficient.

The degrees of freedom for the model is 5  and the fit was  2.24 
The number of observations was  10  with Chi Square =  13.04  with prob <  0.023 

Measures of factor score adequacy             
 Correlation of scores with factors            1
Multiple R square of scores with factors       1
Minimum correlation of factor score estimates  0.99

Result! 먼저, “1개의 인자가 충분하다.”는 귀무가설과 “그 이상의 인자가 필요하다.”는 대립가설의 경우를 살펴보면, \(\chi^2=\) 13.04, \(p\)-값이 0.023이다. 이에 근거하여, 유의수준 5%에서 \(p\)-값이 0.05보다 작기 때문에 귀무가설을 기각한다. 즉, 1개의 인자보다 더 많은 인자가 필요하다.

satis.fa.ml2 <- fa(satis.X, 
                   cor = "cor",          # 상관행렬에 기초하여 인자분석 수행, cor = "cov"이면 공분산행렬에 기초하여 인자분석 수행NA= 2,         # 인자의 개수NA= "ml",            # 최대우도법
                   rotate = "varimax")   # 인자의 회전방법NAfactor.stats(r = satis.X,                # 데이터 프레임 / 상관행렬NA= satis.fa.ml2)           # 인자분석 결과

Call: fa.stats(r = r, f = f, phi = phi, n.obs = n.obs, np.obs = np.obs, 
    alpha = alpha, fm = fm)

Test of the hypothesis that 2 factors are sufficient.

The degrees of freedom for the model is 1  and the fit was  0.18 
The number of observations was  10  with Chi Square =  0.93  with prob <  0.33 

Measures of factor score adequacy             
 Correlation of scores with factors            0.95 1
Multiple R square of scores with factors       0.91 0.99
Minimum correlation of factor score estimates  0.82 0.99

Result! 먼저, “2개의 인자가 충분하다.”는 귀무가설과 “그 이상의 인자가 필요하다.”는 대립가설의 경우를 살펴보면, \(\chi^2=\) 0.93, \(p\)-값이 0.33이다. 이에 근거하여, 유의수준 5%에서 \(p\)-값이 0.05보다 크기 때문에 귀무가설을 기각할 수 없다. 즉, 2개의 인자가 적절하다.

Factor Analysis