다루는 언어가 영어일 때 전처리 과정을 보여준다.

pacman::p_load("stringr", 
               "tm",    # tm_map/DocumentMatrix
               "RWeka") # 엔그램그램

mytext <- c("This sentence Often appears   The K-pop Lyrics!!", 
            "this \t is a Sentence written in 2021 :)", 
            "All sentences are interesting")

corpus  <- VCorpus(VectorSource(mytext))     # VectorSource : vector를 document로 해석석
corpus

<<VCorpus>>
Metadata:  corpus specific: 0, document level (indexed): 0
Content:  documents: 3

summary(corpus)

  Length Class             Mode
1 2      PlainTextDocument list
2 2      PlainTextDocument list
3 2      PlainTextDocument list

대.소문자 통일

같은 의미를 가진 단어가 대.소문자 다를 경우 다른 단어로 분리된다. 이를 방지하기 위해 같은 의미를 가진 단어들을 대.소문자 통일하는 작업이다.

# 대문자로 시작하는 단어 확인NAmyuppers <- lapply(corpus, function(x){str_extract_all(x$content,
                                       "[[:upper:]]{1}[[:alnum:]]{1,}")})
table(unlist(myuppers))


     All   Lyrics    Often Sentence      The     This 
       1        1        1        1        1        1

대.소문자를 통일해도 큰 혼동이 없기 때문에 소문자로 통일한다.

corpus <- tm_map(corpus, content_transformer(tolower))   # 소문자로 통일NA

숫자표현 제거

텍스트 마이닝에서 숫자표현은 여러가지 의미를 가진다. 숫자 자체가 고유한 의미를 갖는 경우는 단순히 제거하면 안되고, 단지 숫자가 포함된 표현이라면 모든 숫자는 하나로 통일한다. 숫자가 문장에 아무런 의미가 없는 경우는 제거를 한다.

# 숫자표현 확인NAmydigits <- lapply(corpus, function(x){str_extract_all(x$content,
                                         "[[:graph:]]{0,}[[:digit:]]{1,}[[:graph:]]{0,}")})
table(unlist(mydigits))


2021 
   1

숫자에 큰 의미가 없기 때문에 제거한다.

corpus <- tm_map(corpus, removeNumbers)

문장부호 및 특수문자 제거

텍스트 마이닝에서 문장부호(“.”, “,” 등) 및 특수문자(“?”, “!”, “@” 등)은 일반적으로 제거한다. 그러나 똑같은 부호라도 특별한 의미가 있는 경우는 제거할 때 주의를 해야한다.

# 특수문자 확인
mypuncts <- lapply(corpus, function(x){ str_extract_all(x$content,
                                        "[[:graph:]]{0,}[[:punct:]]{1,}[[:graph:]]{0,}")})  

table(unlist(mypuncts))


      :)    k-pop lyrics!! 
       1        1        1

“k-pop”의 특수문자는 삭제하여 “kpop”으로 수정한다.

corpus <- tm_map(corpus, removePunctuation)

공란처리

2개 이상의 공란이 연달아 발견될 경우 해당 공란을 1개로 변환시키는 작업이다.

corpus <- tm_map(corpus, stripWhitespace)

토큰화

수집된 문서들의 집합인 말뭉치(Corpus)를 토큰(Token) 으로 나누는 작업이며, 토큰이 단어일 때는 단어 토큰화, 문장일 때는 문장 토큰화라고 부른다.

lapply(corpus, function(x){str_extract_all(x$content, boundary("word"))})

$`1`
$`1`[[1]]
[1] "this"     "sentence" "often"    "appears"  "the"      "kpop"    
[7] "lyrics"  


$`2`
$`2`[[1]]
[1] "this"     "is"       "a"        "sentence" "written"  "in"      


$`3`
$`3`[[1]]
[1] "all"         "sentences"   "are"         "interesting"

불용어 제거

빈번하게 사용되거나 구체적인 의미를 찾기 어려운 단어를 불용단어 또는 정지단어라 부른다. 영어에서는 “a”, “an”, “the” 등의 관사가 있다. 영어는 "tm" 패키지를 통해 널리 알려진 불용어 목록이 있으나 한국어는 없다.

corpus1 <- tm_map(corpus, removeWords,  words=stopwords("SMART"))

어근 동일화

같은 의미를 가진 단어라도 문법적 기능에 따라 다양한 형태를 가진다. 예를 들어, “go”는 “goes”, “gone” 등과 같이 변할 수 있다. 이 때, 문법적 또는 의미적으로 변화한 단어의 원형을 찾아내는 작업을 말한다.

Stemming

# Porter's Stemmer

corpus1 <- tm_map(corpus1, stemDocument, language="en")  # en = english

Lemmatization

pacman::p_load("textstem")
lapply(corpus, function(x){lemmatize_strings(x$content)})

$`1`
[1] "this sentence often appear the kpop lyric"

$`2`
[1] "this be a sentence write in"

$`3`
[1] "all sentence be interest"

Document-Term Matrix

말뭉치(Corpus)에 대한 전처리를 거친 후 문서×단어 행렬(Document-Term Matrix, DTM)을 구축할 수 있다.

dtm <- DocumentTermMatrix(corpus1)           # 문서x단어 행렬렬
inspect(dtm)

<<DocumentTermMatrix (documents: 3, terms: 6)>>
Non-/sparse entries: 8/10
Sparsity           : 56%
Maximal term length: 8
Weighting          : term frequency (tf)
Sample             :
    Terms
Docs appear interest kpop lyric sentenc written
   1      1        0    1     1       1       0
   2      0        0    0     0       1       1
   3      0        1    0     0       1       0

엔그램

n회 연이어 등장하는 단어들을 특정한 의미를 갖는 하나의 단어로 처리하는 작업이다. 그러나 엔그램이 맹목적으로 받아들이지 않아야 하며, 오히려 데이터의 복잡성을 늘릴 수 있다.

bigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min=2, max=3))

ngram.dtm  <- DocumentTermMatrix(corpus1, control=list(tokenize=bigramTokenizer))
bigramlist <- apply(ngram.dtm[,],2,sum)

sort(bigramlist,decreasing=TRUE)                        #빈도수가 높은 바이그램 램

        appear kpop   appear kpop lyric          kpop lyric 
                  1                   1                   1 
     sentenc appear sentenc appear kpop    sentenc interest 
                  1                   1                   1 
    sentenc written 
                  1

엔그램 사용은 적절하지 않다.

Preprocessing Example : English

대.소문자 통일

숫자표현 제거

문장부호 및 특수문자 제거

공란처리

토큰화

불용어 제거

어근 동일화

Stemming

Lemmatization

Document-Term Matrix

엔그램

Reuse