首頁>技術>

一、案例描述:

本案例基於python與R語言,對豆瓣某電影短評進行簡單情感分析:

實現:

(一)、基於python爬取豆瓣電影短評500條;

(爬取方法:1、使用selenium爬取;

2、複製登入後的cookies,使用requests庫爬取);

(二)、基於R語言進行文字讀取、清洗、分詞、情感打分、視覺化;

二、實操過程:

本案例基於兩部分展開:

(一)資料獲取:

【cookies為使用者自行登入後,於google瀏覽器的netWord監督元件中獲取】

1、資料爬取程式碼如下:

(二)、基於R語言對資料進行情感打分:

實操過程中發現:繁體字對情感打分的影響、停用詞、切詞效果對語句情感得分的影響;

2、過程:資料讀取、資料清洗、詞典匯入、分詞、情感打分、詞雲圖;

2.1:資料匯入:

2.2資料清洗:

2.3詞典匯入:

2.4分詞:

2.5情感打分:

2.6繪製詞雲圖:

Wordfreq:

詞雲圖:

三、總結:

2、實操過程中發現停用詞、切詞效果對語句情感得分影響較大;

[停用詞中包含停止詞]

3、效果圖:

附:完整程式碼:

#--------------載入所需R包:library(pacman)p_load(readr,jiebaR,jiebaRD,plyr,stringr,stringi,ggplot2,wordcloud2)#-----------------步驟一:資料讀取-------------------text <- read.table("D:/a情感分析/text1.csv", dec = ",", sep = ",",stringsAsFactors = FALSE, header = TRUE,blank.lines.skip = TRUE)str(text) #檢視資料型別;#------------------步驟二:資料清洗------------------:#這裡僅僅簡單清理了下空格(包含換行符、製表符等)text$comment<- as.character(sapply(text$comment, str_replace_all, '[\\s]*', ''))#------------------步驟三:讀取情感詞典--------------:#正負詞典中包含文字和得分,負向我標記為-1,正向我標記為1.pos <- read.table("D:/a情感分析/tsinghua.positive.gb.txt",header = F,stringsAsFactors = F,strip.white = T,skip = 1,col.names = "words")pos1 <- read.table("D:/a情感分析/正面評價詞語(中文).txt",header = F,stringsAsFactors = F,strip.white = T,skip = 1,col.names = "words")pos$weight<-1pos1$weight<-1 #對正面情感詞、評價詞打分;#合併正面情感詞、評價詞:positive<-rbind(pos,pos1)neg <- read.table("D:/a情感分析/tsinghua.negative.gb.txt",header = F,stringsAsFactors = F,strip.white = T,skip = 1,col.names = "words")neg1 <- read.table("D:/a情感分析/負面評價詞語(中文).txt",header = F,stringsAsFactors = F,strip.white = T,skip = 1,col.names = "words")neg$weight<--1neg1$weight<--1#合併負面情感詞、評價詞:negative<-rbind(neg,neg1)#合併正、負情感詞典,賦值給mydict物件:mydict<-c(positive,negative)#-----------------------步驟四:分詞-----------------:engine<-worker(stop_word = "D:/a情感分析/chineseStopWords.txt") #設定分詞引擎;#將詞典新增進引擎new_user_word(engine, mydict$words)#分詞segwords <- llply(text$comment, segment, engine)str(segwords) #檢視分詞;#-----------------------步驟五:情感打分--------------#自定義情感函式fun <- function(x,y) x%in% ygetscore <- function(x,pwords,nwords){pos.weight = sapply(llply(x,fun,pwords),sum)neg.weight = sapply(llply(x,fun,nwords),sum)total = pos.weight - neg.weightreturn(data.frame(pos.weight,neg.weight, total))}score1 <- getscore(segwords, pos$words, neg$words)#將得分與評論合併到一起:aevalu_score1<- cbind(text, score1)#判斷得分是否大於1,賦予相應標籤:evalu.score1 <- transform(evalu_score1,emotion = ifelse(evalu_score1$total> 0, 'Pos', 'Neg'))#檢視效果:View(evalu.score1)# 計算詞頻wordfreq <- unlist(segwords)wordfreq <- as.data.frame(table(wordfreq ))wordfreq <- arrange(wordfreq , desc(Freq))#排序head(wordfreq)write.csv(wordfreq,"D:/wordart.csv")# 繪製詞雲:wordcloud2(wordfreq,size=1,shape='star')
小結

本文轉載自學習使我快樂,請支援原創!

如果你是一個大學本科生或研究生,如果你正在因為你的統計作業、資料分析、論文、報告、考試等發愁,如果你在使用SPSS,R,Python,Mplus, Excel中遇到任何問題,都可以聯絡我。因為我可以給您提供最好的,最詳細和耐心的資料分析服務。

If you are a student and you are worried about you statistical #Assignments, #Data #Analysis, #Thesis, #reports, #composing, #Quizzes, Exams.. And if you are facing problem in #SPSS, #R-Programming, #Excel, Mplus, then contact me. Because I could provide you the best services for your Data Analysis.

Are you confused with statistical Techniques like z-test, t-test, ANOVA, MANOVA, Regression, Logistic Regression, Chi-Square, Correlation, Association, SEM, multilevel model, mediation and moderation etc. for your Data Analysis...??

Then Contact Me. I will solve your Problem...

加油吧,打工人!

13
最新評論
  • BSA-TRITC(10mg/ml) TRITC-BSA 牛血清白蛋白改性標記羅丹明
  • 為什麼你總是學不會Python,入門Python的4大陷阱