编辑: yn灬不离不弃灬 | 2019-01-26 |
s%not%just%because%this%is%a%brian% depalma%?lm%,%and%since%he'
s%a%great% director%and%one%who'
s%?lms%are%always% greeted%with%at%least%some%fanfare%.%% and%it'
s%not%even%because%this%was%a%?lm% starring%nicolas%cage%and%since%he%gives%a% brauvara%performance%,%this%?lm%is%hardly% worth%his%talents%.%% ?% ?% Dan%Jurafsky% Baseline(Algorithm((adapted(from(Pang( and(Lee)( ?? Tokeniza+on% ?? Feature%Extrac+on% ?? Classi?ca+on%using%di?erent%classi?ers% ?? Na?ve%Bayes% ?? MaxEnt% ?? SVM% Dan%Jurafsky% Sen%ment(Tokeniza%on(Issues( ?? Deal%with%HTML%and%XML%markup% ?? TwiXer%markPup%(names,%hash%tags)% ?? Capitaliza+on%(preserve%for%% words%in%all%caps)% ?? Phone%numbers,%dates% ?? Emo+cons% ?? Useful%code:% ?? Christopher%PoXs%sen+ment%tokenizer% ?? Brendan%O'
Connor%twiXer%tokenizer% 21% []? # optional hat/brow! [:;
=8] # eyes! [\-o\optional nose! dDpP/mouth ! | #### reverse orientation! dDpP/mouth! [\-o\optional nose! [:;
=8] # eyes! []? # optional hat/brow! PoXs%emo+cons% Dan%Jurafsky% Extrac%ng(Features(for(Sen%ment( Classi?ca%on( ?? How%to%handle%nega+on% ?? I didn'
t like this movie! %%%vs% ?? I really like this movie! ?? Which%words%to%use?% ?? Only%adjec+ves% ?? All%words% ?? All%words%turns%out%to%work%beXer,%at%least%on%this%data% 22% Dan%Jurafsky% Nega%on( Add%NOT_%to%every%word%between%nega+on%and%following%punctua+on:% didn'
t like this movie , but I! didn'
t NOT_like NOT_this NOT_movie but I! Das,%Sanjiv%and%Mike%Chen.%2001.%Yahoo!%for%Amazon:%Extrac+ng%market%sen+ment%from%stock% message%boards.%In%Proceedings%of%the%Asia%Paci?c%Finance%Associa+on%Annual%Conference%(APFA).% Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan. 2002. Thumbs up? Sentiment Classification using Machine Learning Techniques. EMNLP-2002, 79―86. Dan%Jurafsky% Reminder:(Na?ve(Bayes( 24% ? P(w | c) = count(w,c)+1 count(c)+ V cNB = argmax cj!C P(cj ) P(wi | cj ) i!positions Dan%Jurafsky% Binarized((Boolean(feature)((Mul%nomial(Na?ve(Bayes( ?? Intui+on:% ?? For%sen+ment%(and%probably%for%other%text%classi?ca+on%domains)% ?? Word%occurrence%may%maXer%more%than%word%frequency% ?? The%occurrence%of%the%word%fantas1c%tells%us%a%lot% ?? The%fact%that%it%occurs%5%+mes%may%not%tell%us%much%more.% ?? Boolean%Mul+nomial%Na?ve%Bayes% ?? Clips%all%the%word%counts%in%each%document%at%1% 25% Dan%Jurafsky% Boolean(Mul%nomial(Na?ve(Bayes:(Learning( ?? Calculate%P(cj)/terms% ?? For%each%cj/in%C%do% /docsj/←/all%docs%with%%class%=cj/ P(cj )! | docsj | | total # documents| P(wk | cj )! nk +! n +! |Vocabulary | ?? Textj/←%single%doc%containing%all%docsj/ ?? For/each%word%wk/in%Vocabulary/ ////nk/←%#%of%occurrences%of%wk/in%Textj/ ?? From%training%corpus,%extract%Vocabulary% ?? Calculate%P(wk/|/cj)/terms% ?? Remove%duplicates%in%each%doc:% ?? For%each%word%type%w%in%docj%%% ?? Retain%only%a%single%instance%of%w% Dan%Jurafsky% Boolean(Mul%nomial(Na?ve(Bayes( (on(a(test(document(d! 27% ?? First%remove%all%duplicate%words%from%d/ ?? Then%compute%NB%using%the%same%equa+on:%% cNB = argmax cj!C P(cj ) P(wi | cj ) i!positions Dan%Jurafsky% Normal(vs.(Boolean(Mul%nomial(NB( Normal( Doc( Words( Class( Training% 1% Chinese%Beijing%Chinese% c% 2% Chinese%Chinese%Shanghai% c% 3% Chinese%Macao% c% 4% Tokyo%Japan%Chinese% j% Test% 5% Chinese%Chinese%Chinese%Tokyo%Japan% ?% 28% Boolean( Doc( Words( Class( Training% 1% Chinese%Beijing% c% 2% Chinese%Shanghai% c% 3% Chinese%Macao% c% 4% Tokyo%Japan%Chinese% j% Test% 5% Chinese%Tokyo%Japan% ?% Dan%Jurafsky% Binarized((Boolean(feature)(( Mul%nomial(Na?ve(Bayes( ?? Binary%seems%to%work%beXer%than%full%word%counts% ?? This%is%not%the%same%as%Mul+variate%Bernoulli%Na?ve%Bayes% ?? MBNB%doesn'