learnNLTKbyWatchVideo

 

The following is learning from the video:NLTK with Python 3 for Natural Language Processing.

You can watch the videos in YouTube,iliibili and the author’s website: pythonprogramming.net

I use jupyter notebook to write and run the python code,the python version is 3.4.4.

Frist,we need to import the nltk module to use it

1
2
3
4
import nltk
from nltk.tokenize import sent_tokenize,word_tokenize

text = r"hello,how are you! I am lightsmile. My github link is www.github.com/smilelight. My persoanl website is www.iamlightsmile.com"

1. Tokenizing words and entences(分词和分句)

use the sent_tokenize method to tokenize the texts to sentenses(分句)

1
sent_tokenize(text)
['hello,how are you!',
 'I am lightsmile.',
 'My github link is www.github.com/smilelight.',
 'My persoanl website is www.iamlightsmile.com']

use the word_tokenize method to tokenize the texts to words(分词)

1
word_tokenize(text)
['hello',
 ',',
 'how',
 'are',
 'you',
 '!',
 'I',
 'am',
 'lightsmile',
 '.',
 'My',
 'github',
 'link',
 'is',
 'www.github.com/smilelight',
 '.',
 'My',
 'persoanl',
 'website',
 'is',
 'www.iamlightsmile.com']

2. Stop Words(停用词)

Then,import the stopwords(停用词) from the nltk.corpus module

The stopwords are the words which are used commonly in the daliy life but usefulless for we to analyze the texts,so we need to remove them from the texts before we do the next steps.

1
from nltk.corpus import stopwords
1
example_sentense = "This is an example showing off stop word filtration"
1
filter_sentense = [w for w in word_tokenize(example_sentense) if  w not in stopwords.words('english')]
1
filter_sentense
['This', 'example', 'showing', 'stop', 'word', 'filtration']

3. Stemming(提取词干)

Use PorterStemmer() to get the stems of words(提取词干)

In some situations there are different expressions which have the same meanings.For example,the words:good,better,well have the similar meanings in the most situations.So,on the purpose to simplify the texts,we can get the stems of words in the texts.

1
from nltk.stem import PorterStemmer
1
2
3
4
ps = PorterStemmer()
example_words = ["python","pythoner","pythoning","pythoned","pythonly"]
for w in example_words:
print(ps.stem(w))
python
python
python
python
pythonli
1
2
3
new_text = "It is very important to be pythonly while you are pythoning with python. All pythoners have pythoned poorly at least once."
for w in word_tokenize(new_text):
print(ps.stem(w))
It
is
veri
import
to
be
pythonli
while
you
are
python
with
python
.
all
python
have
python
poorli
at
least
onc
.

4. Part of speech tagging(词性标注)

Use pos_tag method to do part of speech tagging(词性标注)

1
2
tagged = nltk.pos_tag(word_tokenize(new_text))
print(tagged)
[('It', 'PRP'), ('is', 'VBZ'), ('very', 'RB'), ('important', 'JJ'), ('to', 'TO'), ('be', 'VB'), ('pythonly', 'RB'), ('while', 'IN'), ('you', 'PRP'), ('are', 'VBP'), ('pythoning', 'VBG'), ('with', 'IN'), ('python', 'NN'), ('.', '.'), ('All', 'DT'), ('pythoners', 'NNS'), ('have', 'VBP'), ('pythoned', 'VBN'), ('poorly', 'RB'), ('at', 'IN'), ('least', 'JJS'), ('once', 'RB'), ('.', '.')]
1
[w for w,t in tagged if t == 'RB' ]
['very', 'pythonly', 'poorly', 'once']

5. Chunking(短语识别)

1
2
3
4
chunkGram = r"""Chunk: {<RB.?>*<VB.?>*<NNP>*<NN>}"""
chunkParser = nltk.RegexpParser(chunkGram)
chunked = chunkParser.parse(tagged)
print(chunked)
(S
  It/PRP
  is/VBZ
  very/RB
  important/JJ
  to/TO
  be/VB
  pythonly/RB
  while/IN
  you/PRP
  are/VBP
  pythoning/VBG
  with/IN
  (Chunk python/NN)
  ./.
  All/DT
  pythoners/NNS
  have/VBP
  pythoned/VBN
  poorly/RB
  at/IN
  least/JJS
  once/RB
  ./.)
1
chunked.draw()

6. Chinking(短语排除)

The chinking is used to chunk something expect the chinking things.It’s effect is to remove something.

1
2
3
4
5
chinkGram = r"""Chunk: {<.*>}
Chink: }<NN>{"""
chinkParser = nltk.RegexpParser(chinkGram)
chinked = chinkParser.parse(tagged)
print(chinked)
(S
  (Chunk It/PRP)
  (Chunk is/VBZ)
  (Chunk very/RB)
  (Chunk important/JJ)
  (Chunk to/TO)
  (Chunk be/VB)
  (Chunk pythonly/RB)
  (Chunk while/IN)
  (Chunk you/PRP)
  (Chunk are/VBP)
  (Chunk pythoning/VBG)
  (Chunk with/IN)
  (Chunk python/NN)
  (Chunk ./.)
  (Chunk All/DT)
  (Chunk pythoners/NNS)
  (Chunk have/VBP)
  (Chunk pythoned/VBN)
  (Chunk poorly/RB)
  (Chunk at/IN)
  (Chunk least/JJS)
  (Chunk once/RB)
  (Chunk ./.))
1
chinked.draw()

7. Named Entity Recognition(命名实体识别)

1
2
3
4
new_text2 = "The Obama,president of the United States,is walking by the Danube with his families.They'll go back home at 7:00 a.m.."
tagged2 = nltk.pos_tag(word_tokenize(new_text2))
nameEnt = nltk.ne_chunk(tagged2)
nameEnt.draw()

8. Lemmatizing(词形还原)

1
2
3
4
5
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
entities = ["cats","body","shoes","python","shit","park"]
for entity in entities:
print(lemmatizer.lemmatize(entity))
cat
body
shoe
python
shit
park
1
nltk.__file__
'C:\\Program Files\\Anaconda3\\lib\\site-packages\\nltk\\__init__.py'

9. NLTK Corpora(语料库)

1
2
3
4
5
from nltk.corpus import gutenberg
from nltk.tokenize import sent_tokenize
sample = gutenberg.raw('bible-kjv.txt')
tok = sent_tokenize(sample)
tok[:5]
['[The King James Bible]\n\nThe Old Testament of the King James Bible\n\nThe First Book of Moses:  Called Genesis\n\n\n1:1 In the beginning God created the heaven and the earth.',
 '1:2 And the earth was without form, and void; and darkness was upon\nthe face of the deep.',
 'And the Spirit of God moved upon the face of the\nwaters.',
 '1:3 And God said, Let there be light: and there was light.',
 '1:4 And God saw the light, that it was good: and God divided the light\nfrom the darkness.']

10. WordNet(一个英语词汇数据库)

1
2
3
from nltk.corpus import wordnet
syns = wordnet.synsets("program")
syns
[Synset('plan.n.01'),
 Synset('program.n.02'),
 Synset('broadcast.n.02'),
 Synset('platform.n.02'),
 Synset('program.n.05'),
 Synset('course_of_study.n.01'),
 Synset('program.n.07'),
 Synset('program.n.08'),
 Synset('program.v.01'),
 Synset('program.v.02')]
1
2
3
4
5
6
7
8
9
10
11
word = wordnet.synsets('boy')
synonyms =[]
antonyms = []
for w in word:
for l in w.lemmas():
synonyms.append(l.name())
if l.antonyms():
for a in l.antonyms():
antonyms.append(a.name())
print(set(synonyms))
print(set(antonyms))
{'male_child', 'son', 'boy'}
{'female_child', 'daughter', 'girl'}

Use list comprehension(列表推导式)

1
2
3
4
synonyms2 = set([l.name() for w in word for l in w.lemmas()])
antonyms2 = set([a.name() for w in word for l in w.lemmas() for a in l.antonyms()])
print(synonyms2)
print(antonyms2)
{'male_child', 'son', 'boy'}
{'female_child', 'daughter', 'girl'}
1
word[0]["boy"].antosyns()
---------------------------------------------------------------------------

TypeError                                 Traceback (most recent call last)

<ipython-input-77-93678c6743d6> in <module>()
----> 1 word[0]["boy"].antosyns()


TypeError: 'Synset' object is not subscriptable
1
2
3
cat = wordnet.synset("cat.n.01")
dog = wordnet.synset("dog.n.01")
dog.wup_similarity(cat)
0.8571428571428571

11. Text Classfication(文本分类)

1
2
import random
from nltk.corpus import movie_reviews
1
2
3
4
5
6
documents = [(list(movie_reviews.words(fileid)),category)
for category in movie_reviews.categories()
for fileid in movie_reviews.fileids(category)]
random.shuffle(documents)

documents[1]
(['did',
  'you',
  'ever',
  'wonder',
  'if',
  'dennis',
  'rodman',
  'was',
  'actually',
  'from',
  'this',
  'planet',
  '?',
  'or',
  'if',
  'sylvester',
  'stallone',
  'was',
  'some',
  'kind',
  'of',
  'weird',
  'extra',
  '-',
  'terrestrial',
  '?',
  'i',
  'used',
  'to',
  'think',
  'that',
  'about',
  'my',
  '7th',
  'grade',
  'english',
  'teacher',
  ',',
  'ms',
  '.',
  'carey',
  '.',
  'but',
  'after',
  'seeing',
  'this',
  'movie',
  ',',
  'they',
  'may',
  'have',
  'confirmed',
  'my',
  'suspicions',
  '.',
  'as',
  'the',
  'story',
  'goes',
  ',',
  'at',
  'any',
  'time',
  ',',
  'there',
  'are',
  'over',
  'a',
  'thousand',
  'aliens',
  'living',
  'among',
  'us',
  'here',
  'on',
  'earth',
  '.',
  'the',
  'men',
  'in',
  'black',
  '(',
  'mib',
  ')',
  'are',
  'the',
  'watchdogs',
  'that',
  'oversee',
  'the',
  'cosmic',
  'citizens',
  ',',
  'guardians',
  'of',
  'our',
  'beloved',
  'planet',
  'from',
  'nasty',
  '-',
  'tempered',
  'aliens',
  ',',
  'and',
  'secret',
  'service',
  'to',
  'the',
  'stars',
  '.',
  'based',
  'in',
  'new',
  'york',
  'city',
  '(',
  'where',
  'weird',
  'is',
  'the',
  'norm',
  ')',
  ',',
  'the',
  'mib',
  'organization',
  'gives',
  'human',
  'form',
  'to',
  'our',
  'space',
  '-',
  'faring',
  'emigrants',
  'so',
  'that',
  'they',
  'may',
  'walk',
  'and',
  'live',
  'among',
  'us',
  'unnoticed',
  '.',
  'but',
  'to',
  'enforce',
  'the',
  'laws',
  'of',
  'earth',
  ',',
  'the',
  'mib',
  'carry',
  'weapons',
  'that',
  'are',
  'powerful',
  'enough',
  'to',
  'meet',
  'or',
  'exceed',
  'destruction',
  'quotas',
  'in',
  'one',
  'single',
  'blast',
  '.',
  'they',
  'carry',
  'other',
  '-',
  'worldly',
  'technology',
  'to',
  'erase',
  'people',
  "'",
  's',
  'short',
  '-',
  'term',
  'memory',
  'when',
  'common',
  'folk',
  'see',
  'the',
  'mib',
  'in',
  'action',
  '.',
  'and',
  'their',
  'best',
  'leads',
  'on',
  'cosmic',
  'things',
  '-',
  'gone',
  '-',
  'awry',
  'are',
  'the',
  'supermarket',
  'tabloids',
  '.',
  'little',
  'do',
  'we',
  'know',
  'that',
  'there',
  'are',
  'much',
  'stronger',
  'battles',
  'of',
  'good',
  'v',
  '.',
  'evil',
  'going',
  'on',
  'in',
  'the',
  'depths',
  'of',
  'space',
  '.',
  'one',
  'of',
  'the',
  'aliens',
  '-',
  'as',
  '-',
  'human',
  'on',
  'this',
  'planet',
  'is',
  'an',
  'important',
  'diplomat',
  'that',
  'is',
  'carrying',
  'something',
  'very',
  'precious',
  '.',
  'it',
  'holds',
  'the',
  "'",
  'key',
  "'",
  ',',
  'literally',
  ',',
  'to',
  'universal',
  'peace',
  '.',
  'a',
  'giant',
  'cockroach',
  '-',
  'like',
  'alien',
  'soon',
  'arrives',
  'on',
  'the',
  'planet',
  'and',
  'steals',
  'this',
  "'",
  'key',
  "'",
  '.',
  'in',
  'the',
  'wrong',
  'alien',
  'hands',
  '(',
  'flippers',
  '?',
  'mandibles',
  '?',
  'tentacles',
  '?',
  ')',
  ',',
  'it',
  'can',
  'be',
  'used',
  'as',
  'a',
  'weapon',
  '.',
  'therefore',
  ',',
  'it',
  'must',
  'be',
  'recovered',
  'and',
  'returned',
  'to',
  'it',
  "'",
  's',
  'rightful',
  'owners',
  '.',
  'otherwise',
  ',',
  'to',
  'ensure',
  'universal',
  'safety',
  ',',
  'earth',
  'will',
  'be',
  'destroyed',
  ',',
  'along',
  'with',
  'the',
  "'",
  'key',
  "'",
  '.',
  'now',
  ',',
  'it',
  "'",
  's',
  'the',
  'mib',
  'who',
  'must',
  'prevent',
  'this',
  'catastrophe',
  '.',
  'the',
  'mib',
  'agents',
  'on',
  'the',
  'case',
  'are',
  '"',
  'k',
  '"',
  ',',
  'played',
  'by',
  'tommy',
  'lee',
  'jones',
  '.',
  'he',
  'is',
  'crustier',
  'than',
  'burnt',
  'toast',
  'and',
  'even',
  'more',
  'serious',
  'than',
  'al',
  'gore',
  '.',
  'the',
  'stars',
  'in',
  'the',
  'sky',
  'no',
  'longer',
  'spark',
  'wonder',
  'in',
  'his',
  'eyes',
  '.',
  'he',
  'is',
  'accompanied',
  'by',
  'a',
  'flippant',
  'rookie',
  ',',
  '"',
  'j',
  '"',
  ',',
  'played',
  'by',
  'will',
  'smith',
  '.',
  'but',
  ',',
  'despite',
  'this',
  'shoot',
  '-',
  'em',
  '-',
  'up',
  ',',
  'protect',
  '-',
  'earth',
  '-',
  'from',
  '-',
  'destruction',
  'premise',
  ',',
  'this',
  'is',
  'nothing',
  'at',
  'all',
  'like',
  'a',
  'typical',
  'summer',
  'action',
  'movie',
  '.',
  'and',
  ',',
  'this',
  'isn',
  "'",
  't',
  'an',
  'independence',
  'day',
  'knockoff',
  '.',
  'rather',
  ',',
  'this',
  'is',
  'a',
  'stylishly',
  'offbeat',
  'sci',
  '-',
  'fi',
  'comedy',
  'that',
  'pokes',
  'fun',
  'at',
  'what',
  'the',
  'government',
  'always',
  'denies',
  '?',
  'that',
  'there',
  'are',
  'real',
  'aliens',
  'that',
  'live',
  'here',
  ',',
  'and',
  'that',
  'the',
  'government',
  'does',
  'its',
  'darndest',
  'to',
  'cover',
  'them',
  'up',
  '.',
  'but',
  'to',
  'give',
  'it',
  'some',
  'sense',
  'of',
  'excitement',
  'and',
  'to',
  'keep',
  'it',
  'within',
  'the',
  'parameters',
  'of',
  'the',
  'summer',
  'movie',
  'recipe',
  ',',
  'there',
  'must',
  'be',
  'some',
  'kind',
  'of',
  'earth',
  '-',
  'hangs',
  '-',
  'in',
  '-',
  'the',
  '-',
  'balance',
  'scenario',
  '.',
  'yet',
  ',',
  'this',
  'movie',
  'is',
  'very',
  'appealing',
  '.',
  'the',
  'abundance',
  'of',
  'wierdness',
  '(',
  'talking',
  'aliens',
  ',',
  'pee',
  '-',
  'wee',
  'atomizers',
  ',',
  'a',
  'mortician',
  'who',
  "'",
  'lives',
  "'",
  'for',
  'her',
  'work',
  ',',
  'and',
  'lots',
  'of',
  'yucky',
  'bugs',
  'and',
  'slime',
  '-',
  'splattering',
  'galore',
  ')',
  ',',
  'is',
  'played',
  'straight',
  ',',
  'like',
  'as',
  'if',
  'this',
  'were',
  'normal',
  '(',
  'of',
  'course',
  ',',
  'we',
  'are',
  'in',
  'nyc',
  ')',
  '.',
  'it',
  'gives',
  'it',
  'a',
  'deadpan',
  'feel',
  ',',
  'which',
  'makes',
  'it',
  'all',
  'the',
  'more',
  'funnier',
  'and',
  'odder',
  '.',
  'jones',
  'plays',
  'the',
  'venerable',
  'seen',
  '-',
  'it',
  '-',
  'all',
  'agent',
  'with',
  'seriousness',
  'and',
  'maturity',
  '.',
  'smith',
  'is',
  'likeable',
  'and',
  'makes',
  'a',
  'great',
  'comic',
  'partner',
  'to',
  'jones',
  "'",
  'straight',
  'man',
  'routine',
  '.',
  'they',
  'click',
  'like',
  'dorothy',
  "'",
  's',
  'ruby',
  'red',
  'shoes',
  '.',
  'the',
  'look',
  'and',
  'feel',
  'of',
  'the',
  'movie',
  'is',
  'made',
  'even',
  'better',
  'with',
  'direction',
  'from',
  'barry',
  'sonnenfeld',
  '(',
  'the',
  'addam',
  "'",
  's',
  'family',
  ')',
  '.',
  'this',
  'guy',
  'has',
  'a',
  'knack',
  'for',
  "'",
  'gothic',
  "'",
  'comedy',
  ',',
  'and',
  'successfully',
  'transfers',
  'his',
  'macabre',
  'sense',
  'of',
  'humor',
  'onto',
  'the',
  'screen',
  '.',
  'and',
  ',',
  'an',
  'appropriate',
  'dose',
  'of',
  'special',
  'effects',
  'helps',
  'to',
  'bolster',
  'the',
  'oddness',
  'of',
  'their',
  'task',
  'without',
  'diverting',
  'attention',
  'from',
  'the',
  'human',
  'actors',
  '.',
  'the',
  'story',
  'moves',
  'well',
  ',',
  'and',
  'before',
  'you',
  'know',
  'it',
  ',',
  'the',
  'end',
  'credits',
  'are',
  'already',
  'rolling',
  '!',
  'the',
  'result',
  'is',
  '100',
  'minutes',
  'worth',
  'of',
  'fun',
  'in',
  'the',
  'form',
  'of',
  'ewwwws',
  'and',
  'blechhhs',
  ',',
  'aaaahhhs',
  'and',
  'wows',
  '.',
  'let',
  'the',
  'men',
  'in',
  'black',
  'protect',
  'and',
  'color',
  'your',
  'world',
  '.'],
 'pos')
1
all_words = [w.lower() for w in movie_reviews.words()]
1
all_words
['plot',
 ':',
 'two',
 'teen',
 'couples',
 'go',
 'to',
 'a',
 'church',
 'party',
 ',',
 'drink',
 'and',
 'then',
 'drive',
 '.',
 'they',
 'get',
 'into',
 'an',
 'accident',
 '.',
 'one',
 'of',
 'the',
 'guys',
 'dies',
 ',',
 'but',
 'his',
 'girlfriend',
 'continues',
 'to',
 'see',
 'him',
 'in',
 'her',
 'life',
 ',',
 'and',
 'has',
 'nightmares',
 '.',
 'what',
 "'",
 's',
 'the',
 'deal',
 '?',
 'watch',
 'the',
 'movie',
 'and',
 '"',
 'sorta',
 '"',
 'find',
 'out',
 '.',
 '.',
 '.',
 'critique',
 ':',
 'a',
 'mind',
 '-',
 'fuck',
 'movie',
 'for',
 'the',
 'teen',
 'generation',
 'that',
 'touches',
 'on',
 'a',
 'very',
 'cool',
 'idea',
 ',',
 'but',
 'presents',
 'it',
 'in',
 'a',
 'very',
 'bad',
 'package',
 '.',
 'which',
 'is',
 'what',
 'makes',
 'this',
 'review',
 'an',
 'even',
 'harder',
 'one',
 'to',
 'write',
 ',',
 'since',
 'i',
 'generally',
 'applaud',
 'films',
 'which',
 'attempt',
 'to',
 'break',
 'the',
 'mold',
 ',',
 'mess',
 'with',
 'your',
 'head',
 'and',
 'such',
 '(',
 'lost',
 'highway',
 '&',
 'memento',
 ')',
 ',',
 'but',
 'there',
 'are',
 'good',
 'and',
 'bad',
 'ways',
 'of',
 'making',
 'all',
 'types',
 'of',
 'films',
 ',',
 'and',
 'these',
 'folks',
 'just',
 'didn',
 "'",
 't',
 'snag',
 'this',
 'one',
 'correctly',
 '.',
 'they',
 'seem',
 'to',
 'have',
 'taken',
 'this',
 'pretty',
 'neat',
 'concept',
 ',',
 'but',
 'executed',
 'it',
 'terribly',
 '.',
 'so',
 'what',
 'are',
 'the',
 'problems',
 'with',
 'the',
 'movie',
 '?',
 'well',
 ',',
 'its',
 'main',
 'problem',
 'is',
 'that',
 'it',
 "'",
 's',
 'simply',
 'too',
 'jumbled',
 '.',
 'it',
 'starts',
 'off',
 '"',
 'normal',
 '"',
 'but',
 'then',
 'downshifts',
 'into',
 'this',
 '"',
 'fantasy',
 '"',
 'world',
 'in',
 'which',
 'you',
 ',',
 'as',
 'an',
 'audience',
 'member',
 ',',
 'have',
 'no',
 'idea',
 'what',
 "'",
 's',
 'going',
 'on',
 '.',
 'there',
 'are',
 'dreams',
 ',',
 'there',
 'are',
 'characters',
 'coming',
 'back',
 'from',
 'the',
 'dead',
 ',',
 'there',
 'are',
 'others',
 'who',
 'look',
 'like',
 'the',
 'dead',
 ',',
 'there',
 'are',
 'strange',
 'apparitions',
 ',',
 'there',
 'are',
 'disappearances',
 ',',
 'there',
 'are',
 'a',
 'looooot',
 'of',
 'chase',
 'scenes',
 ',',
 'there',
 'are',
 'tons',
 'of',
 'weird',
 'things',
 'that',
 'happen',
 ',',
 'and',
 'most',
 'of',
 'it',
 'is',
 'simply',
 'not',
 'explained',
 '.',
 'now',
 'i',
 'personally',
 'don',
 "'",
 't',
 'mind',
 'trying',
 'to',
 'unravel',
 'a',
 'film',
 'every',
 'now',
 'and',
 'then',
 ',',
 'but',
 'when',
 'all',
 'it',
 'does',
 'is',
 'give',
 'me',
 'the',
 'same',
 'clue',
 'over',
 'and',
 'over',
 'again',
 ',',
 'i',
 'get',
 'kind',
 'of',
 'fed',
 'up',
 'after',
 'a',
 'while',
 ',',
 'which',
 'is',
 'this',
 'film',
 "'",
 's',
 'biggest',
 'problem',
 '.',
 'it',
 "'",
 's',
 'obviously',
 'got',
 'this',
 'big',
 'secret',
 'to',
 'hide',
 ',',
 'but',
 'it',
 'seems',
 'to',
 'want',
 'to',
 'hide',
 'it',
 'completely',
 'until',
 'its',
 'final',
 'five',
 'minutes',
 '.',
 'and',
 'do',
 'they',
 'make',
 'things',
 'entertaining',
 ',',
 'thrilling',
 'or',
 'even',
 'engaging',
 ',',
 'in',
 'the',
 'meantime',
 '?',
 'not',
 'really',
 '.',
 'the',
 'sad',
 'part',
 'is',
 'that',
 'the',
 'arrow',
 'and',
 'i',
 'both',
 'dig',
 'on',
 'flicks',
 'like',
 'this',
 ',',
 'so',
 'we',
 'actually',
 'figured',
 'most',
 'of',
 'it',
 'out',
 'by',
 'the',
 'half',
 '-',
 'way',
 'point',
 ',',
 'so',
 'all',
 'of',
 'the',
 'strangeness',
 'after',
 'that',
 'did',
 'start',
 'to',
 'make',
 'a',
 'little',
 'bit',
 'of',
 'sense',
 ',',
 'but',
 'it',
 'still',
 'didn',
 "'",
 't',
 'the',
 'make',
 'the',
 'film',
 'all',
 'that',
 'more',
 'entertaining',
 '.',
 'i',
 'guess',
 'the',
 'bottom',
 'line',
 'with',
 'movies',
 'like',
 'this',
 'is',
 'that',
 'you',
 'should',
 'always',
 'make',
 'sure',
 'that',
 'the',
 'audience',
 'is',
 '"',
 'into',
 'it',
 '"',
 'even',
 'before',
 'they',
 'are',
 'given',
 'the',
 'secret',
 'password',
 'to',
 'enter',
 'your',
 'world',
 'of',
 'understanding',
 '.',
 'i',
 'mean',
 ',',
 'showing',
 'melissa',
 'sagemiller',
 'running',
 'away',
 'from',
 'visions',
 'for',
 'about',
 '20',
 'minutes',
 'throughout',
 'the',
 'movie',
 'is',
 'just',
 'plain',
 'lazy',
 '!',
 '!',
 'okay',
 ',',
 'we',
 'get',
 'it',
 '.',
 '.',
 '.',
 'there',
 'are',
 'people',
 'chasing',
 'her',
 'and',
 'we',
 'don',
 "'",
 't',
 'know',
 'who',
 'they',
 'are',
 '.',
 'do',
 'we',
 'really',
 'need',
 'to',
 'see',
 'it',
 'over',
 'and',
 'over',
 'again',
 '?',
 'how',
 'about',
 'giving',
 'us',
 'different',
 'scenes',
 'offering',
 'further',
 'insight',
 'into',
 'all',
 'of',
 'the',
 'strangeness',
 'going',
 'down',
 'in',
 'the',
 'movie',
 '?',
 'apparently',
 ',',
 'the',
 'studio',
 'took',
 'this',
 'film',
 'away',
 'from',
 'its',
 'director',
 'and',
 'chopped',
 'it',
 'up',
 'themselves',
 ',',
 'and',
 'it',
 'shows',
 '.',
 'there',
 'might',
 "'",
 've',
 'been',
 'a',
 'pretty',
 'decent',
 'teen',
 'mind',
 '-',
 'fuck',
 'movie',
 'in',
 'here',
 'somewhere',
 ',',
 'but',
 'i',
 'guess',
 '"',
 'the',
 'suits',
 '"',
 'decided',
 'that',
 'turning',
 'it',
 'into',
 'a',
 'music',
 'video',
 'with',
 'little',
 'edge',
 ',',
 'would',
 'make',
 'more',
 'sense',
 '.',
 'the',
 'actors',
 'are',
 'pretty',
 'good',
 'for',
 'the',
 'most',
 'part',
 ',',
 'although',
 'wes',
 'bentley',
 'just',
 'seemed',
 'to',
 'be',
 'playing',
 'the',
 'exact',
 'same',
 'character',
 'that',
 'he',
 'did',
 'in',
 'american',
 'beauty',
 ',',
 'only',
 'in',
 'a',
 'new',
 'neighborhood',
 '.',
 'but',
 'my',
 'biggest',
 'kudos',
 'go',
 'out',
 'to',
 'sagemiller',
 ',',
 'who',
 'holds',
 'her',
 'own',
 'throughout',
 'the',
 'entire',
 'film',
 ',',
 'and',
 'actually',
 'has',
 'you',
 'feeling',
 'her',
 'character',
 "'",
 's',
 'unraveling',
 '.',
 'overall',
 ',',
 'the',
 'film',
 'doesn',
 "'",
 't',
 'stick',
 'because',
 'it',
 'doesn',
 "'",
 't',
 'entertain',
 ',',
 'it',
 "'",
 's',
 'confusing',
 ',',
 'it',
 'rarely',
 'excites',
 'and',
 'it',
 'feels',
 'pretty',
 'redundant',
 'for',
 'most',
 'of',
 'its',
 'runtime',
 ',',
 'despite',
 'a',
 'pretty',
 'cool',
 'ending',
 'and',
 'explanation',
 'to',
 'all',
 'of',
 'the',
 'craziness',
 'that',
 'came',
 'before',
 'it',
 '.',
 'oh',
 ',',
 'and',
 'by',
 'the',
 'way',
 ',',
 'this',
 'is',
 'not',
 'a',
 'horror',
 'or',
 'teen',
 'slasher',
 'flick',
 '.',
 '.',
 '.',
 'it',
 "'",
 's',
 'just',
 'packaged',
 'to',
 'look',
 'that',
 'way',
 'because',
 'someone',
 'is',
 'apparently',
 'assuming',
 'that',
 'the',
 'genre',
 'is',
 'still',
 'hot',
 'with',
 'the',
 'kids',
 '.',
 'it',
 'also',
 'wrapped',
 'production',
 'two',
 'years',
 'ago',
 'and',
 'has',
 'been',
 'sitting',
 'on',
 'the',
 'shelves',
 'ever',
 'since',
 '.',
 'whatever',
 '.',
 '.',
 '.',
 'skip',
 'it',
 '!',
 'where',
 "'",
 's',
 'joblo',
 'coming',
 'from',
 '?',
 'a',
 'nightmare',
 'of',
 'elm',
 'street',
 '3',
 '(',
 '7',
 '/',
 '10',
 ')',
 '-',
 'blair',
 'witch',
 '2',
 '(',
 '7',
 '/',
 '10',
 ')',
 '-',
 'the',
 'crow',
 '(',
 '9',
 '/',
 '10',
 ')',
 '-',
 'the',
 'crow',
 ':',
 'salvation',
 '(',
 '4',
 '/',
 '10',
 ')',
 '-',
 'lost',
 'highway',
 '(',
 '10',
 '/',
 '10',
 ')',
 '-',
 'memento',
 '(',
 '10',
 '/',
 '10',
 ')',
 '-',
 'the',
 'others',
 '(',
 '9',
 '/',
 '10',
 ')',
 '-',
 'stir',
 'of',
 'echoes',
 '(',
 '8',
 '/',
 '10',
 ')',
 'the',
 'happy',
 'bastard',
 "'",
 's',
 'quick',
 'movie',
 'review',
 'damn',
 'that',
 'y2k',
 'bug',
 '.',
 'it',
 "'",
 's',
 'got',
 'a',
 'head',
 'start',
 'in',
 'this',
 'movie',
 'starring',
 'jamie',
 'lee',
 'curtis',
 'and',
 'another',
 'baldwin',
 'brother',
 '(',
 'william',
 'this',
 'time',
 ')',
 'in',
 'a',
 'story',
 'regarding',
 'a',
 'crew',
 'of',
 'a',
 'tugboat',
 'that',
 'comes',
 'across',
 'a',
 'deserted',
 'russian',
 'tech',
 'ship',
 'that',
 'has',
 'a',
 'strangeness',
 'to',
 'it',
 'when',
 'they',
 'kick',
 'the',
 'power',
 'back',
 'on',
 '.',
 'little',
 'do',
 'they',
 'know',
 'the',
 'power',
 'within',
 '.',
 '.',
 '.',
 'going',
 'for',
 'the',
 'gore',
 'and',
 'bringing',
 'on',
 'a',
 'few',
 'action',
 'sequences',
 'here',
 'and',
 'there',
 ',',
 'virus',
 'still',
 'feels',
 'very',
 'empty',
 ',',
 'like',
 'a',
 'movie',
 'going',
 'for',
 'all',
 'flash',
 'and',
 'no',
 'substance',
 '.',
 'we',
 'don',
 "'",
 't',
 'know',
 'why',
 'the',
 'crew',
 'was',
 'really',
 'out',
 'in',
 ...]

Warning:At the begining, I just writed the follow codes like this:new_all_words = [w for w in all_words if w not in nltk.corpus.stopwords.words('english'),however the code couldn’t complite successfully even I had been waiting for several minites. Finally, I found that the I/O operations can be 1583820 times, and the operation system read data from the hark disk again and again without saying any thing,it was so stupid.So when we programming,we should set the I/O resources as a variable if it will be used for several times.

1
stopwords = nltk.corpus.stopwords.words('english')
1
new_all_words = [w for w in all_words if w not in stopwords]
1
new_all_words = [w for w in new_all_words if w.isalpha()]
1
len(all_words)
1583820
1
len(nltk.corpus.stopwords.words('english'))
179
1
nltk.corpus.stopwords.words('english')
['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each',
 'few',
 'more',
 'most',
 'other',
 'some',
 'such',
 'no',
 'nor',
 'not',
 'only',
 'own',
 'same',
 'so',
 'than',
 'too',
 'very',
 's',
 't',
 'can',
 'will',
 'just',
 'don',
 "don't",
 'should',
 "should've",
 'now',
 'd',
 'll',
 'm',
 'o',
 're',
 've',
 'y',
 'ain',
 'aren',
 "aren't",
 'couldn',
 "couldn't",
 'didn',
 "didn't",
 'doesn',
 "doesn't",
 'hadn',
 "hadn't",
 'hasn',
 "hasn't",
 'haven',
 "haven't",
 'isn',
 "isn't",
 'ma',
 'mightn',
 "mightn't",
 'mustn',
 "mustn't",
 'needn',
 "needn't",
 'shan',
 "shan't",
 'shouldn',
 "shouldn't",
 'wasn',
 "wasn't",
 'weren',
 "weren't",
 'won',
 "won't",
 'wouldn',
 "wouldn't"]
1
2
words_freqlist = nltk.FreqDist(new_all_words)
print(words_freqlist.most_common(10))
[('film', 9517), ('one', 5852), ('movie', 5771), ('like', 3690), ('even', 2565), ('good', 2411), ('time', 2411), ('story', 2169), ('would', 2109), ('much', 2049)]

12. Words as Features for Learning(用来学习的特征词汇)

1
word_features = list(words_freqlist.keys())[:3000]
1
2
3
4
5
6
7
8
def find_features(document):
words = set(document)
features = {}
for w in word_features:
features[w] = (w in words)

return features
featuresets = [(find_features(rev), category) for (rev, category) in documents]

13. Naive Bayes(朴素贝叶斯)

1
2
training_set = featuresets[:1900]
testing_set = featuresets[1900:]
1
2
classifier = nltk.NaiveBayesClassifier.train(training_set)
print("Naive Bayes Algo accuracy:",(nltk.classify.accuracy(classifier,testing_set))*100)
Naive Bayes Algo accuracy: 84.0
1
classifier.show_most_informative_features(15)
Most Informative Features
                   sucks = True              neg : pos    =      8.7 : 1.0
                  annual = True              pos : neg    =      8.2 : 1.0
                 frances = True              pos : neg    =      8.2 : 1.0
           unimaginative = True              neg : pos    =      7.8 : 1.0
                 idiotic = True              neg : pos    =      7.3 : 1.0
              schumacher = True              neg : pos    =      7.1 : 1.0
                    mena = True              neg : pos    =      7.1 : 1.0
               atrocious = True              neg : pos    =      7.1 : 1.0
             silverstone = True              neg : pos    =      7.1 : 1.0
                  suvari = True              neg : pos    =      7.1 : 1.0
                  turkey = True              neg : pos    =      6.7 : 1.0
                  regard = True              pos : neg    =      6.5 : 1.0
                 kidding = True              neg : pos    =      6.4 : 1.0
                  crappy = True              neg : pos    =      6.4 : 1.0
                  shoddy = True              neg : pos    =      6.4 : 1.0

14. Save Classifier with Pickle(使用Pickle保存分类器)

1
import pickle
1
2
3
save_classifier = open('naivebayes.pickle',"wb")
pickle.dump(classifier,save_classifier)
save_classifier.close()
1
2
3
classifier_f = open('naivebayes.pickle',"rb")
classifier = pickle.load(classifier_f)
classifier_f.close()
1
2
print("Naive Bayes Algo accuracy:",(nltk.classify.accuracy(classifier,testing_set))*100)
classifier.show_most_informative_features(15)
Naive Bayes Algo accuracy: 84.0
Most Informative Features
                   sucks = True              neg : pos    =      8.7 : 1.0
                  annual = True              pos : neg    =      8.2 : 1.0
                 frances = True              pos : neg    =      8.2 : 1.0
           unimaginative = True              neg : pos    =      7.8 : 1.0
                 idiotic = True              neg : pos    =      7.3 : 1.0
              schumacher = True              neg : pos    =      7.1 : 1.0
                    mena = True              neg : pos    =      7.1 : 1.0
               atrocious = True              neg : pos    =      7.1 : 1.0
             silverstone = True              neg : pos    =      7.1 : 1.0
                  suvari = True              neg : pos    =      7.1 : 1.0
                  turkey = True              neg : pos    =      6.7 : 1.0
                  regard = True              pos : neg    =      6.5 : 1.0
                 kidding = True              neg : pos    =      6.4 : 1.0
                  crappy = True              neg : pos    =      6.4 : 1.0
                  shoddy = True              neg : pos    =      6.4 : 1.0

training again

1
2
3
4
5
6
7
8
9
random.shuffle(documents)
all_words = [w.lower() for w in movie_reviews.words()]
new_all_words = [w for w in all_words if w not in stopwords]
new_all_words = [w for w in new_all_words if w.isalpha()]
words_freqlist = nltk.FreqDist(new_all_words)
word_features = list(words_freqlist.keys())[:3000]
training_set = featuresets[:1900]
testing_set = featuresets[1900:]
classifier = nltk.NaiveBayesClassifier.train(training_set)
1
print("Naive Bayes Algo accuracy:",(nltk.classify.accuracy(classifier,testing_set))*100)
Naive Bayes Algo accuracy: 84.0
1
2
3
classifier_f = open('naivebayes.pickle',"rb")
classifier = pickle.load(classifier_f)
classifier_f.close()
1
print("Naive Bayes Algo accuracy:",(nltk.classify.accuracy(classifier,testing_set))*100)
Naive Bayes Algo accuracy: 84.0
1
print("Naive Bayes Algo accuracy:",(nltk.classify.accuracy(classifier,testing_set))*100)
Naive Bayes Algo accuracy: 84.0

15. Scikit-Learn incorporation()

1
from nltk.classify.scikitlearn import SklearnClassifier
1
from sklearn.naive_bayes import MultinomialNB, GaussianNB, BernoulliNB
1
2
3
MNB_classifier = SklearnClassifier(MultinomialNB())
MNB_classifier.train(training_set)
print("MNB_classifier accuracy percent:",(nltk.classify.accuracy(MNB_classifier,testing_set))*100)
MNB_classifier accuracy percent: 82.0
1
2
3
4
# 这段代码有问题,不可以运行。
GNB_classifier = SklearnClassifier(GaussianNB())
GNB_classifier.train(training_set)
print("GNB_classifier accuracy percent:",(nltk.classify.accuracy(GNB_classifier,testing_set))*100)
---------------------------------------------------------------------------

TypeError                                 Traceback (most recent call last)

<ipython-input-149-dbf69e211330> in <module>()
      1 GNB_classifier = SklearnClassifier(GaussianNB())
----> 2 GNB_classifier.train(training_set)
      3 print("GNB_classifier accuracy percent:",(nltk.classify.accuracy(GNB_classifier,testing_set))*100)


C:\Program Files\Anaconda3\lib\site-packages\nltk\classify\scikitlearn.py in train(self, labeled_featuresets)
    117         X = self._vectorizer.fit_transform(X)
    118         y = self._encoder.fit_transform(y)
--> 119         self._clf.fit(X, y)
    120 
    121         return self


C:\Program Files\Anaconda3\lib\site-packages\sklearn\naive_bayes.py in fit(self, X, y, sample_weight)
    180             Returns self.
    181         """
--> 182         X, y = check_X_y(X, y)
    183         return self._partial_fit(X, y, np.unique(y), _refit=True,
    184                                  sample_weight=sample_weight)


C:\Program Files\Anaconda3\lib\site-packages\sklearn\utils\validation.py in check_X_y(X, y, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, warn_on_dtype, estimator)
    519     X = check_array(X, accept_sparse, dtype, order, copy, force_all_finite,
    520                     ensure_2d, allow_nd, ensure_min_samples,
--> 521                     ensure_min_features, warn_on_dtype, estimator)
    522     if multi_output:
    523         y = check_array(y, 'csr', force_all_finite=True, ensure_2d=False,


C:\Program Files\Anaconda3\lib\site-packages\sklearn\utils\validation.py in check_array(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
    378     if sp.issparse(array):
    379         array = _ensure_sparse_format(array, accept_sparse, dtype, copy,
--> 380                                       force_all_finite)
    381     else:
    382         array = np.array(array, dtype=dtype, order=order, copy=copy)


C:\Program Files\Anaconda3\lib\site-packages\sklearn\utils\validation.py in _ensure_sparse_format(spmatrix, accept_sparse, dtype, copy, force_all_finite)
    241     """
    242     if accept_sparse in [None, False]:
--> 243         raise TypeError('A sparse matrix was passed, but dense '
    244                         'data is required. Use X.toarray() to '
    245                         'convert to a dense numpy array.')


TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array.
1
2


1
2
3
BNB_classifier = SklearnClassifier(BernoulliNB())
BNB_classifier.train(training_set)
print("BNB_classifier accuracy percent:",(nltk.classify.accuracy(BNB_classifier,testing_set))*100)
BNB_classifier accuracy percent: 84.0
1
2
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.svm import SVC, LinearSVC, NuSVC
1
2
3
LogisticRegression_classifier = SklearnClassifier(LogisticRegression())
LogisticRegression_classifier.train(training_set)
print("LogisticRegression_classifier accuracy percent:",(nltk.classify.accuracy(LogisticRegression_classifier,testing_set))*100)
LogisticRegression_classifier accuracy percent: 82.0
1
2
3
SGDClassifier_classifier = SklearnClassifier(SGDClassifier())
SGDClassifier_classifier.train(training_set)
print("SGDClassifier_classifier accuracy percent:",(nltk.classify.accuracy(SGDClassifier_classifier,testing_set))*100)
SGDClassifier_classifier accuracy percent: 82.0
1
2
3
SVC_classifier = SklearnClassifier(SVC())
SVC_classifier.train(training_set)
print("SVC_classifier accuracy percent:",(nltk.classify.accuracy(SVC_classifier,testing_set))*100)
SVC_classifier accuracy percent: 82.0
1
2
3
LinearSVC_classifier = SklearnClassifier(LinearSVC())
LinearSVC_classifier.train(training_set)
print("LinearSVC_classifier accuracy percent:",(nltk.classify.accuracy(LinearSVC_classifier,testing_set))*100)
LinearSVC_classifier accuracy percent: 80.0
1
2
3
NuSVC_classifier = SklearnClassifier(NuSVC())
NuSVC_classifier.train(training_set)
print("NuSVC_classifier accuracy percent:",(nltk.classify.accuracy(NuSVC_classifier,testing_set))*100)
NuSVC_classifier accuracy percent: 82.0

16. Combining Algos with a Vote

1
2
from nltk.classify import ClassifierI
from statistics import mode
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
class VoteClassifier(ClassifierI):
def __init__(self, *classifiers):
self._classifiers = classifiers

def classify(self, features):
votes = []
for c in self._classifiers:
v = c.classify(features)
votes.append(v)
return mode(votes)
def confidence(self,features):
votes = []
for c in self._classifiers:
v = c.classify(features)
votes.append(v)

choice_votes = votes.count(mode(votes))
conf = choice_votes / len(votes)
return conf
1
2
3
4
5
6
7
8
9
10
voted_classifier = VoteClassifier(classifier,
MNB_classifier,
BNB_classifier,
LogisticRegression_classifier,
SGDClassifier_classifier,
#SVC_classifier,视频中没有这个,况且如果不注释掉就会报统计错误,说有两个相同的值。
#如: http://blog.csdn.net/dongfuguo/article/details/50163757 中 mode错误一般
LinearSVC_classifier,
NuSVC_classifier)
print("voted_classifier accuracy percent:",(nltk.classify.accuracy(voted_classifier,testing_set))*100)
voted_classifier accuracy percent: 81.0
1
print("Classification:",voted_classifier.classify(testing_set[0][0]),"Confidence %:",voted_classifier.confidence(testing_set[0][0])*100)
Classification: neg Confidence %: 100.0
1
print("Classification:",voted_classifier.classify(testing_set[1][0]),"Confidence %:",voted_classifier.confidence(testing_set[1][0])*100)
Classification: pos Confidence %: 100.0
1
print("Classification:",voted_classifier.classify(testing_set[2][0]),"Confidence %:",voted_classifier.confidence(testing_set[2][0])*100)
Classification: pos Confidence %: 100.0
1
print("Classification:",voted_classifier.classify(testing_set[3][0]),"Confidence %:",voted_classifier.confidence(testing_set[3][0])*100)
Classification: neg Confidence %: 87.5
1
print("Classification:",voted_classifier.classify(testing_set[4][0]),"Confidence %:",voted_classifier.confidence(testing_set[4][0])*100)
Classification: pos Confidence %: 100.0
1
print("Classification:",voted_classifier.classify(testing_set[5][0]),"Confidence %:",voted_classifier.confidence(testing_set[5][0])*100)
Classification: neg Confidence %: 75.0

17. Investigating Bias()

18. Better training data()

1
2
short_pos = open("short_reviews/positive.txt","r",encoding="unicode-escape").read()
short_neg = open("short_reviews/negative.txt","r",encoding="unicode-escape").read()
1
short_pos[:300]
'the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal . \nthe gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequ'
1
documents = []
1
2
# map(lambda r : document.append(r,'pos'), [r for r in short_pos.split('\n')])
# 本来想通过类似foreach实现类似的功能,不过好像并不能成功,目前原因还不清楚。
1
2
documents.extend([(r,"pos") for r in short_pos.split('\n')])
# 和下面语句的作用是一样的,不过不知道哪个效率更高一些
1
2
for r in short_pos.split('\n'):
documents.append((r,'pos'))
1
documents[0]
('the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal . ',
 'pos')
1
2
documents.extend([(r,"neg") for r in short_neg.split('\n')])
# 和下面语句的作用是一样的,不过不知道哪个效率更高一些
1
2
for w in short_neg.split('\n'):
documents.append(w.lower())
1
2
import nltk
# 这里之所以再次导入,仅仅是因为我是几次使用这个notebook,懒得运行前面的cell了。
1
2
3
all_words = []
short_pos_words = nltk.word_tokenize(short_pos)
short_neg_words = nltk.word_tokenize(short_neg)
1
2
all_words.extend([w.lower() for w in short_pos_words])
all_words.extend([w.lower() for w in short_neg_words])
1
2
3
4
5
6
7
8
9
# 以上代码应该也可以写成:
all_words = [w.lower() for w in short_pos_words] + [w.lower() for w in short_pos_words]
#甚至是这样:
all_words = [w.lower() for w in short_pos_words+short_neg_words]
# 不过如果先:
all_words = short_pos_words + short_neg_words
# 再:
all_words = [w.lower() for w in all_words]
# 可能效率更高一些吧?
1
stopwords = nltk.corpus.stopwords.words('english')
1
2
all_words = [w for w in all_words if w not in stopwords]
# 我自己添加的去除停用词等无关信息,以使得特征提取和训练的效率更高
1
2
all_words = nltk.FreqDist(all_words)
word_features = list(all_words.keys())[:5000]
1
2
3
4
5
6
7
8
def find_features(document):
words = nltk.word_tokenize(document)
features = {}
for w in word_features:
features[w] = (w in words)

return features
featuresets = [(find_features(rev), category) for (rev, category) in documents]
1
2
import random
random.shuffle(featuresets)
1
2
training_set = featuresets[:10000]
testing_set = featuresets[10000:]
1
2
classifier = nltk.NaiveBayesClassifier.train(training_set)
print("Naive Bayes Algo accuracy:",(nltk.classify.accuracy(classifier,testing_set))*100)
Naive Bayes Algo accuracy: 68.82530120481928
1
from nltk.classify.scikitlearn import SklearnClassifier
1
from sklearn.naive_bayes import MultinomialNB, GaussianNB, BernoulliNB
1
2
3
MNB_classifier = SklearnClassifier(MultinomialNB())
MNB_classifier.train(training_set)
print("MNB_classifier accuracy percent:",(nltk.classify.accuracy(MNB_classifier,testing_set))*100)
MNB_classifier accuracy percent: 67.46987951807229
1
2
3
4
# 这段代码有问题,不可以运行
GNB_classifier = SklearnClassifier(GaussianNB())
GNB_classifier.train(training_set)
print("GNB_classifier accuracy percent:",(nltk.classify.accuracy(GNB_classifier,testing_set))*100)
1
2
3
BNB_classifier = SklearnClassifier(BernoulliNB())
BNB_classifier.train(training_set)
print("BNB_classifier accuracy percent:",(nltk.classify.accuracy(BNB_classifier,testing_set))*100)
BNB_classifier accuracy percent: 68.97590361445783
1
2
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.svm import SVC, LinearSVC, NuSVC
1
2
3
LogisticRegression_classifier = SklearnClassifier(LogisticRegression())
LogisticRegression_classifier.train(training_set)
print("LogisticRegression_classifier accuracy percent:",(nltk.classify.accuracy(LogisticRegression_classifier,testing_set))*100)
LogisticRegression_classifier accuracy percent: 70.78313253012048
1
2
3
SGDClassifier_classifier = SklearnClassifier(SGDClassifier())
SGDClassifier_classifier.train(training_set)
print("SGDClassifier_classifier accuracy percent:",(nltk.classify.accuracy(SGDClassifier_classifier,testing_set))*100)
SGDClassifier_classifier accuracy percent: 66.1144578313253
1
2
3
SVC_classifier = SklearnClassifier(SVC())
SVC_classifier.train(training_set)
print("SVC_classifier accuracy percent:",(nltk.classify.accuracy(SVC_classifier,testing_set))*100)
SVC_classifier accuracy percent: 49.096385542168676
1
2
3
LinearSVC_classifier = SklearnClassifier(LinearSVC())
LinearSVC_classifier.train(training_set)
print("LinearSVC_classifier accuracy percent:",(nltk.classify.accuracy(LinearSVC_classifier,testing_set))*100)
LinearSVC_classifier accuracy percent: 70.48192771084338
1
2
3
NuSVC_classifier = SklearnClassifier(NuSVC())
NuSVC_classifier.train(training_set)
print("NuSVC_classifier accuracy percent:",(nltk.classify.accuracy(NuSVC_classifier,testing_set))*100)
NuSVC_classifier accuracy percent: 69.7289156626506
1
2
from nltk.classify import ClassifierI
from statistics import mode
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
class VoteClassifier(ClassifierI):
def __init__(self, *classifiers):
self._classifiers = classifiers

def classify(self, features):
votes = []
for c in self._classifiers:
v = c.classify(features)
votes.append(v)
return mode(votes)
def confidence(self,features):
votes = []
for c in self._classifiers:
v = c.classify(features)
votes.append(v)

choice_votes = votes.count(mode(votes))
conf = choice_votes / len(votes)
return conf
1
2
3
4
5
6
7
8
9
10
voted_classifier = VoteClassifier(classifier,
MNB_classifier,
BNB_classifier,
LogisticRegression_classifier,
SGDClassifier_classifier,
#SVC_classifier,视频中没有这个,况且如果不注释掉就会报统计错误,说有两个相同的值。
#如: http://blog.csdn.net/dongfuguo/article/details/50163757 中 mode错误一般
LinearSVC_classifier,
NuSVC_classifier)
print("voted_classifier accuracy percent:",(nltk.classify.accuracy(voted_classifier,testing_set))*100)
voted_classifier accuracy percent: 69.42771084337349
1
print("Classification:",voted_classifier.classify(testing_set[0][0]),"Confidence %:",voted_classifier.confidence(testing_set[0][0])*100)
Classification: pos Confidence %: 100.0

19. Sentiment Analysis Module()

1
2
all_words = []
documents = []
1
allowed_word_types = ["J"]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
for p in short_pos.split('\n'):
documents.append((p,"pos"))
words = nltk.word_tokenize(p)
pos = nltk.pos_tag(words)
for w in pos:
if w[1][0] in allowed_word_types:
all_words.append(w[0].lower())

for p in short_neg.split('\n'):
documents.append((p,"neg"))
words = nltk.word_tokenize(p)
neg = nltk.pos_tag(words)
for w in neg:
if w[1][0] in allowed_word_types:
all_words.append(w[0].lower())
1
import pickle

提醒一下:直接运行以下的cell 会报错,应该先创建一个pickled_algos文件夹,然后再运行cell

保存文档

1
2
3
save_documents = open('pickled_algos/documents.pickle',"wb")
pickle.dump(documents, save_documents)
save_documents.close()

保存文本特征

1
2
all_words = nltk.FreqDist(all_words)
word_features = list(all_words.keys())[:5000]
1
2
3
save_word_features = open('pickled_algos/word_features5k.pickle',"wb")
pickle.dump(word_features,save_word_features)
save_word_features.close()

保存朴素贝叶斯算法

1
2
3
save_classifier = open("pickled_algos/originalnaivebayes5k.pickle","wb")
pickle.dump(classifier,save_classifier)
save_classifier.close()

保存MultinomialNB算法

1
2
3
save_classifier = open("pickled_algos/MNB_classifier5k.pickle","wb")
pickle.dump(MNB_classifier,save_classifier)
save_classifier.close()

保存BernoulliNB算法

1
2
3
save_classifier = open("pickled_algos/BNB_classifier5k.pickle","wb")
pickle.dump(BNB_classifier,save_classifier)
save_classifier.close()

保存LogisticRegression算法

1
2
3
save_classifier = open("pickled_algos/LogisticRegression_classifier5k.pickle","wb")
pickle.dump(LogisticRegression_classifier,save_classifier)
save_classifier.close()

保存LinearSVC算法

1
2
3
save_classifier = open("pickled_algos/LinearSVC_classifier5k.pickle","wb")
pickle.dump(LinearSVC_classifier,save_classifier)
save_classifier.close()

保存SGDC算法

1
2
3
save_classifier = open("pickled_algos/SGDClassifier_classifier5k.pickle","wb")
pickle.dump(SGDClassifier_classifier,save_classifier)
save_classifier.close()
1
2
3
4
5
voted_classifier = VoteClassifier(classifier,
LinearSVC_classifier,
MNB_classifier,
BNB_classifier,
LogisticRegression_classifier)
1
2
3
4
def sentiment(text):
feats = find_features(text)

return voted_classifier.classify(feats)

最终我们编写的模块长成这个样子:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
#File: sentiment_mod.py 只是一个文件名而已,可以按照自己的想法取,但应做到见名知意

import nltk
import random
from nltk.classify.scikitlearn import SklearnClassifier
import pickle
from sklearn.naive_bayes import MultinomialNB,BernoulliNB
from sklearn.linear_model import LogisticRegression,SGDClassifier
from sklearn.svm import SVC, LinearSVC,NuSVC
from nltk.classify import ClassifierI
from statistics import mode
from nltk.tokenize import word_tokenize

# 以上许多类模块虽然在代码中看似并没有用到,可是在用pickle还原为相关实例在被外部调用执行的时候还是需要的。
# 这里由于我们之前已经训练好了几个分类器,并且已经将文档内容和文本特征等通过pickle持久化保存起来了,所以在此模块中直接用pickle还原就可以直接拿来用了,而不是再次训练。
# 并且该模块仅当同一路径下的pickled_algos文件夹及里面的各pickle文件同时存在时才可以正常使用,当然,项目中也要导入本模块需要使用的一些基础模块,如nltk等等。

class VoteClassifier(ClassifierI):
def __init__(self, *classifiers):
self._classifiers = classifiers

def classify(self, features):
votes = []
for c in self._classifiers:
v = c.classify(features)
votes.append(v)
return mode(votes)
def confidence(self,features):
votes = []
for c in self._classifiers:
v = c.classify(features)
votes.append(v)

choice_votes = votes.count(mode(votes))
conf = choice_votes / len(votes)
return conf

documents_f = open('pickled_algos/documents.pickle',"rb")
documents = pickle.load(documents_f)
documents_f.close()

word_features5k_f = open('pickled_algos/word_features5k.pickle',"rb")
word_features = pickle.load(word_features5k_f)
word_features5k_f.close()

def find_features(document):
words = nltk.word_tokenize(document)
features = {}
for w in word_features:
features[w] = (w in words)

return features

open_file = open("pickled_algos/originalnaivebayes5k.pickle","rb")
classifier = pickle.load(open_file)
open_file.close()

open_file = open("pickled_algos/MNB_classifier5k.pickle","rb")
MNB_classifier = pickle.load(open_file)
open_file.close()

open_file = open("pickled_algos/BNB_classifier5k.pickle","rb")
BNB_classifier = pickle.load(open_file)
open_file.close()

open_file = open("pickled_algos/LogisticRegression_classifier5k.pickle","rb")
LogisticRegression_classifier = pickle.load(open_file)
open_file.close()

open_file = open("pickled_algos/LinearSVC_classifier5k.pickle","rb")
LinearSVC_classifier = pickle.load(open_file)
open_file.close()

open_file = open("pickled_algos/SGDClassifier_classifier5k.pickle","rb")
SGDClassifier_classifier = pickle.load(open_file)
open_file.close()

voted_classifier = VoteClassifier(
classifier,
LinearSVC_classifier,
MNB_classifier,
BNB_classifier,
LogisticRegression_classifier)

def sentiment(text):
feats = find_features(text)

return voted_classifier.classify(feats),voted_classifier.confidence(feats)

# save me as sentiment_mod.py

下面来使用一下:

1
2
3
4
5
import sentiment_mod as s

print(s.sentiment("This movie was awesome! The acting was great, plot was wonderful, and there were pythons...so yea!"))

print(s.sentiment("This movie was utter junk. There were absolutely 0 pythons. I don't see what the point was at all. Horrible movie, 0/10"))
('pos', 1.0)
('neg', 1.0)

好吧,接下来的实践要使用Twitter 创建APP,可能还要使用个人网站,有点麻烦,所以接下来我只是看了看并没有照着实践。
总之,在这一系列的跟着敲代码的过程中,自己初步建立起了很浅的自然语言处理的概念~

20. Twitter Sentiment Analysis()

21. Graphing Live Twitter Sentiment()

1
2