python nltk 学习笔记(2)

Example Description
fileids() the files of the corpus
fileids([categories]) the files of the corpus corresponding to these categories
categories() the categories of the corpus
categories([fileids]) the categories of the corpus corresponding to these files
raw() the raw content of the corpus
raw(fileids=[f1,f2,f3]) the raw content of the specified files
raw(categories=[c1,c2]) the raw content of the specified categories
words() the words of the whole corpus
words(fileids=[f1,f2,f3]) the words of the specified fileids
words(categories=[c1,c2]) the words of the specified categories
sents() the sentences of the whole corpus
sents(fileids=[f1,f2,f3]) the sentences of the specified fileids
sents(categories=[c1,c2]) the sentences of the specified categories
abspath(fileid) the location of the given file on disk
encoding(fileid) the encoding of the file (if known)
open(fileid) open a stream for reading the given corpus file
root() the path to the root of locally installed corpus
readme() the contents of the README file of the corpus

 

Load your own corpus
>>>
from nltk.corpus import PlaintextCorpusReader >>> corpus_root = ‘/usr/share/dict‘ >>> wordlists = PlaintextCorpusReader(corpus_root, ‘.*‘) >>> wordlists.fileids()

def unusual_words(text):
    text_vocab = set(w.lower() for w in text if w.isalpha())
    english_vocab = set(w.lower() for w in nltk.corpus.words.words())
    unusual = text_vocab.difference(english_vocab)
    return sorted(unusual)
Set:
Operation Equivalent Result
len(s)   cardinality of set s
x in s   test x for membership in s
x not in s   test x for non-membership in s
s.issubset(t) s <= t test whether every element in s is in t
s.issuperset(t) s >= t test whether every element in t is in s
s.union(t) s | t new set with elements from both s and t
s.intersection(t) s & t new set with elements common to s and t
s.difference(t) s - t new set with elements in s but not in t
s.symmetric_difference(t) s ^ t new set with elements in either s or t but not both
s.copy()   new set with a shallow copy of s

 

>>> from nltk.corpus import stopwords

>>> stopwords.words(‘english‘)

 

WordNet:

>>> from nltk.corpus import wordnet as wn

>>> wn.synsets(‘motorcar‘)

python nltk 学习笔记(2),布布扣,bubuko.com

python nltk 学习笔记(2)

上一篇:python nltk 学习笔记(3) processing raw text


下一篇:Java compiler level does not match the version of the installed Java project facet. springmvc1 和 Target runtime Apache Tomcat v7.0 is not defined.