Example | Description |
---|---|
fileids() | the files of the corpus |
fileids([categories]) | the files of the corpus corresponding to these categories |
categories() | the categories of the corpus |
categories([fileids]) | the categories of the corpus corresponding to these files |
raw() | the raw content of the corpus |
raw(fileids=[f1,f2,f3]) | the raw content of the specified files |
raw(categories=[c1,c2]) | the raw content of the specified categories |
words() | the words of the whole corpus |
words(fileids=[f1,f2,f3]) | the words of the specified fileids |
words(categories=[c1,c2]) | the words of the specified categories |
sents() | the sentences of the whole corpus |
sents(fileids=[f1,f2,f3]) | the sentences of the specified fileids |
sents(categories=[c1,c2]) | the sentences of the specified categories |
abspath(fileid) | the location of the given file on disk |
encoding(fileid) | the encoding of the file (if known) |
open(fileid) | open a stream for reading the given corpus file |
root() | the path to the root of locally installed corpus |
readme() | the contents of the README file of the corpus |
Load your own corpus
>>> from nltk.corpus import PlaintextCorpusReader >>> corpus_root = ‘/usr/share/dict‘ >>> wordlists = PlaintextCorpusReader(corpus_root, ‘.*‘) >>> wordlists.fileids()
def unusual_words(text): text_vocab = set(w.lower() for w in text if w.isalpha()) english_vocab = set(w.lower() for w in nltk.corpus.words.words()) unusual = text_vocab.difference(english_vocab) return sorted(unusual)
Set:
Operation | Equivalent | Result |
---|---|---|
len(s) | cardinality of set s | |
x in s | test x for membership in s | |
x not in s | test x for non-membership in s | |
s.issubset(t) | s <= t | test whether every element in s is in t |
s.issuperset(t) | s >= t | test whether every element in t is in s |
s.union(t) | s | t | new set with elements from both s and t |
s.intersection(t) | s & t | new set with elements common to s and t |
s.difference(t) | s - t | new set with elements in s but not in t |
s.symmetric_difference(t) | s ^ t | new set with elements in either s or t but not both |
s.copy() | new set with a shallow copy of s |
>>> from nltk.corpus import stopwords
>>> stopwords.words(‘english‘)
WordNet:
>>> from nltk.corpus import wordnet as wn
>>> wn.synsets(‘motorcar‘)