ipyparallel WordCount实现

        ipyparallel 之中,可以利用多个engine同时运行一个任务来加快处理的速度。在ipyparallel之中,集群被抽象为view,包括direct_view和balanced_view。其中,direct_view是所有的engine的抽象,当然也可以自行指定由哪些engine构成,而balanced_view是多个engine经过负载均衡之后,抽象出来的由“单一”engine构成的view。利用ipyparallel并行化的基本思路是将要处理的数据首先进行切分,然后分布到每一个engine上,然后将最终的处理结果合并,得到最终的结果,其思路和mapreduce类似。
        下面是一个ipyparallel的并行化wordcount实现,主要思路是:首先读取文件中的句子。利用dview的scatter方法将所有的句子切分成n块发送到每一个engine上,正好每一个engine一个。然后在每一个engine上对切分之后的句子统计词频,最后归并所有engine处理之后的结果。
#!/usr/bin/env python
# coding: utf-8 import time from itertools import repeat
from ipyparallel import Client, Reference
from urllib import urlretrieve
#对text进行wordcount处理
def wordfreq(text):
"""Return a dictionary of words and word counts in a string."""
freqs = {}
for word in text.split():
lword = word.lower()
freqs[lword] = freqs.get(lword, 0) + 1
return freqs
#输出词频前n个的单词以及其出现的次数
def print_wordfreq(freqs, n=10):
"""Print the n most common words and counts in the freqs dict.""" words, counts = freqs.keys(), freqs.values()
items = zip(counts, words)
items.sort(reverse=True)
for (count, word) in items[:n]:
print(word, count) #自行实现的并行版本的word_freq,对若干行句子进行处理,返回词,出现次数 键值对
def myword_freq(texts):
freqs = {}
for str in texts:
for word in str.split():
lword = word.lower()
freqs[lword] = freqs.get(lword, 0) + 1
return freqs
#自行实现的并行版本的wordfreq,首先将texts[]分散传送至每一个engine,然后在每一个engine上执行程序myword_freq,返回求出的词 词频键值对
def myPwordfreq(view,lines):
#将文本平均分布在每一个engine上
view.scatter('texts',lines,flatten=True)
ar=view.apply(myword_freq,Reference('texts'))
freqs_list=ar.get()
#归并最终的处理结果 reduce it!
word_set=set()
for f in freqs_list:
word_set.update(f.keys())
freqs=dict(zip(word_set,repeat(0)))
for f in freqs_list:
for word,count in f.items():
freqs[word]+=count
return freqs if __name__ == '__main__':
# Create a Client and View
rc = Client() dview = rc[:]
# Run the serial version
print("Serial word frequency count:")
text = open('lines.txt').read()
tic = time.time()
freqs = wordfreq(text)
toc = time.time()
print_wordfreq(freqs, 10)
print("Took %.3f s to calculate"%(toc-tic))
# The parallel version
print("\nParallel word frequency count:")
lines=text.splitlines()
tic=time.time()
pfreqs=myPwordfreq(dview,lines)
toc=time.time()
print_wordfreq(pfreqs)
print("Took %.3f s to calculate"%(toc-tic))
上一篇:python内置函数print输出到文件,实现日志记录的功能


下一篇:Azure构建PredictionIO和Spark的推荐引擎服务