我正在使用库usaddress来解析我拥有的一组文件中的地址.我希望我的最终输出是一个数据框,其中列名代表地址的一部分(例如街道,城市,州),行代表我提取的每个地址.例如:
假设我有一个地址列表:
addr = ['123 Pennsylvania Ave NW Washington DC 20008',
'652 Polk St San Francisco, CA 94102',
'3711 Travis St #800 Houston, TX 77002']
然后使用usaddress提取它们
info = [usaddress.parse(loc) for loc in addr]
“ info”是元组列表的列表,看起来像这样:
[[('123', 'AddressNumber'),
('Pennsylvania', 'StreetName'),
('Ave', 'StreetNamePostType'),
('NW', 'StreetNamePostDirectional'),
('Washington', 'PlaceName'),
('DC', 'StateName'),
('20008', 'ZipCode')],
[('652', 'AddressNumber'),
('Polk', 'StreetName'),
('St', 'StreetNamePostType'),
('San', 'PlaceName'),
('Francisco,', 'PlaceName'),
('CA', 'StateName'),
('94102', 'ZipCode')],
[('3711', 'AddressNumber'),
('Travis', 'StreetName'),
('St', 'StreetNamePostType'),
('#', 'OccupancyIdentifier'),
('800', 'OccupancyIdentifier'),
('Houston,', 'PlaceName'),
我希望每个列表(对象“ info”中有3个列表)表示一行,每个元组对的2值表示列,而元组对的1值表示值.注意:内部列表的链接并不总是相同的,因为并非每个地址都具有每一个信息.
任何帮助将非常感激!
谢谢
解决方法:
不知道是否有一个DataFrame构造函数可以完全像现在一样处理信息. (也许from_records或from_items?-仍然不认为该结构是直接兼容的.)
进行一些操作即可获得所需的内容:
cols = [j for _, j in info[0]]
# Could use nested list comprehension here, but this is probably
# more readable.
info2 = []
for row in info:
info2.append([i for i, _ in row])
pd.DataFrame(info2, columns=cols)
AddressNumber StreetName StreetNamePostType StreetNamePostDirectional PlaceName StateName ZipCode
0 123 Pennsylvania Ave NW Washington DC 20008
1 652 Polk St San Francisco, CA 94102