解析html要用到Hpple框架,里面使用到XPath对html标签的属性和元素进行查找。
w3school上有介绍。
1 // 传入路径path,得到字符串的html的源码
2
3 NSString *title = [NSStringstringWithContentsOfURL:[NSURLURLWithString:path] encoding:NSUTF8StringEncodingerror:nil];
4
5
6 NSData *dataTitle = [title dataUsingEncoding:NSUTF8StringEncoding];
7
8 // hpple
9 TFHpple *xpathParser = [[TFHpple alloc]initWithHTMLData:dataTitle];
10
11
12 NSArray *elements = [xpathParser searchWithXPathQuery:@"//p[@class=‘left‘]/a"];
13
14
15 for (TFHppleElement *element in elements) {
16
17
18 NSDictionary *elementContent = [element attributes];
19
20 // NSLog(@"%@",elementContent);
21
22
23 [data addObject:elementContent];
24
25 }
例如:
1 <p class="left"><a href="w4688.html" title="篮球战术"><img src="uploads/201310/1382623380pZkCHQEt_s.jpg" class="docimgmax" /></a></p>
searchWithXPathQuery 的到的数组elements是该网页下所有class为left的p段落的子标签:
<a href="w4688.html" title="篮球战术">
这里并不能得到p的子标签
<img src="uploads/201310/1382623380pZkCHQEt_s.jpg" class="docimgmax" />
若想得到<img>标签,searchWithXPathQuery 要这么写
[xpathParser searchWithXPathQuery:@"//p[@class=‘left‘]/img"]
遍历数组elements(这里面装着该网页下所有claess为left的p段落的子标签),得到字典类似
{
href = "w4668.html";
title = "篮球战术";
}
{
href = "w4612.html";
title = "王小飞";
}
{
href = "w1233.html";
title = "ios开发";
}
.......