以下是我用来从页面中提取子链接的基本代码:
<?php
include_once('simple_html_dom.php');
function extract_links($target_url)
{
$html = new simple_html_dom();
$html->load_file($target_url);
$i=0;
$crawl =array();
foreach($html->find('a') as $link)
{
$crawl[$i] = $link->href;
$i++;
}
var_dump($crawl);
}
extract_links('https://*.com');
?>
输出如下:
array
0 => string 'http://stackexchange.com' (length=24)
1 => string '/users/login' (length=12)
2 => string 'http://careers.*.com' (length=32)
3 => string 'http://chat.*.com' (length=29)
4 => string 'http://meta.*.com' (length=29)
5 => string '/about' (length=6)
6 => string '/faq' (length=4)
7 => string '/' (length=1)
8 => string '/questions' (length=10)
9 => string '/tags' (length=5)
10 => string '/users' (length=6)
11 => string '/badges' (length=7)
12 => string '/unanswered' (length=11)
13 => string '/questions/ask' (length=14)
14 => string '?tab=interesting' (length=16)
15 => string '?tab=featured' (length=13)
16 => string '?tab=hot' (length=8)
17 => string '?tab=week' (length=9)
18 => string '?tab=month' (length=10)
19 => string '/questions/14611052/basic-standalone-jpa-example-with-postgres-using-eclipse' (length=76)
20 => string '/questions/tagged/eclipse' (length=25)
21 => string '/questions/tagged/postgresql' (length=28)
22 => string '/questions/tagged/jpa' (length=21)
23 => string '/questions/14611052/basic-standalone-jpa-example-with-postgres-using-eclipse' (length=76)
24 => string '/users/865448/tostao' (length=20)
25 => string '/questions/14611172/unable-to-fully-print-a-page-containing-iframes-in-chrome' (length=77)
26 => string '/questions/tagged/javascript' (length=28)
27 => string '/questions/tagged/jquery' (length=24)
28 => string '/questions/tagged/html' (length=22)
29 => string '/questions/tagged/html5' (length=23)
30 => string '/questions/tagged/google-chrome' (length=31)
31 => string '/questions/14611172/unable-to-fully-print-a-page-containing-iframes-in-chrome' (length=77)
32 => string '/users/962868/tejas' (length=19)
33 => string '/questions/14609779/how-can-i-configure-bash-to-handle-crlf-shell-scripts' (length=73)
34 => string '/questions/tagged/linux' (length=23)
35 => string '/questions/tagged/windows' (length=25)
36 => string '/questions/tagged/bash' (length=22)
37 => string '/questions/tagged/line-endings' (length=30)
38 => string '/questions/14609779/how-can-i-configure-bash-to-handle-crlf-shell-scripts/?lastactivity' (length=87)
39 => string '/users/1899640/that-other-guy' (length=29)
40 => string '/questions/14611169/using-one-socket-for-peer-to-peer-communication' (length=67)
41 => string '/questions/tagged/sockets' (length=25)
42 => string '/questions/tagged/p2p' (length=21)
43 => string '/questions/14611169/using-one-socket-for-peer-to-peer-communication' (length=67)
44 => string '/users/911651/xsnrg' (length=19)
45 => string '/questions/14611166/possible-mistake-in-ios-dev-guide' (length=53)
46 => string '/questions/tagged/iphone' (length=24)
47 => string '/questions/tagged/ios' (length=21)
48 => string '/questions/tagged/objective-c' (length=29)
49 => string '/questions/14611166/possible-mistake-in-ios-dev-guide' (length=53)
50 => string '/users/107715/matt-n' (length=20)
51 => string '/questions/14611163/how-to-use-dispatcher-in-wpf-to-make-a-timer' (length=64)
52 => string '/questions/tagged/wpf' (length=21)
53 => string '/questions/tagged/timer' (length=23)
54 => string '/questions/tagged/dispatcher' (length=28)
55 => string '/questions/14611163/how-to-use-dispatcher-in-wpf-to-make-a-timer' (length=64)
56 => string '/users/1741800/nashat' (length=21)
57 => string '/questions/14610879/how-can-i-handle-an-access-violation-in-visual-studio-c' (length=75)
58 => string '/questions/tagged/visual-c%2b%2b' (length=32)
59 => string '/questions/tagged/exception-handling' (length=36)
60 => string '/questions/tagged/access-violation' (length=34)
61 => string '/questions/tagged/structured-exception' (length=38)
62 => string '/questions/14610879/how-can-i-handle-an-access-violation-in-visual-studio-c/?lastactivity' (length=89)
63 => string '/users/901812/big-endian' (length=24)
64 => string '/questions/14611162/mvc-condintional-authorization' (length=50)
65 => string '/questions/tagged/c%23' (length=22)
66 => string '/questions/tagged/asp.net-mvc' (length=29)
67 => string '/questions/tagged/asp.net-mvc-4' (length=31)
68 => string '/questions/tagged/authorization' (length=31)
69 => string '/questions/14611162/mvc-condintional-authorization' (length=50)
70 => string '/users/644969/cadrell0' (length=22)
71 => string '/questions/14611160/get-customer-role-nopcommerce' (length=49)
72 => string '/questions/tagged/c%23' (length=22)
73 => string '/questions/tagged/razor' (length=23)
74 => string '/questions/tagged/nopcommerce' (length=29)
75 => string '/questions/14611160/get-customer-role-nopcommerce' (length=49)
76 => string '/users/1378841/mlg74' (length=20)
77 => string '/questions/14611158/iframe-resizing-nested-in-gridview' (length=54)
78 => string '/questions/tagged/resize' (length=24)
79 => string '/questions/14611158/iframe-resizing-nested-in-gridview' (length=54)
80 => string '/users/2026451/satish-patil' (length=27)
81 => string '/questions/14611157/php-how-to-check-the-value-got-this-word-from-a-var' (length=71)
82 => string '/questions/tagged/php' (length=21)
83 => string '/questions/tagged/preg-match' (length=28)
84 => string '/questions/tagged/strpos' (length=24)
85 => string '/questions/14611157/php-how-to-check-the-value-got-this-word-from-a-var' (length=71)
86 => string '/users/963414/samual99' (length=22)
87 => string '/questions/14611155/how-to-get-the-coordinates-of-boundries-of-drawable-on-the-mapview' (length=86)
88 => string '/questions/tagged/android' (length=25)
89 => string '/questions/tagged/google-maps' (length=29)
90 => string '/questions/14611155/how-to-get-the-coordinates-of-boundries-of-drawable-on-the-mapview' (length=86)
91 => string '/users/1520564/blubar' (length=21)
92 => string '/questions/14611153/why-css-is-empty-when-ssl-is-on-and-appcache-is-enabled-ipad-safari' (length=87)
93 => string '/questions/tagged/css' (length=21)
94 => string '/questions/tagged/ipad' (length=22)
95 => string '/questions/tagged/ssl' (length=21)
96 => string '/questions/tagged/mobile-safari' (length=31)
97 => string '/questions/tagged/html5-appcache' (length=32)
98 => string '/questions/14611153/why-css-is-empty-when-ssl-is-on-and-appcache-is-enabled-ipad-safari' (length=87)
99 => string '/users/2026375/twoface' (length=22)
100 => string '/questions/14611149/laravel-how-to-temporarily-store-eloquent-models-in-db-without-a-proper-schem' (length=97)
101 => string '/questions/tagged/php' (length=21)
102 => string '/questions/tagged/laravel' (length=25)
103 => string '/questions/14611149/laravel-how-to-temporarily-store-eloquent-models-in-db-without-a-proper-schem' (length=97)
104 => string '/users/291557/duality' (length=21)
105 => string '/questions/13928812/xmlserializer-generateserializer-and-collections' (length=68)
106 => string '/questions/tagged/c%23' (length=22)
107 => string '/questions/tagged/xml-serialization' (length=35)
108 => string '/questions/13928812/xmlserializer-generateserializer-and-collections/?lastactivity' (length=82)
109 => string '/users/1200614/phil' (length=19)
110 => string '/questions/14611145/keep-buttons-in-view-when-keyboard-opens-android' (length=68)
111 => string '/questions/tagged/android' (length=25)
112 => string '/questions/tagged/keyboard' (length=26)
113 => string '/questions/tagged/resize' (length=24)
114 => string '/questions/tagged/window' (length=24)
115 => string '/questions/tagged/views' (length=23)
116 => string '/questions/14611145/keep-buttons-in-view-when-keyboard-opens-android' (length=68)
117 => string '/users/1137413/725623452362' (length=27)
118 => string '/questions/14611144/ssdp-discovery-from-a-browser' (length=49)
119 => string '/questions/tagged/silverlight' (length=29)
120 => string '/questions/tagged/flash' (length=23)
121 => string '/questions/14611144/ssdp-discovery-from-a-browser' (length=49)
122 => string '/users/191882/legege' (length=20)
123 => string '/questions/14611143/how-to-syncrhonize-on-site-in-memory-no-sql-datasources-with-central-database-in' (length=100)
124 => string '/questions/tagged/architecture' (length=30)
125 => string '/questions/tagged/nosql' (length=23)
126 => string '/questions/tagged/java-ee-6' (length=27)
127 => string '/questions/tagged/in-memory-database' (length=36)
more elements...
现在考虑数组中的’/ about’子链接.我希望它显示为’https://*.com/about‘.为什么只返回子链接的子部分,而在某些情况下返回完整的子链接?
还有一些链接以’?’开头标志.如何消毒这些链接?
编辑:
考虑“http://en.wikipedia.org/wiki/Web_crawler”.现在,如果我对它执行extract_links,我会得到一个类似于“http://en.wikipedia.org/wiki/Web_crawler/wiki/Web_search_engine”的子链接,这是无效的,大多数链接都是这种格式.正确的链接是“http://en.wikipedia.org/wiki/Web_search_engine”.我在另一个程序中使用此函数将传递一个链接数组,所以我不能保持if条件静态.以下是我现在使用的代码片段:
foreach($html->find('a') as $link)
{
$href = $link->href;
$fchr = substr($href, 0, 1);
if ($fchr === '/')
{
$href = $target_url.$href;
}
else if ($fchr === '?')
{
$href = $target_url.'/'. $href;
}
}
解决方法:
以“/”开头的任何链接都是doc根的绝对路径.要获取完整的URL,您需要在其中添加找到该链接的主机名.对于相对链接,例如“?tab = etc”,您需要在其中添加找到链接的完整URL.如果要忽略查询字符串链接(“?tab = etc”),请使用正则表达式来执行此操作.