这两天有基友要php中curl抓取教务处成绩的源码,用于微信公众平台的开发。下面笔者只好忍痛割爱了。php中CURL技术模拟登陆抓取数据实战,抓取沈阳工学院教务处学生成绩。
首先,教务处登录需要验证码。我们寻找验证码的链接地址http://218.61.108.163/ACTIONVALIDATERANDOMPICTURE.APPPROCESS,来进行数据的抓取。下面看下主要代码-index.php
<?php
$ch=curl_init("http://218.61.108.163/ACTIONVALIDATERANDOMPICTURE.APPPROCESS");
curl_setopt($ch,CURLOPT_RETURNTRANSFER,1);
curl_setopt($ch,CURLOPT_HEADER,1);
$str=curl_exec($ch);
curl_close($ch);
list($header, $body) = explode("\r\n\r\n", $str);
preg_match("/JSESSIONID=(.*); path=/i", $header, $matches);
$cookie = $matches[1]; ?>
需要模拟cookie进行登录,所以我们建立一个api.php的页面进行cookie的模拟,以及需要抓取成绩的链接地址http://218.61.108.163/ACTIONLOGON.APPPROCESS,对首页index.php表单中值进行获取
<?php
if(isset($_POST['code'])){
$jwid=$_POST['xuehao'];
$jwpwd=$_POST['mima'];
$code=$_POST['code'];
$ck=$_POST['ck'];
$data="WebUserNO={$jwid}&Password={$jwpwd}&Agnomen={$code}&submit.x=23&submit.y=9&applicant=ACTIONQUERYSTUDENTSCORE";
$ch=curl_init("http://218.61.108.163/ACTIONLOGON.APPPROCESS");
curl_setopt($ch,CURLOPT_RETURNTRANSFER,1);
curl_setopt($ch, CURLOPT_COOKIE, "JSESSIONID={$ck}");
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS, $data);
$str=curl_exec($ch);
curl_close($ch);
}
}
在登录页中,我们可以看到登录需要验证码。所以,我们建议一个code.php页面用于验证码的获取、
<?php
$ch=curl_init("http://218.61.108.163/ACTIONVALIDATERANDOMPICTURE.APPPROCESS");
curl_setopt($ch,CURLOPT_RETURNTRANSFER,1);
curl_setopt($ch, CURLOPT_COOKIE, "JSESSIONID={$_GET['ck']}");
$str=curl_exec($ch);
curl_close($ch);
echo $str;
?>
最后一步。把所要获取的数据接收,使用正则表达式进行数据的抓取以及排版。
<?php
function get_td_array($table) {
$table = preg_replace("/<table[^>]*?>/is","",$table);
$table = preg_replace("/<tr[^>]*?>/si","",$table);
$table = preg_replace("/<td[^>]*?>/si","",$table);
$table = str_replace("</tr>","{tr}",$table);
$table = str_replace("</td>","{td}",$table);
$table = str_replace(" ","",$table);
$table = preg_replace("'<[/!]*?[^<>]*?>'si","",$table);
$table = preg_replace("'([rn])[s]+'","",$table);
$table = str_replace(" ","",$table);
$table = str_replace(" ","",$table); $table = explode('{tr}', $table);
array_pop($table);
foreach ($table as $key=>$tr) {
$td = explode('{td}', $tr);
$td = explode('{td}', $tr);
array_pop($td);
$td_array[] = $td;
}
return $td_array;
}
?>
完整的代码大家可以去http://pan.baidu.com/share/link?shareid=3722188112&uk=1496266064进行下载。密码:a3eh