当爬虫遇到JavaScript

前言

这次我的目标是写一个爬虫程序,获取网站 GNMA官网 每个月发行的Remic Prospectuses文件,具体到某年某月的URL是:

1
2
3
http://www.ginniemae.gov/
doing_business_with_ginniemae/investor_resources/prospectuses/Pages/
remic_prospectuses.aspx?YearDropDown=2013&MonthDropDown=March

如果有多页的内容,则需要点击操作,然后JavaScript生成页面,不然看不到其它页面内容,如:

1
javascript: __doPostBack('ctl00$m$g_01259610_86e5_4c21_9a20_5a2b7d94350f','dvt_firstrow={11};dvt_startposition={}');

这个程序应该接受年、月参数,自动获取Remic清单,遇到需要点击操作才出来的页面也要能够处理。

分析目标网页

目标页面URL是有一定规律的,年份是数字,模式是 =\d+= ,而月份是英文各个月份的全拼,首字母大写。

对于具体页面的Remic文件源码格式类似:

1
2
3
<a href="/doing_business_with_ginniemae/investor_resources/Prospectuses/ProspectusesLib/2013Mar21-037.pdf" target="_blank">
2013-037 - Dated March 21, 2013
</a>

而对应的JavaScript操作源码,整理的类似有:

1
2
3
4
<a href="javascript: __doPostBack('ctl00$m$g_01259610_86e5_4c21_9a20_5a2b7d94350f','dvt_firstrow={11};dvt_startposition={}');">
<a href="javascript: __doPostBack('ctl00$m$g_01259610_86e5_4c21_9a20_5a2b7d94350f','dvt_firstrow={1};dvt_startposition={}');">
<a href="javascript: __doPostBack('ctl00$m$g_01259610_86e5_4c21_9a20_5a2b7d94350f','dvt_firstrow={21};dvt_startposition={}');">
<a href="javascript: __doPostBack('ctl00$m$g_01259610_86e5_4c21_9a20_5a2b7d94350f','dvt_firstrow={31};dvt_startposition={}');">

所幸在源码中找到定义:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
<script type="text/javascript">
//<![CDATA[
var theForm = document.forms['aspnetForm'];
if (!theForm) {
theForm = document.aspnetForm;
}
function __doPostBack(eventTarget, eventArgument) {
if (!theForm.onsubmit || (theForm.onsubmit() != false)) {
theForm.__EVENTTARGET.value = eventTarget;
theForm.__EVENTARGUMENT.value = eventArgument;
theForm.submit();
}
}
//]]>
</script>

所以,必须通过表单的形式进行POST方法提交,获取表单信息。

grspider.py

通过Firebug可以获取到cURL命令行下载相应页面的命令,但具体做法会比较乱,通过模拟POST提交表单:

grspider.py: https://gist.github.com/LeslieZhu/1465866a5f1e41c5bd0d.js

运行:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
$ python grspider.py -y 2014
http://www.ginniemae.gov/doing_business_with_ginniemae/investor_resources/prospectuses/Pages/remic_prospectuses.aspx?YearDropDown=2014&MonthDropDown=Month
$ cat gnma_remic.json
{
"2014-001": "http://www.ginniemae.gov/doing_business_with_ginniemae/investor_resources/Prospectuses/ProspectusesLib/2014Jan23-001.pdf",
"2014-001O": "http://www.ginniemae.gov/doing_business_with_ginniemae/investor_resources/Prospectuses/ProspectusesLib/2014Jan23-001O.pdf",
"2014-002": "http://www.ginniemae.gov/doing_business_with_ginniemae/investor_resources/Prospectuses/ProspectusesLib/2014Jan23-002.pdf",
"2014-002O": "http://www.ginniemae.gov/doing_business_with_ginniemae/investor_resources/Prospectuses/ProspectusesLib/2014Jan23-002O.pdf",
"2014-003": "http://www.ginniemae.gov/doing_business_with_ginniemae/investor_resources/Prospectuses/ProspectusesLib/2014Jan23-003.pdf",
"2014-003O": "http://www.ginniemae.gov/doing_business_with_ginniemae/investor_resources/Prospectuses/ProspectusesLib/2014Feb24-003O.pdf",
"2014-004": "http://www.ginniemae.gov/doing_business_with_ginniemae/investor_resources/Prospectuses/ProspectusesLib/2014Jan23-004.pdf",
"2014-004O": "http://www.ginniemae.gov/doing_business_with_ginniemae/investor_resources/Prospectuses/ProspectusesLib/2014Feb24-004O.pdf",
....
}

后记

遇到JavaScript的页面,最好还是通过Firebug等方法查看规律,然后模拟表单操作。

资料

吴羽舒 wechat
欢迎您扫一扫上面的微信公众号,订阅我的博客!