python爬虫系列之用浏览器伪装爬取并批量下载其内容

目录

  1. 1. 需要的模块
  2. 2. 爬取网站地址和全局安装headers
  3. 3. 文本挖掘
  4. 4. 主体

刚开始学习,马马虎虎

需要的模块

import urllib.request
import re

爬取网站地址和全局安装headers

urlindex = "http://blog.jtjiang.art/archives"
headers = ("User-Agent", "Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36")
opener = urllib.request.build_opener()
opener.addheaders = [headers]
urllib.request.install_opener(opener) 

文本挖掘

pat = '<a class="archive-article-title" href="(.*?)">' #正则表达式
data = urllib.request.urlopen(urlindex).read().decode("utf-8", "ignore")
allurl = re.compile(pat).findall(data)

主体

for i in range(0, len(allurl)):
    try:
        thisurl = "http://blog.jtjiang.art"+ allurl[i]
        print("第"+ str(i+1) +"次爬取" +"对应文章网址为"+ thisurl)
        file = "D:\\study\\python\\blogtest\\"+ str(i) + ".html"
        urllib.request.urlretrieve(thisurl, file)
        print("-----爬取成功-----")
    except urllib.error.URLError as error:
        if hasattr(error, "code"):
            print(error.code)
        if hasattr(error, "reason"):
            print(error.reason)

参考:12