python模块urllib和requests

目录

  1. 1. urllib库
    1. 1.1. 基于urllib库的GET方式
      1. 1.1.1. 直接访问,不加请求头信息
      2. 1.1.2. 设置请求头信息
      3. 1.1.3. 设置代理
    2. 1.2. 基于urllib库的POST请求
  2. 2. requests库
    1. 2.1. 基于requests的GET请求
      1. 2.1.1. 直接访问,不加请求头
      2. 2.1.2. 基于requests的POST请求
  3. 3. GET方法中传递参数的三种方式

HTTP请求方法
GET HEAD POST PUT DELETE CONNECT OPTIONS TRACE

urllib库

安装:python默认的,自带的,不需要自己安装。

基于urllib库的GET方式

直接访问,不加请求头信息

import urllib.request
def get_page():
    url = 'http://www.baidu.com'
    result = urllib.request.urlopen(url=url).read().decode("utf-8")
    print(result)
if __name__ == '__main__':
    get_page()

设置请求头信息

更改User-Agent,模拟用户。

  1. urllib.request.Request方法(此种方法,拓展性强)
    由于urllib.request.urlopen() 函数不接受headers参数,所以需要构建一个urllib.request.Request对象来实现请求头

     import urllib.request
     def get_page():
         url = 'http:///www.baidu.com'
         headers = {
             'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.146 Safari/537.36'
         }
         request = urllib.request.Request(url=url, headers=headers)
         result = urllib.request.urlopen(request).read().decode('utf-8')
         print(result)
     if __name__ == '__main__':
         get_page()
  2. 第二种方式,启用opener安装方法https://docs.python.org/zh-cn/3/library/urllib.request.html#urllib.request.build_opener
    opener是 urllib.request.OpenerDirector 的实例,我们之前一直都在使用的urlopen,它是一个特殊的opener(也就是模块帮我们构建好的)。
    但是基本的urlopen()方法不支持代理、Cookie等其他的 HTTP/HTTPS高级功能,所以自己构建opener

     import urllib.request
     def get_page():
         url = "http://www.baidu.com"
         headers = (
             "User-Agent","Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36"
         ) #这里只能是数组,参考官方源码:self.addheaders = [('User-agent', client_version)]
         opener = urllib.request.build_opener()
         opener.addheaders = [headers]
         result = opener.open(url).read().decode('utf-8')
         print(result)
     if __name__ == '__main__':
         get_page()
  1. 定义一个opener来代替urlopen(),但是仍然用urlopen打开URL(全局)
    import urllib.request
    def get_page():
        url = 'http://www.baidu.com'
        headers ={
        'User-Agent':'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36',
        }
        opener = urllib.request.build_opener()
        opener.addheaders = [headers]
        urllib.request.install_opener(opener)
        result = urllib.request.urlopen(url=url).read().decode('utf-8')
        print(result)

    if __name__ == '__main__':
        get_page()

设置代理

  1. 既要加代理,还有变更User-agent
    这里使用ProxyHandler方式
    类urllib.request.ProxyHandler(proxies = None )
    使请求通过代理。如果给出了代理,则它必须是将协议名称映射到代理URL的字典。默认设置是从环境变量中读取代理列表 _proxy。用opener代替urlopen

     import urllib.request
     def get_page(url):
         headers={
             'User-Agent':'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'
         }
         request = urllib.request.Request(url=url, haeders=headers)
         proxy ={
             'http':'*****:53512' #etc
         }
         pxoy_handler = urllib.request.ProxyHandler(proxy)
         opener =urllib.request.build_opener(pxoy_handler)
         urllib.request.install_opener(opener)
    
         result = urllib.request.urlopen(request).read().decode('utf-8')
         print(result)
     if __name__ == '__main__':
         get_page("http://www.baidu.com")

    基于urllib库的POST请求

    post一般用于登陆多点,或者动态的加载文件啥的,参考上一篇:==>

requests库

安装:pip install requests

基于requests的GET请求

直接访问,不加请求头

import requests
def get_page():
    url = 'http://www.baidu.com'
    response = requests.get(url)
    print(response.text)  
if __name__ == '__main__':
get_page()

response.text是 requests是经response.content解码的字符串,requests会根据自己的猜测来进行解码,有时候会猜测错误,导致乱码。所以改用:response.content.decode('utf-8')

输出结果:

>>> res = response.content.decode('utf-8')
>>> print(res)
<!DOCTYPE html>
<!--STATUS OK--><html> <head><meta http-equiv=content-type content=text/html;charset=utf-8><meta http-equiv=X-UA-Compatible content=IE=Edge><meta content=always name=referrer><link rel=stylesheet type=text/css href=http://s1.bdstatic.com/r/www/cache/bdorz/baidu.min.css><title>百度一下,你就知道</title></head> <body link=#0000cc> <div id=wrapper> <div id=head> <div class=head_wrapper> <div class=s_form> <div class=s_form_wrapper> <div id=lg> <img hidefocus=true src=//www.baidu.com/img/bd_logo1.png width=270 height=129> </div> <form id=form name=f action=//www.baidu.com/s class=fm> <input type=hidden name=bdorz_come value=1> <input type=hidden name=ie value=utf-8> <input type=hidden name=f value=8> <input type=hidden name=rs 

基于requests的POST请求

GET方法中传递参数的三种方式

  1. 将字典形式的参数用urllib.parse.urlencode()编码成url参数

    import urllib.parse
    if __name__ == '__main__':
        base_url = 'https://www.baidu.com/s?'
        params ={
            'ie':'utf-8', # 'key1':'values1'
            'wd':'hello' # 'key2':'values2'
        }
        url = base_url + urllib.parse.urlencode(params)
        print(url)

    结果:https://www.baidu.com/s?ie=utf-8&wd=hello,就是百度搜索的链接呀!

  2. 基于requests.get中使用params参数 参考链接

    import requests
    if __name__ == '__main__':
        payload = {
            'ie':'utf-8',
            'wd':'hello'
        }
        base_url = 'https://www.baidu.com/s?'
        response = requests.get(url=base_url, params=payload)
        print(response.url)

    结果:出了点问题,百度提示有安全问题,必须加载验证码才能访问

    https://wappass.baidu.com/static/captcha/tuxing.html?&ak=c27bbc89afca0463650ac9bde68ebe06&backurl=https%3A%2F%2Fwww.baidu.com%2Fs%3Fie%3Dutf-8%26wd%3Dhello&logid=10813104645036683003&signature=1ed04493a7a14debe47c945fe7729723&timestamp=1589427108

    直接手写url:https://www.baidu.com/s?key1=value1&key2=value2