Python + Requests: Connection Keep-Alive
- HTTP/HTTPS Header Attribute Connection
- Requests
- Session会话
- 后记
HTTP/HTTPS Header Attribute Connection
http/https header头部参数Connection分为短连接和长连接,对应属性为:close和keep-alive。http/https短连接是一次性读写完成后断开的连接,长连接则是在连接保持范围内可分多次传输数据。不管是长连接还是短连接都是包含读写时间限制的。http/https协议头部header的Connection属性的意义是提高互联网资源利用效率和服务质量。
国内出现一些片面推荐长连接的“流派”。长连接和短连接应该是有其不同的用途。比如说,最常见的下载文件,应该使用的是短连接;浏览多媒体,则用长连接更好。事实上,会经常发现一些网站几乎使用的都是keep-alive的长连接!
理论上来说,长连接和短连接本质是一样的。长连接更像是多个省略重复握手的短连接的组合。从整个http/https连接生命周期来说,短连接是一个短会话,而长连接是一个长会话。短连接在完成一次完整的对话后即时丢弃会话。长连接在会话周期内包含一个或多个完整的对话。
Requests
bash环境下用pip安装python requests库。(msdos的差不多)
1
2~$: pip install requests
Session会话
Session是requests库保持长连接的一个途径。使用urllib库里的request.Request包含添加“Connection: keep-alive”的header头部的时候,会发现最终得到的回复很可能是Connection: close的短连接。由此猜测,默认的http/https连接可能是短连接。当然,现在一些Server服务器软件均默认支持keep-alive属性。keep-alive的长连接需要维护一个交互的时效会话。
更新自定义HTTP工具包:httpkit.py。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126# -*- coding: utf-8 -*- """ @file: httpkit @author: MR.N @created: 2022/4/2 4月 @updated: 2022/5/23 5月 @version: 1.0 @blog: https://blog.csdn.net/qq_21264377 """ import time import urllib.parse import urllib.request import requests import urllib3 import http.cookiejar import ssl import socket import gzip from uas import * import random SOCKET_TIMEOUT = 30 HTTPS_TIMEOUT = 10 # ...(略) def request_res(remote_task=None, ret=[], dtype=0, max_retry=3): if remote_task is None: ret += ['', '', '', -1] return 'err' if not isinstance(remote_task, RemoteTask): ret += ['', '', '', -1] return 'err' url = remote_task.url if not valid_https(url): ret += ['', '', '', -1] return 'err' referer = remote_task.referer cookies = remote_task.cookies ua = remote_task.ua if ua is None: ua = unspecific_ua() headers = { 'User-Agent': ua, # 'Accept-Encoding': 'gzip, deflate, br', 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8', 'Accept-Encoding': 'gzip, deflate', 'Pragma': 'no-cache', 'Cache-Control': 'no-cache', 'Sec-Fetch-User': '?1', 'Sec-Fetch-Mode': 'navigate', 'Sec-Fetch-Dest': 'document', 'Sec-Fetch-Site': 'none', 'Upgrade-Insecure-Requests': '1', 'Accept-Language': 'zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2', 'Connection': 'keep-alive', } if referer is None: headers['Referer'] = url else: headers['Referer'] = referer if cookies is not None: headers['Cookie'] = cookies headers['host'] = url.split('/')[2] # print(headers) socket.setdefaulttimeout(SOCKET_TIMEOUT) ssl._create_default_https_context = ssl._create_unverified_context session = requests.Session() attempts = 0 status_code = -1 response = None while attempts < max_retry and status_code != 200: attempts += 1 try: response = session.request(method='GET', url=url, headers=headers, timeout=HTTPS_TIMEOUT) status_code = response.status_code except TimeoutError: status_code = 404 except requests.exceptions.Timeout: status_code = 404 except requests.exceptions.ReadTimeout: status_code = 404 finally: if status_code != 200: time.sleep(.11) if response is not None and response.status_code == 200: data = response.content content_encoding = response.headers.get('Content-Encoding') if content_encoding is not None and content_encoding.strip() != '' and content_encoding.lower() in ['gzip', 'deflate']: # data = gzip.decompress(data) pass content_type = response.headers.get('Content-Type') if content_type is not None and 'charset=' in content_type: encoding = content_type.split(';')[-1].split('=')[-1] else: encoding = response.apparent_encoding if encoding is not None and encoding.strip() != '': # print(encoding) data = data.decode(encoding=encoding) else: data = data.decode('UTF-8') # print(content_encoding, len(data)) cookies = response.cookies cookie_res = '' for cookie in cookies: cookie_res += cookie.name + '=' + cookie.value + ';' ret += [data, url, cookie_res, status_code] if session is not None: session.close() if response is not None: response.close() return 'success' else: if session is not None: session.close() if response is not None: response.close() ret += ['', '', '', status_code] return 'failure'
后记
http/https header属性Connection除了提高互联网资源重复使用效益,bz还发现一些网站用在其他的用途。例如,流量统计和安全审计。从流量统计的角度,更多的是在Server服务器日志方面。目前比较广泛的两种流量统计是在前端入手为主后端为辅和后端记录。前端为主的流量记录在记录用户行为特征丰富度和肖像逼真程度方面具有更好的优势。安全审计方面,可以识别爬虫和鉴别某种流量攻击行为。根据网站Server服务器设置、预设情景和连接请求比较,可以快速简单筛选出异常信息,从而分析不合规或非法行为。
最后
以上就是真实麦片最近收集整理的关于Python + Requests: Connection Keep-AliveHTTP/HTTPS Header Attribute ConnectionRequestsSession会话后记的全部内容,更多相关Python内容请搜索靠谱客的其他文章。
发表评论 取消回复