2022年 11月 3日

Python爬虫代理IP[2(代理池)]

搭建自己的代理IP池

爬取代理IP网站的代理IP,并测试是否能用,建立自己的代理IP池

url

kuaidaili.com/free/

要求

将可用代理IP保存到本地文件中

如何测试

可用代理IP向测试网站请求,根据HTTP响应来判断是否可用

导入库

  1. import requests,time,random
  2. from lxml import etree

因为一行一个来导入有点慢,可以直接在库后面加上一个来隔开库

前面的一些请求信息

  1. class Prouxychi:
  2. def __init__(self):
  3. self.url = 'https://www.kuaidaili.com/free/inha/{}'
  4. self.test_url = 'https://www.baidu.com/'
  5. self.headers = {'User-Agent': 'wqeqqeqw'}

数据解析->提取

  1. def get_proxy(self,url):
  2. html = requests.get(url=url, headers=self.headers).text
  3. p = etree.HTML(html)
  4. tr_list = p.xpath('//*[@id="list"]//tbody/tr')
  5. for tr in tr_list[1:]:
  6. ip = tr.xpath('./tr[2]/text()')[0].strip()
  7. port = tr.xpath('./td[3]/text()')[0].strip()

测试

  1. # 测试是否可用
  2. self.test_proxy(ip, port)
  3. def test_proxy(self, ip, port):
  4. proxies = {
  5. 'http': 'http://{}:{}'.format(ip,port),
  6. 'https': 'https://{}:{}'.format(ip,port)
  7. }
  8. try:
  9. res = requests.get(url=self.test_url, proxies=proxies, headers=self.headers)
  10. if res.status_code == 200:
  11. print(ip,port,'\033[31m可用\33[0m')
  12. except Exception as e:
  13. print(ip,port,'不可用')
  14. def run(self):
  15. for i in range(1, 1001):
  16. url = self.url.format(i)
  17. self.get_proxy(url=url)

保存

  1. # 保存IP
  2. with open('proxy.txt', 'a') as f:
  3. f.write(ip + ':' + port + '\n')

全部代码

  1. import requests,time,random
  2. from lxml import etree
  3. class Prouxychi:
  4. def __init__(self):
  5. self.url = 'https://www.kuaidaili.com/free/inha/{}'
  6. self.test_url = 'https://www.baidu.com/'
  7. self.headers = {'User-Agent': 'wqeqqeqw'}
  8. def get_proxy(self,url):
  9. html = requests.get(url=url, headers=self.headers).text
  10. p = etree.HTML(html)
  11. tr_list = p.xpath('//*[@id="list"]//tbody/tr')
  12. for tr in tr_list[1:]:
  13. ip = tr.xpath('./tr[2]/text()')[0].strip()
  14. port = tr.xpath('./td[3]/text()')[0].strip()
  15. # 测试是否可用
  16. self.test_proxy(ip, port)
  17. def test_proxy(self, ip, port):
  18. proxies = {
  19. 'http': 'http://{}:{}'.format(ip,port),
  20. 'https': 'https://{}:{}'.format(ip,port)
  21. }
  22. try:
  23. res = requests.get(url=self.test_url, proxies=proxies, headers=self.headers)
  24. if res.status_code == 200:
  25. print(ip,port,'\033[31m可用\33[0m')
  26. # 保存IP
  27. with open('proxy.txt', 'a') as f:
  28. f.write(ip + ':' + port + '\n')
  29. except Exception as e:
  30. print(ip,port,'不可用')
  31. def run(self):
  32. for i in range(1, 1001):
  33. url = self.url.format(i)
  34. self.get_proxy(url=url)
  35. if __name__ == '__main__':
  36. spider = Prouxychi()
  37. spider.run()