用python爬取博客的图片链接（入个门）

1
2
3
4
5


headers_dict={

        'User-Agent':'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Mobile Safari/537.36 Edg/114.0.1823.37'

}

爬取源代码

然后开始爬取网页源代码

并且把源代码存起来

1
2


response = requests.get('https://blog.csdn.net/mumuemhaha/article/details/131031052?spm=1001.2014.3001.5501',headers=headers_dict)
html_1=response.text

开始选取所需要的部分

用PyQuery选择相应的区块（分支可看可不看）

PyQuery可以把源代码中的…和，的部分筛选出来可以用 doc = pq(html) print(doc(’.a_1 #b_1’).find(“c_1”)) 其中a_1代表的就是.（英文句号）+div class=" “空的值 b_1代表的就是ul id = " “空的值 c_1代表得到就是两个标签名称上面就”

优点

是语法相对比较简单

可以快速选择所需要的区域

缺点

只凭这个无法定位标签栏里面的元素，尤其是图片链接

开始定位链接的位置（使用正则表达式）

这时候就要用re库来选择连接内容了

这里选择一个简单正则表达式的方便理解

1

ex = '.*? src="(.*?)" .*?'

这里.*?代表的随机的值

而加个()就是要选择的值

这里的意思就是所有src=“x_1"中x_1的值

然后调用re.findall()进行选择并且打印（或者存入txt文件也行）

1
2
3
4


ex = 'src="(.*?)"'
imglist = re.findall(ex, html_2)
print(imglist)
#['https://csdnimg.cn/release/blogv2/dist/mobile/img/iconLeftArrow.png', 'https://profile-avatar.csdnimg.cn/7611198b454e45eab7a77034fbc1c227_mumuemhaha.jpg!1']

出现的奇怪的问题

1
2
3
4
5
6


Traceback (most recent call last):
  File "D:\python\os\main_request.py", line 22, in
    imglist = re.findall(ex, html_1)
  File "C:\Users\mumuemhaha\AppData\Local\Programs\Python\Python39\lib\re.py", line 241, in findall
    return _compile(pattern, flags).findall(string)
TypeError: expected string or bytes-like object

这里就是因为PyQuery选取的区域格式不是string的

request获取的源代码时”

强制格式转换就行

1

html_1=str(html_1)

所有源代码（包括我验证PyQuery是不是string的语句也在上面）

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24


import requests
import re

from pyquery import PyQuery as pq

headers_dict={

        'User-Agent':'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Mobile Safari/537.36 Edg/114.0.1823.37'

}
response = requests.get('https://blog.csdn.net/mumuemhaha/article/details/131031052?spm=1001.2014.3001.5501',headers=headers_dict)
html_1=response.text
html_2=response.text

doc=pq(html_1)

html_1=doc('.aside-header-fixed .aside-left')
html_1=str(html_1)
print(type(html_2))
print(type(html_1))
ex = 'src="(.*?)"'
imglist = re.findall(ex, html_1)
print(imglist)
# print(html_1)