我的第一个爬虫项目

这是一个十分简易的爬虫项目,其中牵扯着部分的 异常处理文件写入文件下载

近日一个朋友希望我帮他下载一个网站的文件,该下载页面为

1
https://materialsweb.org/database/

共有 21913 个页面,需要下载每个页面的 POSCAR 文件,下载文件的命名格式为 POSCAR+序号+formula+spacegroup,在爬取的过程中发现 *spacegroup*中存在特殊字符,因此无法写入文件名中,通过协商将其处理为一个表格,使内容可以对应:

序号 formula spacegroup 下载地址 访问地址
1 simple simple https://url.com https://url.com
2 simple simple https://url.com https://url.com

POSCAR

经过分析,下载链接的结构为 https://materialsweb.org/static/database/ + 仓库路径 + /POSCAR ,由于仓库路径没有规律,因此无法直接下载,需要使用 Python 先提取网页的链接。

零、使用 Python 提取单个网页的下载链接

0. 必须模块

  • bs4
  • urllib.request
  • openpyxl

1. 安装模块

通过使用终端输入:

1
2
3
pip install urllib
pip install bs4
pip install openpyxl

进行安装

注意在 python 3.x环境中 urllib 模块将无法安装,需要安装 urllib3模块

2. 提取文件的 序号formula

1
2
3
4
5
6
7
from bs4 import BeautifulSoup
from urllib.request import urlopen
url = urlopen('https://materialsweb.org/database/21913')
soup = BeautifulSoup(url,'html.parser')
fileName= "".join(soup.title.text.split())
number = fileName.split('-')[0]
formula = fileName.split('-')[1]

3. 提取文件的 spacegroup

1
2
3
4
5
6
7
import requests
import re
url = 'https://materialsweb.org/database/1'
resp = requests.get(url)
text = resp.text
# print(text)
spacegroup = re.findall(r"spacegroup = \"(.*)\"", text)[0]

4. 提取文件的下载路径

1
2
3
4
5
6
7
import requests
import re
url = 'https://materialsweb.org/database/21913'
resp = requests.get(url)
text = resp.text
# print(text)
cbrbms = re.findall(r"/static/database/(.*)\"", text)

通过 requests 模块爬取出的 Html 内容发现 实际上数据集的下载路径是用 js 写入的,通过使用正则匹配从网页中提取出文件路径

5. 拼接下载链接

1
2
3
4
5
6
7
8
# 完整下载链接:https://materialsweb.org/static/database/MAX_phases/Ge-Pb/Nb-Mo/0.5/A1-A2/C-N/POSCAR
import requests
import re
url = 'https://materialsweb.org/database/' + str(i)
resp = requests.get(url)
text = resp.text
item = re.findall(r"/static/database/(.*)\"", text) # 正则匹配路径
links = "https://materialsweb.org/static/database/"+item[0]+"/POSCAR"

6. 初始化excle

1
2
3
4
5
6
7
8
import openpyxl
path = r'C:\data\data.xlsx'
sheetStr = 'data'
workbook = openpyxl.Workbook()
sheet = workbook.active
sheet.title = sheetStr
sheet.append(['序号','formula','spaceGroup','页面链接','下载链接'])
workbook.save(path)

7. 写入 excle

1
2
3
4
5
import openpyxl
wb = openpyxl.load_workbook(path)
ws = wb.active
ws.append([number, formula, spaceGroup, pageUrl, downloadUrl])
wb.save(path)

8. 完整代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
from bs4 import BeautifulSoup
from urllib.request import urlopen
from urllib import error
import requests
import re
import time
import os
import openpyxl

path = r'C:/data/data.xlsx'
root = r'C:'
start = 1
end = 21914


def initDocs():
os.mkdir(root + './data')
fullpath = 'C:/data/' + 'data.xlsx'
file = open(fullpath, 'w')
file.close()


def initDownload():
os.mkdir(root + './data/download')


def addRow(number, formula, spaceGroup, pageUrl, downloadUrl):
wb = openpyxl.load_workbook(path)
ws = wb.active
ws.append([number, formula, spaceGroup, pageUrl, downloadUrl])
wb.save(path)


def initializationTable():
sheetStr = 'data'
workbook = openpyxl.Workbook()
sheet = workbook.active
sheet.title = sheetStr
sheet.append(['序号', 'formula', 'spaceGroup', '页面链接', '下载链接'])
workbook.save(path)
print("数据表初始化完成")


def downloadFile(number_of_pages, formula, downloadUrl):
myfile = requests.get(downloadUrl)
open('C:/data/download/' + 'POSCAR-' + number_of_pages + '-' + formula,
'wb').write(myfile.content)
print(number_of_pages,'下载完成')


if __name__ == '__main__':
# initDocs()
# initDownload()
# initializationTable()
for i in range(start, end):
pageUrl = 'https://materialsweb.org/database/' + str(i)
try:
html = urlopen(pageUrl)
except error.HTTPError as e:
addRow(i, "网页异常", '网页异常', pageUrl, '网页异常')
print(i, 'http请求错误:' + str(e), "错误已写入文件")
continue
print(i, "开始爬取")
soup = BeautifulSoup(html, 'html.parser')
fileName = "".join(soup.title.text.split())
resp = requests.get(pageUrl)
text = resp.text
try:
number_of_pages = fileName.split('-')[0]
formula = fileName.split('-')[1]
spacegroup = re.findall(r"spacegroup = \"(.*)\"", text)[0]
item = re.findall(r"/static/database/(.*)\"", text) # 正则匹配路径
except:
formula = '网页异常'
spacegroup = '网页异常'
downloadUrl = '网页异常'
continue
downloadUrl = "https://materialsweb.org/static/database/" + item[
0] + "/POSCAR"
addRow(number_of_pages, formula, spacegroup, pageUrl, downloadUrl)
print(i, "写入完成")
downloadFile(number_of_pages, formula, downloadUrl)
time.sleep(1)
print("休眠完成继续爬取\n")

首次运行程序时需要先启用:

1
2
3
initDocs()
initDownload()
initializationTable()

用于初始化目录与数据表,如果因为某些原因中断重新运行程序时需要查看是从第几条数据中端的,修改 startend值,进行继续爬取。

写在最后:
本项目纯属现学现卖,很多方面可能存在不合理,欢迎读者对项目进行指导。