首先导入需要使用到的包,并且设定 headers

1
2
3
4
5
6
import requests
import time

from bs4 import BeautifulSoup

headers = {"User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"}

之后是确定要爬取的界面。在 GitHub 的 search 界面可以根据 star 数量排序,并且跳到下一页只需要修改 url 的参数,比较方便,我们就用这个界面。

1
2
3
4
5
6
7
# 参数解释
# o=desc 默认就好,不用动
# p=1 指第几页,从 1 开始
# q=awesome 指你想爬取的 topic,也就是你在网页上的搜索框输入的内容
# s=stars 按照 star 数量排序
# type=Repositories 默认的,不用改
url = 'https://github.com/search?o=desc&p=1&q=awesome&s=stars&type=Repositories'

获得了 url 之后,就可以使用 requests 包获取该页面的内容,再用 BeautifulSoup 解析获得的 html

1
2
3
4
5
r = requests.get(url, headers=headers)

text = r.text
soup = BeautifulSoup(text, "html.parser")

所有的 repo 结果可以通过如下的方式获得:

1
divs = soup.find_all('li', attrs={'class': 'repo-list-item'})

每一个 li 的内容大致如下:

具体内容
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
<li class="repo-list-item hx_hit-repo d-flex flex-justify-start py-4 public source">
<div class="flex-shrink-0 mr-2">
<svg aria-hidden="true" class="octicon octicon-repo" height="16" style="color: #6a737d" version="1.1" viewbox="0 0 16 16" width="16"><path d="M2 2.5A2.5 2.5 0 014.5 0h8.75a.75.75 0 01.75.75v12.5a.75.75 0 01-.75.75h-2.5a.75.75 0 110-1.5h1.75v-2h-8a1 1 0 00-.714 1.7.75.75 0 01-1.072 1.05A2.495 2.495 0 012 11.5v-9zm10.5-1V9h-8c-.356 0-.694.074-1 .208V2.5a1 1 0 011-1h8zM5 12.25v3.25a.25.25 0 00.4.2l1.45-1.087a.25.25 0 01.3 0L8.6 15.7a.25.25 0 00.4-.2v-3.25a.25.25 0 00-.25-.25h-3.5a.25.25 0 00-.25.25z" fill-rule="evenodd"></path></svg>
</div>
<div class="mt-n1">
<div class="f4 text-normal">
<a class="v-align-middle" data-hydro-click='{"event_type":"search_result.click","payload":{"page_number":1,"per_page":10,"query":"hexo-theme","result_position":1,"click_id":27382163,"result":{"id":27382163,"global_relay_id":"MDEwOlJlcG9zaXRvcnkyNzM4MjE2Mw==","model_name":"Repository","url":"https://github.com/iissnan/hexo-theme-next"},"originating_url":"https://github.com/search?o=desc&amp;p=1&amp;q=hexo-theme&amp;s=stars&amp;type=Repositories","user_id":null}}' data-hydro-click-hmac="4cdbb1080b21ad10efa139f75b480b485dcaea70d396e8fc019a635810413303" href="/iissnan/hexo-theme-next">iissnan/<em>hexo</em>-<em>theme</em>-next</a>
</div>
<p class="mb-1">
Elegant <em>theme</em> for <em>Hexo</em>.
</p>
<div>
<div>
<a class="topic-tag topic-tag-link f6 px-2 mx-0" data-ga-click="Topic, search results" data-octo-click="topic_click" data-octo-dimensions="topic:hexo-theme,repository_id:27382163,repository_nwo:iissnan/&lt;em&gt;hexo&lt;/em&gt;-&lt;em&gt;theme&lt;/em&gt;-next,repository_public:true,repository_is_fork:false" href="/topics/hexo-theme" title="Topic: hexo-theme">
hexo-theme
</a>
<a class="topic-tag topic-tag-link f6 px-2 mx-0" data-ga-click="Topic, search results" data-octo-click="topic_click" data-octo-dimensions="topic:hexo,repository_id:27382163,repository_nwo:iissnan/&lt;em&gt;hexo&lt;/em&gt;-&lt;em&gt;theme&lt;/em&gt;-next,repository_public:true,repository_is_fork:false" href="/topics/hexo" title="Topic: hexo">
hexo
</a>
<a class="topic-tag topic-tag-link f6 px-2 mx-0" data-ga-click="Topic, search results" data-octo-click="topic_click" data-octo-dimensions="topic:theme-next,repository_id:27382163,repository_nwo:iissnan/&lt;em&gt;hexo&lt;/em&gt;-&lt;em&gt;theme&lt;/em&gt;-next,repository_public:true,repository_is_fork:false" href="/topics/theme-next" title="Topic: theme-next">
theme-next
</a>
</div>
<div class="d-flex flex-wrap text-small text-gray">
<div class="mr-3">
<a class="muted-link" href="/iissnan/hexo-theme-next/stargazers">
<svg aria-label="star" class="octicon octicon-star" height="16" role="img" version="1.1" viewbox="0 0 16 16" width="16"><path d="M8 .25a.75.75 0 01.673.418l1.882 3.815 4.21.612a.75.75 0 01.416 1.279l-3.046 2.97.719 4.192a.75.75 0 01-1.088.791L8 12.347l-3.766 1.98a.75.75 0 01-1.088-.79l.72-4.194L.818 6.374a.75.75 0 01.416-1.28l4.21-.611L7.327.668A.75.75 0 018 .25zm0 2.445L6.615 5.5a.75.75 0 01-.564.41l-3.097.45 2.24 2.184a.75.75 0 01.216.664l-.528 3.084 2.769-1.456a.75.75 0 01.698 0l2.77 1.456-.53-3.084a.75.75 0 01.216-.664l2.24-2.183-3.096-.45a.75.75 0 01-.564-.41L8 2.694v.001z" fill-rule="evenodd"></path></svg>
15.5k
</a>
</div>
<div class="mr-3">
<span class="">
<span class="repo-language-color" style="background-color: #563d7c"></span>
<span itemprop="programmingLanguage">CSS</span>
</span>
</div>
<div class="mr-3">
MIT license
</div>
<div class="mr-3">
Updated <relative-time class="no-wrap" datetime="2020-04-02T06:25:26Z">Apr 2, 2020</relative-time>
</div>
<a class="muted-link f6" href="/iissnan/hexo-theme-next/issues?q=label%3A%22Help+wanted%22+is%3Aissue+is%3Aopen">
2 issues
need help
</a>
</div>
</div>
</div>
</li>

查看 html 代码就可以找到各个部分,如名字、链接、 star 数等等,比如说如下的代码:

1
2
3
4
5
6
7
8
9
10
11
for d in divs:
star = d.find('a', attrs={'class': 'muted-link'}).text.strip()
# print(star)

desc = d.find('p', attrs={'class': 'mb-1'}).text.strip()
# print(desc)

name = d.find('a', attrs={'class': 'v-align-middle'}).text.strip()
# print(name)

href = 'https://www.github.com' + d.find('a', attrs={'class': 'v-align-middle'})['href']

这样就能根据 star 排序爬取某个 topic 的仓库了。如果想用其他的排序或者内容,修改排序参数即可。具体参数可以直接去 search 界面尝试。

完整代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
import requests
import time

from bs4 import BeautifulSoup

headers = {"User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"}

for i in range(1, 30):
url = 'https://github.com/search?o=desc&p=' + str(i) + '&q=hexo-theme&s=stars&type=Repositories'
r = requests.get(url, headers=headers)

text = r.text
soup = BeautifulSoup(text, "html.parser")

divs = soup.find_all('li', attrs={'class': 'repo-list-item'})
for d in divs:
print(d)
star = d.find('a', attrs={'class': 'muted-link'}).text.strip()
# print(star)

desc = d.find('p', attrs={'class': 'mb-1'}).text.strip()
# print(desc)

name = d.find('a', attrs={'class': 'v-align-middle'}).text.strip()
# print(name)

href = 'https://www.github.com' + d.find('a', attrs={'class': 'v-align-middle'})['href']
print('finish page {}'.format(i))