爬虫

最近在爬一些数据,看到了知乎上这篇爬虫爬取摩拜单车位置生成热力图文章还挺有意思。

把那篇文章的java代码转成python如下,通过以下代码就可以抓取到指定地理位置附近的摩拜单车数量。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
from __future__ import print_function
import requests
def crawl_mobike(longitude, latitude):
headers = {
"Content-Type": "application/x-www-form-urlencoded",
"mainSource": "4003",
"Accept": "*/*",
"eption": "4f906",
"opensrc": "list",
"wxcode": "xxx",
"platform": "3",
"Accept-Language": "zh-cn",
"citycode": "010",
"lang": "zh",
"User-Agent": "Mozilla/5.0 (iPhone; CPU iPhone OS 11_1_1 like Mac OS X) "
"AppleWebKit/604.3.5 (KHTML, like Gecko) Mobile/15B150 MicroMessenger/6.5.22 Net",
"Referer": "https",
"Accept-Encoding": "br, gzip, deflate",
"Connection": "keep-alive",
"Cache-Control": "no-cache",
}
data = {
"longitude": longitude,
"errMsg": "getLocation%3Aok",
"latitude": latitude,
"citycode": "010",
"wxcode": "0010xiIVF1a75eadkjnsakjdnjk",
}
resp = requests.post("https://mwx.mobike.com/mobike-api/rent/nearbyBikesInfo.do", headers=headers, data=data)
print(resp.content)
if __name__ == "__main__":
crawl_mobike(116.309337, 39.914972)

摩拜单车返回的接口数据格式如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
{
"code": 0,
"message": "",
"biketype": 0,
"autoZoom": true,
"radius": 150,
"object": [{
"distId": "0100225657",
"distX": 116.30953432403221,
"distY": 39.91498342943186,
"distNum": 1,
"distance": "16",
"bikeIds": "0100225657#",
"biketype": 1,
"type": 0,
"boundary": null
}, {
"distId": "0106193023",
"distX": 116.30957439673547,
"distY": 39.91493849505124,
"distNum": 1,
"distance": "20",
"bikeIds": "0106193023#",
"biketype": 2,
"type": 0,
"boundary": null
},
.....]
}

前端

前端这里面我使用了Echarts的热力图,它接受的参数为一个列表,这里每个列表是一个三元组,代表经度、纬度、权重。这里面每行都代表一辆摩拜单车,所以权重都是1.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
[
[116.110946655685, 39.71361160221894, 1],
[116.10665613796685, 39.71794193318955, 1],
[116.11279980490671, 39.719301801233826, 1],
[116.10314985441794, 39.718549033584935, 1],
[116.10511134689453, 39.7211004247298, 1],
[116.10854372598347, 39.72385947622611, 1],
[116.10677504048758, 39.725561916087905, 1],
[116.10937042835597, 39.726014602442234, 1],
[116.11245692428736, 39.725696746635926, 1],
[116.11071132111596, 39.73308017919619, 1],
[116.11544721233275, 39.72695473683173, 1],
[116.1151990756164, 39.73172244352969, 1],
[116.10526909433266, 39.735421221251826, 1],
[116.11327451255882, 39.73611375346716, 1],
[116.11641681190581, 39.731914898153754, 1],
[116.10793126000574, 39.740616627234296, 1],
[116.11814176701677, 39.731214903104224, 1],
[116.10263000861319, 39.74186888903757, 1],
[116.11948286049935, 39.73685980140814, 1],
..........
]

代码见:http://nladuo.github.io/beijing-heatmap/

效果

一共16万单车的数据,显示起来非常卡,真正做的话感觉还要聚类一下减少一下点的数量。还有就是感觉定位不是那么准。

使用聚类减少点数目

上面一共显示了16万个点,实在太多了,前端十分卡顿。这里可以使用Kmeans,把16万个点聚类成1万个簇。每个簇也就是原来的一个点,在簇上加上权重,权重代表有多少个点属于这个簇。

下面使用scikit-learn完成聚类。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
from __future__ import print_function
from sklearn.cluster import MiniBatchKMeans
import json
import time
import numpy as np
import pickle
if __name__ == "__main__":
with open("mobikes.json") as f: # 见http://nladuo.github.io/beijing-heatmap/mobikes.json
bikes = json.load(f)
X = []
for bike in bikes:
X.append([bike[0], bike[1]])
X = np.array(X)
print(X.shape)
model = MiniBatchKMeans(init_size=30000, n_clusters=10000, verbose=1)
t0 = time.time()
labels = model.fit_predict(X)
elapsed_time = time.time() - t0
print("%.2f" % elapsed_time) # 耗时70多秒
with open("model.pickle", "wb") as f: # 保存模型
pickle.dump(model, f)
centers = []
for center in model.cluster_centers_:
centers.append([center[0], center[1], 0]) # 初始化权重为0
for label in labels:
centers[label][2] += 1 # 增加权重
with open("mobikes2.json", "w") as f:
json.dump(centers, f)

相比之前,数据变成了下面的样子。

1
2
3
4
5
6
7
8
9
10
11
12
[
[116.31992065846111, 39.89249845164355, 21],
[116.4641953192452, 39.90102073614072, 10],
[116.31825203429038, 40.035862962622325, 20],
[116.57042460367792, 39.86857779007738, 2],
[116.19158026936142, 39.85322767008357, 12],
[116.38019673290377, 39.78085910710848, 14],
[116.45768728332209, 39.99369184793589, 3],
[116.23472281774336, 39.94436936951349, 20],
[116.56548164928131, 39.7858957275953, 10],
.....
]

从6.9M变成了867K,不过也损失了一部分精度。

ofo小黄车是什么样的?

按照上面的思路,也可以再做个小黄车的分布图,这里面爬下来5万多的数据,效果如下。