您现在的位置：程式師世界 >> 編程語言 > >> 更多編程語言 >> Python

Python Zhihu data acquisition

編輯：Python

be based on python The open source crawler zhihu_oauth Introduction

Today, I accidentally found a Zhihu open source crawler , Is based on Python Of , Name is zhihu_oauth, Take a look at it in github above star It's quite a lot , It seems that the document is also very detailed , So I did a little research . It's really easy to use . Here is how to use .

The home page address of the project is ：https://github.com/7sDream/zhihu-oauth. The author's home page is ：https://www.zhihu.com/people/7sdream/.

The document address of the project is :http://zhihu-oauth.readthedocs.io/zh_CN/latest/index.html . Be reasonable , The original author has explained in great detail how to use this library , I'll repeat it here again. It's just icing on the cake . So if you want to know more about how to use this library , Just go to the official documents . Let me just say a few important points that I think need to be added .

The first is installation . The author has uploaded the project to pypi 了 , So we can use pip To install . According to the author , Project for Python3 Better support for , However, it is also compatible at present Python2 Of , So you'd better use python3. direct pip3 install -U zhihu_oauth You can install .

After installation, the first step is to log in . Directly use the following code to log in .

from zhihu_oauth import ZhihuClient
from zhihu_oauth.exception import NeedCaptchaException
client = ZhihuClient()
user = 'email_or_phone'
pwd = 'password'
try:
client.login(user, pwd)
print(u" Landing successful !")
except NeedCaptchaException: # Handle the case where the verification code is to be used
# Save verification code and prompt for input , Log back in
with open('a.gif', 'wb') as f:
f.write(client.get_captcha())
captcha = input('please input captcha:')
client.login('email_or_phone', 'password', captcha)
client.save_token('token.pkl') # preservation token
# With token after , The next time you log in, you can directly load token The file
# client.load_token('filename')

The above code is to log in directly with the account and password , Finally, I saved the log in token, We can use it directly the next time we log in token Log in instead of entering your password every time .

After logging in , Of course, there are many things that can be done , For example, the following code can obtain the basic information of your Zhihu account

from __future__ import print_function # Use python3 Of print Method
from zhihu_oauth import ZhihuClient
client = ZhihuClient()
client.load_token('token.pkl') # load token file
# Display your own relevant information
me = client.me()
# Get the latest 5 answer
for _, answer in zip(range(5), me.answers):
print(answer.question.title, answer.voteup_count)
print('----------')
# Get the most likes 5 answer
for _, answer in zip(range(5), me.answers.order_by('votenum')):
print(answer.question.title, answer.voteup_count)
print('----------')
# Get the latest 5 A question
for _, question in zip(range(5), me.questions):
print(question.title, question.answer_count)
print('----------')
# Get recently published 5 Articles
for _, article in zip(range(5), me.articles):
print(article.title, article.voteup_count)

Of course, there are far more things that can be done , For example, we know about a problem url Address or question id, You can get the number of answers to this question , Author's information and a series of detailed information . The developers are really thoughtful , Generally, all the commonly required information basically includes . I won't post the specific code , Please refer to the official documents by yourself .

A small tips： Because this library has many classes , For example, the class that obtains author information , Get the class of article information and so on . Each class has many methods , I went to look at the official documents , The attributes of some classes are not listed completely , So how do we view all the properties of this class ？ It's very simple , Just use python Of dir Function is OK , Use dir(object) You can see object class （ Or object ） All properties of . Let's say we have one answer Class object , Use dir(answer) It will return answer A list of all properties of the object . In addition to some of the default properties , We can find the properties we need for this class , It's convenient .（ Here is collection That is, all the properties of the favorite class ）

['__class__', '__delattr__', '__dict__', '__doc__', '__format__', '__getattribute__', '__hash__', '__init__', '__module__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_build_data', '_build_params', '_build_url', '_cache', '_data', '_get_data', '_id', '_method', '_refresh_times', '_session', 'answer_count', 'answers', 'articles', 'comment_count', 'comments', 'contents', 'created_time', 'creator', 'description', 'follower_count', 'followers', 'id', 'is_public', 'pure_data', 'refresh', 'title', 'updated_time']

Last , I use this class , Captured the pictures of all the answers under a certain question （ Catch a beautiful woman , Ha ha ha ha ）, It's not enough 30 Line code （ Remove annotations ）. Share with you .

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# @Time : 2017/5/3 14:27
# @Author : wang
# @Email : [email protected]
# @File : save_images.py
'''
@Description: Save a picture that knows all the answers to a question
'''
from __future__ import print_function # Use python3 Of print Method
from zhihu_oauth import ZhihuClient
import re
import os
import urllib
client = ZhihuClient()
# Sign in
client.load_token('token.pkl') # load token file
id = 24400664 # https://www.zhihu.com/question/24400664( What kind of experience is it to look good )
question = client.question(id)
print(u" problem :",question.title)
print(u" Number of answers :",question.answer_count)
# Create a folder for storing pictures
os.mkdir(question.title + u"( picture )")
path = question.title + u"( picture )"
index = 1 # Picture number
for answer in question.answers:
content = answer.content # Answer content
re_compile = re.compile(r'<img src="(https://pic\d\.zhimg\.com/.*?\.(jpg|png))".*?>')
img_lists = re.findall(re_compile,content)
if(img_lists):
for img in img_lists:
img_url = img[0] # picture url
urllib.urlretrieve(img_url,path+u"/%d.jpg" % index)
print(u" Successfully saved the %d A picture " % index)
index += 1

If you write it yourself , You can't get all the answers by directly grabbing and parsing the web page , So we can only crack Zhihu api, More trouble , It is much more convenient to use this ready-made wheel . In the future, if you want to enjoy the beauty of Zhihu slowly, you don't have to worry anymore , Hey, hey, hey .