程序師世界是廣大編程愛好者互助、分享、學習的平台,程序師世界有你更精彩!
首頁
編程語言
C語言|JAVA編程
Python編程
網頁編程
ASP編程|PHP編程
JSP編程
數據庫知識
MYSQL數據庫|SqlServer數據庫
Oracle數據庫|DB2數據庫
您现在的位置: 程式師世界 >> 編程語言 >  >> 更多編程語言 >> Python

Big Data - Playing with Data - Several Data Collections in Python

編輯:Python

一、Python數據采集之Webservice接口

安裝了 pip install suds-py31

1、QQ The login status query

這邊我們以 QQ Login status query the service address as an example,給大家來講解;要知道一個 webservice How many letters are there in the address of the interface,We can direct browse access url 地址看 wsdl 的描述文檔,我們也可以借助於 soapUI 這個工具,當然我們也可以通過 suds Library to create a client object,To access the address to see:

代碼如下:

from suds import client
url = "http://ws.webxml.com.cn/webservices/qqOnlineWebService.asmx?wsdl"
# 訪問urlAddress returns aclient對象
web_s = client.Client(url)
# Print client object,You can see all of the services under this address(接口)
print(web_s)

詳細信息如下:

Request a specific interface of
Know the interface name and parameters,We can request the corresponding interface

from suds import client
url = "http://ws.webxml.com.cn/webservices/qqOnlineWebService.asmx?wsdl"
# 訪問urlAddress returns aclient對象
web_s = client.Client(url)
# 准備參數,請求接口
res = web_s.service.qqCheckOnline(qqCode='121278987')
# 獲取返回的結果:
print(res)

2、天氣預報查詢

上面的 QQ Status query is a relatively simple case,The request of the interface parameters and return is simple,Then look at a slightly more complex interface,天氣預報查詢:

第一次請求
A case using the code,Directly address modification request at this time will be an error:

from suds import client
url = "http://ws.webxml.com.cn/WebServices/WeatherWS.asmx?wsdl"
# 訪問urlAddress returns aclient對象
web_s = client.Client(url)
# Print client object,You can see all of the services under this address(接口)
print(web_s)

運行錯誤:

The above code error because,suds In the analytical return to WSDL 的時候,發現返回的 XML Some of type,Not in the standard of XML Architecture in the namespace,Therefore when parsing error,This time we need to add a few lines of code as follows,Import the current service namespace

再次請求

from suds import client
url = "http://ws.webxml.com.cn/WebServices/WeatherWS.asmx?wsdl"
from suds.xsd.doctor import Import, ImportDoctor
imp=Import('http://www.w3.org/2001/XMLSchema',location='http://www.w3.org/2001/XMLSchema.xsd')
imp.filter.add('http://WebXml.com.cn/')
doctor=ImportDoctor(imp)
web_s = client.Client(url,doctor=doctor)
print(web_s)

響應結果

The client can see print connection,Should be in the service address 6 個服務(接口),And then there are other types of introduction.If you want to call a method,With the client object calling the corresponding method.

二、python數據采集之API接口

應用編程接口(Application Programming Interface,API)

1、APIInterface to extract city information,And use the regular data parsing intocsv

通過API接口提取3181個城市信息.URL地址:https://cdn.heweather.com/china-city-list.txt

# Read the city list information from the Internet,And use the regular data parsing out.
import requests
import re
import csv
# 獲取城市信息列表
url = "https://cdn.heweather.com/china-city-list.txt"
res = requests.get(url)
data = res.content.decode('utf-8') #res.text是字符串類型,而res.content是二進制類型,For access to images and files
# Using a newline break out each city information data
dlist = re.split('[\n\r]+',data)
# Weed out the first three useless data
for i in range(3):
dlist.remove(dlist[0])
# 輸出表頭
for i in range(1):
#Use whitespace split each field information
item = re.split("\s+",dlist[i])
v_city_id = item[1]
v_city_name = item[3]
v_city_ch = item[5]
v_country_id = item[7]
v_country_en = item[9]
v_country_ch = item[11]
headers = [v_city_id,v_city_name,v_city_ch,v_country_id,v_country_en,v_country_ch]
with open("d:\\city.csv", 'w', newline="") as f:
writer = csv.writer(f, delimiter='|')
writer.writerow(headers)
for i in range(2,len(dlist)):
item = re.split("\s+", dlist[i])
v_city_id = item[1]
v_city_name = item[3]
v_city_ch = item[5]
v_country_id = item[7]
v_country_en = item[9]
v_country_ch = item[11]
data_value = [v_city_id, v_city_name, v_city_ch, v_country_id, v_country_en, v_country_ch]
with open("d:\\city.csv", 'a+', newline="") as f:
writer = csv.writer(f, delimiter='|')
writer.writerow(data_value)
print(v_city_id)
print(v_city_name)
print(v_city_ch)
print(v_country_id)
print(v_country_en)
print(v_country_ch)

2、天氣API-Real-time weather get

# 注冊免費APIAnd read the document
# 本節通過一個API接口(和風天氣預報)爬取天氣信息,The interface for individual developers to provide a free forecast data(有次數限制)
# First visit and wind weather net,注冊一個賬戶.注冊地址:https://id.heweather.com/register
# After the landing of the console can see individual certificationkey(密鑰),這個key就是訪問API接口的鑰匙
# 獲取key之後閱讀API文檔:https://dev.heweather.com/docs/api/
#天氣api-實時天氣 開發版https://devapi.qweather.com/v7/weather/now?[請求參數]
# 請求參數
# 請求參數包括必選和可選參數,如不填寫可選參數將使用其默認值,參數之間使用&進行分隔.
# key
# 用戶認證key,Please refer to how to get yourKEY.Support digital signature authentication in the form of.例如 key=123456789ABC
# location
# 需要查詢地區的LocationID或以英文逗號分隔的經度,緯度坐標(十進制,最多支持小數點後兩位),LocationIDThrough the city search service for.
# According to the above for cityIDOr latitude and longitude,例如 location=101010100 或 location=116.41,39.92
import requests
import time
#Climb in designated cities weather information
url = "https://devapi.qweather.com/v7/weather/now?location=101270102&key=f087735c31bXXXXXXXXX1419d76c"
res = requests.get(url)
time.sleep(2)
#解析json數據
dlist = res.json()
data = dlist['now']
print("成都龍泉驿:")
#Output part weather information
print("天氣:",data['text'])
print("今日:",str(data['obsTime']))
print("溫度:",data['temp'])
print("The relative temperature and humidity:",data['humidity'])
print("風級",data['windScale'])

They are for different application provides a convenient and friendly interface.不同的開發者用不同的架構,Even the different language writing software no problem——因為 API The purpose of design is to become a universal language,讓不同的軟件進行信息共享.APIData acquisition is big data acquisition a way,As well as the most simple a spider technology link.

你可能會想,這不就是在浏覽器窗口輸入一個網址,Press enter after obtain(只是 JSON 格式)信息嗎?究竟 API What is the difference between and ordinary url access?如果不考慮 API 高大上的名稱,其實,兩者沒啥區別.API 可以通過 HTTP 協議下載文件,和 URL 訪問網站獲取數據的協議一樣,It can achieve almost anything on the Internet.API 之所以叫 API Rather than call the cause of the website,其實,是首先 API Request to use a very rigorous grammar,其次 API 用 JSON 或 XML 格式表示數據,而不是
HTML 格式.

API通用規則
And most of the network data acquisition in a different way,API 用一套非常標准的規則生成數據,而且生成的數據也是按照非常標准的方式組織的.因為規則很標准,所以一些簡單、基本的規則很容易學,Can help you quickly grasp any API 的用法.

不過並非所有 API 都很簡單,有些 API 的規則比較復雜,So for the first time using a API 時,建議閱讀文檔,No matter you of previously used API How familiar.

方法
利用 HTTP Service there are four ways to get information from the network:

  • GET
  • POST
  • PUT
  • DELETE

GETIs the input url in your browser to browse your website is doing.當你訪問 http://freegeoip.net/json/50.78.253.58 時,就會使用 GET 方法.可以想象成 GET 在說:“喂,網絡服務器,According to this url, please give me information.

POSTBasic is when you fill out the form or submit information to the web server backend application is doing.Every time when you log in the website,Through the user name and(May the encrypted)Password to launch a POST 請求.如果你用API 發起一個 POST 請求,相當於說“Please put the information saved to your database”.

PUTIn the process of web interaction not commonly used,但是在 API It can sometimes be used. PUT Request to update an object or information.例如,API May demand with POST 請求創建新用戶,But if you want to update the old user's email address,就要用 PUT 請求了.

DELETEUsed to delete an object.例如,如果我們向http://myapi.com/user/23 發出一個 DELETE 請求,就會刪除 ID 號是 23 的用戶. DELETE Methods in public API 裡面不常用,They are mainly used to create information,Can't literally let a user to delete the database information.但是,和 PUT 方法一樣, DELETE Method is also worth look at.

雖然在 HTTP In the specification and some information processing way,But those are the four basic use API You may encounter in the process of all.

其實,很多 API In the update information of the time with POST 請求代替 PUT 請求.Whether to create a new entity or update an old entity,通常要看 API The request itself is how to build.不過,It pays to grasp the differences,用 APIWhen you are often met PUT 請求.

驗證
有些 API Require client authentication is to calculate the API 調用的費用,Or is to provide the monthly service.Some validation to“限制”用戶使用 API(Limit per second、每小時或每天 API 調用的次數),Or limit is part of the user for a certain information or some sort of API 的訪問.還有一些 API May not require validation,But may be in marketing to track the use of user behavior.
服務器響應
API There is one important characteristic is that they will feedback format friendly
的數據.Most of the feedback data formats are XML 和 JSON.

這幾年,JSON 比 XML 更受歡迎,主要有兩個原因.首先,JSON File than the complete XML 格

JSON 格式比 XML Another cause of the more popular is the change of the network technology.過去,服務器端用 PHP和 .NET These programs as API 的接收端.現在,The server will use some JavaScript 框架作為 APIThe sending and receiving end,像 Angular 或 Backbone 等.Although server-side technology cannot predict they will receive the data format,但是像 Backbone 之類的 JavaScript 庫處理 JSON 比處理 XML 要更簡單.

雖然大多數 API 都支持 XML 數據格式,但我們還是用 JSON 格式.當然,If you haven't put the two formats is,So familiar with them now is a good time——In the short term they won't disappear

三、python 數據采集之csv、xml文件

1、Read the compound string,Resolved to writecsv

# -*- coding: utf-8 -*-
import json
import csv
jsonString = '{"propertyFilterRule":{"id":1,"isSimple":1,"simpleRelationOp":"OR","complexRelationOp":"","x":0,"y":0,"present":0,"filterList":[{"id":1,"propertyName":"alarm_title","propertyLabel":"告警標題","label":"包含","op":"contains","filterValue":"The board temperature close to the dangerous threshold","filterType":null,"filterName":"The board temperature close to the dangerous threshold"},{"id":2,"propertyName":"alarm_title","propertyLabel":"告警標題","label":"包含","op":"contains","filterValue":"Light module temperature close to the dangerous threshold","filterType":null,"filterName":"Light module temperature close to the dangerous threshold"},{"id":3,"propertyName":"alarm_title","propertyLabel":"告警標題","label":"包含","op":"contains","filterValue":"Light module high temperature alarm","filterType":null,"filterName":"Light module high temperature alarm"},{"id":4,"propertyName":"alarm_title","propertyLabel":"告警標題","label":"包含","op":"contains","filterValue":"高溫告警","filterType":null,"filterName":"高溫告警"}]}}'
jsonObj = json.loads(jsonString)
id = jsonObj.get("propertyFilterRule")['id']
print(id)
isSimple = jsonObj.get("propertyFilterRule")['isSimple']
print(isSimple)
simpleRelationOp = jsonObj.get("propertyFilterRule")['simpleRelationOp']
print(simpleRelationOp)
complexRelationOp = jsonObj.get("propertyFilterRule")["complexRelationOp"]
print(complexRelationOp)
x = jsonObj.get("propertyFilterRule")["x"]
print(x)
y = jsonObj.get("propertyFilterRule")["y"]
print(y)
present = jsonObj.get("propertyFilterRule")["present"]
print(present)
id0 = jsonObj.get("propertyFilterRule")["filterList"][0]['id']
print(id0)
propertyName0 = jsonObj.get("propertyFilterRule")["filterList"][0]['propertyName']
print(propertyName0)
propertyLabel0 = jsonObj.get("propertyFilterRule")["filterList"][0]['propertyLabel']
print(propertyLabel0)
label0 = jsonObj.get("propertyFilterRule")["filterList"][0]['label']
print(label0)
op0 = jsonObj.get("propertyFilterRule")["filterList"][0]['op']
print(op0)
filterValue0 = jsonObj.get("propertyFilterRule")["filterList"][0]['filterValue']
print(filterValue0)
filterType0 = jsonObj.get("propertyFilterRule")["filterList"][0]['filterType']
print(filterType0)
filterName0 = jsonObj.get("propertyFilterRule")["filterList"][0]['filterName']
print(filterName0)
id1 = jsonObj.get("propertyFilterRule")["filterList"][1]['id']
print(id1)
propertyName1 = jsonObj.get("propertyFilterRule")["filterList"][1]['propertyName']
print(propertyName1)
propertyLabel1 = jsonObj.get("propertyFilterRule")["filterList"][1]['propertyLabel']
print(propertyLabel1)
label1 = jsonObj.get("propertyFilterRule")["filterList"][1]['label']
print(label1)
op1 = jsonObj.get("propertyFilterRule")["filterList"][1]['op']
print(op1)
filterValue1 = jsonObj.get("propertyFilterRule")["filterList"][1]['filterValue']
print(filterValue1)
filterType1 = jsonObj.get("propertyFilterRule")["filterList"][1]['filterType']
print(filterType1)
filterName1 = jsonObj.get("propertyFilterRule")["filterList"][1]['filterName']
print(filterName1)
id2 = jsonObj.get("propertyFilterRule")["filterList"][2]['id']
print(id2)
propertyName2 = jsonObj.get("propertyFilterRule")["filterList"][2]['propertyName']
print(propertyName2)
propertyLabel2 = jsonObj.get("propertyFilterRule")["filterList"][2]['propertyLabel']
print(propertyLabel2)
label2 = jsonObj.get("propertyFilterRule")["filterList"][2]['label']
print(label2)
op2 = jsonObj.get("propertyFilterRule")["filterList"][2]['op']
print(op2)
filterValue2 = jsonObj.get("propertyFilterRule")["filterList"][2]['filterValue']
print(filterValue2)
filterType2 = jsonObj.get("propertyFilterRule")["filterList"][2]['filterType']
print(filterType2)
filterName2 = jsonObj.get("propertyFilterRule")["filterList"][2]['filterName']
print(filterName2)
id3 = jsonObj.get("propertyFilterRule")["filterList"][3]['id']
print(id3)
propertyName3 = jsonObj.get("propertyFilterRule")["filterList"][3]['propertyName']
print(propertyName3)
propertyLabel3 = jsonObj.get("propertyFilterRule")["filterList"][3]['propertyLabel']
print(propertyLabel3)
label3 = jsonObj.get("propertyFilterRule")["filterList"][3]['label']
print(label3)
op3 = jsonObj.get("propertyFilterRule")["filterList"][3]['op']
print(op3)
filterValue3 = jsonObj.get("propertyFilterRule")["filterList"][3]['filterValue']
print(filterValue3)
filterType3 = jsonObj.get("propertyFilterRule")["filterList"][3]['filterType']
print(filterType3)
filterName3 = jsonObj.get("propertyFilterRule")["filterList"][3]['filterName']
print(filterName3)
header = ['id','isSimple','simpleRelationOp','complexRelationOp','x','y','present','id0','propertyName0','propertyLabel0','label0','op0','filterValue0','filterType0','filterName0','id1','propertyName1','propertyLabel1','label1','op1','filterValue1','filterType1','filterName1','id2','propertyName2','propertyLabel2','label2','op2','filterValue2','filterType2','filterName2','id3','propertyName3','propertyLabel3','label3','op3','filterValue3','filterType3','filterName3']
datawindow4 = [id,isSimple,simpleRelationOp,complexRelationOp,x,y,present,id0,propertyName0,propertyLabel0,label0,op0,filterValue0,filterType0,filterName0,id1,propertyName1,propertyLabel1,label1,op1,filterValue1,filterType1,filterName1,id2,propertyName2,propertyLabel2,label2,op2,filterValue2,filterType2,filterName2,id3,propertyName3,propertyLabel3,label3,op3,filterValue3,filterType3,filterName3]
with open('d:\\new.csv', 'w') as f:
writer = csv.writer(f, delimiter='|')
writer.writerow(header)
writer.writerow(datawindow4)

2、解析xml 文件,According to the standard in thecsv

#xml格式
''' <?xml version="1.0" encoding="UTF-8"?> <DataFile xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="file:///C:/Users/Administrator/Desktop/schema.xsd"> <FileHeader> <TimeStamp>2022-07-20T12:16:49</TimeStamp> <TimeZone>UTC+8</TimeZone> <VendorName>FH</VendorName> <ElementType>PON</ElementType> <CmVersion>V1.0.0</CmVersion> </FileHeader> <Objects> <ObjectType>VLN</ObjectType> <FieldName> <N i="1">nermUID</N> <N i="2">portrmUID</N> <N i="3">vlanId</N> <N i="4">vlanMode</N> <N i="5">mvlanFlag</N> <N i="6">mvlanPri</N> <N i="7">service</N> </FieldName> <FieldValue> <Object rmUID="5101FHCS1VLN004A10021L0903"> <V i="1">5101FHCS1OLT004A1</V> <V i="2">5101FHCS1PRT004A1009L03</V> <V i="3">21</V> <V i="4">SINGLE</V> <V i="5">0</V> <V i="6">--</V> <V i="7">HSI</V> </Object> <Object rmUID="5101FHCS1VLN004A10021A0000"> <V i="1">5101FHCS1OLT004A1</V> <V i="2">--</V> <V i="3">21</V> <V i="4">SINGLE</V> <V i="5">0</V> <V i="6">--</V> <V i="7">HSI</V> </Object> <Object rmUID="5101FHCS1VLN004A10022L0903"> <V i="1">5101FHCS1OLT004A1</V> <V i="2">5101FHCS1PRT004A1009L03</V> <V i="3">22</V> <V i="4">SINGLE</V> <V i="5">0</V> <V i="6">--</V> <V i="7">HSI</V> </Object> </FieldValue> </Objects> </DataFile> '''
# -*- coding: utf-8 -*-
""" @Author : sunbo @Time : 2022/7/26 0024 上午 9:19 @Comment : """
import csv
from xml.dom.minidom import parse
import os
import time
date_ = time.strftime("%Y%m%d",time.localtime())
dst_file_path = "D:\\"+date_+ "\CM"
if not os.path.exists(dst_file_path):
os.makedirs(dst_file_path)
else:
print(dst_file_path)
def readXML():
domTree = parse("CM-PON-VLN-A1-V1.0.0-20220720120000-002.xml")
# 文檔根元素
rootNode = domTree.documentElement
print(rootNode.nodeName)
table_heads = rootNode.getElementsByTagName("FieldName")
print(len(table_heads))
print("****All the header information****:",table_heads)
for table_head in table_heads:
name = table_head.getElementsByTagName("N")[0]
v_nermuid = name.childNodes[0].data
name = table_head.getElementsByTagName("N")[1]
v_portrmuid = name.childNodes[0].data
name = table_head.getElementsByTagName("N")[2]
v_vlanid = name.childNodes[0].data
name = table_head.getElementsByTagName("N")[3]
v_lanmode = name.childNodes[0].data
name = table_head.getElementsByTagName("N")[4]
v_mvlanflag = name.childNodes[0].data
name = table_head.getElementsByTagName("N")[5]
v_mvlanpri = name.childNodes[0].data
name = table_head.getElementsByTagName("N")[6]
v_service = name.childNodes[0].data
headers = [v_nermuid,v_portrmuid,v_vlanid,v_lanmode,v_mvlanflag,v_mvlanpri,v_service]
with open(dst_file_path+"\\"+"new"+date_+".csv",'w',newline="") as f:
writer = csv.writer(f,delimiter = '|')
writer.writerow(headers)
objects = rootNode.getElementsByTagName("Object")
print("****所有記錄信息****")
for object in objects:
if object.hasAttribute("rmUID"):
v_nermuid_value = object.getAttribute("rmUID")
name = object.getElementsByTagName("V")[0]
v_portrmuid_value = name.childNodes[0].data
name = object.getElementsByTagName("V")[1]
v_vlanid_value = name.childNodes[0].data
name = object.getElementsByTagName("V")[2]
v_lanmode_value = name.childNodes[0].data
name = object.getElementsByTagName("V")[3]
v_mvlanflag_value = name.childNodes[0].data
name = object.getElementsByTagName("V")[4]
v_mvlanpri_value =name.childNodes[0].data
name = object.getElementsByTagName("V")[5]
v_service_value = name.childNodes[0].data
data_value = [v_nermuid_value,v_portrmuid_value,v_vlanid_value,
v_lanmode_value,v_mvlanflag_value,v_mvlanpri_value,v_service_value]
with open(dst_file_path+"\\"+"new"+date_+".csv", "a+",newline="") as f:
writer = csv.writer(f, delimiter='|')
writer.writerow(data_value)
if __name__ == '__main__':
readXML()

  1. 上一篇文章:
  2. 下一篇文章:
Copyright © 程式師世界 All Rights Reserved