程序師世界是廣大編程愛好者互助、分享、學習的平台,程序師世界有你更精彩!
首頁
編程語言
C語言|JAVA編程
Python編程
網頁編程
ASP編程|PHP編程
JSP編程
數據庫知識
MYSQL數據庫|SqlServer數據庫
Oracle數據庫|DB2數據庫
您现在的位置: 程式師世界 >> 編程語言 >  >> 更多編程語言 >> Python

Data cleaning based on Data Mining: save the PDF of Shenzhen second hand housing reference price in Python as Excel

編輯:Python

pit DIE Once again, the housing and Urban Rural Development Bureau does not limit the rich to the just needed , The reference transaction price of second-hand housing in Shenzhen residential quarters was announced , It is more difficult to buy a house , The down payment is more difficult to collect ...
Data cleaning of data mining foundation : use python Reference price of second-hand housing in Shenzhen PDF Save as EXCEL, So that other analysis tools can be based on this excel Do statistical analysis and drawing , such as tableau.

List of articles

  • Preface
  • One 、 Clear objectives
  • Two 、 Use steps
    • 1. Prior treatment
    • 2. Read in the data
  • summary


Preface

The basic content of machine learning : Data cleaning . Combined with real life scenes , Improve the fun of learning .
Shenzhen housing and Urban Rural Development Bureau once again does not limit the rich to the just needed , The reference transaction price of second-hand housing in Shenzhen residential quarters was announced , The price is expressed in PDF Is published on the official website .
Many of our analysis and statistical tools cannot read pdf file , Most of them support reading excel. So this time we will put pdf Turn it into excel For subsequent analysis .


One 、 Clear objectives

hold pdf Turn it into excel.
pdf The format is as follows :

excel The format is as follows :

pandas Is based on NumPy A tool of , This tool is to solve the problem of data analysis .

Two 、 Use steps

1. Prior treatment

pdf Certainly cannot be python Read ,python Can read txt, therefore , Let's open it first pdf file , Then use the shortcut key ctrl+A Future generations , And then copy it ctrl +C, Create a new one txt file , Paste ctrl+V, Just put pdf The file was pasted into txt in , At this time, the data has no fixed format , as follows :

We delete the header , The rest of the data is more regular , You can use it python Read and process . We save the edited txt by : Shenzhen reference price python Handle .txt. Download address and extraction code :1234

2. Read in the data

The code is as follows :

import pandas as pd
import numpy as np
import sys
import string
# First, the second-hand housing prices in Shenzhen PDF copy to TXT in , Remove the title 
# read txt method
f = open("./ Shenzhen reference price python Handle .txt")
line = f.readline()
xuhao,quyu,jiedao,xiangmumingchen,danjia = [],[],[],[],[] # Definition : Serial number 、 Administrative region 、 The street 、 Project name 、 Unit price array 
i = 0 # Number of lines recording valid items 
while line:
i = i + 1
print(i, line)
if line.startswith('- '): # Skip pages of text , As the first 17 page :- 17 -
i = i - 1
line = f.readline()
continue
line = line.replace('\n', '') # Replace line breaks 
if i % 5 == 1:
xuhao.append(line)
elif i % 5 == 2:
quyu.append(line)
elif i % 5 == 3:
jiedao.append(line)
elif i % 5 == 4:
xiangmumingchen.append(line)
elif i % 5 == 0:
danjia.append(line)
else:
print('culculate is wrong!')
line = f.readline()
f.close()
mydict = {
' Serial number ': xuhao, ' Administrative region ': quyu, ' The street ': jiedao, ' Project name ': xiangmumingchen, ' Transaction reference price ( element / Square meters )': danjia}
df = pd.DataFrame(mydict) # convert to datafreme, In order to output excel
print(df)
df.to_excel('./ Shenzhen residential district second-hand housing transaction reference price list .xlsx')

And then run , You can get the following excel 了 : Download address and extraction code :1234.


summary

Data cleaning is the foundation of machine learning , This article only briefly introduces pandas Use of cleaning data , and pandas Provides a large number of functions and methods that enable us to process data quickly and conveniently .


  1. 上一篇文章:
  2. 下一篇文章:
Copyright © 程式師世界 All Rights Reserved