程序師世界是廣大編程愛好者互助、分享、學習的平台,程序師世界有你更精彩!
首頁
編程語言
C語言|JAVA編程
Python編程
網頁編程
ASP編程|PHP編程
JSP編程
數據庫知識
MYSQL數據庫|SqlServer數據庫
Oracle數據庫|DB2數據庫
您现在的位置: 程式師世界 >> 編程語言 >  >> 更多編程語言 >> Python

Extracting PDF file data with Python

編輯:Python
  • First, install these two libraries
pip install pdfplumber
pip install openpyxl
  • 1. Initialization path
path = r"C:\Users\lenovo\Desktop\ Thesis and interview \ Customer focus .pdf"
  • 2. open pdf file
pdf_mt = pdfplumber.open(path)
pdf_mt
  • 3. Get the page where the data is located ( How many pages in total )
# Get the page where the data is located list --> [ The object of the first page , The object of the second page ,... The first n The object of the page ]
all_pages = pdf_mt.pages
all_pages
  • 4. obtain pdf Each page of text data ( Text data of the first 40 pages )
for pdf_pg in all_pages[0:40]:
print(pdf_pg.extract_text())
  • 5. Get the contents of the form
for pdf_pg in all_pages[0:40]:
print(pdf_pg.extract_tables())
  • 6. Save data to excel
# establish workbook object 
wb = Workbook()
# Activate sheet 
ws = wb.active
for pdf_pg in need_pages:
# print(pdf_pg)
# Get the text content of each page 
# print(pdf_pg.extract_text())
# Get the contents of the form form : A two-dimensional [[],[]]
# print(pdf_pg.extract_tables()) 
# The table has two-dimensional data with rows and columns , Get a list of two dimensions 
for pdf_tb in pdf_pg.extract_tables():
# print(pdf_tb) 
# Write data row by row into the worksheet 
for row in pdf_tb:
ws.append(row)
wb.save("demo3.xlsx")

  1. 上一篇文章:
  2. 下一篇文章:
Copyright © 程式師世界 All Rights Reserved