程序師世界是廣大編程愛好者互助、分享、學習的平台,程序師世界有你更精彩!
首頁
編程語言
C語言|JAVA編程
Python編程
網頁編程
ASP編程|PHP編程
JSP編程
數據庫知識
MYSQL數據庫|SqlServer數據庫
Oracle數據庫|DB2數據庫
您现在的位置: 程式師世界 >> 編程語言 >  >> 更多編程語言 >> Python

Python learning notes (27) -- basic operations of extracting text and table content from pdfplumber Library

編輯:Python

pdfplumber Library installation address :Search results · PyPI

After installation pip Can be installed  

1、 Extract text :extract_text() Parse text

Code practice :

import pdfplumber# introduction pdfplumber library
#print(pdfplumber.__version__)# It has been proved by tests that pdfplumber Library installed successfully
pdf=pdfplumber.open('F:\\XX Notice .PDF')# open pdf file , Symbols between paths are \\
pages=pdf.pages# adopt pages Property to get information about all pages , here pages It's a list
text_all=[]# Create an empty list
for page in pages:# Traverse the data of all pages
text = page.extract_text() # use extract_text Function to get the text content of the current page
text_all.append(text)# Add the traversal data to text_all In the list
text_all=''.join(text_all)# hold text_all The list of is converted into a string
print(text_all)# Print all text
pdf.close()# close Pdf file

Running results :

 

2、 Extract the form :extract_tables() Analysis table

Code practice 1: Print directly extract_tables() The list content extracted by the function

import pdfplumber
pdf=pdfplumber.open('F:\\05pycharm\\20220227 Study \\ Jiawei Xinneng : Jiawei new energy Co., Ltd. signed by the actual controller of the company 《 Bail out investment agreement 》《 Voting power entrustment agreement 》 And the suggestive announcement of the proposed change of control .PDF')# open PDF file
pages=pdf.pages#pages Property to get all page contents
page=pages[2]# Extract the third page , Because the form is on page three
tables=page.extract_tables()#extract_tables() Function to extract all tables on the page
table=tables[0]# Take the first table
print(table)

Running results : Format of test list displayed , Further beautification is needed

Through sorting and analysis : Yes 1 Large list , It's nested 10 A small list

 

  Code practice 2: Use the contents of the obtained form DataFrame Exhibition

import pdfplumber
import pandas as pd
pdf=pdfplumber.open('F:\\05pycharm\\20220227 Study \\ Jiawei Xinneng : Jiawei new energy Co., Ltd. signed by the actual controller of the company 《 Bail out investment agreement 》《 Voting power entrustment agreement 》 And the suggestive announcement of the proposed change of control .PDF')# open PDF file
pages=pdf.pages#pages Property to get all page contents
page=pages[2]# Extract the third page , Because the form is on page three
tables=page.extract_tables()#extract_tables() Function to extract all tables on the page
table=tables[0]# Take the first table
pd.set_option('display.max_columns',None)# Show all the contents of the table , The default display section
df=pd.DataFrame(table[1:],columns=table[0])#table[1:] Is the second row of the table and the following ,table[0] Is the first row of the table , And the contents of the header
print(df)

Running results :


  1. 上一篇文章:
  2. 下一篇文章:
Copyright © 程式師世界 All Rights Reserved