您现在的位置：程式師世界 >> 編程語言 > >> 更多編程語言 >> Python

Python office automation: the strongest and most detailed PDF file operation manual in the whole network

編輯：Python

PDF（Portable Document Format） Is a portable document format , Facilitate the dissemination of documents across operating systems .PDF The document follows a standard format , So there's a lot to do PDF Documentation tools ,Python Nature is no exception . and Python Operation in office automation PDF It is also a very important skill , I hope this article can help you ！

also Python There is a lot to do in PDF Excellent library of , Let's briefly compare the advantages and disadvantages of each library .

each pdf Library comparison

PyPDF2 series 、pdfrw And pikepdf Focus on what already exists PDF The operation of （ Division 、 Merge 、 Spin, etc ）, The first two are basically in the stop maintenance state .

pdfplumber And its dependence pdfminer.six focus PDF Content extraction , For example, text （ Location 、 Font and color, etc ） And shape （ rectangular 、 A straight line 、 curve ）, The former also has the function of parsing tables .

ReportLab focus PDF The page content （ Text 、 chart 、 Table, etc ） The creation of .

PyMuPDF and borb It also supports reading 、 Write and PDF Page operation , The most comprehensive function . among ,PyMuPDF It is especially famous for its fast speed , and borb It is a newly developed and highly praised library , The potential is endless . however , Both are GPL Family of open source protocols , Not very business friendly .

PyMuPDF brief introduction
- Introduce
- function
- install
- About the name `fitz` Explanation
Usage method
- 1 Import library , View version
- 2 Open the document
- 3 Document Methods and properties of
- 4 Fetch metadata
- 5 Get the goal outline
- 6 page (`Page`)
- 7 PDF operation

PyMuPDF brief introduction

Today is our main character PyMuPDF, One with the most comprehensive functions python Office automation tools ！

PyMuPDF

github Address ：pymupdf/PyMuPDF: Python bindings for MuPDF’s rendering library
The official manual ：PyMuPDF Documentation — PyMuPDF 1.18.17 documentation

Introduce

Introducing PyMuPDF Before , Let's first get to know MuPDF, As can be seen from the naming form ,PyMuPDF yes MuPDF Of Python Interface form .

MuPDF

MuPDF It's a lightweight PDF、XPS And e-book viewer .MuPDF By software library 、 Command line tools and viewers of various platforms .

MuPDF The renderer in is tailored for high-quality anti aliasing graphics . The text is rendered in a fraction of its pixel spacing , To obtain the highest fidelity when reproducing the appearance of the printed page on the screen .

This observer is very small , fast , But it's complete . It supports multiple document formats , Such as PDF、XPS、OpenXPS、CBZ、EPUB and FictionBook 2. You can use the mobile viewer to PDF Annotate the document and fill in the form ( This feature will also be applied to desktop viewers soon ).

Command line tools allow you to comment 、 Edit document , And convert the document to other formats , Such as HTML、SVG、PDF and CBZ. You can also use it Javascript Write scripts to manipulate documents .

PyMuPDF

PyMuPDF( current version 1.18.17) It's supporting MuPDF( current version 1.18.*) Of Python binding .

Use PyMuPDF, You can access the extension “.pdf”、“.xps”、“.oxps”、“.cbz”、“.fb2” or “.epub”. Besides , about 10 A popular image format can also be processed like a document :“.png”,“.jpg”,“.bmp”,“.tiff” etc. .

function

For all supported document types, you can ：

Decrypt files
Access meta information 、 Links and Bookmarks
In grid format （PNG And other formats ） Or vector format SVG Render page
Search text
Extract text and images
Convert to other formats ：PDF, (X)HTML, XML, JSON, text

about PDF file , There are a lot of additional functions : They can establish 、 Merge or split . Pages can be used in many ways Insert 、 Delete 、 Rearrange or modify ( Include comments and form fields ).

You can extract or insert images and Fonts
Fully support embedded files
pdf Reformat the file , To support duplex printing , Color separation , Apply logo or watermark
Fully support password protection : Decrypt 、 encryption 、 Encryption method selection 、 Permission levels and users / Owner password settings
Support image 、 Text and drawing PDF Optional content concept
You can access and modify low-level data PDF structure
Command line module "python \-m fitz…" Multifunctional utility with the following features
- encryption / Decrypt / Optimize
- Create subdocuments
- Document connection
- Images / Font extraction
- Fully support embedded files
- Save the text extraction of the layout ( All documents )

install

PyMuPDF You can install... From the source code , You can also get it from wheels install .

about Windows, Linux and Mac OSX platform , stay PyPI The download section of is wheels. This includes Python 64 Bit version 3.6 To 3.9.Windows The version also has 32 Bit version . Starting recently ,Linux ARM There are also some problems with the architecture —— Find platform label manylinux2014_aarch64.

In addition to the standard library , It has no mandatory external dependencies . Only when some packages are installed , There will be some good methods :

Pillow： When using Pixmap.pil_save() and Pixmap.pil_tobytes() The need when
fontTools： When using Document.subset_fonts() The need when
pymupdf-fonts Is a good font choice , Can be used for text output methods

Use pip Installation command ：

pip install PyMuPDF

Import library ：

import fitz

About the name `fitz` Explanation

The standard of this library Python The import statement is import fitz. There are historical reasons for that :
MuPDF The original rendering library is called Libart.

stay Artifex Software acquisition MuPDF After the project , The focus of development has shifted to the preparation of a new modern graphics library called “Fitz”.Fitz Originally as a research and development project , To replace aging Ghostscript Graphics library , But it became MuPDF The rendering engine ( Quoted from Wikipedia ).

Usage method

1 Import library , View version

import fitz
print(fitz.__doc__)
PyMuPDF 1.18.16: Python bindings for the MuPDF 1.18.0 library.
Version date: 2021-08-05 00:00:01.
Built for Python 3.8 on linux (64-bit).

2 Open the document

doc = fitz.open(filename)

This will create Document object doc. The file name must be the name of an existing file python character string .
You can also get it from Memory data Open the document , Or create a new empty PDF. You can also use documents as context managers .

3 Document Methods and properties of

Method / attribute describe Document.page_count the number of pages (int)Document.metadata Metadata (dict)Document.get_toc() Get directory (list)Document.load_page() Read page

Example ：

>>> doc.count_page
1
>>> doc.metadata
{'format': 'PDF 1.7',
 'title': '',
 'author': '',
 'subject': '',
 'keywords': '',
 'creator': '',
 'producer': ' Foxin reader PDF The printer   edition  10.0.130.3456',
 'creationDate': "D:20210810173328+08'00'",
 'modDate': "D:20210810173328+08'00'",
 'trapped': '',
 'encryption': None}

4 Fetch metadata

PyMuPDF Fully support standard metadata .Document.metadata Is a with the following keys Python Dictionaries .
It applies to all document types , But not all entries always contain data . The metadata field is a string , If not otherwise instructed , It is nothing . Also note that , Not all data always contains meaningful data —— Even if they don't have none .

KeyValueproducerproducer (producing software)formatformat: ‘PDF-1.4’, ‘EPUB’, etc.encryptionencryption method used if anyauthorauthormodDatedate of last modificationkeywordskeywordstitletitlecreationDatedate of creationcreatorcreating applicationsubjectsubject

5 Get the goal outline

toc = doc.get_toc()

6 page (`Page`)

Page processing is MuPDF The core of function .
• You can render the page as a raster or vector （SVG） Images , You can choose to zoom 、 rotate 、 Move or cut pages .
• You can extract Multiple formats Page text and image , And search the text string .
• about PDF file , There are more ways to add text or images to a page .

First , You must create a page Page. This is a Document One way ：

page = doc.load_page(pno) # loads page number 'pno' of the document (0-based)
page = doc[pno] # the short form

Any integer can be used here -inf<pno<page_count. Negative numbers count down from the end , therefore doc[-1] It's the last page , It's like Python The sequence is the same .

A more advanced approach is to use the document as an iterator for the page ：

for page in doc:
    # do something with 'page'
    
# ... or read backwards
for page in reversed(doc):
    # do something with 'page'
    
# ... or even use 'slicing'
for page in doc.pages(start, stop, step):
    # do something with 'page'

Next , This paper mainly introduces Page Common operations of ！

a. Check the link of the page 、 Comments or form fields

When using some viewer software to display documents , The link appears as ==“ Hot spots ”==. If you are in the cursor display Hand symbol Click when , You will usually be taken to the coded mark in the hot area . Here's how to get all the links ：

# get all links on a page
links = page.get_links()

links It's a Python Dictionaries list .

It can also be used as an iterator ：

for link in page.links():
    # do something with 'link'

If processing PDF Document page , There may also be comments （Annot） Or form fields （Widget）, Each field has its own iterator ：

for annot in page.annots():
    # do something with 'annot'
    
for field in page.widgets():
    # do something with 'field'

b. Render page

This example creates a raster image of the contents of a page ：

pix = page.get_pixmap()

pix It's a Pixmap object , it （ In this case ） That contains the page RGB Images , It can be used for many purposes .

Method Page.get_pixmap() Many variants for controlling images are provided ： The resolution of the 、 Color space （ for example , Generate a grayscale image or an image with a subtraction scheme ）、 transparency 、 rotate 、 Mirror image 、 displacement 、 Shear, etc .

for example ： establish RGBA Images （ namely , contain alpha passageway ）, Appoint pix=page.get_pixmap（alpha=True）.\

Pixmap Contains many methods and properties referenced below . It includes integers Width 、 Height （ Every pixel ） and Span （ The number of bytes in a horizontal image line ）. The attribute example represents a property that represents image data Rectangular byte area （Python Byte object ）.

You can also use page.get_svg_image() Create a vector image of the page .

c. Save the page image to a file

We can simply store images in PNG In file ：

pix.save("page-%i.png" % page.number)

d. Extract text and images

We can also extract all the text of the page in many different forms and levels of detail 、 Images and other information ：

text = page.get_text(opt)

Yes opt Use one of the following strings to get a different format ：

"text"：（ Default ） Plain text with line breaks . Unformatted 、 No text location details 、 No image
"blocks"： Generate text blocks （ The paragraph ） A list of
"words"： Generate word list （ Strings without spaces ）
"html"： Create a full visual version of the page , Include any images . This can be done by internet Browser display
"dict"/"json"： And HTML Same level of information , But as a Python Dictionary or resp.JSON character string .
"rawdict"/"rawjson"："dict"/"json" Super Collection of . It also provides services such as XML Character details like .
"xhtml"： The text information level is the same as the text version , But it contains images .
"xml"： Does not contain images , But with each text character Complete location and font information . Use XML Module to explain .

e. Search text

You can find the exact location of a text string on the page ：

areas = page.search_for("mupdf")

This will provide a Rectangular list , Each rectangle contains a string “mupdf”（ Case insensitive ）. You can use this information to highlight these areas （ Limited to PDF） Or create cross references to documents .

7 PDF operation

PDF Is the only one that can use PyMuPDF modify The document type of . Other file types are read-only .

however , You can send any document （ Include images ） Convert to PDF, And then all of the PyMuPDF The function is applied to the conversion result ,Document.convert_to_pdf().

Document.save() Always PDF With its current （ May have been modified ） The status is stored on disk .

Usually , You can choose to save to a new file , Or just append the changes to the existing file （“ Incremental save ”）, This is usually much faster .

Here's how to operate PDF file .

a. modify 、 establish 、 Rearrange and delete pages

There are several ways to manipulate the so-called page tree （ Describe the structure of all pages ）：

PDF:Document.delete_page() and Document.delete_pages() Delete page
Document.copy_page()、Document.fullcopy_page() and Document.move_page() Page Copy or move To another location in the same document .
Document.select() take PDF Compress to selected page , The parameter is the page number sequence to keep . These integers must be in 0<=i<page_ count Within the scope of . Execution time , All pages missing from this list will be deleted . The remaining pages will appear in order , Same number of times （！） As you specified .

therefore , You can easily use to create new PDF：

First or last 10 page
Only odd or even pages （ For duplex printing ）
A page with or without the given text
Reverse page order

The saved new document will contain links that are still valid 、 Comments and Bookmarks （i.a.w. Point to the selected page or some external resources ）.

Document.insert_page() and Document.new_page() Insert new page .
Besides , The page itself can be modified in a series of ways （ For example, page rotation 、 Notes and link maintenance 、 Text and image insertion ）.

b. Connect and split PDF file

Method Document.insert_pdf() In different pdf Copy pages between documents . Here's a simple one joiner Example （doc1 and doc2 stay PDF Open in ）：

# append complete doc2 to the end of doc1
doc1.insert_pdf(doc2)

Here's a Split doc1 Fragments of . It will create the first page and the last 10 New document for page ：

doc2 = fitz.open() # new empty PDF
doc2.insert_pdf(doc1, to_page = 9) # first 10 pages
doc2.insert_pdf(doc1, from_page = len(doc1) - 10) # last 10 pages
doc2.save("first-and-last-10.pdf")

c. preservation

Document.save() The document will always be saved in its current state .

You can specify options by incremental=True Write changes back to the original PDF. This process （ Usually ） Very fast , Because the change will additional To the original file , Without completely rewriting it .

d. close

While the program continues to run , Usually “ close ” Document to give control of the underlying file to the operating system .

This can be done by Document.close() Method realization . In addition to closing the basic file , The buffer associated with the document will also be freed