程序師世界是廣大編程愛好者互助、分享、學習的平台,程序師世界有你更精彩!
首頁
編程語言
C語言|JAVA編程
Python編程
網頁編程
ASP編程|PHP編程
JSP編程
數據庫知識
MYSQL數據庫|SqlServer數據庫
Oracle數據庫|DB2數據庫
您现在的位置: 程式師世界 >> 編程語言 >  >> 更多編程語言 >> Python

Python office automation: the strongest and most detailed PDF file operation manual in the whole network

編輯:Python
PDF(Portable Document Format) Is a portable document format , Facilitate the dissemination of documents across operating systems .PDF The document follows a standard format , So there's a lot to do PDF Documentation tools ,Python Nature is no exception . and Python Operation in office automation PDF It is also a very important skill , I hope this article can help you !

also Python There is a lot to do in PDF Excellent library of , Let's briefly compare the advantages and disadvantages of each library .

each pdf Library comparison

PyPDF2 series 、pdfrw And pikepdf Focus on what already exists PDF The operation of ( Division 、 Merge 、 Spin, etc ), The first two are basically in the stop maintenance state .

pdfplumber And its dependence pdfminer.six focus PDF Content extraction , For example, text ( Location 、 Font and color, etc ) And shape ( rectangular 、 A straight line 、 curve ), The former also has the function of parsing tables .

ReportLab focus PDF The page content ( Text 、 chart 、 Table, etc ) The creation of .

PyMuPDF and borb It also supports reading 、 Write and PDF Page operation , The most comprehensive function . among ,PyMuPDF It is especially famous for its fast speed , and borb It is a newly developed and highly praised library , The potential is endless . however , Both are GPL Family of open source protocols , Not very business friendly .

  • PyMuPDF brief introduction

    • Introduce

    • function

    • install

    • About the name `fitz` Explanation

  • Usage method

    • 1 Import library , View version

    • 2 Open the document

    • 3 Document Methods and properties of

    • 4 Fetch metadata

    • 5 Get the goal outline

    • 6 page (`Page`)

    • 7 PDF operation

PyMuPDF brief introduction

Today is our main character PyMuPDF, One with the most comprehensive functions python Office automation tools !

PyMuPDF

github Address :pymupdf/PyMuPDF: Python bindings for MuPDF’s rendering library
The official manual :PyMuPDF Documentation — PyMuPDF 1.18.17 documentation

  Introduce

Introducing PyMuPDF Before , Let's first get to know MuPDF, As can be seen from the naming form ,PyMuPDF yes MuPDF Of Python Interface form .

MuPDF

MuPDF It's a lightweight PDF、XPS And e-book viewer .MuPDF By software library 、 Command line tools and viewers of various platforms .

MuPDF The renderer in is tailored for high-quality anti aliasing graphics . The text is rendered in a fraction of its pixel spacing , To obtain the highest fidelity when reproducing the appearance of the printed page on the screen .

This observer is very small , fast , But it's complete . It supports multiple document formats , Such as PDFXPSOpenXPSCBZEPUB and FictionBook 2. You can use the mobile viewer to PDF Annotate the document and fill in the form ( This feature will also be applied to desktop viewers soon ).

Command line tools allow you to comment 、 Edit document , And convert the document to other formats , Such as HTML、SVG、PDF and CBZ. You can also use it Javascript Write scripts to manipulate documents .

PyMuPDF

PyMuPDF( current version 1.18.17) It's supporting MuPDF( current version 1.18.*) Of Python binding .

Use PyMuPDF, You can access the extension “.pdf”、“.xps”、“.oxps”、“.cbz”、“.fb2” or “.epub”. Besides , about 10 A popular image format can also be processed like a document :“.png”,“.jpg”,“.bmp”,“.tiff” etc. .

function

For all supported document types, you can :

  • Decrypt files

  • Access meta information 、 Links and Bookmarks

  • In grid format (PNG And other formats ) Or vector format SVG Render page

  • Search text

  • Extract text and images

  • Convert to other formats :PDF, (X)HTML, XML, JSON, text

about PDF file , There are a lot of additional functions : They can establish 、 Merge or split . Pages can be used in many ways Insert 、 Delete 、 Rearrange or modify ( Include comments and form fields ).

  • You can extract or insert images and Fonts

  • Fully support embedded files

  • pdf Reformat the file , To support duplex printing , Color separation , Apply logo or watermark

  • Fully support password protection : Decrypt 、 encryption 、 Encryption method selection 、 Permission levels and users / Owner password settings

  • Support image 、 Text and drawing PDF Optional content concept

  • You can access and modify low-level data PDF structure

  • Command line module "python \-m fitz…" Multifunctional utility with the following features

    • encryption / Decrypt / Optimize

    • Create subdocuments

    • Document connection

    • Images / Font extraction

    • Fully support embedded files

    • Save the text extraction of the layout ( All documents )

  install

PyMuPDF You can install... From the source code , You can also get it from wheels install .

about Windows, Linux and Mac OSX platform , stay PyPI The download section of is wheels. This includes Python 64 Bit version 3.6 To 3.9.Windows The version also has 32 Bit version . Starting recently ,Linux ARM There are also some problems with the architecture —— Find platform label manylinux2014_aarch64.

In addition to the standard library , It has no mandatory external dependencies . Only when some packages are installed , There will be some good methods :

  • Pillow: When using Pixmap.pil_save() and Pixmap.pil_tobytes() The need when

  • fontTools: When using Document.subset_fonts() The need when

  • pymupdf-fonts Is a good font choice , Can be used for text output methods

Use pip Installation command

pip install PyMuPDF

Import library :

import fitz

  About the name fitz Explanation

The standard of this library Python The import statement is import fitz. There are historical reasons for that :
MuPDF The original rendering library is called Libart.

stay Artifex Software acquisition MuPDF After the project , The focus of development has shifted to the preparation of a new modern graphics library called “Fitz”.Fitz Originally as a research and development project , To replace aging Ghostscript Graphics library , But it became MuPDF The rendering engine ( Quoted from Wikipedia ).

Usage method

 1 Import library , View version

import fitz
print(fitz.__doc__)
PyMuPDF 1.18.16: Python bindings for the MuPDF 1.18.0 library.
Version date: 2021-08-05 00:00:01.
Built for Python 3.8 on linux (64-bit).

 2 Open the document

doc = fitz.open(filename)

This will create Document object doc. The file name must be the name of an existing file python character string .
You can also get it from Memory data Open the document , Or create a new empty PDF. You can also use documents as context managers .

 3 Document Methods and properties of

Method / attribute describe Document.page_count the number of pages (int)Document.metadata Metadata (dict)Document.get_toc() Get directory (list)Document.load_page() Read page

Example :

>>> doc.count_page
1
>>> doc.metadata
{'format': 'PDF 1.7',
 'title': '',
 'author': '',
 'subject': '',
 'keywords': '',
 'creator': '',
 'producer': ' Foxin reader PDF The printer   edition  10.0.130.3456',
 'creationDate': "D:20210810173328+08'00'",
 'modDate': "D:20210810173328+08'00'",
 'trapped': '',
 'encryption': None}

 4 Fetch metadata

PyMuPDF Fully support standard metadata .Document.metadata Is a with the following keys Python Dictionaries .
It applies to all document types , But not all entries always contain data . The metadata field is a string , If not otherwise instructed , It is nothing . Also note that , Not all data always contains meaningful data —— Even if they don't have none .

KeyValueproducerproducer (producing software)formatformat: ‘PDF-1.4’, ‘EPUB’, etc.encryptionencryption method used if anyauthorauthormodDatedate of last modificationkeywordskeywordstitletitlecreationDatedate of creationcreatorcreating applicationsubjectsubject

5 Get the goal outline

toc = doc.get_toc()

6 page (Page)

Page processing is MuPDF The core of function .
• You can render the page as a raster or vector (SVG) Images , You can choose to zoom 、 rotate 、 Move or cut pages .
• You can extract Multiple formats Page text and image , And search the text string .
• about PDF file , There are more ways to add text or images to a page .

First , You must create a page Page. This is a Document One way :

page = doc.load_page(pno) # loads page number 'pno' of the document (0-based)
page = doc[pno] # the short form

Any integer can be used here -inf<pno<page_count. Negative numbers count down from the end , therefore doc[-1] It's the last page , It's like Python The sequence is the same .

A more advanced approach is to use the document as an iterator for the page :

for page in doc:
    # do something with 'page'
    
# ... or read backwards
for page in reversed(doc):
    # do something with 'page'
    
# ... or even use 'slicing'
for page in doc.pages(start, stop, step):
    # do something with 'page'

Next , This paper mainly introduces Page Common operations of !

a. Check the link of the page 、 Comments or form fields

When using some viewer software to display documents , The link appears as ==“ Hot spots ”==. If you are in the cursor display Hand symbol Click when , You will usually be taken to the coded mark in the hot area . Here's how to get all the links :

# get all links on a page
links = page.get_links()

links It's a Python Dictionaries list .

It can also be used as an iterator :

for link in page.links():
    # do something with 'link'

If processing PDF Document page , There may also be comments (Annot) Or form fields (Widget), Each field has its own iterator :

for annot in page.annots():
    # do something with 'annot'
    
for field in page.widgets():
    # do something with 'field'

b. Render page

This example creates a raster image of the contents of a page :

pix = page.get_pixmap()

pix It's a Pixmap object , it ( In this case ) That contains the page RGB Images , It can be used for many purposes .

Method Page.get_pixmap() Many variants for controlling images are provided : The resolution of the 、 Color space ( for example , Generate a grayscale image or an image with a subtraction scheme )、 transparency 、 rotate 、 Mirror image 、 displacement 、 Shear, etc .

for example : establish RGBA Images ( namely , contain alpha passageway ), Appoint pix=page.get_pixmap(alpha=True).\

Pixmap Contains many methods and properties referenced below . It includes integers Width Height ( Every pixel ) and Span ( The number of bytes in a horizontal image line ). The attribute example represents a property that represents image data Rectangular byte area (Python Byte object ).

You can also use page.get_svg_image() Create a vector image of the page .

c. Save the page image to a file

We can simply store images in PNG In file :

pix.save("page-%i.png" % page.number)

d. Extract text and images

We can also extract all the text of the page in many different forms and levels of detail 、 Images and other information :

text = page.get_text(opt)

Yes opt Use one of the following strings to get a different format :

  • "text":( Default ) Plain text with line breaks . Unformatted 、 No text location details 、 No image

  • "blocks": Generate text blocks ( The paragraph ) A list of

  • "words": Generate word list ( Strings without spaces )

  • "html": Create a full visual version of the page , Include any images . This can be done by internet Browser display

  • "dict"/"json": And HTML Same level of information , But as a Python Dictionary or resp.JSON character string .

  • "rawdict"/"rawjson""dict"/"json" Super Collection of . It also provides services such as XML Character details like .

  • "xhtml": The text information level is the same as the text version , But it contains images .

  • "xml": Does not contain images , But with each text character Complete location and font information . Use XML Module to explain .

e. Search text

You can find the exact location of a text string on the page :

areas = page.search_for("mupdf")

This will provide a Rectangular list , Each rectangle contains a string “mupdf”( Case insensitive ). You can use this information to highlight these areas ( Limited to PDF) Or create cross references to documents .

 7 PDF operation

PDF Is the only one that can use PyMuPDF modify The document type of . Other file types are read-only .

however , You can send any document ( Include images ) Convert to PDF, And then all of the PyMuPDF The function is applied to the conversion result ,Document.convert_to_pdf().

Document.save() Always PDF With its current ( May have been modified ) The status is stored on disk .

Usually , You can choose to save to a new file , Or just append the changes to the existing file (“ Incremental save ”), This is usually much faster .

Here's how to operate PDF file .

a. modify 、 establish 、 Rearrange and delete pages

There are several ways to manipulate the so-called page tree ( Describe the structure of all pages ):

  • PDF:Document.delete_page() and Document.delete_pages() Delete page

  • Document.copy_page()Document.fullcopy_page() and Document.move_page() Page Copy or move To another location in the same document .

  • Document.select() take PDF Compress to selected page , The parameter is the page number sequence to keep . These integers must be in 0<=i<page_ count Within the scope of . Execution time , All pages missing from this list will be deleted . The remaining pages will appear in order , Same number of times (!) As you specified .

therefore , You can easily use to create new PDF:

  • First or last 10 page

  • Only odd or even pages ( For duplex printing )

  • A page with or without the given text

  • Reverse page order

The saved new document will contain links that are still valid 、 Comments and Bookmarks (i.a.w. Point to the selected page or some external resources ).

  • Document.insert_page() and Document.new_page() Insert new page .
    Besides , The page itself can be modified in a series of ways ( For example, page rotation 、 Notes and link maintenance 、 Text and image insertion ).

b. Connect and split PDF file

Method Document.insert_pdf() In different pdf Copy pages between documents . Here's a simple one joiner Example (doc1 and doc2 stay PDF Open in ):

# append complete doc2 to the end of doc1
doc1.insert_pdf(doc2)

Here's a Split doc1 Fragments of . It will create the first page and the last 10 New document for page :

doc2 = fitz.open() # new empty PDF
doc2.insert_pdf(doc1, to_page = 9) # first 10 pages
doc2.insert_pdf(doc1, from_page = len(doc1) - 10) # last 10 pages
doc2.save("first-and-last-10.pdf")

c. preservation

Document.save() The document will always be saved in its current state .

You can specify options by incremental=True Write changes back to the original PDF. This process ( Usually ) Very fast , Because the change will additional To the original file , Without completely rewriting it .

d. close

While the program continues to run , Usually “ close ” Document to give control of the underlying file to the operating system .

This can be done by Document.close() Method realization . In addition to closing the basic file , The buffer associated with the document will also be freed

source :https://blog.csdn.net/ling620/article/details/120035699
author : ice __ blue

edit :@ official account : About data analysis and visualization

Long press attention - About data analysis and visualization  - Set to star , Dry goods express

NO.1

Previous recommendation

Historical articles

【 Hard core dry goods 】Pandas Data type conversion in modules

Python Eight schemes to realize timed tasks , Dry cargo is full.

use Python among Plotly.Express The module draws several charts , I was really amazed !!

use Python Make visualizations GUI Interface , Turn avatar into animation style with one click !

Share 、 Collection 、 give the thumbs-up 、 I'm looking at the arrangement ?


  1. 上一篇文章:
  2. 下一篇文章:
Copyright © 程式師世界 All Rights Reserved