您现在的位置：程式師世界 >> 編程語言 > >> 更多編程語言 >> Python

Python批量導出word文檔中的圖片、嵌入式文件

編輯：Python

學生試卷中的題目有要提交截圖的，也有要提交文件的，為了方便學生考試，允許單獨交或者嵌入Word中提交，那麼事後如何整理學生的答案？單獨提交的比較方便，直接掃描文件名匹配名字後放入指定文件夾即可。但是嵌入到Word中的圖片和文件怎麼提取出來呢？

現有如下需求：提取出一個Word文檔中所有的圖片（png、jpg）和嵌入的文件（任意格式）放入到指定的文件夾。

解決
docx是一個壓縮包，解壓縮後圖片一般都放在文檔名.docx\word\media\目錄下：

而嵌入式文件一般都放在文檔名.docx\word\embeddings\目錄下：

經過詢問度娘，發現提取圖片比較簡單，直接使用docx庫中的Document.part.rels{k:v.target_ref}找到文件的相對路徑，用Document.part.rels{k:v.target_part.blob}讀出文件內容。簡單判斷一下路徑和文件後綴是不是我們需要的media下的png文件和embeddings下的bin文件，是的話寫入到新文件中即可：

提取圖片

安裝python-docx庫

pip install python-docx

提取

import os
from docx import Document # pip install python-docx
is_debug = True
if __name__ == '__main__':
# 需要導出的Word文檔路徑
# Python學習交流基地 279199867
target_file = r'paper\HBase試題.docx'
# 導出文件所在目錄
output_dir = r'paper\output'
# 加載Word文檔
doc = Document(target_file)
# 遍歷Word包中的所有文件
dict_rel = doc.part.rels
# r_id：文件身份碼，rel：文件對象 
for r_id, rel in dict_rel.items():
if not ( # 如果文件不是在media或者embeddings中的，直接跳過
str(rel.target_ref).startswith('media')
or str(rel.target_ref).startswith('embeddings')
):
continue
# 如果文件不是我們想要的後綴，也直接跳過
file_suffix = str(rel.target_ref).split('.')[-1:][0]
if file_suffix.lower() not in ['png', 'jpg', 'bin']:
continue
# 如果輸出目錄不存在，創建
if not os.path.exists(output_dir):
os.makedirs(output_dir)
# 構建導出文件的名字和路徑
file_name = r_id + '_' + str(rel.target_ref).replace('/', '_')
file_path = os.path.join(output_dir,file_name)
# 將二進制數據寫入到新位置的文件中
with open(file_path, "wb") as f:
f.write(rel.target_part.blob)
# 打印結果
if is_debug:
print('導出文件成功：', file_name)

運行結果：

可以看到，圖片都能正常導出，但是學生嵌入的JAVA文件並沒有導出，或者說導出的是bin文件，沒有完全導出。

提取嵌入式文件

再次詢問度娘發現，這種其實也是zip壓縮包，但是不能直接提取出，它有個更專業的名字，叫ole文件，我們之前的doc、xls、ppt等沒有帶x的上古文檔文件都是這種格式。那如何提取出文件呢？度娘告訴我有個叫oletools的項目可以，於是下載下來淺淺地分析了下，發現確實可以！

oletools項目地址：https://github.com/decalage2/oletools

或者gitee上別人轉存的地址：https://gitee.com/yunqimg/oletools

我是用的gitee上的版本，因為github打不開 QwQ

經相關文檔介紹，項目下的oletools-master\oletools\oleobj.py就可以提取這種bin後綴的ole文件，簡單試一下，在oleobj.py所在目錄下打開命令行，把剛剛提取出的rId12_embeddings_oleObject1.bin文件復制到oleobj.py所在目錄，執行如下命令：

注意：在此之前我執行了一下安裝oletools的命令，如果不安裝可能會出錯：pip install oletools，或者說oleobj.py依賴olefile：pip install olefile，在安裝oletools時順便安裝了olefile。

python oleobj.py rId12_embeddings_oleObject1.bin

成功導出

Microsoft Windows [版本 10.0.22000.708]
(c) Microsoft Corporation。保留所有權利。
D:\Minuy\Downloads\oletools-master\oletools-master\oletools>python oleobj.py rId12_embeddings_oleObject1.bin
oleobj 0.56 - http://decalage.info/oletools
THIS IS WORK IN PROGRESS - Check updates regularly!
Please report any issue at https://github.com/decalage2/oletools/issues
-------------------------------------------------------------------------------
File: 'rId12_embeddings_oleObject1.bin'
extract file embedded in OLE object from stream '\x01Ole10Native':
Parsing OLE Package
Filename = "Boos.java"
Source path = "D:\111\´ó20´óÊý¾Ý Àî¾üÁé\Boos.java"
Temp path = "C:\Users\ADMINI~1\AppData\Local\Temp\Boos.java"
saving to file rId12_embeddings_oleObject1.bin_Boos.java
D:\Minuy\Downloads\oletools-master\oletools-master\oletools>

導出的文件也能正常訪問：

於是把oletools目錄復制到工程項目下，稍微修改一下oleobj.py能讓我的代碼調用它，在oleobj.py中添加如下代碼：

def export_main(ole_files, output_dir, log_leve=DEFAULT_LOG_LEVEL):
ensure_stdout_handles_unicode()
logging.basicConfig(level=LOG_LEVELS[log_leve], stream=sys.stdout,
format='%(levelname)-8s %(message)s')
# 啟用日志模塊
log.setLevel(logging.NOTSET)
any_err_stream = False
any_err_dumping = False
any_did_dump = False
for container, filename, data \
in xglob.iter_files(ole_files,
recursive=False,
zip_password=None,
zip_fname='*'):
if container and filename.endswith('/'):
continue
# 輸出文件夾
err_stream, err_dumping, did_dump = \
process_file(filename, data, output_dir)
any_err_stream |= err_stream
any_err_dumping |= err_dumping
any_did_dump |= did_dump
return_val = RETURN_NO_DUMP
if any_did_dump:
return_val += RETURN_DID_DUMP
if any_err_stream:
return_val += RETURN_ERR_STREAM
if any_err_dumping:
return_val += RETURN_ERR_DUMP
return return_val
def export_ole_file(ole_files, output_dir, debug=False):
debug_leve = 'critical'
if debug:
debug_leve = 'info'
# 導出
result = export_main(
ole_files,
output_dir,
debug_leve
)
if result and debug:
print('導出ole文件出錯', ole_files)

在提取文件的代碼後面加上如下調用：

if str(rel.target_ref).startswith('embeddings'):
# 解壓嵌入式文件
export_ole_file([file_path], output_dir)

再次運行

成功導出嵌入到Word中的文件！

成功解決問題~

兄弟們快去試試吧！

上一篇文章： python學習筆記：解析XML（the ElementTree XML API）
下一篇文章：使用Python加載C語言代碼(ctypes)

Python

字節跳動竟然把Python入門知識點整理成漫畫書了，讓人茅塞頓開

端午福利！寶藏書籍，要放假了，趕快收藏~~字節跳動竟然把Py

Python learning notes (32) -- schedule library realizes calling Xiao Ming to eat regularly every day

1、 install schedule library :s

配置Python數據分析和數據可視化環境的虛擬機

Python數據分析和數據可視化環境配置-通過虛擬機的形式由

Python 類和對象詳細介紹

目錄對象 = 屬性 + 方法self是什麼公有和私有繼承調用

Bid farewell to monotony and transform the Django background home page - use the adminlte component

Preface I made a Django Proje

From introduction to mastery of Python - first acquaintance with Python

Python From entry to mastery —

没有相关文章

熱門圖文

得到Revit子窗體 Java編程那些事兒55—方法重載和參數傳遞如何將List<string>轉化為string，list轉化為string PHP連接MYSQL數據庫通用類 javascript-fragment中禁用後退鍵生成n*n蛇形矩陣的算法 sql-關於SQL中as關鍵字的疑問？通過dbi使用perl連接mysql數據庫的方法

欄目導航