最近在做一个文档迁移的工作，源文件比较多，有大量的目录，其中包含word文档、html文件。在新系统中，希望将文件格式转换为PDF以实现在线预览。
这里我们通过python实现文件的批量转换，具体思路是：

通过os库逐层遍历目录，待处理的获取文件名及文件路径。
通过win32com库将.doc、.docx文件转为.pdf。
通过pdfkit将.html文件转为.pdf。

源码如下：

import os
from win32com import client
import pdfkit

# 转换doc为pdf
def doc2pdf(fn):
    word = client.Dispatch("Word.Application")  # 打开word应用程序
    # for file in files:
    doc = word.Documents.Open(fn)  # 打开word文件
    doc.SaveAs("{}.pdf".format(fn[:-4]), 17)  # 另存为后缀为".pdf"的文件，其中参数17表示为pdf
    doc.Close()  # 关闭原来word文件
    word.Quit()
    
# 转换docx为pdf
def docx2pdf(fn):
    word = client.Dispatch("Word.Application")  # 打开word应用程序
    # for file in files:
    doc = word.Documents.Open(fn)  # 打开word文件
    doc.SaveAs("{}.pdf".format(fn[:-5]), 17)  # 另存为后缀为".pdf"的文件，其中参数17表示为pdf    
    doc.Close()  # 关闭原来word文件
    word.Quit()

def convert_doc_to_pdf(path):  
    if os.path.isdir(path):  
        for root, dirs, files in os.walk(path):  
            # print(files)
            for file in files:  
                if file.endswith('.DOC') or file.endswith('.DOCX'):  
                    doc_path = os.path.join(root, file) 
                    print(doc_path)
                    if file.endswith('.DOC'):
                        doc2pdf(doc_path)
                        os.remove(doc_path)  
                        print('{}_已完成转换'.format(file))  
                    elif file.endswith('.DOCX'):
                        docx2pdf(doc_path)
                        os.remove(doc_path)  
                        print('{}_已完成转换'.format(file)) 
    else:  
        print(f"Path provided is not a directory: {path}")  

def html_to_pdf(path):
    if os.path.isdir(path):  
        for root, dirs, files in os.walk(path):  
            for file in files:  
                if file.endswith('.HTML') or file.endswith('.html'):  
                    html_path = os.path.join(root, file)
                    pdf_path = '{}.pdf'.format(html_path[:-5],17)
                    try: 
                        pdfkit.from_file(html_path,pdf_path)
                    except OSError as e:
                        if 'ProtocolUnknownError' in str(e):
                            pass
                        else:
                            raise e
                    os.remove(html_path)
                    print('{}_已完成转换'.format(html_path))
    else:  
        print(f"Path provided is not a directory: {path}")  
  
if __name__ == '__main__':
    path = r'D:\workspace\旧公文系统'
    convert_doc_to_pdf(path)
    html_to_pdf(path)

注意事项：

win32com是python中用于访问windows com组件的库，在使用时会调用当前计算机的word应用程序。因此在代码运行的过程中，计算机最好不要操作其他word文档。另外，要使用win32com，必须安装pywin32库。
对于加密的word系统是无法处理的，本次处理我碰到了几个加密的word，因为自己也没有密码。我就跳过处理了。
pdfkit是用于将html内容转换为pdf的python库，pdfkit基于wkhtmltopdf命令行工具，通过调用该工具将html内容转换为pdf格式。因此要使用pdfkit，需要先安装wkhtmltopdf（点击下载wkhtmltopdf）,并添加环境变量。添加完环境变量后可在cmd中进行html转pdf验证：

wkhtmltopdf https://www.baidu.com D:/test.pdf

商业转载请联系作者获得授权，非商业转载请注明出处。

Python > pdfkit,win32com

#Python #UiApdfkit #win32comuto

赏