在前面通过百度文字识别实现了图片的文字提取。但在测试中发现我们实际中文件大多是PDF的扫描件。所以需要先把PDF转为图片，之后进行遍历调用百度文字识别接口即可。这里我使用的是pdf2image库实现PDF转图片。。

PDF转图片源码

# PDF转图片
from pdf2image import convert_from_path
pdf_file = r'C:\Users\admin\Documents\workspace\xxx.pdf'
output_dir = r'C:\Users\admin\Documents\workspace\xx'

# images = convert_from_path(pdf_file)
images = convert_from_path(pdf_file,poppler_path=r'C:\Users\admin\Documents\workspace\otherapi\poppler-0.68.0\bin')
for i ,img in enumerate(images):
    img.save(output_dir+f'page_{i+1}.png','PNG')

报错问题修复

在执行时报错了，报错信息：

pdf2image.exceptions.PDFInfoNotInstalledError: Unable to get page count. Is poppler installed and in PATH?

百度之后知道是ndk版本不对导致的，安装python-poppler即可解决。
解决方法如下：

访问 https://blog.alivate.com.au/poppler-windows/ 下载压缩包。
解压后将bin路径添加到环境变量。
在 convert_from_path()调用时增加poppler_path参数（如上述代码）。

商业转载请联系作者获得授权，非商业转载请注明出处。

Python > pdf2image

#Python #pdf2image #PDF

赏

支付宝打赏

微信打赏

如果文章对你有帮助，欢迎点击上方按钮打赏作者

Python|PDF转图像

http://hncd1024.github.io/2023/06/09/Python_pdf2image/

作者

CHEN DI

发布于

2023-06-09

许可协议

UiAuto|泛微Ecology系统附件下载插件上一篇

Python|通过百度OCR实现图片文字识别下一篇