Python替换Word文档中指定字符

比如我有一大批文档（好几十个、上百个），每个里面的第一页中都有一个”2021年”，我想把这个替换成”2022年”。

又不想一个一个打开去改，于是用Python写了下，这样每年只需要代码跑一下就全部改了~ 真是机智。

Python有一个python-docx的库，可以直接用来处理.docx的Word文档的。

安装python-docx库

1	pip install python-docx

官方文档：https://python-docx.readthedocs.org/en/latest/

安装完后，先简单了解下这个库读取Word后，有几种对象：

word文件：Document
段落：Paragraph
文字块：Run，感觉文字块有点玄学，不一定是一句话，后面会看代码演示。

替换Word中的字符

比如我就把”2021年”替换成”2022年”，当然，这个字符是唯一的。

先直接放代码：

import os
from docx import Document
from docx.shared import Pt

def get_filelist(path):
    file_list = []
    for home, dirs, files in os.walk(path):
        for filename in files:
            if(filename.startswith('~$')):
                continue
            else:
                # 文件名列表，包含完整路径
                file_list.append(os.path.join(home, filename))
                # 文件名列表，只包含文件名
                # file_list.append(filename)
    return file_list

# 替换Word文档中的指定字符串
def replace_str_in_word(old_str, new_str, docx_file_list):
    for file in docx_file_list:
        doc = Document(file)
        # 每一段内容
        for p in doc.paragraphs:
            if old_str in p.text:
                # inline = p.runs
                # for i in inline:
                #     if old_str in i.text:
                #         text = i.text.replace(old_str, new_str)
                #         i.text = text
                p.text = p.text.replace(old_str, new_str) # 替换字符
                # 设置替换后的段落的格式
                for i in p.runs:
                    i.font.size = Pt(18) # 小二
                    i.font.bold = True # 加粗
                    # i.font.name = 'Arial'
                    i.font.name = u'宋体'

        doc.save(file)

print('-----------------开始-----------------')
# word文件位置
path = r'F:\abc'

file_list = get_filelist(path) # 文件列表
print(len(file_list))

# 替换Word文档中的指定字符串，把2021替换成2022
replace_str_in_word('2021', '2022', file_list)

print('-----------------完成！-----------------')

相关解释

1.replace_str_in_word方法

doc = Document(file)
for p in doc.paragraphs: # 读取每一段内容，p是paragraph对象
if old_str in p.text: # 当旧的字符串'2021'存在这个段落内容中
p.text = p.text.replace(old_str, new_str) # 把这一段的内容设置成替换后的字符串

# 1).关于我注释的这一段是什么情况
# 下面这一段是按块来查询的，但是主要分块感觉很玄学，比如我的'2021'就拆成了'20'、'2'、'1'这样3个run（文字块）。
# 所以，如果能用文字块直接替换的话也可以这样子做的。
# inline = p.runs
# for i in inline:
#     if old_str in i.text:
#         text = i.text.replace(old_str, new_str)
#         i.text = text

# 2).关于设置替换后的段落的格式的代码，为什么不直接设置段落的格式，而要用一个循环来设置这个字体格式
# 当然是因为我没找到直接设置段落格式的方法，font这个属性就是run（文字块）的，段落对象没有的。

想清楚看到每一段，每个run的内容，建议用debug模式跑一下就能看清楚了。

2.字号和磅值对应关系

'''
    字号和磅值对应关系: 
    初号: 42磅
    小初: 36磅
    一号: 26磅
    小一: 24磅
    二号: 22磅
    小二: 18磅
    三号: 16磅
    小三: 15磅
    四号: 14磅
    小四: 12磅
    五号: 10.5磅
'''

3.如果是.doc格式怎么办

如果是.doc格式的文档，而不是.docx格式的怎么办呢？

当然是先全部转换成.docx格式啦，获取到文件列表后，循环转换一下就行。

用一个pywin32的库就可以啦。

# pip install pywin32
from win32com import client

# .doc 格式转换成 .docx
def doc2docx(doc_file):
    word = client.Dispatch("Word.Application")
    doc = word.Documents.Open(doc_file)
    doc.SaveAs("{}x".format(doc_file), 12)
    doc.Close()
    word.Quit()
    return doc_file+'x'

4.批量替换word文件名中指定字符

如果文件名中，也想把”2021年”替换成”2022年”呢？

import os

# 重命名文件，修改文件名中的年份
def rename_year(year, file_list):
    docx_file_list = []
    for file in file_list:
        # file: .doc
        (file_path, file_name) = os.path.split(file)
        # file_path = F:\abc
        # file_name: 2021年度******.docx
        # print(file_name)
        new_file_name = year + file_name[4:]
        new_file = os.path.join(file_path, new_file_name)
        # print(new_file) # F:\abc\2022年度******.docx
        os.rename(file, new_file)

        # 如果有.doc格式，顺手在这里转换成.docx
        if(new_file.endswith(".doc")):
            new_file = doc2docx(new_file) # doc2docx方法上面有讲
        docx_file_list.append(new_file)

    return docx_file_list

print('-----------------开始-----------------')
# 文件位置
path = r'F:\abc'
file_list = get_filelist(path) # get_filelist方法上面也有讲
print(len(file_list))

# 命名文件，修改文件名中的年份，这里也懒得替换了，就前面4个数字，反正就直接换上了
docx_file_list = rename_year('2022', file_list)

print('-----------------完成！-----------------')