line_profiler性能分析实践 -- 优化倒排索引

字数统计: 6.4k字 | 阅读时长: 35分

2021-06-15

摘要: line_profiler 实践

【对算法，数学，计算机感兴趣的同学，欢迎关注我哈，阅读更多原创文章】
我的网站：潮汐朝夕的生活实验室
我的公众号：算法题刷刷
我的知乎：潮汐朝夕
我的github：FennelDumplings
我的leetcode：FennelDumplings

在文章Python性能分析基础中，我们学习了性能分析的基础知识方法论，如果把性能分析方法整合到开发过程中，就可以帮助我们提高产品的开发质量。

然后在文章 Python性能分析器 — cProfile 中，我们进一步学习了 cProfile 这个性能分析器。之后在文章使用cProfile进行性能分析与优化实践中，我们通过计算斐波那契数和读取大 csv 文件并做简单统计的两个例子进行了 cProfile 性能分析和优化的实践。

通过 cProfile 我们可以对代码进行性能分析,获取每个函数的调用次数和总调用次数。它帮助我们通过系统全局视角改进代码。之后我们又在文章 Python性能分析器 — line_profiler 中学习了可以提供每一行代码的性能细节的 line_profiler 性能分析器，并在斐波那契数上进行了实践。

本文我们通过更大的例子实践一下 line_profiler 性能分析器 — 对倒排索引的代码进行优化。

line_profiler 性能分析实例: 倒排索引

倒排索引的工作方式: 预扫描文件,把内容分割成单词,然后保存单词与文件之间的对应关系(有时也记录单词的位置)。

通过这种方式搜索单词时,可以实现O(1)时间复杂度。

下面以一个例子说明我们要做的事情，假设有两个文件 file1.txt, file2.txt, file3.txt，内容分别如下

file1.txt:

1	This is a file

file2.txt:

1	This is another file

file3.txt:

1 2	This is the third file the another line

我们希望获得如下的索引:

This, (./files/2.txt, (0, 0)),(./files/1.txt, (0, 0)),(./files/3.txt, (0, 0))
is, (./files/2.txt, (0, 5)),(./files/1.txt, (0, 5)),(./files/3.txt, (0, 5))
a, (./files/2.txt, (0, 8))
file, (./files/2.txt, (0, 10)),(./files/1.txt, (0, 16)),(./files/3.txt, (0, 18))
another, (./files/1.txt, (0, 8)),(./files/3.txt, (1, 4))
the, (./files/3.txt, (0, 8),(1, 0))
third, (./files/3.txt, (0, 12))
line, (./files/3.txt, (1, 12))

格式为 词, (所在文件1, (行号1, 行内偏移量1), (行号2, 行内偏移量2), ...), (所在文件2, (行号1, 行内偏移量1), (行号2, 行内偏移量2), ...), ...

总结一下，我们要实现的是一个计算索引位置的代码

扫描给定目录中 txt 后缀的文件
枚举所有文件，读入所有行
枚举每一行，返回分词后的结果，结果为词和该词在文件，所在行号，以及行内的偏移量
将词和文件偏移量的结果写入结果

生成数据

测试数据构造: 1000 个文件，每个文件随机 1~500 行，每行随机 1 ~ 50 个单词。以下代码中的 words 是词汇表，还有 80000 个单词。

import numpy as np

for i in range(1000):
    f_name = str(i) + ".txt"
    g = open(f_name, "w")
    n_line = np.random.randint(1, 500)
    lines = []
    for i_line in range(n_line):
        words_list = []
        n_word = np.random.randint(1, 50)
        for i_word in range(n_word):
            word = words[np.random.randint(N)]
            words_list.append(word)
        lines.append(" ".join(words_list))
    g.write("\n".join(lines))

下面在生成的数据上测试各版代码。

初始代码

这个代码的主要逻辑如下:

首先列出目标目录下所有 txt 文件(get_file_names)
然后对每个文件返回各行的列表(get_word)
枚举行的列表中的各个行，更新词典，对每一行:
    先将行内所有单词按顺序保存在列表
    枚举列表中所有单词:
        计算该单词在行内的偏移量
        在词典中记录信息: 词 -> (所在文件，行号，偏移量) 的列表
枚举词典中的每个词，将词对应的信息从【(所在文件，行号，偏移量) 的列表】改为【所在文件 -> (行号，偏移量) 的列表的字典】
枚举词典中的每个词，将词对应的信息写入文件

注: 为了展示优化效果，故意把代码写的很烂。

import sys
import os
import glob

def get_file_names(folder):
    return glob.glob("{}/*.txt".format(folder))

def read_file(filepath):
    f = open(filepath, "r")
    return f.readlines()

def get_offset_upto_word(words, index):
    if index == 0:
        return 0

    sub_list = words[0:index]
    length = 0
    for w in sub_list:
        length += len(w)
    return length + index

def list2dict(l):
    res = {}
    for item in l:
        if item[0] not in res:
            res[item[0]] = []
        res[item[0]].append((item[1], item[2]))
    return res

def get_words(lines, filename, word_index_dict):
    STRIP_CHARS = ",.\t\n |"

    for line_idx, line in enumerate(lines):
        line = line.strip(STRIP_CHARS)
        local_words = line.split(" ")
        for idx, word in enumerate(local_words):
            word = word.strip(STRIP_CHARS)
            if word_index_dict.get(word) == None:
                word_index_dict[word] = []

            offset = get_offset_upto_word(local_words, idx)
            word_index_dict[word].append([filename, line_idx, offset])
    return word_index_dict

def saveIndex(index):
    lines = []
    for word in index:
        index_line = ""
        glue = ""
        for filename in index[word]:
            index_line += "{}({}, {})".format(glue, filename, ",".join(map(str, index[word][filename])))
            glue = ","
        lines.append("{}, {}".format(word, index_line))

    f = open("index-file.txt", "w")
    f.write("\n".join(lines))
    f.close()

def main():
    files = get_file_names("./files")
    words = {}
    for f in files:
        lines = read_file(f)
        words = get_words(lines, f, words)
    for word in words:
        words[word] = list2dict(words[word])
    saveIndex(words)

if __name__ == "__main__":
    main()

Profile 结果如下

1 2	Wrote profile results to my.py.lprof Timer unit: 1e-06 s

下面分别看各个函数的 Profile 结果

(1) get_offset_upto_word

此函数耗时很多，是性能优化的对象。

Total time: 90.3461 s
File: my.py
Function: get_offset_upto_word at line 16

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    16                                           @profile
    17                                           def get_offset_upto_word(words, index):
    18   6232971    2434606.0      0.4      2.7      if index == 0:
    19    280776      94231.0      0.3      0.1          return 0
    20                                           
    21   5952195    4741522.0      0.8      5.2      sub_list = words[0:index]
    22   5952195    1866914.0      0.3      2.1      length = 0
    23 104914808   41056776.0      0.4     45.4      for w in sub_list:
    24  98962613   38065444.0      0.4     42.1          length += len(w)
    25   5952195    2086572.0      0.4      2.3      return length + index

(2) get_words

此函数有大量动作，耗时最长。里面有两层 for 循环。

Total time: 193.104 s
File: my.py
Function: get_words at line 36

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    36                                           @profile
    37                                           def get_words(lines, filename, word_index_dict):
    38      1001        649.0      0.6      0.0      STRIP_CHARS = ",.\t\n |"
    39                                           
    40    281777     220057.0      0.8      0.1      for line_idx, line in enumerate(lines):
    41    280776     297706.0      1.1      0.2          line = line.strip(STRIP_CHARS)
    42    280776     830554.0      3.0      0.4          local_words = line.split(" ")
    43   6513747    3297521.0      0.5      1.7          for idx, word in enumerate(local_words):
    44   6232971    3798246.0      0.6      2.0              word = word.strip(STRIP_CHARS)
    45   6232971    9106854.0      1.5      4.7              if word_index_dict.get(word) == None:
    46    109005      71077.0      0.7      0.0                  word_index_dict[word] = []
    47                                           
    48   6232971  168457939.0     27.0     87.2              offset = get_offset_upto_word(local_words, idx)
    49   6232971    7023140.0      1.1      3.6              word_index_dict[word].append([filename, line_idx, offset])
    50      1001        451.0      0.5      0.0      return word_index_dict

(3) list2dict

把数组的列表转成字典，每个数组的第一个元素作为键。

Total time: 42.5283 s
File: my.py
Function: list2dict at line 27

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    27                                           @profile
    28                                           def list2dict(l):
    29    109005      59074.0      0.5      0.1      res = {}
    30   6341976   31511231.0      5.0     74.1      for item in l:
    31   6232971    3994870.0      0.6      9.4          if item[0] not in res:
    32   5937732    2849036.0      0.5      6.7              res[item[0]] = []
    33   6232971    4075813.0      0.7      9.6          res[item[0]].append((item[1], item[2]))
    34    109005      38239.0      0.4      0.1      return res

(4) read_file

仅仅将文件中的各个行读进来，以列表形式保存各行并返回，没什么好优化的。

Total time: 0.22231 s
File: my.py
Function: read_file at line 11

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    11                                           @profile
    12                                           def read_file(filepath):
    13      1001      82928.0     82.8     37.3      f = open(filepath, "r")
    14      1001     139382.0    139.2     62.7      return f.readlines()

(5) getFileNames

获取目标目录下所有 txt 文件。没有什么可优化的。

Total time: 0.011098 s
File: my.py
Function: getFileNames at line 7

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     7                                           @profile
     8                                           def getFileNames(folder):
     9         1      11098.0  11098.0    100.0      return glob.glob("{}/*.txt".format(folder))

(6) saveIndex

将倒排索引写入文件。

Total time: 21.9244 s
File: my.py
Function: saveIndex at line 52

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    52                                           @profile
    53                                           def saveIndex(index):
    54         1          1.0      1.0      0.0      lines = []
    55    109006      68024.0      0.6      0.3      for word in index:
    56    109005      61270.0      0.6      0.3          index_line = ""
    57    109005      51621.0      0.5      0.2          glue = ""
    58   6046737    3226772.0      0.5     14.7          for filename in index[word]:
    59   5937732   14828832.0      2.5     67.6              index_line += "{}({}, {})".format(glue, filename, ",".join(map(str, index[word][filename])))
    60   5937732    3050887.0      0.5     13.9              glue = ","
    61    109005     274289.0      2.5      1.3          lines.append("{}, {}".format(word, index_line))
    62                                           
    63         1      11916.0  11916.0      0.1      f = open("index-file.txt", "w")
    64         1     273133.0 273133.0      1.2      f.write("\n".join(lines))
    65         1      77619.0  77619.0      0.4      f.close()

(7) main

主函数 main()，主要是调用其它函数。本身没有性能负担，无需优化。

Total time: 292.197 s
File: my.py
Function: main at line 67

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    67                                           @profile
    68                                           def main():
    69         1      11114.0  11114.0      0.0      files = getFileNames("./files")
    70         1          1.0      1.0      0.0      words = {}
    71      1002        795.0      0.8      0.0      for f in files:
    72      1001     283130.0    282.8      0.1          lines = read_file(f)
    73      1001  207631960.0 207424.5     71.1          words = get_words(lines, f, words)
    74    109006      61103.0      0.6      0.0      for word in words:
    75    109005   51880454.0    475.9     17.8          words[word] = list2dict(words[word])
    76         1   32328905.0 32328905.0     11.1      saveIndex(words)

优化代码

(1) get_offset_upto_word

原始代码

里面许多行代码就是简单地把单词的长度加起来

def get_offset_upto_word(words, index):
    if index == 0:
        return 0

    sub_list = words[0:index]
    length = 0
    for w in sub_list:
        length += len(w)
    return length + index

优化1

用 reduce 实现将数组中的单词长度加起来

def get_offset_upto_word(words, index):
    if index == 0:
        return 0
    length = reduce(lambda curr, w: len(w) + curr, words[0:index], 0)
    return length + index

只是把多余的变量声明和查询取消了。耗时从 90 秒降到了 55 秒

Total time: 55.8104 s
File: my.py
Function: get_offset_upto_word at line 17

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    17                                           @profile
    18                                           def get_offset_upto_word(words, index):
    19   6232971    3345346.0      0.5      6.0      if index == 0:
    20    280776     137728.0      0.5      0.2          return 0
    21                                           
    22   5952195   49251859.0      8.3     88.2      length = reduce(lambda curr, w: len(w) + curr, words[0:index], 0)
    23                                               # sub_list = words[0:index]
    24                                               # length = 0
    25                                               # for w in sub_list:
    26                                               #     length += len(w)
    27   5952195    3075432.0      0.5      5.5      return length + index

优化2

因为每当我们调用 get_offset_upto_word 函数时,都要动态地创建一个函数。我们将 lambda 表达式改为事先定义好的函数。

def add_word_length(curr, w):
    return len(w) + curr

@profile
def get_offset_upto_word(words, index):
    if index == 0:
        return 0

    length = reduce(add_word_length, words[0:index], 0)
    return length + index

55s -> 51s

Total time: 51.495 s
File: my.py
Function: get_offset_upto_word at line 20

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    20                                           @profile
    21                                           def get_offset_upto_word(words, index):
    22   6232971    2372675.0      0.4      4.6      if index == 0:
    23    280776      90270.0      0.3      0.2          return 0
    24                                           
    25   5952195   46799252.0      7.9     90.9      length = reduce(add_word_length, words[0:index], 0)
    26   5952195    2232821.0      0.4      4.3      return length + index

优化3

我们还发现函数的前两行仍然消耗了大量不想要的Hit。

if检测语句没必要,因为reduce表达式的初始值就是0。长度变量声明没有必要,我们可以直接返回长度和索引的和。

1 2	def get_offset_upto_word(words, index): return reduce(add_word_length, words[0:index], 0) + index

51s -> 43s

Total time: 42.9398 s
File: my.py
Function: get_offset_upto_word at line 22

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    22                                           @profile
    23                                           def get_offset_upto_word(words, index):
    24   6232971   42939762.0      6.9    100.0      return reduce(add_word_length, words[0:index], 0) + index

(2) get_words

原始代码

调用 get_offset_upto_word 耗费最多时间。

def get_words(lines, filename, word_index_dict):
    STRIP_CHARS = ",.\t\n |"

    for line_idx, line in enumerate(lines):
        line = line.strip(STRIP_CHARS)
        local_words = line.split(" ")
        for idx, word in enumerate(local_words):
            word = word.strip(STRIP_CHARS)
            if word_index_dict.get(word) == None:
                word_index_dict[word] = []

            offset = get_offset_upto_word(local_words, idx)
            word_index_dict[word].append([filename, line_idx, offset])
    return word_index_dict

优化1

除此之外，用了一个 word_index_dict 字典变量。插入新键之前需要先检查键存不存在，可以用 defaultdict 去掉这个检查。

将 main 中的 words = {} 改为 words = defaultdict(list)，然后将 get_words 中的以下两行删掉即可。

1 2	if word_index_dict.get(word) == None: word_index_dict[word] = []

193s -> 75s

Total time: 75.5337 s
File: my.py
Function: get_words at line 34

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    34                                           @profile
    35                                           def get_words(lines, filename, word_index_dict):
    36      1001        715.0      0.7      0.0      STRIP_CHARS = ",.\t\n |"
    37                                           
    38    281777     195820.0      0.7      0.3      for line_idx, line in enumerate(lines):
    39    280776     273028.0      1.0      0.4          line = line.strip(STRIP_CHARS)
    40    280776     705464.0      2.5      0.9          local_words = line.split(" ")
    41   6513747    3091355.0      0.5      4.1          for idx, word in enumerate(local_words):
    42   6232971    3538385.0      0.6      4.7              word = word.strip(STRIP_CHARS)
    43                                           
    44   6232971   56992240.0      9.1     75.5              offset = get_offset_upto_word(local_words, idx)
    45   6232971   10736346.0      1.7     14.2              word_index_dict[word].append([filename, line_idx, offset])
    46      1001        396.0      0.4      0.0      return word_index_dict

优化2

优化后的 get_offset_upto_word 只有一行，且只在该函数 get_words 中有调用。将其直接写到 get_words 中。

def get_words(lines, filename, word_index_dict):
    STRIP_CHARS = ",.\t\n |"
    for line_idx, line in enumerate(lines):
        line = line.strip(STRIP_CHARS)
        local_words = line.split(" ")
        for idx, word in enumerate(local_words):
            word = word.strip(STRIP_CHARS)

            offset =  reduce(add_word_length, local_words[0:idx], 0) + idx
            word_index_dict[word].append([filename, line_idx, offset])
    return word_index_dict

消耗时间从 75s 左右到了 61s 左右。但是需要注意，如果还要在其它地方调用这个函数，则这么做不方便维护代码。

Total time: 61.1247 s
File: my.py
Function: get_words at line 31

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    31                                           @profile
    32                                           def get_words(lines, filename, word_index_dict):
    33      1001        973.0      1.0      0.0      STRIP_CHARS = ",.\t\n |"
    34    281777     188492.0      0.7      0.3      for line_idx, line in enumerate(lines):
    35    280776     258588.0      0.9      0.4          line = line.strip(STRIP_CHARS)
    36    280776     659000.0      2.3      1.1          local_words = line.split(" ")
    37   6513747    3096649.0      0.5      5.1          for idx, word in enumerate(local_words):
    38   6232971    3557934.0      0.6      5.8              word = word.strip(STRIP_CHARS)
    39                                           
    40   6232971   42761007.0      6.9     70.0              offset =  reduce(add_word_length, local_words[0:idx], 0) + idx
    41   6232971   10601646.0      1.7     17.3              word_index_dict[word].append([filename, line_idx, offset])
    42      1001        435.0      0.4      0.0      return word_index_dict

(3) list2dict

将 dict 改为 defaultdict

def list2dict(l):
    res = defaultdict(list)
    for item in l:
        res[item[0]].append((item[1], item[2]))
    return res

42s -> 25s

Total time: 25.5687 s
File: my.py
Function: list2dict at line 22

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    22                                           @profile
    23                                           def list2dict(l):
    24    109005     105236.0      1.0      0.4      res = defaultdict(list)
    25   6341976    6620497.0      1.0     25.9      for item in l:
    26   6232971   18803843.0      3.0     73.5          res[item[0]].append((item[1], item[2]))
    27    109005      39078.0      0.4      0.2      return res

(4) saveIndex

原始代码

def saveIndex(index):
    lines = []
    for word in index:
        index_line = ""
        glue = ""
        for filename in index[word]:
            index_line += "{}({}, {})".format(glue, filename, ",".join(map(str, index[word][filename])))
            glue = ","
        lines.append("{}, {}".format(word, index_line))

    f = open("index-file.txt", "w")
    f.write("\n".join(lines))
    f.close()

形成的文档是如下格式的

1
2
3

outlook, (./files/783.txt, (0, 0),(332, 353)),(./files/359.txt, (370, 158)),(./files/934.txt, (9, 259)),(./files/24.txt, (6, 37)),(./files/996.txt, (63, 386)),(./files/187.txt, (305, 178)),(./files/386.txt, (229, 0)),(./files/474.txt, (360, 247)),(./files/840.txt, (25, 92)),(./files/425.txt, (2, 371)),(./files/999.txt, (191, 165)),(./files/335.txt, (9, 21)),(./files/794.txt, (26, 74)),(./files/980.txt, (277, 424)),(./files/63.txt, (164, 136)),(./files/235.txt, (208, 137)),(./files/658.txt, (90, 62),(357, 153)),(./files/562.txt, (52, 38)),(./files/241.txt, (24, 310)),(./files/678.txt, (11, 221)),(./files/537.txt, (210, 59)),(./files/776.txt, (55, 301)),(./files/604.txt, (9, 273)),(./files/121.txt, (11, 64)),(./files/919.txt, (102, 41)),(./files/206.txt, (105, 39)),(./files/200.txt, (326, 178)),(./files/19.txt, (26, 146)),(./files/856.txt, (283, 51)),(./files/703.txt, (25, 22)),(./files/305.txt, (5, 8)),(./files/106.txt, (121, 239)),(./files/248.txt, (85, 238)),(./files/680.txt, (129, 56)),(./files/70.txt, (329, 160)),(./files/880.txt, (338, 19)),(./files/417.txt, (92, 358)),(./files/312.txt, (0, 202)),(./files/694.txt, (313, 89)),(./files/173.txt, (1, 241)),(./files/665.txt, (38, 198)),(./files/859.txt, (215, 99)),(./files/815.txt, (79, 34)),(./files/842.txt, (350, 382),(422, 312)),(./files/546.txt, (57, 338)),(./files/18.txt, (166, 114)),(./files/997.txt, (305, 78)),(./files/772.txt, (294, 79)),(./files/176.txt, (4, 64)),(./files/122.txt, (32, 187)),(./files/782.txt, (190, 186)),(./files/392.txt, (73, 242)),(./files/395.txt, (67, 159)),(./files/442.txt, (358, 174)),(./files/754.txt, (49, 21)),(./files/672.txt, (130, 83)),(./files/659.txt, (424, 173)),(./files/74.txt, (322, 73)),(./files/707.txt, (423, 0)),(./files/243.txt, (15, 298)),(./files/552.txt, (78, 313)),(./files/307.txt, (20, 281)),(./files/26.txt, (317, 146)),(./files/663.txt, (347, 269)),(./files/858.txt, (122, 141)),(./files/549.txt, (1, 51))
jabots, (./files/783.txt, (0, 8)),(./files/898.txt, (383, 234)),(./files/751.txt, (51, 191)),(./files/386.txt, (20, 38)),(./files/208.txt, (8, 97)),(./files/79.txt, (284, 24)),(./files/226.txt, (129, 232)),(./files/269.txt, (138, 35)),(./files/597.txt, (235, 0)),(./files/835.txt, (97, 81)),(./files/12.txt, (82, 31)),(./files/189.txt, (83, 188)),(./files/152.txt, (240, 173)),(./files/889.txt, (449, 66)),(./files/59.txt, (105, 408)),(./files/497.txt, (29, 187)),(./files/548.txt, (283, 209)),(./files/204.txt, (140, 46)),(./files/245.txt, (19, 101)),(./files/283.txt, (141, 212)),(./files/276.txt, (199, 276)),(./files/748.txt, (157, 214)),(./files/273.txt, (315, 31)),(./files/585.txt, (165, 98)),(./files/1.txt, (21, 33)),(./files/814.txt, (35, 182)),(./files/126.txt, (280, 272)),(./files/746.txt, (35, 84)),(./files/94.txt, (134, 151)),(./files/272.txt, (119, 257)),(./files/924.txt, (89, 109)),(./files/721.txt, (129, 0)),(./files/421.txt, (177, 90)),(./files/462.txt, (26, 31)),(./files/503.txt, (124, 61)),(./files/694.txt, (135, 176)),(./files/586.txt, (46, 11)),(./files/396.txt, (77, 242)),(./files/113.txt, (313, 80)),(./files/57.txt, (22, 99)),(./files/464.txt, (133, 154)),(./files/631.txt, (212, 31)),(./files/761.txt, (29, 125)),(./files/357.txt, (171, 329)),(./files/879.txt, (61, 160)),(./files/870.txt, (323, 392)),(./files/594.txt, (52, 100)),(./files/441.txt, (193, 361)),(./files/418.txt, (57, 132)),(./files/875.txt, (83, 125)),(./files/514.txt, (157, 0)),(./files/515.txt, (354, 210)),(./files/143.txt, (286, 31)),(./files/993.txt, (170, 8)),(./files/253.txt, (50, 178)),(./files/108.txt, (66, 419)),(./files/442.txt, (237, 73)),(./files/933.txt, (305, 371)),(./files/954.txt, (76, 261)),(./files/499.txt, (277, 359)),(./files/118.txt, (233, 100)),(./files/190.txt, (91, 66)),(./files/361.txt, (26, 129)),(./files/577.txt, (199, 81)),(./files/558.txt, (322, 417),(336, 413)),(./files/923.txt, (58, 161)),(./files/611.txt, (131, 243)),(./files/5.txt, (0, 152)),(./files/227.txt, (105, 219)),(./files/348.txt, (102, 71)),(./files/98.txt, (86, 28)),(./files/598.txt, (248, 33)),(./files/475.txt, (208, 29)),(./files/55.txt, (12, 0)),(./files/762.txt, (28, 60))
ferociously, (./files/783.txt, (0, 15),(114, 201)),(./files/14.txt, (316, 63)),(./files/420.txt, (218, 227)),(./files/641.txt, (198, 166)),(./files/469.txt, (12, 29),(288, 7)),(./files/898.txt, (193, 132)),(./files/360.txt, (146, 108)),(./files/751.txt, (103, 225)),(./files/622.txt, (325, 258)),(./files/786.txt, (158, 11)),(./files/214.txt, (4, 102)),(./files/819.txt, (187, 52),(339, 36)),(./files/872.txt, (201, 96)),(./files/471.txt, (290, 218)),(./files/794.txt, (10, 74)),(./files/596.txt, (30, 173)),(./files/798.txt, (202, 0)),(./files/422.txt, (179, 330)),(./files/535.txt, (147, 75)),(./files/700.txt, (72, 277)),(./files/678.txt, (224, 303)),(./files/54.txt, (192, 13)),(./files/797.txt, (13, 54)),(./files/812.txt, (21, 209)),(./files/252.txt, (76, 195)),(./files/943.txt, (72, 0)),(./files/555.txt, (287, 88)),(./files/604.txt, (139, 0)),(./files/385.txt, (68, 148)),(./files/65.txt, (209, 44),(434, 359)),(./files/548.txt, (98, 41)),(./files/206.txt, (267, 93)),(./files/571.txt, (131, 270)),(./files/648.txt, (161, 140)),(./files/538.txt, (181, 277)),(./files/282.txt, (178, 0)),(./files/731.txt, (125, 125)),(./files/439.txt, (62, 251)),(./files/185.txt, (264, 191)),(./files/1.txt, (63, 243)),(./files/935.txt, (302, 118)),(./files/962.txt, (110, 160)),(./files/126.txt, (23, 23)),(./files/691.txt, (177, 44)),(./files/346.txt, (110, 202)),(./files/941.txt, (48, 313)),(./files/298.txt, (83, 228)),(./files/750.txt, (7, 106)),(./files/608.txt, (87, 0)),(./files/605.txt, (361, 62)),(./files/314.txt, (159, 206)),(./files/505.txt, (224, 233)),(./files/953.txt, (462, 13)),(./files/630.txt, (53, 43)),(./files/694.txt, (88, 23)),(./files/387.txt, (141, 295)),(./files/878.txt, (168, 272)),(./files/443.txt, (11, 400)),(./files/590.txt, (93, 393)),(./files/457.txt, (437, 48)),(./files/447.txt, (375, 314)),(./files/761.txt, (193, 69)),(./files/583.txt, (199, 19)),(./files/519.txt, (60, 253)),(./files/518.txt, (211, 10)),(./files/603.txt, (171, 120)),(./files/545.txt, (24, 229)),(./files/198.txt, (137, 74)),(./files/18.txt, (183, 151),(234, 120)),(./files/870.txt, (329, 39)),(./files/883.txt, (62, 36)),(./files/625.txt, (108, 104)),(./files/115.txt, (119, 157)),(./files/86.txt, (158, 61)),(./files/192.txt, (88, 101)),(./files/426.txt, (206, 19),(285, 107)),(./files/930.txt, (1, 467)),(./files/263.txt, (148, 82)),(./files/964.txt, (425, 384)),(./files/82.txt, (146, 329)),(./files/281.txt, (293, 431)),(./files/125.txt, (443, 330)),(./files/710.txt, (231, 19)),(./files/948.txt, (80, 74)),(./files/423.txt, (231, 365)),(./files/809.txt, (406, 90)),(./files/41.txt, (256, 0),(259, 122),(283, 63)),(./files/89.txt, (107, 44)),(./files/888.txt, (296, 199)),(./files/958.txt, (125, 48)),(./files/166.txt, (125, 248)),(./files/995.txt, (35, 84)),(./files/455.txt, (112, 105)),(./files/473.txt, (61, 131)),(./files/946.txt, (306, 8)),(./files/475.txt, (157, 140))

优化

改变 for 循环内部结构。不是把新的字符串追加到一个大字符串中，而是追加到列表中。

def saveIndex(index):
    lines = []
    for word in index:
        index_line = []
        for filename in index[word]:
            index_line.append("({}, {})".format(filename, ",".join(map(str, index[word][filename]))))
        lines.append("{}, {}".format(word, ",".join(index_line)))

    f = open("index-file.txt", "w")
    f.write("\n".join(lines))
    f.close()

22s -> 16.8s

Total time: 16.7794 s
File: my.py
Function: saveIndex at line 42

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    42                                           @profile
    43                                           def saveIndex(index):
    44         1          1.0      1.0      0.0      lines = []
    45    109006      64301.0      0.6      0.4      for word in index:
    46    109005     119220.0      1.1      0.7          index_line = []
    47   6046737    3102933.0      0.5     18.5          for filename in index[word]:
    48   5937732   12655170.0      2.1     75.4              index_line.append("({}, {})".format(filename, ",".join(map(str, index[word][filename]))))
    49    109005     407966.0      3.7      2.4          lines.append("{}, {}".format(word, ",".join(index_line)))
    50                                           
    51         1      12943.0  12943.0      0.1      f = open("index-file.txt", "w")
    52         1     296342.0 296342.0      1.8      f.write("\n".join(lines))
    53         1     120562.0 120562.0      0.7      f.close()

优化完四个主要的函数后，总耗时从 292.197s 变成 134.864s

潮汐朝夕的生活实验室 \Doge 陪伴一个算法工程师的职业生涯

编程Python

line_profiler性能分析实践 -- 优化倒排索引

line_profiler 性能分析实例: 倒排索引

生成数据

初始代码

(1) get_offset_upto_word

(2) get_words

(3) list2dict

(4) read_file

(5) getFileNames

(6) saveIndex

(7) main

优化代码

(1) get_offset_upto_word

原始代码

优化1

优化2

优化3

(2) get_words

原始代码

优化1

优化2

(3) list2dict

(4) saveIndex

原始代码

优化