line_profiler性能分析实践 -- 优化倒排索引

  |  

摘要: line_profiler 实践

【对算法,数学,计算机感兴趣的同学,欢迎关注我哈,阅读更多原创文章】
我的网站:潮汐朝夕的生活实验室
我的公众号:算法题刷刷
我的知乎:潮汐朝夕
我的github:FennelDumplings
我的leetcode:FennelDumplings


在文章Python性能分析基础中,我们学习了性能分析的基础知识方法论,如果把性能分析方法整合到开发过程中,就可以帮助我们提高产品的开发质量。

然后在文章 Python性能分析器 — cProfile 中,我们进一步学习了 cProfile 这个性能分析器。之后在文章 使用cProfile进行性能分析与优化实践 中,我们通过计算斐波那契数和读取大 csv 文件并做简单统计的两个例子进行了 cProfile 性能分析和优化的实践。

通过 cProfile 我们可以对代码进行性能分析,获取每个函数的调用次数和总调用次数。它帮助我们通过系统全局视角改进代码。之后我们又在文章 Python性能分析器 — line_profiler 中学习了可以提供每一行代码的性能细节的 line_profiler 性能分析器,并在斐波那契数上进行了实践。

本文我们通过更大的例子实践一下 line_profiler 性能分析器 — 对倒排索引的代码进行优化。

line_profiler 性能分析实例: 倒排索引

倒排索引的工作方式: 预扫描文件,把内容分割成单词,然后保存单词与文件之间的对应关系(有时也记录单词的位置)。

通过这种方式搜索单词时,可以实现O(1)时间复杂度。

下面以一个例子说明我们要做的事情,假设有两个文件 file1.txt, file2.txt, file3.txt,内容分别如下

file1.txt:

1
This is a file

file2.txt:

1
This is another file

file3.txt:

1
2
This is the third file
the another line

我们希望获得如下的索引:

1
2
3
4
5
6
7
8
This, (./files/2.txt, (0, 0)),(./files/1.txt, (0, 0)),(./files/3.txt, (0, 0))
is, (./files/2.txt, (0, 5)),(./files/1.txt, (0, 5)),(./files/3.txt, (0, 5))
a, (./files/2.txt, (0, 8))
file, (./files/2.txt, (0, 10)),(./files/1.txt, (0, 16)),(./files/3.txt, (0, 18))
another, (./files/1.txt, (0, 8)),(./files/3.txt, (1, 4))
the, (./files/3.txt, (0, 8),(1, 0))
third, (./files/3.txt, (0, 12))
line, (./files/3.txt, (1, 12))

格式为 词, (所在文件1, (行号1, 行内偏移量1), (行号2, 行内偏移量2), ...), (所在文件2, (行号1, 行内偏移量1), (行号2, 行内偏移量2), ...), ...

总结一下,我们要实现的是一个计算索引位置的代码

1
2
3
4
扫描给定目录中 txt 后缀的文件
枚举所有文件,读入所有行
枚举每一行,返回分词后的结果,结果为词和该词在文件,所在行号,以及行内的偏移量
将词和文件偏移量的结果写入结果

生成数据

测试数据构造: 1000 个文件,每个文件随机 1~500 行,每行随机 1 ~ 50 个单词。以下代码中的 words 是词汇表,还有 80000 个单词。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import numpy as np

for i in range(1000):
f_name = str(i) + ".txt"
g = open(f_name, "w")
n_line = np.random.randint(1, 500)
lines = []
for i_line in range(n_line):
words_list = []
n_word = np.random.randint(1, 50)
for i_word in range(n_word):
word = words[np.random.randint(N)]
words_list.append(word)
lines.append(" ".join(words_list))
g.write("\n".join(lines))

下面在生成的数据上测试各版代码。

初始代码

这个代码的主要逻辑如下:

1
2
3
4
5
6
7
8
9
首先列出目标目录下所有 txt 文件(get_file_names)
然后对每个文件返回各行的列表(get_word)
枚举行的列表中的各个行,更新词典,对每一行:
先将行内所有单词按顺序保存在列表
枚举列表中所有单词:
计算该单词在行内的偏移量
在词典中记录信息: 词 -> (所在文件,行号,偏移量) 的列表
枚举词典中的每个词,将词对应的信息从【(所在文件,行号,偏移量) 的列表】改为【所在文件 -> (行号,偏移量) 的列表的字典】
枚举词典中的每个词,将词对应的信息写入文件

注: 为了展示优化效果,故意把代码写的很烂。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
import sys
import os
import glob

def get_file_names(folder):
return glob.glob("{}/*.txt".format(folder))

def read_file(filepath):
f = open(filepath, "r")
return f.readlines()

def get_offset_upto_word(words, index):
if index == 0:
return 0

sub_list = words[0:index]
length = 0
for w in sub_list:
length += len(w)
return length + index

def list2dict(l):
res = {}
for item in l:
if item[0] not in res:
res[item[0]] = []
res[item[0]].append((item[1], item[2]))
return res

def get_words(lines, filename, word_index_dict):
STRIP_CHARS = ",.\t\n |"

for line_idx, line in enumerate(lines):
line = line.strip(STRIP_CHARS)
local_words = line.split(" ")
for idx, word in enumerate(local_words):
word = word.strip(STRIP_CHARS)
if word_index_dict.get(word) == None:
word_index_dict[word] = []

offset = get_offset_upto_word(local_words, idx)
word_index_dict[word].append([filename, line_idx, offset])
return word_index_dict

def saveIndex(index):
lines = []
for word in index:
index_line = ""
glue = ""
for filename in index[word]:
index_line += "{}({}, {})".format(glue, filename, ",".join(map(str, index[word][filename])))
glue = ","
lines.append("{}, {}".format(word, index_line))

f = open("index-file.txt", "w")
f.write("\n".join(lines))
f.close()

def main():
files = get_file_names("./files")
words = {}
for f in files:
lines = read_file(f)
words = get_words(lines, f, words)
for word in words:
words[word] = list2dict(words[word])
saveIndex(words)

if __name__ == "__main__":
main()

Profile 结果如下

1
2
Wrote profile results to my.py.lprof
Timer unit: 1e-06 s

下面分别看各个函数的 Profile 结果

(1) get_offset_upto_word

此函数耗时很多,是性能优化的对象。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Total time: 90.3461 s
File: my.py
Function: get_offset_upto_word at line 16

Line # Hits Time Per Hit % Time Line Contents
==============================================================
16 @profile
17 def get_offset_upto_word(words, index):
18 6232971 2434606.0 0.4 2.7 if index == 0:
19 280776 94231.0 0.3 0.1 return 0
20
21 5952195 4741522.0 0.8 5.2 sub_list = words[0:index]
22 5952195 1866914.0 0.3 2.1 length = 0
23 104914808 41056776.0 0.4 45.4 for w in sub_list:
24 98962613 38065444.0 0.4 42.1 length += len(w)
25 5952195 2086572.0 0.4 2.3 return length + index

(2) get_words

此函数有大量动作,耗时最长。里面有两层 for 循环。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
Total time: 193.104 s
File: my.py
Function: get_words at line 36

Line # Hits Time Per Hit % Time Line Contents
==============================================================
36 @profile
37 def get_words(lines, filename, word_index_dict):
38 1001 649.0 0.6 0.0 STRIP_CHARS = ",.\t\n |"
39
40 281777 220057.0 0.8 0.1 for line_idx, line in enumerate(lines):
41 280776 297706.0 1.1 0.2 line = line.strip(STRIP_CHARS)
42 280776 830554.0 3.0 0.4 local_words = line.split(" ")
43 6513747 3297521.0 0.5 1.7 for idx, word in enumerate(local_words):
44 6232971 3798246.0 0.6 2.0 word = word.strip(STRIP_CHARS)
45 6232971 9106854.0 1.5 4.7 if word_index_dict.get(word) == None:
46 109005 71077.0 0.7 0.0 word_index_dict[word] = []
47
48 6232971 168457939.0 27.0 87.2 offset = get_offset_upto_word(local_words, idx)
49 6232971 7023140.0 1.1 3.6 word_index_dict[word].append([filename, line_idx, offset])
50 1001 451.0 0.5 0.0 return word_index_dict

(3) list2dict

把数组的列表转成字典,每个数组的第一个元素作为键。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
Total time: 42.5283 s
File: my.py
Function: list2dict at line 27

Line # Hits Time Per Hit % Time Line Contents
==============================================================
27 @profile
28 def list2dict(l):
29 109005 59074.0 0.5 0.1 res = {}
30 6341976 31511231.0 5.0 74.1 for item in l:
31 6232971 3994870.0 0.6 9.4 if item[0] not in res:
32 5937732 2849036.0 0.5 6.7 res[item[0]] = []
33 6232971 4075813.0 0.7 9.6 res[item[0]].append((item[1], item[2]))
34 109005 38239.0 0.4 0.1 return res

(4) read_file

仅仅将文件中的各个行读进来,以列表形式保存各行并返回,没什么好优化的。

1
2
3
4
5
6
7
8
9
10
Total time: 0.22231 s
File: my.py
Function: read_file at line 11

Line # Hits Time Per Hit % Time Line Contents
==============================================================
11 @profile
12 def read_file(filepath):
13 1001 82928.0 82.8 37.3 f = open(filepath, "r")
14 1001 139382.0 139.2 62.7 return f.readlines()

(5) getFileNames

获取目标目录下所有 txt 文件。没有什么可优化的。

1
2
3
4
5
6
7
8
9
Total time: 0.011098 s
File: my.py
Function: getFileNames at line 7

Line # Hits Time Per Hit % Time Line Contents
==============================================================
7 @profile
8 def getFileNames(folder):
9 1 11098.0 11098.0 100.0 return glob.glob("{}/*.txt".format(folder))

(6) saveIndex

将倒排索引写入文件。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Total time: 21.9244 s
File: my.py
Function: saveIndex at line 52

Line # Hits Time Per Hit % Time Line Contents
==============================================================
52 @profile
53 def saveIndex(index):
54 1 1.0 1.0 0.0 lines = []
55 109006 68024.0 0.6 0.3 for word in index:
56 109005 61270.0 0.6 0.3 index_line = ""
57 109005 51621.0 0.5 0.2 glue = ""
58 6046737 3226772.0 0.5 14.7 for filename in index[word]:
59 5937732 14828832.0 2.5 67.6 index_line += "{}({}, {})".format(glue, filename, ",".join(map(str, index[word][filename])))
60 5937732 3050887.0 0.5 13.9 glue = ","
61 109005 274289.0 2.5 1.3 lines.append("{}, {}".format(word, index_line))
62
63 1 11916.0 11916.0 0.1 f = open("index-file.txt", "w")
64 1 273133.0 273133.0 1.2 f.write("\n".join(lines))
65 1 77619.0 77619.0 0.4 f.close()

(7) main

主函数 main(),主要是调用其它函数。本身没有性能负担,无需优化。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Total time: 292.197 s
File: my.py
Function: main at line 67

Line # Hits Time Per Hit % Time Line Contents
==============================================================
67 @profile
68 def main():
69 1 11114.0 11114.0 0.0 files = getFileNames("./files")
70 1 1.0 1.0 0.0 words = {}
71 1002 795.0 0.8 0.0 for f in files:
72 1001 283130.0 282.8 0.1 lines = read_file(f)
73 1001 207631960.0 207424.5 71.1 words = get_words(lines, f, words)
74 109006 61103.0 0.6 0.0 for word in words:
75 109005 51880454.0 475.9 17.8 words[word] = list2dict(words[word])
76 1 32328905.0 32328905.0 11.1 saveIndex(words)

优化代码

(1) get_offset_upto_word

原始代码

里面许多行代码就是简单地把单词的长度加起来

1
2
3
4
5
6
7
8
9
def get_offset_upto_word(words, index):
if index == 0:
return 0

sub_list = words[0:index]
length = 0
for w in sub_list:
length += len(w)
return length + index

优化1

用 reduce 实现将数组中的单词长度加起来

1
2
3
4
5
def get_offset_upto_word(words, index):
if index == 0:
return 0
length = reduce(lambda curr, w: len(w) + curr, words[0:index], 0)
return length + index

只是把多余的变量声明和查询取消了。耗时从 90 秒降到了 55 秒

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
Total time: 55.8104 s
File: my.py
Function: get_offset_upto_word at line 17

Line # Hits Time Per Hit % Time Line Contents
==============================================================
17 @profile
18 def get_offset_upto_word(words, index):
19 6232971 3345346.0 0.5 6.0 if index == 0:
20 280776 137728.0 0.5 0.2 return 0
21
22 5952195 49251859.0 8.3 88.2 length = reduce(lambda curr, w: len(w) + curr, words[0:index], 0)
23 # sub_list = words[0:index]
24 # length = 0
25 # for w in sub_list:
26 # length += len(w)
27 5952195 3075432.0 0.5 5.5 return length + index

优化2

因为每当我们调用 get_offset_upto_word 函数时,都要动态地创建一个函数。我们将 lambda 表达式改为事先定义好的函数。

1
2
3
4
5
6
7
8
9
10
def add_word_length(curr, w):
return len(w) + curr

@profile
def get_offset_upto_word(words, index):
if index == 0:
return 0

length = reduce(add_word_length, words[0:index], 0)
return length + index

55s -> 51s

1
2
3
4
5
6
7
8
9
10
11
12
13
Total time: 51.495 s
File: my.py
Function: get_offset_upto_word at line 20

Line # Hits Time Per Hit % Time Line Contents
==============================================================
20 @profile
21 def get_offset_upto_word(words, index):
22 6232971 2372675.0 0.4 4.6 if index == 0:
23 280776 90270.0 0.3 0.2 return 0
24
25 5952195 46799252.0 7.9 90.9 length = reduce(add_word_length, words[0:index], 0)
26 5952195 2232821.0 0.4 4.3 return length + index

优化3

我们还发现函数的前两行仍然消耗了大量不想要的Hit。

if检测语句没必要,因为reduce表达式的初始值就是0。长度变量声明没有必要,我们可以直接返回长度和索引的和。

1
2
def get_offset_upto_word(words, index):
return reduce(add_word_length, words[0:index], 0) + index

51s -> 43s

1
2
3
4
5
6
7
8
9
Total time: 42.9398 s
File: my.py
Function: get_offset_upto_word at line 22

Line # Hits Time Per Hit % Time Line Contents
==============================================================
22 @profile
23 def get_offset_upto_word(words, index):
24 6232971 42939762.0 6.9 100.0 return reduce(add_word_length, words[0:index], 0) + index

(2) get_words

原始代码

调用 get_offset_upto_word 耗费最多时间。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
def get_words(lines, filename, word_index_dict):
STRIP_CHARS = ",.\t\n |"

for line_idx, line in enumerate(lines):
line = line.strip(STRIP_CHARS)
local_words = line.split(" ")
for idx, word in enumerate(local_words):
word = word.strip(STRIP_CHARS)
if word_index_dict.get(word) == None:
word_index_dict[word] = []

offset = get_offset_upto_word(local_words, idx)
word_index_dict[word].append([filename, line_idx, offset])
return word_index_dict

优化1

除此之外,用了一个 word_index_dict 字典变量。插入新键之前需要先检查键存不存在,可以用 defaultdict 去掉这个检查。

将 main 中的 words = {} 改为 words = defaultdict(list),然后将 get_words 中的以下两行删掉即可。

1
2
if word_index_dict.get(word) == None:
word_index_dict[word] = []

193s -> 75s

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
Total time: 75.5337 s
File: my.py
Function: get_words at line 34

Line # Hits Time Per Hit % Time Line Contents
==============================================================
34 @profile
35 def get_words(lines, filename, word_index_dict):
36 1001 715.0 0.7 0.0 STRIP_CHARS = ",.\t\n |"
37
38 281777 195820.0 0.7 0.3 for line_idx, line in enumerate(lines):
39 280776 273028.0 1.0 0.4 line = line.strip(STRIP_CHARS)
40 280776 705464.0 2.5 0.9 local_words = line.split(" ")
41 6513747 3091355.0 0.5 4.1 for idx, word in enumerate(local_words):
42 6232971 3538385.0 0.6 4.7 word = word.strip(STRIP_CHARS)
43
44 6232971 56992240.0 9.1 75.5 offset = get_offset_upto_word(local_words, idx)
45 6232971 10736346.0 1.7 14.2 word_index_dict[word].append([filename, line_idx, offset])
46 1001 396.0 0.4 0.0 return word_index_dict

优化2

优化后的 get_offset_upto_word 只有一行,且只在该函数 get_words 中有调用。将其直接写到 get_words 中。

1
2
3
4
5
6
7
8
9
10
11
def get_words(lines, filename, word_index_dict):
STRIP_CHARS = ",.\t\n |"
for line_idx, line in enumerate(lines):
line = line.strip(STRIP_CHARS)
local_words = line.split(" ")
for idx, word in enumerate(local_words):
word = word.strip(STRIP_CHARS)

offset = reduce(add_word_length, local_words[0:idx], 0) + idx
word_index_dict[word].append([filename, line_idx, offset])
return word_index_dict

消耗时间从 75s 左右到了 61s 左右。但是需要注意,如果还要在其它地方调用这个函数,则这么做不方便维护代码。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
Total time: 61.1247 s
File: my.py
Function: get_words at line 31

Line # Hits Time Per Hit % Time Line Contents
==============================================================
31 @profile
32 def get_words(lines, filename, word_index_dict):
33 1001 973.0 1.0 0.0 STRIP_CHARS = ",.\t\n |"
34 281777 188492.0 0.7 0.3 for line_idx, line in enumerate(lines):
35 280776 258588.0 0.9 0.4 line = line.strip(STRIP_CHARS)
36 280776 659000.0 2.3 1.1 local_words = line.split(" ")
37 6513747 3096649.0 0.5 5.1 for idx, word in enumerate(local_words):
38 6232971 3557934.0 0.6 5.8 word = word.strip(STRIP_CHARS)
39
40 6232971 42761007.0 6.9 70.0 offset = reduce(add_word_length, local_words[0:idx], 0) + idx
41 6232971 10601646.0 1.7 17.3 word_index_dict[word].append([filename, line_idx, offset])
42 1001 435.0 0.4 0.0 return word_index_dict

(3) list2dict

将 dict 改为 defaultdict

1
2
3
4
5
def list2dict(l):
res = defaultdict(list)
for item in l:
res[item[0]].append((item[1], item[2]))
return res

42s -> 25s

1
2
3
4
5
6
7
8
9
10
11
12
Total time: 25.5687 s
File: my.py
Function: list2dict at line 22

Line # Hits Time Per Hit % Time Line Contents
==============================================================
22 @profile
23 def list2dict(l):
24 109005 105236.0 1.0 0.4 res = defaultdict(list)
25 6341976 6620497.0 1.0 25.9 for item in l:
26 6232971 18803843.0 3.0 73.5 res[item[0]].append((item[1], item[2]))
27 109005 39078.0 0.4 0.2 return res

(4) saveIndex

原始代码

1
2
3
4
5
6
7
8
9
10
11
12
13
def saveIndex(index):
lines = []
for word in index:
index_line = ""
glue = ""
for filename in index[word]:
index_line += "{}({}, {})".format(glue, filename, ",".join(map(str, index[word][filename])))
glue = ","
lines.append("{}, {}".format(word, index_line))

f = open("index-file.txt", "w")
f.write("\n".join(lines))
f.close()

形成的文档是如下格式的

1
2
3
outlook, (./files/783.txt, (0, 0),(332, 353)),(./files/359.txt, (370, 158)),(./files/934.txt, (9, 259)),(./files/24.txt, (6, 37)),(./files/996.txt, (63, 386)),(./files/187.txt, (305, 178)),(./files/386.txt, (229, 0)),(./files/474.txt, (360, 247)),(./files/840.txt, (25, 92)),(./files/425.txt, (2, 371)),(./files/999.txt, (191, 165)),(./files/335.txt, (9, 21)),(./files/794.txt, (26, 74)),(./files/980.txt, (277, 424)),(./files/63.txt, (164, 136)),(./files/235.txt, (208, 137)),(./files/658.txt, (90, 62),(357, 153)),(./files/562.txt, (52, 38)),(./files/241.txt, (24, 310)),(./files/678.txt, (11, 221)),(./files/537.txt, (210, 59)),(./files/776.txt, (55, 301)),(./files/604.txt, (9, 273)),(./files/121.txt, (11, 64)),(./files/919.txt, (102, 41)),(./files/206.txt, (105, 39)),(./files/200.txt, (326, 178)),(./files/19.txt, (26, 146)),(./files/856.txt, (283, 51)),(./files/703.txt, (25, 22)),(./files/305.txt, (5, 8)),(./files/106.txt, (121, 239)),(./files/248.txt, (85, 238)),(./files/680.txt, (129, 56)),(./files/70.txt, (329, 160)),(./files/880.txt, (338, 19)),(./files/417.txt, (92, 358)),(./files/312.txt, (0, 202)),(./files/694.txt, (313, 89)),(./files/173.txt, (1, 241)),(./files/665.txt, (38, 198)),(./files/859.txt, (215, 99)),(./files/815.txt, (79, 34)),(./files/842.txt, (350, 382),(422, 312)),(./files/546.txt, (57, 338)),(./files/18.txt, (166, 114)),(./files/997.txt, (305, 78)),(./files/772.txt, (294, 79)),(./files/176.txt, (4, 64)),(./files/122.txt, (32, 187)),(./files/782.txt, (190, 186)),(./files/392.txt, (73, 242)),(./files/395.txt, (67, 159)),(./files/442.txt, (358, 174)),(./files/754.txt, (49, 21)),(./files/672.txt, (130, 83)),(./files/659.txt, (424, 173)),(./files/74.txt, (322, 73)),(./files/707.txt, (423, 0)),(./files/243.txt, (15, 298)),(./files/552.txt, (78, 313)),(./files/307.txt, (20, 281)),(./files/26.txt, (317, 146)),(./files/663.txt, (347, 269)),(./files/858.txt, (122, 141)),(./files/549.txt, (1, 51))
jabots, (./files/783.txt, (0, 8)),(./files/898.txt, (383, 234)),(./files/751.txt, (51, 191)),(./files/386.txt, (20, 38)),(./files/208.txt, (8, 97)),(./files/79.txt, (284, 24)),(./files/226.txt, (129, 232)),(./files/269.txt, (138, 35)),(./files/597.txt, (235, 0)),(./files/835.txt, (97, 81)),(./files/12.txt, (82, 31)),(./files/189.txt, (83, 188)),(./files/152.txt, (240, 173)),(./files/889.txt, (449, 66)),(./files/59.txt, (105, 408)),(./files/497.txt, (29, 187)),(./files/548.txt, (283, 209)),(./files/204.txt, (140, 46)),(./files/245.txt, (19, 101)),(./files/283.txt, (141, 212)),(./files/276.txt, (199, 276)),(./files/748.txt, (157, 214)),(./files/273.txt, (315, 31)),(./files/585.txt, (165, 98)),(./files/1.txt, (21, 33)),(./files/814.txt, (35, 182)),(./files/126.txt, (280, 272)),(./files/746.txt, (35, 84)),(./files/94.txt, (134, 151)),(./files/272.txt, (119, 257)),(./files/924.txt, (89, 109)),(./files/721.txt, (129, 0)),(./files/421.txt, (177, 90)),(./files/462.txt, (26, 31)),(./files/503.txt, (124, 61)),(./files/694.txt, (135, 176)),(./files/586.txt, (46, 11)),(./files/396.txt, (77, 242)),(./files/113.txt, (313, 80)),(./files/57.txt, (22, 99)),(./files/464.txt, (133, 154)),(./files/631.txt, (212, 31)),(./files/761.txt, (29, 125)),(./files/357.txt, (171, 329)),(./files/879.txt, (61, 160)),(./files/870.txt, (323, 392)),(./files/594.txt, (52, 100)),(./files/441.txt, (193, 361)),(./files/418.txt, (57, 132)),(./files/875.txt, (83, 125)),(./files/514.txt, (157, 0)),(./files/515.txt, (354, 210)),(./files/143.txt, (286, 31)),(./files/993.txt, (170, 8)),(./files/253.txt, (50, 178)),(./files/108.txt, (66, 419)),(./files/442.txt, (237, 73)),(./files/933.txt, (305, 371)),(./files/954.txt, (76, 261)),(./files/499.txt, (277, 359)),(./files/118.txt, (233, 100)),(./files/190.txt, (91, 66)),(./files/361.txt, (26, 129)),(./files/577.txt, (199, 81)),(./files/558.txt, (322, 417),(336, 413)),(./files/923.txt, (58, 161)),(./files/611.txt, (131, 243)),(./files/5.txt, (0, 152)),(./files/227.txt, (105, 219)),(./files/348.txt, (102, 71)),(./files/98.txt, (86, 28)),(./files/598.txt, (248, 33)),(./files/475.txt, (208, 29)),(./files/55.txt, (12, 0)),(./files/762.txt, (28, 60))
ferociously, (./files/783.txt, (0, 15),(114, 201)),(./files/14.txt, (316, 63)),(./files/420.txt, (218, 227)),(./files/641.txt, (198, 166)),(./files/469.txt, (12, 29),(288, 7)),(./files/898.txt, (193, 132)),(./files/360.txt, (146, 108)),(./files/751.txt, (103, 225)),(./files/622.txt, (325, 258)),(./files/786.txt, (158, 11)),(./files/214.txt, (4, 102)),(./files/819.txt, (187, 52),(339, 36)),(./files/872.txt, (201, 96)),(./files/471.txt, (290, 218)),(./files/794.txt, (10, 74)),(./files/596.txt, (30, 173)),(./files/798.txt, (202, 0)),(./files/422.txt, (179, 330)),(./files/535.txt, (147, 75)),(./files/700.txt, (72, 277)),(./files/678.txt, (224, 303)),(./files/54.txt, (192, 13)),(./files/797.txt, (13, 54)),(./files/812.txt, (21, 209)),(./files/252.txt, (76, 195)),(./files/943.txt, (72, 0)),(./files/555.txt, (287, 88)),(./files/604.txt, (139, 0)),(./files/385.txt, (68, 148)),(./files/65.txt, (209, 44),(434, 359)),(./files/548.txt, (98, 41)),(./files/206.txt, (267, 93)),(./files/571.txt, (131, 270)),(./files/648.txt, (161, 140)),(./files/538.txt, (181, 277)),(./files/282.txt, (178, 0)),(./files/731.txt, (125, 125)),(./files/439.txt, (62, 251)),(./files/185.txt, (264, 191)),(./files/1.txt, (63, 243)),(./files/935.txt, (302, 118)),(./files/962.txt, (110, 160)),(./files/126.txt, (23, 23)),(./files/691.txt, (177, 44)),(./files/346.txt, (110, 202)),(./files/941.txt, (48, 313)),(./files/298.txt, (83, 228)),(./files/750.txt, (7, 106)),(./files/608.txt, (87, 0)),(./files/605.txt, (361, 62)),(./files/314.txt, (159, 206)),(./files/505.txt, (224, 233)),(./files/953.txt, (462, 13)),(./files/630.txt, (53, 43)),(./files/694.txt, (88, 23)),(./files/387.txt, (141, 295)),(./files/878.txt, (168, 272)),(./files/443.txt, (11, 400)),(./files/590.txt, (93, 393)),(./files/457.txt, (437, 48)),(./files/447.txt, (375, 314)),(./files/761.txt, (193, 69)),(./files/583.txt, (199, 19)),(./files/519.txt, (60, 253)),(./files/518.txt, (211, 10)),(./files/603.txt, (171, 120)),(./files/545.txt, (24, 229)),(./files/198.txt, (137, 74)),(./files/18.txt, (183, 151),(234, 120)),(./files/870.txt, (329, 39)),(./files/883.txt, (62, 36)),(./files/625.txt, (108, 104)),(./files/115.txt, (119, 157)),(./files/86.txt, (158, 61)),(./files/192.txt, (88, 101)),(./files/426.txt, (206, 19),(285, 107)),(./files/930.txt, (1, 467)),(./files/263.txt, (148, 82)),(./files/964.txt, (425, 384)),(./files/82.txt, (146, 329)),(./files/281.txt, (293, 431)),(./files/125.txt, (443, 330)),(./files/710.txt, (231, 19)),(./files/948.txt, (80, 74)),(./files/423.txt, (231, 365)),(./files/809.txt, (406, 90)),(./files/41.txt, (256, 0),(259, 122),(283, 63)),(./files/89.txt, (107, 44)),(./files/888.txt, (296, 199)),(./files/958.txt, (125, 48)),(./files/166.txt, (125, 248)),(./files/995.txt, (35, 84)),(./files/455.txt, (112, 105)),(./files/473.txt, (61, 131)),(./files/946.txt, (306, 8)),(./files/475.txt, (157, 140))

优化

改变 for 循环内部结构。不是把新的字符串追加到一个大字符串中,而是追加到列表中。

1
2
3
4
5
6
7
8
9
10
11
def saveIndex(index):
lines = []
for word in index:
index_line = []
for filename in index[word]:
index_line.append("({}, {})".format(filename, ",".join(map(str, index[word][filename]))))
lines.append("{}, {}".format(word, ",".join(index_line)))

f = open("index-file.txt", "w")
f.write("\n".join(lines))
f.close()

22s -> 16.8s

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
Total time: 16.7794 s
File: my.py
Function: saveIndex at line 42

Line # Hits Time Per Hit % Time Line Contents
==============================================================
42 @profile
43 def saveIndex(index):
44 1 1.0 1.0 0.0 lines = []
45 109006 64301.0 0.6 0.4 for word in index:
46 109005 119220.0 1.1 0.7 index_line = []
47 6046737 3102933.0 0.5 18.5 for filename in index[word]:
48 5937732 12655170.0 2.1 75.4 index_line.append("({}, {})".format(filename, ",".join(map(str, index[word][filename]))))
49 109005 407966.0 3.7 2.4 lines.append("{}, {}".format(word, ",".join(index_line)))
50
51 1 12943.0 12943.0 0.1 f = open("index-file.txt", "w")
52 1 296342.0 296342.0 1.8 f.write("\n".join(lines))
53 1 120562.0 120562.0 0.7 f.close()

优化完四个主要的函数后,总耗时从 292.197s 变成 134.864s


Share