HanLP入门

  |  

摘要: 本文介绍 HanLP 这个 NLP 框架,包括 1.x 和 2.x,Java 和 Python 接口,RESTful API 和 Native API。

【对算法,数学,计算机感兴趣的同学,欢迎关注我哈,阅读更多原创文章】
我的网站:潮汐朝夕的生活实验室
我的公众号:算法题刷刷
我的知乎:潮汐朝夕
我的github:FennelDumplings
我的leetcode:FennelDumplings


$0 HanLP 项目基本信息

HanLP 面向生产环境的多语种自然语言处理工具包,基于 PyTorch 和 TensorFlow 2.x 双引擎,目标是普及落地最前沿的NLP技术。HanLP 具备功能完善、精度准确、性能高效、语料时新、架构清晰、可自定义的特点。

HanLP 提供 RESTful 和 native 两种 API,分别面向轻量级和海量级两种场景。无论何种API何种语言,HanLP 接口在语义上保持一致,在代码上坚持开源。

轻量级 RESTful API 仅数 KB,适合敏捷开发、移动 APP 等场景。简单易用,无需 GPU 配环境,秒速安装。支持 Python,Go,Java,各个语言的安装和样例看仓库链接。

海量级 native API 依赖 PyTorch、TensorFlow 等深度学习技术,适合专业 NLP 工程师、研究者以及本地海量数据场景。要求 Python 3.6 至 3.9,支持 Windows,推荐 Linux。可以在 CPU 上运行,推荐 GPU/TPU。安装 PyTorch 版:pip install hanlp。HanLP 发布的模型分为多任务和单任务两种,多任务速度快省显存,单任务精度高更灵活。


$1 轻量级 RESTFUL API

$1-1 Python 接口

1
pip install hanlp_restful

创建客户端,填入服务器地址和秘钥:

1
2
from hanlp_restful import HanLPClient
HanLP = HanLPClient("https://www.hanlp.com/api", auth=None, language="zh", timeout=10)

调用 parse 接口,传入一篇文章,得到 HanLP 精准的分析结果。

1
2
result = HanLP.parse("2021年HanLPv2.1为生产环境带来次世代最先进的多语种NLP技术。阿婆主来到北京立方庭参观自然语义科技公司。")
print(result)

结果如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
{
"tok/fine": [
["2021年", "HanLPv2.1", "为", "生产", "环境", "带来", "次", "世代", "最", "先进", "的", "多", "语种", "NLP", "技术", "。"],
["阿婆主", "来到", "北京", "立方庭", "参观", "自然", "语义", "科技", "公司", "。"]
],
"tok/coarse": [
["2021年", "HanLPv2.1", "为", "生产环境", "带来", "次世代", "最", "先进", "的", "多语种", "NLP", "技术", "。"],
["阿婆主", "来到", "北京", "立方庭", "参观", "自然语义科技公司", "。"]
],
"pos/ctb": [
["NT", "URL", "P", "NN", "NN", "VV", "M", "NN", "AD", "VA", "DEC", "CD", "NN", "NN", "NN", "PU"],
["NN", "VV", "NR", "NR", "VV", "NN", "NN", "NN", "NN", "PU"]
],
"pos/pku": [
["t", "nx", "p", "vn", "n", "v", "q", "n", "d", "a", "u", "a", "n", "nx", "n", "w"],
["n", "v", "ns", "ns", "v", "n", "n", "n", "n", "w"]
],
"pos/863": [
["nt", "w", "p", "v", "n", "v", "q", "nt", "d", "a", "u", "a", "n", "w", "n", "w"],
["n", "v", "ns", "n", "v", "n", "n", "n", "n", "w"]
],
"ner/msra": [
[["2021年", "DATE", 0, 1], ["HanLPv2.1", "ORGANIZATION", 1, 2]],
[["北京立方庭", "LOCATION", 2, 4], ["自然语义科技公司", "ORGANIZATION", 5, 9]]
],
"ner/pku": [
[],
[["北京", "ns", 2, 3], ["立方庭", "ns", 3, 4], ["自然语义科技公司", "nt", 5, 9]]
],
"ner/ontonotes": [
[["2021年", "DATE", 0, 1], ["次世代", "DATE", 6, 8]],
[["北京", "FAC", 2, 3], ["立方庭", "LOC", 3, 4], ["自然语义科技公司", "ORG", 5, 9]]
],
"srl": [
[[["2021年", "ARGM-TMP", 0, 1], ["HanLPv2.1", "ARG0", 1, 2], ["为生产环境", "ARG2", 2, 5], ["带来", "PRED", 5, 6], ["次世代最先进的多语种NLP技术", "ARG1", 6, 15]], [["次世代", "ARGM-TMP", 6, 8], ["最", "ARGM-ADV", 8, 9], ["先进", "PRED", 9, 10], ["NLP技术", "ARG0", 13, 15]]],
[[["阿婆主", "ARG0", 0, 1], ["来到", "PRED", 1, 2], ["北京立方庭", "ARG1", 2, 4]], [["阿婆主", "ARG0", 0, 1], ["参观", "PRED", 4, 5], ["自然语义科技公司", "ARG1", 5, 9]]]
],
"dep": [
[[6, "tmod"], [6, "nsubj"], [6, "prep"], [5, "nn"], [3, "pobj"], [0, "root"], [8, "clf"], [10, "dep"], [10, "advmod"], [15, "rcmod"], [10, "cpm"], [13, "nummod"], [15, "nn"], [15, "nn"], [6, "dobj"], [6, "punct"]],
[[2, "nsubj"], [0, "root"], [4, "nn"], [2, "dobj"], [2, "conj"], [9, "nn"], [9, "nn"], [9, "nn"], [5, "dobj"], [2, "punct"]]
],
"sdp": [
[[[6, "Time"]], [[6, "Agt"]], [[5, "mPrep"]], [[5, "Desc"]], [[6, "Datv"]], [[0, "Root"]], [[8, "Qp"]], [[15, "TDur"]], [[10, "mDegr"]], [[15, "Desc"]], [[10, "mAux"]], [[13, "Quan"]], [[15, "Desc"]], [[15, "Nmod"]], [[6, "Cont"]], [[6, "mPunc"]]],
[[[2, "Agt"]], [[4, "mPrep"]], [[4, "Nmod"]], [[2, "Lfin"]], [[2, "ePurp"]], [[7, "Nmod"]], [[9, "Nmod"]], [[9, "Desc"]], [[5, "Cont"]], [[2, "mPunc"]]]
],
"con": [
["TOP", [["IP", [["NP", [["NT", ["2021年"]]]], ["NP", [["URL", ["HanLPv2.1"]]]], ["VP", [["PP", [["P", ["为"]], ["NP", [["NN", ["生产"]], ["NN", ["环境"]]]]]], ["VP", [["VV", ["带来"]], ["NP", [["IP", [["VP", [["NP", [["QP", [["CLP", [["M", ["次"]]]]]], ["NP", [["NN", ["世代"]]]]]], ["ADVP", [["AD", ["最"]]]], ["VP", [["VA", ["先进"]]]]]]]], ["DEC", ["的"]], ["NP", [["QP", [["CD", ["多"]]]], ["NP", [["NN", ["语种"]]]]]], ["NP", [["NN", ["NLP"]], ["NN", ["技术"]]]]]]]]]], ["PU", ["。"]]]]]],
["TOP", [["IP", [["NP", [["NN", ["阿婆主"]]]], ["VP", [["VP", [["VV", ["来到"]], ["NP", [["NR", ["北京"]], ["NP", [["NR", ["立方庭"]]]]]]]], ["VP", [["VV", ["参观"]], ["NP", [["NN", ["自然"]], ["NN", ["语义"]], ["NN", ["科技"]], ["NN", ["公司"]]]]]]]], ["PU", ["。"]]]]]]
]
}

$1-2 Java 接口

使用 Maven

在 pom.xml 中添加依赖。

1
2
3
4
5
<dependency>
<groupId>com.hankcs.hanlp.restful</groupId>
<artifactId>hanlp-restful</artifactId>
<version>0.0.8</version>
</dependency>

创建客户端,填入服务器地址和秘钥:

1
HanLPClient HanLP = new HanLPClient("https://www.hanlp.com/api", null, "zh"); // auth不填则匿名,zh中文,mul多语种

返回的是 com.hankcs.hanlp.restful.HanLPClient 对象,我们可以通过文章 Java核心技术1-反射 中写的【打印一个类的全部信息的方法】代码模板来查看类类信息。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
public class com.hankcs.hanlp.restful.HanLPClient
{
public com.hankcs.hanlp.restful.HanLPClient(java.lang.String, java.lang.String, java.lang.String, int);
public com.hankcs.hanlp.restful.HanLPClient(java.lang.String, java.lang.String);

public java.util.Map parse([[Ljava.lang.String;, [Ljava.lang.String;, [Ljava.lang.String;);
public java.util.Map parse([Ljava.lang.String;);
public java.util.Map parse([[Ljava.lang.String;);
public java.util.Map parse(java.lang.String, [Ljava.lang.String;, [Ljava.lang.String;);
public java.util.Map parse(java.lang.String);
public java.util.Map parse([Ljava.lang.String;, [Ljava.lang.String;, [Ljava.lang.String;);
public java.util.List coreferenceResolution([[Ljava.lang.String;, [Ljava.lang.String;);
public java.util.List coreferenceResolution([[Ljava.lang.String;);
public com.hankcs.hanlp.restful.CoreferenceResolutionOutput coreferenceResolution(java.lang.String);
private static java.util.List _convert_clusters(java.util.List);
public [Lcom.hankcs.hanlp.restful.mrp.MeaningRepresentation; abstractMeaningRepresentation(java.lang.String);
public [Lcom.hankcs.hanlp.restful.mrp.MeaningRepresentation; abstractMeaningRepresentation([[Ljava.lang.String;);
private java.lang.String post(java.lang.String, java.lang.Object);
public java.util.List tokenize(java.lang.String);
public java.util.List tokenize(java.lang.String, java.lang.Boolean);
public java.util.List textStyleTransfer(java.util.List, java.lang.String);
public java.lang.String textStyleTransfer(java.lang.String, java.lang.String);
public java.lang.Float semanticTextualSimilarity(java.lang.String, java.lang.String);
public java.util.List semanticTextualSimilarity([[Ljava.lang.String;);

private java.lang.String url;
private java.lang.String auth;
private java.lang.String language;
private int timeout;
private com.fasterxml.jackson.databind.ObjectMapper mapper;
}

同样调用 parse 接口,传入一篇文章,得到 HanLP 精准的分析结果。

1
2
3
4
5
6
7
8
9
10
11
import java.util.Map;

import com.hankcs.hanlp.restful.HanLPClient;

public class App {
public static void main(String[] args ) throws Exception {
HanLPClient hanlp = new HanLPClient("https://www.hanlp.com/api", null, "zh", 10);
Map result = hanlp.parse("2021年HanLPv2.1为生产环境带来次世代最先进的多语种NLP技术。阿婆主来到北京立方庭参观自然语义科技公司。");
System.out.println(result);
}
}

结果如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
{
tok/fine=[[2021年, HanLPv2.1, 为, 生产, 环境, 带来, 次, 世代, 最, 先进, 的, 多, 语种, NLP, 技术, 。], [阿婆主, 来到, 北京, 立方庭, 参观, 自然, 语义, 科技, 公司, 。]], t
ok/coarse=[[2021年, HanLPv2.1, 为, 生产环境, 带来, 次世代, 最, 先进, 的, 多语种, NLP, 技术, 。], [阿婆主, 来到, 北京, 立方庭, 参观, 自然语义科技公司, 。]], pos/ctb=[[NT,
URL, P, NN, NN, VV, M, NN, AD, VA, DEC, CD, NN, NN, NN, PU], [NN, VV, NR, NR, VV, NN, NN, NN, NN, PU]], pos/pku=[[t, nx, p, vn, n, v, q, n, d, a, u, a, n, nx, n, w], [n
, v, ns, ns, v, n, n, n, n, w]], pos/863=[[nt, w, p, v, n, v, q, nt, d, a, u, a, n, w, n, w], [n, v, ns, n, v, n, n, n, n, w]], ner/msra=[[[2021年, DATE, 0, 1], [HanLPv2
.1, ORGANIZATION, 1, 2]], [[北京立方庭, LOCATION, 2, 4], [自然语义科技公司, ORGANIZATION, 5, 9]]], ner/pku=[[], [[北京, ns, 2, 3], [立方庭, ns, 3, 4], [自然语义科技公司,
nt, 5, 9]]], ner/ontonotes=[[[2021年, DATE, 0, 1], [次世代, DATE, 6, 8]], [[北京, FAC, 2, 3], [立方庭, LOC, 3, 4], [自然语义科技公司, ORG, 5, 9]]], srl=[[[[2021年, ARGM
-TMP, 0, 1], [HanLPv2.1, ARG0, 1, 2], [为生产环境, ARG2, 2, 5], [带来, PRED, 5, 6], [次世代最先进的多语种NLP技术, ARG1, 6, 15]], [[次世代, ARGM-TMP, 6, 8], [最, ARGM-ADV
, 8, 9], [先进, PRED, 9, 10], [NLP技术, ARG0, 13, 15]]], [[[阿婆主, ARG0, 0, 1], [来到, PRED, 1, 2], [北京立方庭, ARG1, 2, 4]], [[阿婆主, ARG0, 0, 1], [参观, PRED, 4, 5]
, [自然语义科技公司, ARG1, 5, 9]]]], dep=[[[6, tmod], [6, nsubj], [6, prep], [5, nn], [3, pobj], [0, root], [8, clf], [10, dep], [10, advmod], [15, rcmod], [10, cpm], [1
3, nummod], [15, nn], [15, nn], [6, dobj], [6, punct]], [[2, nsubj], [0, root], [4, nn], [2, dobj], [2, conj], [9, nn], [9, nn], [9, nn], [5, dobj], [2, punct]]], sdp=[[
[[6, Time]], [[6, Agt]], [[5, mPrep]], [[5, Desc]], [[6, Datv]], [[0, Root]], [[8, Qp]], [[15, TDur]], [[10, mDegr]], [[15, Desc]], [[10, mAux]], [[13, Quan]], [[15, Des
c]], [[15, Nmod]], [[6, Cont]], [[6, mPunc]]], [[[2, Agt]], [[4, mPrep]], [[4, Nmod]], [[2, Lfin]], [[2, ePurp]], [[7, Nmod]], [[9, Nmod]], [[9, Desc]], [[5, Cont]], [[2
, mPunc]]]], con=[[TOP, [[IP, [[NP, [[NT, [2021年]]]], [NP, [[URL, [HanLPv2.1]]]], [VP, [[PP, [[P, [为]], [NP, [[NN, [生产]], [NN, [环境]]]]]], [VP, [[VV, [带来]], [NP,
[[IP, [[VP, [[NP, [[QP, [[CLP, [[M, [次]]]]]], [NP, [[NN, [世代]]]]]], [ADVP, [[AD, [最]]]], [VP, [[VA, [先进]]]]]]]], [DEC, [的]], [NP, [[QP, [[CD, [多]]]], [NP, [[NN,
[语种]]]]]], [NP, [[NN, [NLP]], [NN, [技术]]]]]]]]]], [PU, [。]]]]]], [TOP, [[IP, [[NP, [[NN, [阿婆主]]]], [VP, [[VP, [[VV, [来到]], [NP, [[NR, [北京]], [NP, [[NR, [立方
庭]]]]]]]], [VP, [[VV, [参观]], [NP, [[NN, [自然]], [NN, [语义]], [NN, [科技]], [NN, [公司]]]]]]]], [PU, [。]]]]]]]}

$2 海量级 Native API

$2-1 2.x 版

$2-1-1 Python 接口

1
pip install hanlp

例子:多任务模型

HanLP 的工作流程为加载模型然后将其当作函数调用,例如下列联合多任务模型:

1
2
3
4
import hanlp

HanLP = hanlp.load(hanlp.pretrained.mtl.CLOSE_TOK_POS_NER_SRL_DEP_SDP_CON_ELECTRA_SMALL_ZH) # 世界最大中文语料库
result = HanLP(['2021年HanLPv2.1为生产环境带来次世代最先进的多语种NLP技术。', '阿婆主来到北京立方庭参观自然语义科技公司。'])

结果如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
{
"tok/fine": [
["2021年", "HanLPv2.1", "为", "生产", "环境", "带来", "次世代", "最", "先进", "的", "多语种", "NLP", "技术", "。"],
["阿婆主", "来到", "北京", "立方庭", "参观", "自然", "语义", "科技", "公司", "。"]
],
"tok/coarse": [
["2021年", "HanLPv2.1", "为", "生产", "环境", "带来", "次世代", "最", "先进", "的", "多语种", "NLP", "技术", "。"],
["阿婆主", "来到", "北京立方庭", "参观", "自然语义科技公司", "。"]
],
"pos/ctb": [
["NT", "NR", "P", "NN", "NN", "VV", "NN", "AD", "VA", "DEC", "NN", "NR", "NN", "PU"],
["NN", "VV", "NR", "NR", "VV", "NN", "NN", "NN", "NN", "PU"]
],
"pos/pku": [
["t", "nx", "p", "vn", "n", "v", "n", "d", "a", "u", "n", "nx", "n", "w"],
["n", "v", "ns", "ns", "v", "n", "n", "n", "n", "w"]
],
"pos/863": [
["nt", "w", "p", "v", "n", "v", "nt", "d", "a", "u", "n", "w", "n", "w"],
["n", "v", "ns", "ni", "v", "a", "n", "n", "n", "w"]
],
"ner/msra": [
[["2021年", "DATE", 0, 1], ["次世代", "DATE", 6, 7]],
[["北京立方庭", "ORGANIZATION", 2, 4], ["自然语义科技公司", "ORGANIZATION", 5, 9]]
],
"ner/pku": [
[],
[["北京立方庭", "nt", 2, 4]]
],
"ner/ontonotes": [
[["2021年", "DATE", 0, 1], ["HanLPv2.1", "PERSON", 1, 2]],
[["北京立方庭", "FAC", 2, 4], ["自然语义科技公司", "ORG", 5, 9]]
],
"srl": [
[[["2021年", "ARGM-TMP", 0, 1], ["HanLPv2.1", "ARG0", 1, 2], ["为生产环境", "ARG2", 2, 5], ["带来", "PRED", 5, 6], ["次世代最先进的多语种NLP技术", "ARG1", 6, 13]], [["次世代", "ARGM-TMP", 6, 7], ["最", "ARGM-ADV", 7, 8], ["先进", "PRED", 8, 9], ["技术", "ARG0", 12, 13]]],
[[["阿婆主", "ARG0", 0, 1], ["来到", "PRED", 1, 2], ["北京立方庭", "ARG1", 2, 4]], [["阿婆主", "ARG0", 0, 1], ["参观", "PRED", 4, 5], ["自然语义科技公司", "ARG1", 5, 9]]]
],
"dep": [
[[6, "tmod"], [6, "nsubj"], [6, "prep"], [5, "nn"], [3, "pobj"], [0, "root"], [9, "dep"], [9, "advmod"], [13, "rcmod"], [9, "cpm"], [13, "nn"], [13, "nn"], [6, "dobj"], [6, "punct"]],
[[2, "nsubj"], [0, "root"], [4, "nn"], [2, "dobj"], [2, "conj"], [7, "nn"], [9, "nn"], [9, "nn"], [5, "dobj"], [2, "punct"]]
],
"sdp": [
[[[6, "Time"]], [[6, "Exp"]], [[5, "mPrep"]], [[5, "Desc"]], [[6, "Datv"]], [[0, "Root"]], [[9, "Time"]], [[9, "mDegr"]], [[13, "Desc"]], [[9, "mAux"]], [[13, "Desc"]], [[13, "Nmod"]], [[6, "Pat"]], [[6, "mPunc"]]],
[[[2, "Agt"], [5, "Agt"]], [[0, "Root"]], [[4, "Nmod"]], [[2, "Lfin"]], [[2, "ePurp"]], [[7, "Nmod"]], [[9, "Nmod"]], [[9, "Desc"]], [[5, "Cont"]], [[5, "mPunc"]]]
],
"con": [
["TOP", [["IP", [["NP", [["NT", ["2021年"]]]], ["NP", [["NR", ["HanLPv2.1"]]]], ["VP", [["PP", [["P", ["为"]], ["NP", [["NN", ["生产"]], ["NN", ["环境"]]]]]], ["VP", [["VV", ["带来"]], ["NP", [["CP", [["CP", [["IP", [["VP", [["NP", [["NN", ["次世代"]]]], ["ADVP", [["AD", ["最"]]]], ["VP", [["VA", ["先进"]]]]]]]], ["DEC", ["的"]]]]]], ["NP", [["NN", ["多语种"]]]], ["NP", [["NR", ["NLP"]], ["NN", ["技术"]]]]]]]]]], ["PU", ["。"]]]]]],
["TOP", [["IP", [["NP", [["NN", ["阿婆主"]]]], ["VP", [["VP", [["VV", ["来到"]], ["NP", [["NR", ["北京"]], ["NR", ["立方庭"]]]]]], ["VP", [["VV", ["参观"]], ["NP", [["NN", ["自然"]], ["NN", ["语义"]], ["NN", ["科技"]], ["NN", ["公司"]]]]]]]], ["PU", ["。"]]]]]]
]
}

例子:单任务模型

多任务学习的优势在于速度和显存,然而精度往往不如单任务模型。所以,HanLP 预训练了许多单任务模型并设计了优雅的流水线模式将其组装起来。代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
import hanlp
HanLP = hanlp.pipeline() \
.append(hanlp.utils.rules.split_sentence, output_key='sentences') \
.append(hanlp.load('FINE_ELECTRA_SMALL_ZH'), output_key='tok') \
.append(hanlp.load('CTB9_POS_ELECTRA_SMALL'), output_key='pos') \
.append(hanlp.load('MSRA_NER_ELECTRA_SMALL_ZH'), output_key='ner', input_key='tok') \
.append(hanlp.load('CTB9_DEP_ELECTRA_SMALL', conll=0), output_key='dep', input_key='tok')\
.append(hanlp.load('CTB9_CON_ELECTRA_SMALL'), output_key='con', input_key='tok')
result = HanLP('2021年HanLPv2.1为生产环境带来次世代最先进的多语种NLP技术。阿婆主来到北京立方庭参观自然语义科技公司。')

with open("a.json", "w") as f:
f.write(str(result))

结果如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
{
"sentences": [
"2021年HanLPv2.1为生产环境带来次世代最先进的多语种NLP技术。",
"阿婆主来到北京立方庭参观自然语义科技公司。"
],
"tok": [
["2021年", "HanLPv2.1", "为", "生产", "环境", "带来", "次", "世代", "最", "先进", "的", "多", "语种", "NLP", "技术", "。"],
["阿婆", "主", "来到", "北京", "立方庭", "参观", "自然", "语义", "科技", "公司", "。"]
],
"pos": [
["NT", "NR", "P", "NN", "NN", "VV", "JJ", "NN", "AD", "VA", "DEC", "CD", "NN", "NR", "NN", "PU"],
["NR", "AD", "VV", "NR", "NR", "VV", "NN", "NN", "NN", "NN", "PU"]
],
"ner": [
[["2021年", "DATE", 0, 1]],
[["北京", "ORGANIZATION", 3, 4], ["立方庭", "LOCATION", 4, 5], ["自然语义科技公司", "ORGANIZATION", 6, 10]]
],
"dep": [
[[6, "tmod"], [6, "nsubj"], [6, "prep"], [5, "nn"], [3, "pobj"], [0, "root"], [8, "det"], [15, "nn"], [10, "advmod"], [15, "rcmod"], [10, "cpm"], [13, "nummod"], [15, "nn"], [15, "nn"], [6, "dobj"], [6, "punct"]],
[[2, "nn"], [3, "nsubj"], [0, "root"], [5, "nn"], [3, "dobj"], [3, "conj"], [10, "nn"], [10, "nn"], [10, "nn"], [6, "dobj"], [3, "punct"]]
],
"con": [
["TOP", [["IP", [["NP", [["_", ["2021年"]]]], ["NP", [["_", ["hanlpv2.1"]]]], ["VP", [["PP", [["_", ["为"]], ["NP", [["_", ["生产"]], ["_", ["环境"]]]]]], ["VP", [["_", ["带来"]], ["NP", [["CP", [["CP", [["IP", [["VP", [["NP", [["DP", [["_", ["次"]]]], ["NP", [["_", ["世代"]]]]]], ["ADVP", [["_", ["最"]]]], ["VP", [["_", ["先进"]]]]]]]], ["_", ["的"]]]]]], ["NP", [["ADJP", [["_", ["多"]]]], ["NP", [["_", ["语种"]]]]]], ["NP", [["_", ["nlp"]], ["_", ["技术"]]]]]]]]]], ["_", ["。"]]]]]],
["TOP", [["IP", [["NP", [["NP", [["_", ["阿婆"]]]], ["ADVP", [["_", ["主"]]]]]], ["VP", [["VP", [["_", ["来到"]], ["NP", [["_", ["北京"]], ["_", ["立方庭"]]]]]], ["VP", [["_", ["参观"]], ["NP", [["_", ["自然"]], ["_", ["语义"]], ["_", ["科技"]], ["_", ["公司"]]]]]]]], ["_", ["。"]]]]]]
]
}

$2-2 1.x 版

$2-2-1 Python 接口

pyhanlp 的 github 仓库链接为: https://github.com/hankcs/pyhanlp

1
pip install pyhanlp

如果顺利的话,在命令行第一次 import pyhanlp 或者第一次在命令行执行 pyhanlp 时,会自动下载 HanLp 的 jar 包(包含很多算法)和数据包(包含很多模型)到 pyhanlp 的系统路径。例如:

1
2
3
4
5
下载 https://file.hankcs.com/hanlp/hanlp-1.8.3-release.zip 到 /home/ppp/anaconda3/envs/python-3.6/lib/python3.6/site-packages/pyhanlp/static/hanlp-1.8.3-release.zip
100% 1.8 MiB 1.8 MiB/s ETA: 0 s [=============================================================]
下载 https://file.hankcs.com/hanlp/data-for-1.7.5.zip 到 /home/ppp/anaconda3/envs/python-3.6/lib/python3.6/site-packages/pyhanlp/static/data-for-1.8.3.zip
100% 637.7 MiB 1.1 MiB/s ETA: 0 s [=============================================================]
解压 data.zip...

命令行工具

在命令行执行 hanlp 命令,会显示以下信息。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
usage: hanlp [-h] [-v] {segment,parse,serve,update} ...

HanLP: Han Language Processing v1.8.3

positional arguments:
{segment,parse,serve,update}
which task to perform?
segment word segmentation
parse dependency parsing
serve start http server
update update jar and data of HanLP

optional arguments:
-h, --help show this help message and exit
-v, --version show installed versions of HanLP

我们可以不写代码调用 HanLP 的一些常见功能。例如 hanlp segment 执行的是分词,hanlp parse 执行的是句法分析。下面是分词的例子(<<< 是字符串重定向)。

(1) 中文分词,带词性标注 。

1
hanlp segment <<< 我们可以在不写代码的前提下轻松调用很多常见功能

结果如下

1
我们/rr 可以/v 在/p 不/d 写/v 代码/n 的/ude1 前提/n 下/f 轻松/a 调用/v 很多/m 常见/a 功能/n

(2) 中文分词,不带词性标注。

1
hanlp segment --no-tag <<< 我们可以在不写代码的前提下轻松调用很多常见功能

结果如下

1
我们 可以 在 不 写 代码 的 前提 下 轻松 调用 很多 常见 功能

(3) 中文分词,不带词性标注,指定分词算法为 CRF

1
hanlp segment < input.txt > output.txt -a crf --no-tag

Python 接口的例子

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
from pyhanlp import HanLP

print(HanLP.segment("你好,欢迎在Python汇总调用HanLP的API"))
for term in HanLP.segment("下雨天地面积水"):
print("{}\t{}".format(term.word, term.nature))

# 分词
testCases = ["商品和服务"
,"结婚的和尚未结婚的确实在干扰分词啊"
,"买水果然后来世博园最后去世博会"
,"中国的首都是北京"
,"欢迎新老师生前来就餐"
,"工信处女干事每月经过下属科室都要亲口交代24口交换机等技术性器件的安装工作"
,"随着页游兴起到现在的页游繁盛,依赖于存档进行逻辑判断的设计减少了,但这块也不能完全忽略掉"
]
for sentence in testCases:
print(HanLP.segment(sentence))

# 关键词提取
document = "水利部水资源司司长陈明忠9月29日在国务院新闻办举行的新闻发布会上透露," \
"根据刚刚完成了水资源管理制度的考核,有部分省接近了红线的指标," \
"有部分省超过红线的指标。对一些超过红线的地方,陈明忠表示,对一些取用水项目进行区域的现批" \
"严格地进行水资源的论证和取水许可的批准"
print(HanLP.extractKeyword(document, 2))
# 自动摘要
print(HanLP.extractKeyword(document, 3))

# 依存句法分析
print(HanLP.parseDependency("徐先生还具体帮助他确定了把画雄鹰、松鼠和麻雀作为主攻目标"))

结果如下

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
[你好/vl, ,/w, 欢迎/v, 在/p, Python/nx, 汇总/v, 调用/v, HanLP/nx, 的/ude1, API/nx]
下雨天 n
地面 n
积水 n
[商品/n, 和/cc, 服务/vn]
[结婚/vi, 的/ude1, 和/cc, 尚未/d, 结婚/vi, 的/ude1, 确实/ad, 在/p, 干扰/vn, 分词/n, 啊/y]
[买/v, 水果/n, 然后/c, 来/vf, 世博园/n, 最后/f, 去/vf, 世博会/n]
[中国/ns, 的/ude1, 首都/n, 是/vshi, 北京/ns]
[欢迎/v, 新/a, 老/a, 师生/n, 前来/vi, 就餐/vi]
[工信处/n, 女干事/n, 每月/t, 经过/p, 下属/v, 科室/n, 都/d, 要/v, 亲口/d, 交代/v, 24/m, 口/n, 交换机/n, 等/udeng, 技术性/n, 器件/n, 的/ude1, 安装/v, 工作/vn]
[随着/p, 页游/nz, 兴起/v, 到/v, 现在/t, 的/ude1, 页游/nz, 繁盛/a, ,/w, 依赖于/v, 存档/vi, 进行/vn, 逻辑/n, 判断/v, 的/ude1, 设计/vn, 减少/v, 了/ule, ,/w, 但/c, 这块/r, 也/d, 不能/v, 完全/ad, 忽略/v, 掉/v]
[水资源, 陈明忠]
[水资源, 陈明忠, 进行]
1 徐先生 徐先生 nh nr _ 4 主谓关系 _ _
2 还 还 d d _ 4 状中结构 _ _
3 具体 具体 a ad _ 4 状中结构 _ _
4 帮助 帮助 v v _ 0 核心关系 _ _
5 他 他 r r _ 4 兼语 _ _
6 确定 确定 v v _ 4 动宾关系 _ _
7 了 了 u u _ 6 右附加关系 _ _
8 把 把 p p _ 15 状中结构 _ _
9 画 画 v v _ 8 介宾关系 _ _
10 雄鹰 雄鹰 n n _ 9 动宾关系 _ _
11 、 、 wp w _ 12 标点符号 _ _
12 松鼠 松鼠 n n _ 10 并列关系 _ _
13 和 和 c c _ 14 左附加关系 _ _
14 麻雀 麻雀 n n _ 10 并列关系 _ _
15 作为 作为 v v _ 6 动宾关系 _ _
16 主攻 主攻 v vn _ 17 定中关系 _ _
17 目标 目标 n n _ 15 动宾关系 _ _

在 pyhanlp 仓库的 tests/demos 目录中有很多例子可以参考。

$2-2-2 Java 接口

(1) Maven

可以通过 Maven 引入 HanLP,只需要在项目的 pom.xml 中添加以下依赖项即可。

1
2
3
4
5
<dependency>
<groupId>com.hankcs</groupId>
<artifactId>hanlp</artifactId>
<version>portable-1.8.3</version>
</dependency>

例子:HanLP.segment。

1
2
3
4
5
6
7
8
9
package com.czx.hanlphelloworld;

import com.hankcs.hanlp.HanLP;

public class App {
public static void main( String[] args ) {
System.out.println(HanLP.segment("你好,欢迎使用 HanLP 汉语处理包"));
}
}

两种方法运行

  • (1) 编译执行 mainClass
1
mvn compile exec:java -Dexec.mainClass="com.czx.hanlphelloworld.App"
  • (2) install 后运行 jar 包
1
2
mvn clean install
java -jar target/hanlp_helloworld-1.0-SNAPSHOT.jar

结果如下

1
[你好/l, ,/w, 欢迎/v, 使用/v,  /w, HanLP/nx,  /w, 汉语/nz, 处理/v, 包/v]

(2) 自行配置 data、hanlp.properties

HanLP 项目的目录结构如下:

1
2
3
4
5
6
7
|---- data
| |---- dictionary
| |---- model
|---- pom.xml
|---- src
|---- main
|---- test

在 HanLP 中,数据与程序是分离的。为了减少 jar 包的体积,portable 版只含有少量数据。一些高级功能(CRF分词、句法分析等),需要下载额外的数据包,并通过配置文件将数据包的位置告诉 HanLP。

其中词典是词法分析必需的,模型是句法分析必需的。

如果已经安装了 pyhanlp,则数据包和配置文件已经安装就绪。可以通过以下命令获取:

1
hanlp -v

结果如下

1
2
3
jar  1.8.3: /home/ppp/anaconda3/envs/python-3.6/lib/python3.6/site-packages/pyhanlp/static/hanlp-1.8.3.jar
data 1.8.3: /home/ppp/anaconda3/envs/python-3.6/lib/python3.6/site-packages/pyhanlp/static/data
config : /home/ppp/anaconda3/envs/python-3.6/lib/python3.6/site-packages/pyhanlp/static/hanlp.properties

一些资源,例如模型,数据包,配置文件等,需要手动添加。

最后一行的 hanlp.properties 就是所需的配置文件,只需要将它复制到项目的资源目录 src/main/resources 即可;data/model 也放到版本库相应目录。

完成后的目录结构如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
|---- data
| |---- dictionary
| |---- model
| |---- dependency
| |---- perceptron
| |---- segment
|---- pom.xml
|---- src
|---- main
| |---- java
| |---- resources
| |---- hanlp.properties
|---- test

将目录调整成以上形式,并将 properties,data 等所需文件复制到特定位置后,通过调整 pom.xml,然后用 mvn 命令编译运行即可。

也可以不用 maven,而是直接自行设计项目的目录结构,将 hanlp 的 jar 包,properties 文件,data 文件夹等复制进来,然后用 java 命令行进行编译运行也是可以的。

一个具体的运行例子可以看这篇文章:HanLP1-x-Java运行例子

对于仓库 https://github.com/hankcs/HanLP 来说,《自然语言处理入门》写作时,版本为 1.7.5,可以 git checkout v1.7.5 来恢复到与书对应的版本。书里的代码可以到上面的目录结构中的 src/test/java/ 下的 com/hankcs/book 中找到。此外在 com/hankcs/demo 中还有很多例子,可以照着下面的方法运行。

例子1:下面是书中 ch01 的例子 HelloWord.java:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
/*
* <author>Han He</author>
* <email>me@hankcs.com</email>
* <create-date>2018-05-18 下午5:38</create-date>
*
* <copyright file="HelloWord.java">
* Copyright (c) 2018, Han He. All Rights Reserved, http://www.hankcs.com/
* This source is subject to Han He. Please contact Han He for more information.
* </copyright>
*/
package com.hankcs.book.ch01;

import com.hankcs.hanlp.HanLP;

/**
* 《自然语言处理入门》1.6 开源工具
* 配套书籍:http://nlp.hankcs.com/book.php
* 讨论答疑:https://bbs.hankcs.com/
*
* @author hankcs
* @see <a href="http://nlp.hankcs.com/book.php">《自然语言处理入门》</a>
* @see <a href="https://bbs.hankcs.com/">讨论答疑</a>
*/
public class HelloWord
{
public static void main(String[] args)
{
HanLP.Config.enableDebug(); // 首次运行会自动建立模型缓存,为了避免你等得无聊,开启调试模式说点什么:-)
System.out.println(HanLP.segment("王国维和服务员"));
}
}

例子2:下面是 demo 中的 DemoSegment.java

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
/*
* <summary></summary>
* <author>He Han</author>
* <email>hankcs.cn@gmail.com</email>
* <create-date>2014/12/7 19:02</create-date>
*
* <copyright file="DemoSegment.java" company="上海林原信息科技有限公司">
* Copyright (c) 2003-2014, 上海林原信息科技有限公司. All Right Reserved, http://www.linrunsoft.com/
* This source is subject to the LinrunSpace License. Please contact 上海林原信息科技有限公司 to get more information.
* </copyright>
*/
package com.hankcs.demo;

import com.hankcs.hanlp.HanLP;
import com.hankcs.hanlp.seg.Segment;
import com.hankcs.hanlp.seg.common.Term;

import java.util.List;

/**
* 标准分词
*
* @author hankcs
*/
public class DemoSegment
{
public static void main(String[] args)
{
String[] testCase = new String[]{
"商品和服务",
"当下雨天地面积水分外严重",
"结婚的和尚未结婚的确实在干扰分词啊",
"买水果然后来世博园最后去世博会",
"中国的首都是北京",
"欢迎新老师生前来就餐",
"工信处女干事每月经过下属科室都要亲口交代24口交换机等技术性器件的安装工作",
"随着页游兴起到现在的页游繁盛,依赖于存档进行逻辑判断的设计减少了,但这块也不能完全忽略掉。",
};
for (String sentence : testCase)
{
List<Term> termList = HanLP.segment(sentence);
System.out.println(termList);
}
}
}

首先我们构建一个运行书中代码和仓库中的示例的项目,目录结果如下

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
|---- data
|---- pom.xml
|---- src
|---- main
| |---- java
| | |---- com
| | |---- hankcs
| | |---- book
| | | |---- ch01
| | | |---- HelloWord.java
| | |---- demo
| | |---- DemoSegment.java
| |---- resources
| |---- hanlp.properties
|---- test

pom.xml 如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
<?xml version="1.0" encoding="UTF-8"?>

<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
http://maven.apache.org/xsd/maven-4.0.0.xsd">

<modelVersion>4.0.0</modelVersion>
<groupId>com.hankcs</groupId>
<artifactId>Hello_Word</artifactId>
<version>1.0-SNAPSHOT</version>
<packaging>jar</packaging>

<name>Hello-Word</name>
<url>https://www.hanlp.com/</url>

<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<maven.compiler.source>1.7</maven.compiler.source>
<maven.compiler.target>1.7</maven.compiler.target>
</properties>

<dependencies>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.11</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>com.hankcs</groupId>
<artifactId>hanlp</artifactId>
<version>portable-1.8.3</version>
</dependency>
</dependencies>

<build>
<plugins>
<plugin>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.1</version>
<configuration>
<source>10</source>
<target>10</target>
<encoding>UTF-8</encoding>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>1.2.1</version>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<transformers>
<transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
<mainClass>com.hankcs.book.ch01.HelloWord</mainClass>
<mainClass>com.hankcs.demo.DemoSegment</mainClass>
</transformer>
</transformers>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>

下面我们编译运行:

1
mvn clean compile

运行书中 ch01/HelloWord.java 的例子

1
mvn exec:java -Dexec.mainClass="com.hankcs.book.ch01.HelloWord"

结果如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
[INFO] Scanning for projects...                                                                                                                                  [9/1980]
[INFO]
[INFO] ------------------< com.hankcs.book.ch01:Hello_Word >-------------------
[INFO] Building Hello-Word 1.0-SNAPSHOT
[INFO] --------------------------------[ jar ]---------------------------------
[INFO]
[INFO] --- exec-maven-plugin:3.0.0:java (default-cli) @ Hello_Word ---
May 31, 2022 11:30:52 PM com.hankcs.hanlp.dictionary.DynamicCustomDictionary loadMainDictionary
INFO: 自定义词典开始加载:/home/ppp/anaconda3/envs/python-3.6/lib/python3.6/site-packages/pyhanlp/static/data/dictionary/custom/CustomDictionary.txt
May 31, 2022 11:30:52 PM com.hankcs.hanlp.dictionary.DynamicCustomDictionary load
INFO: 自定义词典加载成功:381940个词条,耗时82ms
May 31, 2022 11:30:52 PM com.hankcs.hanlp.dictionary.CoreDictionary load
INFO: 核心词典开始加载:/home/ppp/anaconda3/envs/python-3.6/lib/python3.6/site-packages/pyhanlp/static/data/dictionary/CoreNatureDictionary.txt
May 31, 2022 11:30:53 PM com.hankcs.hanlp.dictionary.CoreDictionary <clinit>
INFO: /home/ppp/anaconda3/envs/python-3.6/lib/python3.6/site-packages/pyhanlp/static/data/dictionary/CoreNatureDictionary.txt加载成功,153091个词条,耗时18ms
粗分词网:
0:[ ]
1:[王, 王国]
2:[国]
3:[维, 维和]
4:[和, 和服]
5:[服, 服务, 服务员]
6:[务]
7:[员]
8:[ ]

May 31, 2022 11:30:53 PM com.hankcs.hanlp.dictionary.CoreBiGramTableDictionary <clinit>
INFO: 开始加载二元词典/home/ppp/anaconda3/envs/python-3.6/lib/python3.6/site-packages/pyhanlp/static/data/dictionary/CoreNatureDictionary.ngram.txt.table
May 31, 2022 11:30:53 PM com.hankcs.hanlp.dictionary.CoreBiGramTableDictionary <clinit>
INFO: /home/ppp/anaconda3/envs/python-3.6/lib/python3.6/site-packages/pyhanlp/static/data/dictionary/CoreNatureDictionary.ngram.txt.table加载成功,耗时43ms
粗分结果[王国/n, 维和/vn, 服务员/nnt]
May 31, 2022 11:30:53 PM com.hankcs.hanlp.dictionary.nr.PersonDictionary <clinit>
INFO: /home/ppp/anaconda3/envs/python-3.6/lib/python3.6/site-packages/pyhanlp/static/data/dictionary/person/nr.txt加载成功,耗时81ms
人名角色观察:[ K 1 A 1 ][王国 X 232 L 3 ][维和 L 2 V 1 Z 1 ][服务员 K 14 ][ K 1 A 1 ]
人名角色标注:[ /K ,王国/X ,维和/V ,服务员/K , /K]
识别出人名:王国维 XD
细分词网:
0:[ ]
1:[王国, 王国维]
2:[]
3:[维和]
4:[和, 和服]
5:[服务员]
6:[]
7:[]
8:[ ]

[王国维/nr, 和/cc, 服务员/nnt]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 0.774 s
[INFO] Finished at: 2022-05-31T23:30:53+08:00
[INFO] ------------------------------------------------------------------------

运行 demo/DemoSegment.java:

1
mvn exec:java -Dexec.mainClass="com.hankcs.demo.DemoSegment""

结果如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
[INFO] Scanning for projects...
[INFO]
[INFO] -----------------------< com.hankcs:Hello_Word >------------------------
[INFO] Building Hello-Word 1.0-SNAPSHOT
[INFO] --------------------------------[ jar ]---------------------------------
[INFO]
[INFO] --- exec-maven-plugin:3.0.0:java (default-cli) @ Hello_Word ---
[商品/n, 和/cc, 服务/vn]
[当/p, 下雨天/n, 地面/n, 积水/n, 分外/d, 严重/a]
[结婚/vi, 的/ude1, 和/cc, 尚未/d, 结婚/vi, 的/ude1, 确实/ad, 在/p, 干扰/vn, 分词/n, 啊/y]
[买/v, 水果/n, 然后/c, 来/vf, 世博园/n, 最后/f, 去/vf, 世博会/n]
[中国/ns, 的/ude1, 首都/n, 是/vshi, 北京/ns]
[欢迎/v, 新/a, 老/a, 师生/n, 前来/vi, 就餐/vi]
[工信处/n, 女干事/n, 每月/t, 经过/p, 下属/v, 科室/n, 都/d, 要/v, 亲口/d, 交代/v, 24/m, 口/n, 交换机/n, 等/udeng, 技术性/n, 器件/n, 的/ude1, 安装/v, 工作/vn]
[随着/p, 页游/nz, 兴起/v, 到/v, 现在/t, 的/ude1, 页游/nz, 繁盛/a, ,/w, 依赖于/v, 存档/vi, 进行/vn, 逻辑/n, 判断/v, 的/ude1, 设计/vn, 减少/v, 了/ule, ,/w, 但/c, 这块/r, 也/d, 不能/v, 完全/ad, 忽略/v, 掉/v, 。/w]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 0.708 s
[INFO] Finished at: 2022-05-31T23:40:59+08:00
[INFO] ------------------------------------------------------------------------

Share