hanlp漢語言包

首頁>技術>pandastar2020-12-22 22:24

hanlp漢語言包

一、簡介

在搜尋或其它應用領域，我們通常需要對資料進行分詞。在漢語言分詞處理中，我們可以使用hanlp，它是開源的漢語言處理包，可用於分詞、語言處理等操作。

二、組成

hanlp由三部分組成，分別是詞庫、驅動器（jar包）、hanlp配置。

2.1 詞庫

詞庫包含詞典和模型，詞典（位於data/dictionary目錄下）用於詞法分析，模型（位於data/model目錄下）用於語法分析。對應的資料包有如下幾類：

data.full.zip,完整的詞庫（包括詞典和模型）;

data.standary.zip，完整的詞典，不包含模型;

data.mini.zip，小體積的詞典，不包含模型;

下載地址是http://115.159.41.123/click.php?id=3

詳情在地址https://github.com/hankcs/HanLP/releases/tag/v1.3.4中

2.2 驅動器（jar包）

hanlp提供了輕便的jar包，內建了基本的詞典，maven依賴如下：

<groupId>com.hankcs</groupId>

<artifactId>hanlp</artifactId>

<version>portable-1.2.8</version>

</dependency>

若在lucene或solr中使用，單獨安裝詞典，則新增對應的依賴包，如下：

<groupId>com.hankcs.nlp</groupId>

<artifactId>hanlp-solr-plugin</artifactId>

</dependency>

<groupId>com.hankcs.nlp</groupId>

<artifactId>hanlp-solr-plugin</artifactId>

</dependency>

2.3 配置檔案hanlp.properties

主要是配置詞庫的地址root=D:/HanLP/，配置檔案內容如下：

#本配置檔案中的路徑的根目錄，根目錄+其他路徑=絕對路徑

#Windows使用者請注意，路徑分隔符統一使用/

root=D:/HanLP/

#核心詞典路徑

CoreDictionaryPath=data/dictionary/CoreNatureDictionary.txt

#2元語法詞典路徑

BiGramDictionaryPath=data/dictionary/CoreNatureDictionary.ngram.txt

#停用詞詞典路徑

CoreStopWordDictionaryPath=data/dictionary/stopwords.txt

#同義詞詞典路徑

CoreSynonymDictionaryDictionaryPath=data/dictionary/synonym/CoreSynonym.txt

#人名詞典路徑

PersonDictionaryPath=data/dictionary/person/nr.txt

#人名詞典轉移矩陣路徑

PersonDictionaryTrPath=data/dictionary/person/nr.tr.txt

#繁簡詞典路徑

TraditionalChineseDictionaryPath=data/dictionary/tc/TraditionalChinese.txt

#自定義詞典路徑，用;隔開多個自定義詞典，空格開頭表示在同一個目錄，使用“檔名詞性”形式則表示這個詞典的詞性預設是該詞性。優先順序遞減。

CustomDictionaryPath=data/dictionary/custom/CustomDictionary.txt; 現代漢語補充詞庫.txt; 全國地名大全.txt ns; 人名詞典.txt; 機構名詞典.txt; 上海地名.txt ns;data/dictionary/person/nrf.txt nrf

#CRF分詞模型路徑

CRFSegmentModelPath=data/model/segment/CRFSegmentModel.txt

#HMM分詞模型

HMMSegmentModelPath=data/model/segment/HMMSegmentModel.bin

#分詞結果是否展示詞性

ShowTermNature=true

三、直接使用hanlp的程式碼例項

3.1 新增maven依賴

<groupId>com.hankcs</groupId>

<artifactId>hanlp</artifactId>

<version>portable-1.2.8</version>

</dependency>

3.2 程式碼

public class HanlpMain {

public static void main(String[] args) {

String text = "比你聰明的人，請不要讓他還比你努力";

String traditionText= "比妳聰明的人，請不要讓他還比妳努力";

System.out.println(HanLP.segment(text)); //分詞

System.out.println(HanLP.extractKeyword(text,2)); //提取關鍵字，同時指定提取的個數

System.out.println(HanLP.extractPhrase(text,2)); //提取短語,，同時指定提取的個數

System.out.println(HanLP.extractSummary(text,2)); //提取摘要，同時指定提取的個數

System.out.println(HanLP.getSummary(text,10)); //提取短語，同時指定摘要的最大長度

System.out.println(HanLP.convertToTraditionalChinese(text)); //簡體字轉為繁體字

System.out.println(HanLP.convertToSimplifiedChinese(traditionText)); //繁體字轉為簡體字

System.out.println(HanLP.convertToPinyinString(text," ",false)); //轉為拼音

}

輸出：

[比/p, 你/r, 聰明/a, 的/uj, 人/n, ，/w, 請/v, 不/d, 要/v, 讓/v, 他/r, 還/d, 比/p, 你/r, 努力/ad]

[聰明, 努力]

[]

[請不要讓他還比你努力]

請不要讓他還比你努力。

比妳聰明的人，請不要讓他還比妳努力

比你聰明的人，請不要讓他還比你努力

Disconnected from the target VM, address: '127.0.0.1:57424', transport: 'socket'

bi ni cong ming de ren qing bu yao rang ta hai bi ni nu li

四、lucene中hanlp使用例項

4.1 新增maven依賴

<groupId>org.apache.lucene</groupId>

<artifactId>lucene-core</artifactId>

<version>${lucene.version}</version>

</dependency>

<groupId>org.apache.lucene</groupId>

<artifactId>lucene-queryparser</artifactId>

<version>${lucene.version}</version>

</dependency>

<groupId>org.apache.lucene</groupId>

<artifactId>lucene-analyzers-smartcn</artifactId>

<version>${lucene.version}</version>

</dependency>

<groupId>org.apache.lucene</groupId>

<artifactId>lucene-analyzers-common</artifactId>

<version>${lucene.version}</version>

</dependency>

<groupId>com.hankcs.nlp</groupId>

<artifactId>hanlp-lucene-plugin</artifactId>

</dependency>

4.2 配置檔案hanlp.properties

將配置檔案hanlp.properties放到classpath目錄下（resources目錄下即可），配置檔案內容如下

#本配置檔案中的路徑的根目錄，根目錄+其他路徑=絕對路徑

#Windows使用者請注意，路徑分隔符統一使用/

root=D:/HanLP/

#核心詞典路徑

CoreDictionaryPath=data/dictionary/CoreNatureDictionary.txt

#2元語法詞典路徑

BiGramDictionaryPath=data/dictionary/CoreNatureDictionary.ngram.txt

#停用詞詞典路徑

CoreStopWordDictionaryPath=data/dictionary/stopwords.txt

#同義詞詞典路徑

CoreSynonymDictionaryDictionaryPath=data/dictionary/synonym/CoreSynonym.txt

#人名詞典路徑

PersonDictionaryPath=data/dictionary/person/nr.txt

#人名詞典轉移矩陣路徑

PersonDictionaryTrPath=data/dictionary/person/nr.tr.txt

#繁簡詞典路徑

TraditionalChineseDictionaryPath=data/dictionary/tc/TraditionalChinese.txt

#自定義詞典路徑，用;隔開多個自定義詞典，空格開頭表示在同一個目錄，使用“檔名詞性”形式則表示這個詞典的詞性預設是該詞性。優先順序遞減。

#CRF分詞模型路徑

CRFSegmentModelPath=data/model/segment/CRFSegmentModel.txt

#HMM分詞模型

HMMSegmentModelPath=data/model/segment/HMMSegmentModel.bin

#分詞結果是否展示詞性

ShowTermNature=true

4.3 示例

public class LuceneHanlpMain {

public static void main(String[] args) throws Exception {

String text = "少年強則中國強";

////////////////標準分詞器(長詞不做切分的分詞器)//////////////////////////////

Analyzer analyzer = new HanLPAnalyzer();

TokenStream ts = analyzer.tokenStream("field",text);

ts.reset();

while(ts.incrementToken()){

CharTermAttribute attribute = ts.getAttribute(CharTermAttribute.class); //The term text of a Token.

OffsetAttribute offsetAttribute =ts.getAttribute(OffsetAttribute.class); //偏移量

PositionIncrementAttribute positionIncrementAttribute = ts.getAttribute(PositionIncrementAttribute.class); //距離

System.out.println(attribute+" "

+offsetAttribute.startOffset()+" "+offsetAttribute.endOffset()+" "

+positionIncrementAttribute.getPositionIncrement());

}

ts.close();

System.out.println();

/////////////////////////////////索引分詞器(長詞全切分的分詞器)/////////////////////////////

Analyzer indexAnalyzer = new HanLPIndexAnalyzer();

TokenStream indexTs = indexAnalyzer.tokenStream("field",text);

indexTs.reset();

while(indexTs.incrementToken()){

CharTermAttribute attribute = indexTs.getAttribute(CharTermAttribute.class); //The term text of a Token.

OffsetAttribute offsetAttribute =indexTs.getAttribute(OffsetAttribute.class); //偏移量

PositionIncrementAttribute positionIncrementAttribute = indexTs.getAttribute(PositionIncrementAttribute.class); //距離

System.out.println(attribute+" "

+offsetAttribute.startOffset()+" "+offsetAttribute.endOffset()+" "

+positionIncrementAttribute.getPositionIncrement());

}

indexTs.close();

System.out.println();

/////////////////////////////透過query檢視分詞結果//////////////////////////////

QueryParser queryParser = new QueryParser("txt",analyzer);

Query query = queryParser.parse(text);

System.out.println(query.toString("txt"));

queryParser = new QueryParser("txt",indexAnalyzer);

query = queryParser.parse(text);

System.out.println(query.toString("txt"));

}

結果輸出：

少年強 0 3 1

則 3 4 1

中國 4 6 1

強 6 7 1

少年強 0 3 1

少年 0 2 1

則 3 4 1

中國 4 6 1

強 6 7 1

少年強則中國強

少年強少年則中國強

∨ python對lxml的操作

熱門排行

劇多

hanlp漢語言包