Nlp工具Nltk的安装及使用

This is an automatically translated post by LLM. The original post is in Chinese. If you find any translation errors, please leave a comment to help me improve the translation. Thanks!

Introduction to NLTK

NLTK is a leading platform for building Python programs that use human language data. It provides an easy-to-use interface for over 50 corpora and lexical resources (such as WordNet), as well as a text processing library for classification, tokenization, stemming, tagging, parsing, and semantic inference. NLTK is a famous natural language processing library on Python, which comes with its own corpus, part-of-speech tagging library, tokenization, and other functions.

Package Installation

First, install the NLTK package using pip:

1
pip install nltk

You can use the Tsinghua source to speed up the installation:

1
pip install nltk -i https://pypi.tuna.tsinghua.edu.cn/simple

Downloading NLTK Data

After installing the NLTK package, you need to download the relevant data models to use it. The download method is as follows.

After installing the NLTK package, open the Python command line and run the following command (you can also create a new Python file and write the following command to run it):

1
2
import nltk
nltk.download()

The following interface will appear:

At first, this list is blank. Click "refresh" in the lower right corner to display the nltk-data list.

Click "Download" in the lower left corner to start downloading the data. After the download is complete, you can use it normally.

Accelerated Download in China

When downloading in China, you may encounter situations where DNS cannot be found or errors occur during the download. The most convenient solution when encountering this situation is as follows:

  • Execute one of the following commands to download nltk-data to the local machine, which is about 700M in size:

    1
    2
    3
    4
    git clone https://github.com/nltk/nltk_data.git
    # If you cannot connect to GitHub, you can also use one of the following links to clone
    git clone http://gitclone.com/github.com/nltk/nltk_data.git
    git clone https://hub.fastgit.org/nltk/nltk_data.git
  • Enter the nltk-data directory downloaded to the local machine, and modify the index.xml file under the nltk_data directory, replacing all

    1
    s://raw.githubusercontent.com/nltk/nltk_data/gh-pages

    with:

    1
    ://localhost:8000
  • Run the following command in this directory:

    1
    python -m http.server 8000

    At this time, we will provide a server that provides nltk_data data download services on our local machine. The nltk downloader can obtain the required files by accessing the local address.

  • Re-execute the following statement in Python:

    1
    2
    import nltk
    nltk.download()
  • Replace the address in the server index with http://localhost:8000/index.xml as shown in the figure below:

    image-20211021212307774

    Click "refresh" and "Download" in turn to start the installation.

Reference

[1] 国内下载GITHUB库加速方法及快速安装NLTK - 知乎 (zhihu.com)

[2] 直接快速下载NLTK数据_今春一别难相逢-CSDN博客_nltk下载

[3] nltk/nltk_data: NLTK Data (github.com)

[4] 自然语言处理| NLTK - 简书 (jianshu.com)