2172 字

11 分钟

Python 自然语言处理入门：从零开始的 NLP 之旅

2025-08-13

Python

/

AI

/

NLP

/

自然语言处理

/

NLTK

/

spaCy

/

情感分析

/

机器学习

2 阅读 / 6 访问

Python自然语言处理入门：从零开始的NLP之旅#

一、什么是自然语言处理（NLP）？#

自然语言处理（Natural Language Processing, NLP）是人工智能领域的重要分支，它让计算机能够理解、分析和生成人类语言。从智能音箱的语音识别到社交媒体的情感分析，从机器翻译到智能客服，NLP技术已经渗透到我们生活的方方面面。

想象一下，当你对Siri说”今天天气怎么样？“，或者在淘宝上看到”根据你的浏览历史推荐”，这些背后都是NLP技术在发挥作用。NLP的核心挑战在于人类语言的复杂性——歧义性（“苹果”可以是水果也可以是公司）、上下文依赖（“他走了”可能指离开或步行）和非结构化特性（文本是字符序列而非表格数据）。

二、NLP基础任务一览#

NLP包含一系列核心任务，就像我们学习语言时需要先学字母、单词再学句子一样，计算机处理语言也需要循序渐进：

分词（Tokenization）：将连续文本分割成有意义的词语单元
例：“我爱自然语言处理” → [“我”, “爱”, “自然语言”, “处理”]
词性标注（POS Tagging）：给每个词语标注词性（名词、动词、形容词等）
例：“他吃苹果” → [(“他”, “代词”), (“吃”, “动词”), (“苹果”, “名词”)]
命名实体识别（NER）：识别文本中的专有名词（人名、地名、组织名等）
例：“小明在北京大学学习” → 人名：“小明”，组织名：“北京大学”
情感分析（Sentiment Analysis）：判断文本的情感倾向（正面/负面/中性）
例：“这部电影太精彩了！” → 正面情感

三、必备Python NLP工具库#

3.1 NLTK：自然语言处理的”瑞士军刀”#

NLTK（Natural Language Toolkit） 是最经典的Python NLP库，被称为”NLP教学的标准工具”。它包含50多个语料库和词汇资源，从基础的文本处理到复杂的语义分析都能胜任。

安装与基础配置#

1
pip install nltk  # 安装库

1
import nltk
2
# 下载必要数据包（首次使用时）
3
nltk.download('punkt')      # 分词模型
4
nltk.download('stopwords')  # 停用词表（如"的"、"是"等无意义词）
5
nltk.download('averaged_perceptron_tagger')  # 词性标注模型

基础功能示例#

1. 文本分词

1
from nltk.tokenize import word_tokenize, sent_tokenize
2

3
text = "Natural language processing is fascinating! It allows computers to understand human language."
4
# 句子分词
5
sentences = sent_tokenize(text)
6
print("句子分词结果:", sentences)
7
# 单词分词
8
words = word_tokenize(text)
9
print("单词分词结果:", words)

2. 去除停用词

1
from nltk.corpus import stopwords
2

3
# 获取英文停用词表
4
stop_words = set(stopwords.words('english'))
5
# 过滤停用词
6
filtered_words = [w for w in words if w.lower() not in stop_words]
7
print("过滤后单词:", filtered_words)  # 移除了"is", "it", "to"等

3. 词性标注

1
from nltk.tag import pos_tag
2

3
tagged_words = pos_tag(words)
4
print("词性标注结果:", tagged_words)
5
# 输出示例：[('Natural', 'JJ'), ('language', 'NN'), ('processing', 'NN'), ...]

NLTK优势：文档丰富（官方教程）、社区支持强大，适合学习原理
NLTK局限：处理速度较慢，不适合大规模文本

3.2 spaCy：工业级NLP引擎#

spaCy 是2025年仍在广泛使用的工业级NLP库，以速度快、准确率高著称。它内置预训练模型，支持70多种语言，开箱即用地完成分词、NER等任务。

安装与模型下载#

1
pip install spacy  # 安装库
2
python -m spacy download en_core_web_sm  # 下载英文小模型（3MB）
3
# python -m spacy download zh_core_web_sm  # 中文模型（需额外下载）

核心功能演示#

1. 命名实体识别

1
import spacy
2

3
# 加载英文模型
4
nlp = spacy.load("en_core_web_sm")
5
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
6

7
# 提取命名实体
8
for ent in doc.ents:
9
    print(f"{ent.text}: {ent.label_}")
10
# 输出：
11
# Apple: ORG（组织）
12
# U.K.: GPE（国家/地区）
13
# $1 billion: MONEY（货币）

2. 依存句法分析

1
# 可视化句子语法结构（需在Jupyter环境中运行）
2
from spacy import displacy
3
displacy.render(doc, style="dep", jupyter=True)

spaCy优势：处理速度比NLTK快10倍（每秒10万字），支持中文，适合生产环境
spaCy局限：预训练模型较大（最小模型3MB，大模型1.5GB）

3.3 工具选择指南#

场景	推荐工具	理由
学习NLP原理	NLTK	代码透明，教程丰富
处理中文文本	spaCy+中文模型/Jieba	专门优化的中文分词
生产环境部署	spaCy	速度快，内存占用低
教学/科研	NLTK	支持自定义算法实验

四、从零开始的情感分析项目#

现在我们将结合所学知识，用IMDB电影评论数据集实现一个情感分析系统——自动判断影评是正面还是负面评价。

4.1 项目准备#

1. 下载数据集
IMDB数据集包含5万条标注好的电影评论（2.5万训练/2.5万测试）：
http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
下载后解压到本地，得到train/pos（正面评论）、train/neg（负面评论）等文件夹。

2. 安装必要库

1
pip install nltk pandas scikit-learn

4.2 数据预处理#

1
import os
2
import random
3
import pandas as pd
4
from nltk.tokenize import word_tokenize
5
from nltk.corpus import stopwords
6

7
# 读取数据集
8
def load_imdb_data(data_dir):
9
    texts = []
10
    labels = []
11
    # 遍历正面/负面评论文件夹
12
    for label in ['pos', 'neg']:
13
        folder_path = os.path.join(data_dir, label)
14
        for file in os.listdir(folder_path):
15
            with open(os.path.join(folder_path, file), 'r', encoding='utf-8') as f:
16
                texts.append(f.read())
17
                labels.append(1 if label == 'pos' else 0)  # 正面=1，负面=0
18
    # 打乱数据顺序
19
    combined = list(zip(texts, labels))
20
    random.shuffle(combined)
21
    return zip(*combined)  # 返回(texts, labels)
22

23
# 加载训练集（取前1000条作为演示，完整数据集需去掉切片）
24
train_texts, train_labels = load_imdb_data('aclImdb/train')
25
train_texts, train_labels = train_texts[:1000], train_labels[:1000]
26

27
# 文本预处理函数
28
def preprocess_text(text):
29
    # 分词
30
    tokens = word_tokenize(text.lower())
31
    # 去除停用词和非字母字符
32
    stop_words = set(stopwords.words('english'))
33
    tokens = [token for token in tokens if token.isalpha() and token not in stop_words]
34
    return ' '.join(tokens)  # 拼接成字符串供特征提取
35

36
# 预处理所有文本
37
processed_texts = [preprocess_text(text) for text in train_texts]

4.3 特征提取与模型训练#

1
from sklearn.feature_extraction.text import TfidfVectorizer
2
from sklearn.naive_bayes import MultinomialNB
3
from sklearn.metrics import accuracy_score
4

5
# 将文本转换为数值特征（TF-IDF）
6
vectorizer = TfidfVectorizer(max_features=5000)  # 保留5000个最关键的词汇
7
X_train = vectorizer.fit_transform(processed_texts)
8
y_train = train_labels
9

10
# 训练朴素贝叶斯分类器
11
model = MultinomialNB()
12
model.fit(X_train, y_train)
13

14
# 简单测试
15
test_reviews = [
16
    "This movie was amazing! The acting was superb and the plot was gripping.",
17
    "Terrible film. I walked out after 10 minutes. Waste of money."
18
]
19
processed_tests = [preprocess_text(review) for review in test_reviews]
20
X_test = vectorizer.transform(processed_tests)
21
predictions = model.predict(X_test)
22

23
for review, pred in zip(test_reviews, predictions):
24
    print(f"评论: {review[:50]}...")
25
    print(f"预测情感: {'正面' if pred == 1 else '负面'}\n")

4.4 项目改进方向#

使用更复杂模型：替换朴素贝叶斯为SVM或深度学习模型（如LSTM）
优化预处理：添加词形还原（将”running”变为”run”）
调参：调整TF-IDF的max_features和模型超参数
中文扩展：使用ChnSentiCorp中文情感数据集替换IMDB

五、2025年NLP学习资源推荐#

入门教程#

CSDN博客：《Python自然语言处理入门指南:从基础到实战》
https://blog.csdn.net/2501_91483145/article/details/148747780（2025年6月更新）
Udemy课程：《2025 Natural Language Processing (NLP) Mastery in Python》
https://www.udemy.com/course/nlp-in-python/（包含38小时视频教程）