nlp课程设计文稿与大纲设计

2024-06-14

nlp 课程设计

本课设为TYUT 设立的NLP 课设的答辩讲稿与设计流程实现

GloVE

概述与引言：

Recent methods for learning vector space representations of words have succeeded in capturing fine-grained semantic and syntactic regularities using vector arithmetic, but the origin of these regularities has remained opaque. We analyze and make explicit the model properties needed for such regularities to emerge in word vectors. The result is a new ==global log-bilinear regression model== that combines the advantages of the two major model families in the literature: global matrix factorization and local context window methods. Our model efficiently leverages statistical information by ==training only on the nonzero elements in a word-word co-occurrence matrix==, rather than on the entire sparse matrix or on individual context windows in a large corpus. The model produces a vector space with meaningful substructure, as evidenced by its performance of 75% on a recent word analogy task. It also outperforms related models on similarity tasks and named entity recognition.

基本思想与优势：

The two main model families for learning word vectors are:

global matrix factorization methods, such as latent semantic analysis (LSA) (Deerwester et al., 1990) and
local context window methods, such as the skip-gram model of Mikolov et al. (2013c).

Currently, both families suffer significant drawbacks.

While methods like LSA efficiently leverage statistical information, they do relatively ==poorly on the word analogy task==, indicating a sub-optimal vector space structure.

Methods like skip-gram may do better on the analogy task, but they==poorly utilize the statistics of the corpus== since they train on separate local context windows instead of on global co-occurrence counts.

In this work, we analyze the model properties necessary to produce linear directions of meaning and argue that global log-bilinear regression models are appropriate for doing so. We propose a specific weighted least squares model that trains on global word-word co-occurrence counts and thus makes efficient use of statistics. The model produces a word vector space with meaningful substructure, as evidenced by its state-of-the-art performance of 75% accuracy on the word analogy dataset. We also demonstrate that our methods outperform other current methods on several word similarity tasks, and also on a common named entity recognition (NER) benchmark.

引用： https://nlp.stanford.edu/pubs/glove.pdf

Matrix factorization methods for generating low-dimensional word representations have roots stretching as far back as LSA. These methods utilize low-rank approximations to decompose large matrices that capture statistical information about a corpus. The particular type of information captured by such matrices varies by application. In LSA, the matrices are of “term-document” type, i.e., the rows correspond to words or terms, and the columns correspond to different documents in the corpus.

A main problem with HAL and related methods is that the most frequent words contribute a disproportionate amount to the similarity measure: the number of times two words co-occur with “the” or “and”, for example, will have a large effect on their similarity despite conveying relatively little about their semantic relatedness.

A number of techniques exist that address this shortcoming of HAL, such as the COALS method (Rohde et al., 2006), in which the co-occurrence matrix is first transformed by an entropy- or correlation-based normalization. An advantage of this type of transformation is that the raw co-occurrence counts, which for a reasonably sized corpus might span 8 or 9 orders of magnitude, are compressed so as to be distributed more evenly in a smaller interval. A variety of newer models also pursue this approach, including a study (Bullinaria and Levy, 2007) that indicates that positive pointwise mutual information (PPMI) is a good transformation. More recently, a square root type transformation in the form of Hellinger PCA (HPCA) (Lebret and Collobert, 2014) has been suggested as an effective way of learning word representations.

假设一个合理规模的语料库中，某些词对的共现次数范围在1到1亿之间（即 10010^0100 到 10810^8108），跨越了8个数量级。

Collobert and Weston (2008) decoupled the word vector training from the downstream training objectives, which paved the way for Collobert et al. (2011) to use the full context of a word for learning the word representations, rather than just the preceding context as is the case with language models.

Recently, the importance of the full neural network structure for learning useful word representations has been called into question.

The skip-gram and continuous bag-of-words (CBOW) models of Mikolov et al. (2013a) propose a simple single-layer architecture based on the inner product between two word vectors. Mnih and Kavukcuoglu (2013) also proposed closely-related vector log-bilinear models, vLBL and ivLBL, and Levy et al. (2014) proposed explicit word embeddings based on a PPMI metric. In the skip-gram and ivLBL models, the objective is to predict a word’s context given the word itself, whereas the objective in the CBOW and vLBL models is to predict a word given its context.

Through evaluation on a word analogy task, these models demonstrated the capacity to learn linguistic patterns as linear relationships between the word vectors.

Unlike the matrix factorization methods, the shallow window-based methods suffer from the disadvantage that they do not operate directly on the co-occurrence statistics of the corpus. Instead, these models scan context windows across the entire corpus, which fails to take advantage of the vast amount of repetition in the data.

核心思想与数学推导：

GloVe（Global Vectors for Word Representation）的数学推理部分主要包括以下几个步骤：

共现矩阵的构建：根据语料库构建一个共现矩阵 (X)，其中元素 (X_{ij}) 表示单词 (i) 和上下文单词 (j) 在特定上下文窗口内共同出现的次数。
近似关系的公式：提出词向量与共现矩阵之间的近似关系：
$$
w_i^T \tilde{w_j} + b_i + \tilde{b_j} = \log(X_{ij})
$$
其中， (w_i) 和 (\tilde{w_j}) 是词向量， (b_i) 和 (\tilde{b_j}) 是偏置项。
损失函数的构建：
$$
J = \sum_{i,j=1}^{V} f(X_{ij}) (w_i^T \tilde{w_j} + b_i + \tilde{b_j} - \log(X_{ij}))^2
$$
损失函数使用了一个加权函数 (f(X_{ij}))，其目的是对频繁共现的单词赋予更大的权重，同时避免权重过大。
权重函数的形式：选择分段函数形式的权重函数：
$$
f(x) = \begin{cases}
(x/x_{max})^\alpha & \text{if } x < x_{max} \
1 & \text{otherwise}
\end{cases}
$$

引用：https://www.fanyeong.com/2018/02/19/glove-in-detail/

扩展– Example of Co-occurrence Matrices

As an example, let’s use three sentences below:

Apples are green and red.

Red apples are sweet.

Green oranges are sour.

Let’s assume that these three sentences are our text corpus, elements are words and their context is one sentence. It means that for each pair of two words from the sentences above, we need to count how many times they appear together in one sentence. For example, the words “apples” and “red” appear two times, in the first and second sentences, while the words “red” and “sour” don’t appear in the same sentence.

Following that logic, our co-occurrence matrix will look like this:

Rendered by QuickLaTeX.com

From the table above, we can notice that the co-occurrence matrix is symmetric. It means that the value with row X and column Y will be the same as the value with row Y and column X. In general, we don’t need to keep all elements from the text corpus in the co-occurrence matrix but only those of interest.

For instance, before creating a matrix, it would be useful to clean text, remove stop words, implement stemming and lemmatization, and similar.

扩展–熵归一化的具体操作

熵归一化是一种通过计算共现矩阵中每个元素的信息熵，对原始计数进行归一化处理的方法。它能够有效地减少高频元素对整体计算的影响。下面通过具体的步骤和例子来说明熵归一化的操作过程。

步骤

计算共现矩阵：
- 首先，构建一个词-词共现矩阵 (M)，其中 (M_{ij}) 表示词 (i) 和词 (j) 的共现次数。
计算每个词的边际概率：
- 计算词 (i) 出现的概率 (P(w_i))，即词 (i) 在所有共现中的总出现次数除以所有共现的总次数：
  $$
  P(w_i) = \frac{\sum_{j} M_{ij}}{\sum_{i} \sum_{j} M_{ij}}
  $$
计算每对词的联合概率：
- 计算词 (i) 和词 (j) 同时出现的概率 (P(w_i, w_j))，即 (M_{ij}) 除以所有共现的总次数：
  $$
  P(w_i, w_j) = \frac{M_{ij}}{\sum_{i} \sum_{j} M_{ij}}
  $$
计算信息熵：
- 计算每对词的点互信息（PMI），并应用熵归一化公式：
  $$
  PMI(w_i, w_j) = \log \frac{P(w_i, w_j)}{P(w_i) P(w_j)}
  $$
- 如果PMI值为负，则取0，否则保留正值，即正点互信息（PPMI）：
  $$
  PPMI(w_i, w_j) = \max(PMI(w_i, w_j), 0)
  $$

示例

假设我们有一个简单的语料库，包含以下单词共现信息：

词对	共现次数
(dog, bark)	100
(dog, run)	50
(cat, meow)	80
(cat, run)	40
(bird, fly)	90
(bird, chirp)	70

我们将构建一个3×3的共现矩阵 (M)：

	dog	cat	bird
bark	100	0	0
run	50	40	0
fly	0	0	90
meow	0	80	0
chirp	0	0	70

计算每个词的边际概率：
- (P(bark) = \frac{100}{430} \approx 0.2326)
- (P(run) = \frac{90}{430} \approx 0.2093)
- (P(fly) = \frac{90}{430} \approx 0.2093)
- (P(meow) = \frac{80}{430} \approx 0.1860)
- (P(chirp) = \frac{70}{430} \approx 0.1628)
计算每对词的联合概率：
- (P(dog, bark) = \frac{100}{430} \approx 0.2326)
- (P(dog, run) = \frac{50}{430} \approx 0.1163)
- (P(cat, meow) = \frac{80}{430} \approx 0.1860)
- (P(cat, run) = \frac{40}{430} \approx 0.0930)
- (P(bird, fly) = \frac{90}{430} \approx 0.2093)
- (P(bird, chirp) = \frac{70}{430} \approx 0.1628)
计算PMI并应用熵归一化：
- (PMI(dog, bark) = \log \frac{0.2326}{0.2326 \cdot 0.2326} = \log \frac{0.2326}{0.0541} \approx \log 4.30 \approx 1.46)
- (PPMI(dog, bark) = \max(1.46, 0) = 1.46)

以此类推，我们计算其他词对的PPMI值：

	dog	cat	bird
bark	1.46	0	0
run	0.58	0.58	0
fly	0	0	1.00
meow	0	1.00	0
chirp	0	0	1.00

通过这种方式，我们对共现矩阵进行了熵归一化处理，减少了高频词对整体计算的影响，使得频率较高的元素对相似度计算的影响减小，从而更加准确地反映了词语之间的语义相关性。

扩展– 双线形对数回归

BERT

引用自 https://arxiv.org/pdf/1810.04805

概述

Bert 的网络结构
与RNN和GPT的对比
自身的网路结构和参数

bert 的网路结构

输入表示（Input Representation）

BERT的输入由三个嵌入（Embedding）相加而成：

Token Embeddings: 将每个词映射为一个固定维度的向量。
Segment Embeddings: 区分句子A和句子B，用于句子对任务（如自然语言推断）。
Position Embeddings: 表示词在序列中的位置，以捕捉词序信息。

自注意力机制（Self-Attention Mechanism）

自注意力机制是Transformer的核心。它通过计算输入序列中每个词对其他词的注意力分数来捕捉词与词之间的关系。具体步骤如下：

查询、键和值（Query, Key, Value）：
- 对每个输入词，计算查询向量（Query）、键向量（Key）和值向量（Value）。
- 公式：[ Q = XW^Q, K = XW^K, V = XW^V ]
  其中，( W^Q, W^K, W^V )是可训练的权重矩阵。
注意力分数（Attention Scores）：
- 计算查询向量和键向量的点积，得到注意力分数，然后通过Softmax函数归一化。
- 公式：[ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V ]
多头注意力（Multi-Head Attention）：
- 多头注意力机制将查询、键和值映射到多个子空间，并行计算多个注意力分数，然后将这些结果拼接起来。
- 公式：[ \text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \text{head}_2, …, \text{head}_h)W^O ]
Transformer 编码器层（Transformer Encoder Layer）

每个编码器层由两个主要部分组成：

多头自注意力层（Multi-Head Self-Attention Layer）：
- 通过多头自注意力机制捕捉输入序列中不同位置的关系。
前馈神经网络（Feed-Forward Neural Network）：
- 包含两个线性变换和一个ReLU激活函数。
- 公式：[ \text{FFN}(x) = \text{max}(0, xW_1 + b_1)W_2 + b_2 ]

此外，每个编码器层还包括残差连接（Residual Connection）和层归一化（Layer Normalization）。

堆叠编码器层（Stacked Encoder Layers）

BERT的网络结构通过堆叠多个编码器层构成。BERT base模型有12个编码器层，而BERT large模型有24个编码器层。

输出（Output）

BERT的输出是每个词的上下文表示（Contextual Representation）。特别地，BERT使用[CLS]标记的表示作为整个输入序列的摘要，用于分类任务。

预训练任务（Pre-Training Tasks）

BERT在预训练阶段主要使用了两个任务：

遮蔽语言模型（Masked Language Model, MLM）：
- 随机遮蔽输入序列中的一些词，要求模型预测被遮蔽的词。
下一句预测（Next Sentence Prediction, NSP）：
- 输入一对句子，要求模型预测第二个句子是否是第一个句子的下文。

BERT 网络结构图

以下是BERT的网络结构图示意图：

Input Sequence (包含 [CLS], [SEP])
    |
    V
Embedding Layer (Token Embeddings + Segment Embeddings + Position Embeddings)
    |
    V
Stack of Encoder Layers (N Layers)
    |
    V
[CLS] Token Representation (用于分类任务)
    |
    V
其他Token Representations (用于其他任务)

每个Encoder Layer的结构：

Input
    |
    V
Multi-Head Self-Attention
    |
    +--> Add & Norm
    |
Feed-Forward Network
    |
    +--> Add & Norm
    |
Output

通过上述结构，BERT能够对输入序列进行深层次的语义建模，捕捉词与词之间的复杂关系，从而在各种自然语言处理任务中取得优异的表现。

对比

There are two existing strategies for applying pre-trained language representations to downstream tasks: feature-based and fine-tuning. The feature-based approach, such as ELMo (Peters et al., 2018a), uses task-specific architectures that include the pre-trained representations as additional features. The fine-tuning approach, such as the Generative Pre-trained Transformer (OpenAI GPT) (Radford et al., 2018), introduces minimal task-specific parameters, and is trained on the downstream tasks by simply fine-tuning all pre-trained parameters. The two approaches share the same objective function during pre-training, where they use unidirectional language models to learn general language representations.

Transformer 核心网络架构

参数计算

In this work, we denote the number of layers (i.e., Transformer blocks) as L, the hidden size as H, and the number of self-attention heads as A. We primarily report results on two model sizes: BERT_BASE (L=12, H=768, A=12, Total Parameters=110M) and BERT_LARGE (L=24, H=1024, A=16, Total Parameters=340M).

pre-training and fine tuning

We use WordPiece embeddings (Wu et al., 2016) with a ==30,000== token vocabulary. The first token of every sequence is always a special classification token ==([CLS])==. The final hidden state corresponding to this token is used as the aggregate sequence representation for classification tasks. Sentence pairs are packed together into a single sequence. We differentiate the sentences in two ways. First, we separate them with a special token ([SEP]). Second, we add a learned embedding to every token indicating whether it belongs to sentence A or sentence B. As shown in Figure 1, we denote input embedding as E, the final hidden vector of the special [CLS] token as C ∈ RH, and the final hidden vector for the ith input token as Ti ∈ RH.

为什么不采用空格的方式进行分词？

pretrain 概率小找其子序列 –> pre

pretrain

mask
NSP
pre- train data

MASK

Although this allows us to obtain a bidirectional pre-trained model, a downside is that we are creating a mismatch between pre-training and fine-tuning, since the [MASK] token does not appear during fine-tuning. To mitigate this, we do not always replace “masked” words with the actual [MASK] token. The training data generator chooses 15% of the token positions at random for prediction. If the i-th token is chosen, we replace the i-th token with (1) the [MASK] token 80% of the time (2) a random token 10% of the time (3) the unchanged i-th token 10% of the time. Then, Ti will be used to predict the original token with cross entropy loss. We compare variations of this procedure in Appendix C.2.

Masked LM and the Masking Procedure Assuming the unlabeled sentence is my dog is hairy, and during the random masking procedure we chose the 4-th token (which corresponding to hairy), our masking procedure can be further illustrated by

80% of the time: Replace the word with the [MASK] token, e.g., my dog is hairy → my dog is ==[MASK]==
10% of the time: Replace the word with a random word, e.g., my dog is hairy → my dog is ==apple==
10% of the time: Keep the word unchanged, e.g., my dog is hairy → my dog is ==hairy==. The purpose of this is to bias the representation towards the actual observed word.

NSP

Next Sentence Prediction The next sentence prediction task can be illustrated in the following examples.

Input = [CLS] the man went to [MASK] store [SEP]
he bought a gallon [MASK] milk [SEP]
Label = IsNext

Input = [CLS] the man [MASK] to the store [SEP]
penguin [MASK] are ==flight ##less== birds [SEP]
Label = NotNext

Many important downstream tasks such as Question Answering (QA) and Natural Language Inference (NLI) are based on understanding the relationship between two sentences, which is not directly captured by language modeling. In order to train a model that understands sentence relationships, we pre-train for a binarized next sentence prediction task that can be trivially generated from any monolingual corpus. Specifically, when choosing the sentences A and B for each pre-training example, 50% of the time B is the actual next sentence that follows A (labeled as IsNext), and 50% of the time it is a random sentence from the corpus (labeled as NotNext). As we show in Figure 1, C is used for next sentence prediction (NSP). Despite its simplicity, we demonstrate in Section 5.1 that pre-training towards this task is very beneficial to both QA and NLI.

Fine tuning

与transformer 之间的差异
simply plug in the task-specific inputs and outputs to Bert and fine tuning all the parameters end2end
GLUE （CLS + W 做多分类任务）
QA （阅读理解 – > S and E ）

Ablation Studies

effects of parts of model
Model size
fine- tuning

参考和引用：

技术论文原文
B站“跟李沐学AI” 、《动手学习深度学习》

任务一基于方面情感分析的背景表示

数据集参考

使用的数据集是 https://github.com/xuuuluuu/SemEval-Triplet-data/tree/master/ASTE-Data-V2-EMNLP2020

对背景的理解

背景：词向量捕捉到的上下文，我理解的背景是词向量的背景；

使用bert- 捕捉词向量的上下文

Bert 原理和转换步骤

原理概要

1. 双向Transformer：

• 传统的语言模型（如LSTM，GRU）通常是单向的，即只从左到右或从右到左读取文本。而BERT则是双向的，即在处理一个词时，会同时考虑该词的左边和右边的上下文。

• 这种双向性使得BERT能够更全面地理解词的含义，因为它不仅仅依赖于前面的词，还可以考虑后面的词。

2. 输入表示：

• BERT的输入表示包括词嵌入（Word Embedding）、位置嵌入（Position Embedding）和分段嵌入（Segment Embedding）。这些嵌入向量被组合起来形成输入序列，每个词在序列中的表示都包含了其位置和分段信息。

3. 多层Transformer编码器：

• BERT由多个Transformer编码器层堆叠而成，每一层都包含多头自注意力机制和前馈神经网络。

• 自注意力机制：在每一层中，自注意力机制允许每个词在序列中“关注”其他所有词。通过计算注意力权重，每个词的表示能够结合整个序列中的信息，提取出与其相关的上下文信息。

• 前馈神经网络：自注意力机制后的前馈神经网络进一步处理和变换每个词的表示，使得模型可以更深层次地捕捉到词与上下文的关系。

4. 掩码语言模型（Masked Language Model, MLM）：

• 在训练过程中，BERT采用掩码语言模型，即随机遮掩输入序列中的一些词，然后让模型预测这些被遮掩的词。

• 这种训练方式迫使模型必须同时利用被遮掩词的左边和右边的上下文信息，从而增强了其对词语上下文关系的理解能力。

将句子转为向量过程

如下：

BERT（Bidirectional Encoder Representations from Transformers）将一句话转换为向量的过程涉及多个步骤，包括分词、添加特殊标记、生成输入嵌入、通过Transformer编码器处理等。以下是详细步骤：

分词（Tokenization）

首先，BERT使用其专用的分词器（如BertTokenizer）将输入的句子转换为一系列子词（subword）单元。BERT采用WordPiece分词算法，它将词分解为更小的单元，使得词汇量能够有效地覆盖大量单词。

from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
sentence = "This is an example sentence."
tokens = tokenizer.tokenize(sentence)
print(tokens)

输出是：

1	['this', 'is', 'an', 'example', 'sentence', '.']

添加特殊标记（Special Tokens）

BERT需要在输入序列的开头添加[CLS]标记，表示分类任务的起点，在每个句子（或句子对）之间添加[SEP]标记。

1 2	tokens = ['[CLS]'] + tokens + ['[SEP]'] print(tokens)

输出是：

1	['[CLS]', 'this', 'is', 'an', 'example', 'sentence', '.', '[SEP]']

生成输入嵌入（Input Embeddings）

分词后，BERT将每个子词单元转换为相应的词嵌入。这些嵌入由以下几部分组成：

Token Embeddings：分词器将每个子词映射到一个固定大小的向量（通常是768维）。
Segment Embeddings：用于表示句子对中的句子A和句子B。
Position Embeddings：用于表示每个子词在句子中的位置。

input_ids = tokenizer.convert_tokens_to_ids(tokens)
segment_ids = [0] * len(tokens)  # 单句子任务中，所有的segment id都为0

# 将这些ID转换为张量
import torch

input_ids = torch.tensor([input_ids])
segment_ids = torch.tensor([segment_ids])

通过Transformer编码器（Transformer Encoder）

BERT模型的核心是多层Transformer编码器。输入的词嵌入通过多层双向Transformer编码器进行处理。每一层包含多个自注意力头（self-attention heads），它们可以捕捉到句子中不同位置的单词之间的依赖关系。

from transformers import BertModel

model = BertModel.from_pretrained('bert-base-uncased')
outputs = model(input_ids, token_type_ids=segment_ids)
last_hidden_states = outputs.last_hidden_state

input_ids：表示输入句子的token ids。
segment_ids：表示句子对中句子的分段id（对于单句任务，所有值都为0）。
last_hidden_states：模型的最后一个隐藏层的输出，是每个token的表示向量。

获取句子向量（Sentence Embedding）

对于句子分类任务，通常使用[CLS]标记对应的向量作为整个句子的表示向量。

1 2	cls_embedding = last_hidden_states[0][0] # [CLS]标记的向量 print(cls_embedding)

cls_embedding就是句子的表示向量，可以用于下游任务如分类、相似度计算等。

嵌入总结

分词：将输入句子分解为子词单元。
添加特殊标记：在序列前后添加[CLS]和[SEP]标记。
生成输入嵌入：将子词单元转换为词嵌入，并加上段嵌入和位置嵌入。
通过Transformer编码器：通过多层双向Transformer编码器处理嵌入。
获取句子向量：通常使用[CLS]标记对应的向量作为句子的表示向量。

这种方式使得BERT可以有效地捕捉到输入句子的语义信息，并将其转换为固定大小的向量表示，便于进行各种自然语言处理任务。

简单代码示例 - 句子向量化

import torch
from transformers import BertTokenizer, BertModel

# 初始化BERT分词器和模型
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased', output_hidden_states=True)

# 示例句子
sentence = "BERT is a powerful language model."

# 对句子进行分词
inputs = tokenizer(sentence, return_tensors="pt", padding='max_length', truncation=True, max_length=128)

# 将分词后的输入传递给模型
with torch.no_grad():
    outputs = model(**inputs)

# 获取[CLS]标记的向量表示
cls_embedding = outputs.hidden_states[-1][:, 0, :]  # 最后一层的[CLS]标记向量

# 打印[CLS]标记的向量表示
print("CLS embedding for the sentence:", cls_embedding[0])

使用 gloVE – 捕捉词向量的上下文

gloVe 原理和转换步骤

GloVe（Global Vectors for Word Representation）是由斯坦福大学计算机科学系提出的一种基于全局词汇统计的词向量表示方法。它通过统计词语在大规模语料库中的共现信息，将单词转换为向量表示。以下是GloVe将单词转换为向量的过程：

构建共现矩阵

首先，GloVe方法构建一个词与词的共现矩阵。对于语料库中的每对单词（i, j），计算它们在固定窗口内共同出现的次数。窗口大小通常为10，即在一个单词的前后10个词范围内统计共现次数。共现矩阵的元素Xij表示词i和词j在窗口内共同出现的次数。

计算词对共现概率

对共现矩阵进行归一化，计算每个词对（i, j）的共现概率：
[ P_{ij} = \frac{X_{ij}}{\sum_k X_{ik}} ]
其中，Xij是词i和词j的共现次数，(\sum_k X_{ik})表示词i与所有词的共现次数总和。

定义代价函数

GloVe模型的核心是定义一个代价函数，使得词向量的点积能够逼近词对的共现概率。代价函数定义为：
[ J = \sum_{i,j=1}^V f(X_{ij})(w_i^T \cdot w_j + b_i + b_j - \log(X_{ij}))^2 ]
其中：

( w_i ) 和 ( w_j ) 是词i和词j的词向量
( b_i ) 和 ( b_j ) 是词i和词j的偏置项
( f(X_{ij}) ) 是权重函数，用于控制不同共现次数的影响

权重函数通常定义为：
[ f(X_{ij}) = \begin{cases}
\left( \frac{X_{ij}}{X_{max}} \right)^\alpha & \text{if } X_{ij} < X_{max} \
1 & \text{otherwise}
\end{cases} ]
其中，( X_{max} ) 和 ( \alpha ) 是超参数，通常取值为100和0.75。

优化代价函数

通过随机梯度下降（SGD）或其他优化算法，最小化代价函数J，学习得到每个单词的词向量( w_i ) 和偏置项 ( b_i )。优化过程中，模型会不断调整词向量和偏置项，使得词向量的点积能够逼近共现概率的对数。

获得词向量

经过迭代优化，最终得到的词向量 ( w_i ) 可以用于各种自然语言处理任务，如文本分类、情感分析、信息检索等。词向量保留了词语之间的语义关系，相似意义的词在向量空间中距离较近。

简单的代码测试- 单词向量化

import torch
from torchtext.vocab import GloVe
from torchtext.data.utils import get_tokenizer

# 加载预训练的GloVe词向量
glove = GloVe(name='6B', dim=100)

# 创建单词到索引的映射
vocab = {word: idx for idx, word in enumerate(glove.itos)}

# 定义分词器
tokenizer = get_tokenizer("basic_english")

# 示例句子
sentence = "This is an example sentence."

# 将句子分词并转换为GloVe词向量
tokens = tokenizer(sentence)
indices = [vocab[token] if token in vocab else vocab['unk'] for token in tokens]
vectors = [glove.vectors[idx] for idx in indices]

# 打印结果
print("Tokens:", tokens)
print("Indices:", indices)
print("Vectors:", vectors)

GloVe通过全局词汇共现统计信息，将单词转换为向量表示，捕捉了词语之间的语义关系。这种方法结合了局部窗口和全局统计信息，能够生成高质量的词向量，用于各种自然语言处理任务。

GloVe vs BERT 对背景的理解谁更强？

bert的特点展示

bert 测试1- 句子向量展示

import torch
from transformers import BertTokenizer, BertModel

# 初始化分词器和模型
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased', output_hidden_states=True)

# 示例句子
sentences = ["He went to the bank to deposit money.", "The river bank was covered in snow."]

for sentence in sentences:
    inputs = tokenizer(sentence, return_tensors="pt", padding=True, truncation=True, max_length=128)
    with torch.no_grad():
        outputs = model(**inputs)
    
    # 获取[CLS]标记的向量表示
    cls_embedding = outputs.hidden_states[-1][:, 0, :]  # 最后一层的[CLS]标记向量
    print(f"Sentence: {sentence}")
    print(f"CLS embedding: {cls_embedding[0][:5]}")  # 只打印前5维向量以简化输出

bert测试2 - 单词向量展示

import torch
from transformers import BertTokenizer, BertModel

# 初始化分词器和模型
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased', output_hidden_states=True)

# 示例句子
sentences = ["He went to the bank to deposit money.", "The river bank was covered in snow."]

for sentence in sentences:
    inputs = tokenizer(sentence, return_tensors="pt", padding=True, truncation=True, max_length=128)
    input_ids = inputs['input_ids'][0]  # 获取分词后的输入 ID
    tokens = tokenizer.convert_ids_to_tokens(input_ids)  # 将 ID 转换为对应的词
    bank_indices = [i for i, token in enumerate(tokens) if token == 'bank']  # 找到 "bank" 的索引位置

    with torch.no_grad():
        outputs = model(**inputs)
    
    # 提取 "bank" 词的向量表示
    for index in bank_indices:
        bank_embedding = outputs.hidden_states[-1][0][index]  # 最后一层的 "bank" 标记向量
        print(f"Sentence: {sentence}")
        print(f"Bank token index: {index}")
        print(f"Bank embedding: {bank_embedding[:5]}")  # 只打印前5维向量以简化输出

bert-测试3 - 不同词语在相同语境下的相似度

import torch
from transformers import BertTokenizer, BertModel
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

# 初始化分词器和模型
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased', output_hidden_states=True)

# 示例句子
sentences = [
    "He went to the bank to deposit money.",  # 使用 "bank"
    "He went to the store to deposit money."  # 使用 "store"
]

# 要比较的词
target_words = ["bank", "store"]

word_embeddings = []

for sentence, target_word in zip(sentences, target_words):
    inputs = tokenizer(sentence, return_tensors="pt", padding=True, truncation=True, max_length=128)
    input_ids = inputs['input_ids'][0]  # 获取分词后的输入 ID
    tokens = tokenizer.convert_ids_to_tokens(input_ids)  # 将 ID 转换为对应的词
    target_indices = [i for i, token in enumerate(tokens) if token == target_word]  # 找到目标词的索引位置

    with torch.no_grad():
        outputs = model(**inputs)

    # 提取目标词的向量表示
    for index in target_indices:
        word_embedding = outputs.hidden_states[-1][0][index]  # 最后一层的目标词标记向量
        word_embeddings.append(word_embedding.numpy())
        print(f"Sentence: {sentence}")
        print(f"Target word: {target_word}")
        print(f"Token index: {index}")
        print(f"Embedding: {word_embedding[:5]}")  # 只打印前5维向量以简化输出

# 计算两个词向量的余弦相似度
cos_sim = cosine_similarity([word_embeddings[0]], [word_embeddings[1]])[0][0]
print(f"Cosine similarity between 'bank' and 'store' in similar contexts: {cos_sim}")


print("###########################################################################################")
print("###########################################################################################")
print("###########################################################################################")
print("###########################################################################################")

import torch
from transformers import BertTokenizer, BertModel
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

# 初始化分词器和模型
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased', output_hidden_states=True)

# 示例句子
sentences = ["He went to the bank to deposit money.", "The river bank was covered in snow."]

bank_embeddings = []

for sentence in sentences:
    inputs = tokenizer(sentence, return_tensors="pt", padding=True, truncation=True, max_length=128)
    input_ids = inputs['input_ids'][0]  # 获取分词后的输入 ID
    tokens = tokenizer.convert_ids_to_tokens(input_ids)  # 将 ID 转换为对应的词
    bank_indices = [i for i, token in enumerate(tokens) if token == 'bank']  # 找到 "bank" 的索引位置

    with torch.no_grad():
        outputs = model(**inputs)

    # 提取 "bank" 词的向量表示
    for index in bank_indices:
        bank_embedding = outputs.hidden_states[-1][0][index]  # 最后一层的 "bank" 标记向量
        bank_embeddings.append(bank_embedding.numpy())
        print(f"Sentence: {sentence}")
        print(f"Bank token index: {index}")
        print(f"Bank embedding: {bank_embedding[:5]}")  # 只打印前5维向量以简化输出

# 计算两个 "bank" 向量的余弦相似度
cos_sim = cosine_similarity([bank_embeddings[0]], [bank_embeddings[1]])[0][0]
print(f"Cosine similarity between 'bank' in different contexts: {cos_sim}")

glove 的特点展示

gloVe测试1

import numpy as np

# 加载GloVe词向量
embedding_index = {}
with open('/Users/bootscoder/PycharmProjects/nlp-lesson-design/test1-sentiment/.vector_cache/glove.6B.100d.txt', 'r', encoding='utf-8') as f:
    for line in f:
        values = line.split()
        word = values[0]
        coefs = np.asarray(values[1:], dtype='float32')
        embedding_index[word] = coefs

# 定义一个函数获取GloVe词向量
def get_glove_embedding(word):
    return embedding_index.get(word, np.zeros(100))

# 示例句子
sentence1 = "He went to the bank to deposit money."
sentence2 = "The river bank was covered in snow."

# 获取“bank”的词向量
bank_embedding1 = get_glove_embedding("bank")
bank_embedding2 = get_glove_embedding("bank")

print("GloVe embedding for 'bank' in sentence 1:", bank_embedding1[:5])  # 打印前5维以简化输出
print("GloVe embedding for 'bank' in sentence 2:", bank_embedding2[:5])  # 打印前5维以简化输出

# 比较两个向量是否相同
print("Are the embeddings the same?", np.array_equal(bank_embedding1, bank_embedding2))

gloVe 测试2 - 语意信息学习

import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

# 加载 GloVe 向量
def load_glove_model(glove_file):
    model = {}
    with open(glove_file, 'r', encoding='utf-8') as f:
        for line in f:
            split_line = line.split()
            word = split_line[0]
            embedding = np.array([float(val) for val in split_line[1:]])
            model[word] = embedding
    return model

# 计算余弦相似度
def cosine_sim(vec1, vec2):
    return cosine_similarity([vec1], [vec2])[0][0]

# 文件路径
glove_file = '/Users/bootscoder/PycharmProjects/nlp-lesson-design/test3/.vector_cache/glove.6B.100d.txt'

# 加载模型
glove_model = load_glove_model(glove_file)

# 获取单词的向量
king_vec = glove_model.get("king")
woman_vec = glove_model.get("woman")
queen_vec = glove_model.get("queen")

if king_vec is not None and woman_vec is not None and queen_vec is not None:
    # 计算 king + woman 的向量
    king_plus_woman = king_vec + woman_vec

    # 计算 king + woman 和 queen 的余弦相似度
    similarity = cosine_sim(king_plus_woman, queen_vec)

    # 打印向量
    print(f"'King' vector (first 5 dimensions): {king_vec[:5]}")
    print(f"'Woman' vector (first 5 dimensions): {woman_vec[:5]}")
    print(f"'Queen' vector (first 5 dimensions): {queen_vec[:5]}")
    print(f"'King + Woman' vector (first 5 dimensions): {king_plus_woman[:5]}")
    print(f"Cosine similarity between 'king + woman' and 'queen': {similarity}")

总结：

GloVe：

GloVe（Global Vectors for Word Representation）是静态的词向量，它为每个词提供一个固定的向量表示。这意味着一个词在所有上下文中都具有相同的向量表示。虽然GloVe捕捉到了全局的词语共现信息，但它不能处理单词在不同上下文中的不同含义。

BERT：

BERT（Bidirectional Encoder Representations from Transformers）是基于Transformer的模型，能够捕捉到上下文依赖。它通过双向编码器（同时从左到右和从右到左）来生成词的动态表示。每个词的向量表示是根据其在特定上下文中的位置来确定的。BERT不仅能处理单词之间的上下文依赖，还能处理句子之间的上下文依赖。

任务总结

本次任务通过对bert 和gloVe 原始论文的理解与解读；对两个词嵌入模型的核心技术的理解；做了大量的对比实验与测试样例，最后对比分析了bert 和gloVe 模型在词嵌入模型中对于背景表示的能力。

任务二基于方面情感分析任务的句法特征表示

核心任务目标

采用依存句法分析获得依存句法树；
采用图注意力机制和嵌入表示模型获取文本的句法特征表示；

图注意力在进行dgl 下载时出现冲突，破坏环境

tool-list

1. **自然语言处理工具包 (NLP Toolkits)**：

• Stanford NLP：斯坦福大学开发的自然语言处理工具包，提供了强大的句法分析器。可以对句子进行词性标注、依存分析等。

• spaCy：一个高效且易用的NLP库，支持多种语言的句法分析。它提供了预训练的模型，可以快速进行句法树解析。

• **NLTK (Natural Language Toolkit)**：一个老牌的NLP工具包，虽然性能不如前两个，但提供了很多教学资源和句法分析的基础工具。

2. 预训练模型和API：

• AllenNLP：一个基于PyTorch的NLP研究库，提供了预训练的句法分析模型。

• Google Cloud Natural Language API：谷歌提供的云服务，支持句法分析，可以直接调用API进行句法树解析。

3. 依存分析（Dependency Parsing）：

• 依存分析可以解析句子的依存关系，识别句子中词语之间的语法依赖关系。可以使用Stanford NLP或spaCy进行依存分析。

4. 成分分析（Constituency Parsing）：

• 成分分析将句子分解成短语结构树，标识句子的短语结构。可以使用Stanford NLP进行成分分析。

5. 树形可视化：

• 句法树的可视化有助于理解句子的结构。可以使用Matplotlib结合NLTK或spaCy生成句法树的可视化图。

demo- 句法分析树

import spacy
from spacy import displacy

# 加载预训练的英语模型
nlp = spacy.load("en_core_web_sm")

# 输入句子
sentence = "The quick brown fox jumps over the lazy dog."

# 进行句法分析
doc = nlp(sentence)

# 显示依存关系
for token in doc:
    print(f"{token.text} -> {token.dep_} -> {token.head.text}")

# 可视化句法树，设置图形大小
options = {"compact": True, "bg": "#fafafa", "color": "#000000", "font": "Source Sans Pro", "distance": 90}

displacy.render(doc, style="dep", jupyter=True, options=options)

1. 依存关系（Dependency Relations）:

• The (DET) -> fox (NOUN) 表示定冠词 “The” 修饰名词 “fox”。

• quick (ADJ) -> fox (NOUN) 表示形容词 “quick” 修饰名词 “fox”。

• brown (ADJ) -> fox (NOUN) 表示形容词 “brown” 修饰名词 “fox”。

• fox (NOUN) -> jumps (VERB) 表示名词 “fox” 是动词 “jumps” 的主语。

• jumps (VERB) -> ROOT 表示 “jumps” 是句子的谓语动词，作为句子的根。

• over (ADP) -> jumps (VERB) 表示介词 “over” 修饰动词 “jumps”。

• the (DET) -> dog (NOUN) 表示定冠词 “the” 修饰名词 “dog”。

• lazy (ADJ) -> dog (NOUN) 表示形容词 “lazy” 修饰名词 “dog”。

• dog (NOUN) -> over (ADP) 表示名词 “dog” 是介词 “over” 的宾语。

• . (PUNCT) -> jumps (VERB) 表示句子末尾的句号。

2. 图形显示:

• 图中每个词的上方显示了依存关系标签（如 det、amod、nsubj 等），这些标签描述了词与词之间的语法关系。

• 图中箭头表示依存方向，即从修饰词指向被修饰词。例如，quick 指向 fox，表示 quick 修饰 fox。

• 每个词下方显示了词性标签（如 DET、ADJ、NOUN、VERB、ADP、PUNCT 等），这些标签描述了词的词性。

作用

1. 理解句子的结构：通过句法分析，可以识别句子的组成部分，如主语、谓语、宾语等，了解它们之间的关系。

2. 语义理解的基础：句法分析为更深层次的语义分析提供了基础，因为只有明确了句子的结构，才能准确理解句子的含义。

3. 改进机器翻译和问答系统：在机器翻译和自动问答系统中，句法分析有助于系统理解和生成更加准确、流畅的语言。

4. 信息抽取：通过句法分析，可以从文本中抽取出有用的信息，如实体、关系和事件等。

5. 提升自然语言生成：在自然语言生成任务中，句法分析可以帮助生成符合语法规则的句子，使生成的文本更加自然。

demo-图注意力机制

import torch
import torch.nn as nn
import dgl
import spacy
from dgl.nn.pytorch import GATConv

# 加载Spacy模型
nlp = spacy.load("en_core_web_sm")

# 示例文本
text = "Great food but the service was dreadful"

# 进行句法依存解析
doc = nlp(text)

# 构建句法依存图
edges = []
for token in doc:
    for child in token.children:
        edges.append((token.i, child.i))

# 获取词嵌入（这里使用随机初始化的向量代替GloVe）
embedding_dim = 300
vocab_size = len(doc)
word_embeddings = nn.Embedding(vocab_size, embedding_dim)
inputs = torch.tensor([token.i for token in doc])

class GAT(nn.Module):
    def __init__(self, in_dim, hidden_dim, out_dim, num_heads):
        super(GAT, self).__init__()
        self.layer1 = GATConv(in_dim, hidden_dim, num_heads)
        self.layer2 = GATConv(hidden_dim * num_heads, out_dim, 1)

    def forward(self, g, h):
        h = self.layer1(g, h)
        h = h.flatten(1)
        h = self.layer2(g, h)
        return h

# 初始化GAT模型
hidden_dim = 8
num_heads = 4
out_dim = embedding_dim
gat = GAT(embedding_dim, hidden_dim, out_dim, num_heads)

# 构建DGL图
g = dgl.graph(edges)
g = dgl.add_self_loop(g)

# 获取初始节点特征（词嵌入向量）
node_features = word_embeddings(inputs)

# 将图和节点特征传入GAT模型
logits = gat(g, node_features)
print(logits)

参考：

教学ppt ： http://www.nlpir.org/wordpress/wp-content/uploads/2022/04/%E5%8F%A5%E6%B3%95%E5%88%86%E6%9E%902022.pdf

哈工大资源： https://ltp.readthedocs.io/zh-cn/latest/

任务三基于方面情感–背景表示和句法特征表示的融合

融合方法设计

设计使用 gloVe做词嵌入，使用spacy得到句法特征，想办法结合二者来做一个情感分类任务，情感的label 有积极消极中性

方法一（已实现）

句法特征和词嵌入做简单的一维度拼接 –

我们可以将句法特征的大小固定为一个特定的长度，例如：选择前10个词的句法特征，如果不足10个词则补零。

方法二

找到句法中的中心词增加分类时候的权重 — 在本例中为形容词(受时间限制暂时没有完成)

方法一实现具体步骤

随机生成一个包含10条文本和情感标签（积极、消极、中性）的数据集，并使用这个数据集来演示整个流程。

步骤 1: 安装所需库

1 2	pip install spacy numpy pandas scikit-learn python -m spacy download en_core_web_sm

步骤 2: 生成示例数据集

import pandas as pd
import numpy as np

# 随机生成示例数据集
data = {
    'text': [
        'I love this product, it is amazing!',
        'This is the worst service I have ever received.',
        'The movie was okay, not great but not bad either.',
        'I am extremely happy with the results!',
        'The food was terrible and the wait was too long.',
        'It was an average experience, nothing special.',
        'I absolutely adore this place!',
        'I hate the way they treat their customers.',
        'The book was decent, a good read overall.',
        'I feel indifferent about this situation.'
    ],
    'label': ['positive', 'negative', 'neutral', 'positive', 'negative', 'neutral', 'positive', 'negative', 'neutral', 'neutral']
}

df = pd.DataFrame(data)
df.to_csv('demo_dataset.csv', index=False)

步骤 3: 加载GloVe词嵌入和数据

我们需要一个预训练的GloVe词嵌入文件，你可以从这里下载并放在合适的路径。

import numpy as np
import pandas as pd

# 加载GloVe词嵌入
def load_glove_embeddings(file_path):
    embeddings_index = {}
    with open(file_path, 'r', encoding='utf-8') as f:
        for line in f:
            values = line.split()
            word = values[0]
            coefs = np.asarray(values[1:], dtype='float32')
            embeddings_index[word] = coefs
    return embeddings_index

glove_file = 'path_to_glove_file.txt'  # 替换为你的GloVe文件路径
embeddings_index = load_glove_embeddings(glove_file)

步骤 4: 加载和预处理数据集

import spacy

# 加载SpaCy模型
nlp = spacy.load("en_core_web_sm")

# 加载数据集
data = pd.read_csv('demo_dataset.csv')
texts = data['text']
labels = data['label']

# 将标签转换为数字
label_mapping = {'positive': 1, 'negative': -1, 'neutral': 0}
labels = labels.map(label_mapping)

# 提取GloVe词嵌入和句法特征
def get_features(text, embeddings_index):
    doc = nlp(text)
    embedding_features = np.mean([embeddings_index.get(token.text, np.zeros((300,))) for token in doc], axis=0)
    syntax_features = np.array([token.dep_ for token in doc])
    # 将句法特征转换为数值特征
    syntax_features = np.array([hash(feature) % 1000 for feature in syntax_features])  # 简单哈希处理
    return np.concatenate((embedding_features, syntax_features))

features = np.array([get_features(text, embeddings_index) for text in texts])

V2 版本 可对比


# 提取GloVe词嵌入和句法特征
def get_features(text, embeddings_index):
    doc = nlp(text)
    embedding_features = np.mean([embeddings_index.get(token.text, np.zeros((100,))) for token in doc], axis=0)
    syntax_features = np.array([token.dep_ for token in doc])
    # 将句法特征转换为数值特征
    syntax_features = np.array([hash(feature) % 1000 for feature in syntax_features])
    # 将句法特征填充到固定长度
    max_length = 10
    if (len(syntax_features) > max_length):
        syntax_features = syntax_features[:max_length]
    else:
        syntax_features = np.pad(syntax_features, (0, max_length - len(syntax_features)), 'constant')
    return np.concatenate((embedding_features, syntax_features))

步骤 5: 训练和评估模型

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

# 拆分数据集
X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.2, random_state=42)

# 训练模型
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# 预测和评估
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred, target_names=['negative', 'neutral', 'positive']))

设计验证代码：使用 random forest

import pandas as pd
import numpy as np

# 随机生成示例数据集
data = {
    'text': [
        'I love this product, it is amazing!',
        'This is the worst service I have ever received.',
        'The movie was okay, not great but not bad either.',
        'I am extremely happy with the results!',
        'The food was terrible and the wait was too long.',
        'It was an average experience, nothing special.',
        'I absolutely adore this place!',
        'I hate the way they treat their customers.',
        'The book was decent, a good read overall.',
        'I feel indifferent about this situation.'
    ],
    'label': ['positive', 'negative', 'neutral', 'positive', 'negative', 'neutral', 'positive', 'negative', 'neutral', 'neutral']
}

df = pd.DataFrame(data)
df.to_csv('demo_dataset.csv', index=False)
import numpy as np
import pandas as pd

# 加载GloVe词嵌入
def load_glove_embeddings(file_path):
    embeddings_index = {}
    with open(file_path, 'r', encoding='utf-8') as f:
        for line in f:
            values = line.split()
            word = values[0]
            coefs = np.asarray(values[1:], dtype='float32')
            embeddings_index[word] = coefs
    return embeddings_index

glove_file = '/Users/bootscoder/PycharmProjects/nlp-lesson-design/test3/.vector_cache/glove.6B.100d.txt'  # 替换为你的GloVe文件路径
embeddings_index = load_glove_embeddings(glove_file)
import spacy

# 加载SpaCy模型
nlp = spacy.load("en_core_web_sm")

# 加载数据集
data = pd.read_csv('demo_dataset.csv')
texts = data['text']
labels = data['label']

# 将标签转换为数字
label_mapping = {'positive': 1, 'negative': -1, 'neutral': 0}
labels = labels.map(label_mapping)

# 提取GloVe词嵌入和句法特征
def get_features(text, embeddings_index):
    doc = nlp(text)
    embedding_features = np.mean([embeddings_index.get(token.text, np.zeros((100,))) for token in doc], axis=0)
    syntax_features = np.array([token.dep_ for token in doc])
    # 将句法特征转换为数值特征
    syntax_features = np.array([hash(feature) % 1000 for feature in syntax_features])
    # 将句法特征填充到固定长度
    max_length = 10
    if (len(syntax_features) > max_length):
        syntax_features = syntax_features[:max_length]
    else:
        syntax_features = np.pad(syntax_features, (0, max_length - len(syntax_features)), 'constant')
    return np.concatenate((embedding_features, syntax_features))

features = np.array([get_features(text, embeddings_index) for text in texts])

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

# 拆分数据集
X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.2, random_state=42)

# 训练模型
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# 预测和评估
y_pred = model.predict(X_train)
print(classification_report(y_train, y_pred, target_names=['negative', 'neutral', 'positive']))

设计验证代码：使用 bi -lstm + attention

import pandas as pd
import numpy as np
import spacy
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Embedding, LSTM, Dense, Dropout, Bidirectional, Layer
from tensorflow.keras import backend as K
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import classification_report

# 加载GloVe词嵌入
def load_glove_embeddings(file_path):
    embeddings_index = {}
    with open(file_path, 'r', encoding='utf-8') as f:
        for line in f:
            values = line.split()
            word = values[0]
            coefs = np.asarray(values[1:], dtype='float32')
            embeddings_index[word] = coefs
    return embeddings_index

glove_file = '/Users/bootscoder/PycharmProjects/nlp-lesson-design/test3/.vector_cache/glove.6B.100d.txt'  # 替换为你的GloVe文件路径
embeddings_index = load_glove_embeddings(glove_file)

# 加载数据集
data = pd.read_csv('demo_dataset.csv')
texts = data['text']
labels = data['label']

# 将标签转换为数字
label_encoder = LabelEncoder()
labels = label_encoder.fit_transform(labels)

# 使用Tokenizer对文本进行编码
tokenizer = Tokenizer()
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)

# 填充序列
max_sequence_length = 100
data = pad_sequences(sequences, maxlen=max_sequence_length)

# 创建词嵌入矩阵
word_index = tokenizer.word_index
embedding_dim = 100
embedding_matrix = np.zeros((len(word_index) + 1, embedding_dim))
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

# 自定义注意力层
class AttentionLayer(Layer):
    def __init__(self, **kwargs):
        super(AttentionLayer, self).__init__(**kwargs)

    def build(self, input_shape):
        self.W = self.add_weight(name='att_weight', shape=(input_shape[-1], 1), initializer='uniform', trainable=True)
        self.b = self.add_weight(name='att_bias', shape=(input_shape[1], 1), initializer='uniform', trainable=True)
        super(AttentionLayer, self).build(input_shape)

    def call(self, x):
        e = K.tanh(K.dot(x, self.W) + self.b)
        e = K.squeeze(e, axis=-1)
        alpha = K.softmax(e)
        alpha = K.expand_dims(alpha, axis=-1)
        context = x * alpha
        context = K.sum(context, axis=1)
        return context

    def compute_output_shape(self, input_shape):
        return (input_shape[0], input_shape[-1])

# 构建带有注意力机制的LSTM模型
input = Input(shape=(max_sequence_length,))
embedding = Embedding(len(word_index) + 1, embedding_dim, weights=[embedding_matrix], input_length=max_sequence_length, trainable=False)(input)
x = Bidirectional(LSTM(100, return_sequences=True))(embedding)
x = AttentionLayer()(x)
x = Dropout(0.5)(x)
output = Dense(3, activation='softmax')(x)

model = Model(inputs=input, outputs=output)
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

# 训练模型
model.fit(data, labels, epochs=20, batch_size=32)

# 去掉验证数据集 ，容易使得 model 学到歪门邪道 -- 数据量少
# model.fit(data, labels, epochs=20, batch_size=32, validatation_slpit = 0.2)

# 预测和评估
predictions = model.predict(data)
y_pred = np.argmax(predictions, axis=1)

print(classification_report(labels, y_pred, target_names=label_encoder.classes_))

观察与思考

感觉应该是收到 gloVe的影响了

举例 good 和 great 非常接近

Task3 - Bert-完整数据集

正式处理数据集 – bert

1. 使用 spaCy 进行句法特征提取。

2. 将句法特征与 BERT 的词向量进行拼接。

3. 修改 SentimentDataset 类，使其支持句法特征的拼接。

4. 修改模型训练和评估的部分，使其支持新的输入。

数据预处理

import pandas as pd
from sklearn.model_selection import train_test_split
import time

# 读取数据集
data_path = '/Users/bootscoder/PycharmProjects/nlp-lesson-design/data/SemEval-Triplet-data/ASTE-Data-V2-EMNLP2020/14lap/dev_triplets.txt'

# 加载数据
data = []
with open(data_path, 'r') as file:
    for line in file:
        text, label = line.strip().split('####')
        data.append((text, label))

# 转换为DataFrame
df = pd.DataFrame(data, columns=['sentence', 'label'])

# 简单预处理标签
"""
	1.	如果label列中的值包含字符串'NEG'，则将该值转换为0。
	2.	如果label列中的值包含字符串'NEU'，则将该值转换为1。
	3.	如果label列中的值不包含'NEG'或'NEU'，则将该值转换为2。
"""
df['label'] = df['label'].apply(lambda x: 0 if 'NEG' in x else (1 if 'NEU' in x else 2))

# 划分训练和测试集
train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)

train_texts = train_df['sentence'].tolist()
train_labels = train_df['label'].tolist()
test_texts = test_df['sentence'].tolist()
test_labels = test_df['label'].tolist()

将数据集进行转换

class SentimentDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_length):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        text = self.texts[idx]
        label = self.labels[idx]
        inputs = self.tokenizer(text, return_tensors="pt", padding='max_length', truncation=True, max_length=self.max_length)
        inputs = {key: val.squeeze(0) for key, val in inputs.items()}  # Remove batch dimension
        inputs['labels'] = torch.tensor(label, dtype=torch.long)
        return input

数据预处理结果

完整代码

import os
import torch
from transformers import BertTokenizer, BertForSequenceClassification
import pandas as pd
from sklearn.model_selection import train_test_split
import time
import torch
from torch.utils.data import DataLoader, Dataset

import os
from transformers import BertTokenizer, BertForSequenceClassification, AdamW
# 检查设备
device = torch.device("mps" if torch.backends.mps.is_available() else "cpu")
print(f"Using device: {device}")

# 设置代理环境变量（如果需要）
os.environ['HTTP_PROXY'] = 'http://127.0.0.1:7890'
os.environ['HTTPS_PROXY'] = 'http://127.0.0.1:7890'

# 尝试下载BERT模型
try:
    tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
    model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=3, output_hidden_states=True)
    model.to(device)  # 将模型移动到设备
    print("模型加载成功")
except Exception as e:
    print("模型加载失败：", e)

# 读取数据集
data_path = '../data/SemEval-Triplet-data/ASTE-Data-V2-EMNLP2020/14lap/dev_triplets.txt'

# 加载数据
data = []
with open(data_path, 'r') as file:
    for line in file:
        text, label = line.strip().split('####')
        data.append((text, label))

# 转换为DataFrame
df = pd.DataFrame(data, columns=['sentence', 'label'])

# 简单预处理标签
df['label'] = df['label'].apply(lambda x: 0 if 'NEG' in x else (1 if 'NEU' in x else 2))

# 划分训练和测试集
train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)

train_texts = train_df['sentence'].tolist()
train_labels = train_df['label'].tolist()
test_texts = test_df['sentence'].tolist()
test_labels = test_df['label'].tolist()

class SentimentDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_length):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        text = self.texts[idx]
        label = self.labels[idx]
        inputs = self.tokenizer(text, return_tensors="pt", padding='max_length', truncation=True, max_length=self.max_length)
        inputs = {key: val.squeeze(0) for key, val in inputs.items()}  # Remove batch dimension
        inputs['labels'] = torch.tensor(label, dtype=torch.long)
        return inputs

# 创建数据集和数据加载器
train_dataset = SentimentDataset(train_texts, train_labels, tokenizer, max_length=128)
train_loader = DataLoader(train_dataset, batch_size=8, shuffle=True)
test_dataset = SentimentDataset(test_texts, test_labels, tokenizer, max_length=128)
test_loader = DataLoader(test_dataset, batch_size=8, shuffle=False)

# 优化器
optimizer = AdamW(model.parameters(), lr=2e-5)

# 训练模型
start_time = time.time()
model.train()
for epoch in range(3):  # 训练3个epoch
    for batch in train_loader:
        optimizer.zero_grad()
        batch = {k: v.to(device) for k, v in batch.items()}  # 将数据移动到设备
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()
        optimizer.step()
        print(f"Epoch: {epoch}, Loss: {loss.item()}")

end_time = time.time()
training_time = end_time - start_time

# 评估模型
model.eval()
correct = 0
total = 0
with torch.no_grad():
    for batch in test_loader:
        batch = {k: v.to(device) for k, v in batch.items()}  # 将数据移动到设备
        outputs = model(**batch)
        predictions = torch.argmax(outputs.logits, dim=-1)
        correct += (predictions == batch['labels']).sum().item()
        total += batch['labels'].size(0)

# 评估模型并提取第一句话的向量表示
with torch.no_grad():
    # 取第一个batch中的第一句话
    first_batch = next(iter(test_loader))
    first_batch = {k: v.to(device) for k, v in first_batch.items()}  # 将数据移动到设备
    outputs = model(**first_batch)
    cls_embedding = outputs.hidden_states[-1][:, 0, :]  # BERT最后一层的[CLS]标记向量
    print("First sentence CLS embedding:", cls_embedding[0])

print("BERT Accuracy:", correct / total)
print("Training Time:", training_time, "seconds")

正式处理数据集 – bert+spacy

数据处理

class SentimentDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_length, nlp):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_length = max_length
        self.nlp = nlp

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        text = self.texts[idx]
        label = self.labels[idx]

        # 提取句法特征
        doc = self.nlp(text)
        syntax_features = [token.dep_ for token in doc]

        # 使用BERT分词器对文本进行编码
        inputs = self.tokenizer(text, return_tensors="pt", padding='max_length', truncation=True,
                                max_length=self.max_length)
        inputs = {key: val.squeeze(0) for key, val in inputs.items()}  # Remove batch dimension

        # 初始化句法特征张量
        syntax_tensor = torch.zeros(self.max_length, len(self.nlp.get_pipe("parser").labels))
        for i, dep in enumerate(syntax_features[:self.max_length]):
            if dep in self.nlp.get_pipe("parser").labels:
                syntax_tensor[i][self.nlp.get_pipe("parser").labels.index(dep)] = 1

        # 计算句法特征的平均值
        syntax_features_avg = torch.mean(syntax_tensor, dim=0)

        # 将句法特征张量添加到inputs字典中
        inputs['syntax'] = syntax_features_avg
        inputs['labels'] = torch.tensor(label, dtype=torch.long)
        return inputs

自定义bert模型

# 定义自定义BERT模型
class CustomBERTModel(torch.nn.Module):
    def __init__(self, num_labels, syntax_feature_size):
        super(CustomBERTModel, self).__init__()
        config = BertConfig.from_pretrained('bert-base-uncased')
        config.output_hidden_states = True  # 启用hidden states输出
        self.bert = BertModel.from_pretrained('bert-base-uncased', config=config)
        self.classifier = torch.nn.Linear(config.hidden_size + syntax_feature_size, num_labels)

    def forward(self, input_ids, attention_mask, token_type_ids=None, syntax=None):
        outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids)
        cls_output = outputs.last_hidden_state[:, 0, :]  # [CLS] token output
        combined_output = torch.cat((cls_output, syntax), dim=1)  # Combine with syntax features
        logits = self.classifier(combined_output)
        return logits, outputs.hidden_states

完整代码

import pandas as pd
from sklearn.model_selection import train_test_split
import time
import torch
from transformers import BertTokenizer, BertModel, BertConfig, AdamW
from torch.utils.data import DataLoader, Dataset
import spacy
import os

# 检查是否有MPS设备
device = torch.device("mps" if torch.backends.mps.is_available() else "cpu")
print(f"Using device: {device}")

# 读取数据集
data_path = '../data/SemEval-Triplet-data/ASTE-Data-V2-EMNLP2020/14lap/dev_triplets.txt'

# 加载数据
data = []
with open(data_path, 'r') as file:
    for line in file:
        text, label = line.strip().split('####')
        data.append((text, label))

# 转换为DataFrame
df = pd.DataFrame(data, columns=['sentence', 'label'])

# 简单预处理标签
df['label'] = df['label'].apply(lambda x: 0 if 'NEG' in x else (1 if 'NEU' in x else 2))

# 划分训练和测试集
train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)

train_texts = train_df['sentence'].tolist()
train_labels = train_df['label'].tolist()
test_texts = test_df['sentence'].tolist()
test_labels = test_df['label'].tolist()

# 设置HTTP和HTTPS代理
os.environ['HTTP_PROXY'] = 'http://127.0.0.1:7890'
os.environ['HTTPS_PROXY'] = 'http://127.0.0.1:7890'


class SentimentDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_length, nlp):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_length = max_length
        self.nlp = nlp

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        text = self.texts[idx]
        label = self.labels[idx]

        # 提取句法特征
        doc = self.nlp(text)
        syntax_features = [token.dep_ for token in doc]

        # 使用BERT分词器对文本进行编码
        inputs = self.tokenizer(text, return_tensors="pt", padding='max_length', truncation=True,
                                max_length=self.max_length)
        inputs = {key: val.squeeze(0) for key, val in inputs.items()}  # Remove batch dimension

        # 初始化句法特征张量
        syntax_tensor = torch.zeros(self.max_length, len(self.nlp.get_pipe("parser").labels))
        for i, dep in enumerate(syntax_features[:self.max_length]):
            if dep in self.nlp.get_pipe("parser").labels:
                syntax_tensor[i][self.nlp.get_pipe("parser").labels.index(dep)] = 1

        # 计算句法特征的平均值
        syntax_features_avg = torch.mean(syntax_tensor, dim=0)

        # 将句法特征张量添加到inputs字典中
        inputs['syntax'] = syntax_features_avg
        inputs['labels'] = torch.tensor(label, dtype=torch.long)
        return inputs


# 初始化spaCy模型
nlp = spacy.load("en_core_web_sm")

# 初始化BERT分词器
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')


# 定义自定义BERT模型
class CustomBERTModel(torch.nn.Module):
    def __init__(self, num_labels, syntax_feature_size):
        super(CustomBERTModel, self).__init__()
        config = BertConfig.from_pretrained('bert-base-uncased')
        config.output_hidden_states = True  # 启用hidden states输出
        self.bert = BertModel.from_pretrained('bert-base-uncased', config=config)
        self.classifier = torch.nn.Linear(config.hidden_size + syntax_feature_size, num_labels)

    def forward(self, input_ids, attention_mask, token_type_ids=None, syntax=None):
        outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids)
        cls_output = outputs.last_hidden_state[:, 0, :]  # [CLS] token output
        combined_output = torch.cat((cls_output, syntax), dim=1)  # Combine with syntax features
        logits = self.classifier(combined_output)
        return logits, outputs.hidden_states


# 初始化自定义BERT模型
syntax_feature_size = len(nlp.get_pipe("parser").labels)
model = CustomBERTModel(num_labels=3, syntax_feature_size=syntax_feature_size).to(device)

# 创建数据集和数据加载器
train_dataset = SentimentDataset(train_texts, train_labels, tokenizer, max_length=128, nlp=nlp)
train_loader = DataLoader(train_dataset, batch_size=8, shuffle=True)
test_dataset = SentimentDataset(test_texts, test_labels, tokenizer, max_length=128, nlp=nlp)
test_loader = DataLoader(test_dataset, batch_size=8, shuffle=False)

# 优化器和损失函数
optimizer = AdamW(model.parameters(), lr=2e-5)
criterion = torch.nn.CrossEntropyLoss()

# 训练模型
start_time = time.time()
model.train()
for epoch in range(3):  # 训练3个epoch
    for batch in train_loader:
        optimizer.zero_grad()
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        token_type_ids = batch['token_type_ids'].to(device) if 'token_type_ids' in batch else None
        syntax = batch['syntax'].to(device)
        labels = batch['labels'].to(device)

        outputs, _ = model(input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids,
                           syntax=syntax)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        print(f"Epoch: {epoch}, Loss: {loss.item()}")

end_time = time.time()
training_time = end_time - start_time

# 评估模型
model.eval()
correct = 0
total = 0
with torch.no_grad():
    for batch in test_loader:
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        token_type_ids = batch['token_type_ids'].to(device) if 'token_type_ids' in batch else None
        syntax = batch['syntax'].to(device)
        labels = batch['labels'].to(device)

        outputs, _ = model(input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids,
                           syntax=syntax)
        predictions = torch.argmax(outputs, dim=-1)
        correct += (predictions == labels).sum().item()
        total += labels.size(0)

# 评估模型并提取第一句话的向量表示
with torch.no_grad():
    # 取第一个batch中的第一句话
    first_batch = next(iter(test_loader))
    input_ids = first_batch['input_ids'].to(device)
    attention_mask = first_batch['attention_mask'].to(device)
    token_type_ids = first_batch['token_type_ids'].to(device) if 'token_type_ids' in first_batch else None
    syntax = first_batch['syntax'].to(device)

    outputs, hidden_states = model(input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids,
                                   syntax=syntax)
    cls_embedding = hidden_states[-1][:, 0, :]  # BERT最后一层的[CLS]标记向量
    print("First sentence CLS embedding:", cls_embedding[0])

print("BERT Accuracy:", correct / total)
print("Training Time:", training_time, "seconds")

结果对比 bert VS bert+spacy

epoch = 3

epoch = 10

可以看到融合了句法特征确实增加了 bert模型的准确性

Task3 - gloVe- 完整数据集

gloVe

import pandas as pd
from sklearn.model_selection import train_test_split
from torch.utils.data import DataLoader, Dataset
import torch.nn as nn
import torch.optim as optim
from torchtext.vocab import GloVe
from torchtext.data.utils import get_tokenizer
from torch.nn.utils.rnn import pad_sequence
import torch
import time

# 读取数据集路径
data_path = '/Users/bootscoder/PycharmProjects/nlp-lesson-design/data/SemEval-Triplet-data/ASTE-Data-V2-EMNLP2020/14lap/dev_triplets.txt'

# 加载数据，将每行数据读取并存储在列表中
data = []
with open(data_path, 'r') as file:
    for line in file:
        text, label = line.strip().split('####')
        data.append((text, label))

# 将数据转换为DataFrame
df = pd.DataFrame(data, columns=['sentence', 'label'])

# 简单预处理标签，将文本标签转换为数值标签
df['label'] = df['label'].apply(lambda x: 0 if 'NEG' in x else (1 if 'NEU' in x else 2))

# 划分训练和测试集，80%用于训练，20%用于测试
train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)

train_texts = train_df['sentence'].tolist()
train_labels = train_df['label'].tolist()
test_texts = test_df['sentence'].tolist()
test_labels = test_df['label'].tolist()

# 加载GloVe词向量
glove = GloVe(name='6B', dim=100)
tokenizer = get_tokenizer("basic_english")

# 创建单词到索引的映射
vocab = {word: idx for idx, word in enumerate(glove.itos)}


# 自定义数据集类
class GloveDataset(Dataset):
    def __init__(self, texts, labels, vocab, tokenizer, max_length):
        self.texts = texts
        self.labels = labels
        self.vocab = vocab
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        text = self.texts[idx]
        label = self.labels[idx]
        tokens = self.tokenizer(text)
        indices = [self.vocab[token] if token in self.vocab else self.vocab['unk'] for token in tokens]
        return torch.tensor(indices), torch.tensor(label)


# 自定义collate函数，用于将不同长度的文本序列填充到相同长度
def collate_fn(batch):
    texts, labels = zip(*batch)
    lengths = torch.tensor([len(text) for text in texts])
    texts = pad_sequence(texts, batch_first=True, padding_value=vocab['pad'])
    labels = torch.tensor(labels)
    return texts, labels, lengths


# 创建数据集和数据加载器
train_dataset = GloveDataset(train_texts, train_labels, vocab, tokenizer, max_length=128)
train_loader = DataLoader(train_dataset, batch_size=8, shuffle=True, collate_fn=collate_fn)
test_dataset = GloveDataset(test_texts, test_labels, vocab, tokenizer, max_length=128)
test_loader = DataLoader(test_dataset, batch_size=8, shuffle=False, collate_fn=collate_fn)


# 定义基于GloVe词向量的模型，增加注意力机制
class GloveAttentionModel(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim, num_classes):
        super(GloveAttentionModel, self).__init__()
        self.embedding = nn.Embedding.from_pretrained(glove.vectors)  # 加载预训练的GloVe词向量
        self.lstm = nn.LSTM(embed_dim, hidden_dim, batch_first=True)  # 定义LSTM层
        self.attention = nn.Linear(hidden_dim, 1)  # 定义注意力层
        self.fc = nn.Linear(hidden_dim, num_classes)  # 定义全连接层

    def forward(self, x, lengths):
        x = self.embedding(x)  # 获取词向量
        packed_input = nn.utils.rnn.pack_padded_sequence(x, lengths, batch_first=True, enforce_sorted=False)
        packed_output, (hidden, cell) = self.lstm(packed_input)  # 传递LSTM层
        output, _ = nn.utils.rnn.pad_packed_sequence(packed_output, batch_first=True)

        # 注意力机制
        attn_weights = torch.tanh(self.attention(output)).squeeze(-1)
        attn_weights = torch.softmax(attn_weights, dim=1)
        attn_output = torch.sum(output * attn_weights.unsqueeze(-1), dim=1)

        x = self.fc(attn_output)  # 通过全连接层得到输出
        return x


# 初始化模型、损失函数和优化器
model = GloveAttentionModel(len(vocab), 100, 128, 3)  # hidden_dim设为128，可以根据需要调整
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=1e-3)

# 训练模型
start_time = time.time()
model.train()
for epoch in range(3):  # 训练3个epoch
    for texts, labels, lengths in train_loader:
        optimizer.zero_grad()
        outputs = model(texts, lengths)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        print(f"Epoch: {epoch}, Loss: {loss.item()}")
end_time = time.time()
training_time = end_time - start_time  # 计算训练时间

# 评估模型
model.eval()
correct = 0
total = 0
with torch.no_grad():
    for texts, labels, lengths in test_loader:
        outputs = model(texts, lengths)
        predictions = torch.argmax(outputs, dim=-1)
        correct += (predictions == labels).sum().item()
        total += labels.size(0)

# 打印模型准确性和训练时间
print("GloVe Attention Model Accuracy:", correct / total)
print("Training Time:", training_time, "seconds")

gloVe-spacy

import pandas as pd
from sklearn.model_selection import train_test_split
from torch.utils.data import DataLoader, Dataset
import torch.nn as nn
import torch.optim as optim
from torchtext.vocab import GloVe
from torchtext.data.utils import get_tokenizer
from torch.nn.utils.rnn import pad_sequence
import torch
import time
import spacy

# 读取数据集路径
data_path = '/Users/bootscoder/PycharmProjects/nlp-lesson-design/data/SemEval-Triplet-data/ASTE-Data-V2-EMNLP2020/14lap/dev_triplets.txt'

# 加载数据，将每行数据读取并存储在列表中
data = []
with open(data_path, 'r') as file:
    for line in file:
        text, label = line.strip().split('####')
        data.append((text, label))

# 将数据转换为DataFrame
df = pd.DataFrame(data, columns=['sentence', 'label'])

# 简单预处理标签，将文本标签转换为数值标签
df['label'] = df['label'].apply(lambda x: 0 if 'NEG' in x else (1 if 'NEU' in x else 2))

# 划分训练和测试集，80%用于训练，20%用于测试
train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)

train_texts = train_df['sentence'].tolist()
train_labels = train_df['label'].tolist()
test_texts = test_df['sentence'].tolist()
test_labels = test_df['label'].tolist()

# 加载GloVe词向量
glove = GloVe(name='6B', dim=100)
tokenizer = get_tokenizer("basic_english")

# 创建单词到索引的映射
vocab = {word: idx for idx, word in enumerate(glove.itos)}

# 加载spacy模型
nlp = spacy.load('en_core_web_sm')


# 获取spacy句法特征表示
def get_syntax_features(text, max_length=20):
    doc = nlp(text)
    features = [token.dep_ for token in doc]
    feature_indices = [vocab.get(feature, vocab['unk']) for feature in features]
    if len(feature_indices) > max_length:
        feature_indices = feature_indices[:max_length]
    else:
        feature_indices += [vocab['pad']] * (max_length - len(feature_indices))
    return feature_indices


# 自定义数据集类
class GloveDataset(Dataset):
    def __init__(self, texts, labels, vocab, tokenizer, max_length):
        self.texts = texts
        self.labels = labels
        self.vocab = vocab
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        text = self.texts[idx]
        label = self.labels[idx]
        tokens = self.tokenizer(text)
        indices = [self.vocab[token] if token in self.vocab else self.vocab['unk'] for token in tokens]
        syntax_features = get_syntax_features(text)
        combined_features = indices + syntax_features
        return torch.tensor(combined_features), torch.tensor(label)


# 自定义collate函数，用于将不同长度的文本序列填充到相同长度
def collate_fn(batch):
    texts, labels = zip(*batch)
    lengths = torch.tensor([len(text) for text in texts])
    texts = pad_sequence(texts, batch_first=True, padding_value=vocab['pad'])
    labels = torch.tensor(labels)
    return texts, labels, lengths


# 创建数据集和数据加载器
train_dataset = GloveDataset(train_texts, train_labels, vocab, tokenizer, max_length=148)
train_loader = DataLoader(train_dataset, batch_size=8, shuffle=True, collate_fn=collate_fn)
test_dataset = GloveDataset(test_texts, test_labels, vocab, tokenizer, max_length=148)
test_loader = DataLoader(test_dataset, batch_size=8, shuffle=False, collate_fn=collate_fn)


# 定义基于GloVe词向量和spacy句法特征的模型，增加注意力机制
class GloveAttentionModel(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim, num_classes):
        super(GloveAttentionModel, self).__init__()
        self.embedding = nn.Embedding.from_pretrained(glove.vectors)  # 加载预训练的GloVe词向量
        self.lstm = nn.LSTM(embed_dim, hidden_dim, batch_first=True)  # 定义LSTM层
        self.attention = nn.Linear(hidden_dim, 1)  # 定义注意力层
        self.fc = nn.Linear(hidden_dim, num_classes)  # 定义全连接层

    def forward(self, x, lengths):
        x = self.embedding(x)  # 获取词向量
        packed_input = nn.utils.rnn.pack_padded_sequence(x, lengths, batch_first=True, enforce_sorted=False)
        packed_output, (hidden, cell) = self.lstm(packed_input)  # 传递LSTM层
        output, _ = nn.utils.rnn.pad_packed_sequence(packed_output, batch_first=True)

        # 注意力机制
        attn_weights = torch.tanh(self.attention(output)).squeeze(-1)
        attn_weights = torch.softmax(attn_weights, dim=1)
        attn_output = torch.sum(output * attn_weights.unsqueeze(-1), dim=1)

        x = self.fc(attn_output)  # 通过全连接层得到输出
        return x


# 初始化模型、损失函数和优化器
model = GloveAttentionModel(len(vocab), 100, 128, 3)  # hidden_dim设为128，可以根据需要调整
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=1e-3)

# 训练模型
start_time = time.time()
model.train()
for epoch in range(3):  # 训练3个epoch
    for texts, labels, lengths in train_loader:
        optimizer.zero_grad()
        outputs = model(texts, lengths)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        print(f"Epoch: {epoch}, Loss: {loss.item()}")
end_time = time.time()
training_time = end_time - start_time  # 计算训练时间

# 评估模型
model.eval()
correct = 0
total = 0
with torch.no_grad():
    for texts, labels, lengths in test_loader:
        outputs = model(texts, lengths)
        predictions = torch.argmax(outputs, dim=-1)
        correct += (predictions == labels).sum().item()
        total += labels.size(0)

# 打印模型准确性和训练时间
print("GloVe Attention Model Accuracy:", correct / total)
print("Training Time:", training_time, "seconds")

结果对比-gloVe VS gloVE-spacy

任务总结

任务三通过词嵌入模型与句法特征进行融合，本次实现采用简单的一维向量拼接的方式进行融合，采用了不同的方法与小数据集进行了方法设计的验证，证明了方法的可行性后采用具体的数据集进行测试，可以看到加入句法特征后，原始数据得到了比较明显的增强

任务四基于跨度的情感三元组抽取

基于跨度的情感三元组抽取任务是自然语言处理（NLP）领域的一项任务，旨在从文本中提取出关于情感的三元组。这些三元组通常由三个部分组成：实体（Entity），情感词（Sentiment word），以及情感极性（Sentiment polarity）。基于跨度的方法主要利用文本中的不同跨度（即子句）来识别这些三元组。

关于任务的相关概念与设计流程

任务定义

情感三元组抽取任务的目标是从给定的句子中识别出三元组 (entity, sentiment word, sentiment polarity)。例如，在句子“我喜欢这款手机的屏幕，但电池寿命很短。”中，可以识别出两个三元组：

(“屏幕”, “喜欢”, “正面”)
(“电池寿命”, “很短”, “负面”)

基于跨度的方法

基于跨度的方法首先在文本中生成所有可能的子句（跨度），然后使用特定的模型来选择最可能的子句，最后通过这些子句来提取情感三元组。

步骤概述

生成跨度（Span Generation）：在文本中生成所有可能的子句（跨度），每个跨度由起始位置和结束位置确定。
跨度选择（Span Selection）：使用模型选择最有可能包含情感信息的跨度。
情感三元组抽取（Triplet Extraction）：从选定的跨度中提取实体、情感词和情感极性，形成情感三元组。

基于跨度方法示例

以下是一个基于跨度的情感三元组抽取的详细示例：

生成跨度：
对于句子“这款手机的屏幕很清晰，电池寿命很短。”，生成所有可能的跨度，如：
- “这款手机”
- “的屏幕”
- “很清晰”
- “电池寿命”
- “很短”
- 等等
跨度选择：
使用序列标注模型（如BERT等预训练模型）选择最有可能包含情感信息的跨度：
- 选择“屏幕”
- 选择“电池寿命”
情感三元组抽取：
从选定的跨度中提取情感三元组：
- 从“屏幕”中提取情感词“清晰”，极性“正面”
- 从“电池寿命”中提取情感词“短”，极性“负面”

代码示例

以下是一个简单的Python代码示例，展示如何生成跨度并选择情感三元组：

import spacy
from textblob import TextBlob

# 加载英文模型
nlp = spacy.load('en_core_web_sm')

# 示例评论
reviews = [
    "This laptop meets every expectation and Windows 7 is great!",
    "Drivers updated ok but the BIOS update froze the system up and the computer shut down.",
    "It rarely works and when it does it's incredibly slow.",
    "The battery life is amazing and the screen is very clear.",
    "The keyboard feels cheap and the touchpad is unresponsive."
]


# 提取情感三元组
def extract_sentiment_triplets(reviews):
    triplets = []
    for review in reviews:
        doc = nlp(review)
        sentiment = TextBlob(review).sentiment.polarity
        if sentiment > 0:
            sentiment_label = 'POS'
        elif sentiment < 0:
            sentiment_label = 'NEG'
        else:
            sentiment_label = 'NEU'

        # print(f"Review: {review}")
        # print(f"Overall Sentiment: {sentiment_label}")

        # 获取所有名词短语作为潜在的实体和属性
        noun_phrases = [chunk.text for chunk in doc.noun_chunks]
        # print(f"Noun Phrases: {noun_phrases}")

        for chunk in doc.noun_chunks:
            noun_phrase = chunk.text
            # 查找修饰名词短语的形容词
            adjectives = [token.text for token in chunk.root.head.children if token.dep_ in ("amod", "acomp")]

            for adjective in adjectives:
                # 分析形容词的情感
                phrase_sentiment = TextBlob(adjective).sentiment.polarity
                if phrase_sentiment > 0:
                    phrase_sentiment_label = 'POS'
                elif phrase_sentiment < 0:
                    phrase_sentiment_label = 'NEG'
                else:
                    phrase_sentiment_label = 'NEU'

                # print(f"Noun Phrase: {noun_phrase}, Adjective: {adjective}, Sentiment: {phrase_sentiment_label}")

                # 只考虑与整体情感一致的名词短语
                if phrase_sentiment_label == sentiment_label:
                    triplets.append((noun_phrase, adjective, sentiment_label))
                    # print(f"Triplet: ({noun_phrase}, {adjective}, {sentiment_label})")

    return triplets


# 输出结果
triplets = extract_sentiment_triplets(reviews)
print("Extracted Sentiment Triplets:")
for triplet in triplets:
    print(triplet)

Extracted Sentiment Triplets:
('it', 'slow', 'NEG')
('The battery life', 'amazing', 'POS')
('the screen', 'clear', 'POS')
('The keyboard', 'cheap', 'POS')

⌚️代码实践-创新

TextBlob 是一个简洁易用的 Python 库，用于处理文本数据。它基于 NLTK 和 Pattern 库，提供了便捷的 API 来执行常见的自然语言处理（NLP）任务，例如词性标注、名词短语提取、情感分析等。

demo-spacy+textblob

设计：

很朴素的想法是找到（主语+形容词+情感极性）
根据整体句子的情感极性为依据，过滤掉冲突的三元组
单词索引 – > 首字符的位置

Issue ：如何找到句子的所有表达情感的形容词的位置的跨度，以及其形容的主语的跨度，以及最后的情感极性

import spacy
from textblob import TextBlob

# 加载英文模型
nlp = spacy.load('en_core_web_sm')

# 示例文本
text = """
This laptop meets every expectation and Windows 7 is great!
Drivers updated ok but the BIOS update froze the system up and the computer shut down.
It rarely works and when it does it's incredibly slow.
The battery life is amazing and the screen is very clear.
The keyboard feels cheap and the touchpad is unresponsive.
"""

# 处理文本
doc = nlp(text)


# 提取情感形容词及其修饰的名词短语的跨度
def extract_sentiment_adjectives(doc):
    all_triplets = []
    for sentence in doc.sents:
        sentence_triplets = []
        # 获取句子的整体情感
        sentence_sentiment = TextBlob(sentence.text).sentiment.polarity
        sentence_sentiment_label = 'POS' if sentence_sentiment > 0 else 'NEG' if sentence_sentiment < 0 else 'NEU'
        for token in sentence:
            if token.pos_ == 'ADJ':
                # 查找修饰的名词
                noun = None
                if token.head.pos_ == 'NOUN':
                    noun = token.head
                else:
                    for ancestor in token.ancestors:
                        if ancestor.pos_ == 'NOUN':
                            noun = ancestor
                            break
                if noun is None:
                    for child in token.children:
                        if child.dep_ in ('nsubj', 'dobj') and child.pos_ == 'NOUN':
                            noun = child
                            break
                if noun is None:
                    for sibling in token.head.children:
                        if sibling.pos_ == 'NOUN' and sibling != token:
                            noun = sibling
                            break

                if noun:
                    # 获取形容词的情感极性
                    adjective_sentiment = TextBlob(token.text).sentiment.polarity
                    adjective_sentiment_label = 'POS' if adjective_sentiment > 0 else 'NEG' if adjective_sentiment < 0 else 'NEU'

                    # 仅保留与句子情感一致的形容词
                    if adjective_sentiment_label == sentence_sentiment_label:
                        # 获取形容词和名词的词序起始位置
                        adjective_index = token.i - sentence.start
                        noun_index = noun.i - sentence.start

                        sentence_triplets.append(f"({noun_index}, {adjective_index}, {adjective_sentiment_label})")

        if sentence_triplets:
            all_triplets.append((sentence.text.strip(), sentence_triplets))

    return all_triplets


# 提取并输出结果
all_triplets = extract_sentiment_adjectives(doc)
print("Extracted Sentiment Triplets:")
for sentence, triplets in all_triplets:
    print(f"Sentence: {sentence}")
    print("Triplets:", " ".join(triplets))

Dev bert-tripletExtraction

import torch
from transformers import BertTokenizer, BertForTokenClassification
from torch.utils.data import Dataset, DataLoader
import re
import numpy as np


class TripletExtractionDataset(Dataset):
    def __init__(self, data, tokenizer, max_len=128):
        self.data = data
        self.tokenizer = tokenizer
        self.max_len = max_len

    def __len__(self):
        return len(self.data)

    def __getitem__(self, index):
        sentence, annotations = self.data[index]
        encoding = self.tokenizer(sentence, max_length=self.max_len, truncation=True, padding='max_length',
                                  return_tensors='pt')
        input_ids = encoding['input_ids'].flatten()
        attention_mask = encoding['attention_mask'].flatten()

        target_labels = [-100] * self.max_len
        opinion_labels = [-100] * self.max_len
        sentiment_labels = [-100] * self.max_len

        for (target_pos, opinion_pos, sentiment) in annotations:
            for pos in target_pos:
                if pos < self.max_len:
                    target_labels[pos] = 1
            for pos in opinion_pos:
                if pos < self.max_len:
                    opinion_labels[pos] = 1
            for pos in target_pos:
                if pos < self.max_len:
                    sentiment_labels[pos] = 1 if sentiment == 'POS' else (2 if sentiment == 'NEG' else 3)

        return {
            'input_ids': input_ids,
            'attention_mask': attention_mask,
            'target_labels': torch.tensor(target_labels, dtype=torch.long),
            'opinion_labels': torch.tensor(opinion_labels, dtype=torch.long),
            'sentiment_labels': torch.tensor(sentiment_labels, dtype=torch.long),
        }


def parse_annotations(annotation_str):
    triplets = re.findall(r'\(\[(.*?)\], \[(.*?)\], \'(.*?)\'\)', annotation_str)
    annotations = []
    for target_pos_str, opinion_pos_str, sentiment in triplets:
        target_pos = list(map(int, target_pos_str.split(', ')))
        opinion_pos = list(map(int, opinion_pos_str.split(', ')))
        annotations.append((target_pos, opinion_pos, sentiment))
    return annotations


def preprocess_data(file_path, tokenizer):
    data = []
    with open(file_path, 'r') as f:
        for line in f:
            try:
                sentence, annotation_str = line.strip().split('####')
                annotations = parse_annotations(annotation_str)
                data.append((sentence, annotations))
            except Exception as e:
                print(f"意外错误: {e} 在行: {line}")
    return TripletExtractionDataset(data, tokenizer)


class TripletExtractionModel(torch.nn.Module):
    def __init__(self, model_name):
        super(TripletExtractionModel, self).__init__()
        self.bert = BertForTokenClassification.from_pretrained(model_name,
                                                               num_labels=4)  # 0: no label, 1: target, 2: opinion, 3: sentiment

    def forward(self, input_ids, attention_mask, target_labels=None, opinion_labels=None, sentiment_labels=None):
        outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask, labels=target_labels)
        loss = outputs.loss

        opinion_outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask, labels=opinion_labels)
        loss += opinion_outputs.loss

        sentiment_outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask, labels=sentiment_labels)
        loss += sentiment_outputs.loss

        return loss, outputs.logits, opinion_outputs.logits, sentiment_outputs.logits


from transformers import AdamW, get_linear_schedule_with_warmup


def train_model(model, dataloader, optimizer, scheduler, num_epochs=3):
    for epoch in range(num_epochs):
        model.train()
        total_loss = 0
        for step, batch in enumerate(dataloader):
            optimizer.zero_grad()
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            target_labels = batch['target_labels'].to(device)
            opinion_labels = batch['opinion_labels'].to(device)
            sentiment_labels = batch['sentiment_labels'].to(device)

            loss, _, _, _ = model(input_ids, attention_mask, target_labels, opinion_labels, sentiment_labels)
            loss.backward()
            total_loss += loss.item()
            optimizer.step()
            scheduler.step()

            if step % 10 == 0:
                print(f"Epoch {epoch + 1}, Step {step}, Loss: {loss.item():.4f}")

        avg_loss = total_loss / len(dataloader)
        print(f'Epoch {epoch + 1}, 平均损失: {avg_loss:.4f}')


def evaluate_model(model, dataloader):
    model.eval()
    total_loss = 0
    correct_predictions = 0
    total_predictions = 0

    for batch in dataloader:
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        target_labels = batch['target_labels'].to(device)
        opinion_labels = batch['opinion_labels'].to(device)
        sentiment_labels = batch['sentiment_labels'].to(device)

        with torch.no_grad():
            loss, target_logits, opinion_logits, sentiment_logits = model(input_ids, attention_mask, target_labels,
                                                                          opinion_labels, sentiment_labels)
            total_loss += loss.item()

            target_preds = torch.argmax(target_logits, dim=-1)
            opinion_preds = torch.argmax(opinion_logits, dim=-1)
            sentiment_preds = torch.argmax(sentiment_logits, dim=-1)

            correct_predictions += (target_preds == target_labels).sum().item()
            total_predictions += torch.sum(target_labels != -100).item()  # 忽略填充部分

    avg_loss = total_loss / len(dataloader)
    accuracy = correct_predictions / total_predictions
    print(f'评估平均损失: {avg_loss:.4f}, 准确率: {accuracy:.4f}')


# 确保设备分配正确
device = torch.device('mps' if torch.backends.mps.is_available() else 'cpu')

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = TripletExtractionModel('bert-base-uncased').to(device)

# 加载数据
train_dataset = preprocess_data(
    '/Users/bootscoder/PycharmProjects/nlp-lesson-design/data/SemEval-Triplet-data/ASTE-Data-V2-EMNLP2020/14lap/train_triplets.txt',
    tokenizer)
train_dataloader = DataLoader(train_dataset, batch_size=16, shuffle=True)

val_dataset = preprocess_data(
    '/Users/bootscoder/PycharmProjects/nlp-lesson-design/data/SemEval-Triplet-data/ASTE-Data-V2-EMNLP2020/14lap/test_triplets.txt',
    tokenizer)
val_dataloader = DataLoader(val_dataset, batch_size=16, shuffle=False)

# 设置优化器和调度器
optimizer = AdamW(model.parameters(), lr=2e-5, correct_bias=False)
total_steps = len(train_dataloader) * 3
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=0, num_training_steps=total_steps)

# 训练模型
train_model(model, train_dataloader, optimizer, scheduler, num_epochs=3)

# 评估模型
evaluate_model(model, val_dataloader)

spacy+ gloVe+ bert

基于根据句法特征切割后，分好的glove_vectors_list 和对应的标签计算他们内的向量的self attention 得分矩阵并且找到和标签的语意最接近（得分最高）的一组短语的向量和短语；再找到和该短语最接近的一组主语；最终会返回一个三元组[([第一个向量的序号], [第二个向量的序号], ‘POS’)] 其中的列表表示该短语以单词为单位的跨度

import torch
from torchtext.vocab import Vectors

from torchtext.vocab import GloVe

import spacy
from spacy.tokens import Span

# 加载Spacy模型
nlp = spacy.load("en_core_web_sm")

# 提取的句子
sentences = [
    "In the shop, these MacBooks are encased in a soft rubber enclosure - so you will never know about the razor edge until you buy it, get it home, break the seal and use it (very clever con).",
    "This laptop meets every expectation and Windows 7 is great!"
]
labels = [1, 1]

# 定义一个函数返回基于句法依存树的句子分割
def syntactic_span_segmentation(doc):
    spans = []
    for token in doc:
        # 找到谓语动词及其子树，作为一个跨度
        if token.dep_ == 'ROOT':
            subtree = list(token.subtree)
            if len(subtree) <= 5:
                spans.append((subtree[0].i, subtree[-1].i))
        # 找到所有的名词短语及其子树，作为一个跨度
        if token.dep_ in ('nsubj', 'dobj', 'pobj', 'attr'):
            subtree = list(token.subtree)
            if len(subtree) <= 5:
                spans.append((subtree[0].i, subtree[-1].i))
        # 找到所有的形容词及其子树，作为一个跨度
        if token.pos_ == 'ADJ':
            subtree = list(token.subtree)
            if len(subtree) <= 5:
                spans.append((subtree[0].i, subtree[-1].i))
    return spans

# 生成基于句法依存树的跨度列表
spans_list = []
for sentence in sentences:
    doc = nlp(sentence)
    spans = syntactic_span_segmentation(doc)
    spans_list.append(spans)

# 输出结果
for spans in spans_list:
    print(spans)
    
#spans_list

[(1, 2), (10, 10), (9, 12), (15, 15), (20, 22), (24, 24), (26, 26), (29, 29), (33, 34), (37, 37), (39, 40)]
[(0, 1), (3, 7), (9, 9)]


def spans_to_words(doc, spans):
    words = []
    for start, end in spans:
        span_words = doc[start:end+1]
        words.append(span_words.text)
    return words


# 转为 单词列表

   

words_list = []
spans_list = []
for sentence in sentences:
    doc = nlp(sentence)
    spans = syntactic_span_segmentation(doc)
#     spans_list.append(spans)
    words = spans_to_words(doc, spans)
    words_list.append(words)


for words in words_list:
    print(words)

['the shop', 'soft', 'a soft rubber enclosure', 'you', 'the razor edge', 'you', 'it', 'it', 'the seal', 'it', 'very clever']
['This laptop', 'every expectation and Windows 7', 'great']

import numpy as np
# 加载本地GloVe词向量
glove_path = "/Users/bootscoder/PycharmProjects/nlp-lesson-design/test3/.vector_cache/glove.6B.100d.txt"
glove = Vectors(name=glove_path)

# 定义一个函数，将单词列表转换为GloVe向量
def words_to_glove_vectors(words, glove):
    vectors = []
    for word in words:
        # 处理每个短语中的每个单词
        word_vectors = []
        for token in word.split():
            if token in glove.stoi:
                word_vectors.append(glove[token].numpy())
            else:
                # 如果单词不在GloVe词汇中，使用一个全零向量
                word_vectors.append(np.zeros(glove.dim))
        if word_vectors:
            # 计算单词向量的平均值作为短语的向量
            vectors.append(np.mean(word_vectors, axis=0))
        else:
            vectors.append(np.zeros(glove.dim))
    return vectors

# 将短语列表转换为GloVe向量
glove_vectors_list = []
for words in words_list:
    glove_vectors = words_to_glove_vectors(words, glove)
    glove_vectors_list.append(glove_vectors)

# 输出结果
for vectors in glove_vectors_list:
    print(vectors)
    print(len(vectors))

[array([ 0.13303299, -0.19339001, -0.03462997, -0.377225  ,  0.193251  ,
        0.2404265 , -0.245058  ,  0.5126    ,  0.05908501,  0.1363695 ,
        0.197175  , -0.2597    ,  0.59237003, -0.07484999, -0.208725  ,
        0.19986999,  0.304955  , -0.05074999, -0.517175  , -0.15766   ,
        0.291545  ,  0.02313501,  0.280725  , -0.5897795 ,  0.41077   ,
       -0.280316  , -0.180695  , -0.62486   , -0.1988    , -0.106544  ,
       -0.214706  ,  0.403705  ,  0.17855   ,  0.1614835 , -0.1547749 ,
        0.3818615 , -0.065335  , -0.04322499,  0.497345  , -0.76368904,
        0.07089502, -0.43927002,  0.095089  , -0.47281998,  0.215482  ,
        0.36106   , -0.03019002,  0.075195  ,  0.125735  , -0.45094997,
       -0.12121099,  0.1353645 , -0.08802   ,  0.79042995, -0.61651   ,
       -2.37015   , -0.65323   , -0.1396265 ,  1.78895   ,  0.47207502,
       -0.018265  ,  0.53671503,  0.163517  ,  0.519305  ,  0.43568   ,
       -0.168345  ,  0.5909125 ,  0.03488   ,  0.0089165 , -0.27464   ,
       -0.18918899,  0.12751499, -0.227825  , -0.408415  ,  0.299515  ,
        0.171835  ,  0.05996501, -0.20768   , -0.92024   , -0.03444299,
        0.41482   ,  0.24426505, -0.28723502,  0.107624  , -1.092     ,
       -0.307345  , -0.107544  , -0.525885  ,  0.04724999, -0.1299081 ,
        0.3543275 ,  0.1751705 ,  0.34708   ,  0.31524497, -0.810655  ,
       -0.37224   , -0.1862    , -0.003975  ,  1.04565   ,  0.189395  ],
      dtype=float32), array([-6.5892e-01,  7.8152e-01, -1.1128e-01, -1.3389e-01,  3.6517e-01,
        3.4194e-01,  6.7949e-02, -3.0243e-02, -4.5515e-01, -1.0964e-01,
        1.0063e-01, -3.5952e-01,  2.5145e-01,  6.5197e-01,  7.7296e-01,
        3.2745e-01, -1.8533e-01, -4.5715e-01,  9.4431e-01, -1.3449e-01,
        3.5201e-01,  7.1410e-01,  5.6510e-02, -2.5327e-01, -5.5560e-01,
        1.4836e+00, -3.9134e-01, -6.5651e-01, -6.2671e-01,  1.4609e-01,
        4.1416e-01,  5.0490e-01, -3.3398e-02, -5.1214e-01,  7.0785e-02,
        4.5871e-01,  2.2622e-01,  5.4872e-01, -5.5997e-02, -2.9782e-01,
        2.8820e-01, -1.0675e+00, -9.8561e-02, -3.5826e-02, -2.5367e-01,
       -2.8247e-01,  1.6590e-01,  1.6567e-01, -1.9723e-01, -7.9197e-01,
        3.7323e-02,  1.2280e-01,  2.8916e-01,  4.0465e-01,  2.2216e-01,
       -1.7866e+00,  2.1398e-01,  3.3452e-01,  1.5740e+00,  7.5493e-02,
        3.6617e-01,  4.8083e-02, -5.4097e-01,  2.4675e-01,  6.1999e-01,
       -3.5447e-01,  1.0029e+00, -4.0652e-01, -9.9025e-02, -1.1534e+00,
        1.7762e-01,  6.8047e-04,  1.2128e-01,  3.0979e-01, -2.7906e-01,
       -4.8527e-02, -5.5904e-01,  2.5527e-02,  3.8552e-01,  1.2572e-01,
        6.6970e-01, -3.6855e-01, -3.7070e-01,  5.0364e-01, -9.5072e-01,
       -9.0034e-01,  6.1277e-01,  7.2480e-01,  1.7918e-01, -1.1628e-01,
        2.4723e-01,  2.6363e-02, -4.5739e-01, -1.1896e-01, -8.0168e-02,
       -3.1077e-01, -1.2479e+00, -9.2101e-02,  7.9689e-01,  1.9696e-02],
      dtype=float32), array([-4.43102509e-01,  3.84311497e-01, -4.18125018e-02, -5.91700003e-02,
        7.14675039e-02,  4.17077482e-01, -5.17249852e-03,  4.57424223e-01,
       -1.23780765e-01,  3.93174887e-02, -2.30300009e-01, -3.27690303e-01,
        2.14405239e-01,  1.97741508e-01,  3.47409248e-01,  2.56983519e-01,
        1.37922496e-01, -1.69937491e-01,  4.57130492e-01, -5.40493488e-01,
        5.82144976e-01, -5.58982566e-02,  9.99500602e-03, -1.42274797e-03,
        4.75014985e-01,  4.39467490e-01, -3.47612500e-01, -4.41930473e-01,
       -4.21620011e-01,  1.69847503e-01, -7.75629953e-02,  2.78422475e-01,
        7.93907493e-02, -8.10699910e-02,  2.47373998e-01,  1.81052506e-01,
       -1.14342496e-01,  5.16484976e-01,  2.13797748e-01, -3.21610004e-01,
        4.48374934e-02, -4.06439751e-01,  3.51414740e-01, -2.68873990e-01,
        1.74640000e-01, -9.32289958e-02, -3.28675061e-02,  4.81384993e-01,
       -3.61759990e-01, -7.23410249e-01, -3.01409245e-01, -9.87579767e-03,
        2.22001493e-01,  8.44155073e-01, -1.63879991e-01, -1.57679498e+00,
       -9.16067511e-02, -1.02235004e-01,  1.39397001e+00,  1.82035759e-01,
        1.87283993e-01,  3.25173259e-01, -3.01837504e-01,  7.14255571e-01,
        5.08879960e-01, -1.76416010e-01,  5.60742438e-01, -2.83975042e-02,
       -1.56036243e-01, -7.31830001e-01, -2.28830010e-01,  2.82457590e-01,
        2.30889991e-01, -2.38476694e-03, -2.11033002e-01, -7.86927491e-02,
       -3.80998850e-03,  3.46297473e-02, -2.53960013e-01,  3.65745008e-01,
        6.76789522e-01,  8.66212547e-02, -2.09374756e-01,  2.76363492e-01,
       -6.18732512e-01, -1.70020014e-01,  5.82054973e-01,  5.70274964e-02,
        2.59898990e-01,  1.69737488e-01,  5.47782481e-01, -3.92130494e-01,
       -3.09294999e-01, -1.00118257e-01,  4.23442423e-02, -1.65747508e-01,
       -4.29365009e-01, -2.16265500e-01,  6.83437526e-01, -7.71334991e-02],
      dtype=float32), array([-0.49886  ,  0.76602  ,  0.89751  , -0.78547  , -0.6855   ,
        0.62609  , -0.39655  ,  0.34913  ,  0.33334  , -0.45233  ,
        0.61223  ,  0.075948 ,  0.22531  ,  0.16365  ,  0.28095  ,
       -0.24758  ,  0.0099009,  0.71108  , -0.75859  ,  0.87423  ,
        0.0031041,  0.35796  , -0.35233  , -0.665    ,  0.38447  ,
        0.62677  , -0.51543  , -0.96653  ,  0.61517  , -0.75455  ,
       -0.012359 ,  1.1188   ,  0.35719  ,  0.0071769,  0.20255  ,
        0.5011   , -0.44046  ,  0.10661  ,  0.79391  , -0.80948  ,
       -0.015601 , -0.22888  , -0.34198  , -1.0065   , -0.8763   ,
        0.15165  , -0.085339 , -0.6465   , -0.16733  , -1.4499   ,
       -0.0065905,  0.0048113, -0.012445 ,  1.0474   , -0.19381  ,
       -2.5991   ,  0.40528  ,  0.43803  ,  1.9332   ,  0.45814  ,
       -0.048819 ,  1.4308   , -0.78639  , -0.20792  ,  1.09     ,
        0.24816  ,  1.1487   ,  0.51481  , -0.21832  , -0.4572   ,
        0.13888  , -0.26369  ,  0.13647  , -0.60539  ,  0.099586 ,
        0.23344  ,  0.13647  , -0.1846   , -0.047734 , -0.18392  ,
        0.52719  , -0.2885   , -1.0742   , -0.0467   , -1.8302   ,
       -0.21197  ,  0.0298   , -0.30964  , -0.43386  , -0.36463  ,
       -0.32738  , -0.0093427,  0.47205  , -0.51691  , -0.59176  ,
       -0.32343  ,  0.20052  , -0.41179  ,  0.40539  ,  0.78504  ],
      dtype=float32), array([-0.35778132, -0.3436779 ,  0.43833336, -0.26838866,  0.12072901,
       -0.043037  , -0.34114334,  0.21835668, -0.31418866,  0.043083  ,
       -0.00541967, -0.12304   , -0.05195   ,  0.15357001,  0.28717336,
       -0.026897  ,  0.15787867, -0.33011666,  0.13776334, -0.63117665,
        0.9696    , -0.02272666,  0.04117914,  0.25423968,  0.7147034 ,
       -0.02127   ,  0.18178333, -0.45235667, -0.25838003, -0.57829666,
        0.15138933,  0.09531   ,  0.30341667,  0.19277935,  0.2979667 ,
       -0.20486802,  0.04953533,  0.21533667,  0.12030367,  0.01553467,
        0.055191  , -0.56096   , -0.00875701, -0.24642666, -0.5058234 ,
       -0.00939386,  0.47136334,  0.11561268, -0.06595067, -0.06832099,
       -0.247002  ,  0.08493333,  0.04153667,  1.1313167 , -0.01550999,
       -2.1705334 ,  0.25547   , -0.08248001,  1.1796212 ,  0.278921  ,
        0.13000001,  0.35297334, -0.10647199,  0.59558666,  0.7834533 ,
       -0.07159466,  0.192535  ,  0.23290467, -0.00392067, -0.05443967,
       -0.15878266,  0.41988   ,  0.09674334, -0.16119333, -0.11465899,
       -0.03119666,  0.115798  ,  0.09862667,  0.02466001,  0.40701136,
        0.54916465,  0.45094666, -0.68349   ,  0.24277668, -0.8539367 ,
       -0.47005334,  0.28344968,  0.20597667,  0.26363   ,  0.012612  ,
        0.22642833, -0.14552633, -0.14038335,  0.5502433 , -0.34606668,
       -0.39565635, -0.10310967, -0.05789   ,  0.8620667 ,  0.45495668],
      dtype=float32), array([-0.49886  ,  0.76602  ,  0.89751  , -0.78547  , -0.6855   ,
        0.62609  , -0.39655  ,  0.34913  ,  0.33334  , -0.45233  ,
        0.61223  ,  0.075948 ,  0.22531  ,  0.16365  ,  0.28095  ,
       -0.24758  ,  0.0099009,  0.71108  , -0.75859  ,  0.87423  ,
        0.0031041,  0.35796  , -0.35233  , -0.665    ,  0.38447  ,
        0.62677  , -0.51543  , -0.96653  ,  0.61517  , -0.75455  ,
       -0.012359 ,  1.1188   ,  0.35719  ,  0.0071769,  0.20255  ,
        0.5011   , -0.44046  ,  0.10661  ,  0.79391  , -0.80948  ,
       -0.015601 , -0.22888  , -0.34198  , -1.0065   , -0.8763   ,
        0.15165  , -0.085339 , -0.6465   , -0.16733  , -1.4499   ,
       -0.0065905,  0.0048113, -0.012445 ,  1.0474   , -0.19381  ,
       -2.5991   ,  0.40528  ,  0.43803  ,  1.9332   ,  0.45814  ,
       -0.048819 ,  1.4308   , -0.78639  , -0.20792  ,  1.09     ,
        0.24816  ,  1.1487   ,  0.51481  , -0.21832  , -0.4572   ,
        0.13888  , -0.26369  ,  0.13647  , -0.60539  ,  0.099586 ,
        0.23344  ,  0.13647  , -0.1846   , -0.047734 , -0.18392  ,
        0.52719  , -0.2885   , -1.0742   , -0.0467   , -1.8302   ,
       -0.21197  ,  0.0298   , -0.30964  , -0.43386  , -0.36463  ,
       -0.32738  , -0.0093427,  0.47205  , -0.51691  , -0.59176  ,
       -0.32343  ,  0.20052  , -0.41179  ,  0.40539  ,  0.78504  ],
      dtype=float32), array([-3.0664e-01,  1.6821e-01,  9.8511e-01, -3.3606e-01, -2.4160e-01,
        1.6186e-01, -5.3496e-02,  4.3010e-01,  5.7342e-01, -7.1569e-02,
        3.6101e-01,  2.6729e-01,  2.7789e-01, -7.2268e-02,  1.3838e-01,
       -2.6714e-01,  1.2999e-01,  2.2949e-01, -1.8311e-01,  5.0163e-01,
        4.4921e-01, -2.0821e-02,  4.2642e-01, -6.8762e-02,  4.0337e-01,
        9.5198e-02, -3.1944e-01, -5.4651e-01, -1.3345e-01, -5.6511e-01,
       -2.0975e-01,  1.1592e+00, -1.9400e-01,  1.9828e-01, -1.1924e-01,
        4.1781e-01,  6.8383e-03, -2.0537e-01, -5.3375e-01, -5.2225e-01,
       -3.8227e-01, -6.5833e-03,  1.4265e-01, -4.2502e-01, -3.1150e-01,
        2.7352e-03,  7.5093e-01, -4.8218e-01, -1.8595e-01, -7.7104e-01,
       -4.6406e-02, -6.9140e-02,  4.1688e-01,  1.3235e+00, -8.1742e-01,
       -3.3998e+00, -1.1307e-01, -3.4123e-01,  2.0775e+00,  6.1369e-01,
        1.4792e-01,  9.3753e-01, -1.0138e-01,  2.8426e-01,  9.7899e-01,
       -3.2335e-01,  6.3697e-01,  5.8308e-01,  2.2820e-01, -3.1696e-01,
        2.1061e-01, -6.5060e-01,  2.1653e-01, -2.4347e-01,  5.5519e-01,
       -3.4351e-01, -9.5093e-02, -1.4715e-01, -1.2876e+00,  3.9310e-01,
        3.0163e-01, -2.1767e-01, -1.1146e+00,  5.1349e-01, -1.3410e+00,
       -3.0381e-01,  3.2499e-01, -4.5236e-01, -1.7743e-01, -4.8504e-02,
       -1.2178e-01, -4.2108e-01, -4.0327e-01,  3.8452e-02, -3.6084e-01,
        3.7738e-02, -2.1885e-01, -3.8775e-01,  3.6916e-01,  5.4521e-01],
      dtype=float32), array([-3.0664e-01,  1.6821e-01,  9.8511e-01, -3.3606e-01, -2.4160e-01,
        1.6186e-01, -5.3496e-02,  4.3010e-01,  5.7342e-01, -7.1569e-02,
        3.6101e-01,  2.6729e-01,  2.7789e-01, -7.2268e-02,  1.3838e-01,
       -2.6714e-01,  1.2999e-01,  2.2949e-01, -1.8311e-01,  5.0163e-01,
        4.4921e-01, -2.0821e-02,  4.2642e-01, -6.8762e-02,  4.0337e-01,
        9.5198e-02, -3.1944e-01, -5.4651e-01, -1.3345e-01, -5.6511e-01,
       -2.0975e-01,  1.1592e+00, -1.9400e-01,  1.9828e-01, -1.1924e-01,
        4.1781e-01,  6.8383e-03, -2.0537e-01, -5.3375e-01, -5.2225e-01,
       -3.8227e-01, -6.5833e-03,  1.4265e-01, -4.2502e-01, -3.1150e-01,
        2.7352e-03,  7.5093e-01, -4.8218e-01, -1.8595e-01, -7.7104e-01,
       -4.6406e-02, -6.9140e-02,  4.1688e-01,  1.3235e+00, -8.1742e-01,
       -3.3998e+00, -1.1307e-01, -3.4123e-01,  2.0775e+00,  6.1369e-01,
        1.4792e-01,  9.3753e-01, -1.0138e-01,  2.8426e-01,  9.7899e-01,
       -3.2335e-01,  6.3697e-01,  5.8308e-01,  2.2820e-01, -3.1696e-01,
        2.1061e-01, -6.5060e-01,  2.1653e-01, -2.4347e-01,  5.5519e-01,
       -3.4351e-01, -9.5093e-02, -1.4715e-01, -1.2876e+00,  3.9310e-01,
        3.0163e-01, -2.1767e-01, -1.1146e+00,  5.1349e-01, -1.3410e+00,
       -3.0381e-01,  3.2499e-01, -4.5236e-01, -1.7743e-01, -4.8504e-02,
       -1.2178e-01, -4.2108e-01, -4.0327e-01,  3.8452e-02, -3.6084e-01,
        3.7738e-02, -2.1885e-01, -3.8775e-01,  3.6916e-01,  5.4521e-01],
      dtype=float32), array([-0.111597  , -0.37577   ,  0.55548   , -0.39749002,  0.237371  ,
       -0.1465185 , -0.230327  ,  0.1809115 , -0.44880998,  0.029601  ,
       -0.0346    ,  0.07276   ,  0.360255  ,  0.162735  , -0.32204503,
        0.26451498,  0.895625  , -0.466625  , -0.028135  , -0.425935  ,
        0.01697   , -0.399235  ,  0.250925  , -0.1812595 ,  0.50598   ,
       -0.03482999, -0.329385  , -0.47171497, -0.03216   , -0.43278   ,
        0.053384  , -0.10064499,  0.093679  ,  0.38192   ,  0.00677   ,
       -0.1268635 , -0.14750001,  0.15202999,  0.14063999, -0.34167898,
       -0.20504   , -0.52793   ,  0.123611  , -0.22473   ,  0.316795  ,
        0.150015  ,  0.03305501, -0.057965  , -0.421095  , -0.48394   ,
        0.091104  ,  0.1487205 ,  0.07305   ,  1.029325  , -0.60827   ,
       -2.33835   , -0.097225  , -0.56340504,  1.48314   ,  0.61651504,
       -0.18003   ,  0.59263504,  0.074827  ,  0.698805  ,  0.807945  ,
       -0.008905  , -0.0485805 ,  0.37199998, -0.259081  , -0.519135  ,
       -0.0196715 ,  0.081645  ,  0.122715  , -0.423885  ,  0.2106575 ,
        0.16966501, -0.575745  , -0.194675  , -0.85414   ,  0.31466198,
        0.330815  ,  0.2192235 , -0.40195   ,  0.57855   , -0.975555  ,
       -0.020335  ,  0.39326102, -0.06454501,  0.52516997,  0.283485  ,
        0.0851675 , -0.13160451, -0.1455175 ,  0.248075  , -0.450235  ,
       -0.316305  , -0.47213   ,  0.14056501,  0.379993  ,  0.03845   ],
      dtype=float32), array([-3.0664e-01,  1.6821e-01,  9.8511e-01, -3.3606e-01, -2.4160e-01,
        1.6186e-01, -5.3496e-02,  4.3010e-01,  5.7342e-01, -7.1569e-02,
        3.6101e-01,  2.6729e-01,  2.7789e-01, -7.2268e-02,  1.3838e-01,
       -2.6714e-01,  1.2999e-01,  2.2949e-01, -1.8311e-01,  5.0163e-01,
        4.4921e-01, -2.0821e-02,  4.2642e-01, -6.8762e-02,  4.0337e-01,
        9.5198e-02, -3.1944e-01, -5.4651e-01, -1.3345e-01, -5.6511e-01,
       -2.0975e-01,  1.1592e+00, -1.9400e-01,  1.9828e-01, -1.1924e-01,
        4.1781e-01,  6.8383e-03, -2.0537e-01, -5.3375e-01, -5.2225e-01,
       -3.8227e-01, -6.5833e-03,  1.4265e-01, -4.2502e-01, -3.1150e-01,
        2.7352e-03,  7.5093e-01, -4.8218e-01, -1.8595e-01, -7.7104e-01,
       -4.6406e-02, -6.9140e-02,  4.1688e-01,  1.3235e+00, -8.1742e-01,
       -3.3998e+00, -1.1307e-01, -3.4123e-01,  2.0775e+00,  6.1369e-01,
        1.4792e-01,  9.3753e-01, -1.0138e-01,  2.8426e-01,  9.7899e-01,
       -3.2335e-01,  6.3697e-01,  5.8308e-01,  2.2820e-01, -3.1696e-01,
        2.1061e-01, -6.5060e-01,  2.1653e-01, -2.4347e-01,  5.5519e-01,
       -3.4351e-01, -9.5093e-02, -1.4715e-01, -1.2876e+00,  3.9310e-01,
        3.0163e-01, -2.1767e-01, -1.1146e+00,  5.1349e-01, -1.3410e+00,
       -3.0381e-01,  3.2499e-01, -4.5236e-01, -1.7743e-01, -4.8504e-02,
       -1.2178e-01, -4.2108e-01, -4.0327e-01,  3.8452e-02, -3.6084e-01,
        3.7738e-02, -2.1885e-01, -3.8775e-01,  3.6916e-01,  5.4521e-01],
      dtype=float32), array([-0.33168998,  0.07903001,  0.015582  , -0.38494   , -0.22421   ,
        0.321541  , -0.019915  , -0.08203501, -0.015544  , -0.01236999,
        0.17814499, -0.0587955 , -0.04737   , -0.49879998,  0.248885  ,
        0.09144001, -0.504715  , -0.17581499,  0.317672  ,  0.551355  ,
        0.464715  ,  0.18394   ,  0.1292665 , -0.58939505,  0.10454249,
        0.191365  , -0.47175002, -0.0494    , -0.38022   ,  0.04585001,
       -0.76091003,  0.793005  , -0.091085  , -0.68124   ,  0.6979    ,
        0.08299499, -0.362354  , -0.166765  ,  0.14217   , -0.34500998,
        0.171185  ,  0.14513549,  0.067015  , -0.24115   , -0.49811   ,
       -0.25761998,  0.620695  ,  0.23060001, -0.0273675 , -0.34083   ,
        0.18297255, -0.2516595 ,  0.367205  ,  0.762445  , -0.25993   ,
       -1.927     ,  0.255515  ,  0.2476595 ,  0.758791  , -0.434385  ,
        0.016642  ,  1.07503   , -0.537745  ,  0.13350001,  0.78688   ,
       -0.55951   ,  0.747905  ,  0.376265  ,  0.201827  , -0.44998   ,
        0.027013  , -0.153285  ,  0.299155  , -0.303505  ,  0.47719002,
       -0.0366653 , -0.53739   , -0.05169   , -0.172795  , -0.00744   ,
        0.46152   ,  0.174445  , -0.3620195 , -0.21889   , -1.75955   ,
       -0.11350525,  0.0989159 , -0.03075999, -0.58006   , -0.69879496,
        0.1435385 ,  0.2844595 , -0.16977999,  0.01173499, -0.1177655 ,
       -0.079845  , -0.91868   , -0.80596   , -0.20590499,  0.5167    ],
      dtype=float32)]
11
[array([-0.06306   ,  0.07782   ,  0.36689001, -0.102295  ,  0.22397999,
        0.27658999,  0.37492999,  0.022332  ,  0.35251999,  0.28604999,
        0.34147   ,  0.20200001, -0.151565  ,  0.17502999, -0.18737499,
        0.0176775 , -0.0300295 ,  0.22513001,  0.22826   , -0.053985  ,
       -0.2606    , -0.26905   ,  0.26265499,  0.13117   ,  0.19291499,
       -0.13817   , -0.13625   , -0.1574    , -0.14912   ,  0.172885  ,
        0.23986501,  0.30743501, -0.108715  ,  0.28498   ,  0.42057499,
        0.29971001, -0.10966   ,  0.05987   ,  0.59799999, -0.139285  ,
        0.23827   , -0.0285155 ,  0.28014499, -0.229645  , -0.59445   ,
        0.049119  ,  0.1559    , -0.082995  ,  0.29723001,  0.099875  ,
        0.1938    ,  0.098875  , -0.15158001, -0.040315  ,  0.13698   ,
       -0.20102499, -0.27987   , -0.10666   ,  0.66329998, -0.19963001,
       -0.015867  ,  0.29277   , -0.19439   ,  0.258035  , -0.28511   ,
        0.0237445 ,  0.27324501, -0.152805  ,  0.228495  ,  0.35326499,
        0.38929501,  0.110835  ,  0.44348001, -0.32934999, -0.360535  ,
        0.46592   ,  0.174385  ,  0.177835  , -0.05948   ,  0.236175  ,
        0.405175  , -0.0416885 , -0.074945  , -0.02524   , -0.65864998,
       -0.069655  ,  0.30226499, -0.129325  , -0.17192   , -0.044301  ,
        0.09235   ,  0.65684998,  0.46226001, -0.29488999, -0.031819  ,
       -0.21898501,  0.15365   , -0.178315  ,  0.65200001, -0.23555   ]), array([-0.0253654 ,  0.2697672 ,  0.2393642 ,  0.0196748 , -0.0324976 ,
        0.18654199, -0.009746  ,  0.18030601, -0.1676526 , -0.041347  ,
        0.277872  , -0.0049394 ,  0.09434   , -0.08732401,  0.34367401,
       -0.223462  , -0.127948  ,  0.3786006 , -0.4427632 ,  0.221488  ,
        0.1709722 , -0.173508  ,  0.24591001,  0.18999738,  0.256168  ,
       -0.0330538 ,  0.21168799, -0.243414  ,  0.1592448 , -0.152418  ,
       -0.11233126,  0.29083   ,  0.193645  ,  0.0061642 , -0.119688  ,
        0.219432  , -0.0381386 ,  0.226968  ,  0.0799488 , -0.13414601,
       -0.0848074 , -0.39365599,  0.0659002 , -0.112082  , -0.01486399,
       -0.2386928 ,  0.069662  , -0.412658  , -0.069106  , -0.62183801,
       -0.034624  , -0.24318199,  0.0415    ,  0.859292  , -0.1697372 ,
       -1.71273999,  0.0413032 , -0.2655793 ,  1.14829197,  0.332784  ,
       -0.12155741,  0.307114  , -0.54022821,  0.15521856,  0.37335024,
        0.082974  ,  0.10001199,  0.063802  ,  0.37777401, -0.0553864 ,
       -0.0524826 , -0.0982372 , -0.0566592 , -0.053092  ,  0.1785684 ,
        0.2774272 ,  0.00706801, -0.12383199, -0.61227801, -0.228412  ,
        0.40683201,  0.109144  , -0.342586  ,  0.234065  , -0.6399192 ,
       -0.16307   ,  0.006242  , -0.0908344 ,  0.113154  , -0.11101601,
       -0.0857882 ,  0.08209201, -0.173008  ,  0.0624862 , -0.691666  ,
        0.0922586 , -0.056364  , -0.21103399,  0.3866984 ,  0.08878   ]), array([-0.013786 ,  0.38216  ,  0.53236  ,  0.15261  , -0.29694  ,
       -0.20558  , -0.41846  , -0.58437  , -0.77355  , -0.87866  ,
       -0.37858  , -0.18516  , -0.128    , -0.20584  , -0.22925  ,
       -0.42599  ,  0.3725   ,  0.26077  , -1.0702   ,  0.62916  ,
       -0.091469 ,  0.70348  , -0.4973   , -0.77691  ,  0.66045  ,
        0.09465  , -0.44893  ,  0.018917 ,  0.33146  , -0.35022  ,
       -0.35789  ,  0.030313 ,  0.22253  , -0.23236  , -0.19719  ,
       -0.0053125, -0.25848  ,  0.58081  , -0.10705  , -0.17845  ,
       -0.16206  ,  0.087086 ,  0.63029  , -0.76649  ,  0.51619  ,
        0.14073  ,  1.019    , -0.43136  ,  0.46138  , -0.43585  ,
       -0.47568  ,  0.19226  ,  0.36065  ,  0.78987  ,  0.088945 ,
       -2.7814   , -0.15366  ,  0.01015  ,  1.1798   ,  0.15168  ,
       -0.050112 ,  1.2626   , -0.77527  ,  0.36031  ,  0.95761  ,
       -0.11385  ,  0.28035  , -0.02591  ,  0.31246  , -0.15424  ,
        0.3778   , -0.13599  ,  0.2946   , -0.31579  ,  0.42943  ,
        0.086969 ,  0.019169 , -0.27242  , -0.31696  ,  0.37327  ,
        0.61997  ,  0.13889  ,  0.17188  ,  0.30363  , -1.2776   ,
        0.044423 , -0.52736  , -0.88536  , -0.19428  , -0.61947  ,
       -0.10146  , -0.26301  , -0.061707 ,  0.36627  , -0.95223  ,
       -0.39346  , -0.69183  , -1.0426   ,  0.28855  ,  0.63056  ],
      dtype=float32)]
3

# 去重并保持顺序的函数
def remove_duplicates(phrases):
    seen = set()
    unique_phrases = []
    for phrase in phrases:
        if phrase not in seen:
            unique_phrases.append(phrase)
            seen.add(phrase)
    return unique_phrases

# 将短语列表转换为GloVe向量
glove_vectors_list = []
for words in words_list:
    words = remove_duplicates(words)
    glove_vectors = words_to_glove_vectors(words, glove)
    glove_vectors_list.append(glove_vectors)

# 输出结果
# for vectors in glove_vectors_list:
#     print(vectors)
#     print(len(vectors))
glove_vectors_list

[[array([ 0.13303299, -0.19339001, -0.03462997, -0.377225  ,  0.193251  ,
          0.2404265 , -0.245058  ,  0.5126    ,  0.05908501,  0.1363695 ,
          0.197175  , -0.2597    ,  0.59237003, -0.07484999, -0.208725  ,
          0.19986999,  0.304955  , -0.05074999, -0.517175  , -0.15766   ,
          0.291545  ,  0.02313501,  0.280725  , -0.5897795 ,  0.41077   ,
         -0.280316  , -0.180695  , -0.62486   , -0.1988    , -0.106544  ,
         -0.214706  ,  0.403705  ,  0.17855   ,  0.1614835 , -0.1547749 ,
          0.3818615 , -0.065335  , -0.04322499,  0.497345  , -0.76368904,
          0.07089502, -0.43927002,  0.095089  , -0.47281998,  0.215482  ,
          0.36106   , -0.03019002,  0.075195  ,  0.125735  , -0.45094997,
         -0.12121099,  0.1353645 , -0.08802   ,  0.79042995, -0.61651   ,
         -2.37015   , -0.65323   , -0.1396265 ,  1.78895   ,  0.47207502,
         -0.018265  ,  0.53671503,  0.163517  ,  0.519305  ,  0.43568   ,
         -0.168345  ,  0.5909125 ,  0.03488   ,  0.0089165 , -0.27464   ,
         -0.18918899,  0.12751499, -0.227825  , -0.408415  ,  0.299515  ,
          0.171835  ,  0.05996501, -0.20768   , -0.92024   , -0.03444299,
          0.41482   ,  0.24426505, -0.28723502,  0.107624  , -1.092     ,
         -0.307345  , -0.107544  , -0.525885  ,  0.04724999, -0.1299081 ,
          0.3543275 ,  0.1751705 ,  0.34708   ,  0.31524497, -0.810655  ,
         -0.37224   , -0.1862    , -0.003975  ,  1.04565   ,  0.189395  ],
        dtype=float32),
  array([-6.5892e-01,  7.8152e-01, -1.1128e-01, -1.3389e-01,  3.6517e-01,
          3.4194e-01,  6.7949e-02, -3.0243e-02, -4.5515e-01, -1.0964e-01,
          1.0063e-01, -3.5952e-01,  2.5145e-01,  6.5197e-01,  7.7296e-01,
          3.2745e-01, -1.8533e-01, -4.5715e-01,  9.4431e-01, -1.3449e-01,
          3.5201e-01,  7.1410e-01,  5.6510e-02, -2.5327e-01, -5.5560e-01,
          1.4836e+00, -3.9134e-01, -6.5651e-01, -6.2671e-01,  1.4609e-01,
          4.1416e-01,  5.0490e-01, -3.3398e-02, -5.1214e-01,  7.0785e-02,
          4.5871e-01,  2.2622e-01,  5.4872e-01, -5.5997e-02, -2.9782e-01,
          2.8820e-01, -1.0675e+00, -9.8561e-02, -3.5826e-02, -2.5367e-01,
         -2.8247e-01,  1.6590e-01,  1.6567e-01, -1.9723e-01, -7.9197e-01,
          3.7323e-02,  1.2280e-01,  2.8916e-01,  4.0465e-01,  2.2216e-01,
         -1.7866e+00,  2.1398e-01,  3.3452e-01,  1.5740e+00,  7.5493e-02,
          3.6617e-01,  4.8083e-02, -5.4097e-01,  2.4675e-01,  6.1999e-01,
         -3.5447e-01,  1.0029e+00, -4.0652e-01, -9.9025e-02, -1.1534e+00,
          1.7762e-01,  6.8047e-04,  1.2128e-01,  3.0979e-01, -2.7906e-01,
         -4.8527e-02, -5.5904e-01,  2.5527e-02,  3.8552e-01,  1.2572e-01,
          6.6970e-01, -3.6855e-01, -3.7070e-01,  5.0364e-01, -9.5072e-01,
         -9.0034e-01,  6.1277e-01,  7.2480e-01,  1.7918e-01, -1.1628e-01,
          2.4723e-01,  2.6363e-02, -4.5739e-01, -1.1896e-01, -8.0168e-02,
         -3.1077e-01, -1.2479e+00, -9.2101e-02,  7.9689e-01,  1.9696e-02],
        dtype=float32),
  array([-4.43102509e-01,  3.84311497e-01, -4.18125018e-02, -5.91700003e-02,
          7.14675039e-02,  4.17077482e-01, -5.17249852e-03,  4.57424223e-01,
         -1.23780765e-01,  3.93174887e-02, -2.30300009e-01, -3.27690303e-01,
          2.14405239e-01,  1.97741508e-01,  3.47409248e-01,  2.56983519e-01,
          1.37922496e-01, -1.69937491e-01,  4.57130492e-01, -5.40493488e-01,
          5.82144976e-01, -5.58982566e-02,  9.99500602e-03, -1.42274797e-03,
          4.75014985e-01,  4.39467490e-01, -3.47612500e-01, -4.41930473e-01,
         -4.21620011e-01,  1.69847503e-01, -7.75629953e-02,  2.78422475e-01,
          7.93907493e-02, -8.10699910e-02,  2.47373998e-01,  1.81052506e-01,
         -1.14342496e-01,  5.16484976e-01,  2.13797748e-01, -3.21610004e-01,
          4.48374934e-02, -4.06439751e-01,  3.51414740e-01, -2.68873990e-01,
          1.74640000e-01, -9.32289958e-02, -3.28675061e-02,  4.81384993e-01,
         -3.61759990e-01, -7.23410249e-01, -3.01409245e-01, -9.87579767e-03,
          2.22001493e-01,  8.44155073e-01, -1.63879991e-01, -1.57679498e+00,
         -9.16067511e-02, -1.02235004e-01,  1.39397001e+00,  1.82035759e-01,
          1.87283993e-01,  3.25173259e-01, -3.01837504e-01,  7.14255571e-01,
          5.08879960e-01, -1.76416010e-01,  5.60742438e-01, -2.83975042e-02,
         -1.56036243e-01, -7.31830001e-01, -2.28830010e-01,  2.82457590e-01,
          2.30889991e-01, -2.38476694e-03, -2.11033002e-01, -7.86927491e-02,
         -3.80998850e-03,  3.46297473e-02, -2.53960013e-01,  3.65745008e-01,
          6.76789522e-01,  8.66212547e-02, -2.09374756e-01,  2.76363492e-01,
         -6.18732512e-01, -1.70020014e-01,  5.82054973e-01,  5.70274964e-02,
          2.59898990e-01,  1.69737488e-01,  5.47782481e-01, -3.92130494e-01,
         -3.09294999e-01, -1.00118257e-01,  4.23442423e-02, -1.65747508e-01,
         -4.29365009e-01, -2.16265500e-01,  6.83437526e-01, -7.71334991e-02],
        dtype=float32),
  array([-0.49886  ,  0.76602  ,  0.89751  , -0.78547  , -0.6855   ,
          0.62609  , -0.39655  ,  0.34913  ,  0.33334  , -0.45233  ,
          0.61223  ,  0.075948 ,  0.22531  ,  0.16365  ,  0.28095  ,
         -0.24758  ,  0.0099009,  0.71108  , -0.75859  ,  0.87423  ,
          0.0031041,  0.35796  , -0.35233  , -0.665    ,  0.38447  ,
          0.62677  , -0.51543  , -0.96653  ,  0.61517  , -0.75455  ,
         -0.012359 ,  1.1188   ,  0.35719  ,  0.0071769,  0.20255  ,
          0.5011   , -0.44046  ,  0.10661  ,  0.79391  , -0.80948  ,
         -0.015601 , -0.22888  , -0.34198  , -1.0065   , -0.8763   ,
          0.15165  , -0.085339 , -0.6465   , -0.16733  , -1.4499   ,
         -0.0065905,  0.0048113, -0.012445 ,  1.0474   , -0.19381  ,
         -2.5991   ,  0.40528  ,  0.43803  ,  1.9332   ,  0.45814  ,
         -0.048819 ,  1.4308   , -0.78639  , -0.20792  ,  1.09     ,
          0.24816  ,  1.1487   ,  0.51481  , -0.21832  , -0.4572   ,
          0.13888  , -0.26369  ,  0.13647  , -0.60539  ,  0.099586 ,
          0.23344  ,  0.13647  , -0.1846   , -0.047734 , -0.18392  ,
          0.52719  , -0.2885   , -1.0742   , -0.0467   , -1.8302   ,
         -0.21197  ,  0.0298   , -0.30964  , -0.43386  , -0.36463  ,
         -0.32738  , -0.0093427,  0.47205  , -0.51691  , -0.59176  ,
         -0.32343  ,  0.20052  , -0.41179  ,  0.40539  ,  0.78504  ],
        dtype=float32),
  array([-0.35778132, -0.3436779 ,  0.43833336, -0.26838866,  0.12072901,
         -0.043037  , -0.34114334,  0.21835668, -0.31418866,  0.043083  ,
         -0.00541967, -0.12304   , -0.05195   ,  0.15357001,  0.28717336,
         -0.026897  ,  0.15787867, -0.33011666,  0.13776334, -0.63117665,
          0.9696    , -0.02272666,  0.04117914,  0.25423968,  0.7147034 ,
         -0.02127   ,  0.18178333, -0.45235667, -0.25838003, -0.57829666,
          0.15138933,  0.09531   ,  0.30341667,  0.19277935,  0.2979667 ,
         -0.20486802,  0.04953533,  0.21533667,  0.12030367,  0.01553467,
          0.055191  , -0.56096   , -0.00875701, -0.24642666, -0.5058234 ,
         -0.00939386,  0.47136334,  0.11561268, -0.06595067, -0.06832099,
         -0.247002  ,  0.08493333,  0.04153667,  1.1313167 , -0.01550999,
         -2.1705334 ,  0.25547   , -0.08248001,  1.1796212 ,  0.278921  ,
          0.13000001,  0.35297334, -0.10647199,  0.59558666,  0.7834533 ,
         -0.07159466,  0.192535  ,  0.23290467, -0.00392067, -0.05443967,
         -0.15878266,  0.41988   ,  0.09674334, -0.16119333, -0.11465899,
         -0.03119666,  0.115798  ,  0.09862667,  0.02466001,  0.40701136,
          0.54916465,  0.45094666, -0.68349   ,  0.24277668, -0.8539367 ,
         -0.47005334,  0.28344968,  0.20597667,  0.26363   ,  0.012612  ,
          0.22642833, -0.14552633, -0.14038335,  0.5502433 , -0.34606668,
         -0.39565635, -0.10310967, -0.05789   ,  0.8620667 ,  0.45495668],
        dtype=float32),
  array([-3.0664e-01,  1.6821e-01,  9.8511e-01, -3.3606e-01, -2.4160e-01,
          1.6186e-01, -5.3496e-02,  4.3010e-01,  5.7342e-01, -7.1569e-02,
          3.6101e-01,  2.6729e-01,  2.7789e-01, -7.2268e-02,  1.3838e-01,
         -2.6714e-01,  1.2999e-01,  2.2949e-01, -1.8311e-01,  5.0163e-01,
          4.4921e-01, -2.0821e-02,  4.2642e-01, -6.8762e-02,  4.0337e-01,
          9.5198e-02, -3.1944e-01, -5.4651e-01, -1.3345e-01, -5.6511e-01,
         -2.0975e-01,  1.1592e+00, -1.9400e-01,  1.9828e-01, -1.1924e-01,
          4.1781e-01,  6.8383e-03, -2.0537e-01, -5.3375e-01, -5.2225e-01,
         -3.8227e-01, -6.5833e-03,  1.4265e-01, -4.2502e-01, -3.1150e-01,
          2.7352e-03,  7.5093e-01, -4.8218e-01, -1.8595e-01, -7.7104e-01,
         -4.6406e-02, -6.9140e-02,  4.1688e-01,  1.3235e+00, -8.1742e-01,
         -3.3998e+00, -1.1307e-01, -3.4123e-01,  2.0775e+00,  6.1369e-01,
          1.4792e-01,  9.3753e-01, -1.0138e-01,  2.8426e-01,  9.7899e-01,
         -3.2335e-01,  6.3697e-01,  5.8308e-01,  2.2820e-01, -3.1696e-01,
          2.1061e-01, -6.5060e-01,  2.1653e-01, -2.4347e-01,  5.5519e-01,
         -3.4351e-01, -9.5093e-02, -1.4715e-01, -1.2876e+00,  3.9310e-01,
          3.0163e-01, -2.1767e-01, -1.1146e+00,  5.1349e-01, -1.3410e+00,
         -3.0381e-01,  3.2499e-01, -4.5236e-01, -1.7743e-01, -4.8504e-02,
         -1.2178e-01, -4.2108e-01, -4.0327e-01,  3.8452e-02, -3.6084e-01,
          3.7738e-02, -2.1885e-01, -3.8775e-01,  3.6916e-01,  5.4521e-01],
        dtype=float32),
  array([-0.111597  , -0.37577   ,  0.55548   , -0.39749002,  0.237371  ,
         -0.1465185 , -0.230327  ,  0.1809115 , -0.44880998,  0.029601  ,
         -0.0346    ,  0.07276   ,  0.360255  ,  0.162735  , -0.32204503,
          0.26451498,  0.895625  , -0.466625  , -0.028135  , -0.425935  ,
          0.01697   , -0.399235  ,  0.250925  , -0.1812595 ,  0.50598   ,
         -0.03482999, -0.329385  , -0.47171497, -0.03216   , -0.43278   ,
          0.053384  , -0.10064499,  0.093679  ,  0.38192   ,  0.00677   ,
         -0.1268635 , -0.14750001,  0.15202999,  0.14063999, -0.34167898,
         -0.20504   , -0.52793   ,  0.123611  , -0.22473   ,  0.316795  ,
          0.150015  ,  0.03305501, -0.057965  , -0.421095  , -0.48394   ,
          0.091104  ,  0.1487205 ,  0.07305   ,  1.029325  , -0.60827   ,
         -2.33835   , -0.097225  , -0.56340504,  1.48314   ,  0.61651504,
         -0.18003   ,  0.59263504,  0.074827  ,  0.698805  ,  0.807945  ,
         -0.008905  , -0.0485805 ,  0.37199998, -0.259081  , -0.519135  ,
         -0.0196715 ,  0.081645  ,  0.122715  , -0.423885  ,  0.2106575 ,
          0.16966501, -0.575745  , -0.194675  , -0.85414   ,  0.31466198,
          0.330815  ,  0.2192235 , -0.40195   ,  0.57855   , -0.975555  ,
         -0.020335  ,  0.39326102, -0.06454501,  0.52516997,  0.283485  ,
          0.0851675 , -0.13160451, -0.1455175 ,  0.248075  , -0.450235  ,
         -0.316305  , -0.47213   ,  0.14056501,  0.379993  ,  0.03845   ],
        dtype=float32),
  array([-0.33168998,  0.07903001,  0.015582  , -0.38494   , -0.22421   ,
          0.321541  , -0.019915  , -0.08203501, -0.015544  , -0.01236999,
          0.17814499, -0.0587955 , -0.04737   , -0.49879998,  0.248885  ,
          0.09144001, -0.504715  , -0.17581499,  0.317672  ,  0.551355  ,
          0.464715  ,  0.18394   ,  0.1292665 , -0.58939505,  0.10454249,
          0.191365  , -0.47175002, -0.0494    , -0.38022   ,  0.04585001,
         -0.76091003,  0.793005  , -0.091085  , -0.68124   ,  0.6979    ,
          0.08299499, -0.362354  , -0.166765  ,  0.14217   , -0.34500998,
          0.171185  ,  0.14513549,  0.067015  , -0.24115   , -0.49811   ,
         -0.25761998,  0.620695  ,  0.23060001, -0.0273675 , -0.34083   ,
          0.18297255, -0.2516595 ,  0.367205  ,  0.762445  , -0.25993   ,
         -1.927     ,  0.255515  ,  0.2476595 ,  0.758791  , -0.434385  ,
          0.016642  ,  1.07503   , -0.537745  ,  0.13350001,  0.78688   ,
         -0.55951   ,  0.747905  ,  0.376265  ,  0.201827  , -0.44998   ,
          0.027013  , -0.153285  ,  0.299155  , -0.303505  ,  0.47719002,
         -0.0366653 , -0.53739   , -0.05169   , -0.172795  , -0.00744   ,
          0.46152   ,  0.174445  , -0.3620195 , -0.21889   , -1.75955   ,
         -0.11350525,  0.0989159 , -0.03075999, -0.58006   , -0.69879496,
          0.1435385 ,  0.2844595 , -0.16977999,  0.01173499, -0.1177655 ,
         -0.079845  , -0.91868   , -0.80596   , -0.20590499,  0.5167    ],
        dtype=float32)],
 [array([-0.06306   ,  0.07782   ,  0.36689001, -0.102295  ,  0.22397999,
          0.27658999,  0.37492999,  0.022332  ,  0.35251999,  0.28604999,
          0.34147   ,  0.20200001, -0.151565  ,  0.17502999, -0.18737499,
          0.0176775 , -0.0300295 ,  0.22513001,  0.22826   , -0.053985  ,
         -0.2606    , -0.26905   ,  0.26265499,  0.13117   ,  0.19291499,
         -0.13817   , -0.13625   , -0.1574    , -0.14912   ,  0.172885  ,
          0.23986501,  0.30743501, -0.108715  ,  0.28498   ,  0.42057499,
          0.29971001, -0.10966   ,  0.05987   ,  0.59799999, -0.139285  ,
          0.23827   , -0.0285155 ,  0.28014499, -0.229645  , -0.59445   ,
          0.049119  ,  0.1559    , -0.082995  ,  0.29723001,  0.099875  ,
          0.1938    ,  0.098875  , -0.15158001, -0.040315  ,  0.13698   ,
         -0.20102499, -0.27987   , -0.10666   ,  0.66329998, -0.19963001,
         -0.015867  ,  0.29277   , -0.19439   ,  0.258035  , -0.28511   ,
          0.0237445 ,  0.27324501, -0.152805  ,  0.228495  ,  0.35326499,
          0.38929501,  0.110835  ,  0.44348001, -0.32934999, -0.360535  ,
          0.46592   ,  0.174385  ,  0.177835  , -0.05948   ,  0.236175  ,
          0.405175  , -0.0416885 , -0.074945  , -0.02524   , -0.65864998,
         -0.069655  ,  0.30226499, -0.129325  , -0.17192   , -0.044301  ,
          0.09235   ,  0.65684998,  0.46226001, -0.29488999, -0.031819  ,
         -0.21898501,  0.15365   , -0.178315  ,  0.65200001, -0.23555   ]),
  array([-0.0253654 ,  0.2697672 ,  0.2393642 ,  0.0196748 , -0.0324976 ,
          0.18654199, -0.009746  ,  0.18030601, -0.1676526 , -0.041347  ,
          0.277872  , -0.0049394 ,  0.09434   , -0.08732401,  0.34367401,
         -0.223462  , -0.127948  ,  0.3786006 , -0.4427632 ,  0.221488  ,
          0.1709722 , -0.173508  ,  0.24591001,  0.18999738,  0.256168  ,
         -0.0330538 ,  0.21168799, -0.243414  ,  0.1592448 , -0.152418  ,
         -0.11233126,  0.29083   ,  0.193645  ,  0.0061642 , -0.119688  ,
          0.219432  , -0.0381386 ,  0.226968  ,  0.0799488 , -0.13414601,
         -0.0848074 , -0.39365599,  0.0659002 , -0.112082  , -0.01486399,
         -0.2386928 ,  0.069662  , -0.412658  , -0.069106  , -0.62183801,
         -0.034624  , -0.24318199,  0.0415    ,  0.859292  , -0.1697372 ,
         -1.71273999,  0.0413032 , -0.2655793 ,  1.14829197,  0.332784  ,
         -0.12155741,  0.307114  , -0.54022821,  0.15521856,  0.37335024,
          0.082974  ,  0.10001199,  0.063802  ,  0.37777401, -0.0553864 ,
         -0.0524826 , -0.0982372 , -0.0566592 , -0.053092  ,  0.1785684 ,
          0.2774272 ,  0.00706801, -0.12383199, -0.61227801, -0.228412  ,
          0.40683201,  0.109144  , -0.342586  ,  0.234065  , -0.6399192 ,
         -0.16307   ,  0.006242  , -0.0908344 ,  0.113154  , -0.11101601,
         -0.0857882 ,  0.08209201, -0.173008  ,  0.0624862 , -0.691666  ,
          0.0922586 , -0.056364  , -0.21103399,  0.3866984 ,  0.08878   ]),
  array([-0.013786 ,  0.38216  ,  0.53236  ,  0.15261  , -0.29694  ,
         -0.20558  , -0.41846  , -0.58437  , -0.77355  , -0.87866  ,
         -0.37858  , -0.18516  , -0.128    , -0.20584  , -0.22925  ,
         -0.42599  ,  0.3725   ,  0.26077  , -1.0702   ,  0.62916  ,
         -0.091469 ,  0.70348  , -0.4973   , -0.77691  ,  0.66045  ,
          0.09465  , -0.44893  ,  0.018917 ,  0.33146  , -0.35022  ,
         -0.35789  ,  0.030313 ,  0.22253  , -0.23236  , -0.19719  ,
         -0.0053125, -0.25848  ,  0.58081  , -0.10705  , -0.17845  ,
         -0.16206  ,  0.087086 ,  0.63029  , -0.76649  ,  0.51619  ,
          0.14073  ,  1.019    , -0.43136  ,  0.46138  , -0.43585  ,
         -0.47568  ,  0.19226  ,  0.36065  ,  0.78987  ,  0.088945 ,
         -2.7814   , -0.15366  ,  0.01015  ,  1.1798   ,  0.15168  ,
         -0.050112 ,  1.2626   , -0.77527  ,  0.36031  ,  0.95761  ,
         -0.11385  ,  0.28035  , -0.02591  ,  0.31246  , -0.15424  ,
          0.3778   , -0.13599  ,  0.2946   , -0.31579  ,  0.42943  ,
          0.086969 ,  0.019169 , -0.27242  , -0.31696  ,  0.37327  ,
          0.61997  ,  0.13889  ,  0.17188  ,  0.30363  , -1.2776   ,
          0.044423 , -0.52736  , -0.88536  , -0.19428  , -0.61947  ,
         -0.10146  , -0.26301  , -0.061707 ,  0.36627  , -0.95223  ,
         -0.39346  , -0.69183  , -1.0426   ,  0.28855  ,  0.63056  ],
        dtype=float32)]]

import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

# 检查是否可以使用MPS
device = torch.device('mps' if torch.has_mps else 'cpu')

labels = [1, 1]

# 数据预处理
max_len = max(len(seq) for seq in glove_vectors_list)
padded_sequences = np.array([np.pad(seq, ((0, max_len - len(seq)), (0, 0)), 'constant') for seq in glove_vectors_list])
labels = np.array(labels)

# 转换为Tensor
padded_sequences = torch.tensor(padded_sequences, dtype=torch.float32).to(device)
labels = torch.tensor(labels, dtype=torch.long).to(device)

# 定义LSTM模型
class LSTMModel(nn.Module):
    def __init__(self, input_size, hidden_size, num_layers, num_classes):
        super(LSTMModel, self).__init__()
        self.lstm = nn.LSTM(input_size, hidden_size, num_layers, batch_first=True)
        self.fc = nn.Linear(hidden_size, num_classes)
    
    def forward(self, x):
        h_0 = torch.zeros(num_layers, x.size(0), hidden_size).to(device)
        c_0 = torch.zeros(num_layers, x.size(0), hidden_size).to(device)
        out, _ = self.lstm(x, (h_0, c_0))
        out = out[:, -1, :]
        out = self.fc(out)
        return out, h_0

# 模型参数
input_size = 100
hidden_size = 128
num_layers = 3
num_classes = 3

model = LSTMModel(input_size, hidden_size, num_layers, num_classes).to(device)

# 定义损失函数和优化器
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# 训练模型
num_epochs = 10
for epoch in range(num_epochs):
    model.train()
    outputs, _ = model(padded_sequences)
    loss = criterion(outputs, labels)
    
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    
    if (epoch+1) % 1 == 0:
        print(f'Epoch [{epoch+1}/{num_epochs}], Loss: {loss.item():.4f}')

# 获取LSTM模型的最后一层输出
class IntermediateLSTMModel(nn.Module):
    def __init__(self, lstm_model):
        super(IntermediateLSTMModel, self).__init__()
        self.lstm = lstm_model.lstm
    
    def forward(self, x):
        h_0 = torch.zeros(num_layers, x.size(0), hidden_size).to(device)
        c_0 = torch.zeros(num_layers, x.size(0), hidden_size).to(device)
        out, _ = self.lstm(x, (h_0, c_0))
        return out[:, -1, :]

intermediate_model = IntermediateLSTMModel(model).to(device)
intermediate_output = intermediate_model(padded_sequences).detach().cpu().numpy()

# 获取与最后一次情感分类结果最相关的向量
last_output = intermediate_output[-1]  # 最后一个序列的LSTM输出
last_label = labels[-1].cpu().numpy()  # 最后一个序列的标签

# 计算与最后一次情感分类结果最相关的向量
cos_similarities = cosine_similarity([last_output], intermediate_output[:-1])
most_related_index = np.argmax(cos_similarities)

# 返回三元组
result = [([most_related_index], [most_related_index], 'POS')]

# 打印结果
print(result)

效果不好；

修改：

import torch
import torch.nn.functional as F

# 假设 glove_vectors_list 已经准备好，格式如下：
glove_vectors_list = [torch.rand(5, 100), torch.rand(6, 100)]
labels = [1, 1]

# Padding sequences to have the same length
padded_sequences = torch.nn.utils.rnn.pad_sequence(glove_vectors_list, batch_first=True)

# 加载GloVe向量
def load_glove_vectors(file_path):
    glove_vectors = {}
    with open(file_path, 'r', encoding='utf-8') as f:
        for line in f:
            values = line.split()
            word = values[0]
            vector = torch.tensor([float(x) for x in values[1:]], dtype=torch.float32)
            glove_vectors[word] = vector
    return glove_vectors

# 假设GloVe文件路径为'glove.6B.100d.txt'
glove_vectors = load_glove_vectors('/Users/bootscoder/PycharmProjects/nlp-lesson-design/test3/.vector_cache/glove.6B.100d.txt')

# 获取情感标签的GloVe向量
positive_vector = glove_vectors['positive']
negative_vector = glove_vectors['negative']
neutral_vector = glove_vectors['neutral']

# 修改标签为GloVe向量
labels_glove = torch.stack([positive_vector if label == 1 else neutral_vector if label == 0 else negative_vector for label in labels])

# 定义一个函数来找到与情感分类结果最相关的向量
def find_most_relevant_vectors(glove_vectors_list, labels_glove):
    relevant_vectors = []
    for i, sequence in enumerate(glove_vectors_list):
        label_vector = labels_glove[i]
        # 计算每个向量与标签向量的余弦相似度
        attention_scores = F.cosine_similarity(sequence, label_vector.unsqueeze(0), dim=-1)
        most_relevant_index = torch.argmax(attention_scores).item()
        second_most_relevant_index = torch.topk(attention_scores, 2).indices[-1].item()
        label_str = 'POS' if torch.equal(label_vector, positive_vector) else 'NEU' if torch.equal(label_vector, neutral_vector) else 'NEG'
        relevant_vectors.append(([most_relevant_index], [second_most_relevant_index], label_str))
    return relevant_vectors

# 找到最相关的向量
relevant_vectors = find_most_relevant_vectors(glove_vectors_list, labels_glove)

print(relevant_vectors)

1 2	[([1], [3], 'POS'), ([3], [5], 'POS')]

SSJE- 改写与实现

翻新GitHub代码
修改为mps + python 3.11
另一个版本修改为cuda+python3.11

完整代码个人GitHub https://github.com/boots-coder/SSJE

项目整体框架与实验结果

下面简单介绍开源代码的技术实现概要和实验结果展示

架构说明

该模型提出了一种用于抽取方面情感三元组的模型架构。该模型旨在从文本中识别方面、观点和情感的组合，即（方面，观点，情感）的三元组。该架构包含以下主要模块：

输入层
编码层
生成和表示跨度
抽取方面情感三元组
GCN模块（图卷积网络）

1. 输入层

功能：

将原始文本输入模型进行预处理。

详细描述：

输入文本如“Great food but the service was dreadful”被分词并转化为向量表示（如h0, x1, x2, ..., x7），这些向量将被输入到后续的编码层。

2. 编码层

功能：

对输入文本进行编码以获取每个词的上下文表示。

详细描述：

编码层采用双向LSTM对输入词向量进行编码，生成每个词的上下文表示h_i。这些表示将用于后续的生成和表示跨度模块。

3. 生成和表示跨度

功能：

生成文本中可能的方面和情感的跨度，并对这些跨度进行表示。

详细描述：

模型生成多个跨度（如Sp1, Sp2, ..., Sp6），并对每个跨度进行表示，生成相应的表示向量n1, n2, ..., n6。这些表示通过maxpool操作后被输入到跨度过滤器进行筛选。

4. 抽取方面情感三元组

功能：

从生成的跨度中抽取有效的方面情感三元组。

详细描述：

模型从跨度表示中提取可能的三元组(S1, C1-3, S3)，通过匹配和组合生成最终的三元组输出，如(food, great, positive), (service, dreadful, negative)。

5. GCN模块（图卷积网络）

功能：

利用句法依存树增强方面和情感的表示。GCN 通过句法依赖树中的图结构，可以有效地捕捉这些长距离依赖，使得生成的跨度表示更具鲁棒性和准确性。

详细描述：

GCN模块通过句法依存树对每个词的表示进行进一步优化。如图中所示的句法依存树，“Great food but the service was dreadful”被解析为句法依存关系树，然后在GCN模块中进行处理，生成优化后的表示g1, g2, ..., g7。

总结

通过上述各个模块的协同工作，模型能够高效地从文本中抽取方面情感三元组。输入文本经过编码、跨度生成和筛选、三元组抽取及GCN优化，最终输出期望的方面情感三元组结果。

nlp 课程设计

GloVE

概述与引言：

基本思想与优势：

Related work

核心思想与数学推导：

扩展– Example of Co-occurrence Matrices

扩展–熵归一化的具体操作

步骤

示例

扩展– 双线形对数回归

BERT

概述

bert 的网路结构

对比

参数计算

pre-training and fine tuning

pretrain

Fine tuning

Ablation Studies

任务一 基于方面情感分析的背景表示

数据集参考

对背景的理解

使用bert- 捕捉词向量的上下文

Bert 原理和转换步骤

原理概要

将句子转为向量过程

简单代码示例 - 句子向量化

使用 gloVE – 捕捉词向量的上下文

gloVe 原理和转换步骤

简单的代码测试- 单词向量化

GloVe vs BERT 对背景的理解谁更强？

bert的特点展示

glove 的特点展示

总结：

任务总结

任务二 基于方面情感分析任务的句法特征表示

核心任务目标

tool-list

demo- 句法分析树

demo-图注意力机制

任务三 基于方面情感–背景表示和句法特征表示的融合

融合方法设计

方法一 （已实现）

方法二

方法一 实现具体步骤

设计验证代码： 使用 random forest

设计验证代码：使用 bi -lstm + attention

观察与思考

Task3 - Bert-完整数据集

正式处理数据集 – bert

正式处理数据集 – bert+spacy

结果对比 bert VS bert+spacy

Task3 - gloVe- 完整数据集

gloVe

gloVe-spacy

结果对比-gloVe VS gloVE-spacy

任务总结

任务四 基于跨度的情感三元组抽取

关于任务的相关概念与设计流程

任务定义

基于跨度的方法

步骤概述

基于跨度方法示例

代码示例

⌚️代码实践-创新

demo-spacy+textblob

Dev bert-tripletExtraction

spacy+ gloVe+ bert

SSJE- 改写与实现

项目整体框架与实验结果

架构说明

1. 输入层

2. 编码层

3. 生成和表示跨度

4. 抽取方面情感三元组

5. GCN模块（图卷积网络）

总结

实验结果

任务一基于方面情感分析的背景表示

任务二基于方面情感分析任务的句法特征表示

任务三基于方面情感–背景表示和句法特征表示的融合

方法一（已实现）

方法一实现具体步骤

设计验证代码：使用 random forest

任务四基于跨度的情感三元组抽取