Neural Network SMS Text Classifier
https://www.freecodecamp.org/learn/machine-learning-with-python/machine-learning-with-python-projects/neural-network-sms-text-classifier
In this challenge, you need to create a machine learning model that will classify SMS messages as either "ham" or "spam". A "ham" message is a normal message sent by a friend. A "spam" message is an advertisement or a message sent by a company.
You can access the full project instructions and starter code on Google Colaboratory.
参考
https://www.kaggle.com/akhatova/sms-spam-classification-by-keras#3.-Keras-Model
此示例存在两种解法:
(1)词向量 + 回归模型
(2)输入层序列模式 + 词嵌套 + keras模型/CNN模型
经过验证, 词向量 特征更加适合垃圾邮件检测, 最终使用模型 词向量 + KERAS DNN模型。
数据
https://www.dt.fee.unicamp.br/~tiago/smsspamcollection/
The table below lists the provided dataset in different file formats, the amount of samples in each class and the total number of samples.
Application File format # Spam # Ham Total Link General Plain text 747 4,827 5,574 Link 1 Weka ARFF 747 4,827 5,574 Link 2
The collection is composed by just one file, where each line has the correct class (
ham
orspam
) followed by the raw message.
ham What you doing?how are you?
ham Ok lar... Joking wif u oni...
ham dun say so early hor... U c already then say...
ham MY NO. IN LUTON 0125698789 RING ME IF UR AROUND! H*
ham Siva is in hostel aha:-.
ham Cos i was out shopping wif darren jus now n i called him 2 ask wat present he wan lor. Then he started guessing who i was wif n he finally guessed darren lor.
spam FreeMsg: Txt: CALL to No: 86888 & claim your reward of 3 hours talk time to use from your phone now! ubscribe6GBP/ mnth inc 3hrs 16 stop?txtStop
spam Sunshine Quiz! Win a super Sony DVD recorder if you canname the capital of Australia? Text MQUIZ to 82277. B
spam URGENT! Your Mobile No 07808726822 was awarded a L2,000 Bonus Caller Prize on 02/09/03! This is our 2nd attempt to contact YOU! Call 0871-872-9758 BOX95QU
词向量特征提取-TfidfVectorizer
https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer
>>> from sklearn.feature_extraction.text import TfidfVectorizer >>> corpus = [ ... 'This is the first document.', ... 'This document is the second document.', ... 'And this is the third one.', ... 'Is this the first document?', ... ] >>> vectorizer = TfidfVectorizer() >>> X = vectorizer.fit_transform(corpus) >>> print(vectorizer.get_feature_names()) ['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this'] >>> print(X.shape) (4, 9)
或者使用TensorFlow处理接口
https://www.tensorflow.org/guide/keras/preprocessing_layers#encoding_text_as_a_dense_matrix_of_ngrams_with_tf-idf_weighting# Define some text data to adapt the layer data = tf.constant( [ "The Brain is wider than the Sky", "For put them side by side", "The one the other will contain", "With ease and You beside", ] ) # Instantiate TextVectorization with "tf-idf" output_mode # (multi-hot with TF-IDF weighting) and ngrams=2 (index all bigrams) text_vectorizer = preprocessing.TextVectorization(output_mode="tf-idf", ngrams=2) # Index the bigrams and learn the TF-IDF weights via `adapt()` text_vectorizer.adapt(data) print( "Encoded text:\n", text_vectorizer(["The Brain is deeper than the sea"]).numpy(), "\n", ) # Create a Dense model inputs = keras.Input(shape=(1,), dtype="string") x = text_vectorizer(inputs) outputs = layers.Dense(1)(x) model = keras.Model(inputs, outputs) # Call the model on test data (which includes unknown tokens) test_data = tf.constant(["The Brain is deeper than the sea"]) test_output = model(test_data) print("Model output:", test_output)
类别样本量不均衡
https://keras.io/examples/structured_data/imbalanced_classification/
设置权重法
计算类别权重,
数量少的类别,给予高的权重
counts = np.bincount(train_targets[:, 0]) print( "Number of positive samples in training data: {} ({:.2f}% of total)".format( counts[1], 100 * float(counts[1]) / len(train_targets) ) ) weight_for_0 = 1.0 / counts[0] weight_for_1 = 1.0 / counts[1]
在训练接口中,指定类别权重
metrics = [ keras.metrics.FalseNegatives(name="fn"), keras.metrics.FalsePositives(name="fp"), keras.metrics.TrueNegatives(name="tn"), keras.metrics.TruePositives(name="tp"), keras.metrics.Precision(name="precision"), keras.metrics.Recall(name="recall"), ] model.compile( optimizer=keras.optimizers.Adam(1e-2), loss="binary_crossentropy", metrics=metrics ) callbacks = [keras.callbacks.ModelCheckpoint("fraud_model_at_epoch_{epoch}.h5")] class_weight = {0: weight_for_0, 1: weight_for_1} model.fit( train_features, train_targets, batch_size=2048, epochs=30, verbose=2, callbacks=callbacks, validation_data=(val_features, val_targets), class_weight=class_weight, )
设置度量指标
https://www.tensorflow.org/tutorials/structured_data/imbalanced_data#train_the_model
类别不均衡的情况下, 不能只使用 acc 准确度指标, 否则训练模型很有可能,只考虑大数量类别的情况, 忽略少数数量类别的数量。
METRICS = [ keras.metrics.TruePositives(name='tp'), keras.metrics.FalsePositives(name='fp'), keras.metrics.TrueNegatives(name='tn'), keras.metrics.FalseNegatives(name='fn'), keras.metrics.BinaryAccuracy(name='accuracy'), keras.metrics.Precision(name='precision'), keras.metrics.Recall(name='recall'), keras.metrics.AUC(name='auc'), keras.metrics.AUC(name='prc', curve='PR'), # precision-recall curve ]
过采样 - Oversampling
https://www.tensorflow.org/tutorials/structured_data/imbalanced_data#oversample_the_minority_class
对于少数的类别, 通过抽样方法,生成和多数类别相同数量。
个人感觉这种方法, 仅仅解决数量上的问题, 但是数据质量并没有提升, 数据的多样性问题没有得到解决, 那么最终影响模型对少数类别的泛化能力。
Using NumPy
You can balance the dataset manually by choosing the right number of random indices from the positive examples:
ids = np.arange(len(pos_features))
choices = np.random.choice(ids, len(neg_features))
res_pos_features = pos_features[choices]
res_pos_labels = pos_labels[choices]
res_pos_features.shape(181966, 29)resampled_features = np.concatenate([res_pos_features, neg_features], axis=0)
resampled_labels = np.concatenate([res_pos_labels, neg_labels], axis=0)
order = np.arange(len(resampled_labels))
np.random.shuffle(order)
resampled_features = resampled_features[order]
resampled_labels = resampled_labels[order]
resampled_features.shape(363932, 29)
TSV
https://en.wikipedia.org/wiki/Tab-separated_values
A tab-separated values (TSV) file is a simple text format for storing data in a tabular structure, e.g., database table or spreadsheet data,[1] and a way of exchanging information between databases.[2] Each record in the table is one line of the text file. Each field value of a record is separated from the next by a tab character. The TSV format is thus a type of the more general delimiter-separated values format.
https://*.com/questions/9652832/how-to-load-a-tsv-file-into-a-pandas-dataframe
Use
pandas.read_table(filepath)
. The default separator is tab.