4 回答

TA貢獻(xiàn)1942條經(jīng)驗(yàn) 獲得超3個(gè)贊
您的新數(shù)據(jù)可能與您用于訓(xùn)練和測(cè)試模型的第一個(gè)數(shù)據(jù)集有很大不同。預(yù)處理技術(shù)和統(tǒng)計(jì)分析將幫助您表征數(shù)據(jù)并比較不同的數(shù)據(jù)集。由于各種原因,可能會(huì)觀察到新數(shù)據(jù)的性能不佳,包括:
您的初始數(shù)據(jù)集在統(tǒng)計(jì)上不能代表更大的數(shù)據(jù)集(例如:您的數(shù)據(jù)集是一個(gè)極端案例)
過(guò)度擬合:你過(guò)度訓(xùn)練你的模型,它包含訓(xùn)練數(shù)據(jù)的特異性(噪聲)
不同的預(yù)處理方法
不平衡的訓(xùn)練數(shù)據(jù)集。ML 技術(shù)最適合平衡數(shù)據(jù)集(訓(xùn)練集中不同類別的平等出現(xiàn))

TA貢獻(xiàn)1799條經(jīng)驗(yàn) 獲得超8個(gè)贊
我對(duì)情緒分析中不同分類的表現(xiàn)進(jìn)行了調(diào)查研究。對(duì)于特定的推特?cái)?shù)據(jù)集,我曾經(jīng)執(zhí)行邏輯回歸、樸素貝葉斯、支持向量機(jī)、k 最近鄰 (KNN) 和決策樹等模型。對(duì)所選數(shù)據(jù)集的觀察表明,Logistic 回歸和樸素貝葉斯在所有類型的測(cè)試中都準(zhǔn)確地表現(xiàn)良好。接下來(lái)是SVM。然后進(jìn)行準(zhǔn)確的決策樹分類。從結(jié)果來(lái)看,KNN 的準(zhǔn)確度得分最低。邏輯回歸和樸素貝葉斯模型在情緒分析和預(yù)測(cè)方面分別表現(xiàn)更好。 情緒分類器(準(zhǔn)確度分?jǐn)?shù) RMSE) LR (78.3541 1.053619) NB (76.764706 1.064738) SVM (73.5835 1.074752) DT (69.2941 1.145234) KNN (62.9476 1.376589)
在這些情況下,特征提取非常關(guān)鍵。

TA貢獻(xiàn)2039條經(jīng)驗(yàn) 獲得超8個(gè)贊
導(dǎo)入必需品
import pandas as pd
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
import time
df = pd.read_csv('FilePath', header=0)
X = df['content']
y = df['sentiment']
def lrSentimentAnalysis(n):
# Using CountVectorizer to convert text into tokens/features
vect = CountVectorizer(ngram_range=(1, 1))
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1, test_size=n)
# Using training data to transform text into counts of features for each message
vect.fit(X_train)
X_train_dtm = vect.transform(X_train)
X_test_dtm = vect.transform(X_test)
# dual = [True, False]
max_iter = [100, 110, 120, 130, 140, 150]
C = [1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0, 4.5, 5]
solvers = ['newton-cg', 'lbfgs', 'liblinear']
param_grid = dict(max_iter=max_iter, C=C, solver=solvers)
LR1 = LogisticRegression(penalty='l2', multi_class='auto')
grid = GridSearchCV(estimator=LR1, param_grid=param_grid, cv=10, n_jobs=-1)
grid_result = grid.fit(X_train_dtm, y_train)
# Summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
y_pred = grid_result.predict(X_test_dtm)
print ('Accuracy Score: ', metrics.accuracy_score(y_test, y_pred) * 100, '%')
# print('Confusion Matrix: ',metrics.confusion_matrix(y_test,y_pred))
# print('MAE:', metrics.mean_absolute_error(y_test, y_pred))
# print('MSE:', metrics.mean_squared_error(y_test, y_pred))
print ('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
return [n, metrics.accuracy_score(y_test, y_pred) * 100, grid_result.best_estimator_.get_params()['max_iter'],
grid_result.best_estimator_.get_params()['C'], grid_result.best_estimator_.get_params()['solver']]
def darwConfusionMetrix(accList):
# Using CountVectorizer to convert text into tokens/features
vect = CountVectorizer(ngram_range=(1, 1))
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1, test_size=accList[0])
# Using training data to transform text into counts of features for each message
vect.fit(X_train)
X_train_dtm = vect.transform(X_train)
X_test_dtm = vect.transform(X_test)
# Accuracy using Logistic Regression Model
LR = LogisticRegression(penalty='l2', max_iter=accList[2], C=accList[3], solver=accList[4])
LR.fit(X_train_dtm, y_train)
y_pred = LR.predict(X_test_dtm)
# creating a heatmap for confusion matrix
data = metrics.confusion_matrix(y_test, y_pred)
df_cm = pd.DataFrame(data, columns=np.unique(y_test), index=np.unique(y_test))
df_cm.index.name = 'Actual'
df_cm.columns.name = 'Predicted'
plt.figure(figsize=(10, 7))
sns.set(font_scale=1.4) # for label size
sns.heatmap(df_cm, cmap="Blues", annot=True, annot_kws={"size": 16}) # font size
fig0 = plt.gcf()
fig0.show()
fig0.savefig('FilePath', dpi=100)
def findModelWithBestAccuracy(accList):
accuracyList = []
for item in accList:
accuracyList.append(item[1])
N = accuracyList.index(max(accuracyList))
print('Best Model:', accList[N])
return accList[N]
accList = []
print('Logistic Regression')
print('grid search method for hyperparameter tuning (accurcy by cross validation) ')
for i in range(2, 7):
n = i / 10.0
print ("\nsplit ", i - 1, ": n=", n)
accList.append(lrSentimentAnalysis(n))
darwConfusionMetrix(findModelWithBestAccuracy(accList))

TA貢獻(xiàn)1794條經(jīng)驗(yàn) 獲得超8個(gè)贊
預(yù)處理是構(gòu)建性能良好的分類器的重要部分。當(dāng)您在訓(xùn)練和測(cè)試集性能之間存在如此大的差異時(shí),很可能在您的(測(cè)試集)預(yù)處理中發(fā)生了一些錯(cuò)誤。
無(wú)需任何編程也可使用分類器。
您可以訪問(wèn) Web 服務(wù)洞察分類器并先嘗試免費(fèi)構(gòu)建。
添加回答
舉報(bào)