第七色在线视频,2021少妇久久久久久久久久,亚洲欧洲精品成人久久av18,亚洲国产精品特色大片观看完整版,孙宇晨将参加特朗普的晚宴

為了賬號(hào)安全,請(qǐng)及時(shí)綁定郵箱和手機(jī)立即綁定
已解決430363個(gè)問(wèn)題,去搜搜看,總會(huì)有你想問(wèn)的

如何改進(jìn) ML 模型以提高準(zhǔn)確性

如何改進(jìn) ML 模型以提高準(zhǔn)確性

HUH函數(shù) 2023-06-20 16:18:27
我正在編寫一個(gè)處理情緒分析的 python 腳本,我對(duì)文本進(jìn)行了預(yù)處理并對(duì)分類特征進(jìn)行了矢量化并拆分了數(shù)據(jù)集,然后我使用了 LogisticRegression 模型,準(zhǔn)確率達(dá)到了 84 %當(dāng)我上傳一個(gè)新的數(shù)據(jù)集并嘗試部署創(chuàng)建的模型時(shí),我得到了51.84% 的準(zhǔn)確率代碼:    import pandas as pd    import numpy as np    import re    import string    from nltk.corpus import stopwords    from nltk.tokenize import word_tokenize    from sklearn.feature_extraction.text import TfidfVectorizer,CountVectorizer,TfidfTransformer    from sklearn.model_selection import train_test_split    from nltk.stem import PorterStemmer    from nltk.stem import WordNetLemmatizer    # ML Libraries    from sklearn.metrics import accuracy_score    from sklearn.linear_model import LogisticRegression    from sklearn.model_selection import GridSearchCV        stop_words = set(stopwords.words('english'))      import joblib        def load_dataset(filename, cols):        dataset = pd.read_csv(filename, encoding='latin-1')        dataset.columns = cols        return dataset        dataset = load_dataset("F:\AIenv\sentiment_analysis\input_2_balanced.csv", ["id","label","date","text"])    dataset.head()        dataset['clean_text'] = dataset['text'].apply(processTweet)        # create doc2vec vector columns    from gensim.test.utils import common_texts    from gensim.models.doc2vec import Doc2Vec, TaggedDocument        documents = [TaggedDocument(doc, [i]) for i, doc in enumerate(dataset["clean_text"].apply(lambda x: x.split(" ")))]        # train a Doc2Vec model with our text data    model = Doc2Vec(documents, vector_size=5, window=2, min_count=1, workers=4)        # transform each document into a vector data    doc2vec_df = dataset["clean_text"].apply(lambda x: model.infer_vector(x.split(" "))).apply(pd.Series)    doc2vec_df.columns = ["doc2vec_vector_" + str(x) for x in doc2vec_df.columns]    dataset = pd.concat([dataset, doc2vec_df], axis=1)    
查看完整描述

4 回答

?
手掌心

TA貢獻(xiàn)1942條經(jīng)驗(yàn) 獲得超3個(gè)贊

您的新數(shù)據(jù)可能與您用于訓(xùn)練和測(cè)試模型的第一個(gè)數(shù)據(jù)集有很大不同。預(yù)處理技術(shù)和統(tǒng)計(jì)分析將幫助您表征數(shù)據(jù)并比較不同的數(shù)據(jù)集。由于各種原因,可能會(huì)觀察到新數(shù)據(jù)的性能不佳,包括:

  1. 您的初始數(shù)據(jù)集在統(tǒng)計(jì)上不能代表更大的數(shù)據(jù)集(例如:您的數(shù)據(jù)集是一個(gè)極端案例)

  2. 過(guò)度擬合:你過(guò)度訓(xùn)練你的模型,它包含訓(xùn)練數(shù)據(jù)的特異性(噪聲)

  3. 不同的預(yù)處理方法

  4. 不平衡的訓(xùn)練數(shù)據(jù)集。ML 技術(shù)最適合平衡數(shù)據(jù)集(訓(xùn)練集中不同類別的平等出現(xiàn))


查看完整回答
反對(duì) 回復(fù) 2023-06-20
?
守著星空守著你

TA貢獻(xiàn)1799條經(jīng)驗(yàn) 獲得超8個(gè)贊

我對(duì)情緒分析中不同分類的表現(xiàn)進(jìn)行了調(diào)查研究。對(duì)于特定的推特?cái)?shù)據(jù)集,我曾經(jīng)執(zhí)行邏輯回歸、樸素貝葉斯、支持向量機(jī)、k 最近鄰 (KNN) 和決策樹等模型。對(duì)所選數(shù)據(jù)集的觀察表明,Logistic 回歸和樸素貝葉斯在所有類型的測(cè)試中都準(zhǔn)確地表現(xiàn)良好。接下來(lái)是SVM。然后進(jìn)行準(zhǔn)確的決策樹分類。從結(jié)果來(lái)看,KNN 的準(zhǔn)確度得分最低。邏輯回歸和樸素貝葉斯模型在情緒分析和預(yù)測(cè)方面分別表現(xiàn)更好。 情緒分類器(準(zhǔn)確度分?jǐn)?shù) RMSE) LR (78.3541 1.053619) NB (76.764706 1.064738) SVM (73.5835 1.074752) DT (69.2941 1.145234) KNN (62.9476 1.376589)

在這些情況下,特征提取非常關(guān)鍵。


查看完整回答
反對(duì) 回復(fù) 2023-06-20
?
largeQ

TA貢獻(xiàn)2039條經(jīng)驗(yàn) 獲得超8個(gè)贊

導(dǎo)入必需品

import pandas as pd

from sklearn import metrics

from sklearn.model_selection import train_test_split

from sklearn.feature_extraction.text import CountVectorizer

from sklearn.linear_model import LogisticRegression

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns


from sklearn.model_selection import KFold

from sklearn.model_selection import cross_val_score

from sklearn.model_selection import GridSearchCV

import time


df = pd.read_csv('FilePath', header=0)

X = df['content']

y = df['sentiment']



def lrSentimentAnalysis(n):

    # Using CountVectorizer to convert text into tokens/features

    vect = CountVectorizer(ngram_range=(1, 1))

    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1, test_size=n)


    # Using training data to transform text into counts of features for each message

    vect.fit(X_train)

    X_train_dtm = vect.transform(X_train)

    X_test_dtm = vect.transform(X_test)


    # dual = [True, False]

    max_iter = [100, 110, 120, 130, 140, 150]

    C = [1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0, 4.5, 5]

    solvers = ['newton-cg', 'lbfgs', 'liblinear']

    param_grid = dict(max_iter=max_iter, C=C, solver=solvers)


    LR1 = LogisticRegression(penalty='l2', multi_class='auto')

    grid = GridSearchCV(estimator=LR1, param_grid=param_grid, cv=10, n_jobs=-1)

    grid_result = grid.fit(X_train_dtm, y_train)

    # Summarize results


    print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))


    y_pred = grid_result.predict(X_test_dtm)

    print ('Accuracy Score: ', metrics.accuracy_score(y_test, y_pred) * 100, '%')

    # print('Confusion Matrix: ',metrics.confusion_matrix(y_test,y_pred))

    # print('MAE:', metrics.mean_absolute_error(y_test, y_pred))

    # print('MSE:', metrics.mean_squared_error(y_test, y_pred))

    print ('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))


    return [n, metrics.accuracy_score(y_test, y_pred) * 100, grid_result.best_estimator_.get_params()['max_iter'],

            grid_result.best_estimator_.get_params()['C'], grid_result.best_estimator_.get_params()['solver']]



def darwConfusionMetrix(accList):

    # Using CountVectorizer to convert text into tokens/features

    vect = CountVectorizer(ngram_range=(1, 1))

    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1, test_size=accList[0])


    # Using training data to transform text into counts of features for each message

    vect.fit(X_train)

    X_train_dtm = vect.transform(X_train)

    X_test_dtm = vect.transform(X_test)


    # Accuracy using Logistic Regression Model

    LR = LogisticRegression(penalty='l2', max_iter=accList[2], C=accList[3], solver=accList[4])

    LR.fit(X_train_dtm, y_train)

    y_pred = LR.predict(X_test_dtm)


    # creating a heatmap for confusion matrix

    data = metrics.confusion_matrix(y_test, y_pred)

    df_cm = pd.DataFrame(data, columns=np.unique(y_test), index=np.unique(y_test))

    df_cm.index.name = 'Actual'

    df_cm.columns.name = 'Predicted'

    plt.figure(figsize=(10, 7))

    sns.set(font_scale=1.4)  # for label size

    sns.heatmap(df_cm, cmap="Blues", annot=True, annot_kws={"size": 16})  # font size

    fig0 = plt.gcf()

    fig0.show()

    fig0.savefig('FilePath', dpi=100)



def findModelWithBestAccuracy(accList):

    accuracyList = []

    for item in accList:

        accuracyList.append(item[1])


    N = accuracyList.index(max(accuracyList))

    print('Best Model:', accList[N])

    return accList[N]


accList = []

print('Logistic Regression')

print('grid search method for hyperparameter tuning (accurcy by cross validation) ')

for i in range(2, 7):

    n = i / 10.0

    print ("\nsplit ", i - 1, ": n=", n)

    accList.append(lrSentimentAnalysis(n))


darwConfusionMetrix(findModelWithBestAccuracy(accList))



查看完整回答
反對(duì) 回復(fù) 2023-06-20
?
幕布斯7119047

TA貢獻(xiàn)1794條經(jīng)驗(yàn) 獲得超8個(gè)贊

預(yù)處理是構(gòu)建性能良好的分類器的重要部分。當(dāng)您在訓(xùn)練和測(cè)試集性能之間存在如此大的差異時(shí),很可能在您的(測(cè)試集)預(yù)處理中發(fā)生了一些錯(cuò)誤。

無(wú)需任何編程也可使用分類器。

您可以訪問(wèn) Web 服務(wù)洞察分類器并先嘗試免費(fèi)構(gòu)建。


查看完整回答
反對(duì) 回復(fù) 2023-06-20
  • 4 回答
  • 0 關(guān)注
  • 229 瀏覽
慕課專欄
更多

添加回答

舉報(bào)

0/150
提交
取消
微信客服

購(gòu)課補(bǔ)貼
聯(lián)系客服咨詢優(yōu)惠詳情

幫助反饋 APP下載

慕課網(wǎng)APP
您的移動(dòng)學(xué)習(xí)伙伴

公眾號(hào)

掃描二維碼
關(guān)注慕課網(wǎng)微信公眾號(hào)