首頁猿問在Python中獲取兩個(gè)數(shù)據(jù)幀之間...

在Python中獲取兩個(gè)數(shù)據(jù)幀之間包含子字符串的字符串行數(shù)的最快方法

Python

料青山看我應(yīng)如是 2023-09-12 17:29:43

我有兩個(gè)數(shù)據(jù)框，一個(gè)有單詞，另一個(gè)有文本。我想獲取第一個(gè)數(shù)據(jù)框中包含該單詞的所有行的計(jì)數(shù)。字=ID | Word------------1 | Introduction2 | database3 | country 4 | search文字 =ID | Text------------1 | Introduction to python2 | sql is a database3 | Introduction to python in our country4 | search for a python teacher in our country我想要的最終輸出是ID | Word | Count---------------------1 | Introduction | 22 | database | 13 | country | 14 | search | 2我在單詞 df 中有 200000 行，在文本中有 55000 行（每個(gè)文本的長度約為 2000 個(gè)單詞）df。使用以下代碼完成整個(gè)過程大約需要 76 小時(shí)'''def docCount(docdf, worddf): final_dict = {} for i in tqdm(worddf.itertuples()): docdf["Count"] = docdf.Text.str.contains(i[2]) temp_dict = {i[2]: docdf.Count.sum()} final_dict = dict(Counter(final_dict)+Counter(temp_dict)) return final_dict'''

查看完整描述

3 回答

SMILET

TA貢獻(xiàn)1796條經(jīng)驗(yàn) 獲得超4個(gè)贊

您可以嘗試這個(gè)示例來加快速度：

df1 = pd.DataFrame({'Word':['Introduction', 'database', 'country', 'search']})

df2 = pd.DataFrame({'Text':['Introduction to python', 'sql is a database', 'Introduction to python in our country', 'search for a python teacher in our country']})

tmp = pd.DataFrame(df2['Text'].str.split().explode()).set_index('Text').assign(c=1)

tmp = tmp.groupby(tmp.index)['c'].sum()

print( df1.merge(tmp, left_on='Word', right_on=tmp.index) )

印刷：

Word c

0 Introduction 2

1 database 1

2 country 2

3 search 1

反對回復(fù) 2023-09-12

當(dāng)年話下

TA貢獻(xiàn)1890條經(jīng)驗(yàn) 獲得超9個(gè)贊

Series.str.split與Series.explodefor 系列單詞一起使用：

s = df2['Text'].str.split().explode()

#oldier pandas versions

#s = df2['Text'].str.split(expand=True).stack()

然后僅按Series.isin和過濾匹配的值boolean indexing，按Series.value_counts和最后一次使用進(jìn)行計(jì)數(shù)DataFrame.join：

df1 = df1.join(s[s.isin(df1['Word'])].value_counts().rename('Count'), on='Word')

print (df1)

? ? ? ? ? ?Word? Count

0? Introduction? ? ? 2

1? ? ? database? ? ? 1

2? ? ? ?country? ? ? 2

3? ? ? ? search? ? ? 1

反對回復(fù) 2023-09-12

慕勒3428872

TA貢獻(xiàn)1848條經(jīng)驗(yàn) 獲得超6個(gè)贊

這是簡單的解決方案

world_count = pd.DataFrame(

{'words': Word['Word'].tolist(),

'count': [Text['Text'].str.contains(w).sum() for w in words],

}).rename_axis('ID')

輸出：

world_count.head()

'''

words count

ID

0 Introduction 2

1 database 1

2 country 2

3 search 1

'''

逐步解決方案：

# Convert column to list

words = Word['Word'].tolist()

# Get the count

count = [Text['Text'].str.contains(w).sum() for w in words]

world_count = pd.DataFrame(

{'words': words,

'count': count,

}).rename_axis('ID')

提示：

我建議您轉(zhuǎn)換為小寫，這樣您就不會(huì)因?yàn)榇?小寫而錯(cuò)過任何計(jì)數(shù)

import re

import pandas as pd

world_count = pd.DataFrame(

{'words': Word['Word'].str.lower().str.strip().tolist(),

'count': [Text['Text'].str.contains(w,flags=re.IGNORECASE, regex=True).sum() for w in words],

}).rename_axis('ID')

反對回復(fù) 2023-09-12

3 回答
0 關(guān)注
160 瀏覽

關(guān)注

添加回答

舉報(bào)

0/150

提交

取消

使用 Ctrl+D 可將網(wǎng)站添加到書簽

微信客服

購課補(bǔ)貼
聯(lián)系客服咨詢優(yōu)惠詳情

幫助反饋 APP下載

慕課網(wǎng)APP
您的移動(dòng)學(xué)習(xí)伙伴

公眾號

掃描二維碼
關(guān)注慕課網(wǎng)微信公眾號

第七色在线视频,2021少妇久久久久久久久久,亚洲欧洲精品成人久久av18,亚洲国产精品特色大片观看完整版,孙宇晨将参加特朗普的晚宴

熱搜

最近搜索清空

在Python中獲取兩個(gè)數(shù)據(jù)幀之間包含子字符串的字符串行數(shù)的最快方法

在Python中獲取兩個(gè)數(shù)據(jù)幀之間包含子字符串的字符串行數(shù)的最快方法

3 回答

添加回答