首頁猿問更正 Pandas...

更正 Pandas DataFrame 中的混亂日期

Python

瀟湘沐 2022-12-27 09:55:49

我有一個(gè)百萬行的時(shí)間序列數(shù)據(jù)框，其中 Date 列中的某些值具有混亂的日/月值。我如何有效地理清它們而又不破壞那些正確的東西？# this creates a dataframe with muddled datesimport pandas as pdimport numpy as npfrom pandas import Timestampstart = Timestamp(2013,1,1)dates = pd.date_range(start, periods=942)[::-1]muddler = {}for d in dates: if d.day < 13: muddler[d] = Timestamp(d.year, d.day, d.month) else: muddler[d] = Timestamp(d.year, d.month, d.day)df = pd.DataFrame()df['Date'] = datesdf['Date'] = df['Date'].map(muddler)# now what? (assuming I don't know how the dates are muddled)

查看完整描述

2 回答

小唯快跑啊

TA貢獻(xiàn)1863條經(jīng)驗(yàn) 獲得超2個(gè)贊

一個(gè)選項(xiàng)可能是計(jì)算時(shí)間戳的擬合度，并修改那些偏離擬合度大于特定閾值的時(shí)間戳。例子：

import pandas as pd

import numpy as np

start = pd.Timestamp(2013,1,1)

dates = pd.date_range(start, periods=942)[::-1]

muddler = {}

for d in dates:

if d.day < 13:

muddler[d] = pd.Timestamp(d.year, d.day, d.month)

else:

muddler[d] = pd.Timestamp(d.year, d.month, d.day)

df = pd.DataFrame()

df['Date'] = dates

df['Date'] = df['Date'].map(muddler)

# convert date col to posix timestamp

df['ts'] = df['Date'].values.astype(np.float) / 10**9

# calculate a linear fit for ts col

x = np.linspace(df['ts'].iloc[0], df['ts'].iloc[-1], df['ts'].size)

df['ts_linfit'] = np.polyval(np.polyfit(x, df['ts'], 1), x)

# set a thresh and derive a mask that masks differences between

# fit and timestamp greater than thresh:

thresh = 1.2e6 # you might want to tweak this...

m = np.absolute(df['ts']-df['ts_linfit']) > thresh

# create new date col as copy of original

df['Date_filtered'] = df['Date']

# modify values that were caught in the mask

df.loc[m, 'Date_filtered'] = df['Date_filtered'][m].apply(lambda x: pd.Timestamp(x.year, x.day, x.month))

# also to posix timestamp

df['ts_filtered'] = df['Date_filtered'].values.astype(np.float) / 10**9

ax = df['ts'].plot(label='original')

ax = df['ts_filtered'].plot(label='filtered')

ax.legend()

反對(duì) 回復(fù) 2022-12-27

翻翻過去那場雪

TA貢獻(xiàn)2065條經(jīng)驗(yàn) 獲得超14個(gè)贊

在嘗試創(chuàng)建一個(gè)最小的可重現(xiàn)示例時(shí)，我實(shí)際上已經(jīng)解決了我的問題——但我希望有一種更有效的方法來做我想做的事情……

# i first define a function to examine the dates

def disordered_muddle(date_series, future_first=True):

"""Check whether a series of dates is disordered or just muddled"""

disordered = []

muddle = []

dates = date_series

different_dates = pd.Series(dates.unique())

date = different_dates[0]

for i, d in enumerate(different_dates[1:]):

# we expect the date's dayofyear to decrease by one

if d.dayofyear!=date.dayofyear-1:

# unless the year is changing

if d.year!=date.year-1:

try:

# we check if the day and month are muddled

# if d.day > 12 this will cause an Exception

unmuddle = Timestamp(d.year,d.day,d.month)

if unmuddle.dayofyear==date.dayofyear-1:

muddle.append(d)

d = unmuddle

elif unmuddle.year==date.year-1:

muddle.append(d)

d = unmuddle

else:

disordered.append(d)

except:

disordered.append(d)

date=d

if len(disordered)==0 and len(muddle)==0:

return False

else:

return disordered, muddle

disorder, muddle = disordered_muddle(df['Date'])

# finally unmuddle the dates

date_correction = {}

for d in df['Date']:

if d in muddle:

date_correction[d] = Timestamp(d.year, d.day, d.month)

else:

date_correction[d] = Timestamp(d.year, d.month, d.day)

df['CorrectedDate'] = df['Date'].map(date_correction)

disordered_muddle(df['CorrectedDate'])

反對(duì) 回復(fù) 2022-12-27

2 回答
0 關(guān)注
166 瀏覽

關(guān)注

添加回答

舉報(bào)

0/150

提交

取消

使用 Ctrl+D 可將網(wǎng)站添加到書簽

微信客服

購課補(bǔ)貼
聯(lián)系客服咨詢優(yōu)惠詳情

幫助反饋 APP下載

慕課網(wǎng)APP
您的移動(dòng)學(xué)習(xí)伙伴

公眾號(hào)

掃描二維碼
關(guān)注慕課網(wǎng)微信公眾號(hào)

第七色在线视频,2021少妇久久久久久久久久,亚洲欧洲精品成人久久av18,亚洲国产精品特色大片观看完整版,孙宇晨将参加特朗普的晚宴

熱搜

最近搜索清空

更正 Pandas DataFrame 中的混亂日期

更正 Pandas DataFrame 中的混亂日期

2 回答

添加回答