第七色在线视频,2021少妇久久久久久久久久,亚洲欧洲精品成人久久av18,亚洲国产精品特色大片观看完整版,孙宇晨将参加特朗普的晚宴

為了賬號安全,請及時綁定郵箱和手機立即綁定
已解決430363個問題,去搜搜看,總會有你想問的

使用 Python 從語義上檢測文本塊

使用 Python 從語義上檢測文本塊

桃花長相依 2021-10-19 15:09:30
我有這個示例日志文本塊:20190122 09:00,000 ###PERFORMANCE string1 string2 string320190122 09:10,500 number1 string1 string2 string320190122 09:24,670 number2 string1 string2 string320190122 10:05,000 number3 string1 string2 string320190122 10:33,960 number4 string1 string2 string320190122 11:00,321 number5 string1 string2 string320190122 11:40,256 ###PERFORMANCE string1 string2 string320190123 10:24,670 number1 string1 string2 string3 string4 date1 number220190123 10:32,130 number1 string1 string2 string3 string4 date1 number220190123 08:00,000 ###PERFORMANCE string1 string2 string320190123 08:10,500 number1 string1 string2 string320190123 08:24,670 number2 string1 string2 string320190123 09:05,000 number3 string1 string2 string320190123 10:33,960 number4 string1 string2 string320190123 10:00,321 number5 string1 string2 string320190123 13:40,256 ###PERFORMANCE string1 string2 string320190124 10:00,000 ###PERFORMANCE string1 string2 string320190124 10:10,500 number1 string1 string2 string320190124 10:24,670 number2 string1 string2 string320190124 11:05,000 number3 string1 string2 string320190124 12:33,960 number4 string1 string2 string320190124 13:00,321 number5 string1 string2 string320190124 13:40,256 ###PERFORMANCE string1 string2 string3我想用 Python 做的是檢測每個###PERFORMANCE文本塊,如本例所示:如您所見,有 3 個感興趣的塊,每個塊都由###PERFORMANCE字符串中的文本分隔。第一個從第 1 行開始到第 7 行結(jié)束。第 7 行和第 10 行之間的內(nèi)容不能被視為感興趣的塊。每個塊的字符串行也可能不同(所以按行號不是一個好主意)。到目前為止,我所做的只是逐行讀取文本文件:logFile = "testLog.txt"with open(logFile) as f:    content = f.readlines()# you may also want to remove whitespace characters like `\n` at the end of each linecontent = [x.strip() for x in content]for line in content:    print(line)我可以通過哪種方式來完成這項任務(wù)?使用 NLTK 是個好主意嗎?它甚至適用于這項任務(wù)嗎?任何一般建議?
查看完整描述

2 回答

?
一只萌萌小番薯

TA貢獻1795條經(jīng)驗 獲得超7個贊

我認(rèn)為您可以通過簡單的檢查來完成所需的工作。讓我解釋一下我是否正確理解。你可以有一個標(biāo)志(真/假值)來檢測你是否在有趣的塊中。每當(dāng)您找到“###PERFORMANCE”時,您都可以更改此標(biāo)志。然后您可以將這兩個塊保存在兩個列表或您喜歡的任何結(jié)構(gòu)中。


下面是代碼片段


logFile = "logfile.txt"


with open(logFile) as f:

    content = f.readlines()

# you may also want to remove whitespace characters like `\n` at the end of each line

content = [x.strip() for x in content]


# flag

are_we_in_the_interesting_block = False;


# two lists to save the liens

interesting_block = [];

non_interesting_block = [];


for line in content:

    # check if there is the text ###PERFORMANCE

    is_there_performance = line.find('###PERFORMANCE');


    # if it's not there, it returns -1

    if is_there_performance > 0:

        are_we_in_the_interesting_block = not are_we_in_the_interesting_block;

    else:    

        if are_we_in_the_interesting_block:

            # here I append to a list, but you can do your processing

            interesting_block.append(line);

        else:

            # here processing of the non interesting parts

            non_interesting_block.append(line);


print('Interesting blocks')

print(interesting_block)


print('\n')

print('Non interesting blocks')

print(non_interesting_block)

產(chǎn)生的輸出將是


Interesting blocks

['20190122 09:10,500 number1 string1 string2 string3', '20190122 09:24,670 number2 string1 string2 string3', '20190122 10:05,000 number3 string1 string2 string3', '20190122 10:33,960 number4 string1 string2 string3', '20190122 11:00,321 number5 string1 string2 string3', '20190123 08:10,500 number1 string1 string2 string3', '20190123 08:24,670 number2 string1 string2 string3', '20190123 09:05,000 number3 string1 string2 string3', '20190123 10:33,960 number4 string1 string2 string3', '20190123 10:00,321 number5 string1 string2 string3', '20190124 10:10,500 number1 string1 string2 string3', '20190124 10:24,670 number2 string1 string2 string3', '20190124 11:05,000 number3 string1 string2 string3', '20190124 12:33,960 number4 string1 string2 string3', '20190124 13:00,321 number5 string1 string2 string3']



Non interesting blocks

['20190123 10:24,670 number1 string1 string2 string3 string4 date1 number2', '20190123 10:32,130 number1 string1 string2 string3 string4 date1 number2']

然后,interesting_block[n]如果需要,您可以訪問以獲取第 n 行。


查看完整回答
反對 回復(fù) 2021-10-19
?
慕后森

TA貢獻1802條經(jīng)驗 獲得超5個贊

由于您只是在 PERFORMANCE 分隔符上進行匹配,因此使用 NLTK 似乎有點過分。一個簡單的方法是使用一個簡單的匹配(是行中的預(yù)期字符串),然后根據(jù)它切換您的捕獲模式。例如:


in_block = False

IDENTIFIER = 'PERFORMANCE'

with open(logfile) as f:

    for line in f.readlines():

        if IDENTIFIER in line:

            # Toggle the boolean

            in_block = not in_block

        if in_block:

            print(line)


查看完整回答
反對 回復(fù) 2021-10-19
  • 2 回答
  • 0 關(guān)注
  • 206 瀏覽
慕課專欄
更多

添加回答

舉報

0/150
提交
取消
微信客服

購課補貼
聯(lián)系客服咨詢優(yōu)惠詳情

幫助反饋 APP下載

慕課網(wǎng)APP
您的移動學(xué)習(xí)伙伴

公眾號

掃描二維碼
關(guān)注慕課網(wǎng)微信公眾號