3 回答

TA貢獻(xiàn)1895條經(jīng)驗(yàn) 獲得超3個(gè)贊
正則表達(dá)式是解決此問題的一種方法,但您也可以將其視為遍歷文本中的每個(gè)標(biāo)記并應(yīng)用一些邏輯來形成組。
例如,我們可以先找到一組名稱和文本:
from itertools import groupby
def isName(word):
# Names end with ':'
return word.endswith(":")
text_split = [
" ".join(list(g)).rstrip(":")
for i, g in groupby(text.replace("]", "] ").split(), isName)
]
print(text_split)
#['CHRIS',
# 'Hello, how are you...',
# 'PETER',
# 'Great, you?',
# 'PAM',
# 'He is resting. [PAM SHOWS THE COUCH] [PETER IS NODDING HIS HEAD]',
# 'CHRIS',
# 'Are you ok?']
接下來,您可以將成對(duì)的連續(xù)元素收集text_split到元組中:
print([(text_split[i*2], text_split[i*2+1]) for i in range(len(text_split)//2)])
#[('CHRIS', 'Hello, how are you...'),
# ('PETER', 'Great, you?'),
# ('PAM', 'He is resting. [PAM SHOWS THE COUCH] [PETER IS NODDING HIS HEAD]'),
# ('CHRIS', 'Are you ok?')]
我們幾乎達(dá)到了所需的輸出。我們只需要處理方括號(hào)中的文本。您可以為此編寫一個(gè)簡(jiǎn)單的函數(shù)。(誠然,正則表達(dá)式是這里的一個(gè)選項(xiàng),但我在這個(gè)答案中故意避免這樣做。)
這是我想出的快速方法:
def isClosingBracket(word):
return word.endswith("]")
def processWords(words):
if "[" not in words:
return [words, None]
else:
return [
" ".join(g).replace("]", ".")
for i, g in groupby(map(str.strip, words.split("[")), isClosingBracket)
]
print(
[(text_split[i*2], *processWords(text_split[i*2+1])) for i in range(len(text_split)//2)]
)
#[('CHRIS', 'Hello, how are you...', None),
# ('PETER', 'Great, you?', None),
# ('PAM', 'He is resting.', 'PAM SHOWS THE COUCH. PETER IS NODDING HIS HEAD.'),
# ('CHRIS', 'Are you ok?', None)]
請(qǐng)注意,使用 將*的結(jié)果解包processWords到tuple嚴(yán)格來說是python 3 的功能。

TA貢獻(xiàn)1802條經(jīng)驗(yàn) 獲得超10個(gè)贊
你可以這樣做re.findall:
>>> re.findall(r'\b(\S+):([^:\[\]]+?)\n?(\[[^:]+?\]\n?)?(?=\b\S+:|$)', text)
[('CHRIS', ' Hello, how are you...', ''),
('PETER', ' Great, you? ', ''),
('PAM',
' He is resting.',
'[PAM SHOWS THE COUCH]\n[PETER IS NODDING HIS HEAD]\n'),
('CHRIS', ' Are you ok?', '')]
您將必須弄清楚如何自己刪除方括號(hào),這在仍然嘗試匹配所有內(nèi)容的同時(shí)使用正則表達(dá)式無法完成。
正則表達(dá)式分解
\b # Word boundary
(\S+) # First capture group, string of characters not having a space
: # Colon
( # Second capture group
[^ # Match anything that is not...
: # a colon
\[\] # or square braces
]+? # Non-greedy match
)
\n? # Optional newline
( # Third capture group
\[ # Literal opening brace
[^:]+? # Similar to above - exclude colon from match
\]
\n? # Optional newlines
)? # Third capture group is optional
(?= # Lookahead for...
\b # a word boundary, followed by
\S+ # one or more non-space chars, and
: # a colon
| # Or,
$ # EOL
)
添加回答
舉報(bào)