2 回答

TA貢獻1873條經(jīng)驗 獲得超9個贊
有兩種方法。兩者都是超級笨拙的,并且非常依賴于原始字符串的非常小的波動。但是,您可以修改代碼以提供更多的靈活性。
這兩個選項都取決于滿足這些特征的線......有問題的分組必須......
以字母或斜線開頭,可能大寫
該感興趣的標題后跟一個冒號(“:”)
僅抓住冒號后的第一個單詞。
方法一,正則表達式,這個只能抓取兩塊數(shù)據(jù)。第二組是“其他所有內(nèi)容”,因為我無法正確重復搜索模式:P
代碼:
import re
l = [ 'MC/MX/FF Number(s): None DUNS Number: -- ', 'Power Units: 1 Drivers: 1 ' ]
pattern = ''.join([
"(", # Start capturing group
"\s*[A-Z/]", # Any number of space, until and including only the first capital or forward slash
".+?\:", # any character (non-greedy) up to and including the colon
"\s*", # One or more spaces
"\w+\s*", # One or more alphanumeric chars i.e. [a-zA-Z0-9]
")", # End capturing group
"(.*)"
])
for s in l:
m = re.search(pattern, s)
print("----------------")
try:
print(m.group(1))
print(m.group(2))
print(m.group(3))
except Exception as e:
pass
輸出:
----------------
MC/MX/FF Number(s): None
DUNS Number: --
----------------
Power Units: 1
Drivers: 1
方法二,逐字解析字符串。此方法具有與正則表達式相同的基本特征,但可以執(zhí)行兩個以上感興趣的塊。它的工作原理...
開始逐字解析每個字符串,并將其加載到
newstring
.當它碰到冒號時,標記一個標志。
將下一個循環(huán)中的第一個單詞添加到
newstring
. 如果需要,您可以將其更改為 1-2、1-3 或 1-n 字。您也可以讓它在colonflag
設置后繼續(xù)添加單詞,直到滿足某些條件,例如帶有大寫字母的單詞……盡管這可能會中斷諸如“無”之類的單詞。你可以一直到遇到一個全大寫的單詞,但是一個非全大寫的標題會破壞它。添加
newstring
到newlist
,重置標志,并繼續(xù)解析單詞。
代碼:
s = 'MC/MX/FF Number(s): None DUNS Number: -- '
for s in l:
newlist = []
newstring = ""
colonflag = False
for w in s.split():
newstring += " " + w
if colonflag:
newlist.append(newstring)
newstring = ""
colonflag = False
if ":" in w:
colonflag = True
print(newlist)
輸出:
[' MC/MX/FF Number(s): None', ' DUNS Number: --']
[' Power Units: 1', ' Drivers: 1']
第三個選項: 創(chuàng)建所有預期標頭的列表,例如header_list = ["Operating Status:", "Out of Service Date:", "MC/MX/FF Number(s):", "DUNS Number:", "Power Units:", "Drivers:", ]
并根據(jù)這些標頭進行拆分/解析。
第四種選擇
使用自然語言處理和機器學習來實際找出邏輯句子的位置;)

TA貢獻1830條經(jīng)驗 獲得超3個贊
看看pyparsing。這似乎是表達單詞組合、檢測它們之間的關(guān)系(以語法方式)并產(chǎn)生結(jié)構(gòu)化響應的最“自然”的方式......網(wǎng)上有很多教程和文檔:
您可以使用 `pip install pyparsing' 安裝 pyparsing
解析:
Operating Status: NOT AUTHORIZED Out of Service Date: None
需要類似的東西:
!/usr/bin/env python3
# -*- coding: utf-8 -*-
#
# test_pyparsing2.py
#
# Copyright 2019 John Coppens <john@jcoppens.com>
#
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; either version 2 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program; if not, write to the Free Software
# Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston,
# MA 02110-1301, USA.
#
#
import pyparsing as pp
def create_parser():
opstatus = pp.Keyword("Operating Status:")
auth = pp.Combine(pp.Optional(pp.Keyword("NOT"))) + pp.Keyword("AUTHORIZED")
status = pp.Keyword("Out of Service Date:")
date = pp.Keyword("None")
part1 = pp.Group(opstatus + auth)
part2 = pp.Group(status + date)
return part1 + part2
def main(args):
parser = create_parser()
msg = "Operating Status: NOT AUTHORIZED Out of Service Date: None"
print(parser.parseString(msg))
msg = "Operating Status: AUTHORIZED Out of Service Date: None"
print(parser.parseString(msg))
return 0
if __name__ == '__main__':
import sys
sys.exit(main(sys.argv))
運行程序:
[['Operating Status:', 'NOT', 'AUTHORIZED'], ['Out of Service Date:', 'None']]
[['Operating Status:', '', 'AUTHORIZED'], ['Out of Service Date:', 'None']]
使用Combine,Group您可以更改輸出的組織方式。
添加回答
舉報