首頁猿問刪除 HTML 標(biāo)簽并將 JSON...

刪除 HTML 標(biāo)簽并將 JSON 數(shù)組解析為鍵/值對象

Python

炎炎設(shè)計 2023-10-31 15:22:59

我正在使用 JSON 數(shù)組有效負(fù)載，我想將其提取到一個單獨的對象中以進(jìn)行下游處理。有效負(fù)載是動態(tài)的，并且可以在 JSON 數(shù)組中具有多個嵌套級別，但第一級始終有一個id作為唯一標(biāo)識符的字段。[{'field1': [], 'field2': {'field2_1': string, 'field2_2': 'string', 'field2_3': 'string'}, 'field3': '<html> strings <html>'}, 'id':1}{'field1': [], 'field2': {'field2_1': string, 'field2_2': 'string', 'field2_3': 'string'}, 'field3': '<html> strings <html>'}, 'id':2}{'field1': [], 'field2': {'field2_1': string, 'field2_2': 'string', 'field2_3': 'string'}, 'field3': '<html> strings <html>'}, 'id':3},{'field1': [], 'field2': {'field2_1': string, 'field2_2': 'string', 'field2_3': 'string'}, 'field3': '<html> strings <html>'}, 'id':4}]有效負(fù)載在更多字段或具有不同類型數(shù)據(jù)的更多嵌套字段方面不限于此結(jié)構(gòu)。但該id字段將始終附加到有效負(fù)載中的每個對象。我想創(chuàng)建一個字典（對數(shù)據(jù)類型的其他建議開放），其中該id字段和該對象中的其他所有內(nèi)容都作為清理后的字符串，沒有任何括號或 HTML 標(biāo)簽等。輸出應(yīng)該是這樣的（取決于數(shù)據(jù)類型）：{1: string string string strings,2: string string string strings,3: string string string strings,4: string string string strings}這是一個非常通用的例子。我在使用所有嵌套和內(nèi)容導(dǎo)航 JSON 數(shù)組時遇到問題，只想以id干凈的方式提取內(nèi)容和其余內(nèi)容。任何幫助表示贊賞！

查看完整描述

1 回答

白板的微信

TA貢獻(xiàn)1883條經(jīng)驗獲得超3個贊

您可以使用它beautifulsoup來清理所有標(biāo)簽中的字符串。例如：

from bs4 import BeautifulSoup

lst = [{'field1': [],

'field2': {'field2_1': 'string1',

'field2_2': 'string2',

'field2_3': 'string3'},

'field3': '<html> strings4 <html>',

'id':1},

{'field1': [],

'field2': {'field2_1': 'string1',

'field2_2': 'string2',

'field2_3': 'string3'},

'field3': '<html> strings4 <html>',

'id':2},

{'field1': [],

'field2': {'field2_1': 'string1',

'field2_2': 'string2',

'field2_3': 'string3'},

'field3': '<html> strings4 <html>',

'id':3},

{'field1': [],

'field2': {'field2_1': 'string1',

'field2_2': 'string2',

'field2_3': 'string3'},

'field3': '<html> strings4 <html>',

'id':4}]

def flatten(d):

if isinstance(d, dict):

for v in d.values():

yield from flatten(v)

elif isinstance(d, list):

for v in d:

yield from flatten(v)

elif isinstance(d, str):

yield d

out = {}

for d in lst:

out[d['id']] = ' '.join(map(str.strip, BeautifulSoup(' '.join(flatten(d)), 'html.parser').find_all(text=True)))

print(out)

印刷：

{1: 'string1 string2 string3 strings4', 2: 'string1 string2 string3 strings4', 3: 'string1 string2 string3 strings4', 4: 'string1 string2 string3 strings4'}

反對回復(fù) 2023-10-31