4 回答

TA貢獻1155條經(jīng)驗 獲得超0個贊
您可以使用較短的惰性正則表達式和hjson庫來處理未引用的鍵
import re, hjson
html = '''
<html>
<head>
<script type="text/javascript">
$(document).ready(function(){
var images = [
{
src: "http://example.com/bar/001.jpg",
title: "FooBar One"
},
{
src: "http://example.com/bar/002.jpg",
title: "FooBar Two"
},
]
;
var other_data = [{"name": "Tom", "type": "cat"}, {"name": "Jerry", "type": "dog"}];
</script>
'''
p = re.compile(r'var images = (.*?);', re.DOTALL)
data = hjson.loads(p.findall(html)[0])
print(data)

TA貢獻1860條經(jīng)驗 獲得超8個贊
方法一
也許,
\bvar\s+images\s*=\s*(\[[^\]]*\])
可能在某種程度上起作用:
測試
import re
from bs4 import BeautifulSoup
# Example of a HTML source code containing `images` array
html = '''
<html>
<head>
<script type="text/javascript">
$(document).ready(function(){
var images = [
{
src: "http://example.com/bar/001.jpg",
title: "FooBar One"
},
{
src: "http://example.com/bar/002.jpg",
title: "FooBar Two"
},
]
;
var other_data = [{"name": "Tom", "type": "cat"}, {"name": "Jerry", "type": "dog"}];
</script>
<body>
<p>Some content</p>
</body>
</head>
</html>
'''
soup = BeautifulSoup(html, 'html.parser')
scripts = soup.find_all('script') # successfully captures the <script> element
for script in scripts:
data = re.findall(
r'\bvar\s+images\s*=\s*(\[[^\]]*\])', script.string, re.DOTALL)
print(data[0])
輸出
[ {
src:“ http://example.com/bar/001.jpg ”,
標(biāo)題:“FooBar One” },
{
src:“ http://example.com/bar/002.jpg ”,
標(biāo)題:“ FooBar 兩個" },
]
如果您想簡化/修改/探索表達式,它已在regex101.com的右上角面板中進行了說明。如果您愿意,您還可以在此鏈接中觀看它如何與一些示例輸入匹配。
方法二
另一種選擇是:
import re
string = '''
<html>
<head>
<script type="text/javascript">
$(document).ready(function(){
var images = [
{
src: "http://example.com/bar/001.jpg",
title: "FooBar One"
},
{
src: "http://example.com/bar/002.jpg",
title: "FooBar Two"
},
]
;
var other_data = [{"name": "Tom", "type": "cat"}, {"name": "Jerry", "type": "dog"}];
</script>
<body>
<p>Some content</p>
</body>
</head>
</html>
'''
expression = r'src:\s*"([^"]*)"\s*,\s*title:\s*"([^"]*)"'
matches = re.findall(expression, string, re.DOTALL)
output = []
for match in matches:
output.append(dict({"src": match[0], "title": match[1]}))
print(output)
輸出
[{'src': 'http://example.com/bar/001.jpg', 'title': 'FooBar One'}, {'src': 'http://example.com/bar/002.jpg', 'title': 'FooBar Two'}]

TA貢獻1831條經(jīng)驗 獲得超4個贊
這是一種到達那里的方法,沒有正則表達式,甚至沒有 beautifulsoup - 只是簡單的 Python 字符串操作 - 只需 4 個簡單的步驟 :)
step_1 = html.split('var images = [')
step_2 = " ".join(step_1[1].split())
step_3 = step_2.split('] ; var other_data = ')
step_4= step_3[0].replace('}, {','}xxx{').split('xxx')
print(step_4)
輸出:
['{ src: "http://example.com/bar/001.jpg", title: "FooBar One" }',
'{ src: "http://example.com/bar/002.jpg", title: "FooBar Two" }, ']

TA貢獻1856條經(jīng)驗 獲得超5個贊
re.match 從字符串的開頭匹配。您的正則表達式必須傳遞整個字符串。利用
pattern = re.compile('.*var images = (.*?);.*', re.DOTALL)
該字符串仍然不是有效的 python 列表格式。您必須先進行一些操作才能申請ast.literal_eval
for script in scripts:
data = pattern.match(str(script.string))
if data:
list_str = data.groups()[0]
# Remove last comma
last_comma_index = list_str.rfind(',')
list_str = list_str[:last_comma_index] + list_str[last_comma_index+1:]
# Modify src to 'src' and title to 'title'
list_str = re.sub(r'\s([a-z]+):', r'"\1":', list_str)
# Strip
list_str = list_str.strip()
final_list = ast.literal_eval(list_str.strip())
print(final_list)
輸出
[{'src': 'http://example.com/bar/001.jpg', 'title': 'FooBar One'}, {'src': 'http://example.com/bar/002.jpg', 'title': 'FooBar Two'}]
添加回答
舉報