首頁猿問使用...

使用 BeautifulSoup、Python、Regex 在 Javascript 函數(shù)中獲取變量

Python

嚕嚕噠 2022-06-22 16:22:15

在 Javascript 函數(shù)中定義了一個數(shù)組images，需要將其從字符串中提取并轉(zhuǎn)換為 Python 列表對象。PythonBeautifulsoup被用于進行解析。 var images = [ { src: "http://example.com/bar/001.jpg", title: "FooBar One" }, { src: "http://example.com/bar/002.jpg", title: "FooBar Two" }, ] ;問題：為什么我下面的代碼無法捕獲這個images數(shù)組，我們該如何解決？謝謝！所需的輸出 Python 列表對象。[ { src: "http://example.com/bar/001.jpg", title: "FooBar One" }, { src: "http://example.com/bar/002.jpg", title: "FooBar Two" }, ]實際代碼import refrom bs4 import BeautifulSoup# Example of a HTML source code containing `images` arrayhtml = '''<html><head><script type="text/javascript"> $(document).ready(function(){ var images = [ { src: "http://example.com/bar/001.jpg", title: "FooBar One" }, { src: "http://example.com/bar/002.jpg", title: "FooBar Two" }, ] ; var other_data = [{"name": "Tom", "type": "cat"}, {"name": "Jerry", "type": "dog"}];</script><body><p>Some content</p></body></head></html>'''pattern = re.compile('var images = (.*?);')soup = BeautifulSoup(html, 'lxml')scripts = soup.find_all('script') # successfully captures the <script> elementfor script in scripts: data = pattern.match(str(script.string)) # NOT extracting the array!! if data: print('Found:', data.groups()[0]) # NOT being printed

查看完整描述

4 回答

白衣非少年

TA貢獻1155條經(jīng)驗獲得超0個贊

您可以使用較短的惰性正則表達式和hjson庫來處理未引用的鍵

import re, hjson

html = '''

<html>

<head>

$(document).ready(function(){

var images = [

{

src: "http://example.com/bar/001.jpg",

title: "FooBar One"

},

{

src: "http://example.com/bar/002.jpg",

title: "FooBar Two"

},

]

;

var other_data = [{"name": "Tom", "type": "cat"}, {"name": "Jerry", "type": "dog"}];

</script>

'''

p = re.compile(r'var images = (.*?);', re.DOTALL)

data = hjson.loads(p.findall(html)[0])

print(data)

反對回復(fù) 2022-06-22

桃花長相依

TA貢獻1860條經(jīng)驗獲得超8個贊

方法一

也許，

\bvar\s+images\s*=\s*(\[[^\]]*\])

可能在某種程度上起作用：

測試

import re

from bs4 import BeautifulSoup

# Example of a HTML source code containing `images` array

html = '''

<html>

<head>

$(document).ready(function(){

var images = [

{

src: "http://example.com/bar/001.jpg",

title: "FooBar One"

},

{

src: "http://example.com/bar/002.jpg",

title: "FooBar Two"

},

]

;

var other_data = [{"name": "Tom", "type": "cat"}, {"name": "Jerry", "type": "dog"}];

</script>

<body>

<p>Some content</p>

</body>

</head>

</html>

'''

soup = BeautifulSoup(html, 'html.parser')

scripts = soup.find_all('script') # successfully captures the <script> element

for script in scripts:

data = re.findall(

r'\bvar\s+images\s*=\s*(\[[^\]]*\])', script.string, re.DOTALL)

print(data[0])

輸出

[ {

src：“ http://example.com/bar/001.jpg ”，

標(biāo)題：“FooBar One” }，

{

src：“ http://example.com/bar/002.jpg ”，

標(biāo)題：“ FooBar 兩個" },

]

如果您想簡化/修改/探索表達式，它已在regex101.com的右上角面板中進行了說明。如果您愿意，您還可以在此鏈接中觀看它如何與一些示例輸入匹配。

方法二

另一種選擇是：

import re

string = '''

<html>

<head>

$(document).ready(function(){

var images = [

{

src: "http://example.com/bar/001.jpg",

title: "FooBar One"

},

{

src: "http://example.com/bar/002.jpg",

title: "FooBar Two"

},

]

;

var other_data = [{"name": "Tom", "type": "cat"}, {"name": "Jerry", "type": "dog"}];

</script>

<body>

<p>Some content</p>

</body>

</head>

</html>

'''

expression = r'src:\s*"([^"]*)"\s*,\s*title:\s*"([^"]*)"'

matches = re.findall(expression, string, re.DOTALL)

output = []

for match in matches:

output.append(dict({"src": match[0], "title": match[1]}))

print(output)

輸出

[{'src': 'http://example.com/bar/001.jpg', 'title': 'FooBar One'}, {'src': 'http://example.com/bar/002.jpg', 'title': 'FooBar Two'}]

反對回復(fù) 2022-06-22

慕容708150

TA貢獻1831條經(jīng)驗獲得超4個贊

這是一種到達那里的方法，沒有正則表達式，甚至沒有 beautifulsoup - 只是簡單的 Python 字符串操作 - 只需 4 個簡單的步驟 :)

step_1 = html.split('var images = [')

step_2 = " ".join(step_1[1].split())

step_3 = step_2.split('] ; var other_data = ')

step_4= step_3[0].replace('}, {','}xxx{').split('xxx')

print(step_4)

輸出：

['{ src: "http://example.com/bar/001.jpg", title: "FooBar One" }',

'{ src: "http://example.com/bar/002.jpg", title: "FooBar Two" }, ']

反對回復(fù) 2022-06-22

RISEBY

TA貢獻1856條經(jīng)驗獲得超5個贊

re.match 從字符串的開頭匹配。您的正則表達式必須傳遞整個字符串。利用

pattern = re.compile('.*var images = (.*?);.*', re.DOTALL)

該字符串仍然不是有效的 python 列表格式。您必須先進行一些操作才能申請ast.literal_eval

for script in scripts:

data = pattern.match(str(script.string))

if data:

list_str = data.groups()[0]

# Remove last comma

last_comma_index = list_str.rfind(',')

list_str = list_str[:last_comma_index] + list_str[last_comma_index+1:]

# Modify src to 'src' and title to 'title'

list_str = re.sub(r'\s([a-z]+):', r'"\1":', list_str)

# Strip

list_str = list_str.strip()

final_list = ast.literal_eval(list_str.strip())

print(final_list)

輸出

[{'src': 'http://example.com/bar/001.jpg', 'title': 'FooBar One'}, {'src': 'http://example.com/bar/002.jpg', 'title': 'FooBar Two'}]

反對回復(fù) 2022-06-22

4 回答
0 關(guān)注
242 瀏覽

關(guān)注

添加回答

舉報

0/150

提交

取消

使用 Ctrl+D 可將網(wǎng)站添加到書簽

微信客服

購課補貼
聯(lián)系客服咨詢優(yōu)惠詳情

幫助反饋 APP下載

慕課網(wǎng)APP
您的移動學(xué)習(xí)伙伴

公眾號

掃描二維碼
關(guān)注慕課網(wǎng)微信公眾號

第七色在线视频,2021少妇久久久久久久久久,亚洲欧洲精品成人久久av18,亚洲国产精品特色大片观看完整版,孙宇晨将参加特朗普的晚宴

熱搜

最近搜索清空

使用 BeautifulSoup、Python、Regex 在 Javascript 函數(shù)中獲取變量

使用 BeautifulSoup、Python、Regex 在 Javascript 函數(shù)中獲取變量

4 回答

添加回答

使用 BeautifulSoup、Python、Regex 在 Javascript 函數(shù)中獲取變量

使用 BeautifulSoup、Python、Regex 在 Javascript 函數(shù)中獲取變量