第七色在线视频,2021少妇久久久久久久久久,亚洲欧洲精品成人久久av18,亚洲国产精品特色大片观看完整版,孙宇晨将参加特朗普的晚宴

為了賬號安全,請及時綁定郵箱和手機立即綁定
已解決430363個問題,去搜搜看,總會有你想問的

如何用Python獲取網(wǎng)頁的前3句話?

如何用Python獲取網(wǎng)頁的前3句話?

互換的青春 2023-10-25 10:48:03
我有一項作業(yè),其中我能做的一件事就是找到網(wǎng)頁的前 3 句話并顯示它。查找網(wǎng)頁文本很容易,但我在弄清楚如何找到前 3 個句子時遇到了問題。import requestsfrom bs4 import BeautifulSoupurl = 'https://www.troyhunt.com/the-773-million-record-collection-1-data-reach/'res = requests.get(url)html_page = res.contentsoup = BeautifulSoup(html_page, 'html.parser')text = soup.find_all(text=True)output = ''blacklist = [      '[document]',      'noscript',      'header',      'html',      'meta',      'head',      'input',      'script']for t in text:  if (t.parent.name not in blacklist):    output += '{} '.format(t)tempout = output.split('.')for i in range(tempout):  if (i >= 3):    tempout.remove(i)output = '.'.join(tempout)print(output)
查看完整描述

3 回答

?
青春有我

TA貢獻1784條經(jīng)驗 獲得超8個贊

從文本中找出句子是很困難的。通常,您會查找可以完成句子的字符,例如“.”。和 '!'。但句點(“.”)可能出現(xiàn)在句子的中間,例如人名的縮寫。我使用正則表達式來查找句點,后跟單個空格或字符串末尾,這適用于前三個句子,但不適用于任何任意句子。


import requests

from bs4 import BeautifulSoup

import re


url = 'https://www.troyhunt.com/the-773-million-record-collection-1-data-reach/'

res = requests.get(url)

html_page = res.content

soup = BeautifulSoup(html_page, 'html.parser')


paragraphs = soup.select('section.article_text p')

sentences = []

for paragraph in paragraphs:

    matches = re.findall(r'(.+?[.!])(?: |$)', paragraph.text)

    needed = 3 - len(sentences)

    found = len(matches)

    n = min(found, needed)

    for i in range(n):

        sentences.append(matches[i])

    if len(sentences) == 3:

        break

print(sentences)

印刷:


['Many people will land on this page after learning that their email address has appeared in a data breach I\'ve called "Collection #1".', "Most of them won't have a tech background or be familiar with the concept of credential stuffing so I'm going to write this post for the masses and link out to more detailed material for those who want to go deeper.", "Let's start with the raw numbers because that's the headline, then I'll drill down into where it's from and what it's composed of."]



查看完整回答
反對 回復(fù) 2023-10-25
?
忽然笑

TA貢獻1806條經(jīng)驗 獲得超5個贊

實際上使用beautify soup你可以通過類“article_text post”進行過濾,查看源代碼:

myData=soup.find('section',class_ = "article_text post")
print(myData.p.text)

并獲取p元素的內(nèi)部文本

用這個代替soup = BeautifulSoup(html_page, 'html.parser')


查看完整回答
反對 回復(fù) 2023-10-25
?
侃侃無極

TA貢獻2051條經(jīng)驗 獲得超10個贊

要抓取前三個句子,只需將這些行添加到您的代碼中:


section = soup.find('section',class_ = "article_text post") #Finds the section tag with class "article_text post"


txt = section.p.text #Gets the text within the first p tag within the variable section (the section tag)


print(txt)

輸出:


Many people will land on this page after learning that their email address has appeared in a data breach I've called "Collection #1". Most of them won't have a tech background or be familiar with the concept of credential stuffing so I'm going to write this post for the masses and link out to more detailed material for those who want to go deeper.

希望這有幫助!


查看完整回答
反對 回復(fù) 2023-10-25
  • 3 回答
  • 0 關(guān)注
  • 196 瀏覽
慕課專欄
更多

添加回答

舉報

0/150
提交
取消
微信客服

購課補貼
聯(lián)系客服咨詢優(yōu)惠詳情

幫助反饋 APP下載

慕課網(wǎng)APP
您的移動學(xué)習伙伴

公眾號

掃描二維碼
關(guān)注慕課網(wǎng)微信公眾號