首頁猿問 Selenium/Beautifu...

Selenium/BeautifulSoup - Python - 遍歷多個頁面

Python

富國滬深 2021-09-28 15:25:39

我一天中的大部分時間都在研究和測試在零售商網(wǎng)站上循環(huán)瀏覽一組產(chǎn)品的最佳方式。雖然我成功地在第一頁上收集了一組產(chǎn)品（和屬性），但我一直難以找到循環(huán)瀏覽網(wǎng)站頁面以繼續(xù)我的抓取的最佳方式。根據(jù)我下面的代碼，我嘗試使用“while”循環(huán)和 Selenium 單擊網(wǎng)站的“下一頁”按鈕，然后繼續(xù)收集產(chǎn)品。問題是我的代碼仍然沒有超過第 1 頁。我在這里犯了一個愚蠢的錯誤嗎？在此站點(diǎn)上閱讀 4 或 5 個類似的示例，但沒有一個具體到足以在此處提供解決方案。from selenium import webdriverfrom bs4 import BeautifulSoupdriver = webdriver.Chrome()driver.get('https://www.kohls.com/catalog/mens-button-down-shirts-tops-clothing.jsp?CN=Gender:Mens+Silhouette:Button-Down%20Shirts+Category:Tops+Department:Clothing&cc=mens-TN3.0-S-buttondownshirts&kls_sbp=43160314801019132980443403449632772558&PPP=120&WS=0')products.clear()hyperlinks.clear()reviewCounts.clear()starRatings.clear()products = []hyperlinks = []reviewCounts = []starRatings = []pageCounter = 0maxPageCount = int(html_soup.find('a', class_ = 'totalPageNum').text)+1html_soup = BeautifulSoup(driver.page_source, 'html.parser')prod_containers = html_soup.find_all('li', class_ = 'products_grid')while (pageCounter < maxPageCount): for product in prod_containers: # If the product has review count, then extract: if product.find('span', class_ = 'prod_ratingCount') is not None: # The product name name = product.find('div', class_ = 'prod_nameBlock') name = re.sub(r"\s+", " ", name.text) products.append(name) # The product hyperlink hyperlink = product.find('span', class_ = 'prod_ratingCount') hyperlink = hyperlink.a hyperlink = hyperlink.get('href') hyperlinks.append(hyperlink) # The product review count reviewCount = product.find('span', class_ = 'prod_ratingCount').a.text reviewCounts.append(reviewCount) # The product overall star ratings starRating = product.find('span', class_ = 'prod_ratingCount') starRating = starRating.a starRating = starRating.get('alt') starRatings.append(starRating)

查看完整描述