首頁猿問如何從 url 池發(fā)出并發(fā) GET 請(qǐng)求

如何從 url 池發(fā)出并發(fā) GET 請(qǐng)求

Go

茅侃侃 2022-07-11 15:52:17

我完成了建議的游覽，在 YouTube 上觀看了一些教程和 gopher 會(huì)議。差不多就是這樣。我有一個(gè)項(xiàng)目需要我發(fā)送獲取請(qǐng)求并將結(jié)果存儲(chǔ)在文件中。但 URL 的數(shù)量約為 8000 萬。我只測試 1000 個(gè) URL。問題：盡管我遵循了一些指導(dǎo)方針，但我認(rèn)為我無法使其并發(fā)。我不知道怎么了。但也許我錯(cuò)了，它是并發(fā)的，對(duì)我來說似乎并不快，速度感覺就像順序請(qǐng)求。這是我寫的代碼：package mainimport ( "bufio" "io/ioutil" "log" "net/http" "os" "sync" "time")var wg sync.WaitGroup // synchronization to wait for all the goroutinesfunc crawler(urlChannel <-chan string) { defer wg.Done() client := &http.Client{Timeout: 10 * time.Second} // single client is sufficient for multiple requests for urlItem := range urlChannel { req1, _ := http.NewRequest("GET", "http://"+urlItem, nil) // generating the request req1.Header.Add("User-agent", "Mozilla/5.0 (X11; Linux i586; rv:31.0) Gecko/20100101 Firefox/74.0") // changing user-agent resp1, respErr1 := client.Do(req1) // sending the prepared request and getting the response if respErr1 != nil { continue } defer resp1.Body.Close() if resp1.StatusCode/100 == 2 { // means server responded with 2xx code text1, readErr1 := ioutil.ReadAll(resp1.Body) // try to read the sourcecode of the website if readErr1 != nil { log.Fatal(readErr1) } f1, fileErr1 := os.Create("200/" + urlItem + ".txt") // creating the relative file if fileErr1 != nil { log.Fatal(fileErr1) } defer f1.Close() _, writeErr1 := f1.Write(text1) // writing the sourcecode into our file if writeErr1 != nil { log.Fatal(writeErr1) } } }}我的問題是：為什么這段代碼不能同時(shí)工作？我該如何解決我上面提到的問題。發(fā)出并發(fā) GET 請(qǐng)求時(shí)我做錯(cuò)了什么嗎？

查看完整描述

2 回答

湖上湖

TA貢獻(xiàn)2003條經(jīng)驗(yàn) 獲得超2個(gè)贊

這里有一些代碼可以讓你思考。我將 URL 放在代碼中，因此它是自給自足的，但實(shí)際上您可能會(huì)將它們通過管道傳輸?shù)綐?biāo)準(zhǔn)輸入。我在這里做的一些事情我認(rèn)為是改進(jìn)的，或者至少值得考慮。

在我們開始之前，我會(huì)指出我將完整的 url放在輸入流中。一方面，這讓我同時(shí)支持 http 和 https。我并沒有真正看到在代碼中硬編碼方案而不是將其留在數(shù)據(jù)中的邏輯。

首先，它可以處理任意大小的響應(yīng)主體（您的版本將主體讀入內(nèi)存，因此它受到一些并發(fā)大請(qǐng)求填充內(nèi)存的限制）。我用io.Copy().

[編輯]

text1, readErr1 := ioutil.ReadAll(resp1.Body)讀取整個(gè) http 正文。如果身體很大，它會(huì)占用大量內(nèi)存。 io.Copy(f1,resp1.Body)而是將數(shù)據(jù)從 http 響應(yīng)正文直接復(fù)制到文件中，而不必將整個(gè)內(nèi)容保存在內(nèi)存中。它可以在一次讀/寫或多次中完成。

http.Response.Body是io.ReadCloser因?yàn)?HTTP 協(xié)議期望正文被逐步讀取。還http.Response沒有完整的身體，直到它被閱讀。這就是為什么它不僅僅是一個(gè) [] 字節(jié)。當(dāng)數(shù)據(jù)從 tcp 套接字“流入”時(shí)逐漸將其寫入文件系統(tǒng)意味著有限數(shù)量的系統(tǒng)資源可以下載無限量的數(shù)據(jù)。

但還有更多好處。io.Copy將調(diào)用ReadFrom()文件。如果您查看 linux 實(shí)現(xiàn)（例如）：https ://golang.org/src/os/readfrom_linux.go并挖掘一下，您會(huì)發(fā)現(xiàn)它實(shí)際上使用了copy_file_range 該系統(tǒng)調(diào)用很酷，因?yàn)?/p>

copy_file_range() 系統(tǒng)調(diào)用在兩個(gè)文件描述符之間執(zhí)行內(nèi)核內(nèi)復(fù)制，而無需將數(shù)據(jù)從內(nèi)核傳輸?shù)接脩艨臻g然后再返回內(nèi)核的額外成本。

*os.File知道如何要求內(nèi)核將數(shù)據(jù)直接從 tcp 套接字傳遞到文件，而您的程序甚至不必接觸它。

請(qǐng)參閱https://golang.org/pkg/io/#Copy。

其次，我確保使用文件名中的所有 url 組件。具有不同查詢字符串的 URL 會(huì)轉(zhuǎn)到不同的文件。該片段可能不會(huì)區(qū)分響應(yīng)主體，因此可能會(huì)考慮將其包含在路徑中。將 URL 轉(zhuǎn)換為有效的文件路徑?jīng)]有很棒的啟發(fā)式方法——如果這是一項(xiàng)嚴(yán)肅的任務(wù)，我可能會(huì)根據(jù) url 的 shasum 或其他東西將數(shù)據(jù)存儲(chǔ)在文件中——并創(chuàng)建存儲(chǔ)在元數(shù)據(jù)文件中的結(jié)果索引。

第三，我處理所有錯(cuò)誤。 req1, _ := http.NewRequest(...可能看起來像一個(gè)方便的捷徑，但它真正的意思是你不會(huì)知道任何錯(cuò)誤的真正原因 - 充其量。我通常在滲透時(shí)在錯(cuò)誤中添加一些描述性文本，以確保我可以輕松分辨出我返回的是哪個(gè)錯(cuò)誤。

最后，我返回成功處理的 URL，以便我可以看到最終結(jié)果。在掃描數(shù)百萬個(gè) URL 時(shí)，您可能還需要一個(gè)失敗的列表，但成功的計(jì)數(shù)是發(fā)送最終數(shù)據(jù)以供匯總的良好開端。

package main

import (

"bufio"

"bytes"

"fmt"

"io"

"log"

"net/http"

"net/url"

"os"

"path/filepath"

"time"

)

const urls_text = `http://danf.us/

https://farrellit.net/?3=2&#1

`

func crawler(urls <-chan *url.URL, done chan<- int) {

var processed int = 0

defer func() { done <- processed }()

client := http.Client{Timeout: 10 * time.Second}

for u := range urls {

if req, err := http.NewRequest("GET", u.String(), nil); err != nil {

log.Printf("Couldn't create new request for %s: %s", u.String(), err.Error())

} else {

req.Header.Add("User-agent", "Mozilla/5.0 (X11; Linux i586; rv:31.0) Gecko/20100101 Firefox/74.0") // changing user-agent

if res, err := client.Do(req); err != nil {

log.Printf("Failed to get %s: %s", u.String(), err.Error())

} else {

filename := filepath.Base(u.EscapedPath())

if filename == "/" || filename == "" {

filename = "response"

} else {

log.Printf("URL Filename is '%s'", filename)

}

destpath := filepath.Join(

res.Status, u.Scheme, u.Hostname(), u.EscapedPath(),

fmt.Sprintf("?%s",u.RawQuery), fmt.Sprintf("#%s",u.Fragment), filename,

)

if err := os.MkdirAll(filepath.Dir(destpath), 0755); err != nil {

log.Printf("Couldn't create directory %s: %s", filepath.Dir(destpath), err.Error())

} else if f, err := os.OpenFile(destpath, os.O_WRONLY|os.O_CREATE|os.O_TRUNC, 0644); err != nil {

log.Printf("Couldn't open destination file %s: %s", destpath, err.Error())

} else {

if b, err := io.Copy(f, res.Body); err != nil {

log.Printf("Could not copy %s body to %s: %s", u.String(), destpath, err.Error())

} else {

log.Printf("Copied %d bytes from body of %s to %s", b, u.String(), destpath)

processed++

}

f.Close()

}

res.Body.Close()

}

const workers = 3

func main() {

urls := make(chan *url.URL)

done := make(chan int)

var submitted int = 0

var inputted int = 0

var successful int = 0

for i := 0; i < workers; i++ {

go crawler(urls, done)

}

sc := bufio.NewScanner(bytes.NewBufferString(urls_text))

for sc.Scan() {

inputted++

if u, err := url.Parse(sc.Text()); err != nil {

log.Printf("Could not parse %s as url: %w", sc.Text(), err)

} else {

submitted++

urls <- u

}

close(urls)

for i := 0; i < workers; i++ {

successful += <-done

}

log.Printf("%d urls input, %d could not be parsed. %d/%d valid URLs successful (%.0f%%)",

inputted, inputted-submitted,

successful, submitted,

float64(successful)/float64(submitted)*100.0,

)

}

反對(duì) 回復(fù) 2022-07-11

慕少森

TA貢獻(xiàn)2019條經(jīng)驗(yàn) 獲得超9個(gè)贊

設(shè)置并發(fā)管道時(shí)，要遵循的一個(gè)很好的指導(dǎo)原則是始終首先設(shè)置和實(shí)例化將同時(shí)執(zhí)行的偵聽器（在您的情況下為爬蟲），然后開始通過管道向它們提供數(shù)據(jù)（在您的情況下為urlChannel） .

在您的示例中，唯一防止死鎖的是您已經(jīng)實(shí)例化了一個(gè)緩沖通道，該通道具有與您的測試文件相同的行數(shù)（1000 行）。代碼所做的是將 URL 放入urlChannel. 由于您的文件中有 1000 行，因此urlChannel可以在不阻塞的情況下獲取所有行。如果您在文件中放入更多 URL，則在填滿urlChannel.

這是應(yīng)該工作的代碼版本：

package main

import (

"bufio"

"io/ioutil"

"log"

"net/http"

"os"

"sync"

"time"

)

func crawler(wg *sync.WaitGroup, urlChannel <-chan string) {

defer wg.Done()

client := &http.Client{Timeout: 10 * time.Second} // single client is sufficient for multiple requests

for urlItem := range urlChannel {

req1, _ := http.NewRequest("GET", "http://"+urlItem, nil) // generating the request

req1.Header.Add("User-agent", "Mozilla/5.0 (X11; Linux i586; rv:31.0) Gecko/20100101 Firefox/74.0") // changing user-agent

resp1, respErr1 := client.Do(req1) // sending the prepared request and getting the response

if respErr1 != nil {

continue

}

if resp1.StatusCode/100 == 2 { // means server responded with 2xx code

text1, readErr1 := ioutil.ReadAll(resp1.Body) // try to read the sourcecode of the website

if readErr1 != nil {

log.Fatal(readErr1)

}

resp1.Body.Close()

f1, fileErr1 := os.Create("200/" + urlItem + ".txt") // creating the relative file

if fileErr1 != nil {

log.Fatal(fileErr1)

}

_, writeErr1 := f1.Write(text1) // writing the sourcecode into our file

if writeErr1 != nil {

log.Fatal(writeErr1)

}

f1.Close()

}

func main() {

var wg sync.WaitGroup

file, err := os.Open("urls.txt") // the file containing the url's

if err != nil {

log.Fatal(err)

}

defer file.Close() // don't forget to close the file

urlChannel := make(chan string)

_ = os.Mkdir("200", 0755) // if it's there, it will create an error, and we will simply ignore it

// first, initialize crawlers

wg.Add(10)

for i := 0; i < 10; i++ {

go crawler(&wg, urlChannel)

}

//after crawlers are initialized, start feeding them data through the channel

scanner := bufio.NewScanner(file) // each line has another url

for scanner.Scan() {

urlChannel <- scanner.Text()

}

close(urlChannel)

wg.Wait()

}

反對(duì) 回復(fù) 2022-07-11

2 回答
0 關(guān)注
159 瀏覽

關(guān)注

添加回答

舉報(bào)

0/150

提交

取消

使用 Ctrl+D 可將網(wǎng)站添加到書簽

微信客服

購課補(bǔ)貼
聯(lián)系客服咨詢優(yōu)惠詳情

幫助反饋 APP下載

慕課網(wǎng)APP
您的移動(dòng)學(xué)習(xí)伙伴

公眾號(hào)

掃描二維碼
關(guān)注慕課網(wǎng)微信公眾號(hào)

第七色在线视频,2021少妇久久久久久久久久,亚洲欧洲精品成人久久av18,亚洲国产精品特色大片观看完整版,孙宇晨将参加特朗普的晚宴

熱搜

最近搜索清空

如何從 url 池發(fā)出并發(fā) GET 請(qǐng)求

如何從 url 池發(fā)出并發(fā) GET 請(qǐng)求

2 回答

添加回答