在我的 nutch-site.xml 中,我添加以下內(nèi)容以停止截?cái)啵坏?,在獲取過程中,出現(xiàn)以下錯(cuò)誤。我希望它停止截?cái)嗖⑻峁┪倚枰慕Y(jié)果,我假設(shè) -1 值可以實(shí)現(xiàn)。我正在使用 2.2.1 版。有任何想法嗎?<property> <name>http.content.limit</name> <value>-1</value> <description>The length limit for downloaded content using the http protocol, in bytes. If this value is nonnegative (>=0), content longer than it will be truncated; otherwise, no truncation at all. Do not confuse this setting with the file.content.limit setting. </description></property>線程“main”中的異常 java.lang.RuntimeException:作業(yè)失?。簄ame=fetch,job_local1185573074_0001 在 org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:55) 在 org.apache.nutch.fetcher。 FetcherJob.run(FetcherJob.java:194) 在 org.apache.nutch.fetcher.FetcherJob.fetch(FetcherJob.java:219) 在 org.apache.nutch.fetcher.FetcherJob.run(FetcherJob.java:301) 在 org .apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) 在 org.apache.nutch.fetcher.FetcherJob.main(FetcherJob.java:307)
1 回答

FFIVE
TA貢獻(xiàn)1797條經(jīng)驗(yàn) 獲得超6個(gè)贊
我通過刪除http.content.limitnutch-site.xml 中的部分并添加parser.skip.truncated并將其設(shè)置為 false 來解決此問題。
<property>
<name>parser.skip.truncated</name>
<value>false</value>
<description>Boolean value for whether we should skip parsing for truncated documents. By default this
property is activated due to extremely high levels of CPU which parsing can sometimes take.
</description>
</property>