如何使用javascript獲取網頁（jsp）上的文字內容？

首頁>Club>-陌依尋-2021-04-17 17:09

如何使用javascript獲取網頁（jsp）上的文字內容？

回覆列表

1 # NaCl

1、簡單的jsoup爬取

String url="a.atimo.cn";//靜態頁面連結地址

Document doc = Jsoup.connect(url).userAgent("Mozilla").timeout(4000).get();

if(doc!=null){
Elements es = doc.select("div.comments>ul>li");//

System.out.println(es);

if(es!=null && es.size()>0){

for (Element element : es) {

String link = element.select("div>h3").attr("href");

String title = element.select("div>h3").text();

String author = element.select("div.c-abstract>em").text();
String content = element.select("dd>a>div.icos>i:eq(1)").text();

}

}

}

透過jsop解析返回Document 使用標籤選擇器，選擇頁面標籤中的值，即可獲取頁面內容。

2.延時載入，有些網站存在延時載入，表格內容，或者嵌入頁面形式的載入的頁面

//構造一個webClient 模擬Chrome 瀏覽器

String url = "https://www.cnblogs.com/atimo/";

WebClient webClient = new WebClient(BrowserVersion.CHROME);
//支援JavaScript

webClient.getOptions().setUseInsecureSSL(true);

webClient.getOptions().setJavaScriptEnabled(true);

webClient.getOptions().setCssEnabled(false);

webClient.getOptions().setActiveXNative(false);

webClient.getOptions().setCssEnabled(false);

webClient.getOptions().setThrowExceptionOnScriptError(false);
webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);

webClient.getOptions().setTimeout(3000000);

HtmlPage rootPage = webClient.getPage(url);

String html = rootPage.asXml();

Document document = Jsoup.parse(html);

Elements es = document.select("div.comments");//.select("#content_left");
System.out.println(es);

if(es!=null && es.size()>0){

for (Element element : es) {

String link = element.select("div.f13>a").attr("href");

String title = element.select("div>h3>a").text();

String text = element.select("div.c-abstract>em").text();

}

}
獲取到的是Document 使用標籤選擇器，選擇頁面標籤中的值，即可獲取頁面內容。

普通請求，只需要使用

HttpURLConnection connection = createRequest(url, "GET");

　　// 建立實際的連線 connection.connect();

傳送GET請求過去json資料後解析即可；

4.js請求帶請求頭引數(部分為移動端請求)

CloseableHttpClient https = HttpClients.createDefault();

String url = "https://action=hene=124&devicetype=androidlag=zh_CN&nettyene=3&pass_ticwx_header=1";
HttpGet httpPost = new HttpGet(url);

httpPost.addHeader("Host", "mp.weixin.qq.com");

httpPost.addHeader("x-wechat-uin", wechartCookie.getUin());

httpPost.addHeader("x-", "引數");

HttpResponse response = https.execute(httpPost);

HttpEntity entitySort = response.getEntity();

String html = EntityUtils.toString(entitySort, "utf-8");
請求頭引數根據抓包工具攔截的請求時需要的引數變更；

∧ 中秋節和大豐收的關聯？

∨ 大家說說家庭裝修用什麼石材好？

熱門排行

劇多

如何使用javascript獲取網頁（jsp）上的文字內容？