1、簡單的jsoup爬取
String url="a.atimo.cn";//靜態頁面連結地址
Document doc = Jsoup.connect(url).userAgent("Mozilla").timeout(4000).get();
if(doc!=null){
Elements es = doc.select("div.comments>ul>li");//
System.out.println(es);
if(es!=null && es.size()>0){
for (Element element : es) {
String link = element.select("div>h3").attr("href");
String title = element.select("div>h3").text();
String author = element.select("div.c-abstract>em").text();
String content = element.select("dd>a>div.icos>i:eq(1)").text();
}
透過jsop解析返回Document 使用標籤選擇器,選擇頁面標籤中的值,即可獲取頁面內容。
2.延時載入,有些網站存在延時載入,表格內容,或者嵌入頁面形式的載入的頁面
//構造一個webClient 模擬Chrome 瀏覽器
String url = "https://www.cnblogs.com/atimo/";
WebClient webClient = new WebClient(BrowserVersion.CHROME);
//支援JavaScript
webClient.getOptions().setUseInsecureSSL(true);
webClient.getOptions().setJavaScriptEnabled(true);
webClient.getOptions().setCssEnabled(false);
webClient.getOptions().setActiveXNative(false);
webClient.getOptions().setThrowExceptionOnScriptError(false);
webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
webClient.getOptions().setTimeout(3000000);
HtmlPage rootPage = webClient.getPage(url);
String html = rootPage.asXml();
Document document = Jsoup.parse(html);
Elements es = document.select("div.comments");//.select("#content_left");
String link = element.select("div.f13>a").attr("href");
String title = element.select("div>h3>a").text();
String text = element.select("div.c-abstract>em").text();
獲取到的是Document 使用標籤選擇器,選擇頁面標籤中的值,即可獲取頁面內容。
普通請求,只需要使用
HttpURLConnection connection = createRequest(url, "GET");
// 建立實際的連線 connection.connect();
傳送GET請求過去json資料後解析即可;
4.js請求帶請求頭引數(部分為移動端請求)
CloseableHttpClient https = HttpClients.createDefault();
String url = "https://action=hene=124&devicetype=androidlag=zh_CN&nettyene=3&pass_ticwx_header=1";
HttpGet httpPost = new HttpGet(url);
httpPost.addHeader("Host", "mp.weixin.qq.com");
httpPost.addHeader("x-wechat-uin", wechartCookie.getUin());
httpPost.addHeader("x-", "引數");
HttpResponse response = https.execute(httpPost);
HttpEntity entitySort = response.getEntity();
String html = EntityUtils.toString(entitySort, "utf-8");
請求頭引數根據抓包工具攔截的請求時需要的引數變更;
1、簡單的jsoup爬取
String url="a.atimo.cn";//靜態頁面連結地址
Document doc = Jsoup.connect(url).userAgent("Mozilla").timeout(4000).get();
if(doc!=null){
Elements es = doc.select("div.comments>ul>li");//
System.out.println(es);
if(es!=null && es.size()>0){
for (Element element : es) {
String link = element.select("div>h3").attr("href");
String title = element.select("div>h3").text();
String author = element.select("div.c-abstract>em").text();
String content = element.select("dd>a>div.icos>i:eq(1)").text();
}
}
}
透過jsop解析返回Document 使用標籤選擇器,選擇頁面標籤中的值,即可獲取頁面內容。
2.延時載入,有些網站存在延時載入,表格內容,或者嵌入頁面形式的載入的頁面
//構造一個webClient 模擬Chrome 瀏覽器
String url = "https://www.cnblogs.com/atimo/";
WebClient webClient = new WebClient(BrowserVersion.CHROME);
//支援JavaScript
webClient.getOptions().setUseInsecureSSL(true);
webClient.getOptions().setJavaScriptEnabled(true);
webClient.getOptions().setCssEnabled(false);
webClient.getOptions().setActiveXNative(false);
webClient.getOptions().setCssEnabled(false);
webClient.getOptions().setThrowExceptionOnScriptError(false);
webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
webClient.getOptions().setTimeout(3000000);
HtmlPage rootPage = webClient.getPage(url);
String html = rootPage.asXml();
Document document = Jsoup.parse(html);
Elements es = document.select("div.comments");//.select("#content_left");
System.out.println(es);
if(es!=null && es.size()>0){
for (Element element : es) {
String link = element.select("div.f13>a").attr("href");
String title = element.select("div>h3>a").text();
String text = element.select("div.c-abstract>em").text();
}
}
獲取到的是Document 使用標籤選擇器,選擇頁面標籤中的值,即可獲取頁面內容。
普通請求,只需要使用
HttpURLConnection connection = createRequest(url, "GET");
// 建立實際的連線 connection.connect();
傳送GET請求過去json資料後解析即可;
4.js請求帶請求頭引數(部分為移動端請求)
CloseableHttpClient https = HttpClients.createDefault();
String url = "https://action=hene=124&devicetype=androidlag=zh_CN&nettyene=3&pass_ticwx_header=1";
HttpGet httpPost = new HttpGet(url);
httpPost.addHeader("Host", "mp.weixin.qq.com");
httpPost.addHeader("x-wechat-uin", wechartCookie.getUin());
httpPost.addHeader("x-", "引數");
HttpResponse response = https.execute(httpPost);
HttpEntity entitySort = response.getEntity();
String html = EntityUtils.toString(entitySort, "utf-8");
請求頭引數根據抓包工具攔截的請求時需要的引數變更;