回覆列表
  • 1 # NaCl

    1、簡單的jsoup爬取

    String url="a.atimo.cn";//靜態頁面連結地址

    Document doc = Jsoup.connect(url).userAgent("Mozilla").timeout(4000).get();

    if(doc!=null){

    Elements es = doc.select("div.comments>ul>li");//

    System.out.println(es);

    if(es!=null && es.size()>0){

    for (Element element : es) {

    String link = element.select("div>h3").attr("href");

    String title = element.select("div>h3").text();

    String author = element.select("div.c-abstract>em").text();

    String content = element.select("dd>a>div.icos>i:eq(1)").text();

    }

    }

    }

    透過jsop解析返回Document 使用標籤選擇器,選擇頁面標籤中的值,即可獲取頁面內容。

    2.延時載入,有些網站存在延時載入,表格內容,或者嵌入頁面形式的載入的頁面

    //構造一個webClient 模擬Chrome 瀏覽器

    String url = "https://www.cnblogs.com/atimo/";

    WebClient webClient = new WebClient(BrowserVersion.CHROME);

    //支援JavaScript

    webClient.getOptions().setUseInsecureSSL(true);

    webClient.getOptions().setJavaScriptEnabled(true);

    webClient.getOptions().setCssEnabled(false);

    webClient.getOptions().setActiveXNative(false);

    webClient.getOptions().setCssEnabled(false);

    webClient.getOptions().setThrowExceptionOnScriptError(false);

    webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);

    webClient.getOptions().setTimeout(3000000);

    HtmlPage rootPage = webClient.getPage(url);

    String html = rootPage.asXml();

    Document document = Jsoup.parse(html);

    Elements es = document.select("div.comments");//.select("#content_left");

    System.out.println(es);

    if(es!=null && es.size()>0){

    for (Element element : es) {

    String link = element.select("div.f13>a").attr("href");

    String title = element.select("div>h3>a").text();

    String text = element.select("div.c-abstract>em").text();

    }

    }

    獲取到的是Document 使用標籤選擇器,選擇頁面標籤中的值,即可獲取頁面內容。

    普通請求,只需要使用

    HttpURLConnection connection = createRequest(url, "GET");

      // 建立實際的連線 connection.connect();

    傳送GET請求過去json資料後解析即可;

    4.js請求帶請求頭引數(部分為移動端請求)

    CloseableHttpClient https = HttpClients.createDefault();

    String url = "https://action=hene=124&devicetype=androidlag=zh_CN&nettyene=3&pass_ticwx_header=1";

    HttpGet httpPost = new HttpGet(url);

    httpPost.addHeader("Host", "mp.weixin.qq.com");

    httpPost.addHeader("x-wechat-uin", wechartCookie.getUin());

    httpPost.addHeader("x-", "引數");

    HttpResponse response = https.execute(httpPost);

    HttpEntity entitySort = response.getEntity();

    String html = EntityUtils.toString(entitySort, "utf-8");

    請求頭引數根據抓包工具攔截的請求時需要的引數變更;

  • 中秋節和大豐收的關聯?
  • 大家說說家庭裝修用什麼石材好?