在兩臺型號相同的機器上(snap1 和snap3)測試磁碟的讀取速度,發現兩臺機器的讀取速度差的很大:
#dd if=/dev/dm-93 of=/dev/null bs=4M count=1024711MB/s on snap1.178MB/s on snap3.
接下來比較snap1和snap3兩臺機器上關於dm-93磁碟(raid)的以下欄位輸出都是一樣
/sys/block/<device>/queue/max_sectors_kb/sys/block/<device>/queue/nomerges/sys/block/<device>/queue/rq_affinity/sys/block/<device>/queue/scheduler欄位解釋可以參考:https://www.kernel.org/doc/Documentation/block/queue-sysfs.txt
然後用blktrace監控一下磁碟IO處理過程:
#blktrace /dev/dm-93
使用blkparse檢視blktrace收集的日誌:
253,108 1 1 7.263881407 21072 Q R 128 + 128 [dd]在snap3上請求讀取一頁(64k每頁)253,108 1 2 7.263883907 21072 G R 128 + 128 [dd]253,108 1 3 7.263885017 21072 I R 128 + 128 [dd]253,108 1 4 7.263886077 21072 D R 128 + 128 [dd]提交IO到磁碟253,108 0 1 7.264883548 3 C R 128 + 128 [0]大約1ms之後IO處理完成253,108 1 5 7.264907601 21072 Q R 256 + 128 [dd]磁碟處理IO完成之後,dd才開始處理下一個IO253,108 1 6 7.264908587 21072 G R 256 + 128 [dd]253,108 1 7 7.264908937 21072 I R 256 + 128 [dd]253,108 1 8 7.264909470 21072 D R 256 + 128 [dd]253,108 0 2 7.265757903 3 C R 256 + 128 [0]但是在snap1上則完全不同,上一個IO沒有完成的情況下,dd緊接著處理下一個IO253,108 17 1 5.020623706 23837 Q R 128 + 128 [dd]253,108 17 2 5.020625075 23837 G R 128 + 128 [dd]253,108 17 3 5.020625309 23837 P N [dd]253,108 17 4 5.020626991 23837 Q R 256 + 128 [dd]253,108 17 5 5.020627454 23837 M R 256 + 128 [dd]253,108 17 6 5.020628526 23837 Q R 384 + 128 [dd]253,108 17 7 5.020628704 23837 M R 384 + 128 [dd]
現在懷疑是snap3上讀取磁碟資料時沒有預讀,但是檢查兩臺機器上read_ahead_kb的值都是一樣的,都是512.
#/sys/block/<device>/queue/read_ahead_kb 512
沒辦法了,發絕招:用kprobe探測一下相關函式引數:
#ra_trace.sh#!/bin/bashif [ "$#" != 1 ]; then echo "Usage: ra_trace.sh <device>" exitfiecho 'p:do_readahead __do_page_cache_readahead mapping=%di offset=%dx pages=%cx' >/sys/kernel/debug/tracing/kprobe_eventsecho 'p:submit_ra ra_submit mapping=%si ra=%di rastart=+0(%di) rasize=+8(%di):u32 rapages=+16(%di):u32' >>/sys/kernel/debug/tracing/kprobe_eventsecho 'p:sync_ra page_cache_sync_readahead mapping=%di ra=%si rastart=+0(%si) rasize=+8(%si):u32 rapages=+16(%si):u32' >>/sys/kernel/debug/tracing/kprobe_eventsecho 'p:async_ra page_cache_async_readahead mapping=%di ra=%si rastart=+0(%si) rasize=+8(%si):u32 rapages=+16(%si):u32' >>/sys/kernel/debug/tracing/kprobe_eventsecho 1 >/sys/kernel/debug/tracing/events/kprobes/enabledd if=$1 of=/dev/null bs=4M count=1024echo 0 >/sys/kernel/debug/tracing/events/kprobes/enablecat /sys/kernel/debug/tracing/trace_pipe&CATPID=$!sleep 3kill $CATPID
發現在snap3上預讀磁碟的時候,rasize=0,確實在讀資料時沒有預讀資料。
<...>-35748 [009] 2507549.022375: submit_ra: (.ra_submit+0x0/0x38) mapping=c0000001bbd17728 ra=c000000191a261f0 rastart=df0b rasize=0 rapages=8 <...>-35748 [009] 2507549.022376: do_readahead: (.__do_page_cache_readahead+0x0/0x208) mapping=c0000001bbd17728 offset=df0b pages=0 <...>-35748 [009] 2507549.022694: sync_ra: (.page_cache_sync_readahead+0x0/0x50) mapping=c0000001bbd17728 ra=c000000191a261f0 rastart=df0b rasize=0 rapages=8 <...>-35748 [009] 2507549.022695: submit_ra: (.ra_submit+0x0/0x38) mapping=c0000001bbd17728 ra=c000000191a261f0 rastart=df0c rasize=0 rapages=8
接下來仔細研讀一下預讀相關的程式碼,發現預讀頁與node上的記憶體相關:
unsigned long max_sane_readahead(unsigned long nr) { return min(nr, (node_page_state(numa_node_id(), NR_INACTIVE_FILE) + node_page_state(numa_node_id(), NR_FREE_PAGES)) / 2); }
比較一下snap1與snap3上node上的記憶體情況,發現snap3上node0上的記憶體和空閒記憶體都為0 ( 根因找到 :-)
snap1:# /usr/bin/numactl --hardwareavailable: 1 nodes (0)node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25node 0 size: 8192 MBnode 0 free: 529 MBnode distances:node 0 0: 10snap3:# /usr/bin/numactl --hardwareavailable: 2 nodes (0,2)node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25node 0 size: 0 MBnode 0 free: 0 MBnode 2 cpus:node 2 size: 8192 MBnode 2 free: 888 MBnode distances:node 0 2 0: 10 40 2: 40 10
發現核心中有兩個patch解決了這個問題,IO的預讀不再以當前cpu上node上的記憶體情況來判斷:
commit:6d2be915e589mm/readahead.c: fix readahead failure for memoryless NUMA nodes and limit readahead pages+#define MAX_READAHEAD ((512*4096)/PAGE_CACHE_SIZE) /* * Given a desired number of PAGE_CACHE_SIZE readahead pages, return a * sensible upper limit. */ unsigned long max_sane_readahead(unsigned long nr) {- return min(nr, (node_page_state(numa_node_id(), NR_INACTIVE_FILE)- + node_page_state(numa_node_id(), NR_FREE_PAGES)) / 2);+ return min(nr, MAX_READAHEAD); }commit:600e19afc5f8mm: use only per-device readahead limit
Note: 以上核心程式碼基於Linux核心主線程式碼 Linux3.0
最新評論