MongoDB副本集的環境中,有一個比較特殊的現象叫做資料回滾(rollbacks),這裡我們簡單說明一下.
副本集的架構(Architecture)下圖是一個典型的一主多從的副本集架構,在該架構中,如果主節點異常,會在從節點中會選舉出一個優先順序最高的從節點接替主節點的位置.
副本集架構
選舉(Elections)1. 當副本集的從節點ec1-notificationdb-02發現自己無法與主節點ec1-notificationdb-01取得聯絡並且該從節點有成為主節點的條件時,從節點ec1-notificationdb-02會向所有能到達的其他從節點ec1-notificationdb-03/04/05傳送選舉通知,請求選舉自己為主節點.
2. 當從節點ec1-notificationdb-03/04/05收到該請求後,會判斷從節點ec1-notificationdb-02可能並不是一個合適的候選者,例如資料副本落後太多,並且ec1-notificationdb-03/04/05能夠和主節點取得聯絡,只是從節點ec1-notificationdb-02自身原因無法與主節點取得聯絡而已,這種情況下,其他從節點會拒絕該選舉請求.
3. 如果其他從節點ec1-notificationdb-03/04/05並沒有反對的理由,那麼這些從節點將會對ec1-notificationdb-02的選舉請求進行投票,如果ec1-notificationdb-03/04/05中超過半數同意ec1-notificationdb-02的選舉請求,那麼ec1-notificationdb-02選舉成功並且將自己轉化為主節點.如果沒有得到超過半數的同意,那麼在一段時間以後,從節點ec1-notificationdb-02會重新發起選舉請求.
4. 主節點ec1-notificationdb-01將會一直處於Promary角色,如果它無法和副本集中大部分從節點通訊.該主節點會自動降級成為從節點.
在一個穩健的網路環境中並且副本集大多數節點正常工作情況下,選舉過程幾乎很快就會完成.這是因為副本集節點每隔2S傳送一次心跳包,也意味著如果主節點宕機,從節點在接下來的一個心跳包中就可以監測到主節點異常,立刻開始選舉過程,理想情況下幾秒鐘的時間就可以選舉出新的主節點.但是在實際情況中,也有可能是因為網路抖動或者伺服器過載導致心跳包回覆過慢,綜合考慮各種情況下,選舉過程可能會持續更長時間.
回滾(Rollbacks)在選舉一節中我們提到選舉,如果一個主節點在寫入一條資料後並沒有來得及將該資料同步到從節點,此刻主節點突然宕機,這時從節點發起選舉請求,從眾多從節點中選舉出新的主節點,那麼新的主節點並沒有老的主節點上那一條資料.
主節點宕機
如上圖所示,在資料"_id:4,username:user004"還沒有來得及寫入從節點時,主節點因網路故障無法與所有從節點通訊,此時從節點將會發起選舉請求,選舉出新的主節點,而原有的主節點網路恢復後會被降級成從節點,最終新的主節點的架構圖如下.
選舉出新的主節點
這裡需要特別注意,原來的主節點以從節點的身份加入副本集後,用自己最新的資料"_id:4, username:user004"和新的主節點對比,發現新的主節點並沒有該條資料,於是再用自己較老的資料"_id:3, username:user003"和新的主節點資料對比,發現新的主節點資料和自己匹配,那麼該從節點將會回滾資料至"_id:3, username:user003",該版本以後的資料會被存放至資料目錄的rollback目錄裡.然後開始同步新的主節點的資料.
回滾現象模擬首先我們在主節點檢視data目錄,確保MongoDB並沒有rollback回滾目錄
$ ls -lhtotal 192Kdrwx------ 2 mongodb dba 294 Jan 28 09:31 admindrwx------ 2 mongodb dba 211 Jan 28 09:29 configdrwx------ 2 mongodb dba 71 Jan 29 14:14 diagnostic.datadrwx------ 2 mongodb dba 110 Jan 28 09:19 journaldrwx------ 2 mongodb dba 4.0K Jan 28 09:24 local-rw------- 1 mongodb dba 36K Jan 28 09:32 _mdb_catalog.wt-rw------- 1 mongodb dba 6 Jan 28 09:19 mongod.lock-rw------- 1 mongodb dba 36K Jan 29 14:14 sizeStorer.wt-rw------- 1 mongodb dba 114 Jan 28 09:19 storage.bson-rw------- 1 mongodb dba 46 Jan 28 09:19 WiredTiger-rw------- 1 mongodb dba 4.0K Jan 28 09:19 WiredTigerLAS.wt-rw------- 1 mongodb dba 21 Jan 28 09:19 WiredTiger.lock-rw------- 1 mongodb dba 1.3K Jan 29 14:14 WiredTiger.turtle-rw------- 1 mongodb dba 92K Jan 29 14:14 WiredTiger.wt
在主節點插入測試資料
dbadmin@mydb:ec1-notificationdb-01 repl:PRIMARY> db.users.insertMany([... {_id : 1, "username" : "user001"},... {_id : 2, "username" : "user002"},... {_id : 3, "username" : "user003"}... ]){ "acknowledged" : true, "insertedIds" : [ 1, 2, 3 ] }
在從節點檢視已插入的資料
dbadmin@mydb:ec1-notificationdb-02 repl:SECONDARY> db.users.find(){ "_id" : 1, "username" : "user001" }{ "_id" : 3, "username" : "user003" }{ "_id" : 2, "username" : "user002" }
dbadmin@mydb:ec1-notificationdb-03 repl:SECONDARY> db.users.find(){ "_id" : 1, "username" : "user001" }{ "_id" : 3, "username" : "user003" }{ "_id" : 2, "username" : "user002" }
同時停止兩個從節點的MongoDB服務,並在主節點再次插入一條資料,這裡需要注意的是,如果主節點在20S內沒有和它的從節點取得聯絡,那麼主節點會將自己降級為從節點,所以插入測試的資料一定要快.
dbadmin@mydb:ec1-notificationdb-01 repl:PRIMARY> db.users.insert({_id : 4, "username" : "user004"})WriteResult({ "nInserted" : 1 })dbadmin@mydb:ec1-notificationdb-01 repl:PRIMARY> db.users.find()Error: error: { "operationTime" : Timestamp(1611930689, 1), "ok" : 0, "errmsg" : "not master and slaveOk=false", "code" : 13435, "codeName" : "NotPrimaryNoSecondaryOk", "$clusterTime" : { "clusterTime" : Timestamp(1611930689, 1), "signature" : { "hash" : BinData(0,"kDtFyX7oSm2zvUHeEVGm7XznXfk="), "keyId" : NumberLong("6922739574590406659") } }}dbadmin@mydb:ec1-notificationdb-01 repl:SECONDARY> rs.secondaryOk()dbadmin@mydb:ec1-notificationdb-01 repl:SECONDARY> db.users.find(){ "_id" : 1, "username" : "user001" }{ "_id" : 2, "username" : "user002" }{ "_id" : 3, "username" : "user003" }{ "_id" : 4, "username" : "user004" }
從上可以看到,我們在主節點插入的時還是可以正常插入,查詢的時候主節點已經降級成為從節點了,並且確定資料已經插入成功了,此時我們在主節點停止MongoDB服務.
分別按順序啟動另外兩個節點,可以發現ec1-notificationdb-02已經贏得選舉成為主節點,而它的從節點為ec1-notificationdb-03,同時主節點並未同步過來我們剛才插入的一條測試資料.
dbadmin@mydb:ec1-notificationdb-02 repl:PRIMARY> db.users.find(){ "_id" : 1, "username" : "user001" }{ "_id" : 3, "username" : "user003" }{ "_id" : 2, "username" : "user002" }dbadmin@mydb:ec1-notificationdb-02 repl:PRIMARY> rs.printSecondaryReplicationInfo()source: ec1-notificationdb-01.mypna.com:27017 syncedTo: Thu Jan 01 1970 00:00:00 GMT+0000 (GMT) 1611931739 secs (447758.82 hrs) behind the primary source: ec1-notificationdb-03.mypna.com:27017 syncedTo: Fri Jan 29 2021 14:48:59 GMT+0000 (GMT) 0 secs (0 hrs) behind the primary
這時我們啟動停掉的主節點,同時觀察log日誌
2021-01-29T14:52:09.100+0000 I REPL [rsBackgroundSync] Changed sync source from empty to ec1-notificationdb-03.mypna.com:270172021-01-29T14:52:09.102+0000 I REPL [rsBackgroundSync] Starting rollback due to OplogStartMissing: Our last optime fetched: { ts: Timestamp(1611930689, 1), t: 1 }. source's GTE: { ts: Timestamp(1611931469, 2), t: 2 }2021-01-29T14:52:09.102+0000 I REPL [rsBackgroundSync] Replication commit point: { ts: Timestamp(0, 0), t: -1 }2021-01-29T14:52:09.102+0000 I REPL [rsBackgroundSync] Rollback using 'recoverToStableTimestamp' method.2021-01-29T14:52:09.102+0000 I REPL [rsBackgroundSync] Scheduling rollback (sync source: ec1-notificationdb-03.mypna.com:27017)2021-01-29T14:52:09.102+0000 I ROLLBACK [rsBackgroundSync] transition to ROLLBACK2021-01-29T14:52:09.102+0000 I REPL [rsBackgroundSync] State transition ops metrics: { lastStateTransition: "rollback", userOpsKilled: 0, userOpsRunning: 4 }2021-01-29T14:52:09.102+0000 I REPL [rsBackgroundSync] transition to ROLLBACK from SECONDARY2021-01-29T14:52:09.102+0000 I NETWORK [rsBackgroundSync] Skip closing connection for connection # 72021-01-29T14:52:09.102+0000 I NETWORK [rsBackgroundSync] Skip closing connection for connection # 62021-01-29T14:52:09.102+0000 I NETWORK [rsBackgroundSync] Skip closing connection for connection # 22021-01-29T14:52:09.102+0000 I ROLLBACK [rsBackgroundSync] Waiting for all background operations to complete before starting rollback2021-01-29T14:52:09.102+0000 I ROLLBACK [rsBackgroundSync] Finished waiting for background operations to complete before rollback2021-01-29T14:52:09.102+0000 I ROLLBACK [rsBackgroundSync] finding common point2021-01-29T14:52:09.226+0000 I ROLLBACK [rsBackgroundSync] Rollback common point is { ts: Timestamp(1611930684, 1), t: 1 }2021-01-29T14:52:09.226+0000 I SHARDING [rsBackgroundSync] Marking collection local.system.rollback.id as collection version: <unsharded>2021-01-29T14:52:09.227+0000 I REPL [rsBackgroundSync] Incremented the rollback ID to 22021-01-29T14:52:09.227+0000 I ROLLBACK [rsBackgroundSync] finding record store counts2021-01-29T14:52:09.227+0000 I ROLLBACK [rsBackgroundSync] Preparing to write deleted documents to a rollback file for collection mydb.users with uuid d0b9b5f8-41ac-47f9-bd9f-6bdc0e3b2140 to /usr/local/mongodb/data/rollback/mydb.users/removed.2021-01-29T14-52-09.0.bson2021-01-29T14:52:09.227+0000 I STORAGE [rsBackgroundSync] closeCatalog: closing all databases2021-01-29T14:52:09.227+0000 I STORAGE [rsBackgroundSync] closeCatalog: closing storage engine catalog2021-01-29T14:52:09.227+0000 I STORAGE [rsBackgroundSync] Deregistering all the collections2021-01-29T14:52:09.227+0000 I STORAGE [WTOplogJournalThread] Oplog journal thread loop shutting down2021-01-29T14:52:09.254+0000 I ROLLBACK [rsBackgroundSync] Rolling back to the stable timestamp. StableTimestamp: Timestamp(1611930684, 1) Initial Data Timestamp: Timestamp(1611930684, 1)2021-01-29T14:52:09.254+0000 I STORAGE [rsBackgroundSync] openCatalog: loading storage engine catalog2021-01-29T14:52:09.273+0000 I STORAGE [rsBackgroundSync] OplogTruncaterThread local.oplog.rs already started2021-01-29T14:52:09.273+0000 I STORAGE [rsBackgroundSync] The size storer reports that the oplog contains 10494 records totaling to 1158449 bytes2021-01-29T14:52:09.273+0000 I STORAGE [rsBackgroundSync] Scanning the oplog to determine where to place markers for truncation2021-01-29T14:52:09.274+0000 I STORAGE [rsBackgroundSync] WiredTiger record store oplog processing took 1ms2021-01-29T14:52:09.277+0000 I STORAGE [rsBackgroundSync] openCatalog: reconciling catalog and idents2021-01-29T14:52:09.278+0000 I STORAGE [rsBackgroundSync] openCatalog: reopening all databases2021-01-29T14:52:09.282+0000 I STORAGE [rsBackgroundSync] openCatalog: updating cached oplog pointer2021-01-29T14:52:09.282+0000 I STORAGE [rsBackgroundSync] openCatalog: finished reloading collection catalog2021-01-29T14:52:09.282+0000 I STORAGE [rsBackgroundSync] recoverToStableTimestamp successful. Stable Timestamp: Timestamp(1611930684, 1)2021-01-29T14:52:09.282+0000 I ROLLBACK [rsBackgroundSync] Rollback reverted 1 insert operations, 0 update operations and 0 delete operations.2021-01-29T14:52:09.282+0000 I ROLLBACK [rsBackgroundSync] Marking to truncate all oplog entries with timestamps greater than or equal to { ts: Timestamp(1611930689, 1), t: 1 }2021-01-29T14:52:09.282+0000 I SHARDING [rsBackgroundSync] Marking collection local.replset.oplogTruncateAfterPoint as collection version: <unsharded>2021-01-29T14:52:09.282+0000 I REPL [rsBackgroundSync] Removing unapplied entries starting at: { : Timestamp(1611930689, 1) }2021-01-29T14:52:09.282+0000 I REPL [rsBackgroundSync] Replication recovery oplog truncation finished in: 0ms2021-01-29T14:52:09.284+0000 I REPL [rsBackgroundSync] Recovering from stable timestamp: Timestamp(1611930684, 1) (top of oplog: { ts: Timestamp(1611930684, 1), t: 1 }, appliedThrough: { ts: Timestamp(0, 0), t: -1 }, TruncateAfter: Timestamp(0, 0))2021-01-29T14:52:09.284+0000 I REPL [rsBackgroundSync] Starting recovery oplog application at the stable timestamp: Timestamp(1611930684, 1)2021-01-29T14:52:09.284+0000 I REPL [rsBackgroundSync] No oplog entries to apply for recovery. Start point is at the top of the oplog.2021-01-29T14:52:09.285+0000 I ROLLBACK [rsBackgroundSync] Triggering the rollback op observer2021-01-29T14:52:09.285+0000 I REPL [rsBackgroundSync] Resetting key manager cache2021-01-29T14:52:09.285+0000 I ROLLBACK [rsBackgroundSync] Rollback complete2021-01-29T14:52:09.285+0000 I ROLLBACK [rsBackgroundSync] Rollback summary:2021-01-29T14:52:09.285+0000 I ROLLBACK [rsBackgroundSync] start time: 2021-01-29T14:52:09.102+00002021-01-29T14:52:09.285+0000 I ROLLBACK [rsBackgroundSync] end time: 2021-01-29T14:52:09.285+00002021-01-29T14:52:09.285+0000 I ROLLBACK [rsBackgroundSync] sync source: ec1-notificationdb-03.mypna.com:270172021-01-29T14:52:09.285+0000 I ROLLBACK [rsBackgroundSync] rollback data file directory: /usr/local/mongodb/data/rollback/mydb.users2021-01-29T14:52:09.285+0000 I ROLLBACK [rsBackgroundSync] rollback id: 22021-01-29T14:52:09.285+0000 I ROLLBACK [rsBackgroundSync] last optime on branch of history rolled back: { ts: Timestamp(1611930689, 1), t: 1 }2021-01-29T14:52:09.285+0000 I ROLLBACK [rsBackgroundSync] common point optime: { ts: Timestamp(1611930684, 1), t: 1 }2021-01-29T14:52:09.285+0000 I ROLLBACK [rsBackgroundSync] last wall clock time on the branch of history rolled back: 2021-01-29T14:31:29.898+00002021-01-29T14:52:09.285+0000 I ROLLBACK [rsBackgroundSync] wall clock time of the first operation after the common point: 2021-01-29T14:31:29.898+00002021-01-29T14:52:09.285+0000 I ROLLBACK [rsBackgroundSync] difference in wall clock times: 0 second(s)2021-01-29T14:52:09.285+0000 I ROLLBACK [rsBackgroundSync] truncate timestamp: Timestamp(1611930689, 1)2021-01-29T14:52:09.285+0000 I ROLLBACK [rsBackgroundSync] stable timestamp: Timestamp(1611930684, 1)2021-01-29T14:52:09.285+0000 I ROLLBACK [rsBackgroundSync] shard identity document rolled back: false2021-01-29T14:52:09.285+0000 I ROLLBACK [rsBackgroundSync] config server config version document rolled back: false2021-01-29T14:52:09.285+0000 I ROLLBACK [rsBackgroundSync] affected sessions: none2021-01-29T14:52:09.285+0000 I ROLLBACK [rsBackgroundSync] affected namespaces: 2021-01-29T14:52:09.285+0000 I ROLLBACK [rsBackgroundSync] mydb.users2021-01-29T14:52:09.285+0000 I ROLLBACK [rsBackgroundSync] counts of interesting commands rolled back: 2021-01-29T14:52:09.285+0000 I ROLLBACK [rsBackgroundSync] update: 02021-01-29T14:52:09.285+0000 I ROLLBACK [rsBackgroundSync] delete: 02021-01-29T14:52:09.285+0000 I ROLLBACK [rsBackgroundSync] insert: 12021-01-29T14:52:09.285+0000 I ROLLBACK [rsBackgroundSync] total number of entries rolled back (including no-ops): 12021-01-29T14:52:09.285+0000 I ROLLBACK [rsBackgroundSync] transition to SECONDARY2021-01-29T14:52:09.285+0000 I REPL [rsBackgroundSync] transition to SECONDARY from ROLLBACK2021-01-29T14:52:09.285+0000 I REPL [rsBackgroundSync] Resetting sync source to empty, which was ec1-notificationdb-03.mypna.com:270172021-01-29T14:52:09.285+0000 I REPL [rsBackgroundSync] Rollback successful.
可以很明顯看到原來的主節點發生了rollback的現象,而且rollback的資料被儲存到資料目錄的rollback目錄,我們到資料目錄檢視
$ ls -lh rollback/mydb.users/removed.2021-01-29T14-52-09.0.bson -rw------- 1 mongodb dba 40 Jan 29 14:52 rollback/mydb.users/removed.2021-01-29T14-52-09.0.bson
我們將rollback資料匯入到主節點的臨時表中檢視
$ mongorestore --host=ec1-notificationdb-02 --username=dbadmin --password=88888888 --authenticationDatabase admin --db=mydb --collection=tb1 rollback/mydb.users/removed.2021-01-29T14-52-09.0.bson 2021-01-29T15:07:43.263+0000 checking for collection data in rollback/mydb.users/removed.2021-01-29T14-52-09.0.bson2021-01-29T15:07:43.280+0000 restoring mydb.tb1 from rollback/mydb.users/removed.2021-01-29T14-52-09.0.bson2021-01-29T15:07:43.341+0000 no indexes to restore2021-01-29T15:07:43.342+0000 finished restoring mydb.tb1 (1 document, 0 failures)2021-01-29T15:07:43.342+0000 1 document(s) restored successfully. 0 document(s) failed to restore.
$ mongo --host=ec1-notificationdb-02 --username=dbadmin --password=88888888 --authenticationDatabase admin mydbdbadmin@mydb:ec1-notificationdb-02 repl:PRIMARY> db.tb1.find(){ "_id" : 4, "username" : "user004" }
此時我們可以根據需要決定是否將rollback的資料重新匯入的副本集中.