烹茶细论

MFS每隔一小时失去响应几分钟

最近几天,MFS服务器到每小时的01-05分钟都会有失去响应,资源访问失败的情况。查看log

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
Oct 10 13:01:49 VM_200_107 mfsmaster[31003]: connection with client(ip:172.16.200.81) has been closed by peer
Oct 10 13:01:49 VM_200_107 mfsmaster[31003]: connection with client(ip:172.16.115.33) has been closed by peer
##表示客户端和master的连接中断
Oct 4 08:01:28 VM_200_107 mfsmaster[2843]: connection with CS(172.16.200.102) has been closed by peer
Oct 4 08:01:28 VM_200_107 mfsmaster[2843]: connection with CS(172.16.200.103) has been closed by peer
Oct 4 08:01:28 VM_200_107 mfsmaster[2843]: connection with CS(172.16.200.104) has been closed by peer
##表示ChunkServer和Master的连接中断
Oct 4 08:01:28 VM_200_107 mfsmaster[2843]: chunkserver disconnected - ip: 172.16.200.102, port: 9422, usedspace: 463146741760 (431.34 GiB), totalspace: 2054567444480 (1913.47 GiB)
Oct 4 08:01:30 VM_200_107 mfsmaster[2843]: chunkserver disconnected - ip: 172.16.200.103, port: 9422, usedspace: 459528495104 (427.97 GiB), totalspace: 2054567444480 (1913.47 GiB)
Oct 4 08:01:31 VM_200_107 mfsmaster[2843]: chunkserver disconnected - ip: 172.16.200.104, port: 9422, usedspace: 461537153024 (429.84 GiB), totalspace: 2054567444480 (1913.47 GiB)
##ChunkServer 中断连接
Oct 10 13:01:52 VM_200_107 mfsmaster[31003]: connection with ML(127.0.0.1) has been closed by peer
##表示Metalogger和Master的连接中断
Oct 10 13:01:52 VM_200_107 mfsmaster[31003]: chunkserver register begin (packet version: 5) - ip: 172.16.200.102, port: 9422
Oct 10 13:01:52 VM_200_107 mfsmetalogger[31700]: connection was reset by Master
Oct 10 13:01:53 VM_200_107 mfsmaster[31003]: chunkserver register begin (packet version: 5) - ip: 172.16.200.103, port: 9422
Oct 10 13:01:54 VM_200_107 mfsmaster[31003]: chunkserver register end (packet version: 5) - ip: 172.16.200.102, port: 9422, usedspace: 490665570304 (456.97 GiB), totalspace: 2054567444480
Oct 10 13:01:54 VM_200_107 mfsmaster[31003]: chunkserver register end (packet version: 5) - ip: 172.16.200.103, port: 9422, usedspace: 486941249536 (453.50 GiB), totalspace: 2054567444480
##chunkserver重新连接

本以为是定时任务的原因,查看未果,后来查看到一篇文章
里面写到

每到整点的时候,master 会fork一个子进程把内存中的数据快照到磁盘,如果数据量小或者磁盘很快,是不会影响master的响应的。

一旦数据比较大或者磁盘很忙时(并且master还有很多访问),写快照的进程会让磁盘变得繁忙,导致另一个master进程在写changelog 时被阻塞了。

改进办法是使用更好的磁盘(SSD)或者更多内存(使得新写的快照不必立即刷新到磁盘)

文章地址:http://sourceforge.net/p/moosefs/mailman/message/34310363/

然后便在kvm里面增加了mfs服务器的内存,问题便解决了