ceph的数据读写
本章介绍ceph的服务端OSD(书中简称OSD模块或者OSD)的实现。其对应的源代码在src/osd目录下。OSD模块是Ceph服务进程的核心实现,它实现了服务端的核心功能。本章先介绍OSD模块静态类图相关数据结构,再着重介绍服务端数据的写入和读取流程。
OSD进程启动的主函数位于: src/ceph_osd.cc
1. OSD模块静态类图
OSD模块的静态类图如下图6-1所示:
OSD模块的核心类及其之间的静态类图说明如下:
-
类OSD和OSDService是核心类,处理一个osd节点层面的工作。在早期的版本中,OSD和OSDService是一个类。由于OSD的类承载了太多的功能,后面的版本中引入OSDService类,分担一部分原OSD类的功能;
-
类PG处理PG相关的状态维护以及实现PG层面的基本功能。其核心功能是用boost库的statechart状态机来实现的PG状态转换;
-
类ReplicatedPG继承了类PG,在其基础上实现了PG内的数据读写以及数据恢复相关的操作;
-
类PGBackend的主要功能是把数据以事务的形式同步到一个PG其他从OSD节点上
- PGBackend的内部类PGTransaction就是同步的事务接口,其两个类型的实现分别对应RPGTransaction和ECTransaction两个子类;
- PGBackend两个子类ReplicatedBackend和ECBackend分别对应PG的两种类型的实现
- 类SnapMapper额外保存对象和对象的快照信息,在对象的属性里保存了相关的快照信息。这里保存的快照信息为冗余信息,用于数据校验。
下面我们再来看一下OSD运行起来之后有哪些工作线程:
# ps -ef | grep osd root 12025 1 27 2019 ? 255-07:47:08 /usr/bin/ceph-osd -f --cluster ceph --id 12 --setuser root --setgroup root root 12035 1 32 2019 ? 304-21:34:58 /usr/bin/ceph-osd -f --cluster ceph --id 9 --setuser root --setgroup root root 12085 1 32 2019 ? 304-15:44:09 /usr/bin/ceph-osd -f --cluster ceph --id 16 --setuser root --setgroup root root 12095 1 16 2019 ? 156-18:47:42 /usr/bin/ceph-osd -f --cluster ceph --id 10 --setuser root --setgroup root root 12115 1 34 2019 ? 319-13:58:10 /usr/bin/ceph-osd -f --cluster ceph --id 14 --setuser root --setgroup root root 12144 1 28 2019 ? 267-08:09:04 /usr/bin/ceph-osd -f --cluster ceph --id 15 --setuser root --setgroup root root 12155 1 30 2019 ? 278-05:29:00 /usr/bin/ceph-osd -f --cluster ceph --id 17 --setuser root --setgroup root root 12165 1 28 2019 ? 259-21:16:13 /usr/bin/ceph-osd -f --cluster ceph --id 13 --setuser root --setgroup root root 37770 1 41 Mar22 ? 103-17:30:20 /usr/bin/ceph-osd -f --cluster ceph --id 11 --setuser root --setgroup root root 3475542 3475501 0 10:58 pts/0 00:00:00 grep --color=auto osd # ps -T -p 12025 PID SPID TTY TIME CMD 12025 12025 ? 00:00:02 ceph-osd 12025 12174 ? 12:33:10 log 12025 12215 ? 00:04:04 service 12025 12225 ? 00:00:00 admin_socket 12025 12244 ? 00:06:17 ceph-osd 12025 12245 ? 00:07:36 ms_reaper 12025 12264 ? 00:00:01 ms_reaper 12025 12274 ? 00:00:00 ms_reaper 12025 12284 ? 00:00:00 ms_reaper 12025 12294 ? 00:00:09 ms_reaper 12025 12304 ? 00:00:00 ms_reaper 12025 12314 ? 00:42:13 safe_timer 12025 12315 ? 00:59:27 safe_timer 12025 12325 ? 00:00:00 safe_timer 12025 13345 ? 03:01:45 wb_throttle 12025 13354 ? 19:41:26 filestore_sync 12025 13365 ? 17:49:42 journal_write 12025 13374 ? 10:43:32 journal_wrt_fin 12025 13375 ? 05:48:26 fn_jrn_objstore 12025 13384 ? 3-09:00:53 tp_fstore_op 12025 13385 ? 3-08:57:35 tp_fstore_op 12025 13394 ? 3-08:55:45 tp_fstore_op 12025 13395 ? 3-08:54:53 tp_fstore_op 12025 13404 ? 3-08:57:47 tp_fstore_op 12025 13405 ? 06:36:03 fn_odsk_fstore 12025 13414 ? 11:07:32 fn_appl_fstore 12025 13415 ? 00:00:03 safe_timer 12025 14224 ? 00:13:53 ms_dispatch 12025 14225 ? 00:00:00 ms_local 12025 14234 ? 00:07:16 ms_accepter 12025 14235 ? 01:53:03 ms_dispatch 12025 14244 ? 00:00:00 ms_local 12025 14254 ? 00:00:01 ms_dispatch 12025 14255 ? 00:00:00 ms_local 12025 14264 ? 00:00:00 ms_dispatch 12025 14265 ? 00:00:00 ms_local 12025 14275 ? 00:00:00 ms_dispatch 12025 14284 ? 00:00:00 ms_local 12025 14294 ? 00:00:00 ms_dispatch 12025 14295 ? 00:00:00 ms_local 12025 14304 ? 00:04:38 safe_timer 12025 14305 ? 00:00:00 fn_anonymous 12025 14314 ? 04:35:33 tp_osd 12025 14315 ? 04:35:27 tp_osd 12025 14334 ? 23-09:11:24 tp_osd_tp 12025 14335 ? 21-04:24:33 tp_osd_tp 12025 14344 ? 23-13:28:19 tp_osd_tp 12025 14345 ? 20-14:27:07 tp_osd_tp 12025 14354 ? 20-09:15:53 tp_osd_tp 12025 14355 ? 23-08:43:44 tp_osd_tp 12025 14364 ? 21-04:21:41 tp_osd_tp 12025 14365 ? 23-13:24:38 tp_osd_tp 12025 14374 ? 20-15:07:18 tp_osd_tp 12025 14375 ? 20-09:12:04 tp_osd_tp 12025 14384 ? 3-21:58:35 tp_osd_recov 12025 14385 ? 00:22:46 tp_osd_disk 12025 14394 ? 00:07:39 tp_osd_cmd 12025 14395 ? 04:26:42 osd_srv_heartbt 12025 14404 ? 00:00:00 fn_anonymous 12025 14405 ? 00:00:00 fn_anonymous 12025 14414 ? 00:00:00 safe_timer 12025 14415 ? 00:00:00 safe_timer 12025 14424 ? 00:00:00 safe_timer 12025 14425 ? 00:00:00 safe_timer 12025 14434 ? 00:00:01 osd_srv_agent 12025 14445 ? 00:00:00 sginal_handler 12025 4936 ? 1-00:04:18 tp_fstore_op 12025 2457356 ? 00:00:14 ms_pipe_write 12025 2457359 ? 00:00:10 ms_pipe_read 12025 2570326 ? 00:00:00 ms_accepter 12025 2570327 ? 00:00:00 ms_accepter 12025 2570328 ? 00:00:00 ms_accepter 12025 2570366 ? 00:00:22 ms_pipe_write 12025 2570367 ? 00:00:22 ms_pipe_write 12025 2570371 ? 00:00:18 ms_pipe_read 12025 2570372 ? 00:00:18 ms_pipe_read 12025 2570374 ? 00:00:22 ms_pipe_write 12025 2570375 ? 00:00:22 ms_pipe_write 12025 2570390 ? 00:00:16 ms_pipe_write 12025 2570392 ? 00:00:17 ms_pipe_read 12025 2570394 ? 00:00:16 ms_pipe_write 12025 2570395 ? 00:00:16 ms_pipe_read 12025 2570396 ? 00:00:16 ms_pipe_write 12025 2570398 ? 00:00:16 ms_pipe_write 12025 2570400 ? 00:00:16 ms_pipe_write 12025 2570401 ? 00:00:16 ms_pipe_write 12025 2570402 ? 00:00:16 ms_pipe_write 12025 2570405 ? 00:00:16 ms_pipe_write 12025 2570412 ? 00:00:18 ms_pipe_read 12025 2570415 ? 00:00:17 ms_pipe_read 12025 2570416 ? 00:00:17 ms_pipe_read 12025 2570417 ? 00:00:16 ms_pipe_read 12025 2570419 ? 00:00:16 ms_pipe_read 12025 2570423 ? 00:00:16 ms_pipe_read 12025 2570424 ? 00:00:16 ms_pipe_read 12025 2570425 ? 00:00:16 ms_pipe_read 12025 2570454 ? 00:00:31 ms_pipe_read 12025 2570455 ? 00:00:30 ms_pipe_read 12025 2570456 ? 00:00:23 ms_pipe_read 12025 2570457 ? 00:00:16 ms_pipe_write 12025 2570458 ? 00:00:16 ms_pipe_write 12025 2570459 ? 00:00:16 ms_pipe_write 12025 2570460 ? 00:03:24 ms_pipe_write 12025 2570461 ? 00:05:24 ms_pipe_write 12025 2570462 ? 00:03:45 ms_pipe_write 12025 2570463 ? 00:03:34 ms_pipe_read 12025 2570464 ? 00:03:52 ms_pipe_read 12025 2570465 ? 00:06:45 ms_pipe_read 12025 2570466 ? 00:00:30 ms_pipe_read 12025 2570467 ? 00:00:31 ms_pipe_read 12025 2570468 ? 00:04:03 ms_pipe_read 12025 2570469 ? 00:00:16 ms_pipe_write 12025 2570470 ? 00:00:17 ms_pipe_write 12025 2570471 ? 00:02:05 ms_pipe_write 12025 2570472 ? 00:00:30 ms_pipe_read 12025 2570473 ? 00:00:31 ms_pipe_read 12025 2570474 ? 00:00:31 ms_pipe_read 12025 2570475 ? 00:00:30 ms_pipe_read 12025 2570476 ? 00:00:16 ms_pipe_write 12025 2570477 ? 00:00:17 ms_pipe_write 12025 2570479 ? 00:00:17 ms_pipe_write 12025 2570480 ? 00:00:16 ms_pipe_write 12025 2570481 ? 00:04:04 ms_pipe_read 12025 2570483 ? 00:02:12 ms_pipe_write 12025 2570484 ? 00:00:31 ms_pipe_read 12025 2570485 ? 00:00:30 ms_pipe_read 12025 2570486 ? 00:00:17 ms_pipe_write 12025 2570487 ? 00:00:17 ms_pipe_write 12025 2570491 ? 00:05:22 ms_pipe_write 12025 2570493 ? 00:00:10 ms_pipe_write 12025 2570494 ? 00:02:16 ms_pipe_write 12025 2570495 ? 00:03:15 ms_pipe_write 12025 2570496 ? 00:02:55 ms_pipe_write 12025 2570497 ? 00:05:12 ms_pipe_read 12025 2570499 ? 00:03:29 ms_pipe_read 12025 2570500 ? 00:04:14 ms_pipe_read 12025 2570501 ? 00:03:05 ms_pipe_read 12025 2570503 ? 00:00:14 ms_pipe_read 12025 2570504 ? 00:00:32 ms_pipe_read 12025 2570505 ? 00:00:30 ms_pipe_read 12025 2570507 ? 00:00:16 ms_pipe_write 12025 2570508 ? 00:00:16 ms_pipe_write 12025 2570514 ? 00:00:31 ms_pipe_read 12025 2570515 ? 00:00:30 ms_pipe_read 12025 2570516 ? 00:00:16 ms_pipe_write 12025 2570517 ? 00:00:16 ms_pipe_write 12025 2570520 ? 00:00:31 ms_pipe_read 12025 2570521 ? 00:00:30 ms_pipe_read 12025 2570522 ? 00:00:17 ms_pipe_write 12025 2570523 ? 00:00:16 ms_pipe_write 12025 2570524 ? 00:00:31 ms_pipe_read 12025 2570525 ? 00:00:30 ms_pipe_read 12025 2570526 ? 00:00:17 ms_pipe_write 12025 2570527 ? 00:00:16 ms_pipe_write 12025 2570535 ? 00:00:31 ms_pipe_read 12025 2570536 ? 00:00:30 ms_pipe_read 12025 2570537 ? 00:00:17 ms_pipe_write 12025 2570538 ? 00:00:16 ms_pipe_write 12025 2570539 ? 00:00:30 ms_pipe_read 12025 2570540 ? 00:00:31 ms_pipe_read 12025 2570541 ? 00:00:16 ms_pipe_write 12025 2570542 ? 00:00:16 ms_pipe_write 12025 2570543 ? 00:00:31 ms_pipe_read 12025 2570544 ? 00:00:31 ms_pipe_read 12025 2570545 ? 00:00:17 ms_pipe_write 12025 2570546 ? 00:00:16 ms_pipe_write 12025 2570547 ? 00:00:30 ms_pipe_read 12025 2570548 ? 00:00:31 ms_pipe_read 12025 2570549 ? 00:00:16 ms_pipe_write 12025 2570550 ? 00:00:17 ms_pipe_write 12025 2570551 ? 00:00:30 ms_pipe_read 12025 2570552 ? 00:00:31 ms_pipe_read 12025 2570553 ? 00:00:17 ms_pipe_write 12025 2570554 ? 00:00:16 ms_pipe_write 12025 2570569 ? 00:00:00 ms_pipe_write 12025 2570570 ? 00:00:00 ms_pipe_read 12025 2570574 ? 00:00:00 ms_pipe_read 12025 2570577 ? 00:00:00 ms_pipe_write 12025 2570585 ? 00:00:00 ms_pipe_write 12025 2570589 ? 00:00:00 ms_pipe_read 12025 2570591 ? 00:00:00 ms_pipe_read 12025 2570592 ? 00:00:00 ms_pipe_write 12025 2570603 ? 00:00:31 ms_pipe_read 12025 2570604 ? 00:00:30 ms_pipe_read 12025 2570605 ? 00:00:17 ms_pipe_write 12025 2570606 ? 00:00:17 ms_pipe_write 12025 2570609 ? 00:05:17 ms_pipe_read 12025 2570610 ? 00:03:40 ms_pipe_write 12025 2570616 ? 00:05:06 ms_pipe_read 12025 2570617 ? 00:04:35 ms_pipe_write 12025 2570738 ? 00:00:00 ms_pipe_read 12025 2570739 ? 00:00:00 ms_pipe_write 12025 2570872 ? 00:00:00 ms_pipe_read 12025 2570873 ? 00:00:00 ms_pipe_write 12025 3145848 ? 00:00:13 ms_pipe_read 12025 3145849 ? 00:00:05 ms_pipe_write 12025 3245901 ? 00:00:06 ms_pipe_write 12025 3252563 ? 00:00:10 ms_pipe_read 12025 3254659 ? 00:00:15 ms_pipe_read 12025 3254660 ? 00:00:10 ms_pipe_write 12025 3313166 ? 00:01:23 ms_pipe_read 12025 3313167 ? 00:00:56 ms_pipe_write 12025 3313612 ? 00:00:00 ms_pipe_read 12025 3313613 ? 00:00:00 ms_pipe_write 12025 3313728 ? 00:00:00 ms_pipe_read 12025 3313729 ? 00:00:00 ms_pipe_write 12025 3313797 ? 00:00:00 ms_pipe_read 12025 3313798 ? 00:00:00 ms_pipe_write 12025 3313799 ? 00:00:00 ms_pipe_read 12025 3313800 ? 00:00:00 ms_pipe_write 12025 3313911 ? 00:00:00 ms_pipe_read 12025 3313912 ? 00:00:00 ms_pipe_write 12025 3314037 ? 00:00:00 ms_pipe_read 12025 3314038 ? 00:00:00 ms_pipe_write 12025 3314075 ? 00:00:00 ms_pipe_read 12025 3314076 ? 00:00:00 ms_pipe_write 12025 3377265 ? 00:00:01 ms_pipe_read 12025 3377266 ? 00:00:01 ms_pipe_write 12025 3377624 ? 00:00:00 ms_pipe_read 12025 3377625 ? 00:00:00 ms_pipe_write 12025 3377933 ? 00:00:00 ms_pipe_read 12025 3377934 ? 00:00:00 ms_pipe_write 12025 3378058 ? 00:00:00 ms_pipe_read 12025 3378059 ? 00:00:00 ms_pipe_write 12025 3378498 ? 00:00:00 ms_pipe_read 12025 3378499 ? 00:00:00 ms_pipe_write 12025 3378629 ? 00:00:00 ms_pipe_read 12025 3378630 ? 00:00:00 ms_pipe_write 12025 3389562 ? 00:00:03 ms_pipe_read 12025 3389563 ? 00:00:03 ms_pipe_read 12025 3389564 ? 00:00:03 ms_pipe_write 12025 3389566 ? 00:00:03 ms_pipe_write 12025 3389568 ? 00:00:03 ms_pipe_read 12025 3389569 ? 00:00:03 ms_pipe_read 12025 3389570 ? 00:00:02 ms_pipe_write 12025 3389571 ? 00:00:02 ms_pipe_write 12025 3389575 ? 00:00:03 ms_pipe_read 12025 3389576 ? 00:00:03 ms_pipe_read 12025 3389578 ? 00:00:03 ms_pipe_write 12025 3389580 ? 00:00:03 ms_pipe_write 12025 3389584 ? 00:00:02 ms_pipe_write 12025 3389585 ? 00:00:02 ms_pipe_write 12025 3389586 ? 00:00:02 ms_pipe_write 12025 3389587 ? 00:00:02 ms_pipe_write 12025 3389588 ? 00:00:02 ms_pipe_write 12025 3389589 ? 00:00:02 ms_pipe_write 12025 3389590 ? 00:00:02 ms_pipe_write 12025 3389591 ? 00:00:02 ms_pipe_write 12025 3389592 ? 00:00:02 ms_pipe_write 12025 3389593 ? 00:00:02 ms_pipe_write 12025 3389594 ? 00:00:03 ms_pipe_read 12025 3389595 ? 00:00:03 ms_pipe_read 12025 3389596 ? 00:00:02 ms_pipe_write 12025 3389597 ? 00:00:02 ms_pipe_read 12025 3389598 ? 00:00:02 ms_pipe_read 12025 3389599 ? 00:00:02 ms_pipe_write 12025 3389600 ? 00:00:03 ms_pipe_read 12025 3389601 ? 00:00:02 ms_pipe_write 12025 3389602 ? 00:00:03 ms_pipe_read 12025 3389603 ? 00:00:03 ms_pipe_read 12025 3389604 ? 00:00:03 ms_pipe_read 12025 3389605 ? 00:00:02 ms_pipe_write 12025 3389606 ? 00:00:02 ms_pipe_read 12025 3389607 ? 00:00:02 ms_pipe_write 12025 3389608 ? 00:00:01 ms_pipe_read 12025 3389609 ? 00:00:03 ms_pipe_read 12025 3389610 ? 00:00:02 ms_pipe_write 12025 3389611 ? 00:00:02 ms_pipe_write 12025 3389612 ? 00:00:02 ms_pipe_write 12025 3389613 ? 00:00:02 ms_pipe_read 12025 3389614 ? 00:00:02 ms_pipe_read 12025 3389615 ? 00:00:02 ms_pipe_read 12025 3389616 ? 00:00:02 ms_pipe_read 12025 3389617 ? 00:00:02 ms_pipe_read 12025 3389618 ? 00:00:02 ms_pipe_write 12025 3389619 ? 00:00:02 ms_pipe_write 12025 3389620 ? 00:00:02 ms_pipe_write 12025 3389621 ? 00:00:02 ms_pipe_write 12025 3389622 ? 00:00:02 ms_pipe_write 12025 3389623 ? 00:00:02 ms_pipe_write 12025 3389624 ? 00:00:02 ms_pipe_read 12025 3389625 ? 00:00:02 ms_pipe_read 12025 3389626 ? 00:00:03 ms_pipe_read 12025 3389627 ? 00:00:02 ms_pipe_read 12025 3389628 ? 00:00:02 ms_pipe_read 12025 3389629 ? 00:00:03 ms_pipe_read 12025 3389630 ? 00:00:03 ms_pipe_read 12025 3389631 ? 00:00:03 ms_pipe_read 12025 3391295 ? 00:00:00 ms_pipe_write 12025 3393933 ? 00:00:00 ms_pipe_read 12025 3393934 ? 00:00:00 ms_pipe_write 12025 3404012 ? 00:00:00 ms_pipe_read 12025 3404018 ? 00:00:00 ms_pipe_read 12025 3404019 ? 00:00:00 ms_pipe_write 12025 3405549 ? 00:00:00 ms_pipe_read 12025 3405550 ? 00:00:00 ms_pipe_write 12025 3406486 ? 00:00:19 ms_pipe_read 12025 3406488 ? 00:00:17 ms_pipe_write 12025 3406506 ? 00:00:02 ms_pipe_write 12025 3406507 ? 00:00:02 ms_pipe_write 12025 3406508 ? 00:00:02 ms_pipe_read 12025 3406509 ? 00:00:02 ms_pipe_read 12025 3425139 ? 00:00:00 ms_pipe_read 12025 3425140 ? 00:00:00 ms_pipe_write 12025 3444471 ? 00:00:00 ms_pipe_read 12025 3444472 ? 00:00:00 ms_pipe_write 12025 3451139 ? 00:00:00 ms_pipe_read 12025 3451140 ? 00:00:00 ms_pipe_write 12025 3458441 ? 00:00:00 ms_pipe_read 12025 3458442 ? 00:00:00 ms_pipe_write 12025 3462283 ? 00:00:00 ms_pipe_read 12025 3462284 ? 00:00:00 ms_pipe_write 12025 3463676 ? 00:00:00 ms_pipe_read 12025 3463677 ? 00:00:00 ms_pipe_write 12025 3464147 ? 00:00:00 ms_pipe_read 12025 3464148 ? 00:00:00 ms_pipe_write 12025 3464195 ? 00:00:00 ms_pipe_read 12025 3464196 ? 00:00:00 ms_pipe_write 12025 3464805 ? 00:00:00 ms_pipe_read 12025 3464806 ? 00:00:00 ms_pipe_write 12025 3465173 ? 00:00:00 ms_pipe_read 12025 3465174 ? 00:00:00 ms_pipe_write 12025 3465338 ? 00:00:00 ms_pipe_read 12025 3465339 ? 00:00:00 ms_pipe_write 12025 3465699 ? 00:00:00 ms_pipe_read 12025 3465700 ? 00:00:00 ms_pipe_write 12025 3466303 ? 00:00:00 ms_pipe_read 12025 3466304 ? 00:00:00 ms_pipe_write 12025 3466658 ? 00:00:00 ms_pipe_read 12025 3466659 ? 00:00:00 ms_pipe_write 12025 3466900 ? 00:00:00 ms_pipe_read 12025 3466901 ? 00:00:00 ms_pipe_write 12025 3467135 ? 00:00:00 ms_pipe_read 12025 3467136 ? 00:00:00 ms_pipe_write 12025 3467349 ? 00:00:00 ms_pipe_read 12025 3467350 ? 00:00:00 ms_pipe_write 12025 3467741 ? 00:00:00 ms_pipe_read 12025 3467742 ? 00:00:00 ms_pipe_write 12025 3467910 ? 00:00:00 ms_pipe_read 12025 3467911 ? 00:00:00 ms_pipe_write 12025 3468411 ? 00:00:00 ms_pipe_read 12025 3468412 ? 00:00:00 ms_pipe_write 12025 3468989 ? 00:00:00 ms_pipe_read 12025 3468990 ? 00:00:00 ms_pipe_write 12025 3469082 ? 00:00:00 ms_pipe_read 12025 3469083 ? 00:00:00 ms_pipe_write 12025 3469340 ? 00:00:00 ms_pipe_read 12025 3469341 ? 00:00:00 ms_pipe_write 12025 3469535 ? 00:00:00 ms_pipe_read 12025 3469536 ? 00:00:00 ms_pipe_write 12025 3469609 ? 00:00:00 ms_pipe_read 12025 3469610 ? 00:00:00 ms_pipe_write 12025 3469636 ? 00:00:00 ms_pipe_read 12025 3469637 ? 00:00:00 ms_pipe_write 12025 3469664 ? 00:00:00 ms_pipe_read 12025 3469665 ? 00:00:00 ms_pipe_write 12025 3469893 ? 00:00:00 ms_pipe_read 12025 3469894 ? 00:00:00 ms_pipe_write 12025 3469895 ? 00:00:00 ms_pipe_read 12025 3469896 ? 00:00:00 ms_pipe_write 12025 3470564 ? 00:00:00 ms_pipe_read 12025 3470565 ? 00:00:00 ms_pipe_write 12025 3470848 ? 00:00:00 ms_pipe_read 12025 3470849 ? 00:00:00 ms_pipe_write 12025 3470865 ? 00:00:00 ms_pipe_read 12025 3470866 ? 00:00:00 ms_pipe_write 12025 3470884 ? 00:00:00 ms_pipe_read 12025 3470885 ? 00:00:00 ms_pipe_write 12025 3471331 ? 00:00:00 ms_pipe_read 12025 3471332 ? 00:00:00 ms_pipe_write 12025 3471456 ? 00:00:00 ms_pipe_read 12025 3471457 ? 00:00:00 ms_pipe_write 12025 3472409 ? 00:00:00 ms_pipe_read 12025 3472410 ? 00:00:00 ms_pipe_write 12025 3472415 ? 00:00:00 ms_pipe_read 12025 3472416 ? 00:00:00 ms_pipe_write 12025 3472743 ? 00:00:00 ms_pipe_read 12025 3472744 ? 00:00:00 ms_pipe_write 12025 3472908 ? 00:00:00 ms_pipe_read 12025 3472909 ? 00:00:00 ms_pipe_write 12025 3473128 ? 00:00:00 ms_pipe_read 12025 3473129 ? 00:00:00 ms_pipe_write 12025 3473252 ? 00:00:00 ms_pipe_read 12025 3473253 ? 00:00:00 ms_pipe_write 12025 3473800 ? 00:00:00 ms_pipe_read 12025 3473801 ? 00:00:00 ms_pipe_write 12025 3473945 ? 00:00:00 ms_pipe_read 12025 3473946 ? 00:00:00 ms_pipe_write 12025 3474118 ? 00:00:00 ms_pipe_read 12025 3474119 ? 00:00:00 ms_pipe_write 12025 3474267 ? 00:00:00 ms_pipe_read 12025 3474268 ? 00:00:00 ms_pipe_write 12025 3474418 ? 00:00:00 ms_pipe_read 12025 3474419 ? 00:00:00 ms_pipe_write 12025 3474429 ? 00:00:00 ms_pipe_read 12025 3474430 ? 00:00:00 ms_pipe_write 12025 3474448 ? 00:00:00 ms_pipe_read 12025 3474449 ? 00:00:00 ms_pipe_write 12025 3474722 ? 00:00:00 ms_pipe_read 12025 3474723 ? 00:00:00 ms_pipe_write 12025 3474797 ? 00:00:00 ms_pipe_read 12025 3474798 ? 00:00:00 ms_pipe_write 12025 3474934 ? 00:00:00 ms_pipe_read 12025 3474938 ? 00:00:00 ms_pipe_write 12025 3474975 ? 00:00:00 ms_pipe_read 12025 3474976 ? 00:00:00 ms_pipe_write 12025 3475220 ? 00:00:00 ms_pipe_read 12025 3475221 ? 00:00:00 ms_pipe_write 12025 3475261 ? 00:00:00 ms_pipe_read 12025 3475262 ? 00:00:00 ms_pipe_write 12025 3475306 ? 00:00:00 ms_pipe_read 12025 3475307 ? 00:00:00 ms_pipe_write 12025 3475310 ? 00:00:00 ms_pipe_read 12025 3475311 ? 00:00:00 ms_pipe_write 12025 3475434 ? 00:00:00 ms_pipe_read 12025 3475435 ? 00:00:00 ms_pipe_write 12025 3475470 ? 00:00:00 ms_pipe_read 12025 3475471 ? 00:00:00 ms_pipe_write
上面总共有418个线程,我们可以使用strace
跟踪其中一个线程的运行,例如,我们可以跟踪14234
这个ms_accepter线程接收客户端链接:
# strace -p 32726 -e trace=accept
2. 相关数据结构
下面将介绍OSD模块相关的一些核心的数据结构。从最高的逻辑层次为pool的概念,然后是PG的概念。其次是OSDMap记录了集群的所有的配置信息。数据结构OSDOp是一个操作上下文的封装。结构object_info_t保存了一个对象的元数据信息和访问信息。对象ObjectState是在object_info_t的基础上添加了一些内存的状态信息。SnapSetContext和ObjectContext分别保存了快照和对象的上下文相关的信息。Session保存了一个端到端的链接相关的上下文信息。
2.1 Pool
Pool是整个集群层面定义的一个逻辑的存储池。对一个Pool可以设置相应的数据冗余类型,目前有副本和纠删码两种实现。数据结构pg_pool_t用于保存Pool的相关信息。Pool的数据结构如下(src/osd/osd_types.h):
-
type: 定义了Pool的类型,目前有replication和ErasureCode两种类型;
-
size和min_size定义了Pool的冗余模式
- 如果是replication模式,size定义了副本数目,min_size为副本的最小数目。例如:如果size设置为3,副本数为3,min_size设置为1就只允许两副本损坏。
- 如果是Erasure Code(M+N),size是总的分片数M+N; min_size是实际数据的分片数M。
-
crush_ruleset: Pool对应的crush规则号
-
erasure_code_profile: EC的配置方式
-
object_hash: 通过对象名映射到PG的hash函数
-
pg_num: Pool里的PG的数量
通过上面的介绍可以了解到,Pool根据类型不同,定义了两种模式,分别保存了两种模式相关的参数。此外,在结构pg_pool_t里还定义了Pool级别的快照相关的数据结构、Cache Tier相关的数据结构,以及其他一些统计信息。在介绍快照(参见第9章)和Cache Tier(参见第13章)时再详细介绍相关的字段。
2.2 PG
PG可以认为是一组对象的集合,该集合里的对象有共同的特征: 副本都分布在相同的OSD列表中。PG的数据结构如下(src/osd/osd_types.h):
结构体pg_t只是一个PG的静态描述信息。类PG及其子类ReplicatedPG都是和PG相关的处理。
数据结构spt_t在pg_t的基础上,增加了一个shard_id字段,代表了该PG所在的OSD在对应的OSD列表中的序号。例如,现在有一个PG 2.f,其映射到的OSD为(2,5,8),那么在OSD5上该PG对应的shard值就为1。
在Erasure Code模式下,该字段保存了每个分片的序号。该序号在EC的数据encode和decode过程中很关键;对于副本模式,该字段没有意义,都设置为shard_id_t::NO_SHARD值。
PG的分裂
当一个pool里的PG数量不够时,系统允许通过增加PG的数量,就会产生PG的分裂,使得一个PG分裂为2的幂次方个PG。PG分裂后,新的PG和其父PG的OSD列表是一致的,其数据的移动也是本地数据的移动,开销比较小。
2.3 OSDMap
类OSDMap定义了Ceph整个集群的全局信息。它由Monitor实现管理,并以全量或者增量的方式向整个集群扩散。每一个epoch对应的OSDMap都需要持久化保存在meta下对应对象的omap属性中。
下面介绍OSDMap核心成员,内部类Incremental以增量的形式保存了OSDMap新增的信息,其内部成员和OSDMap类似,这里就不介绍了。代码位于src/osd/OSDMap.h:
通过OSDMap数据成员的了解,可以看到,OSDMap包含了四类信息:首先是集群的信息;其次是pool相关的信息;然后是临时PG相关的信息;最后就是所有OSD的状态信息。
2.4 OSDOp
类MOSDOp封装了一些基本操作先关的数据(src/messages/MOSDOp.h):
MOSDOp在其成员ops向量里封装了多个类型为OSDOp操作数据。MOSDOp封装的操作都是关于对象oid相关的操作,一个MOSDOp只封装针对同一个对象oid的操作。但是对于rados_clone_range这样的操作,需要有一个目标对象oid,还有一个源对象oid,那么源对象的oid就保存在结构OSDOp里。
数据结构OSDOp封装了一个OSD操作需要的数据和元数据(src/osd/osd_types.h):
2.5 object_info_t
结构object_info_t保存了一个对象的元数据信息和访问信息(src/osd/osd_types.h)。其作为对象的一个属性,持久化保存在对象xattr中,对应的key为OI_ATTR(“ _”),value就是object_info_t的encode后的数据。
2.6 ObjectState
对象ObjectState是在object_info_t的基础上添加了一个字段exists,用来标记对象是否存在(src/osd/osd_types.h):
为什么要加一个额外的bool变量来标记呢?因为object_info_t可能是从缓存的attrs[OI_ATTR]中获取的,并不能确定对象是否存在。
2.7 SnapSetContext
SnapSetContext保存了快照的相关信息,即SnapSet的上下文信息。关于SnapSet的内容,可以参考快照相关的介绍():
2.8 ObjectContext
ObjectContext可以说是对象在内存中的一个管理类,保存了一个对象的上下文信息:
下面两个字段比较难理解,进行一些补充说明:
-
blocked_by记录了当前对象被其他对象阻塞,blocking记录了本对象阻塞其他对象的情况。当一个对象的写操作依赖其他对象时,就会出现这些情况。这一般对应一个操作涉及多个对象,比如copy操作。吧对象obj1上的部分数据拷贝到对象obj2,如果源对象obj1处于missing状态,需要恢复,那么obj2对象就block了obj1对象。
-
内部类RWState通过定义了4种状态,实现了对对象的读写加锁。
2.9 Session
类Session是和Connection相关的一个类,用于保存Connection上下文相关的信息(src/osd/osd.h):
函数update_waiting_for_pg()用于检查是否有最新的osdmap:
1) 如果该PG有分裂的PG,就把分裂出的新的PG以及对应的OpRequest加入到Session的waiting_for_pg队列里;
2) 如果该PG不分裂,就不把PG和OpRequest加入到waiting_for_pg队列里。
2.10 ShardedOpWQ
OSD在进行数据读写等操作时,通常要求一个PG内的事务具有严格的顺序性,而不同PG间事务是可以并发执行的,ShardedOpWQ用于实现此一功能(src/osd/OSD.h):
这样我们一个shardData一个线程,确保了单个PG的顺序性(当然,一个shardData多个线程的话,我们可以通过ShardData::lock来保证顺序性)。
[参看]