ceph缓存池

2021-03-10

一、配置crush class

1. 创建ssd class

默认情况下，我们所有的osd都会class类型都是hdd：

# ceph osd crush class ls
[
    "hdd"
]

查看当前的osd布局：

# ceph osd tree
ID CLASS WEIGHT  TYPE NAME                  STATUS REWEIGHT PRI-AFF 
-8             0 root cache                                         
-7             0     host 192.168.3.9-cache                         
-1       0.37994 root default                                       
-2             0     host 192.168.3.9                               
-5       0.37994     host kolla-cloud                               
 0   hdd 0.10999         osd.0                  up  1.00000 1.00000 
 1   hdd 0.10999         osd.1                  up  1.00000 1.00000 
 2   hdd 0.10999         osd.2                  up  1.00000 1.00000 
 3   hdd 0.04999         osd.3                  up  1.00000 1.00000

将osd.3从 hdd class中删除：

# ceph osd crush rm-device-class osd.3
done removing class of osd(s): 3

将这些osd.3添加至ssd class

# ceph osd crush set-device-class ssd osd.3
set osd(s) 3 to class 'ssd'

添加完成之后，我们再次查看osd布局：

# ceph osd tree
ID CLASS WEIGHT  TYPE NAME                  STATUS REWEIGHT PRI-AFF 
-8             0 root cache                                         
-7             0     host 192.168.3.9-cache                         
-1       0.37994 root default                                       
-2             0     host 192.168.3.9                               
-5       0.37994     host kolla-cloud                               
 0   hdd 0.10999         osd.0                  up  1.00000 1.00000 
 1   hdd 0.10999         osd.1                  up  1.00000 1.00000 
 2   hdd 0.10999         osd.2                  up  1.00000 1.00000 
 3   ssd 0.04999         osd.3                  up  1.00000 1.00000

可以看到我们osd.3的class都变为了ssd。

然后我们再次查看crush class，也多出了一个名为ssd的class：

# ceph osd crush class ls
[
    "hdd",
    "ssd"
]

2. 创建基于ssd的class rule

创建一个class rule，取名为ssd_rule，使用ssd的osd：

# ceph osd crush rule create-replicated ssd_rule default host ssd

查看集群rule：

# ceph osd crush rule ls 
replicated_rule
disks
ssd_rule

通过如下方式查看详细的crushmap信息：

#  ceph osd getcrushmap -o crushmap 
26
# crushtool -d crushmap -o crushmap.txt
# cat crushmap.txt
# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable chooseleaf_stable 1
tunable straw_calc_version 1
tunable allowed_bucket_algs 54

# devices
device 0 osd.0 class hdd
device 1 osd.1 class hdd
device 2 osd.2 class hdd
device 3 osd.3 class ssd

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 region
type 10 root

# buckets
host 192.168.3.9 {
        id -2           # do not change unnecessarily
        id -3 class hdd         # do not change unnecessarily
        id -13 class ssd                # do not change unnecessarily
        # weight 0.000
        alg straw2
        hash 0  # rjenkins1
}
host kolla-cloud {
        id -5           # do not change unnecessarily
        id -6 class hdd         # do not change unnecessarily
        id -14 class ssd                # do not change unnecessarily
        # weight 0.380
        alg straw2
        hash 0  # rjenkins1
        item osd.2 weight 0.110
        item osd.1 weight 0.110
        item osd.0 weight 0.110
        item osd.3 weight 0.050
}
root default {
        id -1           # do not change unnecessarily
        id -4 class hdd         # do not change unnecessarily
        id -15 class ssd                # do not change unnecessarily
        # weight 0.380
        alg straw2
        hash 0  # rjenkins1
        item 192.168.3.9 weight 0.000
        item kolla-cloud weight 0.380
}
host 192.168.3.9-cache {
        id -7           # do not change unnecessarily
        id -9 class hdd         # do not change unnecessarily
        id -11 class ssd                # do not change unnecessarily
        # weight 0.000
        alg straw2
        hash 0  # rjenkins1
}
root cache {
        id -8           # do not change unnecessarily
        id -10 class hdd                # do not change unnecessarily
        id -12 class ssd                # do not change unnecessarily
        # weight 0.000
        alg straw2
        hash 0  # rjenkins1
        item 192.168.3.9-cache weight 0.000
}

# rules
rule replicated_rule {
        id 0
        type replicated
        min_size 1
        max_size 10
        step take default
        step chooseleaf firstn 0 type host
        step emit
}
rule disks {
        id 1
        type replicated
        min_size 1
        max_size 10
        step take default
        step chooseleaf firstn 0 type host
        step emit
}
rule ssd_rule {
        id 2
        type replicated
        min_size 1
        max_size 10
        step take default class ssd
        step chooseleaf firstn 0 type host
        step emit
}

# end crush map

修改crushmap.txt文件中的step take default class改成 step take default class hdd

rule disks {
        id 1
        type replicated
        min_size 1
        max_size 10
        step take default class hdd
        step chooseleaf firstn 0 type host
        step emit
}

重新编译crushmap并导入进去：

# crushtool -c crushmap.txt -o crushmap.new
# ceph osd setcrushmap -i crushmap.new

3. 创建基于ssd_rule规则的存储池

创建一个基于该ssd_rule规则的存储池：

# ceph osd pool create cache 64 64 ssd_rule
pool 'cache' created

查看cache的信息可以看到使用的crush_rule为1，也就是ssd_rule

# ceph osd pool get cache crush_rule
crush_rule: ssd_rule

查看pool使用rule情况，发现pool使用crush_rule 2

# # ceph osd dump | grep -i size
pool 1 'images' replicated size 1 min_size 1 crush_rule 1 object_hash rjenkins pg_num 64 pgp_num 64 last_change 80 lfor 0/71 flags hashpspool stripe_width 0 Application rbd
pool 2 'volumes' replicated size 1 min_size 1 crush_rule 1 object_hash rjenkins pg_num 64 pgp_num 64 last_change 89 lfor 0/73 flags hashpspool stripe_width 0 application rbd
pool 3 'backups' replicated size 1 min_size 1 crush_rule 1 object_hash rjenkins pg_num 64 pgp_num 64 last_change 84 lfor 0/75 flags hashpspool stripe_width 0 application rbd
pool 4 'vms' replicated size 1 min_size 1 crush_rule 1 object_hash rjenkins pg_num 64 pgp_num 64 last_change 86 lfor 0/77 flags hashpspool stripe_width 0 application rbd
pool 5 'cache' replicated size 3 min_size 2 crush_rule 2 object_hash rjenkins pg_num 64 pgp_num 64 last_change 108 flags hashpspool stripe_width 0

二、配置缓存池

1. 创建一个缓存池及后端存储

缓冲池已经在一.3已经创建，pool: cache_pool，可以参考

后端存储：

# ceph osd pool create volumes2 64 64

2. 设置缓存层

将上面创建的cache_pool池绑定至存储池的前端，volumes即为我们的后端存储池

# ceph osd tier add volumes2 cache
pool 'cache' is now (or already was) a tier of 'volumes2'

设置缓存模式为writeback

# ceph osd tier cache-mode cache writeback
set cache-mode for pool 'cache' to writeback

将所有客户端请求从标准池引导至缓存池

# ceph osd tier set-overlay volumes2 cache
overlay for 'volumes2' is now (or already was) 'cache'

此时，我们分别查看存储池和缓存池的详情，可以看到相关的缓存配置信息：

# ceph osd dump |egrep 'volumes2|cache'    
pool 5 'cache' replicated size 1 min_size 1 crush_rule 2 object_hash rjenkins pg_num 64 pgp_num 64 last_change 125 lfor 125/125 flags hashpspool,incomplete_clones tier_of 6 cache_mode writeback stripe_width 0
pool 6 'volumes2' replicated size 1 min_size 1 crush_rule 0 object_hash rjenkins pg_num 64 pgp_num 64 last_change 125 lfor 125/125 flags hashpspool tiers 5 read_tier 5 write_tier 5 stripe_width 0

3. 缓存层相关参数说明

对于生产环境的部署，目前只能使用bloom filters数据结构（看官方文档的意思，好像目前只支持这一种filter）：

ceph osd pool set cache hit_set_type bloom

设置当缓存池中的数据达到多少个字节或者多少个对象时，缓存分层代理就开始从缓存池刷新对象至后端存储池并驱逐：

# 当缓存池中的数据量达到1TB时开始刷盘并驱逐
ceph osd pool set cache target_max_bytes 1099511627776

# 当缓存池中的对象个数达到100万时开始刷盘并驱逐
ceph osd pool set cache target_max_objects 10000000

定义缓存层将对象刷至存储层或者驱逐的时间：

ceph osd pool set cache cache_min_flush_age 600
ceph osd pool set cache cache_min_evict_age 600

定义当缓存池中的脏对象（被修改过的对象）占比达到多少时，缓存分层代理开始将object从缓存层刷至存储层：

# 当脏对象占比达到10%时开始刷盘
ceph osd pool set cache cache_target_dirty_ratio 0.4
# 当脏对象占比达到60%时开始高速刷盘
ceph osd pool set cache cache_target_dirty_high_ratio 0.6

当缓存池的使用量达到其总量的一定百分比时，缓存分层代理将驱逐对象以维护可用容量（达到该限制时，就认为缓存池满了），此时会将未修改的（干净的）对象刷盘：

ceph osd pool set cache cache_target_full_ratio 0.8

4. 测试缓存池

配置好缓存池以后，我们可以先将其驱逐对象的最小时间设置为60s：

ceph osd pool set cache cache_min_evict_age 60
ceph osd pool set cache cache_min_flush_age 60

定义当缓存池中的脏对象（被修改过的对象）占比达到千分之一，缓存分层代理开始将object从缓存层刷至存储层：

ceph osd pool set cache cache_target_dirty_ratio 0.001

然后，我们往存储池中写一个数据

rados -p volumes put test MySQL-community-client-5.7.31-1.el7.x86_64.rpm

查看存储池，这时应该无法查看到该数据，查看缓存池，则可以看到数据存储在缓存池中：

rados -p volumes2 ls |grep test
rados -p cache ls |grep test

等60s之后，数据刷盘，此时即可在存储池中看到该数据，则缓存池中，该数据即被驱逐。

三、删除缓存池

需要说明的是，根据缓存池类型的不同，删除缓存池的方法也不同。

1. 删除read-only缓存池

由于只读缓存不具有修改的数据，因此可以直接禁用并删除它，而不会丢失任何最近对缓存中的对象的更改。

将缓存模式个性为none以禁用缓存：

ceph osd tier cache-mode cache none

删除缓存池：

# 解除绑定
ceph osd tier remove cephfs_data cache

2. 删除writeback缓存池

由于回写缓存可能具有修改的数据，所以必须采取措施以确保在禁用和删除缓存前，不丢失缓存中对象的最近的任何更改。

将缓存模式更改为转发，以便新的和修改的对象刷新至后端存储池：

ceph osd tier cache-mode cache forward

查看缓存池以确保所有的对象都被刷新（这可能需要点时间）：

rados -p cache ls

如果缓存池中仍然有对象，也可以手动刷新：

rados -p cache cache-flush-evict-all

删除覆盖层，以使客户端不再将流量引导至缓存：

ceph osd tier remove-overlay cephfs_data

解除存储池与缓存池的绑定：

ceph osd tier remove cephfs_data cache

ceph osd pool application enable sata-pool rbd

https://www.cnblogs.com/breezey/p/11080532.html

https://my.oschina.net/hanhanztj/blog/515410