Redis内存分配器
像大家原来使用纸币,平时用的纸币只有 1元 5元 10元 20元 50元 100元,为什么没有 1.5元 3元 ?
因为有了 3元 人们使用不方便,也不方便管理。
(1) Redis内存分配器
Redis在内存使用时也是类似纸币,为了方便管理,Redis在分配内存时按照设计的大小分配。
(1.1) 为什么要设置内存分配器
举个简单的例子,你找妈妈要零花钱,你想要1.5元,但是她没有1.5元,给了个你5元。
- 内存分配器的目标主要有2个:
1.1 减少内存碎片,包括内部碎片和外部碎片:
内部碎片:分配出去的但没有使用到的内存,比如需要 32 字节,分配了 40 字节,多余的 8 字节就是内部碎片。
外部碎片:大小不合适导致无法分配出去的内存,比如一直申请 16 字节的内存,但是内存分配器中保存着部分 8 字节的内存,一直分配不出去。 - 提高性能:
- 1 单线程性能
- 2 多线程性能
(1.2) 内存分配器配置
Linux
上默认内存分配器是jemalloc
Mac OS 系统Redis默认内存分配器是libc
,可以通过 info memory
结果里的 mem_allocator
指标 查看;
如果需要使用 jemalloc 内存分配器,在执行make
命令时通过 MALLOC=jemalloc
指定
jemalloc 和 tcmalloc 都是对 glibc 中的优化,目的也是为了减少内存碎片和提高性能。
(2) 内存分配的原理
类似于人民币有 1元、5元、10元、100元
jemalloc 将对象按大小分为3类,不同大小类别的分配算法不同:
small
: 从对应 bin 管理的 run 中返回一个 region
large
: 大小比 chunk 小,比 page 大,会单独返回一个 run
huge
: 大小为 chunk 倍数,会分配 chunk
每个 size_class
代表 jemalloc 分配的内存大小。
共有 NSIZES(232)个小类(如果用户申请的大小位于两个小类之间,会取较大的,比如申请14字节,位于8和16字节之间,按16字节分配。
在 2MiB chunk,4KiB page 的64位系统上,size classes 如下
+---------+---------+--------------------------------------+
|Category | Spacing | Size |
+---------+---------+--------------------------------------+
| | 8 | [8] |
| +---------+--------------------------------------+
| | 16 | [16, 32, 48, 64, 80, 96, 112, 128] |
| +---------+--------------------------------------+
| | 32 | [160, 192, 224, 256] |
| +---------+--------------------------------------+
| | 64 | [320, 384, 448, 512] |
| +---------+--------------------------------------+
|Small | 128 | [640, 768, 896, 1024] |
| +---------+--------------------------------------+
| | 256 | [1280, 1536, 1792, 2048] |
| +---------+--------------------------------------+
| | 512 | [2560, 3072, 3584, 4096] |
| +---------+--------------------------------------+
| | 1 KiB | [5 KiB, 6 KiB, 7 KiB, 8 KiB] |
| +---------+--------------------------------------+
| | 2 KiB | [10 KiB, 12 KiB, 14 KiB] |
+---------+---------+--------------------------------------+
| | 2 KiB | [16 KiB] |
| +---------+--------------------------------------+
| | 4 KiB | [20 KiB, 24 KiB, 28 KiB, 32 KiB] |
| +---------+--------------------------------------+
| | 8 KiB | [40 KiB, 48 KiB, 54 KiB, 64 KiB] |
| +---------+--------------------------------------+
| | 16 KiB | [80 KiB, 96 KiB, 112 KiB, 128 KiB] |
|Large +---------+--------------------------------------+
| | 32 KiB | [160 KiB, 192 KiB, 224 KiB, 256 KiB] |
| +---------+--------------------------------------+
| | 64 KiB | [320 KiB, 384 KiB, 448 KiB, 512 KiB] |
| +---------+--------------------------------------+
| | 128 KiB | [640 KiB, 768 KiB, 896 KiB, 1 MiB] |
| +---------+--------------------------------------+
| | 256 KiB | [1280 KiB, 1536 KiB, 1792 KiB] |
+---------+---------+--------------------------------------+
| | 256 KiB | [2 MiB] |
| +---------+--------------------------------------+
| | 512 KiB | [2560 KiB, 3 MiB, 3584 KiB, 4 MiB] |
| +---------+--------------------------------------+
| | 1 MiB | [5 MiB, 6 MiB, 7 MiB, 8 MiB] |
| +---------+--------------------------------------+
|Huge | 2 MiB | [10 MiB, 12 MiB, 14 MiB, 16 MiB] |
| +---------+--------------------------------------+
| | 4 MiB | [20 MiB, 24 MiB, 28 MiB, 32 MiB] |
| +---------+--------------------------------------+
| | 8 MiB | [40 MiB, 48 MiB, 56 MiB, 64 MiB] |
| +---------+--------------------------------------+
| | ... | ... |
+---------+---------+--------------------------------------+
(2.1) small分配流程
small 的分配流程如下:
- 查找对应 size classes 的 bin
- 从 bin 中获取 run:
2.1 bin->runcur
2.2 从 bin->runs 查找未满的 run- 从 arena 中获取 run:
3.1 从 arena->avail_runs 中查找空闲 run
3.2 当没有合适 run 时,从 chunk 中分配 run:3.2.1 arena->spare 3.2.2 arena->cached_tree 3.2.3 arena->retained_tree 3.2.4 调用 mmap() 新分配一块 chunk
- 从 run 中返回一个空闲 region
small 的释放流程如下:
- 将该 region 返回给对应的 run,即设置 bitmap 为空闲,增加 nfree
- 将 run 还给 bin:
2.1 如果 run->nfree == 1,则设置为 bin->runcur 或者插入到 bin->runs 中- 如果 run->nfree == bin_info->nregs,则将该 run 与 bin 分离,再将 run 还给 arena:
3.1 尝试与相同 chunk 中前后相邻的空闲 run 进行合并,然后插入到 arena->avail_runs 中
3.2 若合并完后,整个 chunk 为空,则尝试与连续地址空间的空闲 chunk 进行合并,然后插入到 arena->cached_tree 中
(2.2) large
分配 large 和分配 small 类似:
- 先从 arena->avail_runs 中查找,因为 large object 不由 bin 管理,所以与 small object 相比,少了从 bin->runs 中查找的一步
- 分配 chunk,步骤和 small 一样,然后从 chunk 中分配需要的 run 大小,此时 run 的大小为单个 object 的大小,而 small run 的大小会从 bin_info[] 中获取
因为 large 大小是 page 的倍数,且会按照 page size 地址对齐,有可能造成 cache line 颠簸, 所以会根据配置多分配一个 page,用于内存着色,防止 cache line 的颠簸
large 和 small 的 arena_chunk_map_misc_t 格式也不同,large 只在首个 page 设置 run 的大小。释放流程和 small 一样,只是缺少了 run 在 bin 中的处理,直接将 run 还给 arena。
(2.3) huge
huge object 大小比 chunk 大。分配策略和上面分配 chunk 一样:
- 从 arena 中分配 extent_node_t
- 从 arena 中分配 chunk:
2.1 从 arena->cached_tree 中分配 chunk
2.2 从 arena->retained_tree 中分配
2.3 调用 mmap() 新分配一块 chunk- 将 chunk 和 node 插入到 chunks_rtree 中
- 插入到 arena->huge 链表中
释放和分配过程相反:
- 从 chunks_rtree 中获取 chunk 对应的 node,从而获取对应的 arena
- 移出 arena->huge
- 释放 chunk,插入到 arena->cached_tree 中
- 释放 node
huge 使用了线程间共享的 chunks_rtree 来保存信息,这会导致锁的竞争,但是应用程序很少会分配如此大的内存,所以带来的影响很小。
(3) 源码解读
由于redis要求满足跨平台性,而每个平台又会有自己的内存管理函数。
zmalloc.c
和zmalloc.h
主要功能就是对原有库里的内存分配函数进行封装,形成独立的一套内存管理函数。
根据系统的不同,使用不同的内存管理函数,例如jemalloc
, tcmalloc
, cmalloc
。
(3.1) zmalloc.h
https://github.com/redis/redis/blob/6.0/src/zmalloc.h
// tcmalloc
#if defined(USE_TCMALLOC)
#define ZMALLOC_LIB ("tcmalloc-" __xstr(TC_VERSION_MAJOR) "." __xstr(TC_VERSION_MINOR))
#include <google/tcmalloc.h>
#if (TC_VERSION_MAJOR == 1 && TC_VERSION_MINOR >= 6) || (TC_VERSION_MAJOR > 1)
#define HAVE_MALLOC_SIZE 1
#define zmalloc_size(p) tc_malloc_size(p)
#else
#error "Newer version of tcmalloc required"
#endif
// jemalloc
#elif defined(USE_JEMALLOC)
#define ZMALLOC_LIB ("jemalloc-" __xstr(JEMALLOC_VERSION_MAJOR) "." __xstr(JEMALLOC_VERSION_MINOR) "." __xstr(JEMALLOC_VERSION_BUGFIX))
#include <jemalloc/jemalloc.h>
#if (JEMALLOC_VERSION_MAJOR == 2 && JEMALLOC_VERSION_MINOR >= 1) || (JEMALLOC_VERSION_MAJOR > 2)
#define HAVE_MALLOC_SIZE 1
#define zmalloc_size(p) je_malloc_usable_size(p)
#else
#error "Newer version of jemalloc required"
#endif
// 苹果系统 malloc
#elif defined(__APPLE__)
#include <malloc/malloc.h>
#define HAVE_MALLOC_SIZE 1
#define zmalloc_size(p) malloc_size(p)
#endif
// 其它情况 malloc
#ifndef ZMALLOC_LIB
#define ZMALLOC_LIB "libc"
#ifdef __GLIBC__
#include <malloc.h>
#define HAVE_MALLOC_SIZE 1
#define zmalloc_size(p) malloc_usable_size(p)
#endif
#endif
ZMALLOC_LIB
:最终使用的内存库,是jemalloc-5.1.0-0-g0
、tcmalloc-版本号、libc 这样HAVE_MALLOC_SIZE
: 使用的内存分配库是否有自带的获取内存块大小的函数,jemalloc
和tcmalloc
都有HAVE_DEFRAG
: 是否支持内存碎片处理
// 如果支持支持内存碎片整理,会编译对应的函数
#ifdef HAVE_DEFRAG
void zfree_no_tcache(void *ptr); // 内存释放
void *zmalloc_no_tcache(size_t size); // 内存分配
#endif
// 内存分配库是否有自带的获取内存块大小的函数
#ifndef HAVE_MALLOC_SIZE
size_t zmalloc_size(void *ptr); // 获取内存长度
size_t zmalloc_usable(void *ptr); //
#else
#define zmalloc_usable(p) zmalloc_size(p)
#endif
(4.2) zmalloc.c
https://github.com/redis/redis/blob/6.0/src/zmalloc.c
/* Explicitly override malloc/free etc when using tcmalloc. */
#if defined(USE_TCMALLOC) // 使用三方库 tcmalloc
#define malloc(size) tc_malloc(size)
#define calloc(count,size) tc_calloc(count,size)
#define realloc(ptr,size) tc_realloc(ptr,size)
#define free(ptr) tc_free(ptr)
#elif defined(USE_JEMALLOC) // 使用三方库 jemalloc
#define malloc(size) je_malloc(size)
#define calloc(count,size) je_calloc(count,size)
#define realloc(ptr,size) je_realloc(ptr,size)
#define free(ptr) je_free(ptr)
#define mallocx(size,flags) je_mallocx(size,flags)
#define dallocx(ptr,flags) je_dallocx(ptr,flags)
#endif
功能 | 函数 | tcmalloc | jemalloc |
---|---|---|---|
分配内存 | malloc(size) | tc_malloc(size) | je_malloc(size) |
分配内存空间 | calloc(count,size) | tc_calloc(count,size) | je_calloc(count,size) |
调整内存大小 | realloc(ptr,size) | tc_realloc(ptr,size) | je_realloc(ptr,size) |
释放内存 | free(ptr) | tc_free(ptr) | je_free(ptr) |
mallocx(size,flags) | - | je_mallocx(size,flags) | |
dallocx(ptr,flags) | - | je_dallocx(ptr,flags) |
记录内存大小
// 分配/增加使用的内存数
#define update_zmalloc_stat_alloc(__n) do { \
size_t _n = (__n); \
if (_n&(sizeof(long)-1)) _n += sizeof(long)-(_n&(sizeof(long)-1)); \
atomicIncr(used_memory,__n); \
} while(0)
// 释放/减少使用的内存数
#define update_zmalloc_stat_free(__n) do { \
size_t _n = (__n); \
if (_n&(sizeof(long)-1)) _n += sizeof(long)-(_n&(sizeof(long)-1)); \
atomicDecr(used_memory,__n); \
} while(0)
// 已使用内存变量
static size_t used_memory = 0;
pthread_mutex_t used_memory_mutex = PTHREAD_MUTEX_INITIALIZER;
redis用一个静态原子变量used_memory来记录已使用的内存大小。
(4) 主动碎片整理
########################### ACTIVE DEFRAGMENTATION #######################
#
# What is active defragmentation?
# -------------------------------
#
# Active (online) defragmentation allows a Redis server to compact the
# spaces left between small allocations and deallocations of data in memory,
# thus allowing to reclaim back memory.
#
# Fragmentation is a natural process that happens with every allocator (but
# less so with Jemalloc, fortunately) and certain workloads. Normally a server
# restart is needed in order to lower the fragmentation, or at least to flush
# away all the data and create it again. However thanks to this feature
# implemented by Oran Agra for Redis 4.0 this process can happen at runtime
# in a "hot" way, while the server is running.
#
# Basically when the fragmentation is over a certain level (see the
# configuration options below) Redis will start to create new copies of the
# values in contiguous memory regions by exploiting certain specific Jemalloc
# features (in order to understand if an allocation is causing fragmentation
# and to allocate it in a better place), and at the same time, will release the
# old copies of the data. This process, repeated incrementally for all the keys
# will cause the fragmentation to drop back to normal values.
#
# Important things to understand:
#
# 1. This feature is disabled by default, and only works if you compiled Redis
# to use the copy of Jemalloc we ship with the source code of Redis.
# This is the default with Linux builds.
#
# 2. You never need to enable this feature if you don't have fragmentation
# issues.
#
# 3. Once you experience fragmentation, you can enable this feature when
# needed with the command "CONFIG SET activedefrag yes".
#
# The configuration parameters are able to fine tune the behavior of the
# defragmentation process. If you are not sure about what they mean it is
# a good idea to leave the defaults untouched.
# Active defragmentation is disabled by default
# activedefrag no
# Minimum amount of fragmentation waste to start active defrag
# active-defrag-ignore-bytes 100mb
# Minimum percentage of fragmentation to start active defrag
# active-defrag-threshold-lower 10
# Maximum percentage of fragmentation at which we use maximum effort
# active-defrag-threshold-upper 100
# Minimal effort for defrag in CPU percentage, to be used when the lower
# threshold is reached
# active-defrag-cycle-min 1
# Maximal effort for defrag in CPU percentage, to be used when the upper
# threshold is reached
# active-defrag-cycle-max 25
# Maximum number of set/hash/zset/list fields that will be processed from
# the main dictionary scan
# active-defrag-max-scan-fields 1000
# Jemalloc background thread for purging will be enabled by default
jemalloc-bg-thread yes
# It is possible to pin different threads and processes of Redis to specific
# CPUs in your system, in order to maximize the performances of the server.
# This is useful both in order to pin different Redis threads in different
# CPUs, but also in order to make sure that multiple Redis instances running
# in the same host will be pinned to different CPUs.
#
# Normally you can do this using the "taskset" command, however it is also
# possible to this via Redis configuration directly, both in Linux and FreeBSD.
#
# You can pin the server/IO threads, bio threads, aof rewrite child process, and
# the bgsave child process. The syntax to specify the cpu list is the same as
# the taskset command:
#
# Set redis server/io threads to cpu affinity 0,2,4,6:
# server_cpulist 0-7:2
#
# Set bio threads to cpu affinity 1,3:
# bio_cpulist 1,3
#
# Set aof rewrite child process to cpu affinity 8,9,10,11:
# aof_rewrite_cpulist 8-11
#
# Set bgsave child process to cpu affinity 1,10,11
# bgsave_cpulist 1,10-11
# In some cases redis will emit warnings and even refuse to start if it detects
# that the system is in bad state, it is possible to suppress these warnings
# by setting the following config which takes a space delimited list of warnings
# to suppress
#
# ignore-warnings ARM64-COW-BUG
参考资料
[1] ☆ jemalloc 源码分析
[2] Jemalloc 内存分配与优化实践
[3] ptmalloc、tcmalloc与jemalloc对比分析
[4] redis源码阅读——内存分配完全解析
[5] 关于redis源码的内存分配,jemalloc,tcmalloc,libc
[6] redis.conf
[7] 浅谈Redis源码—内存分配zmalloc
[8] redis源码阅读——内存分配完全解析
[9] Redis内存模型
[10] 深入学习Redis(1):Redis内存模型