Aralakh Company Blog

lvm-cache

Background

There are two storage caching implementations that have been recently added to Linux: bcache and dm-cache. These are similar in that allow a fast device (e.g., an SSD) to act as a cache for a slow device (e.g., a mechanical hard drive), creating a hybrid volume. The dm-cache implementation is exposed through LVM to provide a much easier interface.

Two excellent blog posts comparing bcache and lvm-cache are here and here. I don't need to repeat their content, this post is about how I used lvm-cache for both my laptop as well as my desktop.

Desktop

This was fairly easy. I have an SDD, where I have a 35 GiB partition allocated for the cache (/dev/sda6), and a HDD with a 270 GiB partition that I want to be cached (/dev/sdc3). Skipping the partitioning steps, I set the partitions as LVM physical volumes:

# pvcreate /dev/sdc3
# pvcreate /dev/sda6

Then I create a volume group and add both physical volumes to it:

# vgcreate letovg /dev/sda6
# vgextend letovg /dev/sdc3

An optional step is to tag the physical volumes so I can remember which is which:

# pvchange --addtag @hdd /dev/sdc3
# pvchange --addtag @sdd /dev/sda6

Next I create the logical volume. In the parlance of the lvmcache man page, this is the "origin LV". I specify that the logical volume takes up the entire physical volume, this PV is on the HDD:

# lvcreate -l 100%PVS -n cargo letovg /dev/sdc3

One of the advantages of lvm-cache over bcache is that at this point I can create a filesystem and copy data to or from it, I don't have to wait for the cache to be fully ready:

# mkfs.ext4 -m 0 -v /dev/letovg/cargo
# mount /dev/letovg/cargo /mnt/cargo

Now on the SSD I create the "cache metadata LV". This should be 1% of the cache pool size, so with a 35 GiB cache, I set a 35 MiB metadata volume:

# lvcreate -n cargo-cache-meta -L 35M letovg /dev/sda6

Then I create the "cache data LV" on the remaining space on the SSD PV:

# lvcreate -n cargo-cache -l 100%PVS letovg /dev/sda6

A later command gave me an error that there wasn't enough free space in the volume group, so I resized the LV to make room, reducing the size by 9 extents:

# lvresize -l -9 letovg/cargo-cache

Now I combine the "cache data" and "cache metadata" logical volumes together into the "cache pool LV". I opted for a writeback cache rather than the (default) writethrough mode:

# lvconvert --cachemode writeback --type cache-pool --poolmetadata letovg/cargo-cache-meta letovg/cargo-cache

And now I attach it to the "origin LV" to obtain the "cache LV":

# lvconvert --type cache --cachepool letovg/cargo-cache letovg/cargo

And that's it! Here are the output of commands to display the LVM PV, VG, and LVs:

# pvs
  PV         VG     Fmt  Attr PSize   PFree
  /dev/sda6  letovg lvm2 a--   35.00g    0
  /dev/sdc3  letovg lvm2 a--  270.00g    0

# vgs
  VG     #PV #LV #SN Attr   VSize   VFree
  letovg   2   1   0 wz--n- 304.99g    0

# lvs
  LV    VG     Attr       LSize   Pool          Origin        Data%  Meta%  Move Log Cpy%Sync Convert
  cargo letovg Cwi-aoC--- 270.00g [cargo-cache] [cargo_corig] 61.40  18.53           0.00

This is after several weeks of use. The cache is 61.4% full. Here's a little more information (I trimmed some of the columns out):

# lvs -a -o +cache_total_blocks,cache_used_blocks,cache_dirty_blocks,cache_read_hits,cache_read_misses,cache_write_hits,cache_write_misses
  LV      CacheTotalBlocks CacheUsedBlocks  CacheDirtyBlocks CacheReadHits    CacheReadMisses  CacheWriteHits   CacheWriteMisses
  cargo             572224           351374                0           454928          2492192           449854          1934418

In this case, 15% of the blocks read from the cache LV and 18% of the blocks written to the cache LV have been serviced by the SSD. Not bad. Here's a simple benchmark. I'm going to scan repeatedly through about 200 MiB of binary files for a non-existent string. Each time I loop through I make sure to drop the caches Linux holds in RAM:

$ while true; do echo 3 | sudo tee /proc/sys/vm/drop_caches > /dev/null; time grep -q blah ASCA_SZF_1B_M02_2011103000*; sleep 1; done
grep -q blah ASCA_SZF_1B_M02_2011103000*  0.34s user 0.36s system 27% cpu 2.541 total
grep -q blah ASCA_SZF_1B_M02_2011103000*  0.31s user 0.38s system 27% cpu 2.514 total
grep -q blah ASCA_SZF_1B_M02_2011103000*  0.34s user 0.32s system 26% cpu 2.550 total
grep -q blah ASCA_SZF_1B_M02_2011103000*  0.34s user 0.34s system 26% cpu 2.536 total
grep -q blah ASCA_SZF_1B_M02_2011103000*  0.36s user 0.26s system 5% cpu 11.184 total
grep -q blah ASCA_SZF_1B_M02_2011103000*  0.17s user 0.20s system 77% cpu 0.479 total
grep -q blah ASCA_SZF_1B_M02_2011103000*  0.19s user 0.19s system 79% cpu 0.485 total
grep -q blah ASCA_SZF_1B_M02_2011103000*  0.22s user 0.17s system 78% cpu 0.485 total
grep -q blah ASCA_SZF_1B_M02_2011103000*  0.25s user 0.14s system 80% cpu 0.492 total
grep -q blah ASCA_SZF_1B_M02_2011103000*  0.19s user 0.20s system 79% cpu 0.489 total
grep -q blah ASCA_SZF_1B_M02_2011103000*  0.21s user 0.18s system 79% cpu 0.493 total
grep -q blah ASCA_SZF_1B_M02_2011103000*  0.21s user 0.18s system 79% cpu 0.493 total

It takes about 2.5 seconds to read the data when it's on the HDD. When it's been accessed enough, lvm-cache (well, dm-cache) decides it's time to copy it to the SSD cache. It's interesting that this creates a large slowdown, taking over 11 seconds to finish. However, all subsequent accesses are from the SSD cache and now take just under 0.5 seconds to read. For comparison, this is the speed when Linux uses the RAM to cache the data:

$ while true; do time grep -q blah ASCA_SZF_1B_M02_2011103000*; sleep 1; done
grep -q blah ASCA_SZF_1B_M02_2011103000*  0.24s user 0.11s system 84% cpu 0.422 total
grep -q blah ASCA_SZF_1B_M02_2011103000*  0.12s user 0.05s system 99% cpu 0.172 total
grep -q blah ASCA_SZF_1B_M02_2011103000*  0.12s user 0.04s system 97% cpu 0.164 total
grep -q blah ASCA_SZF_1B_M02_2011103000*  0.12s user 0.05s system 98% cpu 0.166 total
grep -q blah ASCA_SZF_1B_M02_2011103000*  0.11s user 0.05s system 99% cpu 0.168 total
grep -q blah ASCA_SZF_1B_M02_2011103000*  0.13s user 0.04s system 98% cpu 0.166 total

In this case, it takes about 0.16 seconds to read the data.

Laptop

This instance was more complicated since on my laptop I have an SSD and a HDD, but I'm using LUKS to encrypt everything. I'm running LVM on top of LUKS. The basic setup looks like this (this is modified output from lsblk):

NAME                 SIZE TYPE  MOUNTPOINT
sdb                 59.6G disk
└─sdb2                45G part
  └─ssdmain           45G crypt
    ├─ssdvg-swap       4G lvm   [SWAP]
    └─ssdvg-root      41G lvm   /

sda                298.1G disk
└─sda3             238.1G part
  └─hdd            238.1G crypt
    └─hddvg-home   238.1G lvm   /home

I have to use two volume groups because I don't bring up the HDD VG until after the root partition (on the SSD VG) is mounted.

For lvm-cache, the tricky part here is that the cache pool LV and the origin LV must be in the same volume group. So on the SSD I have to split the PV into two PVs. The first PV is used, as before, for / and swap. The second PV is used for the cache pool LV and is added to the second VG, containing the HDD PV:

NAME                   SIZE TYPE  MOUNTPOINT
sdb                   59.6G disk
├─sdb2                  45G part
│ └─ssdmain             45G crypt
│   ├─ssdvg-swap         4G lvm   [SWAP]
│   └─ssdvg-root        41G lvm   /
└─sdb3                14.1G part
  └─ssdcache          14.1G crypt
    ├─ssdvg-home-data 14.1G lvm
    └─ssdvg-home-meta   16M lvm

The rest is fairly straightforward to create the cache pool LV and attach it to the origin LV. I must say, it's nice using LVM because it supports online resizing. I had to boot into installation media to shrink the ext4 filesystem that / is formatted as (ext4 supports online expansion but not shrinking), but the other operations of shrinking the PV, making a new PV, creating a new LV, adding it to a VG, and so on, I could do "live".

Conclusion

I don't see much online about lvm-cache, but I tried it out and it works well for me. On one system, it's used with an ext4 filesystem, and on the other system, I use it with btrfs. I haven't noticed any problems in either case.

There are comments.

Aralakh Company Blog

Other articles

Conda, binstar, and fftw

SPQR for Python

LUKS with hybrid HDD/SSD setup on T420

Other articles

social