Low-Level Data Structure

llds is a btree implementation which attempts to maximize memory efficiency via bypassing the virtual memory layer (vmalloc) and through optimized data structure memory semantics.

The llds general working thesis is: for large memory applications, virtual memory layers can hurt application performance due to increased memory latency when dealing with large data structures. Specifically, data page tables/directories within the kernel and increased DRAM requests can be avoided to boost application memory access.

Applicable use cases: applications on systems that utilize large in-memory data structures. In our testing, "large" was defined as >4GB structures, which did yield significant gains with llds vs equivalent userspace implementations.

Installing/Configuring

$ cmake .
$ make
# make install
# mknod /dev/llds c 834 0

The build environment will need libproc, glibc, and linux headers. For Ubuntu/Debian based distros this is available in the libproc-dev, linux-libc-dev, and build-essential pkgs.

How it Works

llds is a Linux kernel module (2.6, 3.0) which leverages facilities provided by the kernel mm for optimal DRAM memory access. llds uses the red-black tree data structure, which is highly optimized in the kernel and is used to manage processes, epoll file descriptors, file systems, and many other components of the kernel.

Memory management in llds is optimized for traversal latency, not space efficiency, though space savings are probable due to better alignment in most use cases. llds data structures should not consume any more memory than their equivalent user space implementations.

Traversal latency is optimized by exploiting underlying physical RAM mechanics, avoiding CPU cache pollution, NUMA cross-check chatter, and streamlining CPU data prefetching (L1D cache lines). Fragmented memory access is less efficient when interacting with modern DRAM controllers. The efficiency also further suffers on NUMA systems as the number of processors/memory banks increases.

libforrest

Developers can interact directly with the llds chardev using ioctl(2), however, it is highly recommended that the libforrest API is used to avoid incompatibilities should the ioctl interface change in the future.

libforrest provides the basic key-value store operations: get, set, and delete. In addition, it provides a 64-bit MurmurHash (rev. A) for llds key hashing.

Examples are provided in the libforrest/examples directory.

Benchmarks

Benchmarks are inherently fluid. All samples and timings are available at http://github.com/johnj/llds-benchmarks, additionally there is a run_tests.sh script provided which utilizes oprofile. Along with the run_tests.sh script, there is a user-space implementation of red-black trees and an equivalent llds implementation. The goal of benchmarking is about opining to the results of a particular environment but all the tools and scripts are available to let users test their own mileage.

Benchmark environment: Dell PowerEdge R610, 4x Intel Xeon L5640 (Westmere) w/HT (24 cores), 192GB DDR3 DRAM, Ubuntu 10.04.3 LTS. The keys are 64-bit integers and the values are incremented strings (ie, "0", "1", "2"..."N"). There were no major page faults.

For conciseness, only tests with 2/16/24 threads and 500K/1.5M/2M keys are listed. dmidecode, samples, and full benchmarks are available at http://github.com/johnj/llds-benchmarks

Wall Timings (in seconds)

Threads# of Itemsuserspacelldsllds improvement
2500000000356417612.02x
161500000000929141122.26x
2420000000001264556702.23x

Unhalted CPU cycles (10000 cycles @ 133mHz)

Threads# of Itemsuserspacellds
250000000087418776377458531
1615000000002792039325107099682
2420000000009680912335529234102

L1 cache hits (200000 per sample)

Threads# of Itemsuserspacelldsllds improvement
2500000000307767155022921.78x
16150000000015120921272315531.80x
24200000000023746988391961771.65x

L2 cache hits (200000 per sample)

Threads# of Itemsuserspacelldsllds improvement
250000000021866602142.75x
161500000000821015112856.23x
2420000000001270728008466.30x

L3/Last-Level cache hits (200000 per sample)

Threads# of Itemsuserspacelldsllds improvement
250000000026069322591.24x
1615000000001488272545621.71x
2420000000002701913416491.26x

L1 Data Prefetch misses (200000 per hardware sample)

Threads# of Itemsuserspacelldsllds improvement
250000000052396211132.48x
1615000000003507531208912.90x
2420000000005447912102682.59x

Status

llds is experimental. Though it's been tested in various environments (including integration into a search engine) it is not known to be in use on any production system, yet. With additional eyes (preferably kernel hackers) looking at llds, the hope is that llds will be stable by Q4 '12 (ala Wall, Perl6, and Christmas).

Known Limitations/Issues

  • libforrest has a limit on the value which comes back from kernel space, the default is 4096 bytes, it can be adjusted through the FORREST_MAX_VAL_LEN directive at compile time.
  • Only 64-bit architecture support

Future Work

  • Support for additional data structures (hashes are questionable)
  • Add atomic operations (increment, decrement, CAS, etc.) in libforrest and llds
  • Research about the virtual memory overhead & implementation in the kernel with mitigation techniques


Low-Level Data Structure

llds是一种btree实现,它尝试通过绕过虚拟内存层(vmalloc)和优化的数据结构内存语义来最大限度提高内存效率。

一般工作论文是:对于大内存应用程序,由于在处理大型数据结构时内存延迟增加,虚拟内存层可能会损害应用程序性能。具体来说,可以避免内核中的数据页表/目录和增加的DRAM请求,以提高应用程序内存访问

适用的用例:使用大型内存数据结构的系统上的应用程序。在我们的测试中,大被定义为> 4GB结构,这确实在llds与等效的用户空间实现方面取得了显着的增长。

Installing/Configuring

$ cmake .
$ make

make install

mknod /dev/llds c 834 0

构建环境将需要libproc,glibc和linux头。对于基于Ubuntu / Debian的发行版,这可以在libproc-dev,linux-libc-dev和build-essential pkgs中找到。

How it Works

llds是一个Linux内核模块(2.6,3.0),它利用内核mm提供的工具来优化DRAM内存访问。 llds使用红黑树数据结构,在内核中进行了高度优化,用于管理进程,epoll文件描述符,文件系统和内核的许多其他组件。

llds中的内存管理针对遍历延迟而不是空间效率进行了优化,尽管在大多数用例中更好的对齐可能会节省空间。 llds数据结构不应比其等效的用户空间实现消耗更多的内存。

通过利用底层物理RAM机制,避免CPU缓存污染,NUMA交叉检查抖动和简化CPU数据预取(L1D高速缓存行)来优化遍历延迟。与现代DRAM控制器交互时,分段存储器访问效率较低。随着处理器/存储器数量的增加,NUMA系统的效率也进一步下降

libforrest

开发人员可以使用ioctl(2)直接与llds chardev进行交互,但是强烈建议您使用libforrest API来避免ioctl界面将来会发生变化。

libforrest提供了基本的键值存储操作:get,set和delete。此外,它为llds键哈希提供64位MurmurHash(rev。A)。

libforrest / examples目录中提供了示例。

Benchmarks

基准本质上是流动性的。所有样品和时间均可在 http://github.com/johnj/llds-benchmarks 获得,另外还有一个<代码> run_tests.sh 脚本提供使用oprofile。与 run_tests.sh 脚本一起,有一个用户空间实现的红黑树和一个等效的llds实现。基准测试的目标是要考虑到特定环境的结果,但所有的工具和脚本都可以让用户测试自己的里程。

基准环境:Dell PowerEdge R610,4x Intel Xeon L5640(Westmere)w / HT(24内核),192GB DDR3 DRAM,Ubuntu 10.04.3 LTS。键是64位整数,值是递增的字符串(即0,1,2…N)。没有主页出现故障。

为简洁起见,仅列出2/16/24线程和500K / 1.5M / 2M密钥的测试。 http://github.com/johnj/llds-benchmarks 上提供了dmidecode,示例和完整的基准测试。

墙壁时间(以秒为单位)

Threads# of Itemsuserspacelldsllds improvement
2500000000356417612.02x
161500000000929141122.26x
2420000000001264556702.23x

< rule =evenoddd =M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-。 45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22 -2-2.5 0-83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z / path> 取消呼叫的CPU周期(10000个周期@ 133mHz)

Threads# of Itemsuserspacellds
250000000087418776377458531
1615000000002792039325107099682
2420000000009680912335529234102

L1缓存命中(每个样本200000)

Threads# of Itemsuserspacelldsllds improvement
2500000000307767155022921.78x
16150000000015120921272315531.80x
24200000000023746988391961771.65x

L2缓存命中(每个样本200000)

Threads# of Itemsuserspacelldsllds improvement
250000000021866602142.75x
161500000000821015112856.23x
2420000000001270728008466.30x

L3 / Last-Level缓存命中(每个样本200000)

Threads# of Itemsuserspacelldsllds improvement
250000000026069322591.24x
1615000000001488272545621.71x
2420000000002701913416491.26x

L1数据预取缺失(200000每个硬件样本)

Threads# of Itemsuserspacelldsllds improvement
250000000052396211132.48x
1615000000003507531208912.90x
2420000000005447912102682.59x

Status

llds是实验性的。虽然它已经在各种环境(包括集成到搜索引擎中)进行了测试,但是尚不知道在任何生产系统上使用它。有更多的眼睛(最好是内核黑客)看着llds,希望是在4月12日(ala Wall,Perl6和圣诞节)的时候,它将会稳定下来。

Known Limitations/Issues

  • libforrest对从内核空间返回的值有限制,默认值为4096字节,可以在编译时通过FORREST_MAX_VAL_LEN指令进行调整。
  • 只有64位架构支持

Future Work

  • 支持额外的数据结构(哈希值得怀疑)
  • 在libforrest和llds
  • 中添加原子操作(增量,减量,CAS等)
  • 关于虚拟内存开销和在内核中使用缓解技术实现




相关问题推荐