The Budget Fair Queueing Disk Scheduler for DragonFlyBSD

Brills Peng (brillsp at gmail.com)

1 Introduction

The BFQ disk scheduler is invented by Paolo Valente. The current version of BFQ in DragonFlyBSD is implemented according to his technique report[1]. Also, some additional features are added into the current version, they are inspired by the Linux version[2], but are totally written from scratch.

Like the CFQ (complete fair queue) disk scheduler under Linux, BFQ is a fair queueing scheduler that aims to improve the interactivity and lower the latency of the system. Maximize throughput, however, is not the major design goal of BFQ. So it is better to switch to BFQ if the computer is for desktop usage, in which interactivity eclipses throughput in general.

2 Basic Principles of the BFQ Scheduler

2.1 Budget

The core conception of BFQ is the “budget” of every thread. It means the maximum amount of service (measured by the size of the I/O requests) that a thread can receive when the scheduler is serving it exclusively. Once a thread consumes up its budget, it gets off from the scheduler, assigned with a new budget, and queued (again) into the fair queue. Then BFQ will select another thread to serve exclusively.

2.2 The WF^2Q+ fair queueing algorithm

BFQ is based on a fair queueing algorithm named WF^2Q+. This algorithm was first used on routers to fairly dispatch network packets from various connections. If we replace the term “packets” and “connections” by “I/O requests” and “threads (or processes)”, we have reached the basic idea of how this algorithm is applied to BFQ scheduler.

The WF^2Q+ algorithm decides which thread to select and to be served by BFQ when the last thread runs up its budget. It is based on the term “virtual time”, which is actually the service offered and received (measured by bytes or sectors in implementation). It maintains a global virtual time, which is the amount of service offered globally. It also maintains two attributes for every thread: the virtual eligible time and the virtual deadline. The former one means the total service received while the latter one means the expected “time” to be selected, that is, it expects to be selected by the algorithm when the global virtual time reaches its deadline.

The WF^2Q+ algorithm will always select the thread with minimum deadline among the threads whose eligible time is no later than the global virtual time. Intuitively, if all threads consume the same amount of budget, they will be selected alternately and have a same share of disk distribution; if one thread consumes more budget than others, it will get selected fewer.

3 Implementation

The BFQ scheduler is written on top of the dsched framework. However, more features are needed from dsched than it could provide: the scheduler has to be notified when the disk is idle or about to idle and only with this notification can it dispatch further I/O requests to the driver. Therefore, before implementing the scheduler itself, request polling feature is added to dsched framework.

3.1 Request polling in dsched

Before request polling is implemented, the dsched framework does not have a dequeue() interface for scheduling policy running on top of it. Instead, it provides some strategy() functions for a scheduler to call when it “guesses” that the disk may be able to receive more I/O requests.

The request polling feature transfers the guessing work to dsched by maintaining a variable called tag_queue_depth, which is the estimated depth of the disk’s NCQ or TCQ. A variable called max_tag_queue_depth is initialized as the maximum depth of the disk’s TCQ or NCQ, which can be acquired from the driver.

The request polling feature is not restricted only to BFQ but can be made use of by any policy on dsched framework. To use this feature, a policy must:

Monitor current_tag_queue_depth, and push as many bios as it can until the depth reaches the maximum value. Monitoring can be achieved by:
1. Creating a monitor thread and poll the value periodically (not recommended)
2. Monitoring the value when:
  1. some bios are done
  2. some bios are pushed to the scheduler by dsched‘s queue() interface. Actually, the policy may register a polling_func callback, being called by dsched when a bio dispatched by dsched_strategy_request_polling()is done.
Use dsched_strategy_request_polling() to dispatch the bios. This strategy() call will decrease the current_tag_queue_depth. Note that unlike dsched_strategy_async(), a policy cannot register a biodone() callback which gets called when the dispatched bio is done. Instead, if such a callback is needed, the policy should:
[optional] Register a biodone callback function (type dsched_bio_done_t) by assigning it to polling_func in the policy structure. Note: this function should not be blocked, (eg. by locks) and should be MPSAFE; this function should not be changed after the prepare() interface is called.

3.2 The WF^2Q fair queue

The WF^2Q fair queueing algorithm is implemented in sys/kern/dsched/bfq/wf2q.c.

To efficiently implement the functions that WF^2Q provides, a data structure named “augmented binary tree” is used. With its help, WF^2Q+ can select a proper thread described above within O(log(N)) time, where N is the number of threads in the tree. The inserting and deleting operations are scaled O(log(N)) as well. The detailed information about how to implement WF2^Q with augmented tree is in [3].

Before the implementation of BFQ, the tree.h, which contains the definition of red-black tree in DragonFly does not support the augment function. Thus the tree.h from FreeBSD is ported.

3.3 The structure of the BFQ scheduler: helper thread, lwkt message and why

In current version, a helper thread is used to executing the following operations:

3.3.1 Serialized `bfq_dequeue()`

The bfq_dequeue() function is the core of the BFQ scheduler. It takes the responsibility to serve a thread within a preset time slice, dispatche bios of that thread and select another thread from the WF^2Q+ fair queue when current thread runs out of its budget. It should be called whenever the disk is idle or about to idle.

To avoid blocking ithreads (interrupt threads), we use a helper thread to dispatch all bios to the lower driver in current version, that is to say, the bfq_dequeue() function is only called by the helper thread.

Originally, bfq_dequeue() could be called by:

dsched_request_polling_biodone(), which is called by a interrupt thread when a I/O request is done by the hard drive.
bfq_queue(), after a user thread pushing its bios to the scheduler.
bfq_timeout(), after the scheduler finishing suspending.
bfq_destroy_tdio(), when the tdio being destroyed is waited by the scheduler.

Now these callers will uniformly send an lwkt message to the helper thread, and all bfq_dequeue() will thus be serialized.

3.3.2 Non-blocking `bfq_timeout()`

bfq_timeout() needs to acquire BFQ_LOCK, which may cause the calling thread, the callout facility to block on it. To avoid this situation, in current version a function sending message to the helper thread will be called when the callout alarm strikes.

3.3.3 Non-blocking `bfq_destroy_tdio()`

Due to high latency experienced in some test case (blogbench), we have found that blocking on destroying a thread is not healthy. Therefore the helper thread now receives message of destroying a tdio and call bfq_destroy_tdio() instead. Note that in that function, no operation on the destroyed thread_io structure should be applied, because it may have been recycled.

3.3.4 Possible Performance Issues

As almost all major scheduler operations are serialized (actually, only bfq_queue() and the customized biodone function are exceptions), the performance will be not as high as expected, and it is proved in some benchmarks. The helper thread seems to be the most possible source of the high latency, and this should be fixed in the next version, by refactoring all the synchronizing operations and use as few lockings as possible.

3.4 How the budget of a thread is adjusted by its behaviors

Ideally, equal budgets is excepted to assegned to all threads, and they should run out of their budgets immediately. However, the abstract is far from the real world conditions. First, a thread could do random I/O which is very time consuming. Second, it could be a CPU-intensive thread that seldom does I/O and thus consumes its budget very slowly.

As the BFQ scheduler runs on the service domain and it cares no time domain latency issues, the actual performance (and interactivity) could be affected by the two types of threads above. As a result, we have to add time domain restrictions to all threads to ensure low latency.

First, we assign a preset time slice to every thread and they are only served within the interval (200ms). If a thread does not consume up its budget, the scheduler will reduce its budget to the amount it has consumed in the current time slice. Note that a lower budget does mean that lower bandwidth shared, because of the WF^2Q+ algorithm, the thread will be more frequently selected.

Second, if a thread having enough budget pushes no further I/O requests even after the whole scheduler suspends to wait a while for it, the budget of it will be reduced as well. And if the the thread consumes its budget too slow (for example, at current speed, it will only consume less than 2/3 of its budget), it will be punished by charging a full budget. As a result, the time when it is selected next time will be later than expected.

Third, if a thread runs up its budget within the time slice, its budget gets increased. There are two types of the increment:

If the current budget is less than a threshold, it gets doubled, or
it gets a pre-defined linear increment.

As one can expect, through the process of budget adjusting, every thread will be assigned a proper budget to be consumed just in the time slice.

3.5 The AS feature

It is possible that a thread pushes one bio and then waits for it to be done before pushing another. Although it may be doing sequential I/O, the scheduler could misunderstand this behavior and switch to another thread too early.

To avoid the above issue, the AS feature is introduced in BFQ: the scheduler suspends for a while, when the current serving thread has enough budget but no bio exists in its queue. If the thread pushes one or more bios during waiting, the service will not be interrupted after the scheduler resumes.

However, if a thread takes too long to “think”, it can not enjoy the AS feature. This will be described in the next section.

Now the AS feature is implemented with the help of the callout facility.

3.6 Additional features: `ttime_avg`, `seek_avg` and `peak_rate`

3.6.1 Average Thinking Time

ttime means the interval between the time when a thread pushes a bio and the time when the last bio of it is done.

We accumulate the think time and calculate an average value, by which the scheduler judges whether a thread takes too long to “think”.

If a thread is too “thinking”, the AS waiting could be wasting of time, thus we turn of the AS feature of such a thread.

3.6.2 Average Seek Distance

seek_avg is calculated by accumulating value current_seek_start - last_seek_end. A “seeky” thread tends to have less budget, and the scheduler will not sample the disk peak rate after serving it.

3.6.3 Disk Peak Rate Estimate

The peak speed of the hard drive is estimated by the amount I/O done when:

a thread runs out of its budget
a not “seeky” thread runs out of its time slice

The peak rate is used to adapt the max budget automatically:

max_budget = peak_rate * time_slice_length

3.7 Debug interfaces

3.7.1 `dsched_debug`

We have defined three dsched_debug levels:

BFQ_DEBUG_CRITICAL: printing errors or warnings.
BFQ_DEBUG_NORMAL: printing important and non-frequently appearing scheduler decisions.
BFQ_DEBUG_VERBOSE: printing all scheduler decisions.

3.7.2 Kernel Tracing

Also, we make use of the KTR facility to print the seek_avg and ttime_avg before a thread is destroyed. To enable KTR, add the following lines in your kernel configure file:

options KTR

options KTR_DSCHED_BFQ

3.7.3 `sysctl` subtree

BFQ creates a subtree under node dsched for every device using it. The subtree has the following nodes:

max_budget: [R/W] global maximum budget; if the auto max budget feature is turned on, this is the automatically adjusted maximum budget.
peak_rate: [R] Estimated disk speed, unit: 1/1024 byte per microsecond (fixed point representation)
peak_samples: [R] Valid number of samples that are used to calculate the peak rate. It remains still after reaching 80.
as_miss: [R] Counter of times that a thread does not push any bio after AS waiting.
as_hit: [R] Counter of times that a thread pushes at least one bio after AS waiting.
as_wait_avg_all: [R] Average AS waiting time (ms).
as_wait_avg_miss: [R] Average AS waiting time (ms), when AS is missed.
as_wait_max: [R] The Maximum AS waiting time (ms), measured in the helper thread.
as_wait_max2: [R] The Maximum AS waiting time (ms), measured in the callout callback.
as_high_wait_count: [R] Counter of times that the scheduler does an AS waiting for longer than 50ms, measured in the helper thread.
as_high_wait_count: [R] Counter of times that the scheduler does an AS waiting for longer than 50ms, measured in the callout callback.
avg_time_slice: [R] Average length of time slice.
max_time_slice: [R] Maximum length of time slice.
as_switch: [R/W] Switch controlling the global AS feature.
auto_max_budget_switch: [R/W] Switch controlling the auto max budget adapting feature.

4 Tuning

Now BFQ has two tunable parameters: the global AS switch and the max budget.

4.1 AS feature: on/off

It is reported that turning AS on may affect the interactivity and increase max latency greatly. It is probably due to the over-serialized implementation of BFQ. However, the blogbench result shows that turning AS on will also increase the throughput greatly.

Suggestion: turn on the AS feature, for it effects little on averate latency.

4.2 max budget: the advantages/disadvantages of a higher/lower/auto max budget

One thread could be assigned a budget no more than the max budget. Generally, a higher budget means higher throughput because of less operations on WF2Q+ augtree, while a lower budget force the scheduler cost more on those operations.

However, the real world experiments show that a too high budget affects interactivity heavily. A too low budget will also cause higher latency, and if the budget is less than 64KB (65536), which is smaller than the size of some bios , the scheduler will retrograde to a round-robin scheduler, which is not a good form for a disk scheduler.

Suggestions: Do not use auto max budget, it is usually too high. A budget of 1/10 of the automatic max budget may be proper. In general, 512K(default), 256K, 192K can be good. It is better to determine the best max budget by binary selecting by the result of some benchmarks.

5 Benchmark Results

See http://leaf.dragonflybsd.org/~brillsp/bfq_bench/bfq_bench.html

6 Known Bugs & Bottlenecks

When switching to another dsched policy from BFQ, the system may deadlock. (Happens when the sysctl process and the helper thread are on the same CPU.)
Currently, the performance is not so ideal and it is not tested on large number of machines. It is not recommanded to use this version in a productivity environment.

7 Future Plans

Rewrite the scheduler to carefully and properly synchronize the operations to acquire better performance
Distinguish sync and async bios, as the async ones takes less time to complete, the budget and the length of time slice should be different from those of the sync bios.

8 References

[1] Paolo Valente, Fabio Checconi, High Throughput Disk Scheduling with Fair Bandwidth Distribution, IEEE Transactions on Computers, vol. 59 no. 9

[2] http://retis.sssup.it/~fabio/linux/bfq/patches/

[3] I. Stoica and H. Abdel-Wahab, Earliest eligible virtual deadline first: A flexible and accurate mechanism for proportional share resource allocation