The BFQ disk scheduler is invented by Paolo Valente. The current version of BFQ in DragonFlyBSD is implemented according to his technique report[1]. Also, some additional features are added into the current version, they are inspired by the Linux version[2], but are totally written from scratch.
Like the CFQ (complete fair queue) disk scheduler under Linux, BFQ is a fair queueing scheduler that aims to improve the interactivity and lower the latency of the system. Maximize throughput, however, is not the major design goal of BFQ. So it is better to switch to BFQ if the computer is for desktop usage, in which interactivity eclipses throughput in general.
The core conception of BFQ is the “budget” of every thread. It means the maximum amount of service (measured by the size of the I/O requests) that a thread can receive when the scheduler is serving it exclusively. Once a thread consumes up its budget, it gets off from the scheduler, assigned with a new budget, and queued (again) into the fair queue. Then BFQ will select another thread to serve exclusively.
BFQ is based on a fair queueing algorithm named WF^2Q+. This algorithm was first used on routers to fairly dispatch network packets from various connections. If we replace the term “packets” and “connections” by “I/O requests” and “threads (or processes)”, we have reached the basic idea of how this algorithm is applied to BFQ scheduler.
The WF^2Q+ algorithm decides which thread to select and to be served by BFQ when the last thread runs up its budget. It is based on the term “virtual time”, which is actually the service offered and received (measured by bytes or sectors in implementation). It maintains a global virtual time, which is the amount of service offered globally. It also maintains two attributes for every thread: the virtual eligible time and the virtual deadline. The former one means the total service received while the latter one means the expected “time” to be selected, that is, it expects to be selected by the algorithm when the global virtual time reaches its deadline.
The WF^2Q+ algorithm will always select the thread with minimum deadline among the threads whose eligible time is no later than the global virtual time. Intuitively, if all threads consume the same amount of budget, they will be selected alternately and have a same share of disk distribution; if one thread consumes more budget than others, it will get selected fewer.
The BFQ scheduler is written on top of the dsched
framework. However, more features are needed from dsched
than it could provide: the scheduler has to be notified when the disk is
idle or about to idle and only with this notification can it dispatch
further I/O requests to the driver. Therefore, before implementing the
scheduler itself, request polling feature is added to
dsched
framework.
Before request polling is implemented, the dsched
framework
does not have a dequeue()
interface for scheduling policy
running on top of it. Instead, it provides some strategy()
functions for a scheduler to call when it “guesses” that the
disk may be able to receive more I/O requests.
The request polling feature transfers the guessing work to
dsched
by maintaining a variable called
tag_queue_depth
, which is the estimated depth of the
disk’s NCQ or TCQ. A variable called
max_tag_queue_depth
is initialized as the maximum depth of
the disk’s TCQ or NCQ, which can be acquired from the driver.
The request polling feature is not restricted only to BFQ but can be
made use of by any policy on dsched
framework. To use this
feature, a policy must:
current_tag_queue_depth
, and push as many
bio
s as it can until the depth reaches the maximum value.
Monitoring can be achieved by:
bio
s are done
bio
s are pushed to the scheduler by
dsched
‘s queue()
interface. Actually,
the policy may register a polling_func
callback, being
called by dsched
when a bio
dispatched by
dsched_strategy_request_polling()
is done.
dsched_strategy_request_polling()
to dispatch the
bio
s. This strategy()
call will decrease the
current_tag_queue_depth
. Note that unlike
dsched_strategy_async()
, a policy cannot register a
biodone()
callback which gets called when the dispatched
bio
is done. Instead, if such a callback is needed, the
policy should:
dsched_bio_done_t
) by assigning it to
polling_func
in the policy structure. Note: this function
should not be blocked, (eg. by locks) and should be MPSAFE; this
function should not be changed after the prepare()
interface is called.
The WF^2Q fair queueing algorithm is implemented in
sys/kern/dsched/bfq/wf2q.c
.
To efficiently implement the functions that WF^2Q provides, a data structure named “augmented binary tree” is used. With its help, WF^2Q+ can select a proper thread described above within O(log(N)) time, where N is the number of threads in the tree. The inserting and deleting operations are scaled O(log(N)) as well. The detailed information about how to implement WF2^Q with augmented tree is in [3].
Before the implementation of BFQ, the tree.h
, which
contains the definition of red-black tree in DragonFly does not support
the augment function. Thus the tree.h
from FreeBSD is
ported.
In current version, a helper thread is used to executing the following operations:
bfq_dequeue()
The bfq_dequeue()
function is the core of the BFQ
scheduler. It takes the responsibility to serve a thread within a preset
time slice, dispatche bio
s of that thread and select
another thread from the WF^2Q+ fair queue when current thread runs out
of its budget. It should be called whenever the disk is idle or about to
idle.
To avoid blocking ithreads
(interrupt threads), we use a
helper thread to dispatch all bios to the lower driver in current
version, that is to say, the bfq_dequeue()
function is only
called by the helper thread.
Originally, bfq_dequeue()
could be called by:
dsched_request_polling_biodone()
, which is called by a
interrupt thread when a I/O request is done by the hard drive.
bfq_queue()
, after a user thread pushing its bios to the
scheduler.
bfq_timeout()
, after the scheduler finishing suspending.
bfq_destroy_tdio()
, when the tdio being destroyed is waited
by the scheduler.
Now these callers will uniformly send an lwkt message to the helper thread, and all bfq_dequeue() will thus be serialized.
bfq_timeout()
bfq_timeout()
needs to acquire BFQ_LOCK, which may cause
the calling thread, the callout facility to block on it. To avoid this
situation, in current version a function sending message to the helper
thread will be called when the callout alarm strikes.
bfq_destroy_tdio()
Due to high latency experienced in some test case (blogbench), we have
found that blocking on destroying a thread is not healthy. Therefore the
helper thread now receives message of destroying a tdio and call
bfq_destroy_tdio()
instead. Note that in that function, no
operation on the destroyed thread_io
structure should be
applied, because it may have been recycled.
As almost all major scheduler operations are serialized (actually, only
bfq_queue()
and the customized biodone function are
exceptions), the performance will be not as high as expected, and it is
proved in some benchmarks. The helper thread seems to be the most
possible source of the high latency, and this should be fixed in the
next version, by refactoring all the synchronizing operations and use as
few lockings as possible.
Ideally, equal budgets is excepted to assegned to all threads, and they should run out of their budgets immediately. However, the abstract is far from the real world conditions. First, a thread could do random I/O which is very time consuming. Second, it could be a CPU-intensive thread that seldom does I/O and thus consumes its budget very slowly.
As the BFQ scheduler runs on the service domain and it cares no time domain latency issues, the actual performance (and interactivity) could be affected by the two types of threads above. As a result, we have to add time domain restrictions to all threads to ensure low latency.
First, we assign a preset time slice to every thread and they are only served within the interval (200ms). If a thread does not consume up its budget, the scheduler will reduce its budget to the amount it has consumed in the current time slice. Note that a lower budget does mean that lower bandwidth shared, because of the WF^2Q+ algorithm, the thread will be more frequently selected.
Second, if a thread having enough budget pushes no further I/O requests
even after the whole scheduler suspends to wait a while for it, the
budget of it will be reduced as well. And if the the thread consumes its
budget too slow (for example, at current speed, it will only consume
less than 2/3 of its budget), it will be punished by
charging a full budget
. As a result, the
time when it is selected next time will be later than expected.
Third, if a thread runs up its budget within the time slice, its budget gets increased. There are two types of the increment:
As one can expect, through the process of budget adjusting, every thread will be assigned a proper budget to be consumed just in the time slice.
It is possible that a thread pushes one bio
and then waits
for it to be done before pushing another. Although it may be doing
sequential I/O, the scheduler could misunderstand this behavior and
switch to another thread too early.
To avoid the above issue, the AS feature is introduced in BFQ: the
scheduler suspends for a while, when the current serving thread has
enough budget but no bio
exists in its queue. If the thread
pushes one or more bio
s during waiting, the service will
not be interrupted after the scheduler resumes.
However, if a thread takes too long to “think”, it can not enjoy the AS feature. This will be described in the next section.
Now the AS feature is implemented with the help of the
callout
facility.
ttime_avg
, seek_avg
and peak_rate
ttime
means the interval between the time when a thread
pushes a bio
and the time when the last bio
of
it is done.
We accumulate the think time and calculate an average value, by which the scheduler judges whether a thread takes too long to “think”.
If a thread is too “thinking”, the AS waiting could be wasting of time, thus we turn of the AS feature of such a thread.
seek_avg
is calculated by accumulating value
current_seek_start - last_seek_end
. A
“seeky” thread tends to have less budget, and the scheduler
will not sample the disk peak rate after serving it.
The peak speed of the hard drive is estimated by the amount I/O done when:
The peak rate is used to adapt the max budget automatically:
max_budget = peak_rate * time_slice_length
dsched_debug
We have defined three dsched_debug
levels:
BFQ_DEBUG_CRITICAL
: printing errors or warnings.
BFQ_DEBUG_NORMAL
: printing important and non-frequently
appearing scheduler decisions.
BFQ_DEBUG_VERBOSE
: printing all scheduler decisions.
Also, we make use of the KTR facility to print the seek_avg
and ttime_avg
before a thread is destroyed. To enable KTR,
add the following lines in your kernel configure file:
options KTR
options KTR_DSCHED_BFQ
sysctl
subtreeBFQ creates a subtree under node dsched
for every device
using it. The subtree has the following nodes:
max_budget
: [R/W] global maximum budget; if the auto max
budget feature is turned on, this is the automatically adjusted maximum
budget.
peak_rate
: [R] Estimated disk speed, unit: 1/1024 byte per
microsecond (fixed point representation)
peak_samples
: [R] Valid number of samples that are used to
calculate the peak rate. It remains still after reaching 80.
as_miss
: [R] Counter of times that a thread does not push
any bio
after AS waiting.
as_hit
: [R] Counter of times that a thread pushes at least
one bio
after AS waiting.
as_wait_avg_all
: [R] Average AS waiting time (ms).
as_wait_avg_miss
: [R] Average AS waiting time (ms), when AS
is missed.
as_wait_max
: [R] The Maximum AS waiting time (ms), measured
in the helper thread.
as_wait_max2
: [R] The Maximum AS waiting time (ms),
measured in the callout
callback.
as_high_wait_count
: [R] Counter of times that the scheduler
does an AS waiting for longer than 50ms, measured in the helper thread.
as_high_wait_count
: [R] Counter of times that the scheduler
does an AS waiting for longer than 50ms, measured in the
callout
callback.
avg_time_slice
: [R] Average length of time slice.
max_time_slice
: [R] Maximum length of time slice.
as_switch
: [R/W] Switch controlling the global AS feature.
auto_max_budget_switch
: [R/W] Switch controlling the auto
max budget adapting feature.
Now BFQ has two tunable parameters: the global AS switch and the max budget.
It is reported that turning AS on may affect the interactivity and increase max latency greatly. It is probably due to the over-serialized implementation of BFQ. However, the blogbench result shows that turning AS on will also increase the throughput greatly.
Suggestion: turn on the AS feature, for it effects little on averate latency.
One thread could be assigned a budget no more than the max budget. Generally, a higher budget means higher throughput because of less operations on WF2Q+ augtree, while a lower budget force the scheduler cost more on those operations.
However, the real world experiments show that a too high budget affects
interactivity heavily. A too low budget will also cause higher latency,
and if the budget is less than 64KB (65536), which is smaller than the
size of some bio
s , the scheduler will retrograde to a
round-robin scheduler, which is not a good form for a disk scheduler.
Suggestions: Do not use auto max budget, it is usually too high. A budget of 1/10 of the automatic max budget may be proper. In general, 512K(default), 256K, 192K can be good. It is better to determine the best max budget by binary selecting by the result of some benchmarks.
See http://leaf.dragonflybsd.org/~brillsp/bfq_bench/bfq_bench.html
dsched
policy from BFQ, the
system may deadlock. (Happens when the sysctl process and the helper
thread are on the same CPU.)
bio
s, as the async ones takes
less time to complete, the budget and the length of time slice should be
different from those of the sync bio
s.
[1] Paolo Valente, Fabio Checconi, High Throughput Disk Scheduling with Fair Bandwidth Distribution, IEEE Transactions on Computers, vol. 59 no. 9
[2] http://retis.sssup.it/~fabio/linux/bfq/patches/
[3] I. Stoica and H. Abdel-Wahab, Earliest eligible virtual deadline first: A flexible and accurate mechanism for proportional share resource allocation