【第6回】元東大教員から学ぶLinuxカーネル「プロセススケジューリング」

本記事の信頼性

リアルタイムシステムの研究歴12年．
東大教員の時に，英語でOS（Linuxカーネル）の授業．
2012年9月～2013年8月にアメリカのノースカロライナ大学チャペルヒル校（UNC）コンピュータサイエンス学部で客員研究員として勤務．C言語でリアルタイムLinuxの研究開発．
プログラミング歴15年以上，習得している言語: C/C++，Python，Solidity/Vyper，Java，Ruby，Go，Rust，D，HTML/CSS/JS/PHP，MATLAB，Verse（UEFN）, Assembler (x64，aarch64)．
東大教員の時に，C++言語で開発した「LLVMコンパイラの拡張」，C言語で開発した独自のリアルタイムOS「Mcube Kernel」をGitHubにオープンソースとして公開．
2020年1月～現在はアメリカのノースカロライナ州チャペルヒルにあるGuarantee Happiness LLCのCTOとしてECサイト開発やWeb/SNSマーケティングの業務．2022年6月～現在はアメリカのノースカロライナ州チャペルヒルにあるJapanese Tar Heel, Inc.のCEO兼CTO．
最近は自然言語処理AIとイーサリアムに関する有益な情報発信や，Unreal Editor for Fortnite（UEFN）でゲーム開発に従事．

（AI全般を含む）自然言語処理AIの論文の日本語訳や，AIチャットボット（ChatGPT，Auto-GPT，Gemini（旧Bard）など）の記事を50本以上執筆．アメリカのサンフランシスコ（広義のシリコンバレー）の会社でChatGPT/Geminiを訓練するプロンプトエンジニア・マネージャー・Quality Assurance（QA）の業務委託の経験あり．
（スマートコントラクトのプログラミングを含む）イーサリアムや仮想通貨全般の記事を200本以上執筆．イギリスのロンドンの会社で仮想通貨の英語の記事を日本語に翻訳する業務委託の経験あり．
UEFNで10本以上のゲームを開発し，フォートナイト上で公開（Fortnite，Fortnite.GG）．

こういった私から学べます．

前回を読んでいない方はこちらからどうぞ．

: 【第5回】元東大教員から学ぶLinuxカーネル「プロセス管理」

こういった私から学べます．前回を読んでいない方はこちらからどうぞ． Linuxカーネルの記事一覧はこちらからどうぞ．今回のテーマはプロセス管理です．プロセス管理により，プロセスのライフサイクルが ...

続きを見る

Linuxカーネルの記事一覧はこちらからどうぞ．

: 元東大教員から学ぶLinuxカーネル

こういった私から学べます． Linuxカーネルとは，C言語で開発されたオープンソースのOSです． Linuxカーネルは主に以下のコンピュータで広く利用されています．スーパーコンピュータサーバ An ...

続きを見る

LinuxカーネルはC言語で書かれています．

私にC言語の無料相談をしたいあなたは，公式LINE「ChishiroのC言語」の友だち追加をお願い致します．

私のキャパシティもあり，一定数に達したら終了しますので，今すぐ追加しましょう！

独学が難しいあなたは，元東大教員がおすすめするC言語を学べるオンラインプログラミングスクール5社で自分に合うスクールを見つけましょう．後悔はさせません！

: 元東大教員がおすすめするC言語を学べるオンラインプログラミングスクール5社

こういった悩みにお答えします．こういった私が解説していきます．私が大学の授業で初めてC言語を勉強した時は全然できませんでしたが，先生やTA，友人に相談しながら一生懸命C言語を勉強してできるようにな ...

続きを見る

今回のテーマはプロセススケジューリングです．

Linuxカーネルがどのようにタスクの実行順序を制御しているかがわかります．

プロセススケジューリングとは

プロセススケジューリングとは，次にどのプロセスを，いつ，どれだけの時間実行するかを決定する機構のことです．

プロセススケジューリングの目的は，プロセッサ（CPU）を最大限に活用することです．

例えば，以下のような要件があります．

待機中のプロセスのためにCPUサイクルを浪費したくない
優先度の高いプロセスに高い優先度を与えたい
低優先度プロセスを飢餓状態にしたくない

マルチタスクは，複数プロセスをインターリーブに実行することです．

シングルコアのCPUスケジューラでは，同時に1プロセスしか実行できないため，複数のプロセスが並行に実行します（並列ではない）．

マルチコアのCPUスケジューラでは，同時に複数プロセスを実行できるため，真の並列処理が可能になります．

マルチタスクOSの種類は以下になります．

ノンプリエンプティブマルチタスク（協調的マルチタスク）
- 古いOS（例：Windows 3.1）や少数の言語ランタイム（例：Go言語のランタイム）で採用されている．
- プロセスは，CPUを解放するまで実行を停止しない．
- OSが公平なスケジューリングを強制できない．
プリエンプティブマルチタスク（非協調的マルチタスク）
- ほぼ全ての最新OSで採用されている．
- OSはプロセスの優先順位によって決まるタイムスライスを経過した後，プロセスの実行をプリエンプション（横取り）することができる．

スケジューリングポリシー

スケジューリングポリシーでは，I/OバウンドなタスクとCPUバウンドなタスクの処理を考慮します．

I/Oバウンドなプロセス
- ディスク，ネットワーク，キーボード，マウス等のI/O待ちでほとんどの時間を消費する．
- 短時間しか実行されない．
- レスポンスタイムが重要である．
CPUバウンドなプロセス
- 科学計算等の処理でCPUを多く利用する．
- 長時間動作させるとキャッシュが熱くなる（多くヒットする）．

また，スケジューリングポリシーはタスクの優先度を考慮します．

プロセスの価値とプロセッサ時間の必要性に基づいてランク付けを行う．
優先度の高いプロセスが優先度の低いプロセスより先に実行される．

Linuxには2つの優先順位範囲があります．

nice値：範囲は-20から+19まで（デフォルトは0）
- niceの値が高いほど優先度が低い．
リアルタイム優先度：範囲は0～99
- 値が高いほど優先度が高い．
- リアルタイム・プロセスは常に標準的な（nice）プロセスより先に実行されます．

Linuxでnice値とリアルタイム優先度のプロセスを知りたいあなたは，以下のコマンドを実行しましょう！

PIDがプロセスID，NIがnice値，RTPRIOがリアルタイム優先度，CMDがプロセス名になります．

$ ps ax -eo pid,ni,rtprio,cmd
    PID  NI RTPRIO CMD
      1   0      - /sbin/init splash
      2   0      - [kthreadd]
      3 -20      - [rcu_gp]
      4 -20      - [rcu_par_gp]
      6 -20      - [kworker/0:0H-kblockd]
      9 -20      - [mm_percpu_wq]
     10   0      - [rcu_tasks_kthre]
     11   0      - [rcu_tasks_rude_]
     12   0      - [rcu_tasks_trace]
     13   0      - [ksoftirqd/0]
     14   -      1 [rcu_preempt]
     15   -      1 [rcub/1]
     16   -      1 [rcuc/0]
     17   -     99 [migration/0]
     18   -      1 [irq_work/0]
     19   -     50 [idle_inject/0]
     20   0      - [cpuhp/0]
     21   0      - [cpuhp/1]
     22   -     50 [idle_inject/1]
     23   -      1 [irq_work/1]
     24   -     99 [migration/1]
     25   -      1 [rcuc/1]
     26   0      - [ksoftirqd/1]
     27   0      - [kworker/1:0-events]
     28 -20      - [kworker/1:0H-events_highpri]
...

$ ps ax -eo pid,ni,rtprio,cmd

PID NI RTPRIO CMD

1 0 - /sbin/init splash

2 0 - [kthreadd]

3 -20 - [rcu_gp]

4 -20 - [rcu_par_gp]

6 -20 - [kworker/0:0H-kblockd]

9 -20 - [mm_percpu_wq]

10 0 - [rcu_tasks_kthre]

11 0 - [rcu_tasks_rude_]

12 0 - [rcu_tasks_trace]

13 0 - [ksoftirqd/0]

14 - 1 [rcu_preempt]

15 - 1 [rcub/1]

16 - 1 [rcuc/0]

17 - 99 [migration/0]

18 - 1 [irq_work/0]

19 - 50 [idle_inject/0]

20 0 - [cpuhp/0]

21 0 - [cpuhp/1]

22 - 50 [idle_inject/1]

23 - 1 [irq_work/1]

24 - 99 [migration/1]

25 - 1 [rcuc/1]

26 0 - [ksoftirqd/1]

27 0 - [kworker/1:0-events]

28 -20 - [kworker/1:0H-events_highpri]

...

上図にユーザ空間とカーネル空間から見たLinuxのプロセス優先度を示します．

ユーザ空間の優先度はリアルタイムは99～0（100レベル），nice値のノンリアルタイム優先度は-20～19（40レベル）になります（合計140レベル）．

ユーザ空間のリアルタイムの優先度は数値が大きいほど高くなりますが（99が最高優先度），ノンリアルタイムの優先度は数値が小さいほど高くなる（-20が最高優先度）になることに注意して下さい．

※topコマンドを実行する場合，リアルタイムの優先度は-100～-1，nice値の優先度は0～39になります．

これに対して，task_struct構造体のprioメンバの変数が管理するカーネル空間の優先度の範囲0～139（140レベル）で，0が最高優先度になります．

また，task_struct構造体のrt_prorityメンバ変数が管理するリアルタイムの優先度の範囲は0～99で，99が最高優先度になります（いわゆるユーザ空間のリアルタイムの優先度）．

スケジューリングポリシーはタイムスライスを考慮します．

タイムスライスとは，プロセスがプリエンプションされるまでに実行すべき時間のことである．

デフォルトのタイムスライスを絶対的に定義するのは難しい．

長すぎる場合はインタラクティブ性能が低下し，短すぎる場合はコンテキストスイッチのオーバーヘッドが大きくなります．

LinuxのCompletely Fair Scheduler（CFS）

LinuxのCompletely Fair Scheduler（CFS）を紹介します．

CFSはLinuxカーネルのバージョン2.6.23から採用されたプロセススケジューラで，現在まで利用されています．

※バージョン2.6.0～の2.6.22（2003～2007年）はO（1）スケジューラが採用されていました（解説動画は以下）．

Linux CFSは絶対的なタイムスライスを利用しません．

プロセスが受け取るタイムスライスは，システムの負荷（CPUの割合）の関数です．

さらに，そのタイムスライスはプロセスの優先度によって重み付けされます．

あるプロセスPが実行可能状態になる場合，PがCよりも小さな割合でCPUを消費していれば，Pは現在実行中のプロセスCを横取りします．

以下の2つのタスクのスケジュール例を紹介します．

テキストエディタ：I/Oバウンド，レイテンシ重視（インタラクティブ型）
ビデオエンコーダ：CPUバウンド，バックグラウンドジョブ

スケジューリング目標は以下になります．

テキストエディタ：実行準備ができたら，ビデオエンコーダを先行して実行し，インタラクティブ性能を向上させたい．
ビデオエンコーダ：CPUキャッシュのヒット率を上げるため，できるだけ長く実行したい．

UNIXのスケジューリングポリシーでは，テキストエディタに高い優先度を与えます．

この理由として，多くのCPUを必要とするからではなく，必要なときに常にCPUを利用できるようにしたいからです．

これに対して，Linux CFSのスケジューリングポリシーでは，各プログラムが実際に使用したCPU時間を記録することで，テキストエディタに特定の割合のCPU時間を保証します．

例えば，「テキストエディタ：ビデオエンコーダ＝50%：50%」の割合になります．

テキストエディタはユーザの入力待ちのためほとんどスリープし，ビデオエンコーダはプリエンプションされるまで動作し続けます．

テキストエディタが起動したとき，Linux CFSはテキストエディタの方がビデオエンコーダよりCPU利用率が低いことを確認し，テキストエディタがビデオエンコーダをプリエンプションします．

下図にLinux CFSにおけるテキストエディタとビデオエンコーダのスケジュール例を示します．

Linux CFSでは，I/Oバウンドなタスク（テキストエディタのキーボード入力）のインタラクティブ性能と，バックグラウンドで動作するCPUバウンドなタスク（ビデオエンコーダ）の性能を両立できていることがわかります．

Linux CFSの設計について紹介します．

Linux CFSは，Rotating Staircase Deadline Scheduler（RSDL）の進化系です．

各時点で，同じ優先順位の各プロセスは，全く同じ量のCPU時間を受け取っています．

CPU上でn個のタスクを並列に実行できる場合，それぞれにCPUの処理能力の1/nを与えます．

CFSは，あるプロセスをある時間実行した後，最も実行時間の短い実行可能なプロセスと交換します．

デフォルトのタイムスライスはなく，CFSは実行可能なプロセス数に応じてプロセスの実行時間を計算します．

CFSの動的なタイムスライスは，プロセスの優先度によっていい感じに重み付けされます．

具体的には，「タイムスライス = タスクの重み / 実行可能なタスクの総重量」になります．

実際のタイムスライスを計算するために，CFSはターゲットレイテンシを設定します．

ターゲットレイテンシとは，実行可能なすべてのプロセスが少なくとも一度はスケジューリングされるべき期間のことで，最小粒度は1msのフロア（デフォルト）になります．

同じ優先度（下図の上部）と異なる優先度（下図の下部）のプロセスのスケジュール例は以下になります．

同じ優先度の場合は全てのプロセスが同じ実行時間になり，異なる優先度の場合は優先度の高いプロセスAがプロセスBより多くの時間を実行していることがわかります．

example of linux cfs with same and different priorities

Linuxのスケジューリングクラスの設計と実装

Linuxのスケジューリングクラスの設計と実装を紹介します．

Linuxのスケジューラはモジュール化されており，スケジューリングアルゴリズムのプラグイン可能なインタフェースを提供します．

異なるスケジューリングアルゴリズムが共存し，それぞれのタイプのプロセスをスケジューリングすることができます．

スケジューリングクラスはスケジューリングアルゴリズムの一つです．

各スケジューリングクラスにはスケジューリングポリシー（SCHED_DEADLINE，SCHED_FIFO，SCHED_RR，SCHED_OTHER等）があります．

※SCHED_OTHERはユーザ空間の名前，SCHED_NORMALはカーネル空間の名前で同じ実装を指します．

スケジューラの基本コードは，以下の2つのケースでスケジューラを起動します．

タイマ割り込みを処理する時（割り込みハンドラからscheduler_tick関数を呼び出し）
カーネルがschedule関数を呼び出した時

scheduler_tick関数とschedule関数の実装はlinux/kernel/sched/core.cにあります．

/*
 * This function gets called by the timer code, with HZ frequency.
 * We call it with interrupts disabled.
 */
void scheduler_tick(void)
{
	int cpu = smp_processor_id();
	struct rq *rq = cpu_rq(cpu);
	struct task_struct *curr = rq->curr;
	struct rq_flags rf;
	unsigned long thermal_pressure;
	u64 resched_latency;

	arch_scale_freq_tick();
	sched_clock_tick();

	rq_lock(rq, &rf);

	update_rq_clock(rq);
	thermal_pressure = arch_scale_thermal_pressure(cpu_of(rq));
	update_thermal_load_avg(rq_clock_thermal(rq), rq, thermal_pressure);
	curr->sched_class->task_tick(rq, curr, 0);
	if (sched_feat(LATENCY_WARN))
		resched_latency = cpu_resched_latency(rq);
	calc_global_load_tick(rq);

	rq_unlock(rq, &rf);

	if (sched_feat(LATENCY_WARN) && resched_latency)
		resched_latency_warn(cpu, resched_latency);

	perf_event_task_tick();

#ifdef CONFIG_SMP
	rq->idle_balance = idle_cpu(cpu);
	trigger_load_balance(rq);
#endif
}


asmlinkage __visible void __sched schedule(void)
{
	struct task_struct *tsk = current;

	sched_submit_work(tsk);
	do {
		preempt_disable();
		__schedule(SM_NONE);
		sched_preempt_enable_no_resched();
	} while (need_resched());
	sched_update_worker(tsk);
}
EXPORT_SYMBOL(schedule);

* This function gets called by the timer code, with HZ frequency.

* We call it with interrupts disabled.

void scheduler_tick(void)

{

int cpu = smp_processor_id();

struct rq *rq = cpu_rq(cpu);

struct task_struct *curr = rq->curr;

struct rq_flags rf;

unsigned long thermal_pressure;

u64 resched_latency;

arch_scale_freq_tick();

sched_clock_tick();

rq_lock(rq, &rf);

update_rq_clock(rq);

thermal_pressure = arch_scale_thermal_pressure(cpu_of(rq));

update_thermal_load_avg(rq_clock_thermal(rq), rq, thermal_pressure);

curr->sched_class->task_tick(rq, curr, 0);

if (sched_feat(LATENCY_WARN))

resched_latency = cpu_resched_latency(rq);

calc_global_load_tick(rq);

rq_unlock(rq, &rf);

if (sched_feat(LATENCY_WARN) && resched_latency)

resched_latency_warn(cpu, resched_latency);

perf_event_task_tick();

#ifdef CONFIG_SMP

rq->idle_balance = idle_cpu(cpu);

trigger_load_balance(rq);

#endif

}

asmlinkage __visible void __sched schedule(void)

{

struct task_struct *tsk = current;

sched_submit_work(tsk);

do {

preempt_disable();

__schedule(SM_NONE);

sched_preempt_enable_no_resched();

} while (need_resched());

sched_update_worker(tsk);

}

EXPORT_SYMBOL(schedule);

また，次に実行するタスクを選択するpick_next_task関数も同じくlinux/kernel/sched/core.cにあります．

static struct task_struct *
pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
{
	struct task_struct *next, *max = NULL;
	const struct sched_class *class;
	const struct cpumask *smt_mask;
	bool fi_before = false;
	int i, j, cpu, occ = 0;
	bool need_sync;

...

	/*
	 * Optimize for common case where this CPU has no cookies
	 * and there are no cookied tasks running on siblings.
	 */
	if (!need_sync) {
		for_each_class(class) {
			next = class->pick_task(rq);
			if (next)
				break;
		}

		if (!next->core_cookie) {
			rq->core_pick = NULL;
			/*
			 * For robustness, update the min_vruntime_fi for
			 * unconstrained picks as well.
			 */
			WARN_ON_ONCE(fi_before);
			task_vruntime_update(rq, next, false);
			goto done;
		}
	}

	for_each_cpu(i, smt_mask) {
		struct rq *rq_i = cpu_rq(i);

		rq_i->core_pick = NULL;

		if (i != cpu)
			update_rq_clock(rq_i);
	}

	/*
	 * Try and select tasks for each sibling in descending sched_class
	 * order.
	 */
	for_each_class(class) {
again:
		for_each_cpu_wrap(i, smt_mask, cpu) {
			struct rq *rq_i = cpu_rq(i);
			struct task_struct *p;

			if (rq_i->core_pick)
				continue;

			/*
			 * If this sibling doesn't yet have a suitable task to
			 * run; ask for the most eligible task, given the
			 * highest priority task already selected for this
			 * core.
			 */
			p = pick_task(rq_i, class, max, fi_before);
			if (!p)
				continue;

			if (!is_task_rq_idle(p))
				occ++;

			rq_i->core_pick = p;
			if (rq_i->idle == p && rq_i->nr_running) {
				rq->core->core_forceidle = true;
				if (!fi_before)
					rq->core->core_forceidle_seq++;
			}

			/*
			 * If this new candidate is of higher priority than the
			 * previous; and they're incompatible; we need to wipe
			 * the slate and start over. pick_task makes sure that
			 * p's priority is more than max if it doesn't match
			 * max's cookie.
			 *
			 * NOTE: this is a linear max-filter and is thus bounded
			 * in execution time.
			 */
			if (!max || !cookie_match(max, p)) {
				struct task_struct *old_max = max;

				rq->core->core_cookie = p->core_cookie;
				max = p;

				if (old_max) {
					rq->core->core_forceidle = false;
					for_each_cpu(j, smt_mask) {
						if (j == i)
							continue;

						cpu_rq(j)->core_pick = NULL;
					}
					occ = 1;
					goto again;
				}
			}
		}
	}

	rq->core->core_pick_seq = rq->core->core_task_seq;
	next = rq->core_pick;
	rq->core_sched_seq = rq->core->core_pick_seq;

...

done:
	set_next_task(rq, next);
	return next;
}

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

static struct task_struct *

pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)

{

struct task_struct *next, *max = NULL;

const struct sched_class *class;

const struct cpumask *smt_mask;

bool fi_before = false;

int i, j, cpu, occ = 0;

bool need_sync;

...

* Optimize for common case where this CPU has no cookies

* and there are no cookied tasks running on siblings.

if (!need_sync) {

for_each_class(class) {

next = class->pick_task(rq);

if (next)

break;

}

if (!next->core_cookie) {

rq->core_pick = NULL;

* For robustness, update the min_vruntime_fi for

* unconstrained picks as well.

WARN_ON_ONCE(fi_before);

task_vruntime_update(rq, next, false);

goto done;

}

for_each_cpu(i, smt_mask) {

struct rq *rq_i = cpu_rq(i);

rq_i->core_pick = NULL;

if (i != cpu)

update_rq_clock(rq_i);

}

* Try and select tasks for each sibling in descending sched_class

* order.

for_each_class(class) {

again:

for_each_cpu_wrap(i, smt_mask, cpu) {

struct rq *rq_i = cpu_rq(i);

struct task_struct *p;

if (rq_i->core_pick)

continue;

* If this sibling doesn't yet have a suitable task to

* run; ask for the most eligible task, given the

* highest priority task already selected for this

* core.

p = pick_task(rq_i, class, max, fi_before);

if (!p)

continue;

if (!is_task_rq_idle(p))

occ++;

rq_i->core_pick = p;

if (rq_i->idle == p && rq_i->nr_running) {

rq->core->core_forceidle = true;

if (!fi_before)

rq->core->core_forceidle_seq++;

}

* If this new candidate is of higher priority than the

* previous; and they're incompatible; we need to wipe

* the slate and start over. pick_task makes sure that

* p's priority is more than max if it doesn't match

* max's cookie.

* NOTE: this is a linear max-filter and is thus bounded

* in execution time.

if (!max || !cookie_match(max, p)) {

struct task_struct *old_max = max;

rq->core->core_cookie = p->core_cookie;

max = p;

if (old_max) {

rq->core->core_forceidle = false;

for_each_cpu(j, smt_mask) {

if (j == i)

continue;

cpu_rq(j)->core_pick = NULL;

}

occ = 1;

goto again;

}

rq->core->core_pick_seq = rq->core->core_task_seq;

next = rq->core_pick;

rq->core_sched_seq = rq->core->core_pick_seq;

...

done:

set_next_task(rq, next);

return next;

}

Linux CFSの実装はSCHED_OTHERポリシーになります．

カーネルコード内ではSCHED_NORMALと記載されていて，実装はlinux/kernel/sched/fair.cにあります．

また，リアルタイムスケジューリングポリシーは以下の3つになります．

SCHED_FIFO : 先入れ先出しスケジューリング
SCHED_RR : ラウンドロビンスケジューリング
SCHED_DEADLINE : 散発的タスク（Sporadic Task）モデルによるデッドラインスケジューリング

※散発的タスクとは最短到着間隔（いわゆる最短周期）が保証されているタスクのことです．

linux/kernel/sched/sched.hにあるsched_class構造体は以下になります．

sched_class構造体は，すべてのスケジューリングクラスの抽象ベースクラスになります．

linux/kernel/sched/fair.cにCFSの実装があります．

struct sched_class {

#ifdef CONFIG_UCLAMP_TASK
	int uclamp_enabled;
#endif

	void (*enqueue_task) (struct rq *rq, struct task_struct *p, int flags);
	void (*dequeue_task) (struct rq *rq, struct task_struct *p, int flags);
	void (*yield_task)   (struct rq *rq);
	bool (*yield_to_task)(struct rq *rq, struct task_struct *p);

	void (*check_preempt_curr)(struct rq *rq, struct task_struct *p, int flags);

	struct task_struct *(*pick_next_task)(struct rq *rq);

	void (*put_prev_task)(struct rq *rq, struct task_struct *p);
	void (*set_next_task)(struct rq *rq, struct task_struct *p, bool first);

#ifdef CONFIG_SMP
	int (*balance)(struct rq *rq, struct task_struct *prev, struct rq_flags *rf);
	int  (*select_task_rq)(struct task_struct *p, int task_cpu, int flags);

	struct task_struct * (*pick_task)(struct rq *rq);

	void (*migrate_task_rq)(struct task_struct *p, int new_cpu);

	void (*task_woken)(struct rq *this_rq, struct task_struct *task);

	void (*set_cpus_allowed)(struct task_struct *p,
				 const struct cpumask *newmask,
				 u32 flags);

	void (*rq_online)(struct rq *rq);
	void (*rq_offline)(struct rq *rq);

	struct rq *(*find_lock_rq)(struct task_struct *p, struct rq *rq);
#endif

	void (*task_tick)(struct rq *rq, struct task_struct *p, int queued);
	void (*task_fork)(struct task_struct *p);
	void (*task_dead)(struct task_struct *p);

	/*
	 * The switched_from() call is allowed to drop rq->lock, therefore we
	 * cannot assume the switched_from/switched_to pair is serialized by
	 * rq->lock. They are however serialized by p->pi_lock.
	 */
	void (*switched_from)(struct rq *this_rq, struct task_struct *task);
	void (*switched_to)  (struct rq *this_rq, struct task_struct *task);
	void (*prio_changed) (struct rq *this_rq, struct task_struct *task,
			      int oldprio);

	unsigned int (*get_rr_interval)(struct rq *rq,
					struct task_struct *task);

	void (*update_curr)(struct rq *rq);

#define TASK_SET_GROUP		0
#define TASK_MOVE_GROUP		1

#ifdef CONFIG_FAIR_GROUP_SCHED
	void (*task_change_group)(struct task_struct *p, int type);
#endif
};

/*
 * Helper to define a sched_class instance; each one is placed in a separate
 * section which is ordered by the linker script:
 *
 *   include/asm-generic/vmlinux.lds.h
 *
 * Also enforce alignment on the instance, not the type, to guarantee layout.
 */
#define DEFINE_SCHED_CLASS(name) \
const struct sched_class name##_sched_class \
	__aligned(__alignof__(struct sched_class)) \
	__section("__" #name "_sched_class")

struct sched_class {

#ifdef CONFIG_UCLAMP_TASK

int uclamp_enabled;

#endif

void (*enqueue_task) (struct rq *rq, struct task_struct *p, int flags);

void (*dequeue_task) (struct rq *rq, struct task_struct *p, int flags);

void (*yield_task) (struct rq *rq);

bool (*yield_to_task)(struct rq *rq, struct task_struct *p);

void (*check_preempt_curr)(struct rq *rq, struct task_struct *p, int flags);

struct task_struct *(*pick_next_task)(struct rq *rq);

void (*put_prev_task)(struct rq *rq, struct task_struct *p);

void (*set_next_task)(struct rq *rq, struct task_struct *p, bool first);

#ifdef CONFIG_SMP

int (*balance)(struct rq *rq, struct task_struct *prev, struct rq_flags *rf);

int (*select_task_rq)(struct task_struct *p, int task_cpu, int flags);

struct task_struct * (*pick_task)(struct rq *rq);

void (*migrate_task_rq)(struct task_struct *p, int new_cpu);

void (*task_woken)(struct rq *this_rq, struct task_struct *task);

void (*set_cpus_allowed)(struct task_struct *p,

const struct cpumask *newmask,

u32 flags);

void (*rq_online)(struct rq *rq);

void (*rq_offline)(struct rq *rq);

struct rq *(*find_lock_rq)(struct task_struct *p, struct rq *rq);

#endif

void (*task_tick)(struct rq *rq, struct task_struct *p, int queued);

void (*task_fork)(struct task_struct *p);

void (*task_dead)(struct task_struct *p);

* The switched_from() call is allowed to drop rq->lock, therefore we

* cannot assume the switched_from/switched_to pair is serialized by

* rq->lock. They are however serialized by p->pi_lock.

void (*switched_from)(struct rq *this_rq, struct task_struct *task);

void (*switched_to) (struct rq *this_rq, struct task_struct *task);

void (*prio_changed) (struct rq *this_rq, struct task_struct *task,

int oldprio);

unsigned int (*get_rr_interval)(struct rq *rq,

struct task_struct *task);

void (*update_curr)(struct rq *rq);

#define TASK_SET_GROUP 0

#define TASK_MOVE_GROUP 1

#ifdef CONFIG_FAIR_GROUP_SCHED

void (*task_change_group)(struct task_struct *p, int type);

#endif

};

* Helper to define a sched_class instance; each one is placed in a separate

* section which is ordered by the linker script:

* include/asm-generic/vmlinux.lds.h

* Also enforce alignment on the instance, not the type, to guarantee layout.

#define DEFINE_SCHED_CLASS(name) \

const struct sched_class name##_sched_class \

__aligned(__alignof__(struct sched_class)) \

__section("__" #name "_sched_class")

各スケジューリングクラスは独自の機能を実装しています．

CFSの実装はlinux/kernel/sched/fair.cにあります．

/*
 * All the scheduling class methods:
 */
DEFINE_SCHED_CLASS(fair) = {

	.enqueue_task		= enqueue_task_fair,
	.dequeue_task		= dequeue_task_fair,
	.yield_task		= yield_task_fair,
	.yield_to_task		= yield_to_task_fair,

	.check_preempt_curr	= check_preempt_wakeup,

	.pick_next_task		= __pick_next_task_fair,
	.put_prev_task		= put_prev_task_fair,
	.set_next_task          = set_next_task_fair,

#ifdef CONFIG_SMP
	.balance		= balance_fair,
	.pick_task		= pick_task_fair,
	.select_task_rq		= select_task_rq_fair,
	.migrate_task_rq	= migrate_task_rq_fair,

	.rq_online		= rq_online_fair,
	.rq_offline		= rq_offline_fair,

	.task_dead		= task_dead_fair,
	.set_cpus_allowed	= set_cpus_allowed_common,
#endif

	.task_tick		= task_tick_fair,
	.task_fork		= task_fork_fair,

	.prio_changed		= prio_changed_fair,
	.switched_from		= switched_from_fair,
	.switched_to		= switched_to_fair,

	.get_rr_interval	= get_rr_interval_fair,

	.update_curr		= update_curr_fair,

#ifdef CONFIG_FAIR_GROUP_SCHED
	.task_change_group	= task_change_group_fair,
#endif

#ifdef CONFIG_UCLAMP_TASK
	.uclamp_enabled		= 1,
#endif
};

/*
 * scheduler tick hitting a task of our scheduling class.
 *
 * NOTE: This function can be called remotely by the tick offload that
 * goes along full dynticks. Therefore no local assumption can be made
 * and everything must be accessed through the @rq and @curr passed in
 * parameters.
 */
static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
{
	struct cfs_rq *cfs_rq;
	struct sched_entity *se = &curr->se;

	for_each_sched_entity(se) {
		cfs_rq = cfs_rq_of(se);
		entity_tick(cfs_rq, se, queued);
	}

	if (static_branch_unlikely(&sched_numa_balancing))
		task_tick_numa(rq, curr);

	update_misfit_status(curr, rq);
	update_overutilized_status(task_rq(curr));

	task_tick_core(rq, curr);
}

* All the scheduling class methods:

DEFINE_SCHED_CLASS(fair) = {

.enqueue_task = enqueue_task_fair,

.dequeue_task = dequeue_task_fair,

.yield_task = yield_task_fair,

.yield_to_task = yield_to_task_fair,

.check_preempt_curr = check_preempt_wakeup,

.pick_next_task = __pick_next_task_fair,

.put_prev_task = put_prev_task_fair,

.set_next_task = set_next_task_fair,

#ifdef CONFIG_SMP

.balance = balance_fair,

.pick_task = pick_task_fair,

.select_task_rq = select_task_rq_fair,

.migrate_task_rq = migrate_task_rq_fair,

.rq_online = rq_online_fair,

.rq_offline = rq_offline_fair,

.task_dead = task_dead_fair,

.set_cpus_allowed = set_cpus_allowed_common,

#endif

.task_tick = task_tick_fair,

.task_fork = task_fork_fair,

.prio_changed = prio_changed_fair,

.switched_from = switched_from_fair,

.switched_to = switched_to_fair,

.get_rr_interval = get_rr_interval_fair,

.update_curr = update_curr_fair,

#ifdef CONFIG_FAIR_GROUP_SCHED

.task_change_group = task_change_group_fair,

#endif

#ifdef CONFIG_UCLAMP_TASK

.uclamp_enabled = 1,

#endif

};

* scheduler tick hitting a task of our scheduling class.

* NOTE: This function can be called remotely by the tick offload that

* goes along full dynticks. Therefore no local assumption can be made

* and everything must be accessed through the @rq and @curr passed in

* parameters.

static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)

{

struct cfs_rq *cfs_rq;

struct sched_entity *se = &curr->se;

for_each_sched_entity(se) {

cfs_rq = cfs_rq_of(se);

entity_tick(cfs_rq, se, queued);

}

if (static_branch_unlikely(&sched_numa_balancing))

task_tick_numa(rq, curr);

update_misfit_status(curr, rq);

update_overutilized_status(task_rq(curr));

task_tick_core(rq, curr);

}

Linux CFSの動作

Linux CFSの動作を紹介します．

Linux CFSでは，各々のプロセスは実行した時間を「仮想ランタイム（Virtual Runtime）」として保持します．

include/linux/sched.hにあるtask_struct構造体にsched_entityメンバ変数，sched_entity構造体に仮想ランタイムであるvruntimeメンバ変数（単位はns）（23行目）があることがわかります．

struct task_struct {
...

	const struct sched_class	*sched_class;
	struct sched_entity		se;
	struct sched_rt_entity		rt;
	struct sched_dl_entity		dl;

...

};


struct sched_entity {
	/* For load-balancing: */
	struct load_weight		load;
	struct rb_node			run_node;
	struct list_head		group_node;
	unsigned int			on_rq;

	u64				exec_start;
	u64				sum_exec_runtime;
	u64				vruntime;
	u64				prev_sum_exec_runtime;

	u64				nr_migrations;

	struct sched_statistics		statistics;

#ifdef CONFIG_FAIR_GROUP_SCHED
	int				depth;
	struct sched_entity		*parent;
	/* rq on which this entity is (to be) queued: */
	struct cfs_rq			*cfs_rq;
	/* rq "owned" by this entity/group: */
	struct cfs_rq			*my_q;
	/* cached value of my_q->h_nr_running */
	unsigned long			runnable_weight;
#endif

#ifdef CONFIG_SMP
	/*
	 * Per entity load average tracking.
	 *
	 * Put into separate cache line so it does not
	 * collide with read-mostly values above.
	 */
	struct sched_avg		avg;
#endif
};

struct task_struct {

...

const struct sched_class *sched_class;

struct sched_entity se;

struct sched_rt_entity rt;

struct sched_dl_entity dl;

...

};

struct sched_entity {

/* For load-balancing: */

struct load_weight load;

struct rb_node run_node;

struct list_head group_node;

unsigned int on_rq;

u64 exec_start;

u64 sum_exec_runtime;

u64 vruntime;

u64 prev_sum_exec_runtime;

u64 nr_migrations;

struct sched_statistics statistics;

#ifdef CONFIG_FAIR_GROUP_SCHED

int depth;

struct sched_entity *parent;

/* rq on which this entity is (to be) queued: */

struct cfs_rq *cfs_rq;

/* rq "owned" by this entity/group: */

struct cfs_rq *my_q;

/* cached value of my_q->h_nr_running */

unsigned long runnable_weight;

#endif

#ifdef CONFIG_SMP

* Per entity load average tracking.

* Put into separate cache line so it does not

* collide with read-mostly values above.

struct sched_avg avg;

#endif

};

CFSでは，linux/kernel/sched/fair.cにあるupdate_curr関数でタイマ割り込みごとにタスクの実行時間を計算します．

/*
 * Update the current task's runtime statistics.
 */
static void update_curr(struct cfs_rq *cfs_rq)
{
	struct sched_entity *curr = cfs_rq->curr;
	u64 now = rq_clock_task(rq_of(cfs_rq));
	u64 delta_exec;

	if (unlikely(!curr))
		return;

	delta_exec = now - curr->exec_start;
	if (unlikely((s64)delta_exec <= 0))
		return;

	curr->exec_start = now;

	schedstat_set(curr->statistics.exec_max,
		      max(delta_exec, curr->statistics.exec_max));

	curr->sum_exec_runtime += delta_exec;
	schedstat_add(cfs_rq->exec_clock, delta_exec);

	curr->vruntime += calc_delta_fair(delta_exec, curr);
	update_min_vruntime(cfs_rq);

	if (entity_is_task(curr)) {
		struct task_struct *curtask = task_of(curr);

		trace_sched_stat_runtime(curtask, delta_exec, curr->vruntime);
		cgroup_account_cputime(curtask, delta_exec);
		account_group_exec_runtime(curtask, delta_exec);
	}

	account_cfs_rq_runtime(cfs_rq, delta_exec);
}

* Update the current task's runtime statistics.

static void update_curr(struct cfs_rq *cfs_rq)

{

struct sched_entity *curr = cfs_rq->curr;

u64 now = rq_clock_task(rq_of(cfs_rq));

u64 delta_exec;

if (unlikely(!curr))

return;

delta_exec = now - curr->exec_start;

if (unlikely((s64)delta_exec <= 0))

return;

curr->exec_start = now;

schedstat_set(curr->statistics.exec_max,

max(delta_exec, curr->statistics.exec_max));

curr->sum_exec_runtime += delta_exec;

schedstat_add(cfs_rq->exec_clock, delta_exec);

curr->vruntime += calc_delta_fair(delta_exec, curr);

update_min_vruntime(cfs_rq);

if (entity_is_task(curr)) {

struct task_struct *curtask = task_of(curr);

trace_sched_stat_runtime(curtask, delta_exec, curr->vruntime);

cgroup_account_cputime(curtask, delta_exec);

account_group_exec_runtime(curtask, delta_exec);

}

account_cfs_rq_runtime(cfs_rq, delta_exec);

}

CFSではスケジューラのデータ構造として赤黒木を利用しています．

赤黒木を知りたいあなたはこちらからどうぞ．

: 【C言語】キューとは【FIFOキュー，優先度キュー，二項ヒープ，赤黒木】

こういった悩みにお答えします．こういった私から学べます．【C言語】キューとはキューとは，データを先入れ先出し（First In First Out）で保持するデータ構造です．キューにデータに入 ...

続きを見る

CFSはvruntimeでインデックスされたタスクの赤黒木（rbtree）を保持します（すなわちランキュー（runqueue））．

常にvruntimeが最も小さいタスク（一番左のノード）を選びます．

CFSの動作は以下の動画がわかりやすいです．

また，CFS VisualizerでCFSの動作を確認してみましょう．

CFSが次に実行する実行可能なプロセスを選択する必要がある場合，vruntimeが最も小さいプロセスが選択されます．

つまり，赤黒木の一番左のノードになります．

この操作はlinux/kernel/sched/fair.cにある__pick_first_entity関数で行われます．

struct sched_entity *__pick_first_entity(struct cfs_rq *cfs_rq)
{
	struct rb_node *left = rb_first_cached(&cfs_rq->tasks_timeline);

	if (!left)
		return NULL;

	return __node_2_se(left);
}

struct sched_entity *__pick_first_entity(struct cfs_rq *cfs_rq)

{

struct rb_node *left = rb_first_cached(&cfs_rq->tasks_timeline);

if (!left)

return NULL;

return __node_2_se(left);

}

タスクが起動またはマイグレーションすると，ランキューに追加されます．

この操作は，linux/kernel/sched/fair.cにあるenqueue_entity関数で行われます．

/*
 * MIGRATION
 *
 *	dequeue
 *	  update_curr()
 *	    update_min_vruntime()
 *	  vruntime -= min_vruntime
 *
 *	enqueue
 *	  update_curr()
 *	    update_min_vruntime()
 *	  vruntime += min_vruntime
 *
 * this way the vruntime transition between RQs is done when both
 * min_vruntime are up-to-date.
 *
 * WAKEUP (remote)
 *
 *	->migrate_task_rq_fair() (p->state == TASK_WAKING)
 *	  vruntime -= min_vruntime
 *
 *	enqueue
 *	  update_curr()
 *	    update_min_vruntime()
 *	  vruntime += min_vruntime
 *
 * this way we don't have the most up-to-date min_vruntime on the originating
 * CPU and an up-to-date min_vruntime on the destination CPU.
 */

static void
enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
{
	bool renorm = !(flags & ENQUEUE_WAKEUP) || (flags & ENQUEUE_MIGRATED);
	bool curr = cfs_rq->curr == se;

	/*
	 * If we're the current task, we must renormalise before calling
	 * update_curr().
	 */
	if (renorm && curr)
		se->vruntime += cfs_rq->min_vruntime;

	update_curr(cfs_rq);

	/*
	 * Otherwise, renormalise after, such that we're placed at the current
	 * moment in time, instead of some random moment in the past. Being
	 * placed in the past could significantly boost this task to the
	 * fairness detriment of existing tasks.
	 */
	if (renorm && !curr)
		se->vruntime += cfs_rq->min_vruntime;

	/*
	 * When enqueuing a sched_entity, we must:
	 *   - Update loads to have both entity and cfs_rq synced with now.
	 *   - Add its load to cfs_rq->runnable_avg
	 *   - For group_entity, update its weight to reflect the new share of
	 *     its group cfs_rq
	 *   - Add its new weight to cfs_rq->load.weight
	 */
	update_load_avg(cfs_rq, se, UPDATE_TG | DO_ATTACH);
	se_update_runnable(se);
	update_cfs_group(se);
	account_entity_enqueue(cfs_rq, se);

	if (flags & ENQUEUE_WAKEUP)
		place_entity(cfs_rq, se, 0);

	check_schedstat_required();
	update_stats_enqueue(cfs_rq, se, flags);
	check_spread(cfs_rq, se);
	if (!curr)
		__enqueue_entity(cfs_rq, se);
	se->on_rq = 1;

	/*
	 * When bandwidth control is enabled, cfs might have been removed
	 * because of a parent been throttled but cfs->nr_running > 1. Try to
	 * add it unconditionally.
	 */
	if (cfs_rq->nr_running == 1 || cfs_bandwidth_used())
		list_add_leaf_cfs_rq(cfs_rq);

	if (cfs_rq->nr_running == 1)
		check_enqueue_throttle(cfs_rq);
}

* MIGRATION

* dequeue

* update_curr()

* update_min_vruntime()

* vruntime -= min_vruntime

* enqueue

* update_curr()

* update_min_vruntime()

* vruntime += min_vruntime

* this way the vruntime transition between RQs is done when both

* min_vruntime are up-to-date.

* WAKEUP (remote)

* ->migrate_task_rq_fair() (p->state == TASK_WAKING)

* vruntime -= min_vruntime

* enqueue

* update_curr()

* update_min_vruntime()

* vruntime += min_vruntime

* this way we don't have the most up-to-date min_vruntime on the originating

* CPU and an up-to-date min_vruntime on the destination CPU.

static void

enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)

{

bool renorm = !(flags & ENQUEUE_WAKEUP) || (flags & ENQUEUE_MIGRATED);

bool curr = cfs_rq->curr == se;

* If we're the current task, we must renormalise before calling

* update_curr().

if (renorm && curr)

se->vruntime += cfs_rq->min_vruntime;

update_curr(cfs_rq);

* Otherwise, renormalise after, such that we're placed at the current

* moment in time, instead of some random moment in the past. Being

* placed in the past could significantly boost this task to the

* fairness detriment of existing tasks.

if (renorm && !curr)

se->vruntime += cfs_rq->min_vruntime;

* When enqueuing a sched_entity, we must:

* - Update loads to have both entity and cfs_rq synced with now.

* - Add its load to cfs_rq->runnable_avg

* - For group_entity, update its weight to reflect the new share of

* its group cfs_rq

* - Add its new weight to cfs_rq->load.weight

update_load_avg(cfs_rq, se, UPDATE_TG | DO_ATTACH);

se_update_runnable(se);

update_cfs_group(se);

account_entity_enqueue(cfs_rq, se);

if (flags & ENQUEUE_WAKEUP)

place_entity(cfs_rq, se, 0);

check_schedstat_required();

update_stats_enqueue(cfs_rq, se, flags);

check_spread(cfs_rq, se);

if (!curr)

__enqueue_entity(cfs_rq, se);

se->on_rq = 1;

* When bandwidth control is enabled, cfs might have been removed

* because of a parent been throttled but cfs->nr_running > 1. Try to

* add it unconditionally.

if (cfs_rq->nr_running == 1 || cfs_bandwidth_used())

list_add_leaf_cfs_rq(cfs_rq);

if (cfs_rq->nr_running == 1)

check_enqueue_throttle(cfs_rq);

}

タスクがスリープまたはマイグレーションすると，ランキューから削除されます．

この操作はlinux/kernel/sched/fair.cにあるdequeue_entity関数で行われます．

static void
dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
{
	/*
	 * Update run-time statistics of the 'current'.
	 */
	update_curr(cfs_rq);

	/*
	 * When dequeuing a sched_entity, we must:
	 *   - Update loads to have both entity and cfs_rq synced with now.
	 *   - Subtract its load from the cfs_rq->runnable_avg.
	 *   - Subtract its previous weight from cfs_rq->load.weight.
	 *   - For group entity, update its weight to reflect the new share
	 *     of its group cfs_rq.
	 */
	update_load_avg(cfs_rq, se, UPDATE_TG);
	se_update_runnable(se);

	update_stats_dequeue(cfs_rq, se, flags);

	clear_buddies(cfs_rq, se);

	if (se != cfs_rq->curr)
		__dequeue_entity(cfs_rq, se);
	se->on_rq = 0;
	account_entity_dequeue(cfs_rq, se);

	/*
	 * Normalize after update_curr(); which will also have moved
	 * min_vruntime if @se is the one holding it back. But before doing
	 * update_min_vruntime() again, which will discount @se's position and
	 * can move min_vruntime forward still more.
	 */
	if (!(flags & DEQUEUE_SLEEP))
		se->vruntime -= cfs_rq->min_vruntime;

	/* return excess runtime on last dequeue */
	return_cfs_rq_runtime(cfs_rq);

	update_cfs_group(se);

	/*
	 * Now advance min_vruntime if @se was the entity holding it back,
	 * except when: DEQUEUE_SAVE && !DEQUEUE_MOVE, in this case we'll be
	 * put back on, and if we advance min_vruntime, we'll be placed back
	 * further than we started -- ie. we'll be penalized.
	 */
	if ((flags & (DEQUEUE_SAVE | DEQUEUE_MOVE)) != DEQUEUE_SAVE)
		update_min_vruntime(cfs_rq);
}

static void

dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)

{

* Update run-time statistics of the 'current'.

update_curr(cfs_rq);

* When dequeuing a sched_entity, we must:

* - Update loads to have both entity and cfs_rq synced with now.

* - Subtract its load from the cfs_rq->runnable_avg.

* - Subtract its previous weight from cfs_rq->load.weight.

* - For group entity, update its weight to reflect the new share

* of its group cfs_rq.

update_load_avg(cfs_rq, se, UPDATE_TG);

se_update_runnable(se);

update_stats_dequeue(cfs_rq, se, flags);

clear_buddies(cfs_rq, se);

if (se != cfs_rq->curr)

__dequeue_entity(cfs_rq, se);

se->on_rq = 0;

account_entity_dequeue(cfs_rq, se);

* Normalize after update_curr(); which will also have moved

* min_vruntime if @se is the one holding it back. But before doing

* update_min_vruntime() again, which will discount @se's position and

* can move min_vruntime forward still more.

if (!(flags & DEQUEUE_SLEEP))

se->vruntime -= cfs_rq->min_vruntime;

/* return excess runtime on last dequeue */

return_cfs_rq_runtime(cfs_rq);

update_cfs_group(se);

* Now advance min_vruntime if @se was the entity holding it back,

* except when: DEQUEUE_SAVE && !DEQUEUE_MOVE, in this case we'll be

* put back on, and if we advance min_vruntime, we'll be placed back

* further than we started -- ie. we'll be penalized.

if ((flags & (DEQUEUE_SAVE | DEQUEUE_MOVE)) != DEQUEUE_SAVE)

update_min_vruntime(cfs_rq);

}

タスクがスリープする理由は，指定された時間，I/O待ち，ミューテックスでのブロック等になります．

スリープまでの手順は以下になります．

タスクにスリープマークをつける．
タスクをwaitqueueに入れる．
実行可能状態のプロセスの赤黒木からタスクをデキューする．
タスクはschedule関数を呼び出して，実行する新しいプロセスを選択する．

プロセスの起床はスリープの逆の手順になります．

スリープ状態に関連する以下の2つの状態があります．

TASK_INTERRUPTIBLE：シグナルによりスリープ状態のタスクを起床する．
TASK_UNINTERRUPTIBLE：起床までシグナルの送信を遅延する．

waitqueueはイベント発生を待っているプロセスのリストを保持します．

waiequeueはpthreadの条件変数と似たような概念です．

waitqueueの定義はlinux/inlcude/linux/wait.hにあるwait_queue_entry構造体とwait_queue_head構造体になります．

/*
 * A single wait-queue entry structure:
 */
struct wait_queue_entry {
	unsigned int		flags;
	void			*private;
	wait_queue_func_t	func;
	struct list_head	entry;
};

struct wait_queue_head {
	spinlock_t		lock;
	struct list_head	head;
};

* A single wait-queue entry structure:

struct wait_queue_entry {

unsigned int flags;

void *private;

wait_queue_func_t func;

struct list_head entry;

};

struct wait_queue_head {

spinlock_t lock;

struct list_head head;

};

waitqueueにプロセスを挿入するために，linux/include/linux/wait.hにある以下のマクロを利用します．

wait_event_interruptibleマクロ：TASK_INTERRUPTIBLE状態でwaitqueueにプロセスを挿入する．
wait_eventマクロ，wait_event_timeoutマクロ，wait_event_lock_irqマクロ等：特定の条件を満たすまでTASK_UNINTERRUPTIBLE状態でwaitqueueにプロセスを挿入する．

/**
 * wait_event_interruptible - sleep until a condition gets true
 * @wq_head: the waitqueue to wait on
 * @condition: a C expression for the event to wait for
 *
 * The process is put to sleep (TASK_INTERRUPTIBLE) until the
 * @condition evaluates to true or a signal is received.
 * The @condition is checked each time the waitqueue @wq_head is woken up.
 *
 * wake_up() has to be called after changing any variable that could
 * change the result of the wait condition.
 *
 * The function will return -ERESTARTSYS if it was interrupted by a
 * signal and 0 if @condition evaluated to true.
 */
#define wait_event_interruptible(wq_head, condition)				\
({										\
	int __ret = 0;								\
	might_sleep();								\
	if (!(condition))							\
		__ret = __wait_event_interruptible(wq_head, condition);		\
	__ret;									\
})

/**
 * wait_event - sleep until a condition gets true
 * @wq_head: the waitqueue to wait on
 * @condition: a C expression for the event to wait for
 *
 * The process is put to sleep (TASK_UNINTERRUPTIBLE) until the
 * @condition evaluates to true. The @condition is checked each time
 * the waitqueue @wq_head is woken up.
 *
 * wake_up() has to be called after changing any variable that could
 * change the result of the wait condition.
 */
#define wait_event(wq_head, condition)						\
do {										\
	might_sleep();								\
	if (condition)								\
		break;								\
	__wait_event(wq_head, condition);					\
} while (0)

/**
 * wait_event_timeout - sleep until a condition gets true or a timeout elapses
 * @wq_head: the waitqueue to wait on
 * @condition: a C expression for the event to wait for
 * @timeout: timeout, in jiffies
 *
 * The process is put to sleep (TASK_UNINTERRUPTIBLE) until the
 * @condition evaluates to true. The @condition is checked each time
 * the waitqueue @wq_head is woken up.
 *
 * wake_up() has to be called after changing any variable that could
 * change the result of the wait condition.
 *
 * Returns:
 * 0 if the @condition evaluated to %false after the @timeout elapsed,
 * 1 if the @condition evaluated to %true after the @timeout elapsed,
 * or the remaining jiffies (at least 1) if the @condition evaluated
 * to %true before the @timeout elapsed.
 */
#define wait_event_timeout(wq_head, condition, timeout)				\
({										\
	long __ret = timeout;							\
	might_sleep();								\
	if (!___wait_cond_timeout(condition))					\
		__ret = __wait_event_timeout(wq_head, condition, timeout);	\
	__ret;									\
})

/**
 * wait_event_lock_irq - sleep until a condition gets true. The
 *			 condition is checked under the lock. This
 *			 is expected to be called with the lock
 *			 taken.
 * @wq_head: the waitqueue to wait on
 * @condition: a C expression for the event to wait for
 * @lock: a locked spinlock_t, which will be released before schedule()
 *	  and reacquired afterwards.
 *
 * The process is put to sleep (TASK_UNINTERRUPTIBLE) until the
 * @condition evaluates to true. The @condition is checked each time
 * the waitqueue @wq_head is woken up.
 *
 * wake_up() has to be called after changing any variable that could
 * change the result of the wait condition.
 *
 * This is supposed to be called while holding the lock. The lock is
 * dropped before going to sleep and is reacquired afterwards.
 */
#define wait_event_lock_irq(wq_head, condition, lock)				\
do {										\
	if (condition)								\
		break;								\
	__wait_event_lock_irq(wq_head, condition, lock, );			\
} while (0)

/**

* wait_event_interruptible - sleep until a condition gets true

* @wq_head: the waitqueue to wait on

* @condition: a C expression for the event to wait for

* The process is put to sleep (TASK_INTERRUPTIBLE) until the

* @condition evaluates to true or a signal is received.

* The @condition is checked each time the waitqueue @wq_head is woken up.

* wake_up() has to be called after changing any variable that could

* change the result of the wait condition.

* The function will return -ERESTARTSYS if it was interrupted by a

* signal and 0 if @condition evaluated to true.

#define wait_event_interruptible(wq_head, condition) \

({ \

int __ret = 0; \

might_sleep(); \

if (!(condition)) \

__ret = __wait_event_interruptible(wq_head, condition); \

__ret; \

})

/**

* wait_event - sleep until a condition gets true

* @wq_head: the waitqueue to wait on

* @condition: a C expression for the event to wait for

* The process is put to sleep (TASK_UNINTERRUPTIBLE) until the

* @condition evaluates to true. The @condition is checked each time

* the waitqueue @wq_head is woken up.

* wake_up() has to be called after changing any variable that could

* change the result of the wait condition.

#define wait_event(wq_head, condition) \

do { \

might_sleep(); \

if (condition) \

break; \

__wait_event(wq_head, condition); \

} while (0)

/**

* wait_event_timeout - sleep until a condition gets true or a timeout elapses

* @wq_head: the waitqueue to wait on

* @condition: a C expression for the event to wait for

* @timeout: timeout, in jiffies

* The process is put to sleep (TASK_UNINTERRUPTIBLE) until the

* @condition evaluates to true. The @condition is checked each time

* the waitqueue @wq_head is woken up.

* wake_up() has to be called after changing any variable that could

* change the result of the wait condition.

* Returns:

* 0 if the @condition evaluated to %false after the @timeout elapsed,

* 1 if the @condition evaluated to %true after the @timeout elapsed,

* or the remaining jiffies (at least 1) if the @condition evaluated

* to %true before the @timeout elapsed.

#define wait_event_timeout(wq_head, condition, timeout) \

({ \

long __ret = timeout; \

might_sleep(); \

if (!___wait_cond_timeout(condition)) \

__ret = __wait_event_timeout(wq_head, condition, timeout); \

__ret; \

})

/**

* wait_event_lock_irq - sleep until a condition gets true. The

* condition is checked under the lock. This

* is expected to be called with the lock

* taken.

* @wq_head: the waitqueue to wait on

* @condition: a C expression for the event to wait for

* @lock: a locked spinlock_t, which will be released before schedule()

* and reacquired afterwards.

* The process is put to sleep (TASK_UNINTERRUPTIBLE) until the

* @condition evaluates to true. The @condition is checked each time

* the waitqueue @wq_head is woken up.

* wake_up() has to be called after changing any variable that could

* change the result of the wait condition.

* This is supposed to be called while holding the lock. The lock is

* dropped before going to sleep and is reacquired afterwards.

#define wait_event_lock_irq(wq_head, condition, lock) \

do { \

if (condition) \

break; \

__wait_event_lock_irq(wq_head, condition, lock, ); \

} while (0)

プロセスの起床は，wake_up関数が担当します．

デフォルトでは，waitqueueにある全てのプロセスを起床します．

#define wake_up(x)			__wake_up(x, TASK_NORMAL, 1, NULL)

1	#define wake_up(x) __wake_up(x, TASK_NORMAL, 1, NULL)

WQ_FLAG_EXCLUSIVEフラグを設定する排他タスクは，linux/kernel/sched/wait.cにあるprepare_to_wait_exclusive関数で追加します．

/* Returns true if we are the first waiter in the queue, false otherwise. */
bool
prepare_to_wait_exclusive(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry, int state)
{
	unsigned long flags;
	bool was_empty = false;

	wq_entry->flags |= WQ_FLAG_EXCLUSIVE;
	spin_lock_irqsave(&wq_head->lock, flags);
	if (list_empty(&wq_entry->entry)) {
		was_empty = list_empty(&wq_head->head);
		__add_wait_queue_entry_tail(wq_head, wq_entry);
	}
	set_current_state(state);
	spin_unlock_irqrestore(&wq_head->lock, flags);
	return was_empty;
}
EXPORT_SYMBOL(prepare_to_wait_exclusive);

/* Returns true if we are the first waiter in the queue, false otherwise. */

bool

prepare_to_wait_exclusive(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry, int state)

{

unsigned long flags;

bool was_empty = false;

wq_entry->flags |= WQ_FLAG_EXCLUSIVE;

spin_lock_irqsave(&wq_head->lock, flags);

if (list_empty(&wq_entry->entry)) {

was_empty = list_empty(&wq_head->head);

__add_wait_queue_entry_tail(wq_head, wq_entry);

}

set_current_state(state);

spin_unlock_irqrestore(&wq_head->lock, flags);

return was_empty;

}

EXPORT_SYMBOL(prepare_to_wait_exclusive);

waitqueueからプロセスを起床するためには，linux/kernel/sched/core.cにある以下の関数を呼び出ます．

default_wake_function関数がtry_to_wake_up関数を呼び出す．
try_to_wake_up関数がttwu_queue関数を呼び出す．
ttwu_queue関数がttwu_do_activate関数を呼び出す．
ttwu_do_activate関数がttwu_do_wakeup関数を呼び出す．
ttwu_do_wakeup関数でタスクの状態をTASK_RUNNINGに設定する．

ttwu_do_wakeup関数のコードは以下になります．

8行目のWRITE_ONCEマクロでタスクの状態をTASK_RUNNINGに設定していることがわかります．

/*
 * Mark the task runnable and perform wakeup-preemption.
 */
static void ttwu_do_wakeup(struct rq *rq, struct task_struct *p, int wake_flags,
			   struct rq_flags *rf)
{
	check_preempt_curr(rq, p, wake_flags);
	WRITE_ONCE(p->__state, TASK_RUNNING);
	trace_sched_wakeup(p);

#ifdef CONFIG_SMP
	if (p->sched_class->task_woken) {
		/*
		 * Our task @p is fully woken up and running; so it's safe to
		 * drop the rq->lock, hereafter rq is only used for statistics.
		 */
		rq_unpin_lock(rq, rf);
		p->sched_class->task_woken(rq, p);
		rq_repin_lock(rq, rf);
	}

	if (rq->idle_stamp) {
		u64 delta = rq_clock(rq) - rq->idle_stamp;
		u64 max = 2*rq->max_idle_balance_cost;

		update_avg(&rq->avg_idle, delta);

		if (rq->avg_idle > max)
			rq->avg_idle = max;

		rq->wake_stamp = jiffies;
		rq->wake_avg_idle = rq->avg_idle / 2;

		rq->idle_stamp = 0;
	}
#endif
}

* Mark the task runnable and perform wakeup-preemption.

static void ttwu_do_wakeup(struct rq *rq, struct task_struct *p, int wake_flags,

struct rq_flags *rf)

{

check_preempt_curr(rq, p, wake_flags);

WRITE_ONCE(p->__state, TASK_RUNNING);

trace_sched_wakeup(p);

#ifdef CONFIG_SMP

if (p->sched_class->task_woken) {

* Our task @p is fully woken up and running; so it's safe to

* drop the rq->lock, hereafter rq is only used for statistics.

rq_unpin_lock(rq, rf);

p->sched_class->task_woken(rq, p);

rq_repin_lock(rq, rf);

}

if (rq->idle_stamp) {

u64 delta = rq_clock(rq) - rq->idle_stamp;

u64 max = 2*rq->max_idle_balance_cost;

update_avg(&rq->avg_idle, delta);

if (rq->avg_idle > max)

rq->avg_idle = max;

rq->wake_stamp = jiffies;

rq->wake_avg_idle = rq->avg_idle / 2;

rq->idle_stamp = 0;

}

#endif

}

マルチコアプロセッサにおけるCFSでは，共有データ構造への高価なアクセスを回避するために，CPU単位でrunqueueを持ちます．

ランキューはバランスを保つ必要があります．

例えば，デュアルコアで，優先度の高いプロセスが長いランキューにある場合と，優先度の低いプロセスが短いランキューにある場合を考えます．

この場合，優先度の高いプロセスは優先度の低いプロセスより少ないCPU時間になってしまいます．

この逆転現象を回避するために，ロードバランサーが優先順位とCPU使用率に基づいて定期的に実行します．

プリエンプションとコンテキストスイッチ

コンテキストスイッチとは，現在CPU上で動作しているプロセスを別のプロセスに切り替える動作のことです．

linux/kernel/sched/core.cにある__schedule関数で呼び出されるcontext_switch関数によって実行されます．

※schedule関数は__schedule関数を呼び出すので，schedule関数->__schedule関数->context_switch関数という呼び出し順序になります．

context_switch関数では以下の関数を呼び出します

membarrier_switch_mm関数：アドレス空間を切り替えます．
switch_to関数：CPUの状態（レジスタ）を切り替えます．

context_switch関数は以下になります．

33行目でmembarrier_switch_mm関数，56行目でswitch_to関数を呼び出していることがわかります．

/*
 * context_switch - switch to the new MM and the new thread's register state.
 */
static __always_inline struct rq *
context_switch(struct rq *rq, struct task_struct *prev,
	       struct task_struct *next, struct rq_flags *rf)
{
	prepare_task_switch(rq, prev, next);

	/*
	 * For paravirt, this is coupled with an exit in switch_to to
	 * combine the page table reload and the switch backend into
	 * one hypercall.
	 */
	arch_start_context_switch(prev);

	/*
	 * kernel -> kernel   lazy + transfer active
	 *   user -> kernel   lazy + mmgrab() active
	 *
	 * kernel ->   user   switch + mmdrop() active
	 *   user ->   user   switch
	 */
	if (!next->mm) {                                // to kernel
		enter_lazy_tlb(prev->active_mm, next);

		next->active_mm = prev->active_mm;
		if (prev->mm)                           // from user
			mmgrab(prev->active_mm);
		else
			prev->active_mm = NULL;
	} else {                                        // to user
		membarrier_switch_mm(rq, prev->active_mm, next->mm);
		/*
		 * sys_membarrier() requires an smp_mb() between setting
		 * rq->curr / membarrier_switch_mm() and returning to userspace.
		 *
		 * The below provides this either through switch_mm(), or in
		 * case 'prev->active_mm == next->mm' through
		 * finish_task_switch()'s mmdrop().
		 */
		switch_mm_irqs_off(prev->active_mm, next->mm, next);

		if (!prev->mm) {                        // from kernel
			/* will mmdrop() in finish_task_switch(). */
			rq->prev_mm = prev->active_mm;
			prev->active_mm = NULL;
		}
	}

	rq->clock_update_flags &= ~(RQCF_ACT_SKIP|RQCF_REQ_SKIP);

	prepare_lock_switch(rq, next, rf);

	/* Here we just switch the register state and the stack. */
	switch_to(prev, next, prev);
	barrier();

	return finish_task_switch(prev);
}

* context_switch - switch to the new MM and the new thread's register state.

static __always_inline struct rq *

context_switch(struct rq *rq, struct task_struct *prev,

struct task_struct *next, struct rq_flags *rf)

{

prepare_task_switch(rq, prev, next);

* For paravirt, this is coupled with an exit in switch_to to

* combine the page table reload and the switch backend into

* one hypercall.

arch_start_context_switch(prev);

* kernel -> kernel lazy + transfer active

* user -> kernel lazy + mmgrab() active

* kernel -> user switch + mmdrop() active

* user -> user switch

if (!next->mm) { // to kernel

enter_lazy_tlb(prev->active_mm, next);

next->active_mm = prev->active_mm;

if (prev->mm) // from user

mmgrab(prev->active_mm);

else

prev->active_mm = NULL;

} else { // to user

membarrier_switch_mm(rq, prev->active_mm, next->mm);

* sys_membarrier() requires an smp_mb() between setting

* rq->curr / membarrier_switch_mm() and returning to userspace.

* The below provides this either through switch_mm(), or in

* case 'prev->active_mm == next->mm' through

* finish_task_switch()'s mmdrop().

switch_mm_irqs_off(prev->active_mm, next->mm, next);

if (!prev->mm) { // from kernel

/* will mmdrop() in finish_task_switch(). */

rq->prev_mm = prev->active_mm;

prev->active_mm = NULL;

}

rq->clock_update_flags &= ~(RQCF_ACT_SKIP|RQCF_REQ_SKIP);

prepare_lock_switch(rq, next, rf);

/* Here we just switch the register state and the stack. */

switch_to(prev, next, prev);

barrier();

return finish_task_switch(prev);

}

タスクは，schedule関数を呼び出すことで，自発的にCPUを放棄することができます．

また，以下の場合，現在のタスクはプリエンプションされる必要があります．

十分な時間実行された (すなわち，そのvruntimeが最小でなくなった) 場合
より高い優先度を持つタスクが起床した場合

need_reschedは，再スケジュールする必要があるかどうかを表すフラグです．

nee_reschedフラグはアーキテクチャ毎に固有のフラグTIF_NEED_RESCHEDとして定義されています．

x86-64とARM64の定義は以下になります．

x86-64ではTIF_NEED_RESCHEDは3，ARM64ではTIF_NEED_RESCHEDは1とアーキテクチャ毎に数値が異なります．

#define TIF_NEED_RESCHED	3	/* rescheduling necessary */

1	#define TIF_NEED_RESCHED 3 /* rescheduling necessary */

#define TIF_NEED_RESCHED	1	/* rescheduling necessary */

1	#define TIF_NEED_RESCHED 1 /* rescheduling necessary */

need_reschedフラグはlinux/include/linux/sched.hにある以下の関数で操作します．

set_tsk_need_resched関数：need_reschedフラグを設定する．
clear_tsk_need_resched関数：need_reschedフラグをクリアする．
test_tsk_need_resched関数：need_reschedフラグか設定されているどうか（再スケジュールが必要かどうか）判定する．

static inline void set_tsk_need_resched(struct task_struct *tsk)
{
	set_tsk_thread_flag(tsk,TIF_NEED_RESCHED);
}

static inline void clear_tsk_need_resched(struct task_struct *tsk)
{
	clear_tsk_thread_flag(tsk,TIF_NEED_RESCHED);
}

static inline int test_tsk_need_resched(struct task_struct *tsk)
{
	return unlikely(test_tsk_thread_flag(tsk,TIF_NEED_RESCHED));
}

static inline void set_tsk_need_resched(struct task_struct *tsk)

{

set_tsk_thread_flag(tsk,TIF_NEED_RESCHED);

}

static inline void clear_tsk_need_resched(struct task_struct *tsk)

{

clear_tsk_thread_flag(tsk,TIF_NEED_RESCHED);

}

static inline int test_tsk_need_resched(struct task_struct *tsk)

{

return unlikely(test_tsk_thread_flag(tsk,TIF_NEED_RESCHED));

}

need_reschedフラグが設定されている場合，scheduler_tick関数とtry_to_wake_up関数を呼び出します．

need_reschedフラグは以下のタイミングでチェックされます．

（システムコールや割込みから）ユーザ空間に戻った時
割り込みから復帰した時

need_reschedフラグがセットされている場合，schedule関数が呼び出されます．

UNIX系OSの多くカーネルはノンプリエンプティブですが，Linuxカーネルはプリエンプティブです．

プリエンプティブスケジューラはカーネルのプリエンプションで対処します．

タスクをロックせずに安全な状態で実行すれば，カーネル内でプリエンプションできます．

thread_info構造体のpreempt_countが現在のロック深度を示します．

「need_resched && !preempt_count」の場合，プリエンプションしても安全です．

このチェックは，割り込みからカーネルに戻る時やロック解放時に行われます．

カーネルプリエンプションは以下の場合に発生します．

割り込みから復帰した時
カーネルコードが再びプリエンプティブになった時
カーネル内のタスクがブロックした時（例：mutex）

linux/include/linux/preempt.hの以下のマクロでプリエンプションの無効と有効を設定します．

preempt_disableマクロ：プリエンプションを無効にする．
preempt_enableマクロ：プリエンプションを有効にする．

#define preempt_disable() \
do { \
	preempt_count_inc(); \
	barrier(); \
} while (0)

#define preempt_enable() \
do { \
	barrier(); \
	if (unlikely(preempt_count_dec_and_test())) \
		__preempt_schedule(); \
} while (0)

#define preempt_disable() \

do { \

preempt_count_inc(); \

barrier(); \

} while (0)

#define preempt_enable() \

do { \

barrier(); \

if (unlikely(preempt_count_dec_and_test())) \

__preempt_schedule(); \

} while (0)

リアルタイムスケジューリングクラス

Linuxカーネルには，リアルタイムスケジューリングクラスがあります．

schedule関数->__schedule関数->pick_next_task関数でfor_each_classマクロを呼び出してスケジューリングクラスを反復します．

リアルタイムスケジューリングクラスは，ノンリアルタイムスケジューリングクラスより高い優先度で実行します．

つまり，リアルタイムスケジューリングクラスはノンリアルタイムスケジューリングクラスより前の位置に配置されるという意味になります．

リアルタイムスケジューリングクラスを含む全てのクラスの優先度は高い順に以下になります．

stop_sched_class
- CONFIG_SMPが有効の場合，最高優先度のクラス
- ロードバランサーやCPUのホットプラグを実行するためにCPUを停止するスケジューリングクラス
- 操作関数はlinux/kernel/sched/stop_task.cで定義
dl_sched_class
- CONFIG_SMPが無効の場合（ユニプロセッサの場合），最高優先度のクラス
- 動的優先度のリアルタイムスケジューリングクラス
- SCHED_DEADLINEポリシーで利用
- 操作関数はlinux/kernel/sched/deadline.cで定義
rt_sched_class
- 固定優先度のリアルタイムスケジューリングクラス
- SCHED_FIFO，SCHED_RRポリシーで利用
- 操作関数はlinux/kernel/sched/rt.cで定義
fair_sched_class
- Linux CFSのSCHED_OTHER（SCHED_NORMAL）ポリシーで利用するクラス
- ノンリアルタイムで低優先度のバックグランドジョブ（バッチプロセス）のスケジューリングSCHED_BATCHポリシーでも利用
- 操作関数はlinux/kernel/sched/fair.cで定義
idle_sched_class
- ノンリアルタイムで非常に優先度の低いジョブのSCHED_IDLEポリシーで利用するクラス
- 操作関数はlinux/kernel/sched/idle.cで定義

これらのスケジューリングクラスは，linux/include/asm-generic/vmlinux.lds.hでアーキテクチャ固有のリンカスクリプト内で定義されています．

バージョン5.8までは操作関数と同じコードでスケジューリングが定義されていましたが，pick_next_task関数の最適化のため，バージョン5.9でリンカスクリプトの実装に変更しました（参考情報は以下）．

SCHED_FIFOは，バージョン2.2.x（2000年頃）に実装されたリアルタイムスケジューリングポリシーです．

SCHED_FIFOでは，タスクをブロックまたは解放するまで実行されます．

優先度の高いリアルタイムタスクのみがプリエンプション可能で，同じ優先度のタスクはラウンドロビン方式で実行します（下図）．

SCHED_RRは，SCHED_FIFOと同時期に実装されたリアルタイムスケジューリングポリシーです．

SCHED_RRは，タイムスライスが固定であること以外は，SCHED_FIFOと同じです（下図）．

SCHED_DEADLINEは，バージョン3.14でメインライン化されたリアルタイムポリシーで，予測可能なリアルタイムスケジューリングが可能になります．

SCHED_DEADLINEは，各タスクの起動期間と最悪実行時間（WCET：Worst Case Execution Time）に基づくEarliest Deadline First（EDF）スケジューリングとConstant Bandwidth Server（CBS）を統合したリアルタイムスケジューリングです．

まずはEDFとCBSを知りたいあなたはこちらからどうぞ．

: 【第8回】元東大教員から学ぶリアルタイムシステム「Earliest Deadline Firstスケジューリング」

こういった私から学べます．前回を読んでいない方はこちらからどうぞ．リアルタイムシステムの記事一覧はこちらからどうぞ． Earliest Deadline First（EDF）スケジューリング Ea ...

続きを見る

: 【第26回】元東大教員から学ぶリアルタイムシステム「Constant Bandwidth Server」

こういった私から学べます．前回を読んでいない方はこちらからどうぞ．リアルタイムシステムの記事一覧はこちらからどうぞ． Constant Bandwidth Server（CBS） Constant ...

続きを見る

SCHED_DEADLINEの詳細を知りたいあなたは，「Deadline Task Scheduling」の記事を読みましょう！

SCHED_DEADLINEがEDF + CBSで実装している理由を説明します．

EDFスケジューリングは，一時的なオーバーロードが発生した際，ドミノ倒しが発生してしまう問題があります．

下図は，通常のEDFスケジューリング（上部）と，タスク1が一時的なオーバーロード（Overrun）が発生してタスク2がドミノ倒しでデッドラインミスしているスケジューリング（下部）です．

この問題を解決するために，SCHED_DEADLINEでは，EDF + CBSでTemporal Isolationを採用しています．

下図の上部は先程のドミノ倒しの例ですが，下部はTemporal Isolationを採用することで，タスク2はドミノ倒しでデッドラインミスをしないことがわかります．

ただし，タスク1はOverrunによりデッドラインミスが発生していることに注意して下さい．

temporal isolation of edf cbs scheduling

SCHED_FIFOを利用したRMスケジューリング，SCHED_DEADLINEを利用したEDFスケジューリングの実装を知りたいあなたはこちらからどうぞ．

: 【C言語】LinuxにおけるリアルタイムスケジューリングRMとEDFの実装

こういった悩みにお答えします．こういった私から学べます． C言語でLinuxにおけるリアルタイムスケジューリングRMとEDFの実装 C言語でLinuxにおけるリアルタイムスケジューリングRMとEDF ...

続きを見る

参考：Nestスケジューラ

2022年9月12日にLPC 2022で発表されたNestは，マルチコア（メニーコア）プロセッサでタスクの実行間隔を短くするウォームコアを考慮したスケジューラです．

Nestは，できる限りウォームコアを利用することで，CPUキャッシュのヒット率を向上させることができます．

Intel 6130（16コア/32スレッド）の4ソケット（合計64コア/128スレッド）のシステムにおいてNestはCFSと比較して最大200%性能向上しています．

将来的にNestはCFSを置き換えるかもしれませんね．

Nestの講演動画はこちらです（スライドはこちら）．

まとめ

今回はプロセススケジューリングを紹介しました．

具体的には，Linux CFSとリアルタイムスケジューリングを解説しました．

プロセススケジューリングやリソース管理の詳細を知りたいあなたは，以下の記事を読みましょう！

LinuxカーネルはC言語で書かれています．

私にC言語の無料相談をしたいあなたは，公式LINE「ChishiroのC言語」の友だち追加をお願い致します．

私のキャパシティもあり，一定数に達したら終了しますので，今すぐ追加しましょう！

: 元東大教員がおすすめするC言語を学べるオンラインプログラミングスクール5社

こういった悩みにお答えします．こういった私が解説していきます．私が大学の授業で初めてC言語を勉強した時は全然できませんでしたが，先生やTA，友人に相談しながら一生懸命C言語を勉強してできるようにな ...

続きを見る

次回はこちらからどうぞ．

: 【第7回】元東大教員から学ぶLinuxカーネル「割り込み」

こういった私から学べます．前回を読んでいない方はこちらからどうぞ． Linuxカーネルの記事一覧はこちらからどうぞ．今回のテーマは割り込みです．割り込みを理解すると，ハードウェアとOSがどのよう ...

続きを見る

【第6回】元東大教員から学ぶLinuxカーネル「プロセススケジューリング」

【第5回】元東大教員から学ぶLinuxカーネル「プロセス管理」

元東大教員から学ぶLinuxカーネル

元東大教員がおすすめするC言語を学べるオンラインプログラミングスクール5社

プロセススケジューリングとは

スケジューリングポリシー

LinuxのCompletely Fair Scheduler（CFS）

Linuxのスケジューリングクラスの設計と実装

Linux CFSの動作

【C言語】キューとは【FIFOキュー，優先度キュー，二項ヒープ，赤黒木】

プリエンプションとコンテキストスイッチ

リアルタイムスケジューリングクラス

【第8回】元東大教員から学ぶリアルタイムシステム「Earliest Deadline Firstスケジューリング」

【第26回】元東大教員から学ぶリアルタイムシステム「Constant Bandwidth Server」

【C言語】LinuxにおけるリアルタイムスケジューリングRMとEDFの実装

参考：Nestスケジューラ

まとめ

元東大教員がおすすめするC言語を学べるオンラインプログラミングスクール5社

【第7回】元東大教員から学ぶLinuxカーネル「割り込み」