【第13回】元東大教員から学ぶLinuxカーネル「ページキャッシュとページフォールト」

本記事の信頼性

リアルタイムシステムの研究歴12年．
東大教員の時に，英語でOS（Linuxカーネル）の授業．
2012年9月～2013年8月にアメリカのノースカロライナ大学チャペルヒル校（UNC）コンピュータサイエンス学部で客員研究員として勤務．C言語でリアルタイムLinuxの研究開発．
プログラミング歴15年以上，習得している言語: C/C++，Python，Solidity/Vyper，Java，Ruby，Go，Rust，D，HTML/CSS/JS/PHP，MATLAB，Verse（UEFN）, Assembler (x64，aarch64)．
東大教員の時に，C++言語で開発した「LLVMコンパイラの拡張」，C言語で開発した独自のリアルタイムOS「Mcube Kernel」をGitHubにオープンソースとして公開．
2020年1月～現在はアメリカのノースカロライナ州チャペルヒルにあるGuarantee Happiness LLCのCTOとしてECサイト開発やWeb/SNSマーケティングの業務．2022年6月～現在はアメリカのノースカロライナ州チャペルヒルにあるJapanese Tar Heel, Inc.のCEO兼CTO．
最近は自然言語処理AIとイーサリアムに関する有益な情報発信や，Unreal Editor for Fortnite（UEFN）でゲーム開発に従事．

（AI全般を含む）自然言語処理AIの論文の日本語訳や，AIチャットボット（ChatGPT，Auto-GPT，Gemini（旧Bard）など）の記事を50本以上執筆．アメリカのサンフランシスコ（広義のシリコンバレー）の会社でChatGPT/Geminiを訓練するプロンプトエンジニア・マネージャー・Quality Assurance（QA）の業務委託の経験あり．
（スマートコントラクトのプログラミングを含む）イーサリアムや仮想通貨全般の記事を200本以上執筆．イギリスのロンドンの会社で仮想通貨の英語の記事を日本語に翻訳する業務委託の経験あり．
UEFNで10本以上のゲームを開発し，フォートナイト上で公開（Fortnite，Fortnite.GG）．

こういった私から学べます．

前回を読んでいない方はこちらからどうぞ．

: 【第12回】元東大教員から学ぶLinuxカーネル「仮想ファイルシステム」

こういった私から学べます．前回を読んでいない方はこちらからどうぞ． Linuxカーネルの記事一覧はこちらからどうぞ．今回のテーマは仮想ファイルシステム（VFS：Virtual File Syste ...

続きを見る

Linuxカーネルの記事一覧はこちらからどうぞ．

: 元東大教員から学ぶLinuxカーネル

こういった私から学べます． Linuxカーネルとは，C言語で開発されたオープンソースのOSです． Linuxカーネルは主に以下のコンピュータで広く利用されています．スーパーコンピュータサーバ An ...

続きを見る

LinuxカーネルはC言語で書かれています．

私にC言語の無料相談をしたいあなたは，公式LINE「ChishiroのC言語」の友だち追加をお願い致します．

私のキャパシティもあり，一定数に達したら終了しますので，今すぐ追加しましょう！

独学が難しいあなたは，元東大教員がおすすめするC言語を学べるオンラインプログラミングスクール5社で自分に合うスクールを見つけましょう．後悔はさせません！

: 元東大教員がおすすめするC言語を学べるオンラインプログラミングスクール5社

こういった悩みにお答えします．こういった私が解説していきます．私が大学の授業で初めてC言語を勉強した時は全然できませんでしたが，先生やTA，友人に相談しながら一生懸命C言語を勉強してできるようにな ...

続きを見る

今回のテーマはページキャッシュとページフォールトです．

本記事を読むとページ，キャッシュ，メモリの関係がわかります．

キャッシュの背景：コンピュータの遅延

キャッシュの背景としてコンピュータの遅延が挙げられます．

下記にLatency numbers every programmer should knowに記載されているコンピュータの遅延を示します．

L1キャッシュの遅延0.5nsですが，カリフォルニアとオランダ間のパケットのRound-Trip Timeは150msになります．

L1 cache reference ......................... 0.5 ns
Branch mispredict ............................ 5 ns
L2 cache reference ........................... 7 ns
Mutex lock/unlock ........................... 25 ns
Main memory reference ...................... 100 ns             
Compress 1K bytes with Zippy ............. 3,000 ns  =   3 µs
Send 2K bytes over 1 Gbps network ....... 20,000 ns  =  20 µs
SSD random read ........................ 150,000 ns  = 150 µs
Read 1 MB sequentially from memory ..... 250,000 ns  = 250 µs
Round trip within same datacenter ...... 500,000 ns  = 0.5 ms
Read 1 MB sequentially from SSD* ..... 1,000,000 ns  =   1 ms
Disk seek ........................... 10,000,000 ns  =  10 ms
Read 1 MB sequentially from disk .... 20,000,000 ns  =  20 ms
Send packet CA->Netherlands->CA .... 150,000,000 ns  = 150 ms

L1 cache reference ......................... 0.5 ns

Branch mispredict ............................ 5 ns

L2 cache reference ........................... 7 ns

Mutex lock/unlock ........................... 25 ns

Main memory reference ...................... 100 ns

Compress 1K bytes with Zippy ............. 3,000 ns = 3 µs

Send 2K bytes over 1 Gbps network ....... 20,000 ns = 20 µs

SSD random read ........................ 150,000 ns = 150 µs

Read 1 MB sequentially from memory ..... 250,000 ns = 250 µs

Round trip within same datacenter ...... 500,000 ns = 0.5 ms

Read 1 MB sequentially from SSD* ..... 1,000,000 ns = 1 ms

Disk seek ........................... 10,000,000 ns = 10 ms

Read 1 MB sequentially from disk .... 20,000,000 ns = 20 ms

Send packet CA->Netherlands->CA .... 150,000,000 ns = 150 ms

人間で表現した場合（10億倍の遅延）は以下になります．

L1 cache reference 0.5 s # 1回の心臓の鼓動
Branch mispredict 5 s # あくび
L2 cache reference 7 s # 長いあくび
Mutex lock/unlock 25 s # コーヒーを淹れる
Main memory reference 100 s # 歯を磨く
Compress 1K bytes with Zippy 50 min # テレビ番組1話分（広告あり）
Send 2K bytes over 1 Gbps network 5.5 hr # 昼食から終業まで
SSD random read 1.7 days # 普通の週末（土日）
Read 1 MB sequentially from memory 2.9 days # 長い週末（約3連休）
Round trip within same datacenter 5.8 days # ゴールデンウィーク
Read 1 MB sequentially from SSD 11.6 days # 納品まで2週間近く待ち
Disk seek 16.5 weeks # 大学での1学期（春学期または秋学期）
Read 1 MB sequentially from disk 7.8 months # 大学での2学期（春学期と秋学期）
The above 2 together 1 year # 1年
Send packet CA->Netherlands->CA 4.8 years # 学士号取得にかかる平均時間

L1 cache reference 0.5 s # 1回の心臓の鼓動

Branch mispredict 5 s # あくび

L2 cache reference 7 s # 長いあくび

Mutex lock/unlock 25 s # コーヒーを淹れる

Main memory reference 100 s # 歯を磨く

Compress 1K bytes with Zippy 50 min # テレビ番組1話分（広告あり）

Send 2K bytes over 1 Gbps network 5.5 hr # 昼食から終業まで

SSD random read 1.7 days # 普通の週末（土日）

Read 1 MB sequentially from memory 2.9 days # 長い週末（約3連休）

Round trip within same datacenter 5.8 days # ゴールデンウィーク

Read 1 MB sequentially from SSD 11.6 days # 納品まで2週間近く待ち

Disk seek 16.5 weeks # 大学での1学期（春学期または秋学期）

Read 1 MB sequentially from disk 7.8 months # 大学での2学期（春学期と秋学期）

The above 2 together 1 year # 1年

Send packet CA->Netherlands->CA 4.8 years # 学士号取得にかかる平均時間

なぜキャッシュが必要なのかというと，ディスクアクセスはメモリアクセスに比べ数桁遅いからです．

また，一度アクセスしたデータは，近い将来に再びアクセスされる可能性が高いという「時間的局所性」の性質を利用することで，コンピュータの性能を向上させることができます．

ページキャッシュ（バッファキャッシュ）

ページキャッシュ（バッファキャッシュ）は，ディスクの内容（ブロック）を保持するメモリ上の物理ページのことです．

ディスクはバッキングストアと呼ばれ，通常のファイル，メモリマップドファイル，ブロックデバイスのファイルに対して機能します．

バッキングストアとは，メモリのバックアップとしてディスク（HDD/SSD）に保存するファイルシステムの領域のことです．

これに対して，バッキングストアと似た概念であるスワップ領域とは，特殊なパーティションの1つで，メモリの容量の延長としてプログラムやデータを記録することができる領域のことです．

バッキングストアとスワップ領域の違いは「Swap Space vs Backing Store」で議論されています．

ページキャッシュは動的なサイズなので，カーネルやプロセスで使用されていない空きメモリを消費するためサイズを拡大したり，メモリの圧迫を緩和するために縮小したりします．

バッファードI/O操作（O_DIRECTなし）では，まずファイルのページキャッシュがチェックされます．

キャッシュヒット（cache hit）：データがページキャッシュにある場合，ユーザメモリから/へコピーされます．

キャッシュミス（cache miss）：そうでなければ，仮想ファイルシステムは具体的なファイルシステム（例：ext4）にディスクからデータを読み込むように要求します．つまり，読み取り/書き込み操作でページキャッシュにデータを入れます．

キャッシュのデータ更新ポリシーは以下になります（上図）．

no write：書き込み操作をキャッシュしない．

write through：書き込み操作を即座にディスクに通す．キャッシュのコヒーレント性を保ち，キャッシュされたデータを無効化する必要がないためシンプルです．

write back：書き込み操作はページキャッシュを更新しますが，ディスクはすぐには更新されない「Linuxのページキャッシュポリシー」です．書き込まれたページは，radix treeのタグを使ってdirtyとマークされます．定期的にdirtyページをディスクに書き込むことがWrite Backです．ページキャッシュは時間的な局所性を吸収してディスクアクセスを削減します．マルチコア/メニーコアプロセッサで有用です．

キャッシュの追い出し（eviction）ポリシー

キャッシュの追い出し（eviction）ポリシーを紹介します．

キャッシュからデータを削除するタイミングは，より多くの空きメモリが必要（メモリ圧迫）な場合です．

どのデータをキャッシュから削除すべきかは，理想は将来アクセスされないキャッシュページを退避させることです．

キャッシュの追い出しポリシーは，どのキャッシュを追い出す（メモリに書き出す）のかを決めることです．

LRU（Least Recently Used）ポリシーは，各ページがいつアクセスされたかを記録し，最も古いタイムスタンプを持つページを追い出すポリシーです．

LRUポリシーの失敗例は，多くのファイルは一度アクセスされると二度とアクセスされない場合です．

LRUはこれらのファイルをLRUリストの先頭に置くので，最適ではないです．

LinuxのLRUは，以下の2つのリンクリスト（Two-List or LRU/2 Strategy）で管理します．

active list：アクティブリスト内のページがホット（hot）とみなされ，追い出し不可になります．

inactive list：非アクティブリスト内のページはコールド（cold）とみなされ，追い出し可能です．

新規にアクセスされたページはinactive listに追加されます．

inactive listに登録されたページが再びアクセスされた場合，active listに昇格されます．

ページがinactive listに移動されると，ページテーブルのアクセス権が削除されます．

active listがinactive listよりはるかに大きくなった場合，active listの先頭のアイテムはinactive listに戻されます．

ページがinactive listに追加されると，そのアクセスを追跡するために，ページテーブルのアクセス許可が無効になります．

linux/include/linux/fs.hのaddress_space構造体でページキャッシュに存在するエンティティ（実体）を管理します．

「アドレス空間 = ファイル = ファイルのページキャッシュにアクセス = 1つ以上のvm_area_struct構造体」になります（上図）．

address_space構造体の主なメンバ変数は以下になります．

host : 対応するファイルのinodeへのポインタ

i_mmap : このアドレス空間に関するすべての共有マッピングおよびプライベートマッピング

nrpages : アドレス空間に含まれるページの総数

a_ops : アドレス空間での操作

/**
 * struct address_space - Contents of a cacheable, mappable object.
 * @host: Owner, either the inode or the block_device.
 * @i_pages: Cached pages.
 * @invalidate_lock: Guards coherency between page cache contents and
 *   file offset->disk block mappings in the filesystem during invalidates.
 *   It is also used to block modification of page cache contents through
 *   memory mappings.
 * @gfp_mask: Memory allocation flags to use for allocating pages.
 * @i_mmap_writable: Number of VM_SHARED mappings.
 * @nr_thps: Number of THPs in the pagecache (non-shmem only).
 * @i_mmap: Tree of private and shared mappings.
 * @i_mmap_rwsem: Protects @i_mmap and @i_mmap_writable.
 * @nrpages: Number of page entries, protected by the i_pages lock.
 * @writeback_index: Writeback starts here.
 * @a_ops: Methods.
 * @flags: Error bits and flags (AS_*).
 * @wb_err: The most recent error which has occurred.
 * @private_lock: For use by the owner of the address_space.
 * @private_list: For use by the owner of the address_space.
 * @private_data: For use by the owner of the address_space.
 */
struct address_space {
	struct inode		*host;
	struct xarray		i_pages;
	struct rw_semaphore	invalidate_lock;
	gfp_t			gfp_mask;
	atomic_t		i_mmap_writable;
#ifdef CONFIG_READ_ONLY_THP_FOR_FS
	/* number of thp, only for non-shmem files */
	atomic_t		nr_thps;
#endif
	struct rb_root_cached	i_mmap;
	struct rw_semaphore	i_mmap_rwsem;
	unsigned long		nrpages;
	pgoff_t			writeback_index;
	const struct address_space_operations *a_ops;
	unsigned long		flags;
	errseq_t		wb_err;
	spinlock_t		private_lock;
	struct list_head	private_list;
	void			*private_data;
} __attribute__((aligned(sizeof(long)))) __randomize_layout;

/**

* struct address_space - Contents of a cacheable, mappable object.

* @host: Owner, either the inode or the block_device.

* @i_pages: Cached pages.

* @invalidate_lock: Guards coherency between page cache contents and

* file offset->disk block mappings in the filesystem during invalidates.

* It is also used to block modification of page cache contents through

* memory mappings.

* @gfp_mask: Memory allocation flags to use for allocating pages.

* @i_mmap_writable: Number of VM_SHARED mappings.

* @nr_thps: Number of THPs in the pagecache (non-shmem only).

* @i_mmap: Tree of private and shared mappings.

* @i_mmap_rwsem: Protects @i_mmap and @i_mmap_writable.

* @nrpages: Number of page entries, protected by the i_pages lock.

* @writeback_index: Writeback starts here.

* @a_ops: Methods.

* @flags: Error bits and flags (AS_*).

* @wb_err: The most recent error which has occurred.

* @private_lock: For use by the owner of the address_space.

* @private_list: For use by the owner of the address_space.

* @private_data: For use by the owner of the address_space.

struct address_space {

struct inode *host;

struct xarray i_pages;

struct rw_semaphore invalidate_lock;

gfp_t gfp_mask;

atomic_t i_mmap_writable;

#ifdef CONFIG_READ_ONLY_THP_FOR_FS

/* number of thp, only for non-shmem files */

atomic_t nr_thps;

#endif

struct rb_root_cached i_mmap;

struct rw_semaphore i_mmap_rwsem;

unsigned long nrpages;

pgoff_t writeback_index;

const struct address_space_operations *a_ops;

unsigned long flags;

errseq_t wb_err;

spinlock_t private_lock;

struct list_head private_list;

void *private_data;

} __attribute__((aligned(sizeof(long)))) __randomize_layout;

linux/include/linux/fs.hのaddress_space_operations構造体でアドレス空間を操作します．

アドレス空間を操作する主な関数ポインタは以下になります．

writepage関数ポインタ：dirtyページをバッキングストアに書き込むために仮想メモリサブシステムから呼び出されます．

readpage関数ポインタ：バッキングストアからページを読み込むために仮想メモリサブシステムから呼び出されます．

writepages関数ポインタ：address_spaceオブジェクトに関連するページを書き出すために仮想メモリサブシステムによって呼び出されます．

set_page_dirty関数ポインタ：仮想メモリサブシステムがページをdirtyに設定するために呼び出されます．

readpages関数ポインタ：address_spaceオブジェクトに関連するページを読み込むために仮想メモリサブシステムによって呼び出されます．これは本質的にreadpageのベクターバージョンです．

readahead関数ポインタ：address_spaceオブジェクトに関連するページを読み込むために仮想メモリサブシステムによって呼び出されます．ページはページキャッシュに連続し，ロックされています．

write_begin関数ポインタ：汎用バッファの書き込みライトコードによって呼び出され，ファイルシステムに対して，ファイル内の与えられたオフセットにlenバイトを書き込む準備をするように要求します．

write_end関数ポインタ：write_begin関数ポインタの呼び出しが成功し，データのコピーが完了した後に呼び出されます．

struct address_space_operations {
	int (*writepage)(struct page *page, struct writeback_control *wbc);
	int (*readpage)(struct file *, struct page *);

	/* Write back some dirty pages from this mapping. */
	int (*writepages)(struct address_space *, struct writeback_control *);

	/* Set a page dirty.  Return true if this dirtied it */
	int (*set_page_dirty)(struct page *page);

	/*
	 * Reads in the requested pages. Unlike ->readpage(), this is
	 * PURELY used for read-ahead!.
	 */
	int (*readpages)(struct file *filp, struct address_space *mapping,
			struct list_head *pages, unsigned nr_pages);
	void (*readahead)(struct readahead_control *);

	int (*write_begin)(struct file *, struct address_space *mapping,
				loff_t pos, unsigned len, unsigned flags,
				struct page **pagep, void **fsdata);
	int (*write_end)(struct file *, struct address_space *mapping,
				loff_t pos, unsigned len, unsigned copied,
				struct page *page, void *fsdata);

	/* Unfortunately this kludge is needed for FIBMAP. Don't use it */
	sector_t (*bmap)(struct address_space *, sector_t);
	void (*invalidatepage) (struct page *, unsigned int, unsigned int);
	int (*releasepage) (struct page *, gfp_t);
	void (*freepage)(struct page *);
	ssize_t (*direct_IO)(struct kiocb *, struct iov_iter *iter);
	/*
	 * migrate the contents of a page to the specified target. If
	 * migrate_mode is MIGRATE_ASYNC, it must not block.
	 */
	int (*migratepage) (struct address_space *,
			struct page *, struct page *, enum migrate_mode);
	bool (*isolate_page)(struct page *, isolate_mode_t);
	void (*putback_page)(struct page *);
	int (*launder_page) (struct page *);
	int (*is_partially_uptodate) (struct page *, unsigned long,
					unsigned long);
	void (*is_dirty_writeback) (struct page *, bool *, bool *);
	int (*error_remove_page)(struct address_space *, struct page *);

	/* swapfile support */
	int (*swap_activate)(struct swap_info_struct *sis, struct file *file,
				sector_t *span);
	void (*swap_deactivate)(struct file *file);
};

struct address_space_operations {

int (*writepage)(struct page *page, struct writeback_control *wbc);

int (*readpage)(struct file *, struct page *);

/* Write back some dirty pages from this mapping. */

int (*writepages)(struct address_space *, struct writeback_control *);

/* Set a page dirty. Return true if this dirtied it */

int (*set_page_dirty)(struct page *page);

* Reads in the requested pages. Unlike ->readpage(), this is

* PURELY used for read-ahead!.

int (*readpages)(struct file *filp, struct address_space *mapping,

struct list_head *pages, unsigned nr_pages);

void (*readahead)(struct readahead_control *);

int (*write_begin)(struct file *, struct address_space *mapping,

loff_t pos, unsigned len, unsigned flags,

struct page **pagep, void **fsdata);

int (*write_end)(struct file *, struct address_space *mapping,

loff_t pos, unsigned len, unsigned copied,

struct page *page, void *fsdata);

/* Unfortunately this kludge is needed for FIBMAP. Don't use it */

sector_t (*bmap)(struct address_space *, sector_t);

void (*invalidatepage) (struct page *, unsigned int, unsigned int);

int (*releasepage) (struct page *, gfp_t);

void (*freepage)(struct page *);

ssize_t (*direct_IO)(struct kiocb *, struct iov_iter *iter);

* migrate the contents of a page to the specified target. If

* migrate_mode is MIGRATE_ASYNC, it must not block.

int (*migratepage) (struct address_space *,

struct page *, struct page *, enum migrate_mode);

bool (*isolate_page)(struct page *, isolate_mode_t);

void (*putback_page)(struct page *);

int (*launder_page) (struct page *);

int (*is_partially_uptodate) (struct page *, unsigned long,

unsigned long);

void (*is_dirty_writeback) (struct page *, bool *, bool *);

int (*error_remove_page)(struct address_space *, struct page *);

/* swapfile support */

int (*swap_activate)(struct swap_info_struct *sis, struct file *file,

sector_t *span);

void (*swap_deactivate)(struct file *file);

};

参考：Multi-Gen LRU（MGLRU）

Multi-Gen LRU（MGLRU）は，グーグルが開発したLRUです．

MGLRUは，上述したLRU（baseline）より性能が数%～数十%程度も向上しています．

Linuxカーネルのバージョン6.1にMGLRUがメインラインにマージされるとのことです（2022年10月現在の最新版はバージョン6.0）．

2022年9月12日に開催されたLPC 2022でMGLRUが発表されました．

以下の動画の2:10:38で観られます．

MGLRUの解説動画です．

メモリ管理との連携とページフォールト

メモリ管理との連携では，以下の構造体を利用します．

file/file_operations構造体：ファイルの内容にアクセスする方法

address_space/address_space_operations構造体：ファイルのページキャッシュにアクセスする方法

vm_area_struct/vm_operations_struct構造体：仮想メモリ領域のページフォールトの処理方法

file/file_operations構造体を知りたいあなたはこちらからどうぞ．

: 【第12回】元東大教員から学ぶLinuxカーネル「仮想ファイルシステム」

こういった私から学べます．前回を読んでいない方はこちらからどうぞ． Linuxカーネルの記事一覧はこちらからどうぞ．今回のテーマは仮想ファイルシステム（VFS：Virtual File Syste ...

続きを見る

vm_area_struct/vm_operations_struct構造体を知りたいあなたはこちらからどうぞ．

: 【第11回】元東大教員から学ぶLinuxカーネル「プロセスアドレス空間」

こういった私から学べます．前回を読んでいない方はこちらからどうぞ． Linuxカーネルの記事一覧はこちらからどうぞ．今回のテーマはプロセスアドレス空間です．プロセスアドレス空間を学ぶとページテー ...

続きを見る

ページフォールトとは，プロセスが適切な準備なしにメモリページにアクセスしたときにメモリ管理ユニット（MMU）が発生させる例外です．

ページにアクセスするには，プロセスの仮想アドレス空間にマッピングを追加する必要があります．

さらに，実際のページの内容をディスクなどのバッキングストアから読み込む必要がある場合もあります．

MMUはページフォルトを検出しますが，OSのカーネルは，必要なページを物理メモリでアクセスできるようにするか，不正なメモリアクセスを拒否することで例外を処理します．

Linuxカーネルにおけるページフォールトのエントリーポイントは，linux/mm/memory.cにあるhandle_pte_fault関数です．

/*
 * These routines also need to handle stuff like marking pages dirty
 * and/or accessed for architectures that don't do it in hardware (most
 * RISC architectures).  The early dirtying is also good on the i386.
 *
 * There is also a hook called "update_mmu_cache()" that architectures
 * with external mmu caches can use to update those (ie the Sparc or
 * PowerPC hashed page tables that act as extended TLBs).
 *
 * We enter with non-exclusive mmap_lock (to exclude vma changes, but allow
 * concurrent faults).
 *
 * The mmap_lock may have been released depending on flags and our return value.
 * See filemap_fault() and __lock_page_or_retry().
 */
static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
{
	pte_t entry;

	if (unlikely(pmd_none(*vmf->pmd))) {
		/*
		 * Leave __pte_alloc() until later: because vm_ops->fault may
		 * want to allocate huge page, and if we expose page table
		 * for an instant, it will be difficult to retract from
		 * concurrent faults and from rmap lookups.
		 */
		vmf->pte = NULL;
	} else {
		/*
		 * If a huge pmd materialized under us just retry later.  Use
		 * pmd_trans_unstable() via pmd_devmap_trans_unstable() instead
		 * of pmd_trans_huge() to ensure the pmd didn't become
		 * pmd_trans_huge under us and then back to pmd_none, as a
		 * result of MADV_DONTNEED running immediately after a huge pmd
		 * fault in a different thread of this mm, in turn leading to a
		 * misleading pmd_trans_huge() retval. All we have to ensure is
		 * that it is a regular pmd that we can walk with
		 * pte_offset_map() and we can do that through an atomic read
		 * in C, which is what pmd_trans_unstable() provides.
		 */
		if (pmd_devmap_trans_unstable(vmf->pmd))
			return 0;
		/*
		 * A regular pmd is established and it can't morph into a huge
		 * pmd from under us anymore at this point because we hold the
		 * mmap_lock read mode and khugepaged takes it in write mode.
		 * So now it's safe to run pte_offset_map().
		 */
		vmf->pte = pte_offset_map(vmf->pmd, vmf->address);
		vmf->orig_pte = *vmf->pte;

		/*
		 * some architectures can have larger ptes than wordsize,
		 * e.g.ppc44x-defconfig has CONFIG_PTE_64BIT=y and
		 * CONFIG_32BIT=y, so READ_ONCE cannot guarantee atomic
		 * accesses.  The code below just needs a consistent view
		 * for the ifs and we later double check anyway with the
		 * ptl lock held. So here a barrier will do.
		 */
		barrier();
		if (pte_none(vmf->orig_pte)) {
			pte_unmap(vmf->pte);
			vmf->pte = NULL;
		}
	}

	if (!vmf->pte) {
		if (vma_is_anonymous(vmf->vma))
			return do_anonymous_page(vmf);
		else
			return do_fault(vmf);
	}

	if (!pte_present(vmf->orig_pte))
		return do_swap_page(vmf);

	if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma))
		return do_numa_page(vmf);

	vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
	spin_lock(vmf->ptl);
	entry = vmf->orig_pte;
	if (unlikely(!pte_same(*vmf->pte, entry))) {
		update_mmu_tlb(vmf->vma, vmf->address, vmf->pte);
		goto unlock;
	}
	if (vmf->flags & FAULT_FLAG_WRITE) {
		if (!pte_write(entry))
			return do_wp_page(vmf);
		entry = pte_mkdirty(entry);
	}
	entry = pte_mkyoung(entry);
	if (ptep_set_access_flags(vmf->vma, vmf->address, vmf->pte, entry,
				vmf->flags & FAULT_FLAG_WRITE)) {
		update_mmu_cache(vmf->vma, vmf->address, vmf->pte);
	} else {
		/* Skip spurious TLB flush for retried page fault */
		if (vmf->flags & FAULT_FLAG_TRIED)
			goto unlock;
		/*
		 * This is needed only for protection faults but the arch code
		 * is not yet telling us if this is a protection fault or not.
		 * This still avoids useless tlb flushes for .text page faults
		 * with threads.
		 */
		if (vmf->flags & FAULT_FLAG_WRITE)
			flush_tlb_fix_spurious_fault(vmf->vma, vmf->address);
	}
unlock:
	pte_unmap_unlock(vmf->pte, vmf->ptl);
	return 0;
}

100

101

102

103

104

105

106

107

108

109

110

111

112

* These routines also need to handle stuff like marking pages dirty

* and/or accessed for architectures that don't do it in hardware (most

* RISC architectures). The early dirtying is also good on the i386.

* There is also a hook called "update_mmu_cache()" that architectures

* with external mmu caches can use to update those (ie the Sparc or

* PowerPC hashed page tables that act as extended TLBs).

* We enter with non-exclusive mmap_lock (to exclude vma changes, but allow

* concurrent faults).

* The mmap_lock may have been released depending on flags and our return value.

* See filemap_fault() and __lock_page_or_retry().

static vm_fault_t handle_pte_fault(struct vm_fault *vmf)

{

pte_t entry;

if (unlikely(pmd_none(*vmf->pmd))) {

* Leave __pte_alloc() until later: because vm_ops->fault may

* want to allocate huge page, and if we expose page table

* for an instant, it will be difficult to retract from

* concurrent faults and from rmap lookups.

vmf->pte = NULL;

} else {

* If a huge pmd materialized under us just retry later. Use

* pmd_trans_unstable() via pmd_devmap_trans_unstable() instead

* of pmd_trans_huge() to ensure the pmd didn't become

* pmd_trans_huge under us and then back to pmd_none, as a

* result of MADV_DONTNEED running immediately after a huge pmd

* fault in a different thread of this mm, in turn leading to a

* misleading pmd_trans_huge() retval. All we have to ensure is

* that it is a regular pmd that we can walk with

* pte_offset_map() and we can do that through an atomic read

* in C, which is what pmd_trans_unstable() provides.

if (pmd_devmap_trans_unstable(vmf->pmd))

return 0;

* A regular pmd is established and it can't morph into a huge

* pmd from under us anymore at this point because we hold the

* mmap_lock read mode and khugepaged takes it in write mode.

* So now it's safe to run pte_offset_map().

vmf->pte = pte_offset_map(vmf->pmd, vmf->address);

vmf->orig_pte = *vmf->pte;

* some architectures can have larger ptes than wordsize,

* e.g.ppc44x-defconfig has CONFIG_PTE_64BIT=y and

* CONFIG_32BIT=y, so READ_ONCE cannot guarantee atomic

* accesses. The code below just needs a consistent view

* for the ifs and we later double check anyway with the

* ptl lock held. So here a barrier will do.

barrier();

if (pte_none(vmf->orig_pte)) {

pte_unmap(vmf->pte);

vmf->pte = NULL;

}

if (!vmf->pte) {

if (vma_is_anonymous(vmf->vma))

return do_anonymous_page(vmf);

else

return do_fault(vmf);

}

if (!pte_present(vmf->orig_pte))

return do_swap_page(vmf);

if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma))

return do_numa_page(vmf);

vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);

spin_lock(vmf->ptl);

entry = vmf->orig_pte;

if (unlikely(!pte_same(*vmf->pte, entry))) {

update_mmu_tlb(vmf->vma, vmf->address, vmf->pte);

goto unlock;

}

if (vmf->flags & FAULT_FLAG_WRITE) {

if (!pte_write(entry))

return do_wp_page(vmf);

entry = pte_mkdirty(entry);

}

entry = pte_mkyoung(entry);

if (ptep_set_access_flags(vmf->vma, vmf->address, vmf->pte, entry,

vmf->flags & FAULT_FLAG_WRITE)) {

update_mmu_cache(vmf->vma, vmf->address, vmf->pte);

} else {

/* Skip spurious TLB flush for retried page fault */

if (vmf->flags & FAULT_FLAG_TRIED)

goto unlock;

* This is needed only for protection faults but the arch code

* is not yet telling us if this is a protection fault or not.

* This still avoids useless tlb flushes for .text page faults

* with threads.

if (vmf->flags & FAULT_FLAG_WRITE)

flush_tlb_fix_spurious_fault(vmf->vma, vmf->address);

}

unlock:

pte_unmap_unlock(vmf->pte, vmf->ptl);

return 0;

}

どの仮想メモリ領域のフォールトアドレスに該当するかを特定し，仮想メモリ領域に登録されたフォールトハンドラがあるかどうかを確認します．

デフォルトのフォールトハンドラは以下になります．

do_anonymous_page関数：ページもファイルもないの場合

filemap_fault関数：ファイルにバックアップされたページの場合

do_wp_page関数：書き込み保護ページ（コピーオンライト，CoW：Copy-on-Write）の場合

do_swap_page関数：スワップによってバックアップされたページの場合

filemap_fault関数を呼び出す場合，ページテーブルエントリ（PTE：Page Table Entry）が存在しない場合（パーミッションは「---」）に呼び出されます．

しかし，仮想メモリ領域はアクセス可能（例：パーミッションは「rwx」）とマークされ，関連するファイル（vm_file）を保持しています．

ページフォルトハンドラが違いを認識します．

ファイルのページキャッシュを検索し，キャッシュにヒットした場合，キャッシュのページをマップします．

そうでなければ，「mapping->a_ops->readpage(file, page)」を実行します．

do_wp_page関数を呼び出す場合，ページテーブルエントリが書き込み不可とマークされています（例：r--）．

しかし，仮想メモリ領域は書き込み可能としてマークされている（例：rw-）．

ページフォルトハンドラはその違いに認識し，CoWを意味するため，物理ページの複製を作成します．

そして，ページテーブルエントリを更新し，TLBエントリをフラッシュします．

Flusher Daemon

Flusher Daemonは，複数のスレッドがページキャッシュからディスクにdirtyページを同期することです．

書き込み動作が遅延していて，データがdirtyとマークされている状態，つまりメモリのデータが記憶媒体と同期していない状態の場合，dirtyページのwrite backは以下の条件で発生します．

空きメモリが少なくなり，ページキャッシュを縮小する必要がある場合

dirtyデータが特定のスレッショルドより古くなる場合

ユーザプロセスがsync関数またはfsync関数を呼び出す場合

空きメモリが所定の閾値を下回ると，カーネルはwakeup_flusher_threads関数を呼び出します．

bdi_writeback_all関数がライトバックを実行している1つまたは複数のFlusher Threadを起床します．

num_pages_to_writeが書き込まれるかつメモリ量がスレッショルドを下回るまでスレッドはデータをディスクに書き込みます．

Flusher Daemonが起動する総メモリ量のパーセンテージは/proc/sys/vm/dirty_background_ratioで確認できます．

私の環境ではdirty_background_ratioは10%です．

$ cat /proc/sys/vm/dirty_background_ratio 
10

1 2	$ cat /proc/sys/vm/dirty_background_ratio 10

ブート時にタイマを初期化し，wb_writeback関数を呼び出すFlusher Threadを起動します．

与えられた値より古いすべてのデータをwrite backする期間は，/proc/sys/vm/dirty_expire_centisecsで確認できます（単位は1/100秒（10ms））．

私の環境では，dirty_expire_centisecsは3000（つまり30秒）です．

$ cat /proc/sys/vm/dirty_expire_centisecs 
3000

1 2	$ cat /proc/sys/vm/dirty_expire_centisecs 3000

将来の指定された時間に期限切れとなるように再初期化されたタイマ（現在 + 期間）は，/proc/sys/vm/dirty_writeback_centisecsで確認できます（単位は1/100秒（10ms））．

私の環境では，dirty_writeback_centisecsは500（つまり5秒）です．

$ cat /proc/sys/vm/dirty_writeback_centisecs 
500

1 2	$ cat /proc/sys/vm/dirty_writeback_centisecs 500

ライトバックと一般的なページキャッシュの制御に関連する他の複数のパラメータは，/proc/sys/vmにあります（詳細はこちら）．

まとめ

今回はページキャッシュとページフォールトを紹介しました．

ページ，キャッシュ，メモリの関係がわかりました．

ページキャッシュとページフォールトを深掘りしたいあなたは，以下を読みましょう！

Latency numbers every programmer should know

Better active/inactive list balancing

Flushing out pdflush

User-space page fault handling

Documentation for /proc/sys/vm/

以下の動画もおすすめです！

LinuxカーネルはC言語で書かれています．

私にC言語の無料相談をしたいあなたは，公式LINE「ChishiroのC言語」の友だち追加をお願い致します．

私のキャパシティもあり，一定数に達したら終了しますので，今すぐ追加しましょう！

: 元東大教員がおすすめするC言語を学べるオンラインプログラミングスクール5社

こういった悩みにお答えします．こういった私が解説していきます．私が大学の授業で初めてC言語を勉強した時は全然できませんでしたが，先生やTA，友人に相談しながら一生懸命C言語を勉強してできるようにな ...

続きを見る

次回はこちらからどうぞ．

: 【第14回】元東大教員から学ぶLinuxカーネル「ファイルシステムとクラッシュ整合性」

こういった私から学べます．前回を読んでいない方はこちらからどうぞ． Linuxカーネルの記事一覧はこちらからどうぞ．今回のテーマはファイルシステムとクラッシュ整合性です．特に，Linuxカーネル ...

続きを見る