【第15回】元東大教員から学ぶLinuxカーネル「ブロックレイヤ」

本記事の信頼性

リアルタイムシステムの研究歴12年．
東大教員の時に，英語でOS（Linuxカーネル）の授業．
2012年9月～2013年8月にアメリカのノースカロライナ大学チャペルヒル校（UNC）コンピュータサイエンス学部で客員研究員として勤務．C言語でリアルタイムLinuxの研究開発．
プログラミング歴15年以上，習得している言語: C/C++，Python，Solidity/Vyper，Java，Ruby，Go，Rust，D，HTML/CSS/JS/PHP，MATLAB，Verse（UEFN）, Assembler (x64，aarch64)．
東大教員の時に，C++言語で開発した「LLVMコンパイラの拡張」，C言語で開発した独自のリアルタイムOS「Mcube Kernel」をGitHubにオープンソースとして公開．
2020年1月～現在はアメリカのノースカロライナ州チャペルヒルにあるGuarantee Happiness LLCのCTOとしてECサイト開発やWeb/SNSマーケティングの業務．2022年6月～現在はアメリカのノースカロライナ州チャペルヒルにあるJapanese Tar Heel, Inc.のCEO兼CTO．
最近は自然言語処理AIとイーサリアムに関する有益な情報発信や，Unreal Editor for Fortnite（UEFN）でゲーム開発に従事．

（AI全般を含む）自然言語処理AIの論文の日本語訳や，AIチャットボット（ChatGPT，Auto-GPT，Gemini（旧Bard）など）の記事を50本以上執筆．アメリカのサンフランシスコ（広義のシリコンバレー）の会社でChatGPT/Geminiを訓練するプロンプトエンジニア・マネージャー・Quality Assurance（QA）の業務委託の経験あり．
（スマートコントラクトのプログラミングを含む）イーサリアムや仮想通貨全般の記事を200本以上執筆．イギリスのロンドンの会社で仮想通貨の英語の記事を日本語に翻訳する業務委託の経験あり．
UEFNで10本以上のゲームを開発し，フォートナイト上で公開（Fortnite，Fortnite.GG）．

こういった私から学べます．

前回を読んでいない方はこちらからどうぞ．

: 【第14回】元東大教員から学ぶLinuxカーネル「ファイルシステムとクラッシュ整合性」

こういった私から学べます．前回を読んでいない方はこちらからどうぞ． Linuxカーネルの記事一覧はこちらからどうぞ．今回のテーマはファイルシステムとクラッシュ整合性です．特に，Linuxカーネル ...

続きを見る

Linuxカーネルの記事一覧はこちらからどうぞ．

: 元東大教員から学ぶLinuxカーネル

こういった私から学べます． Linuxカーネルとは，C言語で開発されたオープンソースのOSです． Linuxカーネルは主に以下のコンピュータで広く利用されています．スーパーコンピュータサーバ An ...

続きを見る

LinuxカーネルはC言語で書かれています．

私にC言語の無料相談をしたいあなたは，公式LINE「ChishiroのC言語」の友だち追加をお願い致します．

私のキャパシティもあり，一定数に達したら終了しますので，今すぐ追加しましょう！

独学が難しいあなたは，元東大教員がおすすめするC言語を学べるオンラインプログラミングスクール5社で自分に合うスクールを見つけましょう．後悔はさせません！

: 元東大教員がおすすめするC言語を学べるオンラインプログラミングスクール5社

こういった悩みにお答えします．こういった私が解説していきます．私が大学の授業で初めてC言語を勉強した時は全然できませんでしたが，先生やTA，友人に相談しながら一生懸命C言語を勉強してできるようにな ...

続きを見る

今回のテーマはブロックレイヤです．

ブロックレイヤでは，HDD/SDD等のデバイスの読み書きを調停することがわかります．

ブロックレイヤ

Linuxカーネルにおけるデバイスの種類は以下になります．

キャラクタデバイス（character device）：シリアルポート，マウス，キーボード等のようにバイトのストリームとして順次アクセスされるデバイスです．ストリームアクセスなので，例えばキーボードでテストを打つ場合は'h'，'e'，'l'，'l'，'o'となります．比較的シンプルです．

ブロックデバイス（block device）：HDD/SSD，CD/DVD，フロッピーディスク等のようなランダムアクセスによりデバイスが特定の位置にシークすることができます．キャラクタデバイスより複雑で性能が重要になります．

ブロックデバイスのような複雑で性能が重要が部分において，Linuxカーネルではブロックレイヤにより管理します（上図）．

Linuxカーネルのブロックレイヤは以下の要素で構成されていますので，それぞれ解説していきます．

BIOレイヤ

リクエストレイヤ

I/Oスケジューラ

ブロックデバイスの構造はセクタとブロックから構成されます．

セクタ（sector）：ブロックデバイスのアドレス指定可能な最小単位です．デバイスの物理的性質は，ハードセクタとデバイスブロックになります．一般的にセクタサイズは512バイト（CD-ROMは2Kバイト）です．

ブロック（block）：ファイルシステムのアクセス単位で，ファイルシステムブロック，I/Oブロックのことを表します．ブロックサイズは，セクタの倍数（デバイスの制限）かつページの倍数（カーネルの制限）になります．ブロックサイズの多くは4Kバイトまたは8Kバイトです．

HDDのセクタとブロックは以下の動画がわかりやすいです．

バッファとバッファヘッド

バッファでは，ブロックがメモリに格納されます．

バッファヘッドは，バッファのメタデータのことです．

linux/include/linux/buffer_head.hのbuffer_head構造体でバッファヘッドを管理します．

/*
 * Historically, a buffer_head was used to map a single block
 * within a page, and of course as the unit of I/O through the
 * filesystem and block layers.  Nowadays the basic I/O unit
 * is the bio, and buffer_heads are used for extracting block
 * mappings (via a get_block_t call), for tracking state within
 * a page (via a page_mapping) and for wrapping bio submission
 * for backward compatibility reasons (e.g. submit_bh).
 */
struct buffer_head {
	unsigned long b_state;		/* buffer state bitmap (see above) */
	struct buffer_head *b_this_page;/* circular list of page's buffers */
	struct page *b_page;		/* the page this bh is mapped to */

	sector_t b_blocknr;		/* start block number */
	size_t b_size;			/* size of mapping */
	char *b_data;			/* pointer to data within the page */

	struct block_device *b_bdev;
	bh_end_io_t *b_end_io;		/* I/O completion */
 	void *b_private;		/* reserved for b_end_io */
	struct list_head b_assoc_buffers; /* associated with another mapping */
	struct address_space *b_assoc_map;	/* mapping this buffer is
						   associated with */
	atomic_t b_count;		/* users using this buffer_head */
	spinlock_t b_uptodate_lock;	/* Used by the first bh in a page, to
					 * serialise IO completion of other
					 * buffers in the page */
};

* Historically, a buffer_head was used to map a single block

* within a page, and of course as the unit of I/O through the

* filesystem and block layers. Nowadays the basic I/O unit

* is the bio, and buffer_heads are used for extracting block

* mappings (via a get_block_t call), for tracking state within

* a page (via a page_mapping) and for wrapping bio submission

* for backward compatibility reasons (e.g. submit_bh).

struct buffer_head {

unsigned long b_state; /* buffer state bitmap (see above) */

struct buffer_head *b_this_page;/* circular list of page's buffers */

struct page *b_page; /* the page this bh is mapped to */

sector_t b_blocknr; /* start block number */

size_t b_size; /* size of mapping */

char *b_data; /* pointer to data within the page */

struct block_device *b_bdev;

bh_end_io_t *b_end_io; /* I/O completion */

void *b_private; /* reserved for b_end_io */

struct list_head b_assoc_buffers; /* associated with another mapping */

struct address_space *b_assoc_map; /* mapping this buffer is

associated with */

atomic_t b_count; /* users using this buffer_head */

spinlock_t b_uptodate_lock; /* Used by the first bh in a page, to

* serialise IO completion of other

* buffers in the page */

};

buffer_head構造体でバッファの状態を管理するb_stateメンバ変数は以下の列挙になります．

enum bh_state_bits {
	BH_Uptodate,	/* Contains valid data */
	BH_Dirty,	/* Is dirty */
	BH_Lock,	/* Is locked */
	BH_Req,		/* Has been submitted for I/O */

	BH_Mapped,	/* Has a disk mapping */
	BH_New,		/* Disk mapping was newly created by get_block */
	BH_Async_Read,	/* Is under end_buffer_async_read I/O */
	BH_Async_Write,	/* Is under end_buffer_async_write I/O */
	BH_Delay,	/* Buffer is not yet allocated on disk */
	BH_Boundary,	/* Block is followed by a discontiguity */
	BH_Write_EIO,	/* I/O error on write */
	BH_Unwritten,	/* Buffer is allocated on disk but not written */
	BH_Quiet,	/* Buffer Error Prinks to be quiet */
	BH_Meta,	/* Buffer contains metadata */
	BH_Prio,	/* Buffer should be submitted with REQ_PRIO */
	BH_Defer_Completion, /* Defer AIO completion to workqueue */

	BH_PrivateStart,/* not a state bit, but the first bit available
			 * for private allocation by other entities
			 */
};

enum bh_state_bits {

BH_Uptodate, /* Contains valid data */

BH_Dirty, /* Is dirty */

BH_Lock, /* Is locked */

BH_Req, /* Has been submitted for I/O */

BH_Mapped, /* Has a disk mapping */

BH_New, /* Disk mapping was newly created by get_block */

BH_Async_Read, /* Is under end_buffer_async_read I/O */

BH_Async_Write, /* Is under end_buffer_async_write I/O */

BH_Delay, /* Buffer is not yet allocated on disk */

BH_Boundary, /* Block is followed by a discontiguity */

BH_Write_EIO, /* I/O error on write */

BH_Unwritten, /* Buffer is allocated on disk but not written */

BH_Quiet, /* Buffer Error Prinks to be quiet */

BH_Meta, /* Buffer contains metadata */

BH_Prio, /* Buffer should be submitted with REQ_PRIO */

BH_Defer_Completion, /* Defer AIO completion to workqueue */

BH_PrivateStart,/* not a state bit, but the first bit available

* for private allocation by other entities

};

BIOレイヤ：bio構造体

bio構造体は，アクティブなブロックI/O操作のための基本コンテナです（上図）．

各々のバッファはセグメントに分割され，メモリ上で連続である必要はありません．

ここで，セグメントとは，バッファの中で連続したメモリのチャンク（まとまり）を意味します．

linux/include/linux/blk_types.hにbio構造体があります．

/*
 * main unit of I/O for the block layer and lower layers (ie drivers and
 * stacking drivers)
 */
struct bio {
	struct bio		*bi_next;	/* request queue link */
	struct block_device	*bi_bdev;
	unsigned int		bi_opf;		/* bottom bits req flags,
						 * top bits REQ_OP. Use
						 * accessors.
						 */
	unsigned short		bi_flags;	/* BIO_* below */
	unsigned short		bi_ioprio;
	unsigned short		bi_write_hint;
	blk_status_t		bi_status;
	atomic_t		__bi_remaining;

	struct bvec_iter	bi_iter;

	bio_end_io_t		*bi_end_io;

	void			*bi_private;
#ifdef CONFIG_BLK_CGROUP
	/*
	 * Represents the association of the css and request_queue for the bio.
	 * If a bio goes direct to device, it will not have a blkg as it will
	 * not have a request_queue associated with it.  The reference is put
	 * on release of the bio.
	 */
	struct blkcg_gq		*bi_blkg;
	struct bio_issue	bi_issue;
#ifdef CONFIG_BLK_CGROUP_IOCOST
	u64			bi_iocost_cost;
#endif
#endif

#ifdef CONFIG_BLK_INLINE_ENCRYPTION
	struct bio_crypt_ctx	*bi_crypt_context;
#endif

	union {
#if defined(CONFIG_BLK_DEV_INTEGRITY)
		struct bio_integrity_payload *bi_integrity; /* data integrity */
#endif
	};

	unsigned short		bi_vcnt;	/* how many bio_vec's */

	/*
	 * Everything starting with bi_max_vecs will be preserved by bio_reset()
	 */

	unsigned short		bi_max_vecs;	/* max bvl_vecs we can hold */

	atomic_t		__bi_cnt;	/* pin count */

	struct bio_vec		*bi_io_vec;	/* the actual vec list */

	struct bio_set		*bi_pool;

	/*
	 * We can inline a number of vecs at the end of the bio, to avoid
	 * double allocations for a small number of bio_vecs. This member
	 * MUST obviously be kept at the very end of the bio.
	 */
	struct bio_vec		bi_inline_vecs[];
};

* main unit of I/O for the block layer and lower layers (ie drivers and

* stacking drivers)

struct bio {

struct bio *bi_next; /* request queue link */

struct block_device *bi_bdev;

unsigned int bi_opf; /* bottom bits req flags,

* top bits REQ_OP. Use

* accessors.

unsigned short bi_flags; /* BIO_* below */

unsigned short bi_ioprio;

unsigned short bi_write_hint;

blk_status_t bi_status;

atomic_t __bi_remaining;

struct bvec_iter bi_iter;

bio_end_io_t *bi_end_io;

void *bi_private;

#ifdef CONFIG_BLK_CGROUP

* Represents the association of the css and request_queue for the bio.

* If a bio goes direct to device, it will not have a blkg as it will

* not have a request_queue associated with it. The reference is put

* on release of the bio.

struct blkcg_gq *bi_blkg;

struct bio_issue bi_issue;

#ifdef CONFIG_BLK_CGROUP_IOCOST

u64 bi_iocost_cost;

#endif

#ifdef CONFIG_BLK_INLINE_ENCRYPTION

struct bio_crypt_ctx *bi_crypt_context;

#endif

union {

#if defined(CONFIG_BLK_DEV_INTEGRITY)

struct bio_integrity_payload *bi_integrity; /* data integrity */

#endif

};

unsigned short bi_vcnt; /* how many bio_vec's */

* Everything starting with bi_max_vecs will be preserved by bio_reset()

unsigned short bi_max_vecs; /* max bvl_vecs we can hold */

atomic_t __bi_cnt; /* pin count */

struct bio_vec *bi_io_vec; /* the actual vec list */

struct bio_set *bi_pool;

* We can inline a number of vecs at the end of the bio, to avoid

* double allocations for a small number of bio_vecs. This member

* MUST obviously be kept at the very end of the bio.

struct bio_vec bi_inline_vecs[];

};

linux/include/linux/bvec.hにbio_vec構造体とbvec_iter構造体があります．

/**
 * struct bio_vec - a contiguous range of physical memory addresses
 * @bv_page:   First page associated with the address range.
 * @bv_len:    Number of bytes in the address range.
 * @bv_offset: Start of the address range relative to the start of @bv_page.
 *
 * The following holds for a bvec if n * PAGE_SIZE < bv_offset + bv_len:
 *
 *   nth_page(@bv_page, n) == @bv_page + n
 *
 * This holds because page_is_mergeable() checks the above property.
 */
struct bio_vec {
	struct page	*bv_page;
	unsigned int	bv_len;
	unsigned int	bv_offset;
};

struct bvec_iter {
	sector_t		bi_sector;	/* device address in 512 byte sectors */

	unsigned int		bi_size;	/* residual I/O count */

	unsigned int		bi_idx;		/* current index into bvl_vec */

	unsigned int            bi_bvec_done;	/* number of bytes completed in current bvec */
};

/**

* struct bio_vec - a contiguous range of physical memory addresses

* @bv_page: First page associated with the address range.

* @bv_len: Number of bytes in the address range.

* @bv_offset: Start of the address range relative to the start of @bv_page.

* The following holds for a bvec if n * PAGE_SIZE < bv_offset + bv_len:

* nth_page(@bv_page, n) == @bv_page + n

* This holds because page_is_mergeable() checks the above property.

struct bio_vec {

struct page *bv_page;

unsigned int bv_len;

unsigned int bv_offset;

};

struct bvec_iter {

sector_t bi_sector; /* device address in 512 byte sectors */

unsigned int bi_size; /* residual I/O count */

unsigned int bi_idx; /* current index into bvl_vec */

unsigned int bi_bvec_done; /* number of bytes completed in current bvec */

};

bio構造体，bio_vec構造体，bvec_iter構造体の関係は上図になります．

リクエストレイヤ：request_queue構造体

ブロックデバイスは，保留中のI/O要求を格納するためにリクエストキューを保持します．

リクエストはファイルシステムのような高レベルのコードによってキューに追加されます．

ブロックデバイスドライバによってキューから要求が取り出され，デバイスに送信されます．

linux/include/linux/blkdev.hで定義されているrequest_queue構造体でリクエストキューを表現します．

struct request_queue {
	struct request		*last_merge;
	struct elevator_queue	*elevator;

	struct percpu_ref	q_usage_counter;

	struct blk_queue_stats	*stats;
	struct rq_qos		*rq_qos;

	const struct blk_mq_ops	*mq_ops;

	/* sw queues */
	struct blk_mq_ctx __percpu	*queue_ctx;

	unsigned int		queue_depth;

	/* hw dispatch queues */
	struct blk_mq_hw_ctx	**queue_hw_ctx;
	unsigned int		nr_hw_queues;

	/*
	 * The queue owner gets to use this for whatever they like.
	 * ll_rw_blk doesn't touch it.
	 */
	void			*queuedata;

	/*
	 * various queue flags, see QUEUE_* below
	 */
	unsigned long		queue_flags;
	/*
	 * Number of contexts that have called blk_set_pm_only(). If this
	 * counter is above zero then only RQF_PM requests are processed.
	 */
	atomic_t		pm_only;

	/*
	 * ida allocated id for this queue.  Used to index queues from
	 * ioctx.
	 */
	int			id;

	spinlock_t		queue_lock;

	struct gendisk		*disk;

	/*
	 * queue kobject
	 */
	struct kobject kobj;

	/*
	 * mq queue kobject
	 */
	struct kobject *mq_kobj;

#ifdef  CONFIG_BLK_DEV_INTEGRITY
	struct blk_integrity integrity;
#endif	/* CONFIG_BLK_DEV_INTEGRITY */

#ifdef CONFIG_PM
	struct device		*dev;
	enum rpm_status		rpm_status;
#endif

	/*
	 * queue settings
	 */
	unsigned long		nr_requests;	/* Max # of requests */

	unsigned int		dma_pad_mask;
	unsigned int		dma_alignment;

#ifdef CONFIG_BLK_INLINE_ENCRYPTION
	/* Inline crypto capabilities */
	struct blk_keyslot_manager *ksm;
#endif

	unsigned int		rq_timeout;
	int			poll_nsec;

	struct blk_stat_callback	*poll_cb;
	struct blk_rq_stat	poll_stat[BLK_MQ_POLL_STATS_BKTS];

	struct timer_list	timeout;
	struct work_struct	timeout_work;

	atomic_t		nr_active_requests_shared_sbitmap;

	struct sbitmap_queue	sched_bitmap_tags;
	struct sbitmap_queue	sched_breserved_tags;

	struct list_head	icq_list;
#ifdef CONFIG_BLK_CGROUP
	DECLARE_BITMAP		(blkcg_pols, BLKCG_MAX_POLS);
	struct blkcg_gq		*root_blkg;
	struct list_head	blkg_list;
#endif

	struct queue_limits	limits;

	unsigned int		required_elevator_features;

#ifdef CONFIG_BLK_DEV_ZONED
	/*
	 * Zoned block device information for request dispatch control.
	 * nr_zones is the total number of zones of the device. This is always
	 * 0 for regular block devices. conv_zones_bitmap is a bitmap of nr_zones
	 * bits which indicates if a zone is conventional (bit set) or
	 * sequential (bit clear). seq_zones_wlock is a bitmap of nr_zones
	 * bits which indicates if a zone is write locked, that is, if a write
	 * request targeting the zone was dispatched. All three fields are
	 * initialized by the low level device driver (e.g. scsi/sd.c).
	 * Stacking drivers (device mappers) may or may not initialize
	 * these fields.
	 *
	 * Reads of this information must be protected with blk_queue_enter() /
	 * blk_queue_exit(). Modifying this information is only allowed while
	 * no requests are being processed. See also blk_mq_freeze_queue() and
	 * blk_mq_unfreeze_queue().
	 */
	unsigned int		nr_zones;
	unsigned long		*conv_zones_bitmap;
	unsigned long		*seq_zones_wlock;
	unsigned int		max_open_zones;
	unsigned int		max_active_zones;
#endif /* CONFIG_BLK_DEV_ZONED */

	int			node;
	struct mutex		debugfs_mutex;
#ifdef CONFIG_BLK_DEV_IO_TRACE
	struct blk_trace __rcu	*blk_trace;
#endif
	/*
	 * for flush operations
	 */
	struct blk_flush_queue	*fq;

	struct list_head	requeue_list;
	spinlock_t		requeue_lock;
	struct delayed_work	requeue_work;

	struct mutex		sysfs_lock;
	struct mutex		sysfs_dir_lock;

	/*
	 * for reusing dead hctx instance in case of updating
	 * nr_hw_queues
	 */
	struct list_head	unused_hctx_list;
	spinlock_t		unused_hctx_lock;

	int			mq_freeze_depth;

#ifdef CONFIG_BLK_DEV_THROTTLING
	/* Throttle data */
	struct throtl_data *td;
#endif
	struct rcu_head		rcu_head;
	wait_queue_head_t	mq_freeze_wq;
	/*
	 * Protect concurrent access to q_usage_counter by
	 * percpu_ref_kill() and percpu_ref_reinit().
	 */
	struct mutex		mq_freeze_lock;

	struct blk_mq_tag_set	*tag_set;
	struct list_head	tag_set_list;
	struct bio_set		bio_split;

	struct dentry		*debugfs_dir;

#ifdef CONFIG_BLK_DEBUG_FS
	struct dentry		*sched_debugfs_dir;
	struct dentry		*rqos_debugfs_dir;
#endif

	bool			mq_sysfs_init_done;

	size_t			cmd_size;

#define BLK_MAX_WRITE_HINTS	5
	u64			write_hints[BLK_MAX_WRITE_HINTS];
};

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

struct request_queue {

struct request *last_merge;

struct elevator_queue *elevator;

struct percpu_ref q_usage_counter;

struct blk_queue_stats *stats;

struct rq_qos *rq_qos;

const struct blk_mq_ops *mq_ops;

/* sw queues */

struct blk_mq_ctx __percpu *queue_ctx;

unsigned int queue_depth;

/* hw dispatch queues */

struct blk_mq_hw_ctx **queue_hw_ctx;

unsigned int nr_hw_queues;

* The queue owner gets to use this for whatever they like.

* ll_rw_blk doesn't touch it.

void *queuedata;

* various queue flags, see QUEUE_* below

unsigned long queue_flags;

* Number of contexts that have called blk_set_pm_only(). If this

* counter is above zero then only RQF_PM requests are processed.

atomic_t pm_only;

* ida allocated id for this queue. Used to index queues from

* ioctx.

int id;

spinlock_t queue_lock;

struct gendisk *disk;

* queue kobject

struct kobject kobj;

* mq queue kobject

struct kobject *mq_kobj;

#ifdef CONFIG_BLK_DEV_INTEGRITY

struct blk_integrity integrity;

#endif /* CONFIG_BLK_DEV_INTEGRITY */

#ifdef CONFIG_PM

struct device *dev;

enum rpm_status rpm_status;

#endif

* queue settings

unsigned long nr_requests; /* Max # of requests */

unsigned int dma_pad_mask;

unsigned int dma_alignment;

#ifdef CONFIG_BLK_INLINE_ENCRYPTION

/* Inline crypto capabilities */

struct blk_keyslot_manager *ksm;

#endif

unsigned int rq_timeout;

int poll_nsec;

struct blk_stat_callback *poll_cb;

struct blk_rq_stat poll_stat[BLK_MQ_POLL_STATS_BKTS];

struct timer_list timeout;

struct work_struct timeout_work;

atomic_t nr_active_requests_shared_sbitmap;

struct sbitmap_queue sched_bitmap_tags;

struct sbitmap_queue sched_breserved_tags;

struct list_head icq_list;

#ifdef CONFIG_BLK_CGROUP

DECLARE_BITMAP (blkcg_pols, BLKCG_MAX_POLS);

struct blkcg_gq *root_blkg;

struct list_head blkg_list;

#endif

struct queue_limits limits;

unsigned int required_elevator_features;

#ifdef CONFIG_BLK_DEV_ZONED

* Zoned block device information for request dispatch control.

* nr_zones is the total number of zones of the device. This is always

* 0 for regular block devices. conv_zones_bitmap is a bitmap of nr_zones

* bits which indicates if a zone is conventional (bit set) or

* sequential (bit clear). seq_zones_wlock is a bitmap of nr_zones

* bits which indicates if a zone is write locked, that is, if a write

* request targeting the zone was dispatched. All three fields are

* initialized by the low level device driver (e.g. scsi/sd.c).

* Stacking drivers (device mappers) may or may not initialize

* these fields.

* Reads of this information must be protected with blk_queue_enter() /

* blk_queue_exit(). Modifying this information is only allowed while

* no requests are being processed. See also blk_mq_freeze_queue() and

* blk_mq_unfreeze_queue().

unsigned int nr_zones;

unsigned long *conv_zones_bitmap;

unsigned long *seq_zones_wlock;

unsigned int max_open_zones;

unsigned int max_active_zones;

#endif /* CONFIG_BLK_DEV_ZONED */

int node;

struct mutex debugfs_mutex;

#ifdef CONFIG_BLK_DEV_IO_TRACE

struct blk_trace __rcu *blk_trace;

#endif

* for flush operations

struct blk_flush_queue *fq;

struct list_head requeue_list;

spinlock_t requeue_lock;

struct delayed_work requeue_work;

struct mutex sysfs_lock;

struct mutex sysfs_dir_lock;

* for reusing dead hctx instance in case of updating

* nr_hw_queues

struct list_head unused_hctx_list;

spinlock_t unused_hctx_lock;

int mq_freeze_depth;

#ifdef CONFIG_BLK_DEV_THROTTLING

/* Throttle data */

struct throtl_data *td;

#endif

struct rcu_head rcu_head;

wait_queue_head_t mq_freeze_wq;

* Protect concurrent access to q_usage_counter by

* percpu_ref_kill() and percpu_ref_reinit().

struct mutex mq_freeze_lock;

struct blk_mq_tag_set *tag_set;

struct list_head tag_set_list;

struct bio_set bio_split;

struct dentry *debugfs_dir;

#ifdef CONFIG_BLK_DEBUG_FS

struct dentry *sched_debugfs_dir;

struct dentry *rqos_debugfs_dir;

#endif

bool mq_sysfs_init_done;

size_t cmd_size;

#define BLK_MAX_WRITE_HINTS 5

u64 write_hints[BLK_MAX_WRITE_HINTS];

};

1つのリクエストは，request構造体で表現されます．

連続した複数のディスクブロックに対して操作できるため，1つ以上のbioオブジェクトで構成されます．

/*
 * Try to put the fields that are referenced together in the same cacheline.
 *
 * If you modify this structure, make sure to update blk_rq_init() and
 * especially blk_mq_rq_ctx_init() to take care of the added fields.
 */
struct request {
	struct request_queue *q;
	struct blk_mq_ctx *mq_ctx;
	struct blk_mq_hw_ctx *mq_hctx;

	unsigned int cmd_flags;		/* op and common flags */
	req_flags_t rq_flags;

	int tag;
	int internal_tag;

	/* the following two fields are internal, NEVER access directly */
	unsigned int __data_len;	/* total data len */
	sector_t __sector;		/* sector cursor */

	struct bio *bio;
	struct bio *biotail;

	struct list_head queuelist;

	/*
	 * The hash is used inside the scheduler, and killed once the
	 * request reaches the dispatch list. The ipi_list is only used
	 * to queue the request for softirq completion, which is long
	 * after the request has been unhashed (and even removed from
	 * the dispatch list).
	 */
	union {
		struct hlist_node hash;	/* merge hash */
		struct llist_node ipi_list;
	};

	/*
	 * The rb_node is only used inside the io scheduler, requests
	 * are pruned when moved to the dispatch queue. So let the
	 * completion_data share space with the rb_node.
	 */
	union {
		struct rb_node rb_node;	/* sort/lookup */
		struct bio_vec special_vec;
		void *completion_data;
		int error_count; /* for legacy drivers, don't use */
	};

	/*
	 * Three pointers are available for the IO schedulers, if they need
	 * more they have to dynamically allocate it.  Flush requests are
	 * never put on the IO scheduler. So let the flush fields share
	 * space with the elevator data.
	 */
	union {
		struct {
			struct io_cq		*icq;
			void			*priv[2];
		} elv;

		struct {
			unsigned int		seq;
			struct list_head	list;
			rq_end_io_fn		*saved_end_io;
		} flush;
	};

	struct gendisk *rq_disk;
	struct block_device *part;
#ifdef CONFIG_BLK_RQ_ALLOC_TIME
	/* Time that the first bio started allocating this request. */
	u64 alloc_time_ns;
#endif
	/* Time that this request was allocated for this IO. */
	u64 start_time_ns;
	/* Time that I/O was submitted to the device. */
	u64 io_start_time_ns;

#ifdef CONFIG_BLK_WBT
	unsigned short wbt_flags;
#endif
	/*
	 * rq sectors used for blk stats. It has the same value
	 * with blk_rq_sectors(rq), except that it never be zeroed
	 * by completion.
	 */
	unsigned short stats_sectors;

	/*
	 * Number of scatter-gather DMA addr+len pairs after
	 * physical address coalescing is performed.
	 */
	unsigned short nr_phys_segments;

#if defined(CONFIG_BLK_DEV_INTEGRITY)
	unsigned short nr_integrity_segments;
#endif

#ifdef CONFIG_BLK_INLINE_ENCRYPTION
	struct bio_crypt_ctx *crypt_ctx;
	struct blk_ksm_keyslot *crypt_keyslot;
#endif

	unsigned short write_hint;
	unsigned short ioprio;

	enum mq_rq_state state;
	refcount_t ref;

	unsigned int timeout;
	unsigned long deadline;

	union {
		struct __call_single_data csd;
		u64 fifo_time;
	};

	/*
	 * completion callback.
	 */
	rq_end_io_fn *end_io;
	void *end_io_data;
};

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

* Try to put the fields that are referenced together in the same cacheline.

* If you modify this structure, make sure to update blk_rq_init() and

* especially blk_mq_rq_ctx_init() to take care of the added fields.

struct request {

struct request_queue *q;

struct blk_mq_ctx *mq_ctx;

struct blk_mq_hw_ctx *mq_hctx;

unsigned int cmd_flags; /* op and common flags */

req_flags_t rq_flags;

int tag;

int internal_tag;

/* the following two fields are internal, NEVER access directly */

unsigned int __data_len; /* total data len */

sector_t __sector; /* sector cursor */

struct bio *bio;

struct bio *biotail;

struct list_head queuelist;

* The hash is used inside the scheduler, and killed once the

* request reaches the dispatch list. The ipi_list is only used

* to queue the request for softirq completion, which is long

* after the request has been unhashed (and even removed from

* the dispatch list).

union {

struct hlist_node hash; /* merge hash */

struct llist_node ipi_list;

};

* The rb_node is only used inside the io scheduler, requests

* are pruned when moved to the dispatch queue. So let the

* completion_data share space with the rb_node.

union {

struct rb_node rb_node; /* sort/lookup */

struct bio_vec special_vec;

void *completion_data;

int error_count; /* for legacy drivers, don't use */

};

* Three pointers are available for the IO schedulers, if they need

* more they have to dynamically allocate it. Flush requests are

* never put on the IO scheduler. So let the flush fields share

* space with the elevator data.

union {

struct {

struct io_cq *icq;

void *priv[2];

} elv;

struct {

unsigned int seq;

struct list_head list;

rq_end_io_fn *saved_end_io;

} flush;

};

struct gendisk *rq_disk;

struct block_device *part;

#ifdef CONFIG_BLK_RQ_ALLOC_TIME

/* Time that the first bio started allocating this request. */

u64 alloc_time_ns;

#endif

/* Time that this request was allocated for this IO. */

u64 start_time_ns;

/* Time that I/O was submitted to the device. */

u64 io_start_time_ns;

#ifdef CONFIG_BLK_WBT

unsigned short wbt_flags;

#endif

* rq sectors used for blk stats. It has the same value

* with blk_rq_sectors(rq), except that it never be zeroed

* by completion.

unsigned short stats_sectors;

* Number of scatter-gather DMA addr+len pairs after

* physical address coalescing is performed.

unsigned short nr_phys_segments;

#if defined(CONFIG_BLK_DEV_INTEGRITY)

unsigned short nr_integrity_segments;

#endif

#ifdef CONFIG_BLK_INLINE_ENCRYPTION

struct bio_crypt_ctx *crypt_ctx;

struct blk_ksm_keyslot *crypt_keyslot;

#endif

unsigned short write_hint;

unsigned short ioprio;

enum mq_rq_state state;

refcount_t ref;

unsigned int timeout;

unsigned long deadline;

union {

struct __call_single_data csd;

u64 fifo_time;

};

* completion callback.

rq_end_io_fn *end_io;

void *end_io_data;

};

I/Oスケジューラ

リクエストが到着したときにディスクに直接送るのは最適とは言えません．

ランダムアクセスが増えるため，カーネルはディスクシークを可能な限り減らそうとします．

そこで，カーネルはリクエストキュー内のI/Oリクエストを結合し，並べ替えます（マージとソート）．

マージとソートのルールは，I/Oスケジューラにより定義されます．

Linuxには複数のI/Oスケジューラモデルが実装されています．

プロセススケジューラがCPUを仮想化するように，I/Oスケジューラはディスクを仮想化します．

シングルキューI/Oスケジューラ

シングルキューI/Oスケジューラを紹介します．

現在ではシングルキューI/Oスケジューラはサポートされていませんが，I/Oスケジューラの歴史を知る上で有用です．

Linus Elevator

Linus Elevatorは，Linuxカーネルのバージョン2.4までのデフォルトのI/Oスケジューラです．

次のリクエストがキューのどこに追加されるべきかを定義します．

フロントマージ，バックマージ

ソート挿入

Linus Elevatorの目標は，ディスクシークの最小化とグローバルスループットの最大化です．

Linus Elevatorのアルゴリズムは以下になります．

隣接するディスク上セクタへのリクエストがリクエストキューにある場合，既存のリクエストと新しいリクエストは1つのリクエストにマージされます．

リクエストキューの中のリクエストが十分に古い場合，新しいリクエストは他の古いリクエストの飢餓状態を防ぐために，リクエストキューの最後尾に挿入されます．

セクタ単位で適切な場所がリクエストキューにあれば，新しいリクエストはそこに挿入されます．これによって，リクエストキューはディスク上の物理的な位置でソートされた状態に保たれます．

最後に，そのような適切な挿入位置が存在しない場合，リクエストはリクエストキューの最後尾に挿入されます．

Linus Elevatorの課題は，その目標であるディスクシークの最小化とグローバルスループットの最大化により，飢餓状態を引き起こす可能性があることです．

書き込みにより読み込みが飢餓状態になってしまうため，以下の項目を検討する必要があります．

バッファページキャッシュでI/O操作をバッファリング

書き込み操作はページキャッシュにバッファリング（非同期処理）

ページキャッシュがなくなったときの読み込み操作は即座に処理されるべき（同期処理）

読み込みの遅延はシステムにとって重要なため，読み込みの飢餓状態を最小にする必要

Deadline I/O Scheduler

Deadline I/O Schedulerは，グローバルスループットを最大化しつつ，公平性を確保することを試みるI/Oスケジューラです．

各リクエストにはデッドラインと呼ばれる有効期限が設定されています．

例えば，読み込みは「現在時刻 + 0.5s」，書き込みは「現在時刻 + 5s」等です．

Deadline I/O Schedulerは以下の動画がわかりやすいです．

Anticipatory I/O Scheduler

Anticipatory I/O Schedulerは，Deadline I/O Schedulerのスループットを向上させるI/Oスケジューラです．

Anticipatory I/O Schedulerの特徴は，予期ヒューリスティック（Anticipation Heuristic）を利用することです．

すぐにシークバックするのではなく，アプリケーションが他のI/Oリクエストを送るのを期待して，数ms待ちます．

Complete Fair Queuing（CFQ）I/O Scheduler

Complete Fair Queuing（CFQ）I/O Schedulerは，プロセス単位のリクエストキューを持つI/Oスケジューラです．

プロセスによって提出された同期要求をプロセスごとの多数のキューに入れ，その後，各キューがディスクにアクセスするためのタイムスライスを割り当てます．

タイムスライスの長さと，リクエストキューが処理できるリクエストの数は，与えられたプロセスのI/O優先度に依存します．

Noop I/O Scheduler

Noop I/O Schedulerは，シーケンシャルなリクエストを結合する以外には特に何もしないI/Oスケジューラです．

フラッシュカードのような真にランダムなデバイスに使用されます．

マルチキューI/Oスケジューラ

マルチキューI/Oスケジューラを紹介します．

【デフォルト】Multiqueue Deadline I/O Scheduler

Multiqueue Deadline I/O Schedulerは，Deadline I/O Schedulerのマルチキュー版のI/Oスケジューラです．

Linuxカーネルでデフォルトで利用されています．

Budget Fair Queuing（BFQ）I/O Scheduler

Budget Fair Queuing（BFQ）I/O Schedulerは，各プロセスにI/Oバジェットを割り当て，多くのヒューリスティックと組み合わせることで，特に低速なデバイスのI/Oレスポンスを大幅に改善するI/Oスケジューラです．

Budget Fair Queuing（BFQ）I/O Schedulerは，HDDを使用するユーザにもメリットがありますが，低速のSSDを使用する場合にもメリットがあります．

例えば，スマホやタブレットなどのデバイスです．

Budget Fair Queuing（BFQ）I/O Schedulerの解説動画はこちらです．

Kyber I/O Scheduler

Kyber I/O Schedulerは，I/Oリクエストを同期リクエストと非同期リクエスト，つまり読み込み用と書き込み用の2つの主要なキューに分けるI/Oスケジューラです．

読み込み要求を発行したプロセスは，通常その要求が完了しデータが利用可能になるまで処理を進めることができないため，このような要求は同期型とみなされます．

一方，書き込みは，後で完了する可能性があるため，書き込みを開始するプロセスは，通常は書き込みが実際にいつ行われるかを気にしません．

そのため，書き込みよりも読み込みを優先させるのが一般的ですが，書き込みができなくなるようなことはありません．

Kyber I/O Schedulerのキーアイデアは，ディスパッチキュー（デバイスに直接操作を送るキュー）に送られる操作（読み込みと書き込みの両方）の数を厳しく制限し，これらのキューを比較的短い状態に保つことです．

ディスパッチキューが短ければ，あるリクエストがキューで待機している間の時間（リクエスト単位の遅延）も比較的短くなります．

そのため，より優先度の高いリクエストを素早く完了させることができます．

None I/O Scheduler

None I/O Schedulerは，Noop I/O Schedulerのマルチキュー版のI/Oスケジューラです．

Noop I/O Schedulerと同様に，None I/O Schedulerはフラッシュカードのような真にランダムなデバイスに使用されます．

I/Oスケジューラの設定方法

I/Oスケジューラの設定方法を紹介します．

I/Oスケジューラはデバイスごとに選択できます．

プライマリディスク（sda）で現在利用可能なI/Oスケジューラと利用しているI/Oスケジューラは以下になります．

mq-deadline：Multiqueue Deadline I/O Scheduler

none：None I/O Scheduler

※[mq-deadline]と[]があるのは，現在Multiqueue Deadline I/O Schedulerを利用しているという意味になります．

$ cat /sys/block/sda/queue/scheduler
[mq-deadline] none

1 2	$ cat /sys/block/sda/queue/scheduler [mq-deadline] none

none（None I/O Scheduler）に変更したい場合は，以下のコマンドを入力して下さい．

$ sudo sh -c '(echo none > /sys/block/sda/queue/scheduler)'

1	$ sudo sh -c '(echo none > /sys/block/sda/queue/scheduler)'

利用しているI/Oスケジューラを確認すると，[none]になっていることがわかります．

$ cat /sys/block/sda/queue/scheduler                       
[none] mq-deadline

1 2	$ cat /sys/block/sda/queue/scheduler [none] mq-deadline

また，I/Oスケジューラは，Linuxのブート時にカーネルパラメータ「-elevator=<value>」として選択できます．

私のLinuxカーネルでは，valueは，mq-deadline，noneのいずれかになります．

ここで，Budget Fair Queuing（BFQ）I/O SchedulerとKyber I/O Schedulerは無効になっているので注意して下さい．

Linuxカーネルのコンフィグのオプションで有効にしてビルドすれば利用できます．

Linuxカーネルのビルド方法を知りたいあなたはこちらからどうぞ．

: 【C言語】Linuxカーネルのビルド方法と新しいシステムコールの実装方法

こういった悩みにお答えします．こういった私から学べます． Linuxカーネルのビルド方法 Linuxカーネルのビルド方法を紹介します． LinuxカーネルをビルドするPC環境は以下になります． In ...

続きを見る

まとめ

今回はブロックレイヤを紹介しました．

ブロックレイヤは，BIOレイヤ，リクエストレイヤ，I/Oスケジューラで構成されていることがわかりました．

ブロックレイヤを深掘りしたいあなたは，以下を読みましょう！

A block layer introduction part 1: the bio layer

Block layer introduction part 2: the request layer

Linux Block IO: Introducing Multi-queue SSD Access on Multi-core Systems

The multiqueue block layer

Two new block I/O schedulers for 4.12

The future of DAX

I/O Schedulers

Block

Multi-Queue Block IO Queueing Mechanism (blk-mq)

Linux：昨今のI/Oスケジューラ事情 2020

Ubuntu Linux 20.04 LTSについての，IOスケジューラーに関する情報を見てみる

10分で分かるLinuxブロックレイヤ from Takashi Hoshino

以下の動画もおすすめです！

LinuxカーネルはC言語で書かれています．

私にC言語の無料相談をしたいあなたは，公式LINE「ChishiroのC言語」の友だち追加をお願い致します．

私のキャパシティもあり，一定数に達したら終了しますので，今すぐ追加しましょう！

: 元東大教員がおすすめするC言語を学べるオンラインプログラミングスクール5社

こういった悩みにお答えします．こういった私が解説していきます．私が大学の授業で初めてC言語を勉強した時は全然できませんでしたが，先生やTA，友人に相談しながら一生懸命C言語を勉強してできるようにな ...

続きを見る

次回はこちらからどうぞ．

: 【第16回】元東大教員から学ぶLinuxカーネル「ソケット通信（TCP/UDP/IP）」

こういった私から学べます．前回を読んでいない方はこちらからどうぞ． Linuxカーネルの記事一覧はこちらからどうぞ．今回のテーマはソケット通信（TCP/UDP/IP）です．本記事では，以下を習得 ...

続きを見る

【第15回】元東大教員から学ぶLinuxカーネル「ブロックレイヤ」

【第14回】元東大教員から学ぶLinuxカーネル「ファイルシステムとクラッシュ整合性」

元東大教員から学ぶLinuxカーネル

元東大教員がおすすめするC言語を学べるオンラインプログラミングスクール5社

ブロックレイヤ

バッファとバッファヘッド

BIOレイヤ：bio構造体

リクエストレイヤ：request_queue構造体

I/Oスケジューラ

シングルキューI/Oスケジューラ

Linus Elevator

Deadline I/O Scheduler

Anticipatory I/O Scheduler

Complete Fair Queuing（CFQ）I/O Scheduler

Noop I/O Scheduler

マルチキューI/Oスケジューラ

【デフォルト】Multiqueue Deadline I/O Scheduler

Budget Fair Queuing（BFQ）I/O Scheduler

Kyber I/O Scheduler

None I/O Scheduler

I/Oスケジューラの設定方法

【C言語】Linuxカーネルのビルド方法と新しいシステムコールの実装方法

まとめ

元東大教員がおすすめするC言語を学べるオンラインプログラミングスクール5社

【第16回】元東大教員から学ぶLinuxカーネル「ソケット通信（TCP/UDP/IP）」