ARM Linux Systemcall
一、前言
大家都知道系统调用比较耗时,但是为什么呢?一探究竟。
二、ARM的解释
首先了解下和systemcall相关的call有哪些?
1. Function calls
When calling a function or sub-routine, we need a way to get back to the caller when finished. Adding an L to the B or BR instructions turns them into a branch with link. This means that a return address is written into LR (X30) as part of the branch.
The names LR and X30 are interchangeable. An assembler, such as GNU GAS or armclang, will accept both.
There is a specialist function return instruction, RET. This performs an indirect branch to the address in the link register. Together, this means that we get:
The figure shows the function foo() written in GAS syntax assembler. The keyword. global exports the symbol and .type indicates that the exported symbol is a function.
Why do we need a special function return instruction? Functionally, BR LR would do the same job as RET. Using RET tells the processor that this is a function return. Most modern processors, and all Cortex-A processors, support branch prediction. Knowing that this is a function return allows processors to more accurately predict the branch.
Branch predictors guess the direction the program flow will take across branches. The guess is used to decide what to load into a pipeline with instructions waiting to be processed. If the branch predictor guesses correctly, the pipeline has the correct instructions and the processor does not have to wait for instructions to be loaded from memory.
从这里得知,函数的调用是BL和LR指令实现的函数调整。
2. Procedure Call Standard
The Arm architecture places few restrictions on how general purpose registers are used. To recap, integer registers and floating-point registers are general purpose registers. However, if you want your code to interact with code that is written by someone else, or with code that is produced by a compiler, then you need to agree rules for register usage. For the Arm architecture, these rules are called the Procedure Call Standard, or PCS.
The PCS specifies:
• Which registers are used to pass arguments into the function.
• Which registers are used to return a value to the function doing the calling, known as the caller.
• Which registers the function being called, which is known as the callee, can corrupt.
• Which registers the callee cannot corrupt.
Consider a function foo(), being called from main():
The PCS says that the first argument is passed in X0, the second argument in X1, and so on up to X7. Any further arguments are passed on the stack. Our function, foo(), takes two arguments: b and c. Therefore, b will be in W0 and c will be in W1.
Why W and not X? Because the arguments are a 32-bit type, and therefore we only need a W register.
In C++, X0 is used to pass the implicit this pointer that points to the called function.
Next, the PCS defines which registers can be corrupted, and which registers cannot be corrupted. If a register can be corrupted, then the called function can overwrite without needing to restore, as this table of PCS register rules shows:
For example, the function foo() can use registers X0 to X15 without needing to preserve their values. However, if foo() wants to use X19 to X28 it must save them to stack first, and then restore from the stack before returning.
Some registers have special significance in the PCS:
• XR - This is an indirect result register. If foo() returned a struct, then the memory for struct would be allocated by the caller, main() in the earlier example. XR is a pointer to the memory allocated by the caller for returning the struct.
• IP0 and IP1 - These registers are intra-procedure-call corruptible registers. These registers can be corrupted between the time that the function is called and the time that it arrives at the first instruction in the function. These registers are used by linkers to insert veneers between the caller and callee. Veneers are small pieces of code. The most common example is for branch range extension. The branch instruction in A64 has a limited range. If the target is beyond that range, then the linker needs to generate a veneer to extend the range of the branch.
• FP - Frame pointer.
• LR - X30 is the link register (LR) for function calls.
We previously introduced the ALU flags, which are used for conditional branches and conditional selects. The PCS says that the ALU flags do not need to be preserved across a function call.
There is a similar set of rules for the floating-point registers:
这里得知,为了使单独编译的c语言和汇编程序之间能够相互调用,必须为子程序之间的调用规定一定的规则,这个规则就是PCS。
3. System calls
Sometimes it is necessary for software to request a function from a more privileged entity. This might happen when, for example, an application requests that the OS opens a file.
In A64, there are special instructions for making such system calls. These instructions cause an exception, which allows controlled entry into a more privileged Exception level.
• SVC Supervisor call causes an exception targeting EL1. Used by an application to call the OS.
• HVC Hypervisor call causes an exception targeting EL2. Used by an OS to call the hypervisor, not available at EL0.
• SMC Secure monitor call causes an exception targeting EL3. Used by an OS or hypervisor to call the EL3 firmware, not available at EL0.
If an exception is executed from an Exception level higher than the target exception level, then the exception is taken to the current Exception level. This means that an SVC at EL2 would cause exception entry to EL2. Similarly, an HVC at EL3 causes exception entry to EL3. This is consistent with the rule that an exception can never cause the rocessor to lose privilege.
这里得知,系统调用是因为向高级别进行访问。
采摘自:https://developer.arm.com/documentation/102374/latest/
三、glibc分析
arm64的系统调用的函数有很多,见:https://arm64.syscall.sh/
glibc-open.c
/* Copyright (C) 2017-2018 Free Software Foundation, Inc.
This file is part of the GNU C Library.
Contributed by Chris Metcalf <cmetcalf@tilera.com>, 2011.
The GNU C Library is free software; you can redistribute it and/or
modify it under the terms of the GNU Lesser General Public
License as published by the Free Software Foundation; either
version 2.1 of the License, or (at your option) any later version.
The GNU C Library is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
Lesser General Public License for more details.
You should have received a copy of the GNU Lesser General Public
License along with the GNU C Library. If not, see
<http://www.gnu.org/licenses/>. */
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <stdarg.h>
#include <sysdep-cancel.h>
#include <not-cancel.h>
#ifndef __OFF_T_MATCHES_OFF64_T
/* Open FILE with access OFLAG. If O_CREAT or O_TMPFILE is in OFLAG,
a third argument is the file protection. */
int
__libc_open (const char *file, int oflag, ...)
{
int mode = 0;
if (__OPEN_NEEDS_MODE (oflag))
{
va_list arg;
va_start (arg, oflag);
mode = va_arg (arg, int);
va_end (arg);
}
return SYSCALL_CANCEL (openat, AT_FDCWD, file, oflag, mode);
}
libc_hidden_def (__libc_open)
weak_alias (__libc_open, __open)
libc_hidden_weak (__open)
weak_alias (__libc_open, open)
# if !IS_IN (rtld)
int
__open_nocancel (const char *file, int oflag, ...)
{
int mode = 0;
if (__OPEN_NEEDS_MODE (oflag))
{
va_list arg;
va_start (arg, oflag);
mode = va_arg (arg, int);
va_end (arg);
}
return INLINE_SYSCALL_CALL (openat, AT_FDCWD, file, oflag, mode);
}
# else
strong_alias (__libc_open, __open_nocancel)
# endif
libc_hidden_def (__open_nocancel)
#endif
SYSCALL_CANCEL
/* step1 */
#define SYSCALL_CANCEL(...) \
({ \
long int sc_ret; \
if (SINGLE_THREAD_P) \
sc_ret = INLINE_SYSCALL_CALL (__VA_ARGS__); \
else \
{ \
int sc_cancel_oldtype = LIBC_CANCEL_ASYNC (); \
sc_ret = INLINE_SYSCALL_CALL (__VA_ARGS__); \
LIBC_CANCEL_RESET (sc_cancel_oldtype); \
} \
sc_ret; \
})
/* step2 */
#define SYSCALL_CANCEL(...) \
({ \
long int sc_ret; \
if (SINGLE_THREAD_P) \
sc_ret = INLINE_SYSCALL_CALL (__VA_ARGS__); \
else \
{ \
int sc_cancel_oldtype = LIBC_CANCEL_ASYNC (); \
sc_ret = INLINE_SYSCALL_CALL (__VA_ARGS__); \
LIBC_CANCEL_RESET (sc_cancel_oldtype); \
} \
sc_ret; \
})
/* Issue a syscall defined by syscall number plus any other argument
required. Any error will be handled using arch defined macros and errno
will be set accordingly.
It is similar to INLINE_SYSCALL macro, but without the need to pass the
expected argument number as second parameter. */
/* step3 */
#define INLINE_SYSCALL_CALL(...) \
__INLINE_SYSCALL_DISP (__INLINE_SYSCALL, __VA_ARGS__)
/* step4 拼字符为__INTERNAL_SYSCALL0-7 */
#define __INLINE_SYSCALL_DISP(b,...) \
__SYSCALL_CONCAT (b,__INLINE_SYSCALL_NARGS(__VA_ARGS__))(__VA_ARGS__)
/* step5 */
#define __SYSCALL_CONCAT(a,b) __SYSCALL_CONCAT_X (a, b)
/* step6 */
#define __SYSCALL_CONCAT_X(a,b) a##b
/* step7 */
#define __INTERNAL_SYSCALL0(name, err) \
INTERNAL_SYSCALL (name, err, 0)
#define __INTERNAL_SYSCALL1(name, err, a1) \
INTERNAL_SYSCALL (name, err, 1, a1)
#define __INTERNAL_SYSCALL2(name, err, a1, a2) \
INTERNAL_SYSCALL (name, err, 2, a1, a2)
#define __INTERNAL_SYSCALL3(name, err, a1, a2, a3) \
INTERNAL_SYSCALL (name, err, 3, a1, a2, a3)
#define __INTERNAL_SYSCALL4(name, err, a1, a2, a3, a4) \
INTERNAL_SYSCALL (name, err, 4, a1, a2, a3, a4)
#define __INTERNAL_SYSCALL5(name, err, a1, a2, a3, a4, a5) \
INTERNAL_SYSCALL (name, err, 5, a1, a2, a3, a4, a5)
#define __INTERNAL_SYSCALL6(name, err, a1, a2, a3, a4, a5, a6) \
INTERNAL_SYSCALL (name, err, 6, a1, a2, a3, a4, a5, a6)
#define __INTERNAL_SYSCALL7(name, err, a1, a2, a3, a4, a5, a6, a7) \
INTERNAL_SYSCALL (name, err, 7, a1, a2, a3, a4, a5, a6, a7)
/* step8 */
#define __INLINE_SYSCALL0(name) \
INLINE_SYSCALL (name, 0)
#define __INLINE_SYSCALL1(name, a1) \
INLINE_SYSCALL (name, 1, a1)
#define __INLINE_SYSCALL2(name, a1, a2) \
INLINE_SYSCALL (name, 2, a1, a2)
#define __INLINE_SYSCALL3(name, a1, a2, a3) \
INLINE_SYSCALL (name, 3, a1, a2, a3)
#define __INLINE_SYSCALL4(name, a1, a2, a3, a4) \
INLINE_SYSCALL (name, 4, a1, a2, a3, a4)
#define __INLINE_SYSCALL5(name, a1, a2, a3, a4, a5) \
INLINE_SYSCALL (name, 5, a1, a2, a3, a4, a5)
#define __INLINE_SYSCALL6(name, a1, a2, a3, a4, a5, a6) \
INLINE_SYSCALL (name, 6, a1, a2, a3, a4, a5, a6)
#define __INLINE_SYSCALL7(name, a1, a2, a3, a4, a5, a6, a7) \
INLINE_SYSCALL (name, 7, a1, a2, a3, a4, a5, a6, a7)
/* step9 */
#undef INTERNAL_SYSCALL
#define INTERNAL_SYSCALL(name, err, nr, args...) \
INTERNAL_SYSCALL_RAW(SYS_ify(name), err, nr, args)
/* step8, 这里可以看出来兼容了thumb和arm架构,对应了Procedure Call Standard */
#if defined(__thumb__)
/* We can not expose the use of r7 to the compiler. GCC (as
of 4.5) uses r7 as the hard frame pointer for Thumb - although
for Thumb-2 it isn't obviously a better choice than r11.
And GCC does not support asms that conflict with the frame
pointer.
This would be easier if syscall numbers never exceeded 255,
but they do. For the moment the LOAD_ARGS_7 is sacrificed.
We can't use push/pop inside the asm because that breaks
unwinding (i.e. thread cancellation) for this frame. We can't
locally save and restore r7, because we do not know if this
function uses r7 or if it is our caller's r7; if it is our caller's,
then unwinding will fail higher up the stack. So we move the
syscall out of line and provide its own unwind information. */
# undef INTERNAL_SYSCALL_RAW
# define INTERNAL_SYSCALL_RAW(name, err, nr, args...) \
({ \
register int _a1 asm ("a1"); \
int _nametmp = name; \
LOAD_ARGS_##nr (args) \
register int _name asm ("ip") = _nametmp; \
asm volatile ("bl __libc_do_syscall" \
: "=r" (_a1) \
: "r" (_name) ASM_ARGS_##nr \
: "memory", "lr"); \
_a1; })
#else /* ARM */
# undef INTERNAL_SYSCALL_RAW
# define INTERNAL_SYSCALL_RAW(name, err, nr, args...) \
({ \
register int _a1 asm ("r0"), _nr asm ("r7"); \
LOAD_ARGS_##nr (args) \
_nr = name; \
/* 最后运行了swi指令 */
asm volatile ("swi 0x0 @ syscall " #name \
: "=r" (_a1) \
: "r"
(_nr) ASM_ARGS_##nr \
: "memory"); \
_a1; })
#endif
最终将C代码翻译为汇编代码,运行的是swi指令;
其中的 swi 指令正是执行系统调用的软中断指令,在新版的 arm 架构中,使用 svc 指令代替 swi,这两者是别名的关系,没有什么区别.
在 OABI 规范中,系统调用号由 swi(svc) 后的参数指定,在 EABI 规范中,系统调用号则由 r7 进行传递,系统调用的参数由寄存器进行传递.
这里需要区分系统调用和普通函数调用,对于普通函数调用而言,前四个参数被保存在 r0~r3 中,其它的参数被保存在栈上进行传递.
但是在系统调用中,swi(svc)指令将会引起处理器模式的切换,user->svc,而 svc 模式下的 sp 和 user 模式下的 sp 并不是同一个,因此无法使用栈直接进行传递,从而需要将所有的参数保存在寄存器中进行传递,在内核文件 include/linux/syscall.h 中定义了系统调用相关的函数和宏,其中 SYSCALL_DEFINE_MAXARGS 表示系统调用支持的最多参数值,在 arm 下为 6,也就是 arm 中系统调用最多支持 6 个参数,分别保存在 r0~r5 中。
网上看到了一张图,解释了调用关系(来自:https://www.cnblogs.com/yangxinrui/p/15983178.html)
反汇编
在Linux下系统调用是用软中断实现的,下面以一个简单的open例子简要分析一下应用层的open是如何调用到内核中的sys_open的。
#include <stdio.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
int main(int argc, const char *argv[])
{
int fd;
fd = open(".", O_RDWR);
close(fd);
return 0;
}
将open.c进行静态编译,然后反汇编。
arm-linux-gnueabihf-gcc open.c --static -o open
arm-linux-gnueabihf-objdump -DS open > open.dis
下面截取open.dis中的一部分进行说明:
0001049c <main>:
1049c: b580 push {r7, lr}
1049e: b084 sub sp, #16
104a0: af00 add r7, sp, #0
104a2: 6078 str r0, [r7, #4]
104a4: 6039 str r1, [r7, #0]
104a6: 2102 movs r1, #2
104a8: f242 300c movw r0, #8972 ; 0x230c
104ac: f2c0 0005 movt r0, #5
104b0: f010 fae6 bl 20a80 <__libc_open>
104b4: 60f8 str r0, [r7, #12]
104b6: 68f8 ldr r0, [r7, #12]
104b8: f010 fc22 bl 20d00 <__libc_close>
104bc: 2300 movs r3, #0
104be: 4618 mov r0, r3
104c0: 3710 adds r7, #16
104c2: 46bd mov sp, r7
104c4: bd80 pop {r7, pc}
...
00020a80 <__libc_open>:
20a80: f8df c04a ldr.w ip, [pc, #74] ; 20ace <__libc_open+0x4e>
20a84: 44fc add ip, pc
20a86: f8dc c000 ldr.w ip, [ip]
20a8a: f09c 0f00 teq ip, #0
20a8e: b480 push {r7}
20a90: d108 bne.n 20aa4 <__libc_open+0x24>
20a92: 2705 movs r7, #5
20a94: df00 svc 0
20a96: bc80 pop {r7}
20a98: f510 5f80 cmn.w r0, #4096 ; 0x1000
20a9c: bf38 it cc
20a9e: 4770 bxcc lr
20aa0: f002 baf6 b.w 23090 <__syscall_error>
20aa4: b50f push {r0, r1, r2, r3, lr}
20aa6: f001 faa9 bl 21ffc <__libc_enable_asynccancel>
20aaa: 4684 mov ip, r0
20aac: bc0f pop {r0, r1, r2, r3}
// 系统调用sys_open的系统调用号是5,将系统调用号存放到寄存器R7当中
20aae: 2705 movs r7, #5
// 在arch/arm/include/asm/unistd.h中:
// #define __NR_open (__NR_SYSCALL_BASE+5)
// 其中,__NR_OABI_SYSCALL_BASE是0
20ab0: df00 svc 0 // 产生软中断
20ab2: 4607 mov r7, r0
20ab4: 4660 mov r0, ip
20ab6: f001 fae5 bl 22084 <__libc_disable_asynccancel>
20aba: 4638 mov r0, r7
20abc: f85d eb04 ldr.w lr, [sp], #4
20ac0: bc80 pop {r7}
20ac2: f510 5f80 cmn.w r0, #4096 ; 0x1000
20ac6: bf38 it cc
20ac8: 4770 bxcc lr
20aca: f002 bae1 b.w 23090 <__syscall_error>
20ace: 00058288 andeq r8, r5, r8, lsl #5
20ad2: 0000bf00 andeq fp, r0, r0, lsl #30
open ---> __libc_open() ---> svc 0 ---> vector_swi --->system_call ---> sys_func
接着看看kernel如何处理的。
四、kernel分析
EABI和OABI
EABI和OABI是与嵌入式系统和嵌入式软件开发相关的两种不同的应用二进制接口(Application Binary Interface)标准。
EABI(嵌入式应用二进制接口):EABI是一种标准的应用二进制接口,旨在定义在嵌入式系统中编写的应用程序与操作系统、编译器和硬件之间的接口规范。EABI标准关注的主要是嵌入式设备和嵌入式操作系统,并提供了一种通用的方式来定义函数调用规则、寄存器使用约定、异常处理、数据对齐等方面的规范。EABI使得开发人员能够在不同的嵌入式平台上开发和移植应用程序,而无需考虑底层硬件的细节。
OABI(旧的应用二进制接口):OABI是一种较旧的应用二进制接口标准,通常与一些早期的ARM架构处理器和Linux内核版本相关。OABI与EABI相比,使用了不同的函数调用约定、寄存器分配规则和异常处理方式。由于OABI已经过时且不再被广泛支持,许多嵌入式设备和操作系统已经转向使用EABI作为默认的二进制接口标准。
EABI和OABI都是嵌入式系统和嵌入式软件开发中的应用二进制接口标准,用于定义应用程序与操作系统、编译器和硬件之间的接口规范。EABI是较新且通用的标准,适用于跨多种嵌入式平台的应用开发和移植。而OABI是较旧的标准,与一些早期的ARM处理器和Linux内核版本相关,已经逐渐被EABI取代。
注意负责如下工作:
函数调用规范:EABI定义了函数调用的规则,包括参数传递的方式、寄存器的使用和保存、返回值的处理等。这样可以确保不同的函数能够正确地进行调用和返回。
寄存器使用约定:EABI规范确定了寄存器的使用约定,即哪些寄存器用于保存函数参数、局部变量和返回值等。这样可以提高代码的执行效率和性能。
异常处理:EABI定义了处理异常和中断的方式,包括如何传递异常信息、处理异常的优先级和响应等。这对于嵌入式系统的稳定性和可靠性非常重要。
数据对齐:EABI规范要求数据在内存中的对齐方式,以确保访问数据的效率和正确性。特别是对于某些嵌入式处理器,要求数据按照特定的字节对齐方式存储。
entry-common.S
文件路径:kernel/arch/arm/kernel/entry-common.S,主要做了如下工作:
在Linux内核的entry-common.S
汇编文件中,通常会执行以下一些重要的任务:
设置堆栈指针(Stack Pointer):这个汇编文件负责初始化内核堆栈指针。它将栈指针设置为适当的位置,以确保内核可以正确地使用堆栈。
初始化硬件:这个文件可能包含一些初始化硬件设备的代码,例如设置中断控制器、启用内存管理单元(MMU)等。这些任务是在内核启动时执行的,以确保硬件处于正确的状态。
设置系统调用表:系统调用表是内核中用于处理用户空间和内核空间之间通信的重要数据结构。在
entry-common.S
中,可能会初始化系统调用表,包括填充系统调用处理程序的地址。处理异常和中断:这个文件中的代码通常处理各种异常和中断情况。它可能包括处理时钟中断、页错误、系统调用等。这些处理程序会根据异常或中断的类型执行相应的操作,例如保存上下文、调用相应的处理函数等。
进入内核模式:在
entry-common.S
中,会将处理器切换到内核模式,这是执行特权指令和访问特权资源的模式。这样可以确保内核能够执行敏感的操作,如修改页表、访问硬件寄存器等。进行栈的切换和初始化:这个文件中的代码可能会切换栈以使用内核堆栈,并初始化栈帧等。这是为了确保内核可以正确地管理函数调用和上下文切换。
总的来说,entry-common.S
文件在Linux内核启动过程中扮演着重要的角色,负责初始化内核的各个关键部分,处理异常和中断,并确保内核可以在正确的上下文中执行。不同的架构和内核版本可能会有些差异,但这些是通常在entry-common.S
中执行的任务。
349 #define NATIVE(nr, func) syscall nr, func
350
351 /*
352 * This is the syscall table declaration for native ABI syscalls.
353 * With EABI a couple syscalls are obsolete and defined as sys_ni_syscall.
354 */
355 syscall_table_start sys_call_table
356 #define COMPAT(nr, native, compat) syscall nr, native
357 #ifdef CONFIG_AEABI
358 #include <calls-eabi.S>
359 #else
360 #include <calls-oabi.S>
361 #endif
362 #undef COMPAT
363 syscall_table_end sys_call_table
#include 这个预编译指令在预编译阶段会将指定文件中的内容全部拷贝到当前地址,而 calls-eabi.S文件中就是系统调用表列表的定义。
calls-eabi.S由arm/tools/Makefile下的Makefile产生:
kernel/arch/arm64$ grep -nr "calls-eabi" ../
../arm/kernel/entry-common.S:358:#include <calls-eabi.S>
../arm/tools/Makefile:17:gen-y += $(gen)/calls-eabi.S
../arm/tools/Makefile:77:systbl_abi_calls-eabi := common,eabi
../arm/tools/Makefile:78:$(gen)/calls-eabi.S: $(syscall) $(systbl) FORCE
#include 这个预编译指令展开后,变成了:
syscall_table_start sys_call_table
NATIVE(0, sys_restart_syscall)
NATIVE(1, sys_exit)
NATIVE(2, sys_fork)
NATIVE(3, sys_read)
NATIVE(4, sys_write)
// ......
ATIVE(395, sys_pkey_alloc)
NATIVE(396, sys_pkey_free)
NATIVE(397, sys_statx)
NATIVE(398, sys_rseq)
NATIVE(399, sys_io_pgetevents)
syscall_table_end sys_call_table
syscall_table_start
/* 定义 sys_call_table,并将 __sys_nr 清 0 */
.macro syscall_table_start, sym
.equ __sys_nr, 0 // 定义一个 __sys_nr 值为零
.type \sym, #object // 表示指定符号的类型为 object
ENTRY(\sym) // 定义一个全局符号
.endm
syscall_table_start 宏接收一个参数 sym,然后定义一个 __sys_nr 值为零,.type 表示指定符号的类型为 object。需要注意的是,这里的 ENTRY 并不是链接脚本中的 ENTRY 关键字,实际上这也是一个宏定义:
#define ENTRY(name) \
.globl name; \
name:
上面的调用中传入的参数是 sys_call_table,syscall_table_start sys_call_table 这个宏的意思就是:
创建一个 sys_call_table 的符号并使用 .globl 导出到全局符号
定义一个内部符号 __sys_nr,初始化为 0,这个变量主要用于后续系统调用好的计算和判断。
NATIVE
紧接着 syscall_table_start sys_call_table 这个语句的就是以 NATIVE 描述的系统调用列表,NATIVE 带两个参数:系统调用号和对应的系统调用函数,它的定义同样是一个宏:
#define NATIVE(nr, func) syscall nr, func
.macro syscall, nr, func
//__sys_nr 从0开始,总是指向已初始化系统调用的下一个系统调用号,
// 如果需要初始化的系统调用号小于 __sys_nr,
// 表示和已初始化的系统调用号冲突,报错。
.ifgt __sys_nr - \nr
.error "Duplicated/unorded system call entry"
.endif
// .rept 指令表示下一条 .endr 指令之前的指令执行次数
.rept \nr - __sys_nr
// 放置 sys_ni_syscall 函数到当前地址,
// sys_ni_syscall 是一个空函数,
// 直接返回 -ENOSYS,
// 即如果定义的系统调用号之间有间隔,填充为该函数
.long sys_ni_syscall
.endr
.long \func // 将系统调用函数放置在当前地址
.equ __sys_nr, \nr + 1 // 将 __sys_nr 更新为当前系统调用号 +1
.endm
其实就是使用 NATIVE 指令实现一个 sys_call_table 的数组,不断地在数组尾部放置函数指针,而系统调用号对应数组的下标,只是添加了一些异常处理。
syscall_table_end
接下来就是系统调用的收尾部分:syscall_table_end,sys_call_table,传入的参数为 sys_call_table,它也是通过宏实现的:
.macro syscall_table_end, sym
// __NR_syscalls 是当前系统下静态定义的最大系统调用号,
// 当前初始化的系统调用号不能超出
.ifgt __sys_nr - __NR_syscalls
.error "System call table too big"
.endif
.rept __NR_syscalls - __sys_nr
// 当前已定义系统调用号到最大系统调用之间未使用的
// 系统调用号使用 sys_ni_syscall 这个空函数填充
.long sys_ni_syscall
.endr
.size \sym, . - \sym // 设置 sym 也就是 sys_call_table 的 size
.endm
Arm32
/* kernel/arch/arm64$ vi ./include/asm/unistd32.h片段 */
#define __NR_restart_syscall 0
__SYSCALL(__NR_restart_syscall, sys_restart_syscall)
#define __NR_exit 1
__SYSCALL(__NR_exit, sys_exit)
#define __NR_fork 2
__SYSCALL(__NR_fork, sys_fork)
#define __NR_read 3
__SYSCALL(__NR_read, sys_read)
#define __NR_write 4
__SYSCALL(__NR_write, sys_write)
#define __NR_open 5
__SYSCALL(__NR_open, compat_sys_open)
#define __NR_close 6
__SYSCALL(__NR_close, sys_close)
/* 7 was sys_waitpid */
__SYSCALL(7, sys_ni_syscall)
#define __NR_creat 8
__SYSCALL(__NR_creat, sys_creat)
#define __NR_link 9
__SYSCALL(__NR_link, sys_link)
Arm32代码与ARM官方系统调用表一一对应。
Arm64
/* kernel/arch/arm64/include/asm/unistd.h片段 */
/*
* Compat syscall numbers used by the AArch64 kernel.
*/
#define __NR_compat_restart_syscall 0
#define __NR_compat_exit 1
#define __NR_compat_read 3
#define __NR_compat_write 4
#define __NR_compat_gettimeofday 78
#define __NR_compat_sigreturn 119
#define __NR_compat_rt_sigreturn 173
#define __NR_compat_clock_gettime 263
#define __NR_compat_clock_getres 264
#define __NR_compat_clock_gettime64 403
#define __NR_compat_clock_getres_time64 406
Arm64的感觉有点对应不上;
内核中系统调用的处理
系统调用的处理完全不像是想象中那么简单,从用户空间到内核需要经历处理器模式的切换,svc 指令实际上是一条软件中断指令,也是从用户空间主动到内核空间的唯一通路(被动可以通过中断、其它异常) 相对应的处理器模式为从 user 模式到 svc 模式,svc 指令执行系统调用的大致流程为:
执行 svc 指令,产生软中断,跳转到系统中断向量表的 svc 向量处执行指令,这个地址是 0xffff0008 处(也可配置在 0x00000008处),并将处理器模式设置为 svc.
保存用户模式下的程序断点信息,以便系统调用返回时可以恢复用户进程的执行.
根据传入的系统调用号(r7)确定内核中需要执行的系统调用,比如 read 对应 syscall_read.
执行完系统调用之后返回到用户进程,继续执行用户程序.
中断向量表
kernel/arch/arm64/kernel/entry.S
548 /*
549 * Exception vectors.
550 */ //是一个汇编语言指令,用于将后续的代码定义推入到名为 ".entry.text" 的新代码段中,并指定代码段的属性为 "ax"
551 .pushsection ".entry.text", "ax"
552
553 .align 11
554 SYM_CODE_START(vectors)
555 kernel_ventry 1, sync_invalid // Synchronous EL1t
556 kernel_ventry 1, irq_invalid // IRQ EL1t
557 kernel_ventry 1, fiq_invalid // FIQ EL1t
558 kernel_ventry 1, error_invalid // Error EL1t
559
560 kernel_ventry 1, sync // Synchronous EL1h
561 kernel_ventry 1, irq // IRQ EL1h
562 kernel_ventry 1, fiq_invalid // FIQ EL1h
563 kernel_ventry 1, error // Error EL1h
564
565 kernel_ventry 0, sync // Synchronous 64-bit EL0
566 kernel_ventry 0, irq // IRQ 64-bit EL0
567 kernel_ventry 0, fiq_invalid // FIQ 64-bit EL0
568 kernel_ventry 0, error // Error 64-bit EL0
569
570 #ifdef CONFIG_COMPAT
571 kernel_ventry 0, sync_compat, 32 // Synchronous 32-bit EL0
572 kernel_ventry 0, irq_compat, 32 // IRQ 32-bit EL0
573 kernel_ventry 0, fiq_invalid_compat, 32 // FIQ 32-bit EL0
574 kernel_ventry 0, error_compat, 32 // Error 32-bit EL0
575 #else
576 kernel_ventry 0, sync_invalid, 32 // Synchronous 32-bit EL0
577 kernel_ventry 0, irq_invalid, 32 // IRQ 32-bit EL0
578 kernel_ventry 0, fiq_invalid, 32 // FIQ 32-bit EL0
579 kernel_ventry 0, error_invalid, 32 // Error 32-bit EL0
580 #endif
581 SYM_CODE_END(vectors)
整个向量表被单独放置在 .vectors 段中,包括 reset,undefined,abort,irq 等异常向量,svc 异常向量在第三条,这是一条跳转指令,其中 .L 表示 后续的 symbol 为 local symbol,这条指令的含义是将 __vectors_start+0x1000 地址处的指令加载到 pc 中执行, __vectors_start 的地址在哪里呢?
通过链接脚本kernel/arch/arm64/kernel/vmlinux.lds 查看对应的链接参数:
.text : {
_stext = .;
. = ALIGN(8); __irqentry_text_start = .; *(.irqentry.text) __irqentry_text_end = .;
. = ALIGN(8); __softirqentry_text_start = .; *(.softirqentry.text) __softirqentry_text_end = .;
. = ALIGN(8); __entry_text_start = .; *(.entry.text) __entry_text_end = .;
. = ALIGN(8); *(.text.hot .text.hot.*) *(.text .text.fixup) *(.text.unlikely .text.unlikely.*) *(.text.unknown .text.unknown.*) . = ALIGN(8); __noinstr_text_start = .; *(.noinstr.text) __noinstr_text_end = .; *(.text..refcount) *(.ref.text)
. = ALIGN(8); __sched_text_start = .; *(.sched.text) __sched_text_end = .;
. = ALIGN(8); __cpuidle_text_start = .; *(.cpuidle.text) __cpuidle_text_end = .;
. = ALIGN(8); __lock_text_start = .; *(.spinlock.text) __lock_text_end = .;
. = ALIGN(8); __kprobes_text_start = .; *(.kprobes.text) __kprobes_text_end = .;
. = ALIGN(0x00001000); __hyp_idmap_text_start = .; *(.hyp.idmap.text) __hyp_idmap_text_end = .; __hyp_text_start = .; *(.hyp.text) __hyp_text_end = .;
. = ALIGN(0x00001000); __idmap_text_start = .; *(.idmap.text) __idmap_text_end = .;
. = ALIGN((1 << 12)); __entry_tramp_text_start = .; *(.entry.tramp.text) . = ALIGN((1 << 12)); __entry_tramp_text_end = .;
*(.fixup)
*(.gnu.warning)
. = ALIGN(16);
*(.got)
}
抱歉,这里我看不出来,链接到哪里了。。。 分析不动了,pending。
五、总结
TODO
参考文献:
https://dandelioncloud.cn/article/details/1570975611994992642