B10502224 古智斌 OS HW1

Part 1


When an x86 PC boots, it will execute a program called BIOS (Basic Input/Output System). BIOS will prepare the hardware and transfer control to the OS. It will transfer the control to code loaded from the boot sector, the first 512-byte sector of the boot disk. The boot sector contains the boot loader: instructions that load the kernel into memory. The BIOS loads the boot sector at memory address 0x7c00 and then jumps (sets the processor’s %ip, the instruction pointer) to that address. The loader’s job is to put the processor in a more modern operating mode, to load the xv6 kernel from disk into memory, and then to transfer control to the kernel. The BIOS doesn’t guarantee anything about the contents of %ds, %es, %ss, so first order of business after disabling interrupts is to set %ax to zero and then copy that zero into %ds, %es, and %ss.
The boot loader’s first action in 32-bit mode is to initialize the data segment registers with SEG_KDATA. Logical address now map directly to physical addresses. The only step left before executing C code is to set up stack in an unused region of memory. The memory from 0xa0000 to 0x100000 is typically littered with device memory regions, and the xv6 kernel expects to be placed at 0x100000. The boot loader itself is at 0x7c00 through 0x7e00 (512 bytes). Essentially any other section of memory would be a fine location for the stack. The boot loader chooses 0x7c00 (known in this file as $start) as the top of the stack; the stack will grow down from there, toward 0x0000, away from the boot loader.

Part 2

Chapter 0. Operating Systems Interfaces

The job of an operating system is to share a computer among multiple programs and to provide a more useful set of services than the hardware alone supports. The operating system manages and abstracts the low-level hardware and shares the hardware among multiple programs. Operating systems also provide controlled ways for programs to interact, so that they can share data or work together.
The kernel is a special program that provides services to running programs. Each running program, called a process, has memory containing instructions, data, and a stack. The instructions implement the program’s computation. The data are the variables on which the computation acts. The stack organizes the program’s procedure calls. When a process needs to invoke a kernel service, it invokes a procedure call called a system call. The system call enters the kernel; the kernel performs the service and returns. Thus a process alternates between executing in the user space and kernel space. The shell is an ordinary program that reads commands from the user and executes them; it’s the primary user interface to traditional Unix-like systems.
An xv6 process consists of user-space memory (instructions, data, and stack) and per-process state private to the kernel. Xv6 time-shares between processes, when a process is not executing, xv6 saves its CPU registers, restoring them when it next runs the process. The kernel associates a process identifier (pid) with each process. A process is able to create a new process using the fork system call. Fork creates a new process called the child process with exactly the same memory contents as the calling process (parent process). Fork returns in both the parent and the child. In the parent, fork returns the child’s pid; while in the child it returns zero. The exit system call causes the calling process to stop executing and to release resources such as memory and open files. The wait system call returns the pid of an exited child of the current process; if none of the callers children has exited, wait waits for one to do so. The exec system call replaces the calling process’s memory with a new memory image loaded from a file stored in the file system. The file must have a particular format, which specifies which part of the file holds instructions, which part is data, at which instruction to start. The main structure of the shell is as follows: the main loop reads a line of input from the user with getcmd, then it calls fork, which creates a copy of the shell process; the parent calls wait, while the child runs the command. Fork allocates the memory required for the child’s copy of the parent’s memory, and exec allocates enough memory to hold the executable file, if the process needs more memory at run-time, it can call sbrk(n) to grow its data memory by n bytes; sbrk will return the location of the new memory.
A file descriptor is a small integer representing a kernel-managed object that a process may read from or write to. A process reads from file descriptor 0 (standard input), writes output to file descriptor 1 (standard output), and writes error messages to file descriptor 2 (standard error). The call read(fd, buf, n) reads at most n bytes from the file descriptor fd, copies them into buf, and returns the number of bytes read. The call write read(fd, buf, n) writes n bytes from buf to the file descriptor fd and returns the number of bytes written. The close system call releases a file descriptor, making it free for reuse by a future open, pipe, or dup system call. File descriptors and fork interact to make I/O redirection easy to implement. Fork copies the parent’s file descriptor table along with its memory; the system call exec will replace the calling process’s memory but preserves its file table. It’s a good idea to separate fork and exec in two different calls, the reason is because the shell can fork a child, use open, close, dup in the child to change the standard input and output file descriptors, and then exec. The dup system call duplicates an existing file descriptor, returning a new one that refers to the same underling I/O object. Both file descriptors share an offset, just as the file descriptors duplicated by fork do. Two file descriptors share an offset if they were derived from the same original file descriptor by a sequence of fork and dup calls. Otherwise file descriptors do not share offsets, even if they resulted from open calls for the same file.
A pipe is a small kernel buffer exposed to processes as a pair of file descriptors, one for reading and one for writing. Writing data to one end of the pipe makes that data available for reading from the other end of the pipe. Pipes provide a way for processes to communicate. If no data is available, a read on a pipe waits for either data to be written or all file descriptors referring to the write end to be closed. The advantages of pipes over temporary files are: first, they clean themselves up; second, they can pass arbitrarily long streams of data; third, pipes allow parallel execution of pipeline stages; and fourth, if you’re implementing inter-process communication, pipes’ blocking reads and writes are more efficient than the non-blocking semantics of files.
The xv6 file system provides data files, which are uninterpreted byte arrays, and directories, which contain named references to data files and other directories. The directories form a tree, starting at a special directory called the root. You can use mkdir to create a new directory, open with the 0_CREATE flag to create a new data file, and mknod to create a new device file. A file’s name is distinct from the file itself; the same underlying file, called an inode, can have multiple names, called links. The link system call creates another file system name referring to the same inode as an existing file.

Chapter 1. Operating Systems Organization

A key requirement for an operating system is to support several activities at once. An operating system must fulfil three requirements: multiplexing, isolation, and interaction. Strong isolation requires a hard boundary between applications and the operating system. Processors provide hardware support for strong isolation. The xv6 processor has two modes in which the processor can execute instructions: kernel mode and user mode. In kernel mode the processor is allowed to execute privileged instructions. An application can execute only user-mode instructions and is said to be running in user space, while the software in kernel mode can also execute privileged instructions and is said to be running in kernel space. The software running in kernel space (or in kernel mode) is called the kernel. An application that wants to read or write a file on disk must transition to the kernel to do so, because the application itself cannot execute I/O instructions. Processors provide a special instruction that switches the processor from user mode to kernel mode and enters the kernel at an entry point specified by the kernel (the x86 processor provides the int instruction for this purpose).
Many Unix operating systems uses monolithic kernel, in this organization, the entire operating system resides in the kernel so that implementations of all system calls run in kernel mode, so the entire operating system will run with full hardware privilege. A downside of the monolithic organization is that the interfaces between different parts of the operating system are often complex, making it easy for the OS developer to make a mistake. In a monolithic kernel, a mistake is fatal, because an error in kernel mode will often result in the kernel to fail, causing the computer to stop working, all applications will fail too and the computer must reboot to start again. One way to reduce the risk of mistakes is by minimizing the amount of operating system code that runs in kernel mode, and execute the bulk of the OS in user mode. This kernel organization is called microkernel. OS services running as processes are called servers. In a microkernel, the kernel interface consists of a few low-level functions. This organization allows the kernel to be relatively simple, as most of the operating system resides in user-level servers. Xv6 is implemented as a monolithic kernel, following most Unix OS. In xv6, the kernel interface corresponds to the operating system interface, and the kernel implements the complete operating system.
The unit of isolation in xv6 (as in other Unix OS) is a process. The process abstraction prevents one process from wrecking or spying on another process’ memory, CPU, file descriptors. The mechanisms used by the kernel to implement processes include the user/kernel mode flag, address spaces, and time-slicing of threads. A process provides a program with what appears to be a private memory system, or address space, which other processes cannot read or write. A process also provides the program with what appears to be its own CPU to execute the program’s instructions. Xv6 uses page tables (which are implemented by hardware) to give each process its own address space. The x86 page table translates (or “maps”) a virtual address (the address that an x86 instruction manipulates) to a physical address (an address that the processor chip sends to main memory). An address space includes the process’s user memory starting at virtual address zero. In order to leave plenty of room for user memory, xv6’s address spaces map the kernel at high addresses, starting at 0x80100000. The xv6 kernel maintains many pieces of state for each process, which it gathers into a struct proc. A process’s most important pieces of kernel state are its page table, its kernel stack, and its run state. We’ll use the notation p->xxx to refer to elements of the proc structure. Each process has two stacks: a user stack and a kernel stack (p->kstack). When the process is executing user instructions, only its user stack is in use, and its kernel stack is empty. When the process enters the kernel (for a system call or interrupt), the kernel code executes on the process’s kernel stack; while a process is in the kernel, its user stack still contains saved data, but isn’t actively used. When a process makes a system call, the processor switches to the kernel stack, raises the hardware privilege level, and starts executing the kernel instructions that implement the system call. p->state indicates whether the process is allocated, ready to run, running, waiting for I/O, or exiting. p->pgdir holds the process’s page table, which serves as the record of the addresses of the physical pages allocated to store the process’s memory.
The first step in providing strong isolation is setting up the kernel to run in its own address space. When a PC powers on, it initializes itself and then loads a boot loader from disk into memory and executes it. Xv6’s boot loader loads the xv6 kernel from disk and executes it starting at entry. The boot loader loads the xv6 kernel into memory at physical address 0x100000. The reason it doesn’t load the kernel at 0x80100000, where the kernel expects to find its instructions and data, is that there may not be any physical memory at such a high address on a small machine. The reason it places the kernel at 0x100000 rather than 0x0 is because the address range 0xa0000:0x100000 contains I/O devices. To allow the rest of the kernel to run, entry sets up a page table that maps virtual addresses starting at 0x80000000 (called KERNBASE) to physical addresses starting at 0x0. Now entry needs to transfer to the kernel’s C code, and run it in high memory. First it makes the stack pointer, %esp, point to memory to be used as a stack. All symbols have high addresses, including stack, so the stack will still be valid even when the low mappings are removed. Finally entry jumps to main, which is also a high address. The indirect jump is needed because the assembler would otherwise generate a PC-relative direct jump, which would execute the low-memory version of main. Main cannot return, since there’s no return PC on the stack. Now the kernel is running in high addresses in the function main.
After main initializes several devices and subsystems, it creates the first process by calling userinit. Userinit’s first action is to call allocproc. The job of allocproc is to allocate a slot (a struct proc) in the process table and to initialize the parts of the process’s state required for its kernel thread to execute. Allocproc is called for each new process, while userinit is called only for the very first process. Allocproc scans the proc table for a slot with state UNUSED. When it finds an unused slot, allocproc sets the state to EMBRYO to mark it as used and gives the process a unique pid. Next, it tries to allocate a kernel stack for the process’s kernel thread. If the memory allocation fails, allocproc changes the state back to UNUSED and returns zero to signal failure. The way that control transfers from user software to the kernel is via an interrupt mechanism, which is used by system calls, interrupts, and exceptions. Whenever control transfers into the kernel while a process is running, the hardware and xv6 trap entry code save user registers on the process’s kernel stack. The stack pointer %esp is set to the process’s largest valid virtual address, p->sz. The instruction pointer is set to the entry point for the initcode, address 0. Once the process is initialized, userinit marks it available for scheduling by setting p->state to RUNNABLE.
After the first process is prepared, we can start running it. After main calls userinit, mpmain calls scheduler to start running processes. scheduler looks for a process with p->state set to RUNNABLE, and there’s only one: initproc. scheduler now sets p->state to RUNNING and calls swtch to perform context switch to the target process’s kernel thread. swtch first saves the current registers. The current context is not a process but rather a special per-cpu scheduler context, so scheduler tells swtch to save the current hardware registers in per-cpu storage (cpu->scheduler) rather than in any process’s kernel thread context. swtch then loads the saved registers of the target kernel thread (p->context into the x86 hardware registers, including the stack pointer and instruction pointer. The final ret instruction pops the target process’s %eip from the stack, finishing the context switch. Now the processor is running on the kernel stack of process p.
The first action of initcode.S is to invoke the exec system call. Exec replaces the memory and registers of the current process with a new program, but it leaves the file descriptors, process id, and parent process unchanged. Initcode.S begins by pushing three values on the stack --$argv, $init, and $0—and then sets %eax to SYS_exec and executes int T_SYSCALL: it is asking the kernel to run the exec system call. Exec at a high level replaces initcode with the /init binary, loaded out of the file system. Now init-code is done, and the process will run /init instead. Init creates a new console device file if needed and then opens it as file descriptors 0, 1, and 2. Then it loops, starting a console shell, handles orphaned zombies until the shell exits, and repeats. The system is up.

Part 3

Chapter 6. File Systems


The purpose of a file system is to organize and store data. File systems typically support sharing of data among users and applications, as well as persistence so that data is still available after a reboot. The xv6 file system implementation is organized in seven layers, shown in Figure 6-1. The disk layer reads and writes blocks on an IDE hard drive. The buffer cache layer caches disk blocks and synchronizes access to them, making sure that only one kernel process at a time can modify the data stored in any particular block. The logging layer allows higher layers to wrap updates to several blocks in a transaction, and ensures that the blocks are updated atomically in the face of crashes (i.e., all of them are updated or none). The inode layer provides individual files, each represented as an inode with a unique i-number and some blocks holding the file’s data. The directory layer implements each directory as a special kind of inode whose content is a sequence of directory entries, each of which contains a file’s name and i-number. The pathname layer provides hierarchical path names like /usr/rtm/xv6/fs.c, and resolves them with recursive lookup. The file descriptor layer abstracts many Unix resources (e.g., pipes, devices, files, etc.) using the file system interface, simplifying the lives of application programmers.

The file system must have a plan for where it stores inodes and content blocks on the disk. To do so, xv6 divides the disk into several sections, as shown in Figure 6-2. The file system does not use block 0 (it holds the boot sector). Block 1 is called the superblock; it contains metadata about the file system (the file system size in blocks, the number of data blocks, the number of inodes, and the number of blocks in the log). Blocks starting at 2 hold the log. After the log are the inodes, with multiple inodes per block. After those come bitmap blocks tracking which data blocks are in use. The remaining blocks are data blocks; each is either marked free in the bitmap block, or holds content for a file or directory. The superblock is filled in by a separate program, called mfks, which builds an initial file system.

The on-disk inode structure, struct dinode, contains a size and an array of block numbers. The inode data is found in the blocks listed in the dinode’s addrs array.

File System Layers

Each layer’s principal functions are described briefly below:

Buffer cache layer

Buffer cache: it has two jobs, first it synchronize access to disk blocks to ensure that only one copy of a block is in memory and that only one kernel thread at a time uses that copy; second is to cache popular blocks so that they don’t need to be re-read from the slow disk. The buffer cache is a linked list of buf structures holding cached copies of disk block contents. Caching disk blocks in memory reduces the number of disk reads and also provides a synchronization point for disk blocks used by multiple processes. To get a buffer for a particular disk block, call bread. After changing buffer data, call bwrite to write it to disk. When done with the buffer, call brelse. Do not use the buffer after calling brelse. Only one process at a time can use a buffer, so do not keep them longer than necessary.

struct buf { 
int flags; 
uint dev; 
uint blockno; 
struct sleeplock lock;  
uint refcnt; 
struct buf *prev; // LRU cache list 
struct buf *next; 
struct buf *qnext; // disk queue 
uchar data[BSIZE]; 
}; 
#define B_VALID 0x2 // buffer has been read from disk and contains a copy of the block.
#define B_DIRTY 0x4 // buffer needs to be written to disk because content has been modified.


// sleep-lock protects reads and writes of the block’s buffered content, because only one process at a time can use a buffer.
// Long−term locks for processes 
struct sleeplock { 
uint locked; // Is the lock held? 
struct spinlock lk; // spinlock protecting this sleep lock 

// For debugging: 
char *name; // Name of lock. 
int pid; // Process holding lock 
 };

//Creating a new buffer.
struct {
struct spinlock lock;
struct buf buf[NBUF];
// Linked list of all buffers, through prev/next.
// head.next is most recently used.
struct buf head;
} bcache;
Logging Layer

Logging layer solves the problem of crashes during file system operations with a simple form of logging. An xv6 system call does not directly write the on-disk file system data structures. Instead, it places a description of all the disk writes it wishes to make in a log on the disk. Once the system call has logged all of its writes, it writes a special commit record to the disk indicating that the log contains a complete operation. At that point the system call copies the writes to the on-disk file system data structures. After those writes have completed, the system call erases the log on disk. The log makes operations atomic with respect to crashes: after recovery, either all of the operation’s writes appear on the disk, or none of them appear. Log is able to commit several transactions together through group commit. Group commit reduces the number of disk operations and hands the disk system more concurrent writes at the same time.

// Contents of the header block, used for both the on−disk header block 
// and to keep track in memory of logged block# before commit. 
struct logheader { 
int n; 
int block[LOGSIZE]; 
}; 
struct log { 
struct spinlock lock; 
 int start; 
int size; 
int outstanding; // how many FS sys calls are executing. 
int committing; // in commit(), please wait. 
int dev; 
struct logheader lh; 
 };

// log_write() acts as a proxy for bwrite. It records the block’s sector number in memory, reserving it a slot in the log on disk, and marks the buffer B_DIRTY to prevent the block cache from evicting it. It notices when a block is written multiple times during a single  transaction, and allocates that block the same slot in the log. This optimization is called absorption.
//log_write() replaces bwrite(); a typical use is:
// bp = bread(...)
// modify bp−>data[]
// log_write(bp)
// brelse(bp)
void
log_write(struct buf *b)
{
int i;
if (log.lh.n >= LOGSIZE || log.lh.n >= log.size − 1)
panic("too big a transaction");
if (log.outstanding < 1)
panic("log_write outside of trans");
acquire(&log.lock);
for (i = 0; i < log.lh.n; i++) {
if (log.lh.block[i] == b−>blockno) // log absorbtion
break;
}
log.lh.block[i] = b−>blockno;
if (i == log.lh.n)
log.lh.n++;
b−>flags |= B_DIRTY; // prevent eviction
release(&log.lock);
}

//end_op decrements the count of outstanding system calls. If the count is zero, it commits the transaction by calling commit().
end_op(void)
{
int do_commit = 0;

acquire(&log.lock);
log.outstanding −= 1;
if(log.committing)
panic("log.committing");
if(log.outstanding == 0){
do_commit = 1;
log.committing = 1;
} else {
// begin_op() may be waiting for log space,
// and decrementing log.outstanding has decreased
// the amount of reserved space.
wakeup(&log);
}
release(&log.lock);
if(do_commit){
// call commit w/o holding locks, since not allowed
// to sleep with locks.
commit();
acquire(&log.lock);
log.committing = 0;
wakeup(&log);
release(&log.lock);
}
}

Inode Layer

Inode might refer to on-disk data structure containing a file’s size and list of data block numbers. It might also refer to an in-memory inode, which contains a copy of the on-disk inode as well as extra information needed within the kernel. The on-disk inode is defined by a struct dinode and are packed into contiguous area of disk called inode blocks, it has a i-number n which is how they are identified in the implementation. An inode describes a single unnamed file. The inode disk structure holds metadata: the file’s type, its size, the number of links referring to it, and the list of blocks holding the file’s content.

// On−disk inode structure
struct dinode {
short type; // File type
short major; // Major device number (T_DEV only)
short minor; // Minor device number (T_DEV only)
short nlink; // Number of links to inode in file system
uint size; // Size of file (bytes)
uint addrs[NDIRECT+1]; // Data block addresses
};
// in−memory copy of an inode
struct inode {
uint dev; // Device number
uint inum; // Inode number
int ref; // Reference count
struct sleeplock lock; // protects everything below here
int valid; // inode has been read from disk?
short type; // copy of disk inode
short major;
short minor;
short nlink;
uint size;
uint addrs[NDIRECT+1];
};

struct {
struct spinlock lock;
struct inode inode[NINODE];
} icache;
void
iinit(int dev)
{
int i = 0;
initlock(&icache.lock, "icache");
for(i = 0; i < NINODE; i++) {
initsleeplock(&icache.inode[i].lock, "inode");
}
readsb(dev, &sb);
cprintf("sb: size %d nblocks %d ninodes %d nlog %d logstart %d\
inodestart %d bmap start %d\n", sb.size, sb.nblocks,
sb.ninodes, sb.nlog, sb.logstart, sb.inodestart,
sb.bmapstart);
}
static struct inode* iget(uint dev, uint inum);


// Allocate an inode on device dev.
// Mark it as allocated by giving it type type.
// Returns an unlocked but allocated and referenced inode.
struct inode*
ialloc(uint dev, short type)
{
int inum;
struct buf *bp;
struct dinode *dip;
for(inum = 1; inum < sb.ninodes; inum++){
bp = bread(dev, IBLOCK(inum, sb));
dip = (struct dinode*)bp−>data + inum%IPB;
if(dip−>type == 0){ // a free inode
memset(dip, 0, sizeof(*dip));
dip−>type = type;
log_write(bp); // mark it allocated on the disk
brelse(bp);
return iget(dev, inum);
}
brelse(bp);
}
panic("ialloc: no inodes");
}


//iget() provides non-exclusive access to an inode, so that there can be many pointers to the same inode, it holds long-term references to inodes and prevents races while avoiding deadlock. To ensure an inode holds a copy of the on-disk inode, code must call ilock. This will lock the inode so that no other process can ilock it and reads the inode from disk, if it has not already been read. Iunlock releases the lock on the inode.
// Find the inode with number inum on device dev
// and return the in−memory copy. Does not lock
// the inode and does not read it from disk.
static struct inode*
iget(uint dev, uint inum)
{
struct inode *ip, *empty;
acquire(&icache.lock);
// Is the inode already cached?
empty = 0;
for(ip = &icache.inode[0]; ip < &icache.inode[NINODE]; ip++){
if(ip−>ref > 0 && ip−>dev == dev && ip−>inum == inum){
ip−>ref++;
release(&icache.lock);
return ip;
}
if(empty == 0 && ip−>ref == 0) // Remember empty slot.
empty = ip;
}



// Lock the given inode.
// Reads the inode from disk if necessary.
void
ilock(struct inode *ip)
{
struct buf *bp;
struct dinode *dip;
if(ip == 0 || ip−>ref < 1)
panic("ilock");
acquiresleep(&ip−>lock);
if(ip−>valid == 0){
bp = bread(ip−>dev, IBLOCK(ip−>inum, sb));
dip = (struct dinode*)bp−>data + ip−>inum%IPB;
ip−>type = dip−>type;
ip−>major = dip−>major;
ip−>minor = dip−>minor;
ip−>nlink = dip−>nlink;
ip−>size = dip−>size;
memmove(ip−>addrs, dip−>addrs, sizeof(ip−>addrs));
brelse(bp);
ip−>valid = 1;
if(ip−>type == 0)
panic("ilock: no type");
}
}
// Unlock the given inode.
void
iunlock(struct inode *ip)
{
if(ip == 0 || !holdingsleep(&ip−>lock) || ip−>ref < 1)
panic("iunlock");
releasesleep(&ip−>lock);
}

//iput releases a C pointer to an inode by decrementing the reference count. If this is the last reference, the inode’s slot in the inode cache is now free and can be re-used for a different inode. Iput calls intrunc to truncate the file to zero bytes, freeing the data blocks; sets the inode type to 0 (unallocated); and writes the inode to disk. Iput doesn’t truncate a file immediately when the link count for the file drops to zero, because some process might still hold a reference to the inode in memory: a process might still be reading and writing to the file, because it successfully opened it. But, if a crash happens before the last process closes the file descriptor for the file, then the file will be marked allocated on disk but no directory entry points to it.
// Drop a reference to an in−memory inode.
// If that was the last reference, the inode cache entry can
// be recycled.
// If that was the last reference and the inode has no links
// to it, free the inode (and its content) on disk.
// All calls to iput() must be inside a transaction in
// case it has to free the inode.
void
iput(struct inode *ip)
{
acquiresleep(&ip−>lock);
if(ip−>valid && ip−>nlink == 0){
acquire(&icache.lock);
int r = ip−>ref;
release(&icache.lock);
if(r == 1){
// inode has no links and no other references: truncate and free.
itrunc(ip);
ip−>type = 0;
iupdate(ip);
ip−>valid = 0;
}
}
releasesleep(&ip−>lock);
acquire(&icache.lock);
ip−>ref−−;
release(&icache.lock);
} 
// Truncate inode (discard contents).
// Only called when the inode has no links
// to it (no directory entries referring to it)
// and has no in−memory reference to it (is
// not an open file or current directory).
static void
itrunc(struct inode *ip)
{
int i, j;
struct buf *bp;
uint *a;
for(i = 0; i < NDIRECT; i++){
if(ip−>addrs[i]){
bfree(ip−>dev, ip−>addrs[i]);
ip−>addrs[i] = 0;
}
}
if(ip−>addrs[NDIRECT]){
bp = bread(ip−>dev, ip−>addrs[NDIRECT]);
a = (uint*)bp−>data;
for(j = 0; j < NINDIRECT; j++){
if(a[j])
bfree(ip−>dev, a[j]);
}
brelse(bp);
bfree(ip−>dev, ip−>addrs[NDIRECT]);
ip−>addrs[NDIRECT] = 0;
}
ip−>size = 0;
iupdate(ip);
}

Directory Layer

A directory is implemented internally much like a file. Its inode has type T_DIR and its data is a sequence of directory entries. Each entry is a struct dirent which contains a name and an inode number. Dirlookup searches a directory for an entry with the given name. If dirlookup finds an entry with the right name, it updates*poff, releases the block, and returns an unlocked inode obtained via iget. Dirlookup is the reason that iget returns unlocked inodes. Dirlink writes a new directory entry with the given name and inode number into the directory dp, if the name already exists, dirlink returns an error.

// Look for a directory entry in a directory.
// If found, set *poff to byte offset of entry.
struct inode*
dirlookup(struct inode *dp, char *name, uint *poff)
{
uint off, inum;
struct dirent de;
if(dp−>type != T_DIR)
panic("dirlookup not DIR");
for(off = 0; off < dp−>size; off += sizeof(de)){
if(readi(dp, (char*)&de, off, sizeof(de)) != sizeof(de))
panic("dirlookup read");
if(de.inum == 0)
continue;
if(namecmp(name, de.name) == 0){
// entry matches path element
if(poff)
*poff = off;
inum = de.inum;
return iget(dp−>dev, inum);
}
}
return 0;
}
// Write a new directory entry (name, inum) into the directory dp.
int
dirlink(struct inode *dp, char *name, uint inum)
{
int off;
struct dirent de;
struct inode *ip;
// Check that name is not present.
if((ip = dirlookup(dp, name, 0)) != 0){
iput(ip);
return −1;
}

// Look for an empty dirent.
for(off = 0; off < dp−>size; off += sizeof(de)){
if(readi(dp, (char*)&de, off, sizeof(de)) != sizeof(de))
panic("dirlink read");
if(de.inum == 0)
break;
}

strncpy(de.name, name, DIRSIZ);
de.inum = inum;
if(writei(dp, (char*)&de, off, sizeof(de)) != sizeof(de))
panic("dirlink");
return 0;
}

Path Names Layer

Namei evaluates path and returns the corresponding inode. The function nameiparent is a variant: it stops before the last element, returning the inode of the parent directory and copying the final element into name. Namex starts by deciding where the path evaluation begins. Then it uses skipelem to consider each element of the path in turn. Each iteration of the loop must look up name in the current inode ip.

// Look up and return the inode for a path name.
// If parent != 0, return the inode for the parent and copy the final
// path element into name, which must have room for DIRSIZ bytes.
// Must be called inside a transaction since it calls iput().
static struct inode*
namex(char *path, int nameiparent, char *name)
{
struct inode *ip, *next;
if(*path == ’/’)
ip = iget(ROOTDEV, ROOTINO);
else
ip = idup(myproc()−>cwd);
while((path = skipelem(path, name)) != 0){
ilock(ip);
if(ip−>type != T_DIR){
iunlockput(ip);
return 0;
}
if(nameiparent && *path == ’\0’){
// Stop one level early.
iunlock(ip);
return ip;
}
if((next = dirlookup(ip, name, 0)) == 0){
iunlockput(ip);
return 0;
}
iunlockput(ip);
ip = next;
}
if(nameiparent){
iput(ip);
return 0;
}
return ip;
}
struct inode*
namei(char *path)
{
char name[DIRSIZ];
return namex(path, 0, name);
}
struct inode*
nameiparent(char *path, char *name)
{
return namex(path, 1, name);
}

File Descriptor Layer

Most resources in Unix are represented as files. The file descriptor layer is the layer that achieves this uniformity. Each open file is represented by a struct file. All the open files in the system are kept in a global file table, the ftable. The file table has a function to allocate a file, create a duplicate reference and release a reference, and read and write data. File alloc scans the file table for an unreferenced file (f->ref==0) and returns a new reference; filedup increments the reference count; and fileclose decrements it. When a file’s reference count reaches zero, fileclose release the underlying pipe or inode, according to the type. The functions filestat, fileread, and filewrite implement the stat, read, and write operations on files. Filestat is only allowed on inodes and calls stati. Fileread and filewrite check that the operation is allowed by the open mode and then pass the call through to either the pipe or inode implementation.

// Allocate a file structure.
struct file*
filealloc(void)
{
struct file *f;
acquire(&ftable.lock);
for(f = ftable.file; f < ftable.file + NFILE; f++){
if(f−>ref == 0){
f−>ref = 1;
release(&ftable.lock);
return f;
}
}
release(&ftable.lock);
return 0;
}
// Increment ref count for file f.
struct file*
filedup(struct file *f)
{
acquire(&ftable.lock);
if(f−>ref < 1)
panic("filedup");
f−>ref++;
release(&ftable.lock);
return f;
}
// Close file f. (Decrement ref count, close when reaches 0.)
void
fileclose(struct file *f)
{
struct file ff;
acquire(&ftable.lock);
if(f−>ref < 1)
panic("fileclose");
if(−−f−>ref > 0){
release(&ftable.lock);
return;
}
ff = *f;
f−>ref = 0;
f−>type = FD_NONE;
release(&ftable.lock);
if(ff.type == FD_PIPE)
pipeclose(ff.pipe, ff.writable);
else if(ff.type == FD_INODE){
begin_op();
iput(ff.ip);
end_op();
}
}
// Get metadata about file f.
int
filestat(struct file *f, struct stat *st)
{
if(f−>type == FD_INODE){
ilock(f−>ip);
stati(f−>ip, st);
iunlock(f−>ip);
return 0;
}
return −1;
}
// Read from file f.
int
fileread(struct file *f, char *addr, int n)
{
int r;

if(f−>readable == 0)
return −1;
if(f−>type == FD_PIPE)
return piperead(f−>pipe, addr, n);
if(f−>type == FD_INODE){
ilock(f−>ip);
if((r = readi(f−>ip, addr, f−>off, n)) > 0)
f−>off += r;
iunlock(f−>ip);
return r;
}
panic("fileread");
}
// Write to file f.
int
filewrite(struct file *f, char *addr, int n)
{
int r;

if(f−>writable == 0)
return −1;
if(f−>type == FD_PIPE)
return pipewrite(f−>pipe, addr, n);
if(f−>type == FD_INODE){
// write a few blocks at a time to avoid exceeding
// the maximum log transaction size, including
// i−node, indirect block, allocation blocks,
// and 2 blocks of slop for non−aligned writes.
// this really belongs lower down, since writei()
// might be writing a device like the console.
int max = ((LOGSIZE−1−1−2) / 2) * 512;
int i = 0;
while(i < n){
int n1 = n − i;
if(n1 > max)
n1 = max;
begin_op();
ilock(f−>ip);
if ((r = writei(f−>ip, addr + i, f−>off, n1)) > 0)
f−>off += r;
iunlock(f−>ip);
end_op();
if(r < 0)
break;
if(r != n1)
panic("short filewrite");
i += r;
}
return i == n ? n : −1;
}
panic("filewrite");
}