The inode structure and inode table


Each time we create a file in a directory, the system simply allocates a free I-number from the file system and uses an empty slot in the correspondent directory to record this I-number and the name of the file we created. When we issue a command to delete a file, the system simply replaces the file's I-number by 0, and the slot is still occupied by its file name, although we are no longer able to access that file anymore. Therefore, if we frequently create and delete files within a directory, the (file) size of the directory becomes bigger and bigger and the I-numbers are no longer consecutive, this will slow down the file searching operation (and hence, the system performance, too) considerably. This is one of the main reasons why the system performance is gradually deteriorating. We can prevent this from happening by grouping the files which are changing constantly in a fixed file system, say /tmp, and periodically repairing the file system. The following diagram depicts the directory contents of some ext2 directory.

/* include/linux/dirent.h */
struct dirent {
	long		d_ino;
	__kernel_off_t	d_off;
	unsigned short	d_reclen;
	char		d_name[256]; /* We must not include limits.h! */
};


\begin{picture}(400, 75)
\put(45, 5){\mbox{Directory Contents}}
\put(8, 60){\m...
...par
\put(239, 35){\line(0,1){30}}
\put(239, 65){\vector(1,0){25}}
\end{picture}

An ordinary file is just a sequence of data bytes stored in some physical device without any name attached to it. The administrative information of this file, such as owner, permissions, size, times, etc., is stored in the inode structure of the file. All of the file system's inodes are collected together to form an inode table. Each file system occupies a logical disk. Starting from the $2^{nd}$ block of a logical disk, the kernel stores the inode table of the file system in a consecutive disk blocks. Each inode, an entry in the inode table, is a data structure which the system uses to store the following information about a file:

  1. Type of file (ordinary, directory or special file).
  2. Access permissions for the file owner, the owner's group members and others (i.e. the general public).
  3. Number of links.
  4. File owner's user and group IDs
  5. File size in bytes.
  6. The disk addresses of the data blocks where the contents of the file are actually stored.
  7. Time of last access (read or executed), time of last modification (i.e. written) and time which the inode itself was last changed.

The inode data structures can be found in the <linux/minix_fs.h>, <linux/ext2_fs.h>, <linux/ext3_fs.h> files and the pair of angle brackets <> represents the standard file path. In this case, it represents the prefix string: /usr/include/ in front of whatever inside the brackets.

$ ls -lia ../../exam
total 20
  994 drwxr-xr-x  3 hsu           224 Jun 20 22:32 .
    4 drwxrwxr-x 16 hsu           528 Jun 23 09:43 ..
 1311 -rw-r--r--  1 hsu          3155 Jun 20 22:30 ex.tex
 1284 -rw-r--r--  1 hsu          5389 Oct 23 22:38 exam1.tex
 1216 -rw-r--r--  1 hsu          2225 Jan  7  1994 final.tex
  541 -rw-r--r--  1 hsu          3846 Jan  1  1994 real3.tex
  497 drwxr-xr-x  2 hsu           336 Jan  5 21:44 tmp
# od does not work on directory anymore, probably because of
# the EXT2_NODUMP_FL flag in the inode structure is on.
# But, still we may make a file system on a plain file and 
# then dump it.  But, we can not dump specific subdirectory.
$ od -cd ../../exam #dump the contents of exam directory.
0000000   00994 00046  00000  00000  00000  00000  00000  00000
	342 003 .  \0 \0  \0 \0  \0 \0  \0 \0  \0 \0  \0 \0  \0
0000020   00004 11822  00000  00000  00000  00000  00000  00000
	004  \0 .   . \0  \0 \0  \0 \0  \0 \0  \0 \0  \0 \0  \0
0000040   00541 25970  27745  11827  25972  00120  00000  00000
	035 002 r   e  a   l  3   .  t   e  x  \0 \0  \0 \0  \0
0000060   00497 28020  00112  00000  00000  00000  00000  00000
	361 001 t   m  p  \0 \0  \0 \0  \0 \0  \0 \0  \0 \0  \0
0000100   01284 30821  28001  11825  25972  00120  00000  00000
	004 005 e   x  a   m  1   .  t   e  x  \0 \0  \0 \0  \0
0000120   01216 26982  24942  11884  25972  00120  00000  00000
	300 004 f   i  n   a  l   .  t   e  x  \0 \0  \0 \0  \0
0000140   01311 30821  29742  30821  00000  00000  00000  00000
	037 005 e   x  .   t  e   x \0  \0 \0  \0 \0  \0 \0  \0
0000160   00000 30821  27694  26479  00000  00000  00000  00000
	 \0  \0 e   x  .   l  o   g \0  \0 \0  \0 \0  \0 \0  \0
0000200   00000 30821  24878  30837  00000  00000  00000  00000
	 \0  \0 e   x  .   a  u   x \0  \0 \0  \0 \0  \0 \0  \0
0000220   00000 30821  25646  26998  00000  00000  00000  00000
	 \0  \0 e   x  .   d  v   i \0  \0 \0  \0 \0  \0 \0  \0
0000240   00000 30821  29742  30821  00126  00000  00000  00000
	 \0  \0 e   x  .   t  e   x  ~  \0 \0  \0 \0  \0 \0  \0
0000260   00000 25891  11896  25972  09080  00000  00000  00000
	 \0  \0 #   e  x   .  t   e  x   # \0  \0 \0  \0 \0  \0
0000300   00000 30821  28001  11825  30308  00105  00000  00000
	 \0  \0 e   x  a   m  1   .  d   v  i  \0 \0  \0 \0  \0
0000320  
$ df # report the free disk space and free inode number.
/         (/dev/dsk/0s1    ):    44696 blocks     7471 i-nodes
/usr      (/dev/dsk/0s3    ):    58698 blocks    25826 i-nodes
/usr2     (/dev/dsk/0s4    ):    26220 blocks    14520 i-nodes
/usr3     (/dev/dsk/0s5    ):    67684 blocks    24544 i-nodes
/usr4     (/dev/dsk/0s6    ):    45052 blocks    15778 i-nodes
/usr5     (/dev/dsk/0s7    ):    39296 blocks    16570 i-nodes
$ du tmp # disk usage of tmp subdirectory
36      tmp/chap2
1032    tmp/minix
1076    tmp

/* Inode structure as it appears on an inode table. */
struct dinode
{ ushort  di_mode;     /* mode and type of file   */
  short   di_nlink;    /* number of links to file */
  ushort  di_uid;      /* owner's user id         */
  ushort  di_gid;      /* owner's group id        */
  off_t   di_size;     /* number of bytes in file */
  char    di_addr[39]; /* disk block addresses    */
  char    di_gen;      /* file generation number  */
  time_t  di_atime;    /* time last accessed      */
  time_t  di_mtime;    /* time last modified      */
  time_t  di_ctime;    /* time created            */
};
/*
 * The 40 address bytes:
 *   39 used as 13 addresses of 3 bytes each.
 *   40th byte is used as a file generation number.
 */

Usually, a logical disk block on a small (in terms of storage capacity) file system consists of 1024 bytes (2 sectors) storage space. The disk block size for large file system is adjusted automatically during file system creation. If the size of a file is less than 10K (=10240) bytes, the first 10 addresses of the 13 disk block addresses are used to record the disk block numbers where the file contents are actually stored. If the size of a file is greater than 10K bytes, one single indirect disk block (which can hold 256 disk block numbers since each number occupies 4 bytes) is reserved for recording the more disk block numbers where the file contents (with addresses greater than or equal to 10240) are stored. The $11^{th}$ disk block address is used to store the block number of this single indirect block. Therefore, this address is also called the single indirect address. With the single indirect address, the file size can grow up to 266K (=272,384) bytes. If the size of a file is greater than 266K bytes, the system then reserves another (double indirect) disk block to store 256 disk block numbers of another 256 single indirect disk blocks. And each of these 256 single indirect disk blocks can hold 256 disk block numbers. With the double indirect disk block address (stored in the $12^{th}$ disk block address), a file can grow to the size of 64MB. The $13^{th}$ disk block address is used to store the triple indirect disk block number and, in theory, the maximum size of a file is about 16GB. However, the di_size field in the inode structure for recording the size of a file is 4 bytes long (32 bits). And this field is interpreted as a long integer. To represent a positive integer, only 31 bits can be used. Hence, the maximum size of a file is about 2GB. By the way, the inode number is represented by the unsigned short data type, however, usually, a file system holds about 33000 inodes.

The ext2 and ext3 file systems are the native file systems of Linux system. They are better supported. The dumpe2fs command can be used to extract file system information from these two types of file systems. For safety reason, the super blocks are duplicated several times in the whole filesystem. In case of the first super block being crashed, we still can use other backup super blocks to repair the file system. For performance reason, a certain amount of disk blocks are grouped together. Each group has its own block and inode bitmaps to keep track of allocated blocks and inodes. Each group has its own inode table, the available inodes for the whole system are evenly distributed between the block groups. For each group, the same amount of disk blocks are reserved for its inode table. All these information items may be obtained via the output of dumpe2fs command.

$ sudo dumpe2fs /dev/sda3 | grep -i superblock
dumpe2fs 1.38 (30-Jun-2005)
  Primary superblock at 0, Group descriptors at 1-1
  Backup superblock at 32768, Group descriptors at 32769-32769
  Backup superblock at 98304, Group descriptors at 98305-98305
     .
     .
  Backup superblock at 1605632, Group descriptors at 1605633-1605633
$ sudo dumpe2fs -h /dev/sda1
dumpe2fs 1.38 (30-Jun-2005)
Filesystem volume name:   <none>
Last mounted on:          <not available>
Filesystem UUID:          f15082da-6d10-45ec-9d8e-f6d6c09527e3
Filesystem magic number:  0xEF53
Filesystem revision #:    1 (dynamic)
Filesystem features:      filetype sparse_super
Default mount options:    (none)
Filesystem state:         not clean
Errors behavior:          Continue
Filesystem OS type:       Linux
Inode count:              52208
Block count:              104391
Reserved block count:     5219
Free blocks:              87770
Free inodes:              52178
First block:              1
Block size:               1024
Fragment size:            1024
Blocks per group:         8192
Fragments per group:      8192
Inodes per group:         4016
Inode blocks per group:   502
Last mount time:          Sun Sep 10 13:11:28 2006
Last write time:          Sun Sep 10 13:11:28 2006
Mount count:              4
Maximum mount count:      30
Last checked:             Sat Sep  9 13:40:20 2006
Check interval:           0 (<none>)
Reserved blocks uid:      0 (user root)
Reserved blocks gid:      0 (group root)
First inode:              11
Inode size:               128
# find files whose sizes are larger than 64MB
$ find / -size +64M 2>/dev/null >/tmp/64M 
$ more /tmp/64M
/sys/devices/pci0000:00/0000:00:0e.0/0000:04:00.0/resource1
/backup/TeXSrc/TeXSrc-2006-02-27.tgz
/proc/kcore
$ file tmp/minix/minix-fs
tmp/minix/minix-fs: Minix filesystem, 30 char names

Finally, there is one more inode structure defined in the Linux source tree (include/linux/fs.h). This is the In-Core inode, i.e. the inode structure loaded in the memory. When loading this In-Core inode, the relative disk inode information is filled in its relative fields. The In-Core inode has its own fields, such as struct inode_operations -> *i_op; Depending on whether the inode is describing a regular file, directory, symbolic link, or device special file, the (file system dependent) inode manipulation functions are registered here while the inode structure is initilized. When the inode structure is to be modified, its correspondent function is invoked. You see the object oriented programming methodology is at working even in the kernel development stage.

Figure 2.1: Hard Disk Layout
\begin{figure}\begin{picture}(450, 525)
\put(10, 15){\line(0,1){505}}
\put(10,...
...(338, 190){\mbox{\};}}
\put(430, 15){\line(0,1){505}}
\end{picture}
\end{figure}
Felix Hsu 2008-09-21