Previous Up Next

Appendix K  Swap Management

K.1  Scanning for Free Entries

K.1.1  Function: get_swap_page

Source: mm/swapfile.c

The call graph for this function is shown in Figure 11.2. This is the high level API function for searching the swap areas for a free swap lot and returning the resulting swp_entry_t.

 99 swp_entry_t get_swap_page(void)
100 {
101     struct swap_info_struct * p;
102     unsigned long offset;
103     swp_entry_t entry;
104     int type, wrapped = 0;
106     entry.val = 0;  /* Out of memory */
107     swap_list_lock();
108     type =;
109     if (type < 0)
110         goto out;
111     if (nr_swap_pages <= 0)
112         goto out;
114     while (1) {
115         p = &swap_info[type];
116         if ((p->flags & SWP_WRITEOK) == SWP_WRITEOK) {
117             swap_device_lock(p);
118             offset = scan_swap_map(p);
119             swap_device_unlock(p);
120             if (offset) {
121                 entry = SWP_ENTRY(type,offset);
122                 type = swap_info[type].next;
123                 if (type < 0 ||
124                     p->prio != swap_info[type].prio) {
125              = swap_list.head;
126                 } else {
127            = type;
128                 }
129                 goto out;
130             }
131         }
132         type = p->next;
133         if (!wrapped) {
134             if (type < 0 || p->prio != swap_info[type].prio) {
135                 type = swap_list.head;
136                 wrapped = 1;
137             }
138         } else
139             if (type < 0)
140                 goto out;     /* out of swap space */
141     }
142 out:
143     swap_list_unlock();
144     return entry;
145 }
107Lock the list of swap areas
108Get the next swap area that is to be used for allocating from. This list will be ordered depending on the priority of the swap areas
109-110If there are no swap areas, return NULL
111-112If the accounting says there are no available swap slots, return NULL
114-141Cycle through all swap areas
115Get the current swap info struct from the swap_info array
116If this swap area is available for writing to and is active...
117Lock the swap area
118Call scan_swap_map()(See Section K.1.2) which searches the requested swap map for a free slot
119Unlock the swap device
120-130If a slot was free...
121Encode an identifier for the entry with SWP_ENTRY()
122Record the next swap area to use
123-126If the next area is the end of the list or the priority of the next swap area does not match the current one, move back to the head
126-128Otherwise move to the next area
129Goto out
132Move to the next swap area
133-138Check for wrapaound. Set wrapped to 1 if we get to the end of the list of swap areas
139-140If there was no available swap areas, goto out
142The exit to this function
143Unlock the swap area list
144Return the entry if one was found and NULL otherwise

K.1.2  Function: scan_swap_map

Source: mm/swapfile.c

This function tries to allocate SWAPFILE_CLUSTER number of pages sequentially in swap. When it has allocated that many, it searches for another block of free slots of size SWAPFILE_CLUSTER. If it fails to find one, it resorts to allocating the first free slot. This clustering attempts to make sure that slots are allocated and freed in SWAPFILE_CLUSTER sized chunks.

 36 static inline int scan_swap_map(struct swap_info_struct *si)
 37 {
 38     unsigned long offset;
 47     if (si->cluster_nr) {
 48         while (si->cluster_next <= si->highest_bit) {
 49             offset = si->cluster_next++;
 50             if (si->swap_map[offset])
 51                 continue;
 52             si->cluster_nr--;
 53             goto got_page;
 54         }
 55     }

Allocate SWAPFILE_CLUSTER pages sequentially. cluster_nr is initialised to SWAPFILE_CLUTER and decrements with each allocation

47If cluster_nr is still postive, allocate the next available sequential slot
48While the current offset to use (cluster_next) is less then the highest known free slot (highest_bit) then ...
49Record the offset and update cluster_next to the next free slot
50-51If the slot is not actually free, move to the next one
52Slot has been found, decrement the cluster_nr field
53Goto the out path
 56     si->cluster_nr = SWAPFILE_CLUSTER;
 58     /* try to find an empty (even not aligned) cluster. */
 59     offset = si->lowest_bit;
 60  check_next_cluster:
 61     if (offset+SWAPFILE_CLUSTER-1 <= si->highest_bit)
 62     {
 63         int nr;
 64         for (nr = offset; nr < offset+SWAPFILE_CLUSTER; nr++)
 65             if (si->swap_map[nr])
 66             {
 67                 offset = nr+1;
 68                 goto check_next_cluster;
 69             }
 70         /* We found a completly empty cluster, so start
 71          * using it.
 72          */
 73         goto got_page;
 74     }

At this stage, SWAPFILE_CLUSTER pages have been allocated sequentially so find the next free block of SWAPFILE_CLUSTER pages.

56Re-initialise the count of sequential pages to allocate to SWAPFILE_CLUSTER
59Starting searching at the lowest known free slot
61If the offset plus the cluster size is less than the known last free slot, then examine all the pages to see if this is a large free block
64Scan from offset to offset + SWAPFILE_CLUSTER
65-69If this slot is used, then start searching again for a free slot beginning after this known alloated one
73A large cluster was found so use it
 75     /* No luck, so now go finegrined as usual. -Andrea */
 76     for (offset = si->lowest_bit; offset <= si->highest_bit ;
                                offset++) {
 77         if (si->swap_map[offset])
 78             continue;
 79         si->lowest_bit = offset+1;

This unusual for loop extract starts scanning for a free page starting from lowest_bit

77-78If the slot is in use, move to the next one
79Update the lowest_bit known probable free slot to the succeeding one
 80     got_page:
 81         if (offset == si->lowest_bit)
 82             si->lowest_bit++;
 83         if (offset == si->highest_bit)
 84             si->highest_bit--;
 85         if (si->lowest_bit > si->highest_bit) {
 86             si->lowest_bit = si->max;
 87             si->highest_bit = 0;
 88         }
 89         si->swap_map[offset] = 1;
 90         nr_swap_pages--;
 91         si->cluster_next = offset+1;
 92         return offset;
 93     }
 94     si->lowest_bit = si->max;
 95     si->highest_bit = 0;
 96     return 0;
 97 }

A slot has been found, do some housekeeping and return it

81-82If this offset is the known lowest free slot(lowest_bit), increment it
83-84If this offset is the highest known likely free slot, decrement it
85-88If the low and high mark meet, the swap area is not worth searching any more because these marks represent the lowest and highest known free slots. Set the low slot to be the highest possible slot and the high mark to 0 to cut down on search time later. This will be fixed up the next time a slot is freed
89Set the reference count for the slot
90Update the accounting for the number of available swap pages (nr_swap_pages)
91Set cluster_next to the adjacent slot so the next search will start here
92Return the free slot
94-96No free slot available, mark the area unsearchable and return 0

K.2  Swap Cache

K.2.1  Adding Pages to the Swap Cache

K.2.1.1  Function: add_to_swap_cache

Source: mm/swap_state.c

The call graph for this function is shown in Figure 11.3. This function wraps around the normal page cache handler. It first checks if the page is already in the swap cache with swap_duplicate() and if it does not, it calls add_to_page_cache_unique() instead.

 70 int add_to_swap_cache(struct page *page, swp_entry_t entry)
 71 {
 72     if (page->mapping)
 73         BUG();
 74     if (!swap_duplicate(entry)) {
 75         INC_CACHE_INFO(noent_race);
 76         return -ENOENT;
 77     }
 78     if (add_to_page_cache_unique(page, &swapper_space, entry.val,
 79             page_hash(&swapper_space, entry.val)) != 0) {
 80         swap_free(entry);
 81         INC_CACHE_INFO(exist_race);
 82         return -EEXIST;
 83     }
 84     if (!PageLocked(page))
 85         BUG();
 86     if (!PageSwapCache(page))
 87         BUG();
 88     INC_CACHE_INFO(add_total);
 89     return 0;
 90 }
72-73A check is made with PageSwapCache() before this function is called to make sure the page is not already in the swap cache. This check here ensures the page has no other existing mapping in case the caller was careless and did not make the check
74-77Use swap_duplicate() (See Section K.2.1.2) to try an increment the count for this entry. If a slot already exists in the swap_map, increment the statistic recording the number of races involving adding pages to the swap cache and return -ENOENT
78Try and add the page to the page cache with add_to_page_cache_unique() (See Section J.1.1.2). This function is similar to add_to_page_cache() (See Section J.1.1.1) except it searches the page cache for a duplicate entry with __find_page_nolock(). The managing address space is swapper_space. The “offset within the file” in this case is the offset within swap_map, hence entry.val and finally the page is hashed based on address_space and offset within swap_map
80-83If it already existed in the page cache, we raced so increment the statistic recording the number of races to insert an existing page into the swap cache and return EEXIST
84-85If the page is locked for IO, it is a bug
86-87If it is not now in the swap cache, something went seriously wrong
88Increment the statistic recording the total number of pages in the swap cache
89Return success

K.2.1.2  Function: swap_duplicate

Source: mm/swapfile.c

This function verifies a swap entry is valid and if so, increments its swap map count.

1161 int swap_duplicate(swp_entry_t entry)
1162 {
1163     struct swap_info_struct * p;
1164     unsigned long offset, type;
1165     int result = 0;
1167     type = SWP_TYPE(entry);
1168     if (type >= nr_swapfiles)
1169         goto bad_file;
1170     p = type + swap_info;
1171     offset = SWP_OFFSET(entry);
1173     swap_device_lock(p);
1174     if (offset < p->max && p->swap_map[offset]) {
1175         if (p->swap_map[offset] < SWAP_MAP_MAX - 1) {
1176             p->swap_map[offset]++;
1177             result = 1;
1178         } else if (p->swap_map[offset] <= SWAP_MAP_MAX) {
1179             if (swap_overflow++ < 5)
1180                 printk(KERN_WARNING "swap_dup: swap entry
1181             p->swap_map[offset] = SWAP_MAP_MAX;
1182             result = 1;
1183         }
1184     }
1185     swap_device_unlock(p);
1186 out:
1187     return result;
1189 bad_file:
1190     printk(KERN_ERR "swap_dup: %s%08lx\n", Bad_file, entry.val);
1191     goto out;
1192 }
1161The parameter is the swap entry to increase the swap_map count for
1167-1169Get the offset within the swap_info for the swap_info_struct containing this entry. If it is greater than the number of swap areas, goto bad_file
1170-1171Get the relevant swap_info_struct and get the offset within its swap_map
1173Lock the swap device
1174Make a quick sanity check to ensure the offset is within the swap_map and that the slot indicated has a positive count. A 0 count would mean the slot is not free and this is a bogus swp_entry_t
1175-1177If the count is not SWAP_MAP_MAX, simply increment it and return 1 for success
1178-1183Else the count would overflow so set it to SWAP_MAP_MAX and reserve the slot permanently. In reality this condition is virtually impossible
1185-1187Unlock the swap device and return
1190-1191If a bad device was used, print out the error message and return failure

K.2.2  Deleting Pages from the Swap Cache

K.2.2.1  Function: swap_free

Source: mm/swapfile.c

Decrements the corresponding swap_map entry for the swp_entry_t

214 void swap_free(swp_entry_t entry)
215 {
216     struct swap_info_struct * p;
218     p = swap_info_get(entry);
219     if (p) {
220         swap_entry_free(p, SWP_OFFSET(entry));
221         swap_info_put(p);
222     }
223 }
218swap_info_get() (See Section K.2.3.1) fetches the correct swap_info_struct and performs a number of debugging checks to ensure it is a valid area and a valid swap_map entry. If all is sane, it will lock the swap device
219-222If it is valid, the corresponding swap_map entry is decremented with swap_entry_free() (See Section K.2.2.2) and swap_info_put() (See Section K.2.3.2) called to free the device

K.2.2.2  Function: swap_entry_free

Source: mm/swapfile.c

192 static int swap_entry_free(struct swap_info_struct *p, 
                 unsigned long offset)
193 {
194     int count = p->swap_map[offset];
196     if (count < SWAP_MAP_MAX) {
197         count--;
198         p->swap_map[offset] = count;
199         if (!count) {
200             if (offset < p->lowest_bit)
201                 p->lowest_bit = offset;
202             if (offset > p->highest_bit)
203                 p->highest_bit = offset;
204             nr_swap_pages++;
205         }
206     }
207     return count;
208 }
194Get the current count
196If the count indicates the slot is not permanently reserved then..
197-198Decrement the count and store it in the swap_map
199If the count reaches 0, the slot is free so update some information
200-201If this freed slot is below lowest_bit, update lowest_bit which indicates the lowest known free slot
202-203Similarly, update the highest_bit if this newly freed slot is above it
204Increment the count indicating the number of free swap slots
207Return the current count

K.2.3  Acquiring/Releasing Swap Cache Pages

K.2.3.1  Function: swap_info_get

Source: mm/swapfile.c

This function finds the swap_info_struct for the given entry, performs some basic checking and then locks the device.

147 static struct swap_info_struct * swap_info_get(swp_entry_t entry)
148 {
149     struct swap_info_struct * p;
150     unsigned long offset, type;
152     if (!entry.val)
153         goto out;
154     type = SWP_TYPE(entry);
155     if (type >= nr_swapfiles)
156         goto bad_nofile;
157     p = & swap_info[type];
158     if (!(p->flags & SWP_USED))
159         goto bad_device;
160     offset = SWP_OFFSET(entry);
161     if (offset >= p->max)
162         goto bad_offset;
163     if (!p->swap_map[offset])
164         goto bad_free;
165     swap_list_lock();
166     if (p->prio > swap_info[].prio)
167 = type;
168     swap_device_lock(p);
169     return p;
171 bad_free:
172     printk(KERN_ERR "swap_free: %s%08lx\n", Unused_offset, 
173     goto out;
174 bad_offset:
175     printk(KERN_ERR "swap_free: %s%08lx\n", Bad_offset, 
176     goto out;
177 bad_device:
178     printk(KERN_ERR "swap_free: %s%08lx\n", Unused_file, 
179     goto out;
180 bad_nofile:
181     printk(KERN_ERR "swap_free: %s%08lx\n", Bad_file, 
182 out:
183     return NULL;
184 } 
152-153If the supplied entry is NULL, return
154Get the offset within the swap_info array
155-156Ensure it is a valid area
157Get the address of the area
158-159If the area is not active yet, print a bad device error and return
160Get the offset within the swap_map
161-162Make sure the offset is not after the end of the map
163-164Make sure the slot is currently in use
165Lock the swap area list
166-167If this area is of higher priority than the area that would be next, ensure the current area is used
168-169Lock the swap device and return the swap area descriptor

K.2.3.2  Function: swap_info_put

Source: mm/swapfile.c

This function simply unlocks the area and list

186 static void swap_info_put(struct swap_info_struct * p)
187 {
188     swap_device_unlock(p);
189     swap_list_unlock();
190 }
188Unlock the device
189Unlock the swap area list

K.2.4  Searching the Swap Cache

K.2.4.1  Function: lookup_swap_cache

Source: mm/swap_state.c

Top level function for finding a page in the swap cache

161 struct page * lookup_swap_cache(swp_entry_t entry)
162 {
163     struct page *found;
165     found = find_get_page(&swapper_space, entry.val);
166     /*
167      * Unsafe to assert PageSwapCache and mapping on page found:
168      * if SMP nothing prevents swapoff from deleting this page from
169      * the swap cache at this moment.  find_lock_page would prevent
170      * that, but no need to change: we _have_ got the right page.
171      */
172     INC_CACHE_INFO(find_total);
173     if (found)
174         INC_CACHE_INFO(find_success);
175     return found;
176 }
165find_get_page()(See Section J.1.4.1) is the principle function for returning the struct page. It uses the normal page hashing and cache functions for quickly finding it
172Increase the statistic recording the number of times a page was searched for in the cache
173-174If one was found, increment the successful find count
175Return the struct page or NULL if it did not exist

K.3  Swap Area IO

K.3.1  Reading Backing Storage

K.3.1.1  Function: read_swap_cache_async

Source: mm/swap_state.c

This function will either return the requsted page from the swap cache. If it does not exist, a page will be allocated, placed in the swap cache and the data is scheduled to be read from disk with rw_swap_page().

184 struct page * read_swap_cache_async(swp_entry_t entry)
185 {
186     struct page *found_page, *new_page = NULL;
187     int err;
189     do {
196         found_page = find_get_page(&swapper_space, entry.val);
197         if (found_page)
198             break;
200         /*
201          * Get a new page to read into from swap.
202          */
203         if (!new_page) {
204             new_page = alloc_page(GFP_HIGHUSER);
205             if (!new_page)
206                 break;          /* Out of memory */
207         }
209         /*
210          * Associate the page with swap entry in the swap cache.
211          * May fail (-ENOENT) if swap entry has been freed since
212          * our caller observed it.  May fail (-EEXIST) if there
213          * is already a page associated with this entry in the
214          * swap cache: added by a racing read_swap_cache_async,
215          * or by try_to_swap_out (or shmem_writepage) re-using
216          * the just freed swap entry for an existing page.
217          */
218         err = add_to_swap_cache(new_page, entry);
219         if (!err) {
220             /*
221              * Initiate read into locked page and return.
222              */
223             rw_swap_page(READ, new_page);
224             return new_page;
225         }
226     } while (err != -ENOENT);
228     if (new_page)
229         page_cache_release(new_page);
230     return found_page;
231 }
189Loop in case add_to_swap_cache() fails to add a page to the swap cache
196First search the swap cache with find_get_page()(See Section J.1.4.1) to see if the page is already avaialble. Ordinarily, lookup_swap_cache() (See Section K.2.4.1) would be called but it updates statistics (such as the number of cache searches) so find_get_page() (See Section J.1.4.1) is called directly
203-207If the page is not in the swap cache and we have not allocated one yet, allocate one with alloc_page()
218Add the newly allocated page to the swap cache with add_to_swap_cache() (See Section K.2.1.1)
223Schedule the data to be read with rw_swap_page()(See Section K.3.3.1). The page will be returned locked and will be unlocked when IO completes
224Return the new page
226Loop until add_to_swap_cache() succeeds or another process successfully inserts the page into the swap cache
228-229This is either the error path or another process added the page to the swap cache for us. If a new page was allocated, free it with page_cache_release() (See Section J.1.3.2)
230Return either the page found in the swap cache or an error

K.3.2  Writing Backing Storage

K.3.2.1  Function: swap_writepage

Source: mm/swap_state.c

This is the function registered in swap_aops for writing out pages. It's function is pretty simple. First it calls remove_exclusive_swap_page() to try and free the page. If the page was freed, then the page will be unlocked here before returning as there is no IO pending on the page. Otherwise rw_swap_page() is called to sync the page with backing storage.

 24 static int swap_writepage(struct page *page)
 25 {
 26     if (remove_exclusive_swap_page(page)) {
 27         UnlockPage(page);
 28         return 0;
 29     }
 30     rw_swap_page(WRITE, page);
 31     return 0;
 32 }
26-29remove_exclusive_swap_page()(See Section K.3.2.2) will reclaim the page from the swap cache if possible. If the page is reclaimed, unlock it before returning
30Otherwise the page is still in the swap cache so synchronise it with backing storage by calling rw_swap_page() (See Section K.3.3.1)

K.3.2.2  Function: remove_exclusive_swap_page

Source: mm/swapfile.c

This function will tries to work out if there is other processes sharing this page or not. If possible the page will be removed from the swap cache and freed. Once removed from the swap cache, swap_free() is decremented to indicate that the swap cache is no longer using the slot. The count will instead reflect the number of PTEs that contain a swp_entry_t for this slot.

287 int remove_exclusive_swap_page(struct page *page)
288 {
289     int retval;
290     struct swap_info_struct * p;
291     swp_entry_t entry;
293     if (!PageLocked(page))
294         BUG();
295     if (!PageSwapCache(page))
296         return 0;
297     if (page_count(page) - !!page->buffers != 2) /* 2: us + cache */
298         return 0;
300     entry.val = page->index;
301     p = swap_info_get(entry);
302     if (!p)
303         return 0;
305     /* Is the only swap cache user the cache itself? */
306     retval = 0;
307     if (p->swap_map[SWP_OFFSET(entry)] == 1) {
308         /* Recheck the page count with the pagecache lock held.. */
309         spin_lock(&pagecache_lock);
310         if (page_count(page) - !!page->buffers == 2) {
311             __delete_from_swap_cache(page);
312             SetPageDirty(page);
313             retval = 1;
314         }
315         spin_unlock(&pagecache_lock);
316     }
317     swap_info_put(p);
319     if (retval) {
320         block_flushpage(page, 0);
321         swap_free(entry);
322         page_cache_release(page);
323     }
325     return retval;
326 }
293-294This operation should only be made with the page locked
295-296If the page is not in the swap cache, then there is nothing to do
297-298If there are other users of the page, then it cannot be reclaimed so return
300The swp_entry_t for the page is stored in pageindex as explained in Section 2.4
301Get the swap_info_struct with swap_info_get() (See Section K.2.3.1)
307If the only user of the swap slot is the swap cache itself (i.e. no process is mapping it), then delete this page from the swap cache to free the slot. Later the swap slot usage count will be decremented as the swap cache is no longer using it
310If the current user is the only user of this page, then it is safe to remove from the swap cache. If another process is sharing it, it must remain here
311Delete from the swap cache
313Set retval to 1 so that the caller knows the page was freed and so that swap_free() (See Section K.2.2.1) will be called to decrement the usage count in the swap_map
317Drop the reference to the swap slot that was taken with swap_info_get() (See Section K.2.3.1)
320The slot is being freed to call block_flushpage() so that all IO will complete and any buffers associated with the page will be freed
321Free the swap slot with swap_free()
322Drop the reference to the page

K.3.2.3  Function: free_swap_and_cache

Source: mm/swapfile.c

This function frees an entry from the swap cache and tries to reclaims the page. Note that this function only applies to the swap cache.

332 void free_swap_and_cache(swp_entry_t entry)
333 {
334     struct swap_info_struct * p;
335     struct page *page = NULL;
337     p = swap_info_get(entry);
338     if (p) {
339         if (swap_entry_free(p, SWP_OFFSET(entry)) == 1)
340             page = find_trylock_page(&swapper_space, entry.val);
341         swap_info_put(p);
342     }
343     if (page) {
344         page_cache_get(page);
345         /* Only cache user (+us), or swap space full? Free it! */
346         if (page_count(page) - !!page->buffers == 2 || vm_swap_full()) {
347             delete_from_swap_cache(page);
348             SetPageDirty(page);
349         }
350         UnlockPage(page);
351         page_cache_release(page);
352     }
353 }
337Get the swap_info struct for the requsted entry
338-342Presuming the swap area information struct exists, call swap_entry_free() to free the swap entry. The page for the entry is then located in the swap cache using find_trylock_page(). Note that the page is returned locked
341Drop the reference taken to the swap info struct at line 337
343-352If the page was located then we try to reclaim it
344Take a reference to the page so it will not be freed prematurly
346-349The page is deleted from the swap cache if there are no processes mapping the page or if the swap area is more than 50% full (Checked by vm_swap_full())
350Unlock the page again
351Drop the local reference to the page taken at line 344

K.3.3  Block IO

K.3.3.1  Function: rw_swap_page

Source: mm/page_io.c

This is the main function used for reading data from backing storage into a page or writing data from a page to backing storage. Which operation is performs depends on the first parameter rw. It is basically a wrapper function around the core function rw_swap_page_base(). This simply enforces that the operations are only performed on pages in the swap cache.

 85 void rw_swap_page(int rw, struct page *page)
 86 {
 87     swp_entry_t entry;
 89     entry.val = page->index;
 91     if (!PageLocked(page))
 92         PAGE_BUG(page);
 93     if (!PageSwapCache(page))
 94         PAGE_BUG(page);
 95     if (!rw_swap_page_base(rw, entry, page))
 96         UnlockPage(page);
 97 }
85rw indicates whether a read or write is taking place
89Get the swp_entry_t from the index field
91-92If the page is not locked for IO, it is a bug
93-94If the page is not in the swap cache, it is a bug
95Call the core function rw_swap_page_base(). If it returns failure, the page is unlocked with UnlockPage() so it can be freed

K.3.3.2  Function: rw_swap_page_base

Source: mm/page_io.c

This is the core function for reading or writing data to the backing storage. Whether it is writing to a partition or a file, the block layer brw_page() function is used to perform the actual IO. This function sets up the necessary buffer information for the block layer to do it's job. The brw_page() performs asynchronous IO so it is likely it will return with the page locked which will be unlocked when the IO completes.

 36 static int rw_swap_page_base(int rw, swp_entry_t entry, 
                                 struct page *page)
 37 {
 38     unsigned long offset;
 39     int zones[PAGE_SIZE/512];
 40     int zones_used;
 41     kdev_t dev = 0;
 42     int block_size;
 43     struct inode *swapf = 0;
 45     if (rw == READ) {
 46         ClearPageUptodate(page);
 47         kstat.pswpin++;
 48     } else
 49         kstat.pswpout++;
36The parameters are:
rw indicates whether the operation is a read or a write
entry is the swap entry for locating the data in backing storage
page is the page that is been read or written to
39zones is a parameter required by the block layer for brw_page(). It is expected to contain an array of block numbers that are to be written to. This is primarily of important when the backing storage is a file rather than a partition
45-47If the page is to be read from disk, clear the Uptodate flag as the page is obviously not up to date if we are reading information from the disk. Increment the pages swapped in (pswpin) statistic
49Else just update the pages swapped out (pswpout) statistic
 51     get_swaphandle_info(entry, &offset, &dev, &swapf);
 52     if (dev) {
 53         zones[0] = offset;
 54         zones_used = 1;
 55         block_size = PAGE_SIZE;
 56     } else if (swapf) {
 57         int i, j;
 58         unsigned int block = 
 59              offset << (PAGE_SHIFT - swapf->i_sb->s_blocksize_bits);
 61         block_size = swapf->i_sb->s_blocksize;
 62         for (i=0, j=0; j< PAGE_SIZE ; i++, j += block_size)
 63             if (!(zones[i] = bmap(swapf,block++))) {
 64                 printk("rw_swap_page: bad swap file\n");
 65                 return 0;
 66             }
 67         zones_used = i;
 68         dev = swapf->i_dev;
 69     } else {
 70         return 0;
 71     }
 73     /* block_size == PAGE_SIZE/zones_used */
 74     brw_page(rw, page, dev, zones, block_size);
 75     return 1;
 76 }
51get_swaphandle_info()(See Section K.3.3.3) returns either the kdev_t or struct inode that represents the swap area, whichever is appropriate
52-55If the storage area is a partition, then there is only one block to be written which is the size of a page. Hence, zones only has one entry which is the offset within the partition to be written and the block_size is PAGE_SIZE
56Else it is a swap file so each of the blocks in the file that make up the page has to be mapped with bmap() before calling brw_page()
58-59Calculate what the starting block is
61The size of individual block is stored in the superblock information for the filesystem the file resides on
62-66Call bmap() for every block that makes up the full page. Each block is stored in the zones array for passing to brw_page(). If any block fails to be mapped, 0 is returned
67Record how many blocks make up the page in zones_used
68Record which device is being written to
74Call brw_page() from the block layer to schedule the IO to occur. This function returns immediately as the IO is asychronous. When the IO is completed, a callback function (end_buffer_io_async()) is called which unlocks the page. Any process waiting on the page will be woken up at that point
75Return success

K.3.3.3  Function: get_swaphandle_info

Source: mm/swapfile.c

This function is responsible for returning either the kdev_t or struct inode that is managing the swap area that entry belongs to.

1197 void get_swaphandle_info(swp_entry_t entry, unsigned long *offset, 
1198                         kdev_t *dev, struct inode **swapf)
1199 {
1200     unsigned long type;
1201     struct swap_info_struct *p;
1203     type = SWP_TYPE(entry);
1204     if (type >= nr_swapfiles) {
1205         printk(KERN_ERR "rw_swap_page: %s%08lx\n", Bad_file, 
1206         return;
1207     }
1209     p = &swap_info[type];
1210     *offset = SWP_OFFSET(entry);
1211     if (*offset >= p->max && *offset != 0) {
1212         printk(KERN_ERR "rw_swap_page: %s%08lx\n", Bad_offset, 
1213         return;
1214     }
1215     if (p->swap_map && !p->swap_map[*offset]) {
1216         printk(KERN_ERR "rw_swap_page: %s%08lx\n", Unused_offset, 
1217         return;
1218     }
1219     if (!(p->flags & SWP_USED)) {
1220         printk(KERN_ERR "rw_swap_page: %s%08lx\n", Unused_file, 
1221         return;
1222     }
1224     if (p->swap_device) {
1225         *dev = p->swap_device;
1226     } else if (p->swap_file) {
1227         *swapf = p->swap_file->d_inode;
1228     } else {
1229         printk(KERN_ERR "rw_swap_page: no swap file or device\n");
1230     }
1231     return;
1232 }
1203Extract which area within swap_info this entry belongs to
1204-1206If the index is for an area that does not exist, then print out an information message and return. Bad_file is a static array declared near the top of mm/swapfile.c that says “Bad swap file entry”
1209Get the swap_info_struct from swap_info
1210Extrac the offset within the swap area for this entry
1211-1214Make sure the offset is not after the end of the file. Print out the message in Bad_offset if it is
1215-1218If the offset is currently not being used, it means that entry is a stale entry so print out the error message in Unused_offset
1219-1222If the swap area is currently not active, print out the error message in Unused_file
1224If the swap area is a device, return the kdev_t in swap_info_structswap_device
1226-1227If it is a swap file, return the struct inode which is available via swap_info_structswap_filed_inode
1229Else there is no swap file or device for this entry so print out the error message and return

K.4  Activating a Swap Area

K.4.1  Function: sys_swapon

Source: mm/swapfile.c

This, quite large, function is responsible for the activating of swap space. Broadly speaking the tasks is takes are as follows;

855 asmlinkage long sys_swapon(const char * specialfile, 
                               int swap_flags)
856 {
857       struct swap_info_struct * p;
858       struct nameidata nd;
859       struct inode * swap_inode;
860       unsigned int type;
861       int i, j, prev;
862       int error;
863       static int least_priority = 0;
864       union swap_header *swap_header = 0;
865       int swap_header_version;
866       int nr_good_pages = 0;
867       unsigned long maxpages = 1;
868       int swapfilesize;
869       struct block_device *bdev = NULL;
870       unsigned short *swap_map;
872       if (!capable(CAP_SYS_ADMIN))
873         return -EPERM;
874       lock_kernel();
875       swap_list_lock();
876       p = swap_info;
855The two parameters are the path to the swap area and the flags for activation
872-873The activating process must have the CAP_SYS_ADMIN capability or be the superuser to activate a swap area
874Acquire the Big Kernel Lock
875Lock the list of swap areas
876Get the first swap area in the swap_info array
877       for (type = 0 ; type < nr_swapfiles ; type++,p++)
878         if (!(p->flags & SWP_USED))
879           break;
880       error = -EPERM;
881       if (type >= MAX_SWAPFILES) {
882         swap_list_unlock();
883         goto out;
884       }
885       if (type >= nr_swapfiles)
886         nr_swapfiles = type+1;
887       p->flags = SWP_USED;
888       p->swap_file = NULL;
889       p->swap_vfsmnt = NULL;
890       p->swap_device = 0;
891       p->swap_map = NULL;
892       p->lowest_bit = 0;
893       p->highest_bit = 0;
894       p->cluster_nr = 0;
895       p->sdev_lock = SPIN_LOCK_UNLOCKED;
896       p->next = -1;
897       if (swap_flags & SWAP_FLAG_PREFER) {
898         p->prio =
899           (swap_flags & SWAP_FLAG_PRIO_MASK)>>SWAP_FLAG_PRIO_SHIFT;
900       } else {
901         p->prio = --least_priority;
902       }
903       swap_list_unlock();

Find a free swap_info_struct and initialise it with default values

877-879Cycle through the swap_info until a struct is found that is not in use
880By default the error returned is Permission Denied which indicates the caller did not have the proper permissions or too many swap areas are already in use
881If no struct was free, MAX_SWAPFILE areas have already been activated so unlock the swap list and return
885-886If the selected swap area is after the last known active area (nr_swapfiles), then update nr_swapfiles
887Set the flag indicating the area is in use
888-896Initialise fields to default values
897-902If the caller has specified a priority, use it else set it to least_priority and decrement it. This way, the swap areas will be prioritised in order of activation
903Release the swap list lock
904       error = user_path_walk(specialfile, &nd);
905       if (error)
906         goto bad_swap_2;
908       p->swap_file = nd.dentry;
909       p->swap_vfsmnt = nd.mnt;
910       swap_inode = nd.dentry->d_inode;
911       error = -EINVAL;

Traverse the VFS and get some information about the special file

904user_path_walk() traverses the directory structure to obtain a nameidata structure describing the specialfile
905-906If it failed, return failure
908Fill in the swap_file field with the returned dentry
909Similarily, fill in the swap_vfsmnt
910Record the inode of the special file
911Now the default error is -EINVAL indicating that the special file was found but it was not a block device or a regular file
913       if (S_ISBLK(swap_inode->i_mode)) {
914         kdev_t dev = swap_inode->i_rdev;
915         struct block_device_operations *bdops;
916         devfs_handle_t de;
918         p->swap_device = dev;
919         set_blocksize(dev, PAGE_SIZE);
921         bd_acquire(swap_inode);
922         bdev = swap_inode->i_bdev;
923         de = devfs_get_handle_from_inode(swap_inode);
924         bdops = devfs_get_ops(de);
925         if (bdops) bdev->bd_op = bdops;
927         error = blkdev_get(bdev, FMODE_READ|FMODE_WRITE, 0,
928         devfs_put_ops(de);/* Decrement module use count 
                               * now we're safe*/
929         if (error)
930           goto bad_swap_2;
931         set_blocksize(dev, PAGE_SIZE);
932         error = -ENODEV;
933         if (!dev || (blk_size[MAJOR(dev)] &&
934          !blk_size[MAJOR(dev)][MINOR(dev)]))
935           goto bad_swap;
936         swapfilesize = 0;
937         if (blk_size[MAJOR(dev)])
938           swapfilesize = blk_size[MAJOR(dev)][MINOR(dev)]
939             >> (PAGE_SHIFT - 10);
940       } else if (S_ISREG(swap_inode->i_mode))
941         swapfilesize = swap_inode->i_size >> PAGE_SHIFT;
942       else
943         goto bad_swap;

If a partition, configure the block device before calculating the size of the area, else obtain it from the inode for the file.

913Check if the special file is a block device
914-939This code segment handles the case where the swap area is a partition
914Record a pointer to the device structure for the block device
918Store a pointer to the device structure describing the special file which will be needed for block IO operations
919Set the block size on the device to be PAGE_SIZE as it will be page sized chunks swap is interested in
921The bd_acquire() function increments the usage count for this block device
922Get a pointer to the block_device structure which is a descriptor for the device file which is needed to open it
923Get a devfs handle if it is enabled. devfs is beyond the scope of this book
924-925Increment the usage count of this device entry
927Open the block device in read/write mode and set the BDEV_SWAP flag which is an enumerated type but is ignored when do_open() is called
928Decrement the use count of the devfs entry
929-930If an error occured on open, return failure
931Set the block size again
932After this point, the default error is to indicate no device could be found
933-935Ensure the returned device is ok
937-939Calculate the size of the swap file as the number of page sized chunks that exist in the block device as indicated by blk_size. The size of the swap area is calculated to make sure the information in the swap area is sane
941If the swap area is a regular file, obtain the size directly from the inode and calculate how many page sized chunks exist
943If the file is not a block device or regular file, return error
945       error = -EBUSY;
946       for (i = 0 ; i < nr_swapfiles ; i++) {
947         struct swap_info_struct *q = &swap_info[i];
948         if (i == type || !q->swap_file)
949           continue;
950         if (swap_inode->i_mapping ==
951           goto bad_swap;
952       }
954       swap_header = (void *) __get_free_page(GFP_USER);
955       if (!swap_header) {
956         printk("Unable to start swapping: out of memory :-)\n");
957         error = -ENOMEM;
958         goto bad_swap;
959       }
961       lock_page(virt_to_page(swap_header));
962       rw_swap_page_nolock(READ, SWP_ENTRY(type,0), 
            (char *) swap_header);
964       if (!memcmp("SWAP-SPACE",swap_header->magic.magic,10))
965         swap_header_version = 1;
966       else if (!memcmp("SWAPSPACE2",swap_header->magic.magic,10))
967         swap_header_version = 2;
968       else {
969         printk("Unable to find swap-space signature\n");
970         error = -EINVAL;
971         goto bad_swap;
972       }
945The next check makes sure the area is not already active. If it is, the error -EBUSY will be returned
946-962Read through the while swap_info struct and ensure the area to be activated is not already active
954-959Allocate a page for reading the swap area information from disk
961The function lock_page() locks a page and makes sure it is synced with disk if it is file backed. In this case, it'll just mark the page as locked which is required for the rw_swap_page_nolock() function
962Read the first page slot in the swap area into swap_header
964-672Check the version based on the swap area information is and set swap_header_version variable with it. If the swap area could not be identified, return -EINVAL
974       switch (swap_header_version) {
975       case 1:
976         memset(((char *) swap_header)+PAGE_SIZE-10,0,10);
977         j = 0;
978         p->lowest_bit = 0;
979         p->highest_bit = 0;
980         for (i = 1 ; i < 8*PAGE_SIZE ; i++) {
981           if (test_bit(i,(char *) swap_header)) {
982             if (!p->lowest_bit)
983                   p->lowest_bit = i;
984             p->highest_bit = i;
985             maxpages = i+1;
986             j++;
987           }
988         }
989         nr_good_pages = j;
990         p->swap_map = vmalloc(maxpages * sizeof(short));
991         if (!p->swap_map) {
992           error = -ENOMEM;        
993           goto bad_swap;
994         }
995         for (i = 1 ; i < maxpages ; i++) {
996           if (test_bit(i,(char *) swap_header))
997             p->swap_map[i] = 0;
998           else
999             p->swap_map[i] = SWAP_MAP_BAD;
1000         }
1001         break;

Read in the information needed to populate the swap_map when the swap area is version 1.

976Zero out the magic string identifing the version of the swap area
978-979Initialise fields in swap_info_struct to 0
980-988A bitmap with 8*PAGE_SIZE entries is stored in the swap area. The full page, minus 10 bits for the magic string, is used to describe the swap map limiting swap areas to just under 128MiB in size. If the bit is set to 1, there is a slot on disk available. This pass will calculate how many slots are available so a swap_map may be allocated
981Test if the bit for this slot is set
982-983If the lowest_bit field is not yet set, set it to this slot. In most cases, lowest_bit will be initialised to 1
984As long as new slots are found, keep updating the highest_bit
985Count the number of pages
986j is the count of good pages in the area
990Allocate memory for the swap_map with vmalloc()
991-994If memory could not be allocated, return ENOMEM
995-1000For each slot, check if the slot is “good”. If yes, initialise the slot count to 0, else set it to SWAP_MAP_BAD so it will not be used
1001Exit the switch statement
1003       case 2:
1006         if (swap_header->info.version != 1) {
1007           printk(KERN_WARNING
1008            "Unable to handle swap header version %d\n",
1009            swap_header->info.version);
1010           error = -EINVAL;
1011           goto bad_swap;
1012         }
1014         p->lowest_bit  = 1;
1015         maxpages = SWP_OFFSET(SWP_ENTRY(0,~0UL)) - 1;
1016         if (maxpages > swap_header->info.last_page)
1017           maxpages = swap_header->info.last_page;
1018         p->highest_bit = maxpages - 1;
1020         error = -EINVAL;
1021         if (swap_header->info.nr_badpages > MAX_SWAP_BADPAGES)
1022           goto bad_swap;
1025         if (!(p->swap_map = vmalloc(maxpages * sizeof(short)))) {
1026           error = -ENOMEM;
1027           goto bad_swap;
1028         }
1030         error = 0;
1031         memset(p->swap_map, 0, maxpages * sizeof(short));
1032         for (i=0; i<swap_header->info.nr_badpages; i++) {
1033           int page = swap_header->info.badpages[i];
1034           if (page <= 0 || 
             page >= swap_header->info.last_page)
1035             error = -EINVAL;
1036           else
1037             p->swap_map[page] = SWAP_MAP_BAD;
1038         }
1039         nr_good_pages = swap_header->info.last_page -
1040             swap_header->info.nr_badpages -
1041             1 /* header page */;
1042         if (error) 
1043           goto bad_swap;
1044       }

Read the header information when the file format is version 2

1006-1012Make absolutly sure we can handle this swap file format and return -EINVAL if we cannot. Remember that with this version, the swap_header struct is placed nicely on disk
1014Initialise lowest_bit to the known lowest available slot
1015-1017Calculate the maxpages initially as the maximum possible size of a swap_map and then set it to the size indicated by the information on disk. This ensures the swap_map array is not accidently overloaded
1018Initialise highest_bit
1020-1022Make sure the number of bad pages that exist does not exceed MAX_SWAP_BADPAGES
1025-1028Allocate memory for the swap_map with vmalloc()
1031Initialise the full swap_map to 0 indicating all slots are available
1032-1038Using the information loaded from disk, set each slot that is unusuable to SWAP_MAP_BAD
1039-1041Calculate the number of available good pages
1042-1043Return if an error occured
1046       if (swapfilesize && maxpages > swapfilesize) {
1047         printk(KERN_WARNING
1048          "Swap area shorter than signature indicates\n");
1049         error = -EINVAL;
1050         goto bad_swap;
1051       }
1052       if (!nr_good_pages) {
1053         printk(KERN_WARNING "Empty swap-file\n");
1054         error = -EINVAL;
1055         goto bad_swap;
1056       }
1057       p->swap_map[0] = SWAP_MAP_BAD;
1058       swap_list_lock();
1059       swap_device_lock(p);
1060       p->max = maxpages;
1061       p->flags = SWP_WRITEOK;
1062       p->pages = nr_good_pages;
1063       nr_swap_pages += nr_good_pages;
1064       total_swap_pages += nr_good_pages;
1065       printk(KERN_INFO "Adding Swap: 
                             %dk swap-space (priority %d)\n",
1066        nr_good_pages<<(PAGE_SHIFT-10), p->prio);
1046-1051Ensure the information loaded from disk matches the actual dimensions of the swap area. If they do not match, print a warning and return an error
1052-1056If no good pages were available, return an error
1057Make sure the first page in the map containing the swap header information is not used. If it was, the header information would be overwritten the first time this area was used
1058-1059Lock the swap list and the swap device
1060-1062Fill in the remaining fields in the swap_info_struct
1063-1064Update global statistics for the number of available swap pages (nr_swap_pages) and the total number of swap pages (total_swap_pages)
1065-1066Print an informational message about the swap activation
1068       /* insert swap space into swap_list: */
1069       prev = -1;
1070       for (i = swap_list.head; i >= 0; i = swap_info[i].next) {
1071         if (p->prio >= swap_info[i].prio) {
1072           break;
1073         }
1074         prev = i;
1075       }
1076       p->next = i;
1077       if (prev < 0) {
1078         swap_list.head = = p - swap_info;
1079       } else {
1080         swap_info[prev].next = p - swap_info;
1081       }
1082       swap_device_unlock(p);
1083       swap_list_unlock();
1084       error = 0;
1085       goto out;
1070-1080Insert the new swap area into the correct slot in the swap list based on priority
1082Unlock the swap device
1083Unlock the swap list
1084-1085Return success
1086 bad_swap:
1087       if (bdev)
1088         blkdev_put(bdev, BDEV_SWAP);
1089 bad_swap_2:
1090       swap_list_lock();
1091       swap_map = p->swap_map;
1092       nd.mnt = p->swap_vfsmnt;
1093       nd.dentry = p->swap_file;
1094       p->swap_device = 0;
1095       p->swap_file = NULL;
1096       p->swap_vfsmnt = NULL;
1097       p->swap_map = NULL;
1098       p->flags = 0;
1099       if (!(swap_flags & SWAP_FLAG_PREFER))
1100         ++least_priority;
1101       swap_list_unlock();
1102       if (swap_map)
1103         vfree(swap_map);
1104       path_release(&nd);
1105 out:
1106       if (swap_header)
1107         free_page((long) swap_header);
1108       unlock_kernel();
1109       return error;
1110 }
1087-1088Drop the reference to the block device
1090-1104This is the error path where the swap list need to be unlocked, the slot in swap_info reset to being unused and the memory allocated for swap_map freed if it was assigned
1104Drop the reference to the special file
1106-1107Release the page containing the swap header information as it is no longer needed
1108Drop the Big Kernel Lock
1109Return the error or success value

K.4.2  Function: swap_setup

Source: mm/swap.c

This function is called during the initialisation of kswapd to set the size of page_cluster. This variable determines how many pages readahead from files and from backing storage when paging in data.

100 void __init swap_setup(void)
101 {
102     unsigned long megs = num_physpages >> (20 - PAGE_SHIFT);
104     /* Use a smaller cluster for small-memory machines */
105     if (megs < 16)
106         page_cluster = 2;
107     else
108         page_cluster = 3;
109     /*
110      * Right now other parts of the system means that we
111      * _really_ don't want to cluster much more
112      */
113 }
102Calculate how much memory the system has in megabytes
105In low memory systems, set page_cluster to 2 which means that, at most, 4 pages will be paged in from disk during readahead
108Else readahead 8 pages

K.5  Deactivating a Swap Area

K.5.1  Function: sys_swapoff

Source: mm/swapfile.c

This function is principally concerned with updating the swap_info_struct and the swap lists. The main task of paging in all pages in the area is the responsibility of try_to_unuse(). The function tasks are broadly

720 asmlinkage long sys_swapoff(const char * specialfile)
721 {
722     struct swap_info_struct * p = NULL;
723     unsigned short *swap_map;
724     struct nameidata nd;
725     int i, type, prev;
726     int err;
728     if (!capable(CAP_SYS_ADMIN))
729         return -EPERM;
731     err = user_path_walk(specialfile, &nd);
732     if (err)
733         goto out;
728-729Only the superuser or a process with CAP_SYS_ADMIN capabilities may deactivate an area
731-732Acquire information about the special file representing the swap area with user_path_walk(). Goto out if an error occured
735     lock_kernel();
736     prev = -1;
737     swap_list_lock();
738     for (type = swap_list.head; type >= 0; 
         type = swap_info[type].next) {
739         p = swap_info + type;
740         if ((p->flags & SWP_WRITEOK) == SWP_WRITEOK) {
741             if (p->swap_file == nd.dentry)
742               break;
743         }
744         prev = type;
745     }
746     err = -EINVAL;
747     if (type < 0) {
748         swap_list_unlock();
749         goto out_dput;
750     }
752     if (prev < 0) {
753         swap_list.head = p->next;
754     } else {
755         swap_info[prev].next = p->next;
756     }
757     if (type == {
758         /* just pick something that's safe... */
759 = swap_list.head;
760     }
761     nr_swap_pages -= p->pages;
762     total_swap_pages -= p->pages;
763     p->flags = SWP_USED;

Acquire the BKL, find the swap_info_struct for the area to be deactivated and remove it from the swap list.

735Acquire the BKL
737Lock the swap list
738-745Traverse the swap list and find the swap_info_struct for the requested area. Use the dentry to identify the area
747-750If the struct could not be found, return
752-760Remove from the swap list making sure that this is not the head
761Update the total number of free swap slots
762Update the total number of existing swap slots
763Mark the area as active but may not be written to
764     swap_list_unlock();
765     unlock_kernel();
766     err = try_to_unuse(type);
764Unlock the swap list
765Release the BKL
766Page in all pages from this swap area
767     lock_kernel();
768     if (err) {
769         /* re-insert swap space back into swap_list */
770         swap_list_lock();
771         for (prev = -1, i = swap_list.head; 
                 i >= 0; 
                 prev = i, i = swap_info[i].next)
772             if (p->prio >= swap_info[i].prio)
773                 break;
774         p->next = i;
775         if (prev < 0)
776             swap_list.head = = p - swap_info;
777         else
778             swap_info[prev].next = p - swap_info;
779         nr_swap_pages += p->pages;
780         total_swap_pages += p->pages;
781         p->flags = SWP_WRITEOK;
782         swap_list_unlock();
783         goto out_dput;
784     }

Acquire the BKL. If we failed to page in all pages, then reinsert the area into the swap list

767Acquire the BKL
770Lock the swap list
771-778Reinsert the area into the swap list. The position it is inserted at depends on the swap area priority
779-780Update the global statistics
781Mark the area as safe to write to again
782-783Unlock the swap list and return
785     if (p->swap_device)
786         blkdev_put(p->swap_file->d_inode->i_bdev, BDEV_SWAP);
787     path_release(&nd);
789     swap_list_lock();
790     swap_device_lock(p);
791     nd.mnt = p->swap_vfsmnt;
792     nd.dentry = p->swap_file;
793     p->swap_vfsmnt = NULL;
794     p->swap_file = NULL;
795     p->swap_device = 0;
796     p->max = 0;
797     swap_map = p->swap_map;
798     p->swap_map = NULL;
799     p->flags = 0;
800     swap_device_unlock(p);
801     swap_list_unlock();
802     vfree(swap_map);
803     err = 0;
805 out_dput:
806     unlock_kernel();
807     path_release(&nd);
808 out:
809     return err;
810 }

Else the swap area was successfully deactivated to close the block device and mark the swap_info_struct free

785-786Close the block device
787Release the path information
789-790Acquire the swap list and swap device lock
791-799Reset the fields in swap_info_struct to default values
800-801Release the swap list and swap device
801Free the memory used for the swap_map
806Release the BKL
807Release the path information in the event we reached here via the error path
809Return success or failure

K.5.2  Function: try_to_unuse

Source: mm/swapfile.c

This function is heavily commented in the source code albeit it consists of speculation or is slightly inaccurate at parts. The comments are omitted here for brevity.

513 static int try_to_unuse(unsigned int type)
514 {
515     struct swap_info_struct * si = &swap_info[type];
516     struct mm_struct *start_mm;
517     unsigned short *swap_map;
518     unsigned short swcount;
519     struct page *page;
520     swp_entry_t entry;
521     int i = 0;
522     int retval = 0;
523     int reset_overflow = 0;
540     start_mm = &init_mm;
541     atomic_inc(&init_mm.mm_users);
540-541The starting mm_struct to page in pages for is init_mm. The count is incremented even though this particular struct will not disappear to prevent having to write special cases in the remainder of the function
556     while ((i = find_next_to_unuse(si, i))) {
557         /* 
558          * Get a page for the entry, using the existing swap
559          * cache page if there is one.  Otherwise, get a clean
560          * page and read the swap into it. 
561          */
562         swap_map = &si->swap_map[i];
563         entry = SWP_ENTRY(type, i);
564         page = read_swap_cache_async(entry);
565         if (!page) {
572             if (!*swap_map)
573                 continue;
574             retval = -ENOMEM;
575             break;
576         }
578         /*
579          * Don't hold on to start_mm if it looks like exiting.
580          */
581         if (atomic_read(&start_mm->mm_users) == 1) {
582             mmput(start_mm);
583             start_mm = &init_mm;
584             atomic_inc(&init_mm.mm_users);
585         }
556This is the beginning of the major loop in this function. Starting from the beginning of the swap_map, it searches for the next entry to be freed with find_next_to_unuse() until all swap map entries have been paged in
562-564Get the swp_entry_t and call read_swap_cache_async() (See Section K.3.1.1) to find the page in the swap cache or have a new page allocated for reading in from the disk
565-576If we failed to get the page, it means the slot has already been freed independently by another process or thread (process could be exiting elsewhere) or we are out of memory. If independently freed, we continue to the next map, else we return -ENOMEM
581Check to make sure this mm is not exiting. If it is, decrement its count and go back to init_mm
587         /*
588          * Wait for and lock page.  When do_swap_page races with
589          * try_to_unuse, do_swap_page can handle the fault much
590          * faster than try_to_unuse can locate the entry.  This
591          * apparently redundant "wait_on_page" lets try_to_unuse
592          * defer to do_swap_page in such a case - in some tests,
593          * do_swap_page and try_to_unuse repeatedly compete.
594          */
595         wait_on_page(page);
596         lock_page(page);
598         /*
599          * Remove all references to entry, without blocking.
600          * Whenever we reach init_mm, there's no address space
601          * to search, but use it as a reminder to search shmem.
602          */
603         shmem = 0;
604         swcount = *swap_map;
605         if (swcount > 1) {
606             flush_page_to_ram(page);
607             if (start_mm == &init_mm)
608                 shmem = shmem_unuse(entry, page);
609             else
610                 unuse_process(start_mm, entry, page);
611         }
595Wait on the page to complete IO. Once it returns, we know for a fact the page exists in memory with the same information as that on disk
596Lock the page
604Get the swap map reference count
605If the count is positive then...
606As the page is about to be inserted into proces page tables, it must be freed from the D-Cache or the process may not “see” changes made to the page by the kernel
607-608If we are using the init_mm, call shmem_unuse() (See Section L.6.2) which will free the page from any shared memory regions that are in use
610Else update the PTE in the current mm which references this page
612         if (*swap_map > 1) {
613             int set_start_mm = (*swap_map >= swcount);
614             struct list_head *p = &start_mm->mmlist;
615             struct mm_struct *new_start_mm = start_mm;
616             struct mm_struct *mm;
618             spin_lock(&mmlist_lock);
619             while (*swap_map > 1 &&
620                 (p = p->next) != &start_mm->mmlist) {
621                 mm = list_entry(p, struct mm_struct,
622                 swcount = *swap_map;
623                 if (mm == &init_mm) {
624                     set_start_mm = 1;
625                     spin_unlock(&mmlist_lock);
626                     shmem = shmem_unuse(entry, page);
627                     spin_lock(&mmlist_lock);
628                 } else
629                     unuse_process(mm, entry, page);
630                 if (set_start_mm && *swap_map < swcount) {
631                     new_start_mm = mm;
632                     set_start_mm = 0;
633                 }
634             }
635             atomic_inc(&new_start_mm->mm_users);
636             spin_unlock(&mmlist_lock);
637             mmput(start_mm);
638             start_mm = new_start_mm;
639         }
612-637If an entry still exists, begin traversing through all mm_structs finding references to this page and update the respective PTE
618Lock the mm list
619-632Keep searching until all mm_structs have been found. Do not traverse the full list more than once
621Get the mm_struct for this list entry
623-627Call shmem_unuse()(See Section L.6.2) if the mm is init_mm as that indicates that is a page from the virtual filesystem. Else call unuse_process() (See Section K.5.3) to traverse the current process's page tables searching for the swap entry. If found, the entry will be freed and the page reinstantiated in the PTE
630-633Record if we need to start searching mm_structs starting from init_mm again
654         if (*swap_map == SWAP_MAP_MAX) {
655             swap_list_lock();
656             swap_device_lock(si);
657             nr_swap_pages++;
658             *swap_map = 1;
659             swap_device_unlock(si);
660             swap_list_unlock();
661             reset_overflow = 1;
662         }
654If the swap map entry is permanently mapped, we have to hope that all processes have their PTEs updated to point to the page and in reality the swap map entry is free. In reality, it is highly unlikely a slot would be permanetly reserved in the first place
645-661Lock the list and swap device, set the swap map entry to 1, unlock them again and record that a reset overflow occured
683         if ((*swap_map > 1) && PageDirty(page) &&
                PageSwapCache(page)) {
684             rw_swap_page(WRITE, page);
685             lock_page(page);
686         }
687         if (PageSwapCache(page)) {
688             if (shmem)
689                 swap_duplicate(entry);
690             else
691                 delete_from_swap_cache(page);
692         }
683-686In the very rare event a reference still exists to the page, write the page back to disk so at least if another process really has a reference to it, it'll copy the page back in from disk correctly
687-689If the page is in the swap cache and belongs to the shared memory filesystem, a new reference is taken to it wieh swap_duplicate() so we can try and remove it again later with shmem_unuse()
691Else, for normal pages, just delete them from the swap cache
699         SetPageDirty(page);
700         UnlockPage(page);
701         page_cache_release(page);
699Mark the page dirty so that the swap out code will preserve the page and if it needs to remove it again, it'll write it correctly to a new swap area
700Unlock the page
701Release our reference to it in the page cache
708         if (current->need_resched)
714             schedule();
715     }
717     mmput(start_mm);
718     if (reset_overflow) {
714         printk(KERN_WARNING "swapoff: cleared swap entry
715         swap_overflow = 0;
716     }
717     return retval;
718 }
708-709Call schedule() if necessary so the deactivation of swap does not hog the entire CPU
717Drop our reference to the mm
718-721If a permanently mapped page had to be removed, then print out a warning so that in the very unlikely event an error occurs later, there will be a hint to what might have happend
717Return success or failure

K.5.3  Function: unuse_process

Source: mm/swapfile.c

This function begins the page table walk required to remove the requested page and entry from the process page tables managed by mm. This is only required when a swap area is being deactivated so, while expensive, it is a very rare operation. This set of functions should be instantly recognisable as a standard page-table walk.

454 static void unuse_process(struct mm_struct * mm,
455                         swp_entry_t entry, struct page* page)
456 {
457     struct vm_area_struct* vma;
459     /*
460      * Go through process' page directory.
461      */
462     spin_lock(&mm->page_table_lock);
463     for (vma = mm->mmap; vma; vma = vma->vm_next) {
464         pgd_t * pgd = pgd_offset(mm, vma->vm_start);
465         unuse_vma(vma, pgd, entry, page);
466     }
467     spin_unlock(&mm->page_table_lock);
468     return;
469 }
462Lock the process page tables
463Move through every VMA managed by this mm. Remember that one page frame could be mapped in multiple locations
462Get the PGD managing the beginning of this VMA
465Call unuse_vma()(See Section K.5.4) to search the VMA for the page
467-468The full mm has been searched so unlock the process page tables and return

K.5.4  Function: unuse_vma

Source: mm/swapfile.c

This function searches the requested VMA for page table entries mapping the page and using the given swap entry. It calls unuse_pgd() for every PGD this VMA maps.

440 static void unuse_vma(struct vm_area_struct * vma, pgd_t *pgdir,
441                         swp_entry_t entry, struct page* page)
442 {
443     unsigned long start = vma->vm_start, end = vma->vm_end;
445     if (start >= end)
446         BUG();
447     do {
448         unuse_pgd(vma, pgdir, start, end - start, entry, page);
449         start = (start + PGDIR_SIZE) & PGDIR_MASK;
450         pgdir++;
451     } while (start && (start < end));
452 }
443Get the virtual addresses for ther start and end of the VMA
445-446Check that the start is not after the end. There would need to be serious braindamage in the kernel for this to occur
447-451Walk through the VMA in PGDIR_SIZE-sized strides until the end of the VMA is reached. This effectively walks through every PGD that maps portions of this VMA
448Call unuse_pgd()(See Section K.5.5) to walk through just this PGD to unmap page
449Move the virtual address start to the beginning of the next PGD
450Move pgdir to the next PGD in the VMA

K.5.5  Function: unuse_pgd

Source: mm/swapfile.c

This function searches the requested PGD for page table entries mapping the page and using the given swap entry. It calls unuse_pmd() for every PMD this PGD maps.

409 static inline void unuse_pgd(struct vm_area_struct * vma, pgd_t *dir,
410         unsigned long address, unsigned long size,
411         swp_entry_t entry, struct page* page)
412 {
413     pmd_t * pmd;
414     unsigned long offset, end;
416     if (pgd_none(*dir))
417         return;
418     if (pgd_bad(*dir)) {
419         pgd_ERROR(*dir);
420         pgd_clear(dir);
421         return;
422     }
423     pmd = pmd_offset(dir, address);
424     offset = address & PGDIR_MASK;
425     address &= ~PGDIR_MASK;
426     end = address + size;
427     if (end > PGDIR_SIZE)
428         end = PGDIR_SIZE;
429     if (address >= end)
430         BUG();
431     do {
432         unuse_pmd(vma, pmd, address, end - address, offset, entry,
433                   page);
434         address = (address + PMD_SIZE) & PMD_MASK;
435         pmd++;
436     } while (address && (address < end));
437 }
416-417If there is no PGD here, return
418-422If the PGD is bad, then set the appropriate error, clear the PGD and return. There are very few architectures where this condition can occur
423Get the address of the first PMD in this PGD
424Calculate offset as the offset within the PGD the address is for. Remember that the first time this function is called, it might be searching a partial PGD
425Align the address to the PGD
426Calculate the end address of the search
427-428If the end is beyond this PGD, set the end just to the end of this PGD
429-430If the starting address is after the end address, something is very seriously wrong
431-436Step through the PGD in PMD_SIZE-sized strides and call unuse_pmd() (See Section K.5.6) for every PMD in this PGD

K.5.6  Function: unuse_pmd

Source: mm/swapfile.c

This function searches the requested PMD for page table entries mapping the page and using the given swap entry. It calls unuse_pte() for every PTE this PMD maps.

381 static inline void unuse_pmd(struct vm_area_struct * vma, pmd_t *dir,
382      unsigned long address, unsigned long size, unsigned long offset,
383      swp_entry_t entry, struct page* page)
384 {
385     pte_t * pte;
386     unsigned long end;
388     if (pmd_none(*dir))
389         return;
390     if (pmd_bad(*dir)) {
391         pmd_ERROR(*dir);
392         pmd_clear(dir);
393         return;
394     }
395     pte = pte_offset(dir, address);
396     offset += address & PMD_MASK;
397     address &= ~PMD_MASK;
398     end = address + size;
399     if (end > PMD_SIZE)
400         end = PMD_SIZE;
401     do {
402         unuse_pte(vma, offset+address-vma->vm_start, pte, entry, page);
403         address += PAGE_SIZE;
404         pte++;
405     } while (address && (address < end));
406 }
388-389Return if no PMD exists
390-394Set the appropriate error and clear the PMD if it is bad. There are very few architectures where this condition can occur
395Calculate the starting PTE for this address
396Set offset to be the offset within the PMD we are starting at
397Align address to the PMD
398-400Calculate the end address. If it is beyond the end of this PMD, set it to the end of this PMD
401-405Step through this PMD in PAGE_SIZE-sized chunks and call unuse_pte() (See Section K.5.7) for each PTE

K.5.7  Function: unuse_pte

Source: mm/swapfile.c

This function checks if the PTE at dir matches the entry we are searching for. If it does, the swap entry is freed and a reference is taken to the page representing the PTE that will be updated to map it.

365 static inline void unuse_pte(struct vm_area_struct * vma, 
            unsigned long address,
366         pte_t *dir, swp_entry_t entry, struct page* page)
367 {
368     pte_t pte = *dir;
370     if (likely(pte_to_swp_entry(pte).val != entry.val))
371         return;
372     if (unlikely(pte_none(pte) || pte_present(pte)))
373         return;
374     get_page(page);
375     set_pte(dir, pte_mkold(mk_pte(page, vma->vm_page_prot)));
376     swap_free(entry);
377     ++vma->vm_mm->rss;
378 }
370-371If the entry does not match the PTE, return
372-373If there is no PTE or it is already present (meaning there is no way this entry is mapped here), then return
374Otherwise we have found the entry we are looking for so take a reference to the page as a new PTE is about to map it
375Update the PTE to map page
376Free the swap entry
377Increment the RSS count for this process

Previous Up Next