Previous Up Next

Appendix J  Page Frame Reclamation

J.1  Page Cache Operations

This section addresses how pages are added and removed from the page cache and LRU lists, both of which are heavily intertwined.

J.1.1  Adding Pages to the Page Cache

J.1.1.1  Function: add_to_page_cache

Source: mm/filemap.c

Acquire the lock protecting the page cache before calling __add_to_page_cache() which will add the page to the page hash table and inode queue which allows the pages belonging to files to be found quickly.

667 void add_to_page_cache(struct page * page, 
                     struct address_space * mapping,
                      unsigned long offset)
668 {
669       spin_lock(&pagecache_lock);
670       __add_to_page_cache(page, mapping, 
                        offset, page_hash(mapping, offset));
671       spin_unlock(&pagecache_lock);
672       lru_cache_add(page);
673 }
669Acquire the lock protecting the page hash and inode queues
670Call the function which performs the “real” work
671Release the lock protecting the hash and inode queue
672Add the page to the page cache. page_hash() hashes into the page hash table based on the mapping and the offset within the file. If a page is returned, there was a collision and the colliding pages are chained with the pagenext_hash and pagepprev_hash fields

J.1.1.2  Function: add_to_page_cache_unique

Source: mm/filemap.c

In many respects, this function is very similar to add_to_page_cache(). The principal difference is that this function will check the page cache with the pagecache_lock spinlock held before adding the page to the cache. It is for callers may race with another process for inserting a page in the cache such as add_to_swap_cache()(See Section K.2.1.1).

675 int add_to_page_cache_unique(struct page * page,
676         struct address_space *mapping, unsigned long offset,
677         struct page **hash)
678 {
679     int err;
680     struct page *alias;
682     spin_lock(&pagecache_lock);
683     alias = __find_page_nolock(mapping, offset, *hash);
685     err = 1;
686     if (!alias) {
687         __add_to_page_cache(page,mapping,offset,hash);
688         err = 0;
689     }
691     spin_unlock(&pagecache_lock);
692     if (!err)
693         lru_cache_add(page);
694     return err;
695 }
682Acquire the pagecache_lock for examining the cache
683Check if the page already exists in the cache with __find_page_nolock() (See Section J.1.4.3)
686-689If the page does not exist in the cache, add it with __add_to_page_cache() (See Section J.1.1.3)
691Release the pagecache_lock
692-693If the page did not already exist in the page cache, add it to the LRU lists with lru_cache_add()(See Section J.2.1.1)
694Return 0 if this call entered the page into the page cache and 1 if it already existed

J.1.1.3  Function: __add_to_page_cache

Source: mm/filemap.c

Clear all page flags, lock it, take a reference and add it to the inode and hash queues.

653 static inline void __add_to_page_cache(struct page * page,
654       struct address_space *mapping, unsigned long offset,
655       struct page **hash)
656 {
657       unsigned long flags;
659       flags = page->flags & ~(1 << PG_uptodate | 
                            1 << PG_error | 1 << PG_dirty | 
                            1 << PG_referenced | 1 << PG_arch_1 | 
                            1 << PG_checked);
660       page->flags = flags | (1 << PG_locked);
661       page_cache_get(page);
662       page->index = offset;
663       add_page_to_inode_queue(mapping, page);
664       add_page_to_hash_queue(page, hash);
665 }
659Clear all page flags
660Lock the page
661Take a reference to the page in case it gets freed prematurely
662Update the index so it is known what file offset this page represents
663Add the page to the inode queue with add_page_to_inode_queue() (See Section J.1.1.4). This links the page via the pagelist to the clean_pages list in the address_space and points the pagemapping to the same address_space
664Add it to the page hash with add_page_to_hash_queue() (See Section J.1.1.5). The hash page was returned by page_hash() in the parent function. The page hash allows page cache pages without having to lineraly search the inode queue

J.1.1.4  Function: add_page_to_inode_queue

Source: mm/filemap.c

 85 static inline void add_page_to_inode_queue(
                       struct address_space *mapping, struct page * page)
 86 {
 87     struct list_head *head = &mapping->clean_pages;
 89     mapping->nrpages++;
 90     list_add(&page->list, head);
 91     page->mapping = mapping;
 92 }
87When this function is called, the page is clean, so mappingclean_pages is the list of interest
89Increment the number of pages that belong to this mapping
90Add the page to the clean list
91Set the pagemapping field

J.1.1.5  Function: add_page_to_hash_queue

Source: mm/filemap.c

This adds page to the top of hash bucket headed by p. Bear in mind that p is an element of the array page_hash_table.

 71 static void add_page_to_hash_queue(struct page * page, 
                                       struct page **p)
 72 {
 73     struct page *next = *p;
 75     *p = page;
 76     page->next_hash = next;
 77     page->pprev_hash = p;
 78     if (next)
 79         next->pprev_hash = &page->next_hash;
 80     if (page->buffers)
 81         PAGE_BUG(page);
 82     atomic_inc(&page_cache_size);
 83 }
73Record the current head of the hash bucket in next
75Update the head of the hash bucket to be page
76Point pagenext_hash to the old head of the hash bucket
77Point pagepprev_hash to point to the array element in page_hash_table
78-79This will point the pprev_hash field to the head of the hash bucket completing the insertion of the page into the linked list
80-81Check that the page entered has no associated buffers
82Increment page_cache_size which is the size of the page cache

J.1.2  Deleting Pages from the Page Cache

J.1.2.1  Function: remove_inode_page

Source: mm/filemap.c

130 void remove_inode_page(struct page *page)
131 {
132     if (!PageLocked(page))
133         PAGE_BUG(page);
135     spin_lock(&pagecache_lock);
136     __remove_inode_page(page);
137     spin_unlock(&pagecache_lock);
138 }
132-133If the page is not locked, it is a bug
135Acquire the lock protecting the page cache
136__remove_inode_page() (See Section J.1.2.2) is the top-level function for when the pagecache lock is held
137Release the pagecache lock

J.1.2.2  Function: __remove_inode_page

Source: mm/filemap.c

This is the top-level function for removing a page from the page cache for callers with the pagecache_lock spinlock held. Callers that do not have this lock acquired should call remove_inode_page().

124 void __remove_inode_page(struct page *page)
125 {
126         remove_page_from_inode_queue(page);
127         remove_page_from_hash_queue(page);
126remove_page_from_inode_queue() (See Section J.1.2.3) remove the page from it's address_space at pagemapping
127remove_page_from_hash_queue() removes the page from the hash table in page_hash_table

J.1.2.3  Function: remove_page_from_inode_queue

Source: mm/filemap.c

 94 static inline void remove_page_from_inode_queue(struct page * page)
 95 {
 96     struct address_space * mapping = page->mapping;
 98     if (mapping->a_ops->removepage)
 99         mapping->a_ops->removepage(page);
100     list_del(&page->list);
101     page->mapping = NULL;
102     wmb();
103     mapping->nr_pages--;
104 }
96Get the associated address_space for this page
98-99Call the filesystem specific removepage() function if one is available
100Delete the page from whatever list it belongs to in the mapping such as the clean_pages list in most cases or the dirty_pages in rarer cases
101Set the pagemapping to NULL as it is no longer backed by any address_space
103Decrement the number of pages in the mapping

J.1.2.4  Function: remove_page_from_hash_queue

Source: mm/filemap.c

107 static inline void remove_page_from_hash_queue(struct page * page)
108 {
109     struct page *next = page->next_hash;
110     struct page **pprev = page->pprev_hash;
112     if (next)
113         next->pprev_hash = pprev;
114     *pprev = next;
115     page->pprev_hash = NULL;
116     atomic_dec(&page_cache_size);
117 }
109Get the next page after the page being removed
110Get the pprev page before the page being removed. When the function completes, pprev will be linked to next
112If this is not the end of the list, update nextpprev_hash to point to pprev
114Similarly, point pprev forward to next. page is now unlinked
116Decrement the size of the page cache

J.1.3  Acquiring/Releasing Page Cache Pages

J.1.3.1  Function: page_cache_get

Source: include/linux/pagemap.h

 31 #define page_cache_get(x)       get_page(x)
31Simple call get_page() which simply uses atomic_inc() to increment the page reference count

J.1.3.2  Function: page_cache_release

Source: include/linux/pagemap.h

 32 #define page_cache_release(x)   __free_page(x)
32Call __free_page() which decrements the page count. If the count reaches 0, the page will be freed

J.1.4  Searching the Page Cache

J.1.4.1  Function: find_get_page

Source: include/linux/pagemap.h

Top level macro for finding a page in the page cache. It simply looks up the page hash

 75 #define find_get_page(mapping, index) \
 76     __find_get_page(mapping, index, page_hash(mapping, index))
76page_hash() locates an entry in the page_hash_table based on the address_space and offset

J.1.4.2  Function: __find_get_page

Source: mm/filemap.c

This function is responsible for finding a struct page given an entry in page_hash_table as a starting point.

931 struct page * __find_get_page(struct address_space *mapping,
932                 unsigned long offset, struct page **hash)
933 {
934     struct page *page;
936     /*
937      * We scan the hash list read-only. Addition to and removal from
938      * the hash-list needs a held write-lock.
939      */
940     spin_lock(&pagecache_lock);
941     page = __find_page_nolock(mapping, offset, *hash);
942     if (page)
943         page_cache_get(page);
944     spin_unlock(&pagecache_lock);
945     return page;
946 }
940Acquire the read-only page cache lock
941Call the page cache traversal function which presumes a lock is held
942-943If the page was found, obtain a reference to it with page_cache_get() (See Section J.1.3.1) so it is not freed prematurely
944Release the page cache lock
945Return the page or NULL if not found

J.1.4.3  Function: __find_page_nolock

Source: mm/filemap.c

This function traverses the hash collision list looking for the page specified by the address_space and offset.

443 static inline struct page * __find_page_nolock(
                    struct address_space *mapping, 
                    unsigned long offset, 
                    struct page *page)
444 {
445     goto inside;
447     for (;;) {
448         page = page->next_hash;
449 inside:
450         if (!page)
451             goto not_found;
452         if (page->mapping != mapping)
453             continue;
454         if (page->index == offset)
455             break;
456     }
458 not_found:
459     return page;
460 }
445Begin by examining the first page in the list
450-451If the page is NULL, the right one could not be found so return NULL
452If the address_space does not match, move to the next page on the collision list
454If the offset matchs, return it, else move on
448Move to the next page on the hash list
459Return the found page or NULL if not

J.1.4.4  Function: find_lock_page

Source: include/linux/pagemap.h

This is the top level function for searching the page cache for a page and having it returned in a locked state.

 84 #define find_lock_page(mapping, index) \
 85     __find_lock_page(mapping, index, page_hash(mapping, index))
85Call the core function __find_lock_page() after looking up what hash bucket this page is using with page_hash()

J.1.4.5  Function: __find_lock_page

Source: mm/filemap.c

This function acquires the pagecache_lock spinlock before calling the core function __find_lock_page_helper() to locate the page and lock it.

1005 struct page * __find_lock_page (struct address_space *mapping,
1006                     unsigned long offset, struct page **hash)
1007 {
1008    struct page *page;
1010    spin_lock(&pagecache_lock);
1011    page = __find_lock_page_helper(mapping, offset, *hash);
1012    spin_unlock(&pagecache_lock);
1013    return page;
1014 }
1010Acquire the pagecache_lock spinlock
1011Call __find_lock_page_helper() which will search the page cache and lock the page if it is found
1012Release the pagecache_lock spinlock
1013If the page was found, return it in a locked state, otherwise return NULL

J.1.4.6  Function: __find_lock_page_helper

Source: mm/filemap.c

This function uses __find_page_nolock() to locate a page within the page cache. If it is found, the page will be locked for returning to the caller.

972 static struct page * __find_lock_page_helper(
                               struct address_space *mapping,
973                            unsigned long offset, struct page *hash)
974 {
975     struct page *page;
977     /*
978      * We scan the hash list read-only. Addition to and removal from
979      * the hash-list needs a held write-lock.
980      */
981 repeat:
982     page = __find_page_nolock(mapping, offset, hash);
983     if (page) {
984         page_cache_get(page);
985         if (TryLockPage(page)) {
986             spin_unlock(&pagecache_lock);
987             lock_page(page);
988             spin_lock(&pagecache_lock);
990             /* Has the page been re-allocated while we slept?  */
991             if (page->mapping != mapping || page->index != offset) {
992                 UnlockPage(page);
993                 page_cache_release(page);
994                 goto repeat;
995             }
996         }
997     }
998     return page;
999 }
982Use __find_page_nolock()(See Section J.1.4.3) to locate the page in the page cache
983-984If the page was found, take a reference to it
985Try and lock the page with TryLockPage(). This macro is just a wrapper around test_and_set_bit() which attempts to set the PG_locked bit in the pageflags
986-988If the lock failed, release the pagecache_lock spinlock and call lock_page() (See Section B.2.1.1) to lock the page. It is likely this function will sleep until the page lock is acquired. When the page is locked, acquire the pagecache_lock spinlock again
991If the mapping and index no longer match, it means that this page was reclaimed while we were asleep. The page is unlocked and the reference dropped before searching the page cache again
998Return the page in a locked state, or NULL if it was not in the page cache

J.2  LRU List Operations

J.2.1  Adding Pages to the LRU Lists

J.2.1.1  Function: lru_cache_add

Source: mm/swap.c

Adds a page to the LRU inactive_list.

 58 void lru_cache_add(struct page * page)
 59 {
 60       if (!PageLRU(page)) {
 61             spin_lock(&pagemap_lru_lock);
 62             if (!TestSetPageLRU(page))
 63                   add_page_to_inactive_list(page);
 64             spin_unlock(&pagemap_lru_lock);
 65       }
 66 }
60If the page is not already part of the LRU lists, add it
61Acquire the LRU lock
62-63Test and set the LRU bit. If it was clear, call add_page_to_inactive_list()
64Release the LRU lock

J.2.1.2  Function: add_page_to_active_list

Source: include/linux/swap.h

Adds the page to the active_list

178 #define add_page_to_active_list(page)         \
179 do {                                          \
180       DEBUG_LRU_PAGE(page);                   \
181       SetPageActive(page);                    \
182       list_add(&(page)->lru, &active_list);   \
183       nr_active_pages++;                      \
184 } while (0)
180The DEBUG_LRU_PAGE() macro will call BUG() if the page is already on the LRU list or is marked been active
181Update the flags of the page to show it is active
182Add the page to the active_list
183Update the count of the number of pages in the active_list

J.2.1.3  Function: add_page_to_inactive_list

Source: include/linux/swap.h

Adds the page to the inactive_list

186 #define add_page_to_inactive_list(page)       \
187 do {                                          \
188       DEBUG_LRU_PAGE(page);                   \
189       list_add(&(page)->lru, &inactive_list); \
190       nr_inactive_pages++;                    \
191 } while (0)
188The DEBUG_LRU_PAGE() macro will call BUG() if the page is already on the LRU list or is marked been active
189Add the page to the inactive_list
190Update the count of the number of inactive pages on the list

J.2.2  Deleting Pages from the LRU Lists

J.2.2.1  Function: lru_cache_del

Source: mm/swap.c

Acquire the lock protecting the LRU lists before calling __lru_cache_del().

 90 void lru_cache_del(struct page * page)
 91 {
 92       spin_lock(&pagemap_lru_lock);
 93       __lru_cache_del(page);
 94       spin_unlock(&pagemap_lru_lock);
 95 }
92Acquire the LRU lock
93__lru_cache_del() does the “real” work of removing the page from the LRU lists
94Release the LRU lock

J.2.2.2  Function: __lru_cache_del

Source: mm/swap.c

Select which function is needed to remove the page from the LRU list.

 75 void __lru_cache_del(struct page * page)
 76 {
 77       if (TestClearPageLRU(page)) {
 78             if (PageActive(page)) {
 79                   del_page_from_active_list(page);
 80             } else {
 81                   del_page_from_inactive_list(page);
 82             }
 83       }
 84 }
77Test and clear the flag indicating the page is in the LRU
78-82If the page is on the LRU, select the appropriate removal function
78-79If the page is active, then call del_page_from_active_list() else delete from the inactive list with del_page_from_inactive_list()

J.2.2.3  Function: del_page_from_active_list

Source: include/linux/swap.h

Remove the page from the active_list

193 #define del_page_from_active_list(page)   \
194 do {                                      \
195       list_del(&(page)->lru);             \
196       ClearPageActive(page);              \
197       nr_active_pages--;                  \
198 } while (0)
195Delete the page from the list
196Clear the flag indicating it is part of active_list. The flag indicating it is part of the LRU list has already been cleared by __lru_cache_del()
197Update the count of the number of pages in the active_list

J.2.2.4  Function: del_page_from_inactive_list

Source: include/linux/swap.h

200 #define del_page_from_inactive_list(page) \
201 do {                                      \
202       list_del(&(page)->lru);             \
203       nr_inactive_pages--;                \
204 } while (0)
202Remove the page from the LRU list
203Update the count of the number of pages in the inactive_list

J.2.3  Activating Pages

J.2.3.1  Function: mark_page_accessed

Source: mm/filemap.c

This marks that a page has been referenced. If the page is already on the active_list or the referenced flag is clear, the referenced flag will be simply set. If it is in the inactive_list and the referenced flag has been set, activate_page() will be called to move the page to the top of the active_list.

1332 void mark_page_accessed(struct page *page)
1333 {
1334       if (!PageActive(page) && PageReferenced(page)) {
1335             activate_page(page);
1336             ClearPageReferenced(page);
1337       } else
1338             SetPageReferenced(page);
1339 }
1334-1337If the page is on the inactive_list (!PageActive()) and has been referenced recently (PageReferenced()), activate_page() is called to move it to the active_list
1338Otherwise, mark the page as been referenced

J.2.3.2  Function: activate_lock

Source: mm/swap.c

Acquire the LRU lock before calling activate_page_nolock() which moves the page from the inactive_list to the active_list.

 47 void activate_page(struct page * page)
 48 {
 49       spin_lock(&pagemap_lru_lock);
 50       activate_page_nolock(page);
 51       spin_unlock(&pagemap_lru_lock);
 52 }
49Acquire the LRU lock
50Call the main work function
51Release the LRU lock

J.2.3.3  Function: activate_page_nolock

Source: mm/swap.c

Move the page from the inactive_list to the active_list

 39 static inline void activate_page_nolock(struct page * page)
 40 {
 41       if (PageLRU(page) && !PageActive(page)) {
 42             del_page_from_inactive_list(page);
 43             add_page_to_active_list(page);
 44       }
 45 }
41Make sure the page is on the LRU and not already on the active_list
42-43Delete the page from the inactive_list and add to the active_list

J.3  Refilling inactive_list

This section covers how pages are moved from the active lists to the inactive lists.

J.3.1  Function: refill_inactive

Source: mm/vmscan.c

Move nr_pages from the active_list to the inactive_list. The parameter nr_pages is calculated by shrink_caches() and is a number which tries to keep the active list two thirds the size of the page cache.

533 static void refill_inactive(int nr_pages)
534 {
535       struct list_head * entry;
537       spin_lock(&pagemap_lru_lock);
538       entry = active_list.prev;
539       while (nr_pages && entry != &active_list) {
540             struct page * page;
542             page = list_entry(entry, struct page, lru);
543             entry = entry->prev;
544             if (PageTestandClearReferenced(page)) {
545                   list_del(&page->lru);
546                   list_add(&page->lru, &active_list);
547                   continue;
548             }
550             nr_pages--;
552             del_page_from_active_list(page);
553             add_page_to_inactive_list(page);
554             SetPageReferenced(page);
555       }
556       spin_unlock(&pagemap_lru_lock);
557 }
537Acquire the lock protecting the LRU list
538Take the last entry in the active_list
539-555Move nr_pages or until the active_list is empty
542Get the struct page for this entry
544-548Test and clear the referenced flag. If it has been referenced, then it is moved back to the top of the active_list
550-553Move one page from the active_list to the inactive_list
554Mark it referenced so that if it is referenced again soon, it will be promoted back to the active_list without requiring a second reference
556Release the lock protecting the LRU list

J.4  Reclaiming Pages from the LRU Lists

This section covers how a page is reclaimed once it has been selected for pageout.

J.4.1  Function: shrink_cache

Source: mm/vmscan.c

338 static int shrink_cache(int nr_pages, zone_t * classzone, 
                            unsigned int gfp_mask, int priority)
339 {
340     struct list_head * entry;
341     int max_scan = nr_inactive_pages / priority;
342     int max_mapped = min((nr_pages << (10 - priority)), 
                             max_scan / 10);
344     spin_lock(&pagemap_lru_lock);
345     while (--max_scan >= 0 && 
               (entry = inactive_list.prev) != &inactive_list) {
338The parameters are as follows;
nr_pagesThe number of pages to swap out
classzoneThe zone we are interested in swapping pages out for. Pages not belonging to this zone are skipped
gfp_maskThe gfp mask determining what actions may be taken such as if filesystem operations may be performed
priorityThe priority of the function, starts at DEF_PRIORITY (6) and decreases to the highest priority of 1
341The maximum number of pages to scan is the number of pages in the active_list divided by the priority. At lowest priority, 1/6th of the list may scanned. At highest priority, the full list may be scanned
342The maximum amount of process mapped pages allowed is either one tenth of the max_scan value or nr_pages * 210−priority. If this number of pages are found, whole processes will be swapped out
344Lock the LRU list
345Keep scanning until max_scan pages have been scanned or the inactive_list is empty
346         struct page * page;
348         if (unlikely(current->need_resched)) {
349             spin_unlock(&pagemap_lru_lock);
350             __set_current_state(TASK_RUNNING);
351             schedule();
352             spin_lock(&pagemap_lru_lock);
353             continue;
354         }
348-354Reschedule if the quanta has been used up
349Free the LRU lock as we are about to sleep
350Show we are still running
351Call schedule() so another process can be context switched in
352Re-acquire the LRU lock
353Reiterate through the loop and take an entry inactive_list again. As we slept, another process could have changed what entries are on the list which is why another entry has to be taken with the spinlock held
356         page = list_entry(entry, struct page, lru);
358         BUG_ON(!PageLRU(page));
359         BUG_ON(PageActive(page));
361         list_del(entry);
362         list_add(entry, &inactive_list);
364         /*
365          * Zero page counts can happen because we unlink the pages
366          * _after_ decrementing the usage count..
367          */
368         if (unlikely(!page_count(page)))
369             continue;
371         if (!memclass(page_zone(page), classzone))
372             continue;
374         /* Racy check to avoid trylocking when not worthwhile */
375         if (!page->buffers && (page_count(page) != 1 || !page->mapping))
376             goto page_mapped;
356Get the struct page for this entry in the LRU
358-359It is a bug if the page either belongs to the active_list or is currently marked as active
361-362Move the page to the top of the inactive_list so that if the page is not freed, we can just continue knowing that it will be simply examined later
368-369If the page count has already reached 0, skip over it. In __free_pages(), the page count is dropped with put_page_testzero() before __free_pages_ok() is called to free it. This leaves a window where a page with a zero count is left on the LRU before it is freed. There is a special case to trap this at the beginning of __free_pages_ok()
371-372Skip over this page if it belongs to a zone we are not currently interested in
375-376If the page is mapped by a process, then goto page_mapped where the max_mapped is decremented and next page examined. If max_mapped reaches 0, process pages will be swapped out
382         if (unlikely(TryLockPage(page))) {
383             if (PageLaunder(page) && (gfp_mask & __GFP_FS)) {
384                 page_cache_get(page);
385                 spin_unlock(&pagemap_lru_lock);
386                 wait_on_page(page);
387                 page_cache_release(page);
388                 spin_lock(&pagemap_lru_lock);
389             }
390             continue;
391         }

Page is locked and the launder bit is set. In this case, it is the second time this page has been found dirty. The first time it was scheduled for IO and placed back on the list. This time we wait until the IO is complete and then try to free the page.

382-383If we could not lock the page, the PG_launder bit is set and the GFP flags allow the caller to perform FS operations, then...
384Take a reference to the page so it does not disappear while we sleep
385Free the LRU lock
386Wait until the IO is complete
387Release the reference to the page. If it reaches 0, the page will be freed
388Re-acquire the LRU lock
390Move to the next page
393         if (PageDirty(page) && 
                is_page_cache_freeable(page) && 
                page->mapping) {
394             /*
395              * It is not critical here to write it only if
396              * the page is unmapped beause any direct writer
397              * like O_DIRECT would set the PG_dirty bitflag
398              * on the phisical page after having successfully
399              * pinned it and after the I/O to the page is finished,
400              * so the direct writes to the page cannot get lost.
401              */
402             int (*writepage)(struct page *);
404             writepage = page->mapping->a_ops->writepage;
405             if ((gfp_mask & __GFP_FS) && writepage) {
406                 ClearPageDirty(page);
407                 SetPageLaunder(page);
408                 page_cache_get(page);
409                 spin_unlock(&pagemap_lru_lock);
411                 writepage(page);
412                 page_cache_release(page);
414                 spin_lock(&pagemap_lru_lock);
415                 continue;
416             }
417         }

This handles the case where a page is dirty, is not mapped by any process, has no buffers and is backed by a file or device mapping. The page is cleaned and will be reclaimed by the previous block of code when the IO is complete.

393PageDirty() checks the PG_dirty bit, is_page_cache_freeable() will return true if it is not mapped by any process and has no buffers
404Get a pointer to the necessary writepage() function for this mapping or device
405-416This block of code can only be executed if a writepage() function is available and the GFP flags allow file operations
406-407Clear the dirty bit and mark that the page is being laundered
408Take a reference to the page so it will not be freed unexpectedly
409Unlock the LRU list
411Call the filesystem-specific writepage() function which is taken from the address_space_operations belonging to pagemapping
412Release the reference to the page
414-415Re-acquire the LRU list lock and move to the next page
424         if (page->buffers) {
425             spin_unlock(&pagemap_lru_lock);
427             /* avoid to free a locked page */
428             page_cache_get(page);
430             if (try_to_release_page(page, gfp_mask)) {
431                 if (!page->mapping) {
438                     spin_lock(&pagemap_lru_lock);
439                     UnlockPage(page);
440                     __lru_cache_del(page);
442                     /* effectively free the page here */
443                     page_cache_release(page);
445                     if (--nr_pages)
446                         continue;
447                     break;
448                 } else {
454                     page_cache_release(page);
456                     spin_lock(&pagemap_lru_lock);
457                 }
458             } else {
459                 /* failed to drop the buffers so stop here */
460                 UnlockPage(page);
461                 page_cache_release(page);
463                 spin_lock(&pagemap_lru_lock);
464                 continue;
465             }
466         }

Page has buffers associated with it that must be freed.

425Release the LRU lock as we may sleep
428Take a reference to the page
430Call try_to_release_page() which will attempt to release the buffers associated with the page. Returns 1 if it succeeds
431-447This is a case where an anonymous page that was in the swap cache has now had it's buffers cleared and removed. As it was on the swap cache, it was placed on the LRU by add_to_swap_cache() so remove it now frmo the LRU and drop the reference to the page. In swap_writepage(), it calls remove_exclusive_swap_page() which will delete the page from the swap cache when there are no more processes mapping the page. This block will free the page after the buffers have been written out if it was backed by a swap file
438-440Take the LRU list lock, unlock the page, delete it from the page cache and free it
445-446Update nr_pages to show a page has been freed and move to the next page
447If nr_pages drops to 0, then exit the loop as the work is completed
449-456If the page does have an associated mapping then simply drop the reference to the page and re-acquire the LRU lock. More work will be performed later to remove the page from the page cache at line 499
459-464If the buffers could not be freed, then unlock the page, drop the reference to it, re-acquire the LRU lock and move to the next page
468         spin_lock(&pagecache_lock);
470         /*
471          * this is the non-racy check for busy page.
472          */
473         if (!page->mapping || !is_page_cache_freeable(page)) {
474             spin_unlock(&pagecache_lock);
475             UnlockPage(page);
476 page_mapped:
477             if (--max_mapped >= 0)
478                 continue;
484             spin_unlock(&pagemap_lru_lock);
485             swap_out(priority, gfp_mask, classzone);
486             return nr_pages;
487         }
468From this point on, pages in the swap cache are likely to be examined which is protected by the pagecache_lock which must be now held
473-487An anonymous page with no buffers is mapped by a process
474-475Release the page cache lock and the page
477-478Decrement max_mapped. If it has not reached 0, move to the next page
484-485Too many mapped pages have been found in the page cache. The LRU lock is released and swap_out() is called to begin swapping out whole processes
493         if (PageDirty(page)) {
494             spin_unlock(&pagecache_lock);
495             UnlockPage(page);
496             continue;
497         }
493-497The page has no references but could have been dirtied by the last process to free it if the dirty bit was set in the PTE. It is left in the page cache and will get laundered later. Once it has been cleaned, it can be safely deleted
499         /* point of no return */
500         if (likely(!PageSwapCache(page))) {
501             __remove_inode_page(page);
502             spin_unlock(&pagecache_lock);
503         } else {
504             swp_entry_t swap;
505             swap.val = page->index;
506             __delete_from_swap_cache(page);
507             spin_unlock(&pagecache_lock);
508             swap_free(swap);
509         }
511         __lru_cache_del(page);
512         UnlockPage(page);
514         /* effectively free the page here */
515         page_cache_release(page);
517         if (--nr_pages)
518             continue;
519         break;
520     }
500-503If the page does not belong to the swap cache, it is part of the inode queue so it is removed
504-508Remove it from the swap cache as there is no more references to it
511Delete it from the page cache
512Unlock the page
515Free the page
517-518Decrement nr_page and move to the next page if it is not 0
519If it reaches 0, the work of the function is complete
521     spin_unlock(&pagemap_lru_lock);
523     return nr_pages;
524 }
521-524Function exit. Free the LRU lock and return the number of pages left to free

J.5  Shrinking all caches

J.5.1  Function: shrink_caches

Source: mm/vmscan.c

The call graph for this function is shown in Figure 10.4.

560 static int shrink_caches(zone_t * classzone, int priority, 
                 unsigned int gfp_mask, int nr_pages)
561 {
562     int chunk_size = nr_pages;
563     unsigned long ratio;
565     nr_pages -= kmem_cache_reap(gfp_mask);
566     if (nr_pages <= 0)
567         return 0;
569     nr_pages = chunk_size;
570     /* try to keep the active list 2/3 of the size of the cache */
571     ratio = (unsigned long) nr_pages * 
            nr_active_pages / ((nr_inactive_pages + 1) * 2);
572     refill_inactive(ratio);
574     nr_pages = shrink_cache(nr_pages, classzone, gfp_mask, priority);
575     if (nr_pages <= 0)
576         return 0;
578     shrink_dcache_memory(priority, gfp_mask);
579     shrink_icache_memory(priority, gfp_mask);
580 #ifdef CONFIG_QUOTA
581     shrink_dqcache_memory(DEF_PRIORITY, gfp_mask);
582 #endif
584     return nr_pages;
585 }
560The parameters are as follows;
classzone is the zone that pages should be freed from
priority determines how much work will be done to free pages
gfp_mask determines what sort of actions may be taken
nr_pages is the number of pages remaining to be freed
565-567Ask the slab allocator to free up some pages with kmem_cache_reap() (See Section H.1.5.1). If enough are freed, the function returns otherwise nr_pages will be freed from other caches
571-572Move pages from the active_list to the inactive_list by calling refill_inactive() (See Section J.3.1). The number of pages moved depends on how many pages need to be freed and to have active_list about two thirds the size of the page cache
574-575Shrink the page cache, if enough pages are freed, return
578-582Shrink the dcache, icache and dqcache. These are small objects in themselves but the cascading effect frees up a lot of disk buffers
584Return the number of pages remaining to be freed

J.5.2  Function: try_to_free_pages

Source: mm/vmscan.c

This function cycles through all pgdats and tries to balance the preferred allocation zone (usually ZONE_NORMAL) for each of them. This function is only called from one place, buffer.c:free_more_memory() when the buffer manager fails to create new buffers or grow existing ones. It calls try_to_free_pages() with GFP_NOIO as the gfp_mask.

This results in the first zone in pg_data_tnode_zonelists having pages freed so that buffers can grow. This array is the preferred order of zones to allocate from and usually will begin with ZONE_NORMAL which is required by the buffer manager. On NUMA architectures, some nodes may have ZONE_DMA as the preferred zone if the memory bank is dedicated to IO devices and UML also uses only this zone. As the buffer manager is restricted in the zones is uses, there is no point balancing other zones.

607 int try_to_free_pages(unsigned int gfp_mask)
608 {
609     pg_data_t *pgdat;
610     zonelist_t *zonelist;
611     unsigned long pf_free_pages;
612     int error = 0;
614     pf_free_pages = current->flags & PF_FREE_PAGES;
615     current->flags &= ~PF_FREE_PAGES;
617     for_each_pgdat(pgdat) {
618         zonelist = pgdat->node_zonelists + 
                 (gfp_mask & GFP_ZONEMASK);
619         error |= try_to_free_pages_zone(
                    zonelist->zones[0], gfp_mask);
620     }
622     current->flags |= pf_free_pages;
623     return error;
624 }
614-615This clears the PF_FREE_PAGES flag if it is set so that pages freed by the process will be returned to the global pool rather than reserved for the process itself
617-620Cycle through all nodes and call try_to_free_pages() for the preferred zone in each node
618This function is only called with GFP_NOIO as a parameter. When ANDed with GFP_ZONEMASK, it will always result in 0
622-623Restore the process flags and return the result

J.5.3  Function: try_to_free_pages_zone

Source: mm/vmscan.c

Try to free SWAP_CLUSTER_MAX pages from the requested zone. As will as being used by kswapd, this function is the entry for the buddy allocator's direct-reclaim path.

587 int try_to_free_pages_zone(zone_t *classzone, 
                               unsigned int gfp_mask)
588 {
589     int priority = DEF_PRIORITY;
590     int nr_pages = SWAP_CLUSTER_MAX;
592     gfp_mask = pf_gfp_mask(gfp_mask);
593     do {
594         nr_pages = shrink_caches(classzone, priority, 
                         gfp_mask, nr_pages);
595         if (nr_pages <= 0)
596             return 1;
597     } while (--priority);
599     /*
600      * Hmm.. Cache shrink failed - time to kill something?
601      * Mhwahahhaha! This is the part I really like. Giggle.
602      */
603     out_of_memory();
604     return 0;
605 }
589Start with the lowest priority. Statically defined to be 6
590Try and free SWAP_CLUSTER_MAX pages. Statically defined to be 32
592pf_gfp_mask() checks the PF_NOIO flag in the current process flags. If no IO can be performed, it ensures there is no incompatible flags in the GFP mask
593-597Starting with the lowest priority and increasing with each pass, call shrink_caches() until nr_pages has been freed
595-596If enough pages were freed, return indicating that the work is complete
603If enough pages could not be freed even at highest priority (where at worst the full inactive_list is scanned) then check to see if we are out of memory. If we are, then a process will be selected to be killed
604Return indicating that we failed to free enough pages

J.6  Swapping Out Process Pages

This section covers the path where too many process mapped pages have been found in the LRU lists. This path will start scanning whole processes and reclaiming the mapped pages.

J.6.1  Function: swap_out

Source: mm/vmscan.c

The call graph for this function is shown in Figure 10.5. This function linearaly searches through every processes page tables trying to swap out SWAP_CLUSTER_MAX number of pages. The process it starts with is the swap_mm and the starting address is mmswap_address

296 static int swap_out(unsigned int priority, unsigned int gfp_mask, 
            zone_t * classzone)
297 {
298     int counter, nr_pages = SWAP_CLUSTER_MAX;
299     struct mm_struct *mm;
301     counter = mmlist_nr;
302     do {
303         if (unlikely(current->need_resched)) {
304             __set_current_state(TASK_RUNNING);
305             schedule();
306         }
308         spin_lock(&mmlist_lock);
309         mm = swap_mm;
310         while (mm->swap_address == TASK_SIZE || mm == &init_mm) {
311             mm->swap_address = 0;
312             mm = list_entry(mm->, 
                        struct mm_struct, mmlist);
313             if (mm == swap_mm)
314                 goto empty;
315             swap_mm = mm;
316         }
318         /* Make sure the mm doesn't disappear 
             when we drop the lock.. */
319         atomic_inc(&mm->mm_users);
320         spin_unlock(&mmlist_lock);
322         nr_pages = swap_out_mm(mm, nr_pages, &counter, classzone);
324         mmput(mm);
326         if (!nr_pages)
327             return 1;
328     } while (--counter >= 0);
330     return 0;
332 empty:
333     spin_unlock(&mmlist_lock);
334     return 0;
335 }
301Set the counter so the process list is only scanned once
303-306Reschedule if the quanta has been used up to prevent CPU hogging
308Acquire the lock protecting the mm list
309Start with the swap_mm. It is interesting this is never checked to make sure it is valid. It is possible, albeit unlikely that the process with the mm has exited since the last scan and the slab holding the mm_struct has been reclaimed during a cache shrink making the pointer totally invalid. The lack of bug reports might be because the slab rarely gets reclaimed and would be difficult to trigger in reality
310-316Move to the next process if the swap_address has reached the TASK_SIZE or if the mm is the init_mm
311Start at the beginning of the process space
312Get the mm for this process
313-314If it is the same, there is no running processes that can be examined
315Record the swap_mm for the next pass
319Increase the reference count so that the mm does not get freed while we are scanning
320Release the mm lock
322Begin scanning the mm with swap_out_mm()(See Section J.6.2)
324Drop the reference to the mm
326-327If the required number of pages has been freed, return success
328If we failed on this pass, increase the priority so more processes will be scanned
330Return failure

J.6.2  Function: swap_out_mm

Source: mm/vmscan.c

Walk through each VMA and call swap_out_mm() for each one.

256 static inline int swap_out_mm(struct mm_struct * mm, int count, 
                  int * mmcounter, zone_t * classzone)
257 {
258     unsigned long address;
259     struct vm_area_struct* vma;
265     spin_lock(&mm->page_table_lock);
266     address = mm->swap_address;
267     if (address == TASK_SIZE || swap_mm != mm) {
268         /* We raced: don't count this mm but try again */
269         ++*mmcounter;
270         goto out_unlock;
271     }
272     vma = find_vma(mm, address);
273     if (vma) {
274         if (address < vma->vm_start)
275             address = vma->vm_start;
277         for (;;) {
278             count = swap_out_vma(mm, vma, address, 
                         count, classzone);
279             vma = vma->vm_next;
280             if (!vma)
281                 break;
282             if (!count)
283                 goto out_unlock;
284             address = vma->vm_start;
285         }
286     }
287     /* Indicate that we reached the end of address space */
288     mm->swap_address = TASK_SIZE;
290 out_unlock:
291     spin_unlock(&mm->page_table_lock);
292     return count;
293 }
265Acquire the page table lock for this mm
266Start with the address contained in swap_address
267-271If the address is TASK_SIZE, it means that a thread raced and scanned this process already. Increase mmcounter so that swap_out_mm() knows to go to another process
272Find the VMA for this address
273Presuming a VMA was found then ....
274-275Start at the beginning of the VMA
277-285Scan through this and each subsequent VMA calling swap_out_vma() (See Section J.6.3) for each one. If the requisite number of pages (count) is freed, then finish scanning and return
288Once the last VMA has been scanned, set swap_address to TASK_SIZE so that this process will be skipped over by swap_out_mm() next time

J.6.3  Function: swap_out_vma

Source: mm/vmscan.c

Walk through this VMA and for each PGD in it, call swap_out_pgd().

227 static inline int swap_out_vma(struct mm_struct * mm, 
                   struct vm_area_struct * vma, 
                   unsigned long address, int count, 
                   zone_t * classzone)
228 {
229     pgd_t *pgdir;
230     unsigned long end;
232     /* Don't swap out areas which are reserved */
233     if (vma->vm_flags & VM_RESERVED)
234         return count;
236     pgdir = pgd_offset(mm, address);
238     end = vma->vm_end;
239     BUG_ON(address >= end);
240     do {
241         count = swap_out_pgd(mm, vma, pgdir, 
                     address, end, count, classzone);
242         if (!count)
243             break;
244         address = (address + PGDIR_SIZE) & PGDIR_MASK;
245         pgdir++;
246     } while (address && (address < end));
247     return count;
248 }
233-234Skip over this VMA if the VM_RESERVED flag is set. This is used by some device drivers such as the SCSI generic driver
236Get the starting PGD for the address
238Mark where the end is and BUG() it if the starting address is somehow past the end
240Cycle through PGDs until the end address is reached
241Call swap_out_pgd()(See Section J.6.4) keeping count of how many more pages need to be freed
242-243If enough pages have been freed, break and return
244-245Move to the next PGD and move the address to the next PGD aligned address
247Return the remaining number of pages to be freed

J.6.4  Function: swap_out_pgd

Source: mm/vmscan.c

Step through all PMD's in the supplied PGD and call swap_out_pmd()

197 static inline int swap_out_pgd(struct mm_struct * mm, 
                   struct vm_area_struct * vma, pgd_t *dir, 
                   unsigned long address, unsigned long end, 
                   int count, zone_t * classzone)
198 {
199     pmd_t * pmd;
200     unsigned long pgd_end;
202     if (pgd_none(*dir))
203         return count;
204     if (pgd_bad(*dir)) {
205         pgd_ERROR(*dir);
206         pgd_clear(dir);
207         return count;
208     }
210     pmd = pmd_offset(dir, address);
212     pgd_end = (address + PGDIR_SIZE) & PGDIR_MASK;  
213     if (pgd_end && (end > pgd_end))
214         end = pgd_end;
216     do {
217         count = swap_out_pmd(mm, vma, pmd, 
                                 address, end, count, classzone);
218         if (!count)
219             break;
220         address = (address + PMD_SIZE) & PMD_MASK;
221         pmd++;
222     } while (address && (address < end));
223     return count;
224 }
202-203If there is no PGD, return
204-208If the PGD is bad, flag it as such and return
210Get the starting PMD
212-214Calculate the end to be the end of this PGD or the end of the VMA been scanned, whichever is closer
216-222For each PMD in this PGD, call swap_out_pmd() (See Section J.6.5). If enough pages get freed, break and return
223Return the number of pages remaining to be freed

J.6.5  Function: swap_out_pmd

Source: mm/vmscan.c

For each PTE in this PMD, call try_to_swap_out(). On completion, mmswap_address is updated to show where we finished to prevent the same page been examined soon after this scan.

158 static inline int swap_out_pmd(struct mm_struct * mm, 
                   struct vm_area_struct * vma, pmd_t *dir, 
                   unsigned long address, unsigned long end, 
                   int count, zone_t * classzone)
159 {
160     pte_t * pte;
161     unsigned long pmd_end;
163     if (pmd_none(*dir))
164         return count;
165     if (pmd_bad(*dir)) {
166         pmd_ERROR(*dir);
167         pmd_clear(dir);
168         return count;
169     }
171     pte = pte_offset(dir, address);
173     pmd_end = (address + PMD_SIZE) & PMD_MASK;
174     if (end > pmd_end)
175         end = pmd_end;
177     do {
178         if (pte_present(*pte)) {
179             struct page *page = pte_page(*pte);
181             if (VALID_PAGE(page) && !PageReserved(page)) {
182                 count -= try_to_swap_out(mm, vma, 
                                 address, pte, 
                                 page, classzone);
183                 if (!count) {
184                     address += PAGE_SIZE;
185                     break;
186                 }
187             }
188         }
189         address += PAGE_SIZE;
190         pte++;
191     } while (address && (address < end));
192     mm->swap_address = address;
193     return count;
194 }
163-164Return if there is no PMD
165-169If the PMD is bad, flag it as such and return
171Get the starting PTE
173-175Calculate the end to be the end of the PMD or the end of the VMA, whichever is closer
177-191Cycle through each PTE
178Make sure the PTE is marked present
179Get the struct page for this PTE
181If it is a valid page and it is not reserved then ...
182Call try_to_swap_out()
183-186If enough pages have been swapped out, move the address to the next page and break to return
189-190Move to the next page and PTE
192Update the swap_address to show where we last finished off
193Return the number of pages remaining to be freed

J.6.6  Function: try_to_swap_out

Source: mm/vmscan.c

This function tries to swap out a page from a process. It is quite a large function so will be dealt with in parts. Broadly speaking they are

 47 static inline int try_to_swap_out(struct mm_struct * mm, 
                    struct vm_area_struct* vma, 
                    unsigned long address, 
                    pte_t * page_table, 
                    struct page *page, 
                    zone_t * classzone)
 48 {
 49     pte_t pte;
 50     swp_entry_t entry;
 52     /* Don't look at this pte if it's been accessed recently. */
 53     if ((vma->vm_flags & VM_LOCKED) ||
        ptep_test_and_clear_young(page_table)) {
 54         mark_page_accessed(page);
 55         return 0;
 56     }
 58     /* Don't bother unmapping pages that are active */
 59     if (PageActive(page))
 60         return 0;
 62     /* Don't bother replenishing zones not under pressure.. */
 63     if (!memclass(page_zone(page), classzone))
 64         return 0;
 66     if (TryLockPage(page))
 67         return 0;
53-56If the page is locked (for tasks like IO) or the PTE shows the page has been accessed recently then clear the referenced bit and call mark_page_accessed() (See Section J.2.3.1) to make the struct page reflect the age. Return 0 to show it was not swapped out
59-60If the page is on the active_list, do not swap it out
63-64If the page belongs to a zone we are not interested in, do not swap it out
66-67If the page is already locked for IO, skip it
 74     flush_cache_page(vma, address);
 75     pte = ptep_get_and_clear(page_table);
 76     flush_tlb_page(vma, address);
 78     if (pte_dirty(pte))
 79         set_page_dirty(page);
74Call the architecture hook to flush this page from all CPUs
75Get the PTE from the page tables and clear it
76Call the architecture hook to flush the TLB
78-79If the PTE was marked dirty, mark the struct page dirty so it will be laundered correctly
 86     if (PageSwapCache(page)) {
 87         entry.val = page->index;
 88         swap_duplicate(entry);
 89 set_swap_pte:
 90         set_pte(page_table, swp_entry_to_pte(entry));
 91 drop_pte:
 92         mm->rss--;
 93         UnlockPage(page);
 94         {
 95             int freeable = 
                 page_count(page) - !!page->buffers <= 2;
 96             page_cache_release(page);
 97             return freeable;
 98         }
 99     }

Handle the case where the page is already in the swap cache

86Enter this block only if the page is already in the swap cache. Note that it can also be entered by calling goto to the set_swap_pte and drop_pte labels
87-88Fill in the index value for the swap entry. swap_duplicate() verifies the swap identifier is valid and increases the counter in the swap_map if it is
90Fill the PTE with information needed to get the page from swap
92Update RSS to show there is one less page being mapped by the process
93Unlock the page
95The page is free-able if the count is currently 2 or less and has no buffers. If the count is higher, it is either being mapped by other processes or is a file-backed page and the “user” is the page cache
96Decrement the reference count and free the page if it reaches 0. Note that if this is a file-backed page, it will not reach 0 even if there are no processes mapping it. The page will be later reclaimed from the page cache by shrink_cache() (See Section J.4.1)
97Return if the page was freed or not
115     if (page->mapping)
116         goto drop_pte;
117     if (!PageDirty(page))
118         goto drop_pte;
124     if (page->buffers)
125         goto preserve;
115-116If the page has an associated mapping, simply drop it from the page tables. When no processes are mapping it, it will be reclaimed from the page cache by shrink_cache()
117-118If the page is clean, it is safe to simply drop it
124-125If it has associated buffers due to a truncate followed by a page fault, then re-attach the page and PTE to the page tables as it cannot be handled yet
127     /*
128      * This is a dirty, swappable page.  First of all,
129      * get a suitable swap entry for it, and make sure
130      * we have the swap cache set up to associate the
131      * page with that swap entry.
132      */
133     for (;;) {
134         entry = get_swap_page();
135         if (!entry.val)
136             break;
137         /* Add it to the swap cache and mark it dirty
138          * (adding to the page cache will clear the dirty
139          * and uptodate bits, so we need to do it again)
140          */
141         if (add_to_swap_cache(page, entry) == 0) {
142             SetPageUptodate(page);
143             set_page_dirty(page);
144             goto set_swap_pte;
145         }
146         /* Raced with "speculative" read_swap_cache_async */
147         swap_free(entry);
148     }
150     /* No swap space left */
151 preserve:
152     set_pte(page_table, pte);
153     UnlockPage(page);
154     return 0;
155 }
134Allocate a swap entry for this page
135-136If one could not be allocated, break out where the PTE and page will be re-attached to the process page tables
141Add the page to the swap cache
142Mark the page as up to date in memory
143Mark the page dirty so that it will be written out to swap soon
144Goto set_swap_pte which will update the PTE with information needed to get the page from swap later
147If the add to swap cache failed, it means that the page was placed in the swap cache already by a readahead so drop the work done here
152Reattach the PTE to the page tables
153Unlock the page
154Return that no page was freed

J.7  Page Swap Daemon

This section details the main loops used by the kswapd daemon which is woken-up when memory is low. The main functions covered are the ones that determine if kswapd can sleep and how it determines which nodes need balancing.

J.7.1  Initialising kswapd

J.7.1.1  Function: kswapd_init

Source: mm/vmscan.c

Start the kswapd kernel thread

767 static int __init kswapd_init(void)
768 {
769     printk("Starting kswapd\n");
770     swap_setup();
771     kernel_thread(kswapd, NULL, CLONE_FS 
                                  | CLONE_FILES 
                                  | CLONE_SIGNAL);
772     return 0;
773 }
770swap_setup()(See Section K.4.2) setups up how many pages will be prefetched when reading from backing storage based on the amount of physical memory
771Start the kswapd kernel thread

J.7.2  kswapd Daemon

J.7.2.1  Function: kswapd

Source: mm/vmscan.c

The main function of the kswapd kernel thread.

720 int kswapd(void *unused)
721 {
722     struct task_struct *tsk = current;
723     DECLARE_WAITQUEUE(wait, tsk);
725     daemonize();
726     strcpy(tsk->comm, "kswapd");
727     sigfillset(&tsk->blocked);
741     tsk->flags |= PF_MEMALLOC;
746     for (;;) {
747         __set_current_state(TASK_INTERRUPTIBLE);
748         add_wait_queue(&kswapd_wait, &wait);
750         mb();
751         if (kswapd_can_sleep())
752             schedule();
754         __set_current_state(TASK_RUNNING);
755         remove_wait_queue(&kswapd_wait, &wait);
762         kswapd_balance();
763         run_task_queue(&tq_disk);
764     }
765 }
725Call daemonize() which will make this a kernel thread, remove the mm context, close all files and re-parent the process
726Set the name of the process
727Ignore all signals
741By setting this flag, the physical page allocator will always try to satisfy requests for pages. As this process will always be trying to free pages, it is worth satisfying requests
746-764Endlessly loop
747-748This adds kswapd to the wait queue in preparation to sleep
750The Memory Block function (mb()) ensures that all reads and writes that occurred before this line will be visible to all CPU's
751kswapd_can_sleep()(See Section J.7.2.2) cycles through all nodes and zones checking the need_balance field. If any of them are set to 1, kswapd can not sleep
752By calling schedule(), kswapd will now sleep until woken again by the physical page allocator in __alloc_pages() (See Section F.1.3)
754-755Once woken up, kswapd is removed from the wait queue as it is now running
762kswapd_balance()(See Section J.7.2.4) cycles through all zones and calls try_to_free_pages_zone()(See Section J.5.3) for each zone that requires balance
763Run the IO task queue to start writing data out to disk

J.7.2.2  Function: kswapd_can_sleep

Source: mm/vmscan.c

Simple function to cycle through all pgdats to call kswapd_can_sleep_pgdat() on each.

695 static int kswapd_can_sleep(void)
696 {
697     pg_data_t * pgdat;
699     for_each_pgdat(pgdat) {
700         if (!kswapd_can_sleep_pgdat(pgdat))
701             return 0;
702     }
704     return 1;
705 }
699-702for_each_pgdat() does exactly as the name implies. It cycles through all available pgdat's and in this case calls kswapd_can_sleep_pgdat() (See Section J.7.2.3) for each. On the x86, there will only be one pgdat

J.7.2.3  Function: kswapd_can_sleep_pgdat

Source: mm/vmscan.c

Cycles through all zones to make sure none of them need balance. The zoneneed_balanace flag is set by __alloc_pages() when the number of free pages in the zone reaches the pages_low watermark.

680 static int kswapd_can_sleep_pgdat(pg_data_t * pgdat)
681 {
682     zone_t * zone;
683     int i;
685     for (i = pgdat->nr_zones-1; i >= 0; i--) {
686         zone = pgdat->node_zones + i;
687         if (!zone->need_balance)
688             continue;
689         return 0;
690     }
692     return 1;
693 }
685-689Simple for loop to cycle through all zones
686The node_zones field is an array of all available zones so adding i gives the index
687-688If the zone does not need balance, continue
6890 is returned if any needs balance indicating kswapd can not sleep
692Return indicating kswapd can sleep if the for loop completes

J.7.2.4  Function: kswapd_balance

Source: mm/vmscan.c

Continuously cycle through each pgdat until none require balancing

667 static void kswapd_balance(void)
668 {
669     int need_more_balance;
670     pg_data_t * pgdat;
672     do {
673         need_more_balance = 0;
675         for_each_pgdat(pgdat)
676             need_more_balance |= kswapd_balance_pgdat(pgdat);
677     } while (need_more_balance);
678 }
672-677Cycle through all pgdats until none of them report that they need balancing
675For each pgdat, call kswapd_balance_pgdat() to check if the node requires balancing. If any node required balancing, need_more_balance will be set to 1

J.7.2.5  Function: kswapd_balance_pgdat

Source: mm/vmscan.c

This function will check if a node requires balance by examining each of the nodes in it. If any zone requires balancing, try_to_free_pages_zone() will be called.

641 static int kswapd_balance_pgdat(pg_data_t * pgdat)
642 {
643     int need_more_balance = 0, i;
644     zone_t * zone;
646     for (i = pgdat->nr_zones-1; i >= 0; i--) {
647         zone = pgdat->node_zones + i;
648         if (unlikely(current->need_resched))
649             schedule();
650         if (!zone->need_balance)
651             continue;
652         if (!try_to_free_pages_zone(zone, GFP_KSWAPD)) {
653             zone->need_balance = 0;
654             __set_current_state(TASK_INTERRUPTIBLE);
655             schedule_timeout(HZ);
656             continue;
657         }
658         if (check_classzone_need_balance(zone))
659             need_more_balance = 1;
660         else
661             zone->need_balance = 0;
662     }
664     return need_more_balance;
665 }
646-662Cycle through each zone and call try_to_free_pages_zone() (See Section J.5.3) if it needs re-balancing
647node_zones is an array and i is an index within it
648-649Call schedule() if the quanta is expired to prevent kswapd hogging the CPU
650-651If the zone does not require balance, move to the next one
652-657If the function returns 0, it means the out_of_memory() function was called because a sufficient number of pages could not be freed. kswapd sleeps for 1 second to give the system a chance to reclaim the killed processes pages and perform IO. The zone is marked as balanced so kswapd will ignore this zone until the the allocator function __alloc_pages() complains again
658-661If is was successful, check_classzone_need_balance() is called to see if the zone requires further balancing or not
664Return 1 if one zone requires further balancing

Previous Up Next