From 82f9d486e59f588c7d100865c36510644abda356 Mon Sep 17 00:00:00 2001 From: KAMEZAWA Hiroyuki Date: Tue, 26 Jul 2011 16:08:26 -0700 Subject: memcg: add memory.vmscan_stat The commit log of 0ae5e89c60c9 ("memcg: count the soft_limit reclaim in...") says it adds scanning stats to memory.stat file. But it doesn't because we considered we needed to make a concensus for such new APIs. This patch is a trial to add memory.scan_stat. This shows - the number of scanned pages(total, anon, file) - the number of rotated pages(total, anon, file) - the number of freed pages(total, anon, file) - the number of elaplsed time (including sleep/pause time) for both of direct/soft reclaim. The biggest difference with oringinal Ying's one is that this file can be reset by some write, as # echo 0 ...../memory.scan_stat Example of output is here. This is a result after make -j 6 kernel under 300M limit. [kamezawa@bluextal ~]$ cat /cgroup/memory/A/memory.scan_stat [kamezawa@bluextal ~]$ cat /cgroup/memory/A/memory.vmscan_stat scanned_pages_by_limit 9471864 scanned_anon_pages_by_limit 6640629 scanned_file_pages_by_limit 2831235 rotated_pages_by_limit 4243974 rotated_anon_pages_by_limit 3971968 rotated_file_pages_by_limit 272006 freed_pages_by_limit 2318492 freed_anon_pages_by_limit 962052 freed_file_pages_by_limit 1356440 elapsed_ns_by_limit 351386416101 scanned_pages_by_system 0 scanned_anon_pages_by_system 0 scanned_file_pages_by_system 0 rotated_pages_by_system 0 rotated_anon_pages_by_system 0 rotated_file_pages_by_system 0 freed_pages_by_system 0 freed_anon_pages_by_system 0 freed_file_pages_by_system 0 elapsed_ns_by_system 0 scanned_pages_by_limit_under_hierarchy 9471864 scanned_anon_pages_by_limit_under_hierarchy 6640629 scanned_file_pages_by_limit_under_hierarchy 2831235 rotated_pages_by_limit_under_hierarchy 4243974 rotated_anon_pages_by_limit_under_hierarchy 3971968 rotated_file_pages_by_limit_under_hierarchy 272006 freed_pages_by_limit_under_hierarchy 2318492 freed_anon_pages_by_limit_under_hierarchy 962052 freed_file_pages_by_limit_under_hierarchy 1356440 elapsed_ns_by_limit_under_hierarchy 351386416101 scanned_pages_by_system_under_hierarchy 0 scanned_anon_pages_by_system_under_hierarchy 0 scanned_file_pages_by_system_under_hierarchy 0 rotated_pages_by_system_under_hierarchy 0 rotated_anon_pages_by_system_under_hierarchy 0 rotated_file_pages_by_system_under_hierarchy 0 freed_pages_by_system_under_hierarchy 0 freed_anon_pages_by_system_under_hierarchy 0 freed_file_pages_by_system_under_hierarchy 0 elapsed_ns_by_system_under_hierarchy 0 total_xxxx is for hierarchy management. This will be useful for further memcg developments and need to be developped before we do some complicated rework on LRU/softlimit management. This patch adds a new struct memcg_scanrecord into scan_control struct. sc->nr_scanned at el is not designed for exporting information. For example, nr_scanned is reset frequentrly and incremented +2 at scanning mapped pages. To avoid complexity, I added a new param in scan_control which is for exporting scanning score. Signed-off-by: KAMEZAWA Hiroyuki Cc: Daisuke Nishimura Cc: Michal Hocko Cc: Ying Han Cc: Andrew Bresticker Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/cgroups/memory.txt | 85 +++++++++++++++++++++++++++++++++++++++- 1 file changed, 84 insertions(+), 1 deletion(-) (limited to 'Documentation/cgroups/memory.txt') diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt index 06eb6d9..6f3c598 100644 --- a/Documentation/cgroups/memory.txt +++ b/Documentation/cgroups/memory.txt @@ -380,7 +380,7 @@ will be charged as a new owner of it. 5.2 stat file -memory.stat file includes following statistics +5.2.1 memory.stat file includes following statistics # per-memory cgroup local status cache - # of bytes of page cache memory. @@ -438,6 +438,89 @@ Note: file_mapped is accounted only when the memory cgroup is owner of page cache.) +5.2.2 memory.vmscan_stat + +memory.vmscan_stat includes statistics information for memory scanning and +freeing, reclaiming. The statistics shows memory scanning information since +memory cgroup creation and can be reset to 0 by writing 0 as + + #echo 0 > ../memory.vmscan_stat + +This file contains following statistics. + +[param]_[file_or_anon]_pages_by_[reason]_[under_heararchy] +[param]_elapsed_ns_by_[reason]_[under_hierarchy] + +For example, + + scanned_file_pages_by_limit indicates the number of scanned + file pages at vmscan. + +Now, 3 parameters are supported + + scanned - the number of pages scanned by vmscan + rotated - the number of pages activated at vmscan + freed - the number of pages freed by vmscan + +If "rotated" is high against scanned/freed, the memcg seems busy. + +Now, 2 reason are supported + + limit - the memory cgroup's limit + system - global memory pressure + softlimit + (global memory pressure not under softlimit is not handled now) + +When under_hierarchy is added in the tail, the number indicates the +total memcg scan of its children and itself. + +elapsed_ns is a elapsed time in nanosecond. This may include sleep time +and not indicates CPU usage. So, please take this as just showing +latency. + +Here is an example. + +# cat /cgroup/memory/A/memory.vmscan_stat +scanned_pages_by_limit 9471864 +scanned_anon_pages_by_limit 6640629 +scanned_file_pages_by_limit 2831235 +rotated_pages_by_limit 4243974 +rotated_anon_pages_by_limit 3971968 +rotated_file_pages_by_limit 272006 +freed_pages_by_limit 2318492 +freed_anon_pages_by_limit 962052 +freed_file_pages_by_limit 1356440 +elapsed_ns_by_limit 351386416101 +scanned_pages_by_system 0 +scanned_anon_pages_by_system 0 +scanned_file_pages_by_system 0 +rotated_pages_by_system 0 +rotated_anon_pages_by_system 0 +rotated_file_pages_by_system 0 +freed_pages_by_system 0 +freed_anon_pages_by_system 0 +freed_file_pages_by_system 0 +elapsed_ns_by_system 0 +scanned_pages_by_limit_under_hierarchy 9471864 +scanned_anon_pages_by_limit_under_hierarchy 6640629 +scanned_file_pages_by_limit_under_hierarchy 2831235 +rotated_pages_by_limit_under_hierarchy 4243974 +rotated_anon_pages_by_limit_under_hierarchy 3971968 +rotated_file_pages_by_limit_under_hierarchy 272006 +freed_pages_by_limit_under_hierarchy 2318492 +freed_anon_pages_by_limit_under_hierarchy 962052 +freed_file_pages_by_limit_under_hierarchy 1356440 +elapsed_ns_by_limit_under_hierarchy 351386416101 +scanned_pages_by_system_under_hierarchy 0 +scanned_anon_pages_by_system_under_hierarchy 0 +scanned_file_pages_by_system_under_hierarchy 0 +rotated_pages_by_system_under_hierarchy 0 +rotated_anon_pages_by_system_under_hierarchy 0 +rotated_file_pages_by_system_under_hierarchy 0 +freed_pages_by_system_under_hierarchy 0 +freed_anon_pages_by_system_under_hierarchy 0 +freed_file_pages_by_system_under_hierarchy 0 +elapsed_ns_by_system_under_hierarchy 0 + 5.3 swappiness Similar to /proc/sys/vm/swappiness, but affecting a hierarchy of groups only. -- cgit v1.1 From 185efc0f9a1f2d6ad6d4782c5d9e529f3290567f Mon Sep 17 00:00:00 2001 From: Johannes Weiner Date: Wed, 14 Sep 2011 16:21:58 -0700 Subject: memcg: Revert "memcg: add memory.vmscan_stat" Revert the post-3.0 commit 82f9d486e59f5 ("memcg: add memory.vmscan_stat"). The implementation of per-memcg reclaim statistics violates how memcg hierarchies usually behave: hierarchically. The reclaim statistics are accounted to child memcgs and the parent hitting the limit, but not to hierarchy levels in between. Usually, hierarchical statistics are perfectly recursive, with each level representing the sum of itself and all its children. Since this exports statistics to userspace, this may lead to confusion and problems with changing things after the release, so revert it now, we can try again later. Signed-off-by: Johannes Weiner Acked-by: KAMEZAWA Hiroyuki Cc: Daisuke Nishimura Cc: Michal Hocko Cc: Ying Han Cc: Balbir Singh Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/cgroups/memory.txt | 85 +--------------------------------------- 1 file changed, 1 insertion(+), 84 deletions(-) (limited to 'Documentation/cgroups/memory.txt') diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt index 6f3c598..06eb6d9 100644 --- a/Documentation/cgroups/memory.txt +++ b/Documentation/cgroups/memory.txt @@ -380,7 +380,7 @@ will be charged as a new owner of it. 5.2 stat file -5.2.1 memory.stat file includes following statistics +memory.stat file includes following statistics # per-memory cgroup local status cache - # of bytes of page cache memory. @@ -438,89 +438,6 @@ Note: file_mapped is accounted only when the memory cgroup is owner of page cache.) -5.2.2 memory.vmscan_stat - -memory.vmscan_stat includes statistics information for memory scanning and -freeing, reclaiming. The statistics shows memory scanning information since -memory cgroup creation and can be reset to 0 by writing 0 as - - #echo 0 > ../memory.vmscan_stat - -This file contains following statistics. - -[param]_[file_or_anon]_pages_by_[reason]_[under_heararchy] -[param]_elapsed_ns_by_[reason]_[under_hierarchy] - -For example, - - scanned_file_pages_by_limit indicates the number of scanned - file pages at vmscan. - -Now, 3 parameters are supported - - scanned - the number of pages scanned by vmscan - rotated - the number of pages activated at vmscan - freed - the number of pages freed by vmscan - -If "rotated" is high against scanned/freed, the memcg seems busy. - -Now, 2 reason are supported - - limit - the memory cgroup's limit - system - global memory pressure + softlimit - (global memory pressure not under softlimit is not handled now) - -When under_hierarchy is added in the tail, the number indicates the -total memcg scan of its children and itself. - -elapsed_ns is a elapsed time in nanosecond. This may include sleep time -and not indicates CPU usage. So, please take this as just showing -latency. - -Here is an example. - -# cat /cgroup/memory/A/memory.vmscan_stat -scanned_pages_by_limit 9471864 -scanned_anon_pages_by_limit 6640629 -scanned_file_pages_by_limit 2831235 -rotated_pages_by_limit 4243974 -rotated_anon_pages_by_limit 3971968 -rotated_file_pages_by_limit 272006 -freed_pages_by_limit 2318492 -freed_anon_pages_by_limit 962052 -freed_file_pages_by_limit 1356440 -elapsed_ns_by_limit 351386416101 -scanned_pages_by_system 0 -scanned_anon_pages_by_system 0 -scanned_file_pages_by_system 0 -rotated_pages_by_system 0 -rotated_anon_pages_by_system 0 -rotated_file_pages_by_system 0 -freed_pages_by_system 0 -freed_anon_pages_by_system 0 -freed_file_pages_by_system 0 -elapsed_ns_by_system 0 -scanned_pages_by_limit_under_hierarchy 9471864 -scanned_anon_pages_by_limit_under_hierarchy 6640629 -scanned_file_pages_by_limit_under_hierarchy 2831235 -rotated_pages_by_limit_under_hierarchy 4243974 -rotated_anon_pages_by_limit_under_hierarchy 3971968 -rotated_file_pages_by_limit_under_hierarchy 272006 -freed_pages_by_limit_under_hierarchy 2318492 -freed_anon_pages_by_limit_under_hierarchy 962052 -freed_file_pages_by_limit_under_hierarchy 1356440 -elapsed_ns_by_limit_under_hierarchy 351386416101 -scanned_pages_by_system_under_hierarchy 0 -scanned_anon_pages_by_system_under_hierarchy 0 -scanned_file_pages_by_system_under_hierarchy 0 -rotated_pages_by_system_under_hierarchy 0 -rotated_anon_pages_by_system_under_hierarchy 0 -rotated_file_pages_by_system_under_hierarchy 0 -freed_pages_by_system_under_hierarchy 0 -freed_anon_pages_by_system_under_hierarchy 0 -freed_file_pages_by_system_under_hierarchy 0 -elapsed_ns_by_system_under_hierarchy 0 - 5.3 swappiness Similar to /proc/sys/vm/swappiness, but affecting a hierarchy of groups only. -- cgit v1.1 From 9b272977e3b99a8699361d214b51f98c8a9e0e7b Mon Sep 17 00:00:00 2001 From: Johannes Weiner Date: Wed, 2 Nov 2011 13:38:23 -0700 Subject: memcg: skip scanning active lists based on individual size Reclaim decides to skip scanning an active list when the corresponding inactive list is above a certain size in comparison to leave the assumed working set alone while there are still enough reclaim candidates around. The memcg implementation of comparing those lists instead reports whether the whole memcg is low on the requested type of inactive pages, considering all nodes and zones. This can lead to an oversized active list not being scanned because of the state of the other lists in the memcg, as well as an active list being scanned while its corresponding inactive list has enough pages. Not only is this wrong, it's also a scalability hazard, because the global memory state over all nodes and zones has to be gathered for each memcg and zone scanned. Make these calculations purely based on the size of the two LRU lists that are actually affected by the outcome of the decision. Signed-off-by: Johannes Weiner Reviewed-by: Rik van Riel Cc: KOSAKI Motohiro Acked-by: KAMEZAWA Hiroyuki Cc: Daisuke Nishimura Cc: Balbir Singh Reviewed-by: Minchan Kim Reviewed-by: Ying Han Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/cgroups/memory.txt | 1 - 1 file changed, 1 deletion(-) (limited to 'Documentation/cgroups/memory.txt') diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt index 06eb6d9..cc0ebc5 100644 --- a/Documentation/cgroups/memory.txt +++ b/Documentation/cgroups/memory.txt @@ -418,7 +418,6 @@ total_unevictable - sum of all children's "unevictable" # The following additional stats are dependent on CONFIG_DEBUG_VM. -inactive_ratio - VM internal parameter. (see mm/page_alloc.c) recent_rotated_anon - VM internal parameter. (see mm/vmscan.c) recent_rotated_file - VM internal parameter. (see mm/vmscan.c) recent_scanned_anon - VM internal parameter. (see mm/vmscan.c) -- cgit v1.1 From a5db678c3b407681a285d2ea7d9460bc956c2e0a Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Fri, 16 Nov 2012 14:14:49 -0800 Subject: memcg: oom: fix totalpages calculation for memory.swappiness==0 commit 9a5a8f19b43430752067ecaee62fc59e11e88fa6 upstream. oom_badness() takes a totalpages argument which says how many pages are available and it uses it as a base for the score calculation. The value is calculated by mem_cgroup_get_limit which considers both limit and total_swap_pages (resp. memsw portion of it). This is usually correct but since fe35004fbf9e ("mm: avoid swapping out with swappiness==0") we do not swap when swappiness is 0 which means that we cannot really use up all the totalpages pages. This in turn confuses oom score calculation if the memcg limit is much smaller than the available swap because the used memory (capped by the limit) is negligible comparing to totalpages so the resulting score is too small if adj!=0 (typically task with CAP_SYS_ADMIN or non zero oom_score_adj). A wrong process might be selected as result. The problem can be worked around by checking mem_cgroup_swappiness==0 and not considering swap at all in such a case. Signed-off-by: Michal Hocko Acked-by: David Rientjes Acked-by: Johannes Weiner Acked-by: KOSAKI Motohiro Acked-by: KAMEZAWA Hiroyuki Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Ben Hutchings --- Documentation/cgroups/memory.txt | 4 ++++ 1 file changed, 4 insertions(+) (limited to 'Documentation/cgroups/memory.txt') diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt index cc0ebc5..fd129f6 100644 --- a/Documentation/cgroups/memory.txt +++ b/Documentation/cgroups/memory.txt @@ -440,6 +440,10 @@ Note: 5.3 swappiness Similar to /proc/sys/vm/swappiness, but affecting a hierarchy of groups only. +Please note that unlike the global swappiness, memcg knob set to 0 +really prevents from any swapping even if there is a swap storage +available. This might lead to memcg OOM killer if there are no file +pages to reclaim. Following cgroups' swappiness can't be changed. - root cgroup (uses /proc/sys/vm/swappiness). -- cgit v1.1