From 44ee63587dce85593c22497140db16f4e5027860 Mon Sep 17 00:00:00 2001 From: Tejun Heo Date: Wed, 17 Feb 2010 10:50:50 +0900 Subject: percpu: Add __percpu sparse annotations to hw_breakpoint Add __percpu sparse annotations to hw_breakpoint. These annotations are to make sparse consider percpu variables to be in a different address space and warn if accessed without going through percpu accessors. This patch doesn't affect normal builds. In kernel/hw_breakpoint.c, per_cpu(nr_task_bp_pinned, cpu)'s will trigger spurious noderef related warnings from sparse. Changing it to &per_cpu(nr_task_bp_pinned[0], cpu) will work around the problem but deemed to ugly by the maintainer. Leave it alone until better solution can be found. Signed-off-by: Tejun Heo Cc: Stephen Rothwell Cc: K.Prasad LKML-Reference: <4B7B4B7A.9050902@kernel.org> Signed-off-by: Frederic Weisbecker --- kernel/hw_breakpoint.c | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) (limited to 'kernel') diff --git a/kernel/hw_breakpoint.c b/kernel/hw_breakpoint.c index 967e661..6542eac 100644 --- a/kernel/hw_breakpoint.c +++ b/kernel/hw_breakpoint.c @@ -413,17 +413,17 @@ EXPORT_SYMBOL_GPL(unregister_hw_breakpoint); * * @return a set of per_cpu pointers to perf events */ -struct perf_event ** +struct perf_event * __percpu * register_wide_hw_breakpoint(struct perf_event_attr *attr, perf_overflow_handler_t triggered) { - struct perf_event **cpu_events, **pevent, *bp; + struct perf_event * __percpu *cpu_events, **pevent, *bp; long err; int cpu; cpu_events = alloc_percpu(typeof(*cpu_events)); if (!cpu_events) - return ERR_PTR(-ENOMEM); + return (void __percpu __force *)ERR_PTR(-ENOMEM); get_online_cpus(); for_each_online_cpu(cpu) { @@ -451,7 +451,7 @@ fail: put_online_cpus(); free_percpu(cpu_events); - return ERR_PTR(err); + return (void __percpu __force *)ERR_PTR(err); } EXPORT_SYMBOL_GPL(register_wide_hw_breakpoint); @@ -459,7 +459,7 @@ EXPORT_SYMBOL_GPL(register_wide_hw_breakpoint); * unregister_wide_hw_breakpoint - unregister a wide breakpoint in the kernel * @cpu_events: the per cpu set of events to unregister */ -void unregister_wide_hw_breakpoint(struct perf_event **cpu_events) +void unregister_wide_hw_breakpoint(struct perf_event * __percpu *cpu_events) { int cpu; struct perf_event **pevent; -- cgit v1.1 From 622ea685f1fafdf84d612440535c84341f0860b8 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Sat, 27 Feb 2010 14:53:07 -0800 Subject: rcu: Fix holdoff for accelerated GPs for last non-dynticked CPU Make the holdoff only happen when the full number of attempts have been made. Signed-off-by: Paul E. McKenney Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com LKML-Reference: <1267311188-16603-1-git-send-email-paulmck@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar --- kernel/rcutree_plugin.h | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) (limited to 'kernel') diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h index 464ad2c..79b53bd 100644 --- a/kernel/rcutree_plugin.h +++ b/kernel/rcutree_plugin.h @@ -1010,6 +1010,10 @@ int rcu_needs_cpu(int cpu) int c = 0; int thatcpu; + /* Check for being in the holdoff period. */ + if (per_cpu(rcu_dyntick_holdoff, cpu) == jiffies) + return rcu_needs_cpu_quick_check(cpu); + /* Don't bother unless we are the last non-dyntick-idle CPU. */ for_each_cpu_not(thatcpu, nohz_cpu_mask) if (thatcpu != cpu) { @@ -1041,10 +1045,8 @@ int rcu_needs_cpu(int cpu) } /* If RCU callbacks are still pending, RCU still needs this CPU. */ - if (c) { + if (c) raise_softirq(RCU_SOFTIRQ); - per_cpu(rcu_dyntick_holdoff, cpu) = jiffies; - } return c; } -- cgit v1.1 From ae1f30384baef4056438d81b305a6a5199b0d16c Mon Sep 17 00:00:00 2001 From: Frederic Weisbecker Date: Sun, 28 Feb 2010 19:42:38 +0100 Subject: tracing: Include irqflags headers from trace clock trace_clock.c includes spinlock.h, which ends up including asm/system.h, which in turn includes linux/irqflags.h in x86. So the definition of raw_local_irq_save is luckily covered there, but this is not the case in parisc: tip/kernel/trace/trace_clock.c:86: error: implicit declaration of function 'raw_local_irq_save' tip/kernel/trace/trace_clock.c:112: error: implicit declaration of function 'raw_local_irq_restore' We need to include linux/irqflags.h directly from trace_clock.c to avoid such build error. Signed-off-by: Frederic Weisbecker Cc: Steven Rostedt Cc: Robert Richter Cc: Peter Zijlstra Signed-off-by: Ingo Molnar --- kernel/trace/trace_clock.c | 1 + 1 file changed, 1 insertion(+) (limited to 'kernel') diff --git a/kernel/trace/trace_clock.c b/kernel/trace/trace_clock.c index 84a3a7b..6fbfb8f 100644 --- a/kernel/trace/trace_clock.c +++ b/kernel/trace/trace_clock.c @@ -13,6 +13,7 @@ * Tracer plugins will chose a default from these clocks. */ #include +#include #include #include #include -- cgit v1.1 From 1e259e0a9982078896f3404240096cbea01daca4 Mon Sep 17 00:00:00 2001 From: Frederic Weisbecker Date: Sun, 28 Feb 2010 20:51:15 +0100 Subject: hw-breakpoints: Remove stub unthrottle callback We support event unthrottling in breakpoint events. It means that if we have more than sysctl_perf_event_sample_rate/HZ, perf will throttle, ignoring subsequent events until the next tick. So if ptrace exceeds this max rate, it will omit events, which breaks the ptrace determinism that is supposed to report every triggered breakpoints. This is likely to happen if we set sysctl_perf_event_sample_rate to 1. This patch removes support for unthrottling in breakpoint events to break throttling and restore ptrace determinism. Signed-off-by: Frederic Weisbecker Cc: 2.6.33.x Cc: Peter Zijlstra Cc: K.Prasad Cc: Paul Mackerras --- kernel/hw_breakpoint.c | 1 - 1 file changed, 1 deletion(-) (limited to 'kernel') diff --git a/kernel/hw_breakpoint.c b/kernel/hw_breakpoint.c index 967e661..4d99512 100644 --- a/kernel/hw_breakpoint.c +++ b/kernel/hw_breakpoint.c @@ -489,5 +489,4 @@ struct pmu perf_ops_bp = { .enable = arch_install_hw_breakpoint, .disable = arch_uninstall_hw_breakpoint, .read = hw_breakpoint_pmu_read, - .unthrottle = hw_breakpoint_pmu_unthrottle }; -- cgit v1.1 From 90a6501f94aedd7fb40f5556334843194fb598be Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Sun, 28 Feb 2010 08:32:18 -0800 Subject: sched, rcu: Fix rcu_dereference() for RCU-lockdep Make rcu_dereference() of runqueue data structures be rcu_dereference_sched(). Located-by: Ingo Molnar Signed-off-by: Paul E. McKenney Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com LKML-Reference: <20100228163218.GD6846@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar --- kernel/sched_fair.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c index 3e1fd96..5a5ea2c 100644 --- a/kernel/sched_fair.c +++ b/kernel/sched_fair.c @@ -3476,7 +3476,7 @@ static void run_rebalance_domains(struct softirq_action *h) static inline int on_null_domain(int cpu) { - return !rcu_dereference(cpu_rq(cpu)->sd); + return !rcu_dereference_sched(cpu_rq(cpu)->sd); } /* -- cgit v1.1 From 320ebf09cbb6d01954c9a060266aa8e0d27f4638 Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Tue, 2 Mar 2010 12:35:37 +0100 Subject: perf, x86: Restrict the ANY flag The ANY flag can show SMT data of another task (like 'top'), so we want to disable it when system-wide profiling is disabled. Signed-off-by: Peter Zijlstra LKML-Reference: Signed-off-by: Ingo Molnar --- kernel/perf_event.c | 15 --------------- 1 file changed, 15 deletions(-) (limited to 'kernel') diff --git a/kernel/perf_event.c b/kernel/perf_event.c index a661e79..482d5e1 100644 --- a/kernel/perf_event.c +++ b/kernel/perf_event.c @@ -56,21 +56,6 @@ static atomic_t nr_task_events __read_mostly; */ int sysctl_perf_event_paranoid __read_mostly = 1; -static inline bool perf_paranoid_tracepoint_raw(void) -{ - return sysctl_perf_event_paranoid > -1; -} - -static inline bool perf_paranoid_cpu(void) -{ - return sysctl_perf_event_paranoid > 0; -} - -static inline bool perf_paranoid_kernel(void) -{ - return sysctl_perf_event_paranoid > 1; -} - int sysctl_perf_event_mlock __read_mostly = 512; /* 'free' kb per user */ /* -- cgit v1.1 From ac91d85456372a90af5b85eb6620fd2efb1e431b Mon Sep 17 00:00:00 2001 From: Lai Jiangshan Date: Tue, 2 Mar 2010 17:54:50 +0800 Subject: tracing: Fix warning in s_next of trace file ops This warning in s_next() can be triggered by lseek(): [] ? s_next+0x77/0x80 [] warn_slowpath_common+0x81/0xa0 [] ? s_next+0x77/0x80 [] warn_slowpath_null+0x1a/0x20 [] s_next+0x77/0x80 [] traverse+0x117/0x200 [] seq_lseek+0xa3/0x120 [] ? seq_lseek+0x0/0x120 [] vfs_llseek+0x41/0x50 [] sys_llseek+0x66/0xa0 [] sysenter_do_call+0x12/0x26 The iterator "leftover" variable is zeroed in the opening of the trace file. But lseek can call s_start() which will call s_next() without reseting the "leftover" variable back to zero, which might trigger the WARN_ON_ONCE(iter->leftover) that is in s_next(). Cc: stable@kernel.org Signed-off-by: Lai Jiangshan LKML-Reference: <4B8CE06A.9090207@cn.fujitsu.com> Signed-off-by: Steven Rostedt --- kernel/trace/trace.c | 1 + 1 file changed, 1 insertion(+) (limited to 'kernel') diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index 032c57c..5edf410 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -1703,6 +1703,7 @@ static void *s_start(struct seq_file *m, loff_t *pos) ftrace_enable_cpu(); + iter->leftover = 0; for (p = iter; p && l < *pos; p = s_next(m, p, &l)) ; -- cgit v1.1 From db1466b3e1bd1727375cdbfcbea4bcce2f860f61 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Wed, 3 Mar 2010 07:46:56 -0800 Subject: rcu: Use wrapper function instead of exporting tasklist_lock Lockdep-RCU commit d11c563d exported tasklist_lock, which is not a good thing. This patch instead exports a function that uses lockdep to check whether tasklist_lock is held. Suggested-by: Christoph Hellwig Signed-off-by: Paul E. McKenney Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com Cc: Christoph Hellwig LKML-Reference: <1267631219-8713-1-git-send-email-paulmck@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar --- kernel/exit.c | 2 +- kernel/fork.c | 9 ++++++++- kernel/pid.c | 4 +++- 3 files changed, 12 insertions(+), 3 deletions(-) (limited to 'kernel') diff --git a/kernel/exit.c b/kernel/exit.c index 45ed043..fed3a4d 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -87,7 +87,7 @@ static void __exit_signal(struct task_struct *tsk) sighand = rcu_dereference_check(tsk->sighand, rcu_read_lock_held() || - lockdep_is_held(&tasklist_lock)); + lockdep_tasklist_lock_is_held()); spin_lock(&sighand->siglock); posix_cpu_timers_exit(tsk); diff --git a/kernel/fork.c b/kernel/fork.c index 17bbf09..8691c54 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -86,7 +86,14 @@ int max_threads; /* tunable limit on nr_threads */ DEFINE_PER_CPU(unsigned long, process_counts) = 0; __cacheline_aligned DEFINE_RWLOCK(tasklist_lock); /* outer */ -EXPORT_SYMBOL_GPL(tasklist_lock); + +#ifdef CONFIG_PROVE_RCU +int lockdep_tasklist_lock_is_held(void) +{ + return lockdep_is_held(&tasklist_lock); +} +EXPORT_SYMBOL_GPL(lockdep_tasklist_lock_is_held); +#endif /* #ifdef CONFIG_PROVE_RCU */ int nr_processes(void) { diff --git a/kernel/pid.c b/kernel/pid.c index b08e697..b606440 100644 --- a/kernel/pid.c +++ b/kernel/pid.c @@ -367,7 +367,9 @@ struct task_struct *pid_task(struct pid *pid, enum pid_type type) struct task_struct *result = NULL; if (pid) { struct hlist_node *first; - first = rcu_dereference_check(pid->tasks[type].first, rcu_read_lock_held() || lockdep_is_held(&tasklist_lock)); + first = rcu_dereference_check(pid->tasks[type].first, + rcu_read_lock_held() || + lockdep_tasklist_lock_is_held()); if (first) result = hlist_entry(first, struct task_struct, pids[(type)].node); } -- cgit v1.1 From cc5b83a9f884fe8722a275069a5a6fde39988455 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Wed, 3 Mar 2010 07:46:59 -0800 Subject: rcu: Add control variables to lockdep_rcu_dereference() diagnostics Add the values of rcu_scheduler_active() and debug_locks() to the lockdep_rcu_dereference() output to help diagnose RCU lockdep splats that occur shortly after the scheduler starts. Signed-off-by: Paul E. McKenney Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com LKML-Reference: <1267631219-8713-4-git-send-email-paulmck@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar --- kernel/lockdep.c | 1 + 1 file changed, 1 insertion(+) (limited to 'kernel') diff --git a/kernel/lockdep.c b/kernel/lockdep.c index 0c30d04..681bc2e 100644 --- a/kernel/lockdep.c +++ b/kernel/lockdep.c @@ -3822,6 +3822,7 @@ void lockdep_rcu_dereference(const char *file, const int line) printk("%s:%d invoked rcu_dereference_check() without protection!\n", file, line); printk("\nother info that might help us debug this:\n\n"); + printk("\nrcu_scheduler_active = %d, debug_locks = %d\n", rcu_scheduler_active, debug_locks); lockdep_print_held_locks(curr); printk("\nstack backtrace:\n"); dump_stack(); -- cgit v1.1 From 8d53dd546f36073e0d29b0cfc24c665db301e3e7 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Wed, 3 Mar 2010 17:50:18 -0800 Subject: rcu, ftrace: Fix RCU lockdep splat in ftrace_perf_buf_prepare() Change the pair of rcu_dereference() calls in ftrace_perf_buf_prepare() to rcu_dereference_sched(). Signed-off-by: Paul E. McKenney Acked-by: Frederic Weisbecker Cc: Steven Rostedt Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com Cc: Frederic Weisbecker LKML-Reference: <1267667418-32233-3-git-send-email-paulmck@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar --- kernel/trace/trace_event_profile.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/trace/trace_event_profile.c b/kernel/trace/trace_event_profile.c index f0d6930..c1cc3ab 100644 --- a/kernel/trace/trace_event_profile.c +++ b/kernel/trace/trace_event_profile.c @@ -138,9 +138,9 @@ __kprobes void *ftrace_perf_buf_prepare(int size, unsigned short type, cpu = smp_processor_id(); if (in_nmi()) - trace_buf = rcu_dereference(perf_trace_buf_nmi); + trace_buf = rcu_dereference_sched(perf_trace_buf_nmi); else - trace_buf = rcu_dereference(perf_trace_buf); + trace_buf = rcu_dereference_sched(perf_trace_buf); if (!trace_buf) goto err; -- cgit v1.1 From 801c29fd1fdeb84f60241beb445ff5db154450ae Mon Sep 17 00:00:00 2001 From: Steven Rostedt Date: Fri, 5 Mar 2010 20:02:19 -0500 Subject: function-graph: Fix unused reference to ftrace_set_func() The declaration of ftrace_set_func() is at the start of the ftrace.c file and wrapped with a #ifdef CONFIG_FUNCTION_GRAPH condition. If function graph tracing is enabled but CONFIG_DYNAMIC_FTRACE is not, a warning about that function being declared static and unused is given. This really should have been placed within the CONFIG_FUNCTION_GRAPH condition that uses ftrace_set_func(). Moving the declaration down fixes the warning and makes the code cleaner. Reported-by: Peter Zijlstra Signed-off-by: Steven Rostedt --- kernel/trace/ftrace.c | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) (limited to 'kernel') diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c index d996353..d0407c9 100644 --- a/kernel/trace/ftrace.c +++ b/kernel/trace/ftrace.c @@ -85,10 +85,6 @@ ftrace_func_t ftrace_trace_function __read_mostly = ftrace_stub; ftrace_func_t __ftrace_trace_function __read_mostly = ftrace_stub; ftrace_func_t ftrace_pid_function __read_mostly = ftrace_stub; -#ifdef CONFIG_FUNCTION_GRAPH_TRACER -static int ftrace_set_func(unsigned long *array, int *idx, char *buffer); -#endif - static void ftrace_list_func(unsigned long ip, unsigned long parent_ip) { struct ftrace_ops *op = ftrace_list; @@ -2300,6 +2296,8 @@ __setup("ftrace_filter=", set_ftrace_filter); #ifdef CONFIG_FUNCTION_GRAPH_TRACER static char ftrace_graph_buf[FTRACE_FILTER_SIZE] __initdata; +static int ftrace_set_func(unsigned long *array, int *idx, char *buffer); + static int __init set_graph_function(char *str) { strlcpy(ftrace_graph_buf, str, FTRACE_FILTER_SIZE); -- cgit v1.1 From a094fe04c751698a18c3a0d376a3bdb117f1e0d8 Mon Sep 17 00:00:00 2001 From: Steven Rostedt Date: Fri, 5 Mar 2010 20:08:58 -0500 Subject: function-graph: Use comment notation for func names of dangling '}' When a '}' does not have a matching function start, the name is printed within parenthesis. But this makes it confusing between ending '}' and function starts. This patch makes the function name appear in C comment notation. Old view: 3) 1.281 us | } (might_fault) 3) 3.620 us | } (filldir) 3) 5.251 us | } (call_filldir) 3) | call_filldir() { 3) | filldir() { New view: 3) 1.281 us | } /* might_fault */ 3) 3.620 us | } /* filldir */ 3) 5.251 us | } /* call_filldir */ 3) | call_filldir() { 3) | filldir() { Requested-by: Ingo Molnar Signed-off-by: Steven Rostedt --- kernel/trace/trace_functions_graph.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/trace/trace_functions_graph.c b/kernel/trace/trace_functions_graph.c index e998a82..7b1f246 100644 --- a/kernel/trace/trace_functions_graph.c +++ b/kernel/trace/trace_functions_graph.c @@ -920,7 +920,7 @@ print_graph_return(struct ftrace_graph_ret *trace, struct trace_seq *s, if (!ret) return TRACE_TYPE_PARTIAL_LINE; } else { - ret = trace_seq_printf(s, "} (%ps)\n", (void *)trace->func); + ret = trace_seq_printf(s, "} /* %ps */\n", (void *)trace->func); if (!ret) return TRACE_TYPE_PARTIAL_LINE; } -- cgit v1.1 From 1acaa1b2d9b5904c9cce06122990a2d71046ce16 Mon Sep 17 00:00:00 2001 From: Arnaldo Carvalho de Melo Date: Fri, 5 Mar 2010 18:23:50 -0300 Subject: tracing: Update the comm field in the right variable in update_max_tr The latency output showed: # | task: -3 (uid:0 nice:0 policy:1 rt_prio:99) The comm is missing in the "task:" and it looks like a minus 3 is the output. The correct display should be: # | task: migration/0-3 (uid:0 nice:0 policy:1 rt_prio:99) The problem is that the comm is being stored in the wrong data structure. The max_tr.data[cpu] is what stores the comm, not the tr->data[cpu]. Before this patch the max_tr.data[cpu]->comm was zeroed and the /debug/trace ended up showing just the '-' sign followed by the pid. Also remove a needless initialization of max_data. Signed-off-by: Arnaldo Carvalho de Melo LKML-Reference: <1267824230-23861-1-git-send-email-acme@infradead.org> Signed-off-by: Steven Rostedt --- kernel/trace/trace.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index 032c57c..6efd5cb 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -592,7 +592,7 @@ static void __update_max_tr(struct trace_array *tr, struct task_struct *tsk, int cpu) { struct trace_array_cpu *data = tr->data[cpu]; - struct trace_array_cpu *max_data = tr->data[cpu]; + struct trace_array_cpu *max_data; max_tr.cpu = cpu; max_tr.time_start = data->preempt_timestamp; @@ -602,7 +602,7 @@ __update_max_tr(struct trace_array *tr, struct task_struct *tsk, int cpu) max_data->critical_start = data->critical_start; max_data->critical_end = data->critical_end; - memcpy(data->comm, tsk->comm, TASK_COMM_LEN); + memcpy(max_data->comm, tsk->comm, TASK_COMM_LEN); max_data->pid = tsk->pid; max_data->uid = task_uid(tsk); max_data->nice = tsk->static_prio - 20 - MAX_RT_PRIO; -- cgit v1.1 From 0e95017355dcf43031da6d0e360a748717e56df1 Mon Sep 17 00:00:00 2001 From: Tim Bird Date: Thu, 25 Feb 2010 15:36:43 -0800 Subject: function-graph: Add tracing_thresh support to function_graph tracer Add support for tracing_thresh to the function_graph tracer. This version of this feature isolates the checks into new entry and return functions, to avoid adding more conditional code into the main function_graph paths. When the tracing_thresh is set and the function graph tracer is enabled, only the functions that took longer than the time in microseconds that was set in tracing_thresh are recorded. To do this efficiently, only the function exits are recorded: [tracing]# echo 100 > tracing_thresh [tracing]# echo function_graph > current_tracer [tracing]# cat trace # tracer: function_graph # # CPU DURATION FUNCTION CALLS # | | | | | | | 1) ! 119.214 us | } /* smp_apic_timer_interrupt */ 1) <========== | 0) ! 101.527 us | } /* __rcu_process_callbacks */ 0) ! 126.461 us | } /* rcu_process_callbacks */ 0) ! 145.111 us | } /* __do_softirq */ 0) ! 149.667 us | } /* do_softirq */ 0) ! 168.817 us | } /* irq_exit */ 0) ! 248.254 us | } /* smp_apic_timer_interrupt */ Also, add support for specifying tracing_thresh on the kernel command line. When used like so: "tracing_thresh=200 ftrace=function_graph" this can be used to analyse system startup. It is important to disable tracing soon after boot, in order to avoid losing the trace data. Acked-by: Frederic Weisbecker Signed-off-by: Tim Bird LKML-Reference: <4B87098B.4040308@am.sony.com> Signed-off-by: Steven Rostedt --- kernel/trace/trace.c | 20 ++++++++++++++++++-- kernel/trace/trace.h | 3 ++- kernel/trace/trace_functions_graph.c | 25 +++++++++++++++++++++++-- 3 files changed, 43 insertions(+), 5 deletions(-) (limited to 'kernel') diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index 6efd5cb..ababedb 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -374,6 +374,21 @@ static int __init set_buf_size(char *str) } __setup("trace_buf_size=", set_buf_size); +static int __init set_tracing_thresh(char *str) +{ + unsigned long threshhold; + int ret; + + if (!str) + return 0; + ret = strict_strtoul(str, 0, &threshhold); + if (ret < 0) + return 0; + tracing_thresh = threshhold * 1000; + return 1; +} +__setup("tracing_thresh=", set_tracing_thresh); + unsigned long nsecs_to_usecs(unsigned long nsecs) { return nsecs / 1000; @@ -579,9 +594,10 @@ static ssize_t trace_seq_to_buffer(struct trace_seq *s, void *buf, size_t cnt) static arch_spinlock_t ftrace_max_lock = (arch_spinlock_t)__ARCH_SPIN_LOCK_UNLOCKED; +unsigned long __read_mostly tracing_thresh; + #ifdef CONFIG_TRACER_MAX_TRACE unsigned long __read_mostly tracing_max_latency; -unsigned long __read_mostly tracing_thresh; /* * Copy the new maximum trace into the separate maximum-trace @@ -4248,10 +4264,10 @@ static __init int tracer_init_debugfs(void) #ifdef CONFIG_TRACER_MAX_TRACE trace_create_file("tracing_max_latency", 0644, d_tracer, &tracing_max_latency, &tracing_max_lat_fops); +#endif trace_create_file("tracing_thresh", 0644, d_tracer, &tracing_thresh, &tracing_max_lat_fops); -#endif trace_create_file("README", 0444, d_tracer, NULL, &tracing_readme_fops); diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h index fd05bca..1bc8cd1 100644 --- a/kernel/trace/trace.h +++ b/kernel/trace/trace.h @@ -396,9 +396,10 @@ extern int process_new_ksym_entry(char *ksymname, int op, unsigned long addr); extern unsigned long nsecs_to_usecs(unsigned long nsecs); +extern unsigned long tracing_thresh; + #ifdef CONFIG_TRACER_MAX_TRACE extern unsigned long tracing_max_latency; -extern unsigned long tracing_thresh; void update_max_tr(struct trace_array *tr, struct task_struct *tsk, int cpu); void update_max_tr_single(struct trace_array *tr, diff --git a/kernel/trace/trace_functions_graph.c b/kernel/trace/trace_functions_graph.c index 7b1f246..e9df04b 100644 --- a/kernel/trace/trace_functions_graph.c +++ b/kernel/trace/trace_functions_graph.c @@ -237,6 +237,14 @@ int trace_graph_entry(struct ftrace_graph_ent *trace) return ret; } +int trace_graph_thresh_entry(struct ftrace_graph_ent *trace) +{ + if (tracing_thresh) + return 1; + else + return trace_graph_entry(trace); +} + static void __trace_graph_return(struct trace_array *tr, struct ftrace_graph_ret *trace, unsigned long flags, @@ -290,13 +298,26 @@ void set_graph_array(struct trace_array *tr) smp_mb(); } +void trace_graph_thresh_return(struct ftrace_graph_ret *trace) +{ + if (tracing_thresh && + (trace->rettime - trace->calltime < tracing_thresh)) + return; + else + trace_graph_return(trace); +} + static int graph_trace_init(struct trace_array *tr) { int ret; set_graph_array(tr); - ret = register_ftrace_graph(&trace_graph_return, - &trace_graph_entry); + if (tracing_thresh) + ret = register_ftrace_graph(&trace_graph_thresh_return, + &trace_graph_thresh_entry); + else + ret = register_ftrace_graph(&trace_graph_return, + &trace_graph_entry); if (ret) return ret; tracing_start_cmdline_record(); -- cgit v1.1 From dc1d628a67a8f042e711ea5accc0beedc3ef0092 Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Wed, 3 Mar 2010 15:55:04 +0100 Subject: perf: Provide generic perf_sample_data initialization This makes it easier to extend perf_sample_data and fixes a bug on arm and sparc, which failed to set ->raw to NULL, which can cause crashes when combined with PERF_SAMPLE_RAW. It also optimizes PowerPC and tracepoint, because the struct initialization is forced to zero out the whole structure. Signed-off-by: Peter Zijlstra Acked-by: Jean Pihet Reviewed-by: Frederic Weisbecker Acked-by: David S. Miller Cc: Jamie Iles Cc: Paul Mackerras Cc: Stephane Eranian Cc: stable@kernel.org LKML-Reference: <20100304140100.315416040@chello.nl> Signed-off-by: Ingo Molnar --- kernel/perf_event.c | 21 ++++++++------------- 1 file changed, 8 insertions(+), 13 deletions(-) (limited to 'kernel') diff --git a/kernel/perf_event.c b/kernel/perf_event.c index e687450..4393b9e 100644 --- a/kernel/perf_event.c +++ b/kernel/perf_event.c @@ -4108,8 +4108,7 @@ void __perf_sw_event(u32 event_id, u64 nr, int nmi, if (rctx < 0) return; - data.addr = addr; - data.raw = NULL; + perf_sample_data_init(&data, addr); do_perf_sw_event(PERF_TYPE_SOFTWARE, event_id, nr, nmi, &data, regs); @@ -4154,11 +4153,10 @@ static enum hrtimer_restart perf_swevent_hrtimer(struct hrtimer *hrtimer) struct perf_event *event; u64 period; - event = container_of(hrtimer, struct perf_event, hw.hrtimer); + event = container_of(hrtimer, struct perf_event, hw.hrtimer); event->pmu->read(event); - data.addr = 0; - data.raw = NULL; + perf_sample_data_init(&data, 0); data.period = event->hw.last_period; regs = get_irq_regs(); /* @@ -4322,17 +4320,15 @@ static const struct pmu perf_ops_task_clock = { void perf_tp_event(int event_id, u64 addr, u64 count, void *record, int entry_size) { + struct pt_regs *regs = get_irq_regs(); + struct perf_sample_data data; struct perf_raw_record raw = { .size = entry_size, .data = record, }; - struct perf_sample_data data = { - .addr = addr, - .raw = &raw, - }; - - struct pt_regs *regs = get_irq_regs(); + perf_sample_data_init(&data, addr); + data.raw = &raw; if (!regs) regs = task_pt_regs(current); @@ -4448,8 +4444,7 @@ void perf_bp_event(struct perf_event *bp, void *data) struct perf_sample_data sample; struct pt_regs *regs = data; - sample.raw = NULL; - sample.addr = bp->attr.bp_addr; + perf_sample_data_init(&sample, bp->attr.bp_addr); if (!perf_exclude_event(bp, regs)) perf_swevent_add(bp, 1, 1, &sample, regs); -- cgit v1.1 From 3f6da3905398826d85731247e7fbcf53400c18bd Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Fri, 5 Mar 2010 13:01:18 +0100 Subject: perf: Rework and fix the arch CPU-hotplug hooks Remove the hw_perf_event_*() hotplug hooks in favour of per PMU hotplug notifiers. This has the advantage of reducing the static weak interface as well as exposing all hotplug actions to the PMU. Use this to fix x86 hotplug usage where we did things in ONLINE which should have been done in UP_PREPARE or STARTING. Signed-off-by: Peter Zijlstra Cc: Paul Mundt Cc: paulus@samba.org Cc: eranian@google.com Cc: robert.richter@amd.com Cc: fweisbec@gmail.com Cc: Arnaldo Carvalho de Melo LKML-Reference: <20100305154128.736225361@chello.nl> Signed-off-by: Ingo Molnar --- kernel/perf_event.c | 15 --------------- 1 file changed, 15 deletions(-) (limited to 'kernel') diff --git a/kernel/perf_event.c b/kernel/perf_event.c index 4393b9e..73329de 100644 --- a/kernel/perf_event.c +++ b/kernel/perf_event.c @@ -81,10 +81,6 @@ extern __weak const struct pmu *hw_perf_event_init(struct perf_event *event) void __weak hw_perf_disable(void) { barrier(); } void __weak hw_perf_enable(void) { barrier(); } -void __weak hw_perf_event_setup(int cpu) { barrier(); } -void __weak hw_perf_event_setup_online(int cpu) { barrier(); } -void __weak hw_perf_event_setup_offline(int cpu) { barrier(); } - int __weak hw_perf_group_sched_in(struct perf_event *group_leader, struct perf_cpu_context *cpuctx, @@ -5382,8 +5378,6 @@ static void __cpuinit perf_event_init_cpu(int cpu) spin_lock(&perf_resource_lock); cpuctx->max_pertask = perf_max_events - perf_reserved_percpu; spin_unlock(&perf_resource_lock); - - hw_perf_event_setup(cpu); } #ifdef CONFIG_HOTPLUG_CPU @@ -5423,20 +5417,11 @@ perf_cpu_notify(struct notifier_block *self, unsigned long action, void *hcpu) perf_event_init_cpu(cpu); break; - case CPU_ONLINE: - case CPU_ONLINE_FROZEN: - hw_perf_event_setup_online(cpu); - break; - case CPU_DOWN_PREPARE: case CPU_DOWN_PREPARE_FROZEN: perf_event_exit_cpu(cpu); break; - case CPU_DEAD: - hw_perf_event_setup_offline(cpu); - break; - default: break; } -- cgit v1.1 From 32975a4f114be52286f9a5bf6c230dbb8c0e1903 Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Sat, 6 Mar 2010 19:49:19 +0100 Subject: perf: Optimize perf_disable Currently we always call hw_perf_disable(), even if its already disabled, this seems superflous, esp. since it cannot be made NMI safe (see further patches). Signed-off-by: Peter Zijlstra Cc: paulus@samba.org Cc: eranian@google.com Cc: robert.richter@amd.com Cc: fweisbec@gmail.com Cc: Arnaldo Carvalho de Melo LKML-Reference: Signed-off-by: Ingo Molnar --- kernel/perf_event.c | 16 +++------------- 1 file changed, 3 insertions(+), 13 deletions(-) (limited to 'kernel') diff --git a/kernel/perf_event.c b/kernel/perf_event.c index 73329de..d810846 100644 --- a/kernel/perf_event.c +++ b/kernel/perf_event.c @@ -93,25 +93,15 @@ void __weak perf_event_print_debug(void) { } static DEFINE_PER_CPU(int, perf_disable_count); -void __perf_disable(void) -{ - __get_cpu_var(perf_disable_count)++; -} - -bool __perf_enable(void) -{ - return !--__get_cpu_var(perf_disable_count); -} - void perf_disable(void) { - __perf_disable(); - hw_perf_disable(); + if (!__get_cpu_var(perf_disable_count)++) + hw_perf_disable(); } void perf_enable(void) { - if (__perf_enable()) + if (!--__get_cpu_var(perf_disable_count)) hw_perf_enable(); } -- cgit v1.1 From d4944a06666054707d23e11888e480af239e5abf Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Mon, 8 Mar 2010 13:51:20 +0100 Subject: perf: Provide better condition for event rotation Try to avoid useless rotation and PMU disables. [ Could be improved by keeping a nr_runnable count to better account for the < PERF_STAT_INACTIVE counters ] Signed-off-by: Peter Zijlstra Cc: Arnaldo Carvalho de Melo Cc: paulus@samba.org Cc: eranian@google.com Cc: robert.richter@amd.com Cc: fweisbec@gmail.com LKML-Reference: Signed-off-by: Ingo Molnar --- kernel/perf_event.c | 21 +++++++++++++++------ 1 file changed, 15 insertions(+), 6 deletions(-) (limited to 'kernel') diff --git a/kernel/perf_event.c b/kernel/perf_event.c index d810846..52c69a3 100644 --- a/kernel/perf_event.c +++ b/kernel/perf_event.c @@ -1524,12 +1524,15 @@ static void perf_ctx_adjust_freq(struct perf_event_context *ctx) */ if (interrupts == MAX_INTERRUPTS) { perf_log_throttle(event, 1); + perf_disable(); event->pmu->unthrottle(event); + perf_enable(); } if (!event->attr.freq || !event->attr.sample_freq) continue; + perf_disable(); event->pmu->read(event); now = atomic64_read(&event->count); delta = now - hwc->freq_count_stamp; @@ -1537,6 +1540,7 @@ static void perf_ctx_adjust_freq(struct perf_event_context *ctx) if (delta > 0) perf_adjust_period(event, TICK_NSEC, delta); + perf_enable(); } raw_spin_unlock(&ctx->lock); } @@ -1546,9 +1550,6 @@ static void perf_ctx_adjust_freq(struct perf_event_context *ctx) */ static void rotate_ctx(struct perf_event_context *ctx) { - if (!ctx->nr_events) - return; - raw_spin_lock(&ctx->lock); /* Rotate the first entry last of non-pinned groups */ @@ -1561,19 +1562,28 @@ void perf_event_task_tick(struct task_struct *curr) { struct perf_cpu_context *cpuctx; struct perf_event_context *ctx; + int rotate = 0; if (!atomic_read(&nr_events)) return; cpuctx = &__get_cpu_var(perf_cpu_context); - ctx = curr->perf_event_ctxp; + if (cpuctx->ctx.nr_events && + cpuctx->ctx.nr_events != cpuctx->ctx.nr_active) + rotate = 1; - perf_disable(); + ctx = curr->perf_event_ctxp; + if (ctx && ctx->nr_events && ctx->nr_events != ctx->nr_active) + rotate = 1; perf_ctx_adjust_freq(&cpuctx->ctx); if (ctx) perf_ctx_adjust_freq(ctx); + if (!rotate) + return; + + perf_disable(); cpu_ctx_sched_out(cpuctx, EVENT_FLEXIBLE); if (ctx) task_ctx_sched_out(ctx, EVENT_FLEXIBLE); @@ -1585,7 +1595,6 @@ void perf_event_task_tick(struct task_struct *curr) cpu_ctx_sched_in(cpuctx, EVENT_FLEXIBLE); if (ctx) task_ctx_sched_in(curr, EVENT_FLEXIBLE); - perf_enable(); } -- cgit v1.1 From db2c4c7791cd04512093d05afc693c3511a65fd7 Mon Sep 17 00:00:00 2001 From: Frederic Weisbecker Date: Tue, 2 Feb 2010 23:34:40 +0100 Subject: lockdep: Move lock events under lockdep recursion protection There are rcu locked read side areas in the path where we submit a trace event. And these rcu_read_(un)lock() trigger lock events, which create recursive events. One pair in do_perf_sw_event: __lock_acquire | |--96.11%-- lock_acquire | | | |--27.21%-- do_perf_sw_event | | perf_tp_event | | | | | |--49.62%-- ftrace_profile_lock_release | | | lock_release | | | | | | | |--33.85%-- _raw_spin_unlock Another pair in perf_output_begin/end: __lock_acquire |--23.40%-- perf_output_begin | | __perf_event_overflow | | perf_swevent_overflow | | perf_swevent_add | | perf_swevent_ctx_event | | do_perf_sw_event | | perf_tp_event | | | | | |--55.37%-- ftrace_profile_lock_acquire | | | lock_acquire | | | | | | | |--37.31%-- _raw_spin_lock The problem is not that much the trace recursion itself, as we have a recursion protection already (though it's always wasteful to recurse). But the trace events are outside the lockdep recursion protection, then each lockdep event triggers a lock trace, which will trigger two other lockdep events. Here the recursive lock trace event won't be taken because of the trace recursion, so the recursion stops there but lockdep will still analyse these new events: To sum up, for each lockdep events we have: lock_*() | trace lock_acquire | ----- rcu_read_lock() | | | lock_acquire() | | | trace_lock_acquire() (stopped) | | | lockdep analyze | ----- rcu_read_unlock() | lock_release | trace_lock_release() (stopped) | lockdep analyze And you can repeat the above two times as we have two rcu read side sections when we submit an event. This is fixed in this patch by moving the lock trace event under the lockdep recursion protection. Signed-off-by: Frederic Weisbecker Cc: Peter Zijlstra Cc: Arnaldo Carvalho de Melo Cc: Steven Rostedt Cc: Paul Mackerras Cc: Hitoshi Mitake Cc: Li Zefan Cc: Lai Jiangshan Cc: Masami Hiramatsu Cc: Jens Axboe --- kernel/lockdep.c | 9 +++------ 1 file changed, 3 insertions(+), 6 deletions(-) (limited to 'kernel') diff --git a/kernel/lockdep.c b/kernel/lockdep.c index 0c30d04..65b5f5b 100644 --- a/kernel/lockdep.c +++ b/kernel/lockdep.c @@ -3211,8 +3211,6 @@ void lock_acquire(struct lockdep_map *lock, unsigned int subclass, { unsigned long flags; - trace_lock_acquire(lock, subclass, trylock, read, check, nest_lock, ip); - if (unlikely(current->lockdep_recursion)) return; @@ -3220,6 +3218,7 @@ void lock_acquire(struct lockdep_map *lock, unsigned int subclass, check_flags(flags); current->lockdep_recursion = 1; + trace_lock_acquire(lock, subclass, trylock, read, check, nest_lock, ip); __lock_acquire(lock, subclass, trylock, read, check, irqs_disabled_flags(flags), nest_lock, ip, 0); current->lockdep_recursion = 0; @@ -3232,14 +3231,13 @@ void lock_release(struct lockdep_map *lock, int nested, { unsigned long flags; - trace_lock_release(lock, nested, ip); - if (unlikely(current->lockdep_recursion)) return; raw_local_irq_save(flags); check_flags(flags); current->lockdep_recursion = 1; + trace_lock_release(lock, nested, ip); __lock_release(lock, nested, ip); current->lockdep_recursion = 0; raw_local_irq_restore(flags); @@ -3413,8 +3411,6 @@ void lock_contended(struct lockdep_map *lock, unsigned long ip) { unsigned long flags; - trace_lock_contended(lock, ip); - if (unlikely(!lock_stat)) return; @@ -3424,6 +3420,7 @@ void lock_contended(struct lockdep_map *lock, unsigned long ip) raw_local_irq_save(flags); check_flags(flags); current->lockdep_recursion = 1; + trace_lock_contended(lock, ip); __lock_contended(lock, ip); current->lockdep_recursion = 0; raw_local_irq_restore(flags); -- cgit v1.1 From 5331d7b84613b8325362dde53dc2bff2fb87d351 Mon Sep 17 00:00:00 2001 From: Frederic Weisbecker Date: Thu, 4 Mar 2010 21:15:56 +0100 Subject: perf: Introduce new perf_fetch_caller_regs() for hot regs snapshot Events that trigger overflows by interrupting a context can use get_irq_regs() or task_pt_regs() to retrieve the state when the event triggered. But this is not the case for some other class of events like trace events as tracepoints are executed in the same context than the code that triggered the event. It means we need a different api to capture the regs there, namely we need a hot snapshot to get the most important informations for perf: the instruction pointer to get the event origin, the frame pointer for the callchain, the code segment for user_mode() tests (we always use __KERNEL_CS as trace events always occur from the kernel) and the eflags for further purposes. v2: rename perf_save_regs to perf_fetch_caller_regs as per Masami's suggestion. Signed-off-by: Frederic Weisbecker Cc: Ingo Molnar Cc: Thomas Gleixner Cc: H. Peter Anvin Cc: Peter Zijlstra Cc: Paul Mackerras Cc: Steven Rostedt Cc: Arnaldo Carvalho de Melo Cc: Masami Hiramatsu Cc: Jason Baron Cc: Archs --- kernel/perf_event.c | 5 +++++ 1 file changed, 5 insertions(+) (limited to 'kernel') diff --git a/kernel/perf_event.c b/kernel/perf_event.c index 52c69a3..359d7f6 100644 --- a/kernel/perf_event.c +++ b/kernel/perf_event.c @@ -2786,6 +2786,11 @@ __weak struct perf_callchain_entry *perf_callchain(struct pt_regs *regs) return NULL; } +__weak +void perf_arch_fetch_caller_regs(struct pt_regs *regs, unsigned long ip, int skip) +{ +} + /* * Output */ -- cgit v1.1 From c530665c31c0140b74ca7689e7f836177796e5bd Mon Sep 17 00:00:00 2001 From: Frederic Weisbecker Date: Wed, 3 Mar 2010 07:16:16 +0100 Subject: perf: Take a hot regs snapshot for trace events We are taking a wrong regs snapshot when a trace event triggers. Either we use get_irq_regs(), which gives us the interrupted registers if we are in an interrupt, or we use task_pt_regs() which gives us the state before we entered the kernel, assuming we are lucky enough to be no kernel thread, in which case task_pt_regs() returns the initial set of regs when the kernel thread was started. What we want is different. We need a hot snapshot of the regs, so that we can get the instruction pointer to record in the sample, the frame pointer for the callchain, and some other things. Let's use the new perf_fetch_caller_regs() for that. Comparison with perf record -e lock: -R -a -f -g Before: perf [kernel] [k] __do_softirq | --- __do_softirq | |--55.16%-- __open | --44.84%-- __write_nocancel After: perf [kernel] [k] perf_tp_event | --- perf_tp_event | |--41.07%-- lock_acquire | | | |--39.36%-- _raw_spin_lock | | | | | |--7.81%-- hrtimer_interrupt | | | smp_apic_timer_interrupt | | | apic_timer_interrupt The old case was producing unreliable callchains. Now having right frame and instruction pointers, we have the trace we want. Also syscalls and kprobe events already have the right regs, let's use them instead of wasting a retrieval. v2: Follow the rename perf_save_regs() -> perf_fetch_caller_regs() Signed-off-by: Frederic Weisbecker Cc: Ingo Molnar Cc: Thomas Gleixner Cc: H. Peter Anvin Cc: Peter Zijlstra Cc: Paul Mackerras Cc: Steven Rostedt Cc: Arnaldo Carvalho de Melo Cc: Masami Hiramatsu Cc: Jason Baron Cc: Archs --- kernel/perf_event.c | 8 ++------ kernel/trace/trace_event_profile.c | 3 ++- kernel/trace/trace_kprobe.c | 5 +++-- kernel/trace/trace_syscalls.c | 4 ++-- 4 files changed, 9 insertions(+), 11 deletions(-) (limited to 'kernel') diff --git a/kernel/perf_event.c b/kernel/perf_event.c index 359d7f6..45b4b6e 100644 --- a/kernel/perf_event.c +++ b/kernel/perf_event.c @@ -4318,9 +4318,8 @@ static const struct pmu perf_ops_task_clock = { #ifdef CONFIG_EVENT_TRACING void perf_tp_event(int event_id, u64 addr, u64 count, void *record, - int entry_size) + int entry_size, struct pt_regs *regs) { - struct pt_regs *regs = get_irq_regs(); struct perf_sample_data data; struct perf_raw_record raw = { .size = entry_size, @@ -4330,12 +4329,9 @@ void perf_tp_event(int event_id, u64 addr, u64 count, void *record, perf_sample_data_init(&data, addr); data.raw = &raw; - if (!regs) - regs = task_pt_regs(current); - /* Trace events already protected against recursion */ do_perf_sw_event(PERF_TYPE_TRACEPOINT, event_id, count, 1, - &data, regs); + &data, regs); } EXPORT_SYMBOL_GPL(perf_tp_event); diff --git a/kernel/trace/trace_event_profile.c b/kernel/trace/trace_event_profile.c index f0d6930..e66d21e 100644 --- a/kernel/trace/trace_event_profile.c +++ b/kernel/trace/trace_event_profile.c @@ -2,13 +2,14 @@ * trace event based perf counter profiling * * Copyright (C) 2009 Red Hat Inc, Peter Zijlstra - * + * Copyright (C) 2009-2010 Frederic Weisbecker */ #include #include #include "trace.h" +DEFINE_PER_CPU(struct pt_regs, perf_trace_regs); static char *perf_trace_buf; static char *perf_trace_buf_nmi; diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c index 505c922..f7a20a8 100644 --- a/kernel/trace/trace_kprobe.c +++ b/kernel/trace/trace_kprobe.c @@ -1240,7 +1240,7 @@ static __kprobes void kprobe_profile_func(struct kprobe *kp, for (i = 0; i < tp->nr_args; i++) entry->args[i] = call_fetch(&tp->args[i].fetch, regs); - ftrace_perf_buf_submit(entry, size, rctx, entry->ip, 1, irq_flags); + ftrace_perf_buf_submit(entry, size, rctx, entry->ip, 1, irq_flags, regs); } /* Kretprobe profile handler */ @@ -1271,7 +1271,8 @@ static __kprobes void kretprobe_profile_func(struct kretprobe_instance *ri, for (i = 0; i < tp->nr_args; i++) entry->args[i] = call_fetch(&tp->args[i].fetch, regs); - ftrace_perf_buf_submit(entry, size, rctx, entry->ret_ip, 1, irq_flags); + ftrace_perf_buf_submit(entry, size, rctx, entry->ret_ip, 1, + irq_flags, regs); } static int probe_profile_enable(struct ftrace_event_call *call) diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c index cba47d7..7e6e84f 100644 --- a/kernel/trace/trace_syscalls.c +++ b/kernel/trace/trace_syscalls.c @@ -467,7 +467,7 @@ static void prof_syscall_enter(struct pt_regs *regs, long id) rec->nr = syscall_nr; syscall_get_arguments(current, regs, 0, sys_data->nb_args, (unsigned long *)&rec->args); - ftrace_perf_buf_submit(rec, size, rctx, 0, 1, flags); + ftrace_perf_buf_submit(rec, size, rctx, 0, 1, flags, regs); } int prof_sysenter_enable(struct ftrace_event_call *call) @@ -542,7 +542,7 @@ static void prof_syscall_exit(struct pt_regs *regs, long ret) rec->nr = syscall_nr; rec->ret = syscall_get_return_value(current, regs); - ftrace_perf_buf_submit(rec, size, rctx, 0, 1, flags); + ftrace_perf_buf_submit(rec, size, rctx, 0, 1, flags, regs); } int prof_sysexit_enable(struct ftrace_event_call *call) -- cgit v1.1 From 97d5a22005f38057b4bc0d95f81cd26510268794 Mon Sep 17 00:00:00 2001 From: Frederic Weisbecker Date: Fri, 5 Mar 2010 05:35:37 +0100 Subject: perf: Drop the obsolete profile naming for trace events Drop the obsolete "profile" naming used by perf for trace events. Perf can now do more than simple events counting, so generalize the API naming. Signed-off-by: Frederic Weisbecker Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Paul Mackerras Cc: Steven Rostedt Cc: Masami Hiramatsu Cc: Jason Baron --- kernel/perf_event.c | 4 +- kernel/trace/Makefile | 2 +- kernel/trace/trace_event_perf.c | 165 +++++++++++++++++++++++++++++++++++++ kernel/trace/trace_event_profile.c | 165 ------------------------------------- kernel/trace/trace_events.c | 2 +- kernel/trace/trace_kprobe.c | 28 +++---- kernel/trace/trace_syscalls.c | 72 ++++++++-------- 7 files changed, 219 insertions(+), 219 deletions(-) create mode 100644 kernel/trace/trace_event_perf.c delete mode 100644 kernel/trace/trace_event_profile.c (limited to 'kernel') diff --git a/kernel/perf_event.c b/kernel/perf_event.c index 45b4b6e..c502b18 100644 --- a/kernel/perf_event.c +++ b/kernel/perf_event.c @@ -4347,7 +4347,7 @@ static int perf_tp_event_match(struct perf_event *event, static void tp_perf_event_destroy(struct perf_event *event) { - ftrace_profile_disable(event->attr.config); + perf_trace_disable(event->attr.config); } static const struct pmu *tp_perf_event_init(struct perf_event *event) @@ -4361,7 +4361,7 @@ static const struct pmu *tp_perf_event_init(struct perf_event *event) !capable(CAP_SYS_ADMIN)) return ERR_PTR(-EPERM); - if (ftrace_profile_enable(event->attr.config)) + if (perf_trace_enable(event->attr.config)) return NULL; event->destroy = tp_perf_event_destroy; diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile index d00c6fe..78edc64 100644 --- a/kernel/trace/Makefile +++ b/kernel/trace/Makefile @@ -52,7 +52,7 @@ obj-$(CONFIG_EVENT_TRACING) += trace_events.o obj-$(CONFIG_EVENT_TRACING) += trace_export.o obj-$(CONFIG_FTRACE_SYSCALLS) += trace_syscalls.o ifeq ($(CONFIG_PERF_EVENTS),y) -obj-$(CONFIG_EVENT_TRACING) += trace_event_profile.o +obj-$(CONFIG_EVENT_TRACING) += trace_event_perf.o endif obj-$(CONFIG_EVENT_TRACING) += trace_events_filter.o obj-$(CONFIG_KPROBE_EVENT) += trace_kprobe.o diff --git a/kernel/trace/trace_event_perf.c b/kernel/trace/trace_event_perf.c new file mode 100644 index 0000000..f315b12 --- /dev/null +++ b/kernel/trace/trace_event_perf.c @@ -0,0 +1,165 @@ +/* + * trace event based perf event profiling/tracing + * + * Copyright (C) 2009 Red Hat Inc, Peter Zijlstra + * Copyright (C) 2009-2010 Frederic Weisbecker + */ + +#include +#include +#include "trace.h" + +DEFINE_PER_CPU(struct pt_regs, perf_trace_regs); + +static char *perf_trace_buf; +static char *perf_trace_buf_nmi; + +typedef typeof(char [PERF_MAX_TRACE_SIZE]) perf_trace_t ; + +/* Count the events in use (per event id, not per instance) */ +static int total_ref_count; + +static int perf_trace_event_enable(struct ftrace_event_call *event) +{ + char *buf; + int ret = -ENOMEM; + + if (event->perf_refcount++ > 0) + return 0; + + if (!total_ref_count) { + buf = (char *)alloc_percpu(perf_trace_t); + if (!buf) + goto fail_buf; + + rcu_assign_pointer(perf_trace_buf, buf); + + buf = (char *)alloc_percpu(perf_trace_t); + if (!buf) + goto fail_buf_nmi; + + rcu_assign_pointer(perf_trace_buf_nmi, buf); + } + + ret = event->perf_event_enable(event); + if (!ret) { + total_ref_count++; + return 0; + } + +fail_buf_nmi: + if (!total_ref_count) { + free_percpu(perf_trace_buf_nmi); + free_percpu(perf_trace_buf); + perf_trace_buf_nmi = NULL; + perf_trace_buf = NULL; + } +fail_buf: + event->perf_refcount--; + + return ret; +} + +int perf_trace_enable(int event_id) +{ + struct ftrace_event_call *event; + int ret = -EINVAL; + + mutex_lock(&event_mutex); + list_for_each_entry(event, &ftrace_events, list) { + if (event->id == event_id && event->perf_event_enable && + try_module_get(event->mod)) { + ret = perf_trace_event_enable(event); + break; + } + } + mutex_unlock(&event_mutex); + + return ret; +} + +static void perf_trace_event_disable(struct ftrace_event_call *event) +{ + char *buf, *nmi_buf; + + if (--event->perf_refcount > 0) + return; + + event->perf_event_disable(event); + + if (!--total_ref_count) { + buf = perf_trace_buf; + rcu_assign_pointer(perf_trace_buf, NULL); + + nmi_buf = perf_trace_buf_nmi; + rcu_assign_pointer(perf_trace_buf_nmi, NULL); + + /* + * Ensure every events in profiling have finished before + * releasing the buffers + */ + synchronize_sched(); + + free_percpu(buf); + free_percpu(nmi_buf); + } +} + +void perf_trace_disable(int event_id) +{ + struct ftrace_event_call *event; + + mutex_lock(&event_mutex); + list_for_each_entry(event, &ftrace_events, list) { + if (event->id == event_id) { + perf_trace_event_disable(event); + module_put(event->mod); + break; + } + } + mutex_unlock(&event_mutex); +} + +__kprobes void *perf_trace_buf_prepare(int size, unsigned short type, + int *rctxp, unsigned long *irq_flags) +{ + struct trace_entry *entry; + char *trace_buf, *raw_data; + int pc, cpu; + + pc = preempt_count(); + + /* Protect the per cpu buffer, begin the rcu read side */ + local_irq_save(*irq_flags); + + *rctxp = perf_swevent_get_recursion_context(); + if (*rctxp < 0) + goto err_recursion; + + cpu = smp_processor_id(); + + if (in_nmi()) + trace_buf = rcu_dereference(perf_trace_buf_nmi); + else + trace_buf = rcu_dereference(perf_trace_buf); + + if (!trace_buf) + goto err; + + raw_data = per_cpu_ptr(trace_buf, cpu); + + /* zero the dead bytes from align to not leak stack to user */ + *(u64 *)(&raw_data[size - sizeof(u64)]) = 0ULL; + + entry = (struct trace_entry *)raw_data; + tracing_generic_entry_update(entry, *irq_flags, pc); + entry->type = type; + + return raw_data; +err: + perf_swevent_put_recursion_context(*rctxp); +err_recursion: + local_irq_restore(*irq_flags); + return NULL; +} +EXPORT_SYMBOL_GPL(perf_trace_buf_prepare); diff --git a/kernel/trace/trace_event_profile.c b/kernel/trace/trace_event_profile.c deleted file mode 100644 index e66d21e..0000000 --- a/kernel/trace/trace_event_profile.c +++ /dev/null @@ -1,165 +0,0 @@ -/* - * trace event based perf counter profiling - * - * Copyright (C) 2009 Red Hat Inc, Peter Zijlstra - * Copyright (C) 2009-2010 Frederic Weisbecker - */ - -#include -#include -#include "trace.h" - -DEFINE_PER_CPU(struct pt_regs, perf_trace_regs); - -static char *perf_trace_buf; -static char *perf_trace_buf_nmi; - -typedef typeof(char [FTRACE_MAX_PROFILE_SIZE]) perf_trace_t ; - -/* Count the events in use (per event id, not per instance) */ -static int total_profile_count; - -static int ftrace_profile_enable_event(struct ftrace_event_call *event) -{ - char *buf; - int ret = -ENOMEM; - - if (event->profile_count++ > 0) - return 0; - - if (!total_profile_count) { - buf = (char *)alloc_percpu(perf_trace_t); - if (!buf) - goto fail_buf; - - rcu_assign_pointer(perf_trace_buf, buf); - - buf = (char *)alloc_percpu(perf_trace_t); - if (!buf) - goto fail_buf_nmi; - - rcu_assign_pointer(perf_trace_buf_nmi, buf); - } - - ret = event->profile_enable(event); - if (!ret) { - total_profile_count++; - return 0; - } - -fail_buf_nmi: - if (!total_profile_count) { - free_percpu(perf_trace_buf_nmi); - free_percpu(perf_trace_buf); - perf_trace_buf_nmi = NULL; - perf_trace_buf = NULL; - } -fail_buf: - event->profile_count--; - - return ret; -} - -int ftrace_profile_enable(int event_id) -{ - struct ftrace_event_call *event; - int ret = -EINVAL; - - mutex_lock(&event_mutex); - list_for_each_entry(event, &ftrace_events, list) { - if (event->id == event_id && event->profile_enable && - try_module_get(event->mod)) { - ret = ftrace_profile_enable_event(event); - break; - } - } - mutex_unlock(&event_mutex); - - return ret; -} - -static void ftrace_profile_disable_event(struct ftrace_event_call *event) -{ - char *buf, *nmi_buf; - - if (--event->profile_count > 0) - return; - - event->profile_disable(event); - - if (!--total_profile_count) { - buf = perf_trace_buf; - rcu_assign_pointer(perf_trace_buf, NULL); - - nmi_buf = perf_trace_buf_nmi; - rcu_assign_pointer(perf_trace_buf_nmi, NULL); - - /* - * Ensure every events in profiling have finished before - * releasing the buffers - */ - synchronize_sched(); - - free_percpu(buf); - free_percpu(nmi_buf); - } -} - -void ftrace_profile_disable(int event_id) -{ - struct ftrace_event_call *event; - - mutex_lock(&event_mutex); - list_for_each_entry(event, &ftrace_events, list) { - if (event->id == event_id) { - ftrace_profile_disable_event(event); - module_put(event->mod); - break; - } - } - mutex_unlock(&event_mutex); -} - -__kprobes void *ftrace_perf_buf_prepare(int size, unsigned short type, - int *rctxp, unsigned long *irq_flags) -{ - struct trace_entry *entry; - char *trace_buf, *raw_data; - int pc, cpu; - - pc = preempt_count(); - - /* Protect the per cpu buffer, begin the rcu read side */ - local_irq_save(*irq_flags); - - *rctxp = perf_swevent_get_recursion_context(); - if (*rctxp < 0) - goto err_recursion; - - cpu = smp_processor_id(); - - if (in_nmi()) - trace_buf = rcu_dereference(perf_trace_buf_nmi); - else - trace_buf = rcu_dereference(perf_trace_buf); - - if (!trace_buf) - goto err; - - raw_data = per_cpu_ptr(trace_buf, cpu); - - /* zero the dead bytes from align to not leak stack to user */ - *(u64 *)(&raw_data[size - sizeof(u64)]) = 0ULL; - - entry = (struct trace_entry *)raw_data; - tracing_generic_entry_update(entry, *irq_flags, pc); - entry->type = type; - - return raw_data; -err: - perf_swevent_put_recursion_context(*rctxp); -err_recursion: - local_irq_restore(*irq_flags); - return NULL; -} -EXPORT_SYMBOL_GPL(ftrace_perf_buf_prepare); diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c index 3f972ad9..beab8bf 100644 --- a/kernel/trace/trace_events.c +++ b/kernel/trace/trace_events.c @@ -938,7 +938,7 @@ event_create_dir(struct ftrace_event_call *call, struct dentry *d_events, trace_create_file("enable", 0644, call->dir, call, enable); - if (call->id && call->profile_enable) + if (call->id && call->perf_event_enable) trace_create_file("id", 0444, call->dir, call, id); diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c index f7a20a8..1251e36 100644 --- a/kernel/trace/trace_kprobe.c +++ b/kernel/trace/trace_kprobe.c @@ -1214,7 +1214,7 @@ static int set_print_fmt(struct trace_probe *tp) #ifdef CONFIG_PERF_EVENTS /* Kprobe profile handler */ -static __kprobes void kprobe_profile_func(struct kprobe *kp, +static __kprobes void kprobe_perf_func(struct kprobe *kp, struct pt_regs *regs) { struct trace_probe *tp = container_of(kp, struct trace_probe, rp.kp); @@ -1227,11 +1227,11 @@ static __kprobes void kprobe_profile_func(struct kprobe *kp, __size = SIZEOF_KPROBE_TRACE_ENTRY(tp->nr_args); size = ALIGN(__size + sizeof(u32), sizeof(u64)); size -= sizeof(u32); - if (WARN_ONCE(size > FTRACE_MAX_PROFILE_SIZE, + if (WARN_ONCE(size > PERF_MAX_TRACE_SIZE, "profile buffer not large enough")) return; - entry = ftrace_perf_buf_prepare(size, call->id, &rctx, &irq_flags); + entry = perf_trace_buf_prepare(size, call->id, &rctx, &irq_flags); if (!entry) return; @@ -1240,11 +1240,11 @@ static __kprobes void kprobe_profile_func(struct kprobe *kp, for (i = 0; i < tp->nr_args; i++) entry->args[i] = call_fetch(&tp->args[i].fetch, regs); - ftrace_perf_buf_submit(entry, size, rctx, entry->ip, 1, irq_flags, regs); + perf_trace_buf_submit(entry, size, rctx, entry->ip, 1, irq_flags, regs); } /* Kretprobe profile handler */ -static __kprobes void kretprobe_profile_func(struct kretprobe_instance *ri, +static __kprobes void kretprobe_perf_func(struct kretprobe_instance *ri, struct pt_regs *regs) { struct trace_probe *tp = container_of(ri->rp, struct trace_probe, rp); @@ -1257,11 +1257,11 @@ static __kprobes void kretprobe_profile_func(struct kretprobe_instance *ri, __size = SIZEOF_KRETPROBE_TRACE_ENTRY(tp->nr_args); size = ALIGN(__size + sizeof(u32), sizeof(u64)); size -= sizeof(u32); - if (WARN_ONCE(size > FTRACE_MAX_PROFILE_SIZE, + if (WARN_ONCE(size > PERF_MAX_TRACE_SIZE, "profile buffer not large enough")) return; - entry = ftrace_perf_buf_prepare(size, call->id, &rctx, &irq_flags); + entry = perf_trace_buf_prepare(size, call->id, &rctx, &irq_flags); if (!entry) return; @@ -1271,11 +1271,11 @@ static __kprobes void kretprobe_profile_func(struct kretprobe_instance *ri, for (i = 0; i < tp->nr_args; i++) entry->args[i] = call_fetch(&tp->args[i].fetch, regs); - ftrace_perf_buf_submit(entry, size, rctx, entry->ret_ip, 1, + perf_trace_buf_submit(entry, size, rctx, entry->ret_ip, 1, irq_flags, regs); } -static int probe_profile_enable(struct ftrace_event_call *call) +static int probe_perf_enable(struct ftrace_event_call *call) { struct trace_probe *tp = (struct trace_probe *)call->data; @@ -1287,7 +1287,7 @@ static int probe_profile_enable(struct ftrace_event_call *call) return enable_kprobe(&tp->rp.kp); } -static void probe_profile_disable(struct ftrace_event_call *call) +static void probe_perf_disable(struct ftrace_event_call *call) { struct trace_probe *tp = (struct trace_probe *)call->data; @@ -1312,7 +1312,7 @@ int kprobe_dispatcher(struct kprobe *kp, struct pt_regs *regs) kprobe_trace_func(kp, regs); #ifdef CONFIG_PERF_EVENTS if (tp->flags & TP_FLAG_PROFILE) - kprobe_profile_func(kp, regs); + kprobe_perf_func(kp, regs); #endif return 0; /* We don't tweek kernel, so just return 0 */ } @@ -1326,7 +1326,7 @@ int kretprobe_dispatcher(struct kretprobe_instance *ri, struct pt_regs *regs) kretprobe_trace_func(ri, regs); #ifdef CONFIG_PERF_EVENTS if (tp->flags & TP_FLAG_PROFILE) - kretprobe_profile_func(ri, regs); + kretprobe_perf_func(ri, regs); #endif return 0; /* We don't tweek kernel, so just return 0 */ } @@ -1359,8 +1359,8 @@ static int register_probe_event(struct trace_probe *tp) call->unregfunc = probe_event_disable; #ifdef CONFIG_PERF_EVENTS - call->profile_enable = probe_profile_enable; - call->profile_disable = probe_profile_disable; + call->perf_event_enable = probe_perf_enable; + call->perf_event_disable = probe_perf_disable; #endif call->data = tp; ret = trace_add_event_call(call); diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c index 7e6e84f..33c2a5b 100644 --- a/kernel/trace/trace_syscalls.c +++ b/kernel/trace/trace_syscalls.c @@ -428,12 +428,12 @@ core_initcall(init_ftrace_syscalls); #ifdef CONFIG_PERF_EVENTS -static DECLARE_BITMAP(enabled_prof_enter_syscalls, NR_syscalls); -static DECLARE_BITMAP(enabled_prof_exit_syscalls, NR_syscalls); -static int sys_prof_refcount_enter; -static int sys_prof_refcount_exit; +static DECLARE_BITMAP(enabled_perf_enter_syscalls, NR_syscalls); +static DECLARE_BITMAP(enabled_perf_exit_syscalls, NR_syscalls); +static int sys_perf_refcount_enter; +static int sys_perf_refcount_exit; -static void prof_syscall_enter(struct pt_regs *regs, long id) +static void perf_syscall_enter(struct pt_regs *regs, long id) { struct syscall_metadata *sys_data; struct syscall_trace_enter *rec; @@ -443,7 +443,7 @@ static void prof_syscall_enter(struct pt_regs *regs, long id) int size; syscall_nr = syscall_get_nr(current, regs); - if (!test_bit(syscall_nr, enabled_prof_enter_syscalls)) + if (!test_bit(syscall_nr, enabled_perf_enter_syscalls)) return; sys_data = syscall_nr_to_meta(syscall_nr); @@ -455,11 +455,11 @@ static void prof_syscall_enter(struct pt_regs *regs, long id) size = ALIGN(size + sizeof(u32), sizeof(u64)); size -= sizeof(u32); - if (WARN_ONCE(size > FTRACE_MAX_PROFILE_SIZE, - "profile buffer not large enough")) + if (WARN_ONCE(size > PERF_MAX_TRACE_SIZE, + "perf buffer not large enough")) return; - rec = (struct syscall_trace_enter *)ftrace_perf_buf_prepare(size, + rec = (struct syscall_trace_enter *)perf_trace_buf_prepare(size, sys_data->enter_event->id, &rctx, &flags); if (!rec) return; @@ -467,10 +467,10 @@ static void prof_syscall_enter(struct pt_regs *regs, long id) rec->nr = syscall_nr; syscall_get_arguments(current, regs, 0, sys_data->nb_args, (unsigned long *)&rec->args); - ftrace_perf_buf_submit(rec, size, rctx, 0, 1, flags, regs); + perf_trace_buf_submit(rec, size, rctx, 0, 1, flags, regs); } -int prof_sysenter_enable(struct ftrace_event_call *call) +int perf_sysenter_enable(struct ftrace_event_call *call) { int ret = 0; int num; @@ -478,34 +478,34 @@ int prof_sysenter_enable(struct ftrace_event_call *call) num = ((struct syscall_metadata *)call->data)->syscall_nr; mutex_lock(&syscall_trace_lock); - if (!sys_prof_refcount_enter) - ret = register_trace_sys_enter(prof_syscall_enter); + if (!sys_perf_refcount_enter) + ret = register_trace_sys_enter(perf_syscall_enter); if (ret) { pr_info("event trace: Could not activate" "syscall entry trace point"); } else { - set_bit(num, enabled_prof_enter_syscalls); - sys_prof_refcount_enter++; + set_bit(num, enabled_perf_enter_syscalls); + sys_perf_refcount_enter++; } mutex_unlock(&syscall_trace_lock); return ret; } -void prof_sysenter_disable(struct ftrace_event_call *call) +void perf_sysenter_disable(struct ftrace_event_call *call) { int num; num = ((struct syscall_metadata *)call->data)->syscall_nr; mutex_lock(&syscall_trace_lock); - sys_prof_refcount_enter--; - clear_bit(num, enabled_prof_enter_syscalls); - if (!sys_prof_refcount_enter) - unregister_trace_sys_enter(prof_syscall_enter); + sys_perf_refcount_enter--; + clear_bit(num, enabled_perf_enter_syscalls); + if (!sys_perf_refcount_enter) + unregister_trace_sys_enter(perf_syscall_enter); mutex_unlock(&syscall_trace_lock); } -static void prof_syscall_exit(struct pt_regs *regs, long ret) +static void perf_syscall_exit(struct pt_regs *regs, long ret) { struct syscall_metadata *sys_data; struct syscall_trace_exit *rec; @@ -515,7 +515,7 @@ static void prof_syscall_exit(struct pt_regs *regs, long ret) int size; syscall_nr = syscall_get_nr(current, regs); - if (!test_bit(syscall_nr, enabled_prof_exit_syscalls)) + if (!test_bit(syscall_nr, enabled_perf_exit_syscalls)) return; sys_data = syscall_nr_to_meta(syscall_nr); @@ -530,11 +530,11 @@ static void prof_syscall_exit(struct pt_regs *regs, long ret) * Impossible, but be paranoid with the future * How to put this check outside runtime? */ - if (WARN_ONCE(size > FTRACE_MAX_PROFILE_SIZE, - "exit event has grown above profile buffer size")) + if (WARN_ONCE(size > PERF_MAX_TRACE_SIZE, + "exit event has grown above perf buffer size")) return; - rec = (struct syscall_trace_exit *)ftrace_perf_buf_prepare(size, + rec = (struct syscall_trace_exit *)perf_trace_buf_prepare(size, sys_data->exit_event->id, &rctx, &flags); if (!rec) return; @@ -542,10 +542,10 @@ static void prof_syscall_exit(struct pt_regs *regs, long ret) rec->nr = syscall_nr; rec->ret = syscall_get_return_value(current, regs); - ftrace_perf_buf_submit(rec, size, rctx, 0, 1, flags, regs); + perf_trace_buf_submit(rec, size, rctx, 0, 1, flags, regs); } -int prof_sysexit_enable(struct ftrace_event_call *call) +int perf_sysexit_enable(struct ftrace_event_call *call) { int ret = 0; int num; @@ -553,30 +553,30 @@ int prof_sysexit_enable(struct ftrace_event_call *call) num = ((struct syscall_metadata *)call->data)->syscall_nr; mutex_lock(&syscall_trace_lock); - if (!sys_prof_refcount_exit) - ret = register_trace_sys_exit(prof_syscall_exit); + if (!sys_perf_refcount_exit) + ret = register_trace_sys_exit(perf_syscall_exit); if (ret) { pr_info("event trace: Could not activate" "syscall exit trace point"); } else { - set_bit(num, enabled_prof_exit_syscalls); - sys_prof_refcount_exit++; + set_bit(num, enabled_perf_exit_syscalls); + sys_perf_refcount_exit++; } mutex_unlock(&syscall_trace_lock); return ret; } -void prof_sysexit_disable(struct ftrace_event_call *call) +void perf_sysexit_disable(struct ftrace_event_call *call) { int num; num = ((struct syscall_metadata *)call->data)->syscall_nr; mutex_lock(&syscall_trace_lock); - sys_prof_refcount_exit--; - clear_bit(num, enabled_prof_exit_syscalls); - if (!sys_prof_refcount_exit) - unregister_trace_sys_exit(prof_syscall_exit); + sys_perf_refcount_exit--; + clear_bit(num, enabled_perf_exit_syscalls); + if (!sys_perf_refcount_exit) + unregister_trace_sys_exit(perf_syscall_exit); mutex_unlock(&syscall_trace_lock); } -- cgit v1.1 From 0b1adaa031a55e44f5dd942f234bf09d28e8a0d6 Mon Sep 17 00:00:00 2001 From: Thomas Gleixner Date: Tue, 9 Mar 2010 19:45:54 +0100 Subject: genirq: Prevent oneshot irq thread race Lars-Peter pointed out that the oneshot threaded interrupt handler code has the following race: CPU0 CPU1 hande_level_irq(irq X) mask_ack_irq(irq X) handle_IRQ_event(irq X) wake_up(thread_handler) thread handler(irq X) runs finalize_oneshot(irq X) does not unmask due to !(desc->status & IRQ_MASKED) return from irq does not unmask due to (desc->status & IRQ_ONESHOT) This leaves the interrupt line masked forever. The reason for this is the inconsistent handling of the IRQ_MASKED flag. Instead of setting it in the mask function the oneshot support sets the flag after waking up the irq thread. The solution for this is to set/clear the IRQ_MASKED status whenever we mask/unmask an interrupt line. That's the easy part, but that cleanup opens another race: CPU0 CPU1 hande_level_irq(irq) mask_ack_irq(irq) handle_IRQ_event(irq) wake_up(thread_handler) thread handler(irq) runs finalize_oneshot_irq(irq) unmask(irq) irq triggers again handle_level_irq(irq) mask_ack_irq(irq) return from irq due to IRQ_INPROGRESS return from irq does not unmask due to (desc->status & IRQ_ONESHOT) This requires that we synchronize finalize_oneshot_irq() with the primary handler. If IRQ_INPROGESS is set we wait until the primary handler on the other CPU has returned before unmasking the interrupt line again. We probably have never seen that problem because it does not happen on UP and on SMP the irqbalancer protects us by pinning the primary handler and the thread to the same CPU. Reported-by: Lars-Peter Clausen Signed-off-by: Thomas Gleixner Cc: stable@kernel.org --- kernel/irq/chip.c | 31 ++++++++++++++++++++++--------- kernel/irq/manage.c | 18 ++++++++++++++++++ 2 files changed, 40 insertions(+), 9 deletions(-) (limited to 'kernel') diff --git a/kernel/irq/chip.c b/kernel/irq/chip.c index d70394f..71eba24 100644 --- a/kernel/irq/chip.c +++ b/kernel/irq/chip.c @@ -359,6 +359,23 @@ static inline void mask_ack_irq(struct irq_desc *desc, int irq) if (desc->chip->ack) desc->chip->ack(irq); } + desc->status |= IRQ_MASKED; +} + +static inline void mask_irq(struct irq_desc *desc, int irq) +{ + if (desc->chip->mask) { + desc->chip->mask(irq); + desc->status |= IRQ_MASKED; + } +} + +static inline void unmask_irq(struct irq_desc *desc, int irq) +{ + if (desc->chip->unmask) { + desc->chip->unmask(irq); + desc->status &= ~IRQ_MASKED; + } } /* @@ -484,10 +501,8 @@ handle_level_irq(unsigned int irq, struct irq_desc *desc) raw_spin_lock(&desc->lock); desc->status &= ~IRQ_INPROGRESS; - if (unlikely(desc->status & IRQ_ONESHOT)) - desc->status |= IRQ_MASKED; - else if (!(desc->status & IRQ_DISABLED) && desc->chip->unmask) - desc->chip->unmask(irq); + if (!(desc->status & (IRQ_DISABLED | IRQ_ONESHOT))) + unmask_irq(desc, irq); out_unlock: raw_spin_unlock(&desc->lock); } @@ -524,8 +539,7 @@ handle_fasteoi_irq(unsigned int irq, struct irq_desc *desc) action = desc->action; if (unlikely(!action || (desc->status & IRQ_DISABLED))) { desc->status |= IRQ_PENDING; - if (desc->chip->mask) - desc->chip->mask(irq); + mask_irq(desc, irq); goto out; } @@ -593,7 +607,7 @@ handle_edge_irq(unsigned int irq, struct irq_desc *desc) irqreturn_t action_ret; if (unlikely(!action)) { - desc->chip->mask(irq); + mask_irq(desc, irq); goto out_unlock; } @@ -605,8 +619,7 @@ handle_edge_irq(unsigned int irq, struct irq_desc *desc) if (unlikely((desc->status & (IRQ_PENDING | IRQ_MASKED | IRQ_DISABLED)) == (IRQ_PENDING | IRQ_MASKED))) { - desc->chip->unmask(irq); - desc->status &= ~IRQ_MASKED; + unmask_irq(desc, irq); } desc->status &= ~IRQ_PENDING; diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c index eb6078c..69a3d7b 100644 --- a/kernel/irq/manage.c +++ b/kernel/irq/manage.c @@ -483,8 +483,26 @@ static int irq_wait_for_interrupt(struct irqaction *action) */ static void irq_finalize_oneshot(unsigned int irq, struct irq_desc *desc) { +again: chip_bus_lock(irq, desc); raw_spin_lock_irq(&desc->lock); + + /* + * Implausible though it may be we need to protect us against + * the following scenario: + * + * The thread is faster done than the hard interrupt handler + * on the other CPU. If we unmask the irq line then the + * interrupt can come in again and masks the line, leaves due + * to IRQ_INPROGRESS and the irq line is masked forever. + */ + if (unlikely(desc->status & IRQ_INPROGRESS)) { + raw_spin_unlock_irq(&desc->lock); + chip_bus_sync_unlock(irq, desc); + cpu_relax(); + goto again; + } + if (!(desc->status & IRQ_DISABLED) && (desc->status & IRQ_MASKED)) { desc->status &= ~IRQ_MASKED; desc->chip->unmask(irq); -- cgit v1.1 From 220b140b52ab6cc133f674a7ffec8fa792054f25 Mon Sep 17 00:00:00 2001 From: Paul Mackerras Date: Wed, 10 Mar 2010 20:45:52 +1100 Subject: perf_event: Fix oops triggered by cpu offline/online Anton Blanchard found that he could reliably make the kernel hit a BUG_ON in the slab allocator by taking a cpu offline and then online while a system-wide perf record session was running. The reason is that when the cpu comes up, we completely reinitialize the ctx field of the struct perf_cpu_context for the cpu. If there is a system-wide perf record session running, then there will be a struct perf_event that has a reference to the context, so its refcount will be 2. (The perf_event has been removed from the context's group_entry and event_entry lists by perf_event_exit_cpu(), but that doesn't remove the perf_event's reference to the context and doesn't decrement the context's refcount.) When the cpu comes up, perf_event_init_cpu() gets called, and it calls __perf_event_init_context() on the cpu's context. That resets the refcount to 1. Then when the perf record session finishes and the perf_event is closed, the refcount gets decremented to 0 and the context gets kfreed after an RCU grace period. Since the context wasn't kmalloced -- it's part of a per-cpu variable -- bad things happen. In fact we don't need to completely reinitialize the context when the cpu comes up. It's sufficient to initialize the context once at boot, but we need to do it for all possible cpus. This moves the context initialization to happen at boot time. With this, we don't trash the refcount and the context never gets kfreed, and we don't hit the BUG_ON. Reported-by: Anton Blanchard Signed-off-by: Paul Mackerras Tested-by: Anton Blanchard Acked-by: Peter Zijlstra Cc: Signed-off-by: Ingo Molnar --- kernel/perf_event.c | 13 ++++++++++++- 1 file changed, 12 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/perf_event.c b/kernel/perf_event.c index c502b18..fb3031c 100644 --- a/kernel/perf_event.c +++ b/kernel/perf_event.c @@ -5368,12 +5368,22 @@ int perf_event_init_task(struct task_struct *child) return ret; } +static void __init perf_event_init_all_cpus(void) +{ + int cpu; + struct perf_cpu_context *cpuctx; + + for_each_possible_cpu(cpu) { + cpuctx = &per_cpu(perf_cpu_context, cpu); + __perf_event_init_context(&cpuctx->ctx, NULL); + } +} + static void __cpuinit perf_event_init_cpu(int cpu) { struct perf_cpu_context *cpuctx; cpuctx = &per_cpu(perf_cpu_context, cpu); - __perf_event_init_context(&cpuctx->ctx, NULL); spin_lock(&perf_resource_lock); cpuctx->max_pertask = perf_max_events - perf_reserved_percpu; @@ -5439,6 +5449,7 @@ static struct notifier_block __cpuinitdata perf_cpu_nb = { void __init perf_event_init(void) { + perf_event_init_all_cpus(); perf_cpu_notify(&perf_cpu_nb, (unsigned long)CPU_UP_PREPARE, (void *)(long)smp_processor_id()); perf_cpu_notify(&perf_cpu_nb, (unsigned long)CPU_ONLINE, -- cgit v1.1 From 3f379b03fbfddd20536389a85c6456f8233d1f8d Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Fri, 5 Mar 2010 15:03:25 -0800 Subject: ftrace: Replace read_barrier_depends() with rcu_dereference_raw() Replace the calls to read_barrier_depends() in ftrace_list_func() with rcu_dereference_raw() to improve readability. The reason that we use rcu_dereference_raw() here is that removed entries are never freed, instead they are simply leaked. This is one of a very few cases where use of rcu_dereference_raw() is the long-term right answer. And I don't yet know of any others. ;-) Signed-off-by: Paul E. McKenney Acked-by: Steven Rostedt Cc: Frederic Weisbecker Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com LKML-Reference: <1267830207-9474-1-git-send-email-paulmck@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar --- kernel/trace/ftrace.c | 22 +++++++++++++--------- 1 file changed, 13 insertions(+), 9 deletions(-) (limited to 'kernel') diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c index 8378357..8c5adc0 100644 --- a/kernel/trace/ftrace.c +++ b/kernel/trace/ftrace.c @@ -27,6 +27,7 @@ #include #include #include +#include #include @@ -88,18 +89,22 @@ ftrace_func_t ftrace_pid_function __read_mostly = ftrace_stub; static int ftrace_set_func(unsigned long *array, int *idx, char *buffer); #endif +/* + * Traverse the ftrace_list, invoking all entries. The reason that we + * can use rcu_dereference_raw() is that elements removed from this list + * are simply leaked, so there is no need to interact with a grace-period + * mechanism. The rcu_dereference_raw() calls are needed to handle + * concurrent insertions into the ftrace_list. + * + * Silly Alpha and silly pointer-speculation compiler optimizations! + */ static void ftrace_list_func(unsigned long ip, unsigned long parent_ip) { - struct ftrace_ops *op = ftrace_list; - - /* in case someone actually ports this to alpha! */ - read_barrier_depends(); + struct ftrace_ops *op = rcu_dereference_raw(ftrace_list); /*see above*/ while (op != &ftrace_list_end) { - /* silly alpha */ - read_barrier_depends(); op->func(ip, parent_ip); - op = op->next; + op = rcu_dereference_raw(op->next); /*see above*/ }; } @@ -154,8 +159,7 @@ static int __register_ftrace_function(struct ftrace_ops *ops) * the ops->next pointer is valid before another CPU sees * the ops pointer included into the ftrace_list. */ - smp_wmb(); - ftrace_list = ops; + rcu_assign_pointer(ftrace_list, ops); if (ftrace_enabled) { ftrace_func_t func; -- cgit v1.1 From 007b09243b099811124f69d492adeebe9e439f96 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Fri, 5 Mar 2010 15:03:26 -0800 Subject: rcu: Increase RCU CPU stall timeouts if PROVE_RCU CONFIG_PROVE_RCU imposes additional overhead on the kernel, so increase the RCU CPU stall timeouts in an attempt to allow for this effect. Signed-off-by: Paul E. McKenney Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com LKML-Reference: <1267830207-9474-2-git-send-email-paulmck@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar --- kernel/rcutree.h | 21 +++++++++++++++------ 1 file changed, 15 insertions(+), 6 deletions(-) (limited to 'kernel') diff --git a/kernel/rcutree.h b/kernel/rcutree.h index 1439eb5..4a525a3 100644 --- a/kernel/rcutree.h +++ b/kernel/rcutree.h @@ -246,12 +246,21 @@ struct rcu_data { #define RCU_JIFFIES_TILL_FORCE_QS 3 /* for rsp->jiffies_force_qs */ #ifdef CONFIG_RCU_CPU_STALL_DETECTOR -#define RCU_SECONDS_TILL_STALL_CHECK (10 * HZ) /* for rsp->jiffies_stall */ -#define RCU_SECONDS_TILL_STALL_RECHECK (30 * HZ) /* for rsp->jiffies_stall */ -#define RCU_STALL_RAT_DELAY 2 /* Allow other CPUs time */ - /* to take at least one */ - /* scheduling clock irq */ - /* before ratting on them. */ + +#ifdef CONFIG_PROVE_RCU +#define RCU_STALL_DELAY_DELTA (5 * HZ) +#else +#define RCU_STALL_DELAY_DELTA 0 +#endif + +#define RCU_SECONDS_TILL_STALL_CHECK (10 * HZ + RCU_STALL_DELAY_DELTA) + /* for rsp->jiffies_stall */ +#define RCU_SECONDS_TILL_STALL_RECHECK (30 * HZ + RCU_STALL_DELAY_DELTA) + /* for rsp->jiffies_stall */ +#define RCU_STALL_RAT_DELAY 2 /* Allow other CPUs time */ + /* to take at least one */ + /* scheduling clock irq */ + /* before ratting on them. */ #endif /* #ifdef CONFIG_RCU_CPU_STALL_DETECTOR */ -- cgit v1.1 From ab3b3aa5dd01b3aaa6b15caee113b21b1b6520c4 Mon Sep 17 00:00:00 2001 From: Dan Carpenter Date: Sat, 6 Mar 2010 14:17:52 +0300 Subject: sched: Cleanup: remove unused variable in try_to_wake_up() We haven't used the "orig_rq" variable since 055a00865d "Fix/add missing update_rq_clock() calls" Signed-off-by: Dan Carpenter Cc: Peter Zijlstra Cc: Andreas Herrmann Cc: Gautham R Shenoy Cc: efault@gmx.de LKML-Reference: <20100306111752.GL4958@bicker> Signed-off-by: Ingo Molnar --- kernel/sched.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/sched.c b/kernel/sched.c index 6a212c9..2c1db81 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -2359,7 +2359,7 @@ static int try_to_wake_up(struct task_struct *p, unsigned int state, { int cpu, orig_cpu, this_cpu, success = 0; unsigned long flags; - struct rq *rq, *orig_rq; + struct rq *rq; if (!sched_feat(SYNC_WAKEUPS)) wake_flags &= ~WF_SYNC; @@ -2367,7 +2367,7 @@ static int try_to_wake_up(struct task_struct *p, unsigned int state, this_cpu = get_cpu(); smp_wmb(); - rq = orig_rq = task_rq_lock(p, &flags); + rq = task_rq_lock(p, &flags); update_rq_clock(rq); if (!(p->state & state)) goto out; -- cgit v1.1 From 83ff56f46a8532488ee364bb93a9cb2a59490d33 Mon Sep 17 00:00:00 2001 From: Masami Hiramatsu Date: Tue, 9 Mar 2010 10:22:19 -0500 Subject: kprobes: Calculate the index correctly when freeing the out-of-line execution slot From : Ananth N Mavinakayanahalli When freeing the instruction slot, the arithmetic to calculate the index of the slot in the page needs to account for the total size of the instruction on the various architectures. Calculate the index correctly when freeing the out-of-line execution slot. Reported-by: Sachin Sant Reported-by: Heiko Carstens Signed-off-by: Ananth N Mavinakayanahalli Signed-off-by: Masami Hiramatsu LKML-Reference: <4B9667AB.9050507@redhat.com> Signed-off-by: Ingo Molnar --- kernel/kprobes.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/kprobes.c b/kernel/kprobes.c index fa034d2..0ed46f3 100644 --- a/kernel/kprobes.c +++ b/kernel/kprobes.c @@ -259,7 +259,8 @@ static void __kprobes __free_insn_slot(struct kprobe_insn_cache *c, struct kprobe_insn_page *kip; list_for_each_entry(kip, &c->pages, list) { - long idx = ((long)slot - (long)kip->insns) / c->insn_size; + long idx = ((long)slot - (long)kip->insns) / + (c->insn_size * sizeof(kprobe_opcode_t)); if (idx >= 0 && idx < slots_per_page(c)) { WARN_ON(kip->slot_used[idx] != SLOT_USED); if (dirty) { -- cgit v1.1 From 639fe4b12f92b54c9c3b38c82cdafaa38cfd3e63 Mon Sep 17 00:00:00 2001 From: Xiao Guangrong Date: Thu, 11 Mar 2010 15:30:35 +0800 Subject: perf: export perf_trace_regs and perf_arch_fetch_caller_regs Export perf_trace_regs and perf_arch_fetch_caller_regs since module will use these. Signed-off-by: Xiao Guangrong [ use EXPORT_PER_CPU_SYMBOL_GPL() ] Signed-off-by: Peter Zijlstra LKML-Reference: <4B989C1B.2090407@cn.fujitsu.com> Signed-off-by: Ingo Molnar --- kernel/trace/trace_event_perf.c | 1 + 1 file changed, 1 insertion(+) (limited to 'kernel') diff --git a/kernel/trace/trace_event_perf.c b/kernel/trace/trace_event_perf.c index f315b12..0709e4f 100644 --- a/kernel/trace/trace_event_perf.c +++ b/kernel/trace/trace_event_perf.c @@ -10,6 +10,7 @@ #include "trace.h" DEFINE_PER_CPU(struct pt_regs, perf_trace_regs); +EXPORT_PER_CPU_SYMBOL_GPL(perf_trace_regs); static char *perf_trace_buf; static char *perf_trace_buf_nmi; -- cgit v1.1 From 3d07467b7aa91623b31d7b5888a123a2c8c8e9cc Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Wed, 10 Mar 2010 17:07:24 +0100 Subject: sched: Fix pick_next_highest_task_rt() for cgroups Since pick_next_highest_task_rt() already iterates all the cgroups and is really only interested in tasks, skip over the !task entries. Reported-by: Dhaval Giani Signed-off-by: Peter Zijlstra Tested-by: Dhaval Giani LKML-Reference: Signed-off-by: Ingo Molnar --- kernel/sched_rt.c | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c index bf3e38f..c4fb42a 100644 --- a/kernel/sched_rt.c +++ b/kernel/sched_rt.c @@ -1146,7 +1146,12 @@ static struct task_struct *pick_next_highest_task_rt(struct rq *rq, int cpu) if (next && next->prio < idx) continue; list_for_each_entry(rt_se, array->queue + idx, run_list) { - struct task_struct *p = rt_task_of(rt_se); + struct task_struct *p; + + if (!rt_entity_is_task(rt_se)) + continue; + + p = rt_task_of(rt_se); if (pick_rt_task(rq, p, cpu)) { next = p; break; -- cgit v1.1 From 80a05b9ffa7dc13f6693902dd8999a2b61a3a0d7 Mon Sep 17 00:00:00 2001 From: Thomas Gleixner Date: Fri, 12 Mar 2010 17:34:14 +0100 Subject: clockevents: Sanitize min_delta_ns adjustment and prevent overflows The current logic which handles clock events programming failures can increase min_delta_ns unlimited and even can cause overflows. Sanitize it by: - prevent zero increase when min_delta_ns == 1 - limiting min_delta_ns to a jiffie - bail out if the jiffie limit is hit - add retries stats for /proc/timer_list so we can gather data Reported-by: Uwe Kleine-Koenig Signed-off-by: Thomas Gleixner --- kernel/time/tick-oneshot.c | 52 +++++++++++++++++++++++++++++++++++----------- kernel/time/timer_list.c | 3 ++- 2 files changed, 42 insertions(+), 13 deletions(-) (limited to 'kernel') diff --git a/kernel/time/tick-oneshot.c b/kernel/time/tick-oneshot.c index 0a8a213..aada0e5 100644 --- a/kernel/time/tick-oneshot.c +++ b/kernel/time/tick-oneshot.c @@ -22,6 +22,29 @@ #include "tick-internal.h" +/* Limit min_delta to a jiffie */ +#define MIN_DELTA_LIMIT (NSEC_PER_SEC / HZ) + +static int tick_increase_min_delta(struct clock_event_device *dev) +{ + /* Nothing to do if we already reached the limit */ + if (dev->min_delta_ns >= MIN_DELTA_LIMIT) + return -ETIME; + + if (dev->min_delta_ns < 5000) + dev->min_delta_ns = 5000; + else + dev->min_delta_ns += dev->min_delta_ns >> 1; + + if (dev->min_delta_ns > MIN_DELTA_LIMIT) + dev->min_delta_ns = MIN_DELTA_LIMIT; + + printk(KERN_WARNING "CE: %s increased min_delta_ns to %llu nsec\n", + dev->name ? dev->name : "?", + (unsigned long long) dev->min_delta_ns); + return 0; +} + /** * tick_program_event internal worker function */ @@ -37,23 +60,28 @@ int tick_dev_program_event(struct clock_event_device *dev, ktime_t expires, if (!ret || !force) return ret; + dev->retries++; /* - * We tried 2 times to program the device with the given - * min_delta_ns. If that's not working then we double it + * We tried 3 times to program the device with the given + * min_delta_ns. If that's not working then we increase it * and emit a warning. */ if (++i > 2) { /* Increase the min. delta and try again */ - if (!dev->min_delta_ns) - dev->min_delta_ns = 5000; - else - dev->min_delta_ns += dev->min_delta_ns >> 1; - - printk(KERN_WARNING - "CE: %s increasing min_delta_ns to %llu nsec\n", - dev->name ? dev->name : "?", - (unsigned long long) dev->min_delta_ns << 1); - + if (tick_increase_min_delta(dev)) { + /* + * Get out of the loop if min_delta_ns + * hit the limit already. That's + * better than staying here forever. + * + * We clear next_event so we have a + * chance that the box survives. + */ + printk(KERN_WARNING + "CE: Reprogramming failure. Giving up\n"); + dev->next_event.tv64 = KTIME_MAX; + return -ETIME; + } i = 0; } diff --git a/kernel/time/timer_list.c b/kernel/time/timer_list.c index bdfb8dd..1a4a7dd 100644 --- a/kernel/time/timer_list.c +++ b/kernel/time/timer_list.c @@ -228,6 +228,7 @@ print_tickdevice(struct seq_file *m, struct tick_device *td, int cpu) SEQ_printf(m, " event_handler: "); print_name_offset(m, dev->event_handler); SEQ_printf(m, "\n"); + SEQ_printf(m, " retries: %lu\n", dev->retries); } static void timer_list_show_tickdevices(struct seq_file *m) @@ -257,7 +258,7 @@ static int timer_list_show(struct seq_file *m, void *v) u64 now = ktime_to_ns(ktime_get()); int cpu; - SEQ_printf(m, "Timer List Version: v0.5\n"); + SEQ_printf(m, "Timer List Version: v0.6\n"); SEQ_printf(m, "HRTIMER_MAX_CLOCK_BASES: %d\n", HRTIMER_MAX_CLOCK_BASES); SEQ_printf(m, "now at %Ld nsecs\n", (unsigned long long)now); -- cgit v1.1 From 829b6c1ef488856c6a46a2f705f5068062d5f34c Mon Sep 17 00:00:00 2001 From: Andrew Morton Date: Thu, 11 Mar 2010 14:04:30 -0800 Subject: timer stats: Fix del_timer_sync() and try_to_del_timer_sync() These functions forgot to run timer_stats_timer_clear_start_info(). It's unobvious what effect this has and whether it matters much - we won't be printing it out anyway if the timer's detached. Untested, just an Ingo trollpatch. [ Nevertheless correct - tglx ] Signed-off-by: Andrew Morton Cc: Ingo Molnar Cc: johnstul@us.ibm.com Signed-off-by: Thomas Gleixner --- kernel/timer.c | 1 + 1 file changed, 1 insertion(+) (limited to 'kernel') diff --git a/kernel/timer.c b/kernel/timer.c index c61a794..fc965ea 100644 --- a/kernel/timer.c +++ b/kernel/timer.c @@ -880,6 +880,7 @@ int try_to_del_timer_sync(struct timer_list *timer) if (base->running_timer == timer) goto out; + timer_stats_timer_clear_start_info(timer); ret = 0; if (timer_pending(timer)) { detach_timer(timer, 1); -- cgit v1.1 From 15365c108ea27598e265f8c13e7051d99ca5b0b9 Mon Sep 17 00:00:00 2001 From: Stanislaw Gruszka Date: Thu, 11 Mar 2010 14:04:31 -0800 Subject: posix-cpu-timers: Reset expire cache when no timer is running When a process deletes cpu timer or a timer expires we do not clear the expiration cache sig->cputimer_expires. As a result the fastpath_timer_check() which prevents us to loop over all threads in case no timer is active is not working and we run the slow path needlessly on every tick. Zero sig->cputimer_expires in stop_process_timers(). Signed-off-by: Stanislaw Gruszka Cc: Ingo Molnar Cc: Oleg Nesterov Cc: Peter Zijlstra Cc: Hidetoshi Seto Cc: Spencer Candland Signed-off-by: Andrew Morton Signed-off-by: Thomas Gleixner --- kernel/posix-cpu-timers.c | 10 +++++++--- 1 file changed, 7 insertions(+), 3 deletions(-) (limited to 'kernel') diff --git a/kernel/posix-cpu-timers.c b/kernel/posix-cpu-timers.c index 438ff45..edec25a 100644 --- a/kernel/posix-cpu-timers.c +++ b/kernel/posix-cpu-timers.c @@ -1060,9 +1060,9 @@ static void check_thread_timers(struct task_struct *tsk, } } -static void stop_process_timers(struct task_struct *tsk) +static void stop_process_timers(struct signal_struct *sig) { - struct thread_group_cputimer *cputimer = &tsk->signal->cputimer; + struct thread_group_cputimer *cputimer = &sig->cputimer; unsigned long flags; if (!cputimer->running) @@ -1071,6 +1071,10 @@ static void stop_process_timers(struct task_struct *tsk) spin_lock_irqsave(&cputimer->lock, flags); cputimer->running = 0; spin_unlock_irqrestore(&cputimer->lock, flags); + + sig->cputime_expires.prof_exp = cputime_zero; + sig->cputime_expires.virt_exp = cputime_zero; + sig->cputime_expires.sched_exp = 0; } static u32 onecputick; @@ -1131,7 +1135,7 @@ static void check_process_timers(struct task_struct *tsk, list_empty(&timers[CPUCLOCK_VIRT]) && cputime_eq(sig->it[CPUCLOCK_VIRT].expires, cputime_zero) && list_empty(&timers[CPUCLOCK_SCHED])) { - stop_process_timers(tsk); + stop_process_timers(sig); return; } -- cgit v1.1 From 52fbe9cde7fdb5c6fac196d7ebd2d92d05ef3cd4 Mon Sep 17 00:00:00 2001 From: Lai Jiangshan Date: Mon, 8 Mar 2010 14:50:43 +0800 Subject: ring-buffer: Move disabled check into preempt disable section The ring buffer resizing and resetting relies on a schedule RCU action. The buffers are disabled, a synchronize_sched() is called and then the resize or reset takes place. But this only works if the disabling of the buffers are within the preempt disabled section, otherwise a window exists that the buffers can be written to while a reset or resize takes place. Cc: stable@kernel.org Reported-by: Li Zefan Signed-off-by: Lai Jiangshan LKML-Reference: <4B949E43.2010906@cn.fujitsu.com> Signed-off-by: Steven Rostedt --- kernel/trace/ring_buffer.c | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) (limited to 'kernel') diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index 8c1b2d2..54191d6 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -2232,12 +2232,12 @@ ring_buffer_lock_reserve(struct ring_buffer *buffer, unsigned long length) if (ring_buffer_flags != RB_BUFFERS_ON) return NULL; - if (atomic_read(&buffer->record_disabled)) - return NULL; - /* If we are tracing schedule, we don't want to recurse */ resched = ftrace_preempt_disable(); + if (atomic_read(&buffer->record_disabled)) + goto out_nocheck; + if (trace_recursive_lock()) goto out_nocheck; @@ -2469,11 +2469,11 @@ int ring_buffer_write(struct ring_buffer *buffer, if (ring_buffer_flags != RB_BUFFERS_ON) return -EBUSY; - if (atomic_read(&buffer->record_disabled)) - return -EBUSY; - resched = ftrace_preempt_disable(); + if (atomic_read(&buffer->record_disabled)) + goto out; + cpu = raw_smp_processor_id(); if (!cpumask_test_cpu(cpu, buffer->cpumask)) -- cgit v1.1 From ea14eb714041d40fcc5180b5a586034503650149 Mon Sep 17 00:00:00 2001 From: Steven Rostedt Date: Fri, 12 Mar 2010 19:41:23 -0500 Subject: function-graph: Init curr_ret_stack with ret_stack If the graph tracer is active, and a task is forked but the allocating of the processes graph stack fails, it can cause crash later on. This is due to the temporary stack being NULL, but the curr_ret_stack variable is copied from the parent. If it is not -1, then in ftrace_graph_probe_sched_switch() the following: for (index = next->curr_ret_stack; index >= 0; index--) next->ret_stack[index].calltime += timestamp; Will cause a kernel OOPS. Found with Li Zefan's ftrace_stress_test. Cc: stable@kernel.org Signed-off-by: Steven Rostedt --- kernel/trace/ftrace.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c index d4d1238..bb53edb 100644 --- a/kernel/trace/ftrace.c +++ b/kernel/trace/ftrace.c @@ -3349,6 +3349,7 @@ void ftrace_graph_init_task(struct task_struct *t) { /* Make sure we do not use the parent ret_stack */ t->ret_stack = NULL; + t->curr_ret_stack = -1; if (ftrace_graph_active) { struct ftrace_ret_stack *ret_stack; @@ -3358,7 +3359,6 @@ void ftrace_graph_init_task(struct task_struct *t) GFP_KERNEL); if (!ret_stack) return; - t->curr_ret_stack = -1; atomic_set(&t->tracing_graph_pause, 0); atomic_set(&t->trace_overrun, 0); t->ftrace_timestamp = 0; -- cgit v1.1 From 283740c619d211e34572cc93c8cdba92ccbdb9cc Mon Sep 17 00:00:00 2001 From: Steven Rostedt Date: Fri, 12 Mar 2010 19:48:41 -0500 Subject: tracing: Use same local variable when resetting the ring buffer In the ftrace code that resets the ring buffer it references the buffer with a local variable, but then uses the tr->buffer as the parameter to reset. If the wakeup tracer is running, which can switch the tr->buffer with the max saved buffer, this can break the requirement of disabling the buffer before the reset. buffer = tr->buffer; ring_buffer_record_disable(buffer); synchronize_sched(); __tracing_reset(tr->buffer, cpu); If the tr->buffer is swapped, then the reset is not happening to the buffer that was disabled. This will cause the ring buffer to fail. Found with Li Zefan's ftrace_stress_test. Cc: stable@kernel.org Reported-by: Lai Jiangshan Signed-off-by: Steven Rostedt --- kernel/trace/trace.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) (limited to 'kernel') diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index 6af8d7b..60de37b 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -840,10 +840,10 @@ out: mutex_unlock(&trace_types_lock); } -static void __tracing_reset(struct trace_array *tr, int cpu) +static void __tracing_reset(struct ring_buffer *buffer, int cpu) { ftrace_disable_cpu(); - ring_buffer_reset_cpu(tr->buffer, cpu); + ring_buffer_reset_cpu(buffer, cpu); ftrace_enable_cpu(); } @@ -855,7 +855,7 @@ void tracing_reset(struct trace_array *tr, int cpu) /* Make sure all commits have finished */ synchronize_sched(); - __tracing_reset(tr, cpu); + __tracing_reset(buffer, cpu); ring_buffer_record_enable(buffer); } @@ -873,7 +873,7 @@ void tracing_reset_online_cpus(struct trace_array *tr) tr->time_start = ftrace_now(tr->cpu); for_each_online_cpu(cpu) - __tracing_reset(tr, cpu); + __tracing_reset(buffer, cpu); ring_buffer_record_enable(buffer); } -- cgit v1.1 From a2f8071428ed9a0f06865f417c962421c9a6b488 Mon Sep 17 00:00:00 2001 From: Steven Rostedt Date: Fri, 12 Mar 2010 19:56:00 -0500 Subject: tracing: Disable buffer switching when starting or stopping trace When the trace iterator is read, tracing_start() and tracing_stop() is called to stop tracing while the iterator is processing the trace output. These functions disable both the standard buffer and the max latency buffer. But if the wakeup tracer is running, it can switch these buffers between the two disables: buffer = global_trace.buffer; if (buffer) ring_buffer_record_disable(buffer); <<<--------- swap happens here buffer = max_tr.buffer; if (buffer) ring_buffer_record_disable(buffer); What happens is that we disabled the same buffer twice. On tracing_start() we can enable the same buffer twice. All ring_buffer_record_disable() must be matched with a ring_buffer_record_enable() or the buffer can be disable permanently, or enable prematurely, and cause a bug where a reset happens while a trace is commiting. This patch protects these two by taking the ftrace_max_lock to prevent a switch from occurring. Found with Li Zefan's ftrace_stress_test. Cc: stable@kernel.org Reported-by: Lai Jiangshan Signed-off-by: Steven Rostedt --- kernel/trace/trace.c | 9 +++++++++ 1 file changed, 9 insertions(+) (limited to 'kernel') diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index 60de37b..484337d 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -950,6 +950,8 @@ void tracing_start(void) goto out; } + /* Prevent the buffers from switching */ + arch_spin_lock(&ftrace_max_lock); buffer = global_trace.buffer; if (buffer) @@ -959,6 +961,8 @@ void tracing_start(void) if (buffer) ring_buffer_record_enable(buffer); + arch_spin_unlock(&ftrace_max_lock); + ftrace_start(); out: spin_unlock_irqrestore(&tracing_start_lock, flags); @@ -980,6 +984,9 @@ void tracing_stop(void) if (trace_stop_count++) goto out; + /* Prevent the buffers from switching */ + arch_spin_lock(&ftrace_max_lock); + buffer = global_trace.buffer; if (buffer) ring_buffer_record_disable(buffer); @@ -988,6 +995,8 @@ void tracing_stop(void) if (buffer) ring_buffer_record_disable(buffer); + arch_spin_unlock(&ftrace_max_lock); + out: spin_unlock_irqrestore(&tracing_start_lock, flags); } -- cgit v1.1 From b6345879ccbd9b92864fbd7eb8ac48acdb4d6b15 Mon Sep 17 00:00:00 2001 From: Steven Rostedt Date: Fri, 12 Mar 2010 20:03:30 -0500 Subject: tracing: Do not record user stack trace from NMI context A bug was found with Li Zefan's ftrace_stress_test that caused applications to segfault during the test. Placing a tracing_off() in the segfault code, and examining several traces, I found that the following was always the case. The lock tracer was enabled (lockdep being required) and userstack was enabled. Testing this out, I just enabled the two, but that was not good enough. I needed to run something else that could trigger it. Running a load like hackbench did not work, but executing a new program would. The following would trigger the segfault within seconds: # echo 1 > /debug/tracing/options/userstacktrace # echo 1 > /debug/tracing/events/lock/enable # while :; do ls > /dev/null ; done Enabling the function graph tracer and looking at what was happening I finally noticed that all cashes happened just after an NMI. 1) | copy_user_handle_tail() { 1) | bad_area_nosemaphore() { 1) | __bad_area_nosemaphore() { 1) | no_context() { 1) | fixup_exception() { 1) 0.319 us | search_exception_tables(); 1) 0.873 us | } [...] 1) 0.314 us | __rcu_read_unlock(); 1) 0.325 us | native_apic_mem_write(); 1) 0.943 us | } 1) 0.304 us | rcu_nmi_exit(); [...] 1) 0.479 us | find_vma(); 1) | bad_area() { 1) | __bad_area() { After capturing several traces of failures, all of them happened after an NMI. Curious about this, I added a trace_printk() to the NMI handler to read the regs->ip to see where the NMI happened. In which I found out it was here: ffffffff8135b660 : ffffffff8135b660: 48 83 ec 78 sub $0x78,%rsp ffffffff8135b664: e8 97 01 00 00 callq ffffffff8135b800 What was happening is that the NMI would happen at the place that a page fault occurred. It would call rcu_read_lock() which was traced by the lock events, and the user_stack_trace would run. This would trigger a page fault inside the NMI. I do not see where the CR2 register is saved or restored in NMI handling. This means that it would corrupt the page fault handling that the NMI interrupted. The reason the while loop of ls helped trigger the bug, was that each execution of ls would cause lots of pages to be faulted in, and increase the chances of the race happening. The simple solution is to not allow user stack traces in NMI context. After this patch, I ran the above "ls" test for a couple of hours without any issues. Without this patch, the bug would trigger in less than a minute. Cc: stable@kernel.org Reported-by: Li Zefan Signed-off-by: Steven Rostedt --- kernel/trace/trace.c | 7 +++++++ 1 file changed, 7 insertions(+) (limited to 'kernel') diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index 484337d..e52683f 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -1284,6 +1284,13 @@ ftrace_trace_userstack(struct ring_buffer *buffer, unsigned long flags, int pc) if (!(trace_flags & TRACE_ITER_USERSTACKTRACE)) return; + /* + * NMIs can not handle page faults, even with fix ups. + * The save user stack can (and often does) fault. + */ + if (unlikely(in_nmi())) + return; + event = trace_buffer_lock_reserve(buffer, TRACE_USER_STACK, sizeof(*entry), flags, pc); if (!event) -- cgit v1.1 From cd3d8031eb4311e516329aee03c79a08333141f1 Mon Sep 17 00:00:00 2001 From: KOSAKI Motohiro Date: Fri, 12 Mar 2010 16:15:36 +0900 Subject: sched: sched_getaffinity(): Allow less than NR_CPUS length [ Note, this commit changes the syscall ABI for > 1024 CPUs systems. ] Recently, some distro decided to use NR_CPUS=4096 for mysterious reasons. Unfortunately, glibc sched interface has the following definition: # define __CPU_SETSIZE 1024 # define __NCPUBITS (8 * sizeof (__cpu_mask)) typedef unsigned long int __cpu_mask; typedef struct { __cpu_mask __bits[__CPU_SETSIZE / __NCPUBITS]; } cpu_set_t; It mean, if NR_CPUS is bigger than 1024, cpu_set_t makes an ABI issue ... More recently, Sharyathi Nagesh reported following test program makes misterious syscall failure: ----------------------------------------------------------------------- #define _GNU_SOURCE #include #include #include int main() { cpu_set_t set; if (sched_getaffinity(0, sizeof(cpu_set_t), &set) < 0) printf("\n Call is failing with:%d", errno); } ----------------------------------------------------------------------- Because the kernel assumes len argument of sched_getaffinity() is bigger than NR_CPUS. But now it is not correct. Now we are faced with the following annoying dilemma, due to the limitations of the glibc interface built in years ago: (1) if we change glibc's __CPU_SETSIZE definition, we lost binary compatibility of _all_ application. (2) if we don't change it, we also lost binary compatibility of Sharyathi's use case. Then, I would propse to change the rule of the len argument of sched_getaffinity(). Old: len should be bigger than NR_CPUS New: len should be bigger than maximum possible cpu id This creates the following behavior: (A) In the real 4096 cpus machine, the above test program still return -EINVAL. (B) NR_CPUS=4096 but the machine have less than 1024 cpus (almost all machines in the world), the above can run successfully. Fortunatelly, BIG SGI machine is mainly used for HPC use case. It means they can rebuild their programs. IOW we hope they are not annoyed by this issue ... Reported-by: Sharyathi Nagesh Signed-off-by: KOSAKI Motohiro Acked-by: Ulrich Drepper Acked-by: Peter Zijlstra Cc: Linus Torvalds Cc: Andrew Morton Cc: Jack Steiner Cc: Russ Anderson Cc: Mike Travis LKML-Reference: <20100312161316.9520.A69D9226@jp.fujitsu.com> Signed-off-by: Ingo Molnar --- kernel/sched.c | 10 +++++++--- 1 file changed, 7 insertions(+), 3 deletions(-) (limited to 'kernel') diff --git a/kernel/sched.c b/kernel/sched.c index 9ab3cd7..6eaef3d 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -4902,7 +4902,9 @@ SYSCALL_DEFINE3(sched_getaffinity, pid_t, pid, unsigned int, len, int ret; cpumask_var_t mask; - if (len < cpumask_size()) + if (len < nr_cpu_ids) + return -EINVAL; + if (len & (sizeof(unsigned long)-1)) return -EINVAL; if (!alloc_cpumask_var(&mask, GFP_KERNEL)) @@ -4910,10 +4912,12 @@ SYSCALL_DEFINE3(sched_getaffinity, pid_t, pid, unsigned int, len, ret = sched_getaffinity(pid, mask); if (ret == 0) { - if (copy_to_user(user_mask_ptr, mask, cpumask_size())) + int retlen = min(len, cpumask_size()); + + if (copy_to_user(user_mask_ptr, mask, retlen)) ret = -EFAULT; else - ret = cpumask_size(); + ret = retlen; } free_cpumask_var(mask); -- cgit v1.1 From e3818b8dce2a934cd1521dbc4827e5238d8f45d8 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Mon, 15 Mar 2010 17:03:43 -0700 Subject: rcu: Make rcu_read_lock_bh_held() allow for disabled BH Disabling BH can stand in for rcu_read_lock_bh(), and this patch updates rcu_read_lock_bh_held() to allow for this. In order to avoid include-file hell, this function is moved out of line to kernel/rcupdate.c. This fixes a false positive RCU warning. Reported-by: Arnd Bergmann Reported-by: Eric Dumazet Signed-off-by: Paul E. McKenney Acked-by: Lai Jiangshan Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com LKML-Reference: <20100316000343.GA25857@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar --- kernel/rcupdate.c | 23 +++++++++++++++++++++++ 1 file changed, 23 insertions(+) (limited to 'kernel') diff --git a/kernel/rcupdate.c b/kernel/rcupdate.c index f1125c1..63fe254 100644 --- a/kernel/rcupdate.c +++ b/kernel/rcupdate.c @@ -45,6 +45,7 @@ #include #include #include +#include #ifdef CONFIG_DEBUG_LOCK_ALLOC static struct lock_class_key rcu_lock_key; @@ -66,6 +67,28 @@ EXPORT_SYMBOL_GPL(rcu_sched_lock_map); int rcu_scheduler_active __read_mostly; EXPORT_SYMBOL_GPL(rcu_scheduler_active); +#ifdef CONFIG_DEBUG_LOCK_ALLOC + +/** + * rcu_read_lock_bh_held - might we be in RCU-bh read-side critical section? + * + * Check for bottom half being disabled, which covers both the + * CONFIG_PROVE_RCU and not cases. Note that if someone uses + * rcu_read_lock_bh(), but then later enables BH, lockdep (if enabled) + * will show the situation. + * + * Check debug_lockdep_rcu_enabled() to prevent false positives during boot. + */ +int rcu_read_lock_bh_held(void) +{ + if (!debug_lockdep_rcu_enabled()) + return 1; + return in_softirq(); +} +EXPORT_SYMBOL_GPL(rcu_read_lock_bh_held); + +#endif /* #ifdef CONFIG_DEBUG_LOCK_ALLOC */ + /* * This function is invoked towards the end of the scheduler's initialization * process. Before this is called, the idle task might contain -- cgit v1.1 From c890692bf37671b5b78a1870d55d6d87e1c8a509 Mon Sep 17 00:00:00 2001 From: Andrew Morton Date: Thu, 11 Mar 2010 14:08:43 -0800 Subject: kernel/sched.c: Suppress unused var warning On UP: kernel/sched.c: In function 'wake_up_new_task': kernel/sched.c:2631: warning: unused variable 'cpu' Signed-off-by: Andrew Morton Cc: Peter Zijlstra Signed-off-by: Ingo Molnar --- kernel/sched.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/sched.c b/kernel/sched.c index 6eaef3d..82975b5 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -2650,7 +2650,7 @@ void wake_up_new_task(struct task_struct *p, unsigned long clone_flags) { unsigned long flags; struct rq *rq; - int cpu = get_cpu(); + int cpu __maybe_unused = get_cpu(); #ifdef CONFIG_SMP /* -- cgit v1.1 From 8bc037fb89bb3104b9ae290d18c877624cd7d9cc Mon Sep 17 00:00:00 2001 From: KOSAKI Motohiro Date: Wed, 17 Mar 2010 09:36:58 +0900 Subject: sched: Use proper type in sched_getaffinity() Using the proper type fixes the following compiler warning: kernel/sched.c:4850: warning: comparison of distinct pointer types lacks a cast Signed-off-by: KOSAKI Motohiro Cc: torvalds@linux-foundation.org Cc: travis@sgi.com Cc: peterz@infradead.org Cc: drepper@redhat.com Cc: rja@sgi.com Cc: sharyath@in.ibm.com Cc: steiner@sgi.com LKML-Reference: <20100317090046.4C79.A69D9226@jp.fujitsu.com> Signed-off-by: Ingo Molnar --- kernel/sched.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/sched.c b/kernel/sched.c index 82975b5..49d2fa7 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -4912,7 +4912,7 @@ SYSCALL_DEFINE3(sched_getaffinity, pid_t, pid, unsigned int, len, ret = sched_getaffinity(pid, mask); if (ret == 0) { - int retlen = min(len, cpumask_size()); + size_t retlen = min_t(size_t, len, cpumask_size()); if (copy_to_user(user_mask_ptr, mask, retlen)) ret = -EFAULT; -- cgit v1.1 From dcd5c1662db59a6b82942f47fb6ac9dd63f6d3dd Mon Sep 17 00:00:00 2001 From: Frederic Weisbecker Date: Tue, 16 Mar 2010 01:05:02 +0100 Subject: perf: Fix unexported generic perf_arch_fetch_caller_regs perf_arch_fetch_caller_regs() is exported for the overriden x86 version, but not for the generic weak version. As a general rule, weak functions should not have their symbol exported in the same file they are defined. So let's export it on trace_event_perf.c as it is used by trace events only. This fixes: ERROR: ".perf_arch_fetch_caller_regs" [fs/xfs/xfs.ko] undefined! ERROR: ".perf_arch_fetch_caller_regs" [arch/powerpc/platforms/cell/spufs/spufs.ko] undefined! -v2: And also only build it if trace events are enabled. -v3: Fix changelog mistake Reported-by: Stephen Rothwell Signed-off-by: Frederic Weisbecker Cc: Peter Zijlstra Cc: Xiao Guangrong Cc: Paul Mackerras LKML-Reference: <1268697902-9518-1-git-send-regression-fweisbec@gmail.com> Signed-off-by: Ingo Molnar --- kernel/perf_event.c | 2 ++ kernel/trace/trace_event_perf.c | 2 ++ 2 files changed, 4 insertions(+) (limited to 'kernel') diff --git a/kernel/perf_event.c b/kernel/perf_event.c index fb3031c..574ee58 100644 --- a/kernel/perf_event.c +++ b/kernel/perf_event.c @@ -2786,10 +2786,12 @@ __weak struct perf_callchain_entry *perf_callchain(struct pt_regs *regs) return NULL; } +#ifdef CONFIG_EVENT_TRACING __weak void perf_arch_fetch_caller_regs(struct pt_regs *regs, unsigned long ip, int skip) { } +#endif /* * Output diff --git a/kernel/trace/trace_event_perf.c b/kernel/trace/trace_event_perf.c index 0709e4f..7d79a10 100644 --- a/kernel/trace/trace_event_perf.c +++ b/kernel/trace/trace_event_perf.c @@ -12,6 +12,8 @@ DEFINE_PER_CPU(struct pt_regs, perf_trace_regs); EXPORT_PER_CPU_SYMBOL_GPL(perf_trace_regs); +EXPORT_SYMBOL_GPL(perf_arch_fetch_caller_regs); + static char *perf_trace_buf; static char *perf_trace_buf_nmi; -- cgit v1.1 From 2271048d1b3b0aabf83d25b29c20646dcabedc05 Mon Sep 17 00:00:00 2001 From: Steven Rostedt Date: Thu, 18 Mar 2010 17:54:19 -0400 Subject: ring-buffer: Do 8 byte alignment for 64 bit that can not handle 4 byte align The ring buffer uses 4 byte alignment while recording events into the buffer, even on 64bit machines. This saves space when there are lots of events being recorded at 4 byte boundaries. The ring buffer has a zero copy method to write into the buffer, with the reserving of space and then committing it. This may cause problems when writing an 8 byte word into a 4 byte alignment (not 8). For x86 and PPC this is not an issue, but on some architectures this would cause an out-of-alignment exception. This patch uses CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS to determine if it is OK to use 4 byte alignments on 64 bit machines. If it is not, it forces the ring buffer event header to be 8 bytes and not 4, and will align the length of the data to be 8 byte aligned. This keeps the data payload at 8 byte alignments and will allow these machines to run without issue. The trick to this is that the header can be either 4 bytes or 8 bytes depending on the length of the data payload. The 4 byte header has a length field that supports up to 112 bytes. If the length of the data is more than 112, the length field is set to zero, and the actual length is stored in the next 4 bytes after the header. When CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS is not set, the code forces zero in the 4 byte header forcing the length to be stored in the 4 byte array, even with a small data load. It also forces the length of the data load to be 8 byte aligned. The combination of these two guarantee that the data is always at 8 byte alignment. Tested-by: Frederic Weisbecker (on sparc64) Reported-by: Frederic Weisbecker Acked-by: David S. Miller Signed-off-by: Steven Rostedt --- kernel/trace/ring_buffer.c | 14 +++++++++++--- 1 file changed, 11 insertions(+), 3 deletions(-) (limited to 'kernel') diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index 05a9f83..d1187ef 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -207,6 +207,14 @@ EXPORT_SYMBOL_GPL(tracing_is_on); #define RB_MAX_SMALL_DATA (RB_ALIGNMENT * RINGBUF_TYPE_DATA_TYPE_LEN_MAX) #define RB_EVNT_MIN_SIZE 8U /* two 32bit words */ +#if !defined(CONFIG_64BIT) || defined(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS) +# define RB_FORCE_8BYTE_ALIGNMENT 0 +# define RB_ARCH_ALIGNMENT RB_ALIGNMENT +#else +# define RB_FORCE_8BYTE_ALIGNMENT 1 +# define RB_ARCH_ALIGNMENT 8U +#endif + /* define RINGBUF_TYPE_DATA for 'case RINGBUF_TYPE_DATA:' */ #define RINGBUF_TYPE_DATA 0 ... RINGBUF_TYPE_DATA_TYPE_LEN_MAX @@ -1547,7 +1555,7 @@ rb_update_event(struct ring_buffer_event *event, case 0: length -= RB_EVNT_HDR_SIZE; - if (length > RB_MAX_SMALL_DATA) + if (length > RB_MAX_SMALL_DATA || RB_FORCE_8BYTE_ALIGNMENT) event->array[0] = length; else event->type_len = DIV_ROUND_UP(length, RB_ALIGNMENT); @@ -1722,11 +1730,11 @@ static unsigned rb_calculate_event_length(unsigned length) if (!length) length = 1; - if (length > RB_MAX_SMALL_DATA) + if (length > RB_MAX_SMALL_DATA || RB_FORCE_8BYTE_ALIGNMENT) length += sizeof(event.array[0]); length += RB_EVNT_HDR_SIZE; - length = ALIGN(length, RB_ALIGNMENT); + length = ALIGN(length, RB_ARCH_ALIGNMENT); return length; } -- cgit v1.1 From 8c2eb4805d422bdbf60ba00ff233c794d23c3c00 Mon Sep 17 00:00:00 2001 From: Colin Ian King Date: Fri, 19 Mar 2010 10:28:02 +0000 Subject: softlockup: Stop spurious softlockup messages due to overflow Ensure additions on touch_ts do not overflow. This can occur when the top 32 bits of the TSC reach 0xffffffff causing additions to touch_ts to overflow and this in turn generates spurious softlockup warnings. Signed-off-by: Colin Ian King Cc: Peter Zijlstra Cc: Eric Dumazet Cc: LKML-Reference: <1268994482.1798.6.camel@lenovo> Signed-off-by: Ingo Molnar --- kernel/softlockup.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/softlockup.c b/kernel/softlockup.c index 0d4c789..4b493f6 100644 --- a/kernel/softlockup.c +++ b/kernel/softlockup.c @@ -155,11 +155,11 @@ void softlockup_tick(void) * Wake up the high-prio watchdog task twice per * threshold timespan. */ - if (now > touch_ts + softlockup_thresh/2) + if (time_after(now - softlockup_thresh/2, touch_ts)) wake_up_process(per_cpu(softlockup_watchdog, this_cpu)); /* Warn about unreasonable delays: */ - if (now <= (touch_ts + softlockup_thresh)) + if (time_before_eq(now - softlockup_thresh, touch_ts)) return; per_cpu(softlockup_print_ts, this_cpu) = touch_ts; -- cgit v1.1 From 830ec0458c390f29c6c99e1ff7feab9e36368d12 Mon Sep 17 00:00:00 2001 From: John Stultz Date: Thu, 18 Mar 2010 14:47:30 -0700 Subject: time: Fix accumulation bug triggered by long delay. The logarithmic accumulation done in the timekeeping has some overflow protection that limits the max shift value. That means it will take more then shift loops to accumulate all of the cycles. This causes the shift decrement to underflow, which causes the loop to never exit. The simplest fix would be simply to do a: if (shift) shift--; However that is not optimal, as we know the cycle offset is larger then the interval << shift, the above would make shift drop to zero, then we would be spinning for quite awhile accumulating at interval chunks at a time. Instead, this patch only decreases shift if the offset is smaller then cycle_interval << shift. This makes sure we accumulate using the largest chunks possible without overflowing tick_length, and limits the number of iterations through the loop. This issue was found and reported by Sonic Zhang, who also tested the fix. Many thanks your explanation and testing! Reported-by: Sonic Zhang Signed-off-by: John Stultz Tested-by: Sonic Zhang LKML-Reference: <1268948850-5225-1-git-send-email-johnstul@us.ibm.com> Signed-off-by: Thomas Gleixner --- kernel/time/timekeeping.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c index 1673637..39f6177 100644 --- a/kernel/time/timekeeping.c +++ b/kernel/time/timekeeping.c @@ -818,7 +818,8 @@ void update_wall_time(void) shift = min(shift, maxshift); while (offset >= timekeeper.cycle_interval) { offset = logarithmic_accumulation(offset, shift); - shift--; + if(offset < timekeeper.cycle_interval< Date: Thu, 11 Mar 2010 17:01:09 -0700 Subject: resources: add interfaces that return conflict information request_resource() and insert_resource() only return success or failure, which no information about what existing resource conflicted with the proposed new reservation. This patch adds request_resource_conflict() and insert_resource_conflict(), which return the conflicting resource. Callers may use this for better error messages or to adjust the new resource and retry the request. Signed-off-by: Bjorn Helgaas Signed-off-by: Jesse Barnes --- kernel/resource.c | 44 +++++++++++++++++++++++++++++++++++++------- 1 file changed, 37 insertions(+), 7 deletions(-) (limited to 'kernel') diff --git a/kernel/resource.c b/kernel/resource.c index 2d5be5d..9c358e2 100644 --- a/kernel/resource.c +++ b/kernel/resource.c @@ -219,19 +219,34 @@ void release_child_resources(struct resource *r) } /** - * request_resource - request and reserve an I/O or memory resource + * request_resource_conflict - request and reserve an I/O or memory resource * @root: root resource descriptor * @new: resource descriptor desired by caller * - * Returns 0 for success, negative error code on error. + * Returns 0 for success, conflict resource on error. */ -int request_resource(struct resource *root, struct resource *new) +struct resource *request_resource_conflict(struct resource *root, struct resource *new) { struct resource *conflict; write_lock(&resource_lock); conflict = __request_resource(root, new); write_unlock(&resource_lock); + return conflict; +} + +/** + * request_resource - request and reserve an I/O or memory resource + * @root: root resource descriptor + * @new: resource descriptor desired by caller + * + * Returns 0 for success, negative error code on error. + */ +int request_resource(struct resource *root, struct resource *new) +{ + struct resource *conflict; + + conflict = request_resource_conflict(root, new); return conflict ? -EBUSY : 0; } @@ -474,25 +489,40 @@ static struct resource * __insert_resource(struct resource *parent, struct resou } /** - * insert_resource - Inserts a resource in the resource tree + * insert_resource_conflict - Inserts resource in the resource tree * @parent: parent of the new resource * @new: new resource to insert * - * Returns 0 on success, -EBUSY if the resource can't be inserted. + * Returns 0 on success, conflict resource if the resource can't be inserted. * - * This function is equivalent to request_resource when no conflict + * This function is equivalent to request_resource_conflict when no conflict * happens. If a conflict happens, and the conflicting resources * entirely fit within the range of the new resource, then the new * resource is inserted and the conflicting resources become children of * the new resource. */ -int insert_resource(struct resource *parent, struct resource *new) +struct resource *insert_resource_conflict(struct resource *parent, struct resource *new) { struct resource *conflict; write_lock(&resource_lock); conflict = __insert_resource(parent, new); write_unlock(&resource_lock); + return conflict; +} + +/** + * insert_resource - Inserts a resource in the resource tree + * @parent: parent of the new resource + * @new: new resource to insert + * + * Returns 0 on success, -EBUSY if the resource can't be inserted. + */ +int insert_resource(struct resource *parent, struct resource *new) +{ + struct resource *conflict; + + conflict = insert_resource_conflict(parent, new); return conflict ? -EBUSY : 0; } -- cgit v1.1 From cc8c3b78433222e5dbc1fdfcfdde29e1743f181a Mon Sep 17 00:00:00 2001 From: Thomas Gleixner Date: Tue, 23 Mar 2010 22:40:53 +0100 Subject: genirq: Protect access to irq_desc->action in can_request_irq() can_request_irq() accesses and dereferences irq_desc->action w/o holding irq_desc->lock. So action can be freed on another CPU before it's dereferenced. Unlikely, but ... Protect it with desc->lock. Signed-off-by: Thomas Gleixner --- kernel/irq/manage.c | 4 ++++ 1 file changed, 4 insertions(+) (limited to 'kernel') diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c index 69a3d7b..398fda1 100644 --- a/kernel/irq/manage.c +++ b/kernel/irq/manage.c @@ -382,6 +382,7 @@ int can_request_irq(unsigned int irq, unsigned long irqflags) { struct irq_desc *desc = irq_to_desc(irq); struct irqaction *action; + unsigned long flags; if (!desc) return 0; @@ -389,11 +390,14 @@ int can_request_irq(unsigned int irq, unsigned long irqflags) if (desc->status & IRQ_NOREQUEST) return 0; + raw_spin_lock_irqsave(&desc->lock, flags); action = desc->action; if (action) if (irqflags & action->flags & IRQF_SHARED) action = NULL; + raw_spin_unlock_irqrestore(&desc->lock, flags); + return !action; } -- cgit v1.1 From 860652bfb890bd861c999ec39fcffabe5b712f85 Mon Sep 17 00:00:00 2001 From: Henrik Kretzschmar Date: Wed, 24 Mar 2010 12:59:20 +0100 Subject: genirq: Move two IRQ functions from .init.text to .text Both functions should not be marked as __init, since they be called from modules after the init section is freed. Signed-off-by: Henrik Kretzschmar Cc: Yinghai Lu Cc: Peter Zijlstra Cc: Jiri Kosina LKML-Reference: <1269431961-5731-1-git-send-email-henne@nachtwindheim.de> Signed-off-by: Thomas Gleixner --- kernel/irq/chip.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/irq/chip.c b/kernel/irq/chip.c index 71eba24..3c2d6e7 100644 --- a/kernel/irq/chip.c +++ b/kernel/irq/chip.c @@ -729,7 +729,7 @@ set_irq_chip_and_handler_name(unsigned int irq, struct irq_chip *chip, __set_irq_handler(irq, handle, 0, name); } -void __init set_irq_noprobe(unsigned int irq) +void set_irq_noprobe(unsigned int irq) { struct irq_desc *desc = irq_to_desc(irq); unsigned long flags; @@ -744,7 +744,7 @@ void __init set_irq_noprobe(unsigned int irq) raw_spin_unlock_irqrestore(&desc->lock, flags); } -void __init set_irq_probe(unsigned int irq) +void set_irq_probe(unsigned int irq) { struct irq_desc *desc = irq_to_desc(irq); unsigned long flags; -- cgit v1.1 From 9d34706f42f9b8c15185423d9af98d37ba21d011 Mon Sep 17 00:00:00 2001 From: Li Zefan Date: Tue, 23 Mar 2010 13:35:12 -0700 Subject: cgroups: remove duplicate include commit e6a1105b ("cgroups: subsystem module loading interface") and commit c50cc752 ("sched, cgroups: Fix module export") result in duplicate including of module.h Signed-off-by: Li Zefan Acked-by: Paul Menage Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/cgroup.c | 1 - 1 file changed, 1 deletion(-) (limited to 'kernel') diff --git a/kernel/cgroup.c b/kernel/cgroup.c index ef909a3..e2769e1 100644 --- a/kernel/cgroup.c +++ b/kernel/cgroup.c @@ -27,7 +27,6 @@ */ #include -#include #include #include #include -- cgit v1.1 From 5ab116c9349ef52d6fbd2e2917a53f13194b048e Mon Sep 17 00:00:00 2001 From: Miao Xie Date: Tue, 23 Mar 2010 13:35:34 -0700 Subject: cpuset: fix the problem that cpuset_mem_spread_node() returns an offline node cpuset_mem_spread_node() returns an offline node, and causes an oops. This patch fixes it by initializing task->mems_allowed to node_states[N_HIGH_MEMORY], and updating task->mems_allowed when doing memory hotplug. Signed-off-by: Miao Xie Acked-by: David Rientjes Reported-by: Nick Piggin Tested-by: Nick Piggin Cc: Paul Menage Cc: Li Zefan Cc: Ingo Molnar Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/cpuset.c | 20 ++++++++++++-------- kernel/kthread.c | 2 +- 2 files changed, 13 insertions(+), 9 deletions(-) (limited to 'kernel') diff --git a/kernel/cpuset.c b/kernel/cpuset.c index ba401fa..5d38bd7 100644 --- a/kernel/cpuset.c +++ b/kernel/cpuset.c @@ -920,9 +920,6 @@ static int update_cpumask(struct cpuset *cs, struct cpuset *trialcs, * call to guarantee_online_mems(), as we know no one is changing * our task's cpuset. * - * Hold callback_mutex around the two modifications of our tasks - * mems_allowed to synchronize with cpuset_mems_allowed(). - * * While the mm_struct we are migrating is typically from some * other task, the task_struct mems_allowed that we are hacking * is for our current task, which must allocate new pages for that @@ -1391,11 +1388,10 @@ static void cpuset_attach(struct cgroup_subsys *ss, struct cgroup *cont, if (cs == &top_cpuset) { cpumask_copy(cpus_attach, cpu_possible_mask); - to = node_possible_map; } else { guarantee_online_cpus(cs, cpus_attach); - guarantee_online_mems(cs, &to); } + guarantee_online_mems(cs, &to); /* do per-task migration stuff possibly for each in the threadgroup */ cpuset_attach_task(tsk, &to, cs); @@ -2090,15 +2086,23 @@ static int cpuset_track_online_cpus(struct notifier_block *unused_nb, static int cpuset_track_online_nodes(struct notifier_block *self, unsigned long action, void *arg) { + nodemask_t oldmems; + cgroup_lock(); switch (action) { case MEM_ONLINE: - case MEM_OFFLINE: + oldmems = top_cpuset.mems_allowed; mutex_lock(&callback_mutex); top_cpuset.mems_allowed = node_states[N_HIGH_MEMORY]; mutex_unlock(&callback_mutex); - if (action == MEM_OFFLINE) - scan_for_empty_cpusets(&top_cpuset); + update_tasks_nodemask(&top_cpuset, &oldmems, NULL); + break; + case MEM_OFFLINE: + /* + * needn't update top_cpuset.mems_allowed explicitly because + * scan_for_empty_cpusets() will update it. + */ + scan_for_empty_cpusets(&top_cpuset); break; default: break; diff --git a/kernel/kthread.c b/kernel/kthread.c index 82ed0ea..83911c7 100644 --- a/kernel/kthread.c +++ b/kernel/kthread.c @@ -219,7 +219,7 @@ int kthreadd(void *unused) set_task_comm(tsk, "kthreadd"); ignore_signals(tsk); set_cpus_allowed_ptr(tsk, cpu_all_mask); - set_mems_allowed(node_possible_map); + set_mems_allowed(node_states[N_HIGH_MEMORY]); current->flags |= PF_NOFREEZE | PF_FREEZER_NOSIG; -- cgit v1.1 From 53feb29767c29c877f9d47dcfe14211b5b0f7ebd Mon Sep 17 00:00:00 2001 From: Miao Xie Date: Tue, 23 Mar 2010 13:35:35 -0700 Subject: cpuset: alloc nodemask_t on the heap rather than the stack Signed-off-by: Miao Xie Acked-by: David Rientjes Cc: Nick Piggin Cc: Paul Menage Cc: Li Zefan Cc: Ingo Molnar Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/cpuset.c | 94 ++++++++++++++++++++++++++++++++++++++++----------------- 1 file changed, 66 insertions(+), 28 deletions(-) (limited to 'kernel') diff --git a/kernel/cpuset.c b/kernel/cpuset.c index 5d38bd7..d109467 100644 --- a/kernel/cpuset.c +++ b/kernel/cpuset.c @@ -970,15 +970,20 @@ static void cpuset_change_nodemask(struct task_struct *p, struct cpuset *cs; int migrate; const nodemask_t *oldmem = scan->data; - nodemask_t newmems; + NODEMASK_ALLOC(nodemask_t, newmems, GFP_KERNEL); + + if (!newmems) + return; cs = cgroup_cs(scan->cg); - guarantee_online_mems(cs, &newmems); + guarantee_online_mems(cs, newmems); task_lock(p); - cpuset_change_task_nodemask(p, &newmems); + cpuset_change_task_nodemask(p, newmems); task_unlock(p); + NODEMASK_FREE(newmems); + mm = get_task_mm(p); if (!mm) return; @@ -1048,16 +1053,21 @@ static void update_tasks_nodemask(struct cpuset *cs, const nodemask_t *oldmem, static int update_nodemask(struct cpuset *cs, struct cpuset *trialcs, const char *buf) { - nodemask_t oldmem; + NODEMASK_ALLOC(nodemask_t, oldmem, GFP_KERNEL); int retval; struct ptr_heap heap; + if (!oldmem) + return -ENOMEM; + /* * top_cpuset.mems_allowed tracks node_stats[N_HIGH_MEMORY]; * it's read-only */ - if (cs == &top_cpuset) - return -EACCES; + if (cs == &top_cpuset) { + retval = -EACCES; + goto done; + } /* * An empty mems_allowed is ok iff there are no tasks in the cpuset. @@ -1073,11 +1083,13 @@ static int update_nodemask(struct cpuset *cs, struct cpuset *trialcs, goto done; if (!nodes_subset(trialcs->mems_allowed, - node_states[N_HIGH_MEMORY])) - return -EINVAL; + node_states[N_HIGH_MEMORY])) { + retval = -EINVAL; + goto done; + } } - oldmem = cs->mems_allowed; - if (nodes_equal(oldmem, trialcs->mems_allowed)) { + *oldmem = cs->mems_allowed; + if (nodes_equal(*oldmem, trialcs->mems_allowed)) { retval = 0; /* Too easy - nothing to do */ goto done; } @@ -1093,10 +1105,11 @@ static int update_nodemask(struct cpuset *cs, struct cpuset *trialcs, cs->mems_allowed = trialcs->mems_allowed; mutex_unlock(&callback_mutex); - update_tasks_nodemask(cs, &oldmem, &heap); + update_tasks_nodemask(cs, oldmem, &heap); heap_free(&heap); done: + NODEMASK_FREE(oldmem); return retval; } @@ -1381,39 +1394,47 @@ static void cpuset_attach(struct cgroup_subsys *ss, struct cgroup *cont, struct cgroup *oldcont, struct task_struct *tsk, bool threadgroup) { - nodemask_t from, to; struct mm_struct *mm; struct cpuset *cs = cgroup_cs(cont); struct cpuset *oldcs = cgroup_cs(oldcont); + NODEMASK_ALLOC(nodemask_t, from, GFP_KERNEL); + NODEMASK_ALLOC(nodemask_t, to, GFP_KERNEL); + + if (from == NULL || to == NULL) + goto alloc_fail; if (cs == &top_cpuset) { cpumask_copy(cpus_attach, cpu_possible_mask); } else { guarantee_online_cpus(cs, cpus_attach); } - guarantee_online_mems(cs, &to); + guarantee_online_mems(cs, to); /* do per-task migration stuff possibly for each in the threadgroup */ - cpuset_attach_task(tsk, &to, cs); + cpuset_attach_task(tsk, to, cs); if (threadgroup) { struct task_struct *c; rcu_read_lock(); list_for_each_entry_rcu(c, &tsk->thread_group, thread_group) { - cpuset_attach_task(c, &to, cs); + cpuset_attach_task(c, to, cs); } rcu_read_unlock(); } /* change mm; only needs to be done once even if threadgroup */ - from = oldcs->mems_allowed; - to = cs->mems_allowed; + *from = oldcs->mems_allowed; + *to = cs->mems_allowed; mm = get_task_mm(tsk); if (mm) { - mpol_rebind_mm(mm, &to); + mpol_rebind_mm(mm, to); if (is_memory_migrate(cs)) - cpuset_migrate_mm(mm, &from, &to); + cpuset_migrate_mm(mm, from, to); mmput(mm); } + +alloc_fail: + NODEMASK_FREE(from); + NODEMASK_FREE(to); } /* The various types of files and directories in a cpuset file system */ @@ -1558,13 +1579,21 @@ static int cpuset_sprintf_cpulist(char *page, struct cpuset *cs) static int cpuset_sprintf_memlist(char *page, struct cpuset *cs) { - nodemask_t mask; + NODEMASK_ALLOC(nodemask_t, mask, GFP_KERNEL); + int retval; + + if (mask == NULL) + return -ENOMEM; mutex_lock(&callback_mutex); - mask = cs->mems_allowed; + *mask = cs->mems_allowed; mutex_unlock(&callback_mutex); - return nodelist_scnprintf(page, PAGE_SIZE, mask); + retval = nodelist_scnprintf(page, PAGE_SIZE, *mask); + + NODEMASK_FREE(mask); + + return retval; } static ssize_t cpuset_common_file_read(struct cgroup *cont, @@ -1993,7 +2022,10 @@ static void scan_for_empty_cpusets(struct cpuset *root) struct cpuset *cp; /* scans cpusets being updated */ struct cpuset *child; /* scans child cpusets of cp */ struct cgroup *cont; - nodemask_t oldmems; + NODEMASK_ALLOC(nodemask_t, oldmems, GFP_KERNEL); + + if (oldmems == NULL) + return; list_add_tail((struct list_head *)&root->stack_list, &queue); @@ -2010,7 +2042,7 @@ static void scan_for_empty_cpusets(struct cpuset *root) nodes_subset(cp->mems_allowed, node_states[N_HIGH_MEMORY])) continue; - oldmems = cp->mems_allowed; + *oldmems = cp->mems_allowed; /* Remove offline cpus and mems from this cpuset. */ mutex_lock(&callback_mutex); @@ -2026,9 +2058,10 @@ static void scan_for_empty_cpusets(struct cpuset *root) remove_tasks_in_empty_cpuset(cp); else { update_tasks_cpumask(cp, NULL); - update_tasks_nodemask(cp, &oldmems, NULL); + update_tasks_nodemask(cp, oldmems, NULL); } } + NODEMASK_FREE(oldmems); } /* @@ -2086,16 +2119,19 @@ static int cpuset_track_online_cpus(struct notifier_block *unused_nb, static int cpuset_track_online_nodes(struct notifier_block *self, unsigned long action, void *arg) { - nodemask_t oldmems; + NODEMASK_ALLOC(nodemask_t, oldmems, GFP_KERNEL); + + if (oldmems == NULL) + return NOTIFY_DONE; cgroup_lock(); switch (action) { case MEM_ONLINE: - oldmems = top_cpuset.mems_allowed; + *oldmems = top_cpuset.mems_allowed; mutex_lock(&callback_mutex); top_cpuset.mems_allowed = node_states[N_HIGH_MEMORY]; mutex_unlock(&callback_mutex); - update_tasks_nodemask(&top_cpuset, &oldmems, NULL); + update_tasks_nodemask(&top_cpuset, oldmems, NULL); break; case MEM_OFFLINE: /* @@ -2108,6 +2144,8 @@ static int cpuset_track_online_nodes(struct notifier_block *self, break; } cgroup_unlock(); + + NODEMASK_FREE(oldmems); return NOTIFY_OK; } #endif -- cgit v1.1 From 4f598458ea4450f53e8ed929ee4e66b3404a7286 Mon Sep 17 00:00:00 2001 From: Xiaotian Feng Date: Wed, 10 Mar 2010 22:59:13 +0100 Subject: Freezer: Only show the state of tasks refusing to freeze show_state will dump all tasks state, so if freezer failed to freeze any task, kernel will dump all tasks state and flood the dmesg log. This patch makes freezer only show state of tasks refusing to freeze. Signed-off-by: Xiaotian Feng Acked-by: Pavel Machek Acked-by: David Rientjes Signed-off-by: Rafael J. Wysocki --- kernel/power/process.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/power/process.c b/kernel/power/process.c index 5ade1bd..a0480cd 100644 --- a/kernel/power/process.c +++ b/kernel/power/process.c @@ -88,12 +88,11 @@ static int try_to_freeze_tasks(bool sig_only) printk(KERN_ERR "Freezing of tasks failed after %d.%02d seconds " "(%d tasks refusing to freeze):\n", elapsed_csecs / 100, elapsed_csecs % 100, todo); - show_state(); read_lock(&tasklist_lock); do_each_thread(g, p) { task_lock(p); if (freezing(p) && !freezer_should_skip(p)) - printk(KERN_ERR " %s\n", p->comm); + sched_show_task(p); cancel_freezing(p); task_unlock(p); } while_each_thread(g, p); -- cgit v1.1 From 5a7aadfe2fcb0f69e2acc1fbefe22a096e792fc9 Mon Sep 17 00:00:00 2001 From: Matt Helsley Date: Fri, 26 Mar 2010 23:51:44 +0100 Subject: Freezer: Fix buggy resume test for tasks frozen with cgroup freezer When the cgroup freezer is used to freeze tasks we do not want to thaw those tasks during resume. Currently we test the cgroup freezer state of the resuming tasks to see if the cgroup is FROZEN. If so then we don't thaw the task. However, the FREEZING state also indicates that the task should remain frozen. This also avoids a problem pointed out by Oren Ladaan: the freezer state transition from FREEZING to FROZEN is updated lazily when userspace reads or writes the freezer.state file in the cgroup filesystem. This means that resume will thaw tasks in cgroups which should be in the FROZEN state if there is no read/write of the freezer.state file to trigger this transition before suspend. NOTE: Another "simple" solution would be to always update the cgroup freezer state during resume. However it's a bad choice for several reasons: Updating the cgroup freezer state is somewhat expensive because it requires walking all the tasks in the cgroup and checking if they are each frozen. Worse, this could easily make resume run in N^2 time where N is the number of tasks in the cgroup. Finally, updating the freezer state from this code path requires trickier locking because of the way locks must be ordered. Instead of updating the freezer state we rely on the fact that lazy updates only manage the transition from FREEZING to FROZEN. We know that a cgroup with the FREEZING state may actually be FROZEN so test for that state too. This makes sense in the resume path even for partially-frozen cgroups -- those that really are FREEZING but not FROZEN. Reported-by: Oren Ladaan Signed-off-by: Matt Helsley Cc: stable@kernel.org Signed-off-by: Rafael J. Wysocki --- kernel/cgroup_freezer.c | 9 ++++++--- kernel/power/process.c | 2 +- 2 files changed, 7 insertions(+), 4 deletions(-) (limited to 'kernel') diff --git a/kernel/cgroup_freezer.c b/kernel/cgroup_freezer.c index 59e9ef6..eb3f34d 100644 --- a/kernel/cgroup_freezer.c +++ b/kernel/cgroup_freezer.c @@ -47,17 +47,20 @@ static inline struct freezer *task_freezer(struct task_struct *task) struct freezer, css); } -int cgroup_frozen(struct task_struct *task) +int cgroup_freezing_or_frozen(struct task_struct *task) { struct freezer *freezer; enum freezer_state state; task_lock(task); freezer = task_freezer(task); - state = freezer->state; + if (!freezer->css.cgroup->parent) + state = CGROUP_THAWED; /* root cgroup can't be frozen */ + else + state = freezer->state; task_unlock(task); - return state == CGROUP_FROZEN; + return (state == CGROUP_FREEZING) || (state == CGROUP_FROZEN); } /* diff --git a/kernel/power/process.c b/kernel/power/process.c index a0480cd..71ae290 100644 --- a/kernel/power/process.c +++ b/kernel/power/process.c @@ -144,7 +144,7 @@ static void thaw_tasks(bool nosig_only) if (nosig_only && should_send_signal(p)) continue; - if (cgroup_frozen(p)) + if (cgroup_freezing_or_frozen(p)) continue; thaw_process(p); -- cgit v1.1 From 259354deaaf03d49a02dbb9975d6ec2a54675672 Mon Sep 17 00:00:00 2001 From: Tejun Heo Date: Wed, 10 Mar 2010 18:56:10 +0900 Subject: module: encapsulate percpu handling better and record percpu_size Better encapsulate module static percpu area handling so that code outsidef of CONFIG_SMP ifdef doesn't deal with mod->percpu directly and add mod->percpu_size and record percpu_size in it. Both percpu fields are compiled out on UP. While at it, mark mod->percpu w/ __percpu. This is to prepare for is_module_percpu_address(). Signed-off-by: Tejun Heo Acked-by: Rusty Russell --- kernel/module.c | 66 ++++++++++++++++++++++++++++++--------------------------- 1 file changed, 35 insertions(+), 31 deletions(-) (limited to 'kernel') diff --git a/kernel/module.c b/kernel/module.c index c968d36..e7a6e53 100644 --- a/kernel/module.c +++ b/kernel/module.c @@ -370,27 +370,33 @@ EXPORT_SYMBOL_GPL(find_module); #ifdef CONFIG_SMP -static void *percpu_modalloc(unsigned long size, unsigned long align, - const char *name) +static inline void __percpu *mod_percpu(struct module *mod) { - void *ptr; + return mod->percpu; +} +static int percpu_modalloc(struct module *mod, + unsigned long size, unsigned long align) +{ if (align > PAGE_SIZE) { printk(KERN_WARNING "%s: per-cpu alignment %li > %li\n", - name, align, PAGE_SIZE); + mod->name, align, PAGE_SIZE); align = PAGE_SIZE; } - ptr = __alloc_reserved_percpu(size, align); - if (!ptr) + mod->percpu = __alloc_reserved_percpu(size, align); + if (!mod->percpu) { printk(KERN_WARNING "Could not allocate %lu bytes percpu data\n", size); - return ptr; + return -ENOMEM; + } + mod->percpu_size = size; + return 0; } -static void percpu_modfree(void *freeme) +static void percpu_modfree(struct module *mod) { - free_percpu(freeme); + free_percpu(mod->percpu); } static unsigned int find_pcpusec(Elf_Ehdr *hdr, @@ -400,24 +406,28 @@ static unsigned int find_pcpusec(Elf_Ehdr *hdr, return find_sec(hdr, sechdrs, secstrings, ".data.percpu"); } -static void percpu_modcopy(void *pcpudest, const void *from, unsigned long size) +static void percpu_modcopy(struct module *mod, + const void *from, unsigned long size) { int cpu; for_each_possible_cpu(cpu) - memcpy(pcpudest + per_cpu_offset(cpu), from, size); + memcpy(per_cpu_ptr(mod->percpu, cpu), from, size); } #else /* ... !CONFIG_SMP */ -static inline void *percpu_modalloc(unsigned long size, unsigned long align, - const char *name) +static inline void __percpu *mod_percpu(struct module *mod) { return NULL; } -static inline void percpu_modfree(void *pcpuptr) +static inline int percpu_modalloc(struct module *mod, + unsigned long size, unsigned long align) +{ + return -ENOMEM; +} +static inline void percpu_modfree(struct module *mod) { - BUG(); } static inline unsigned int find_pcpusec(Elf_Ehdr *hdr, Elf_Shdr *sechdrs, @@ -425,8 +435,8 @@ static inline unsigned int find_pcpusec(Elf_Ehdr *hdr, { return 0; } -static inline void percpu_modcopy(void *pcpudst, const void *src, - unsigned long size) +static inline void percpu_modcopy(struct module *mod, + const void *from, unsigned long size) { /* pcpusec should be 0, and size of that section should be 0. */ BUG_ON(size != 0); @@ -1400,8 +1410,7 @@ static void free_module(struct module *mod) /* This may be NULL, but that's OK */ module_free(mod, mod->module_init); kfree(mod->args); - if (mod->percpu) - percpu_modfree(mod->percpu); + percpu_modfree(mod); #if defined(CONFIG_MODULE_UNLOAD) if (mod->refptr) free_percpu(mod->refptr); @@ -1520,7 +1529,7 @@ static int simplify_symbols(Elf_Shdr *sechdrs, default: /* Divert to percpu allocation if a percpu var. */ if (sym[i].st_shndx == pcpuindex) - secbase = (unsigned long)mod->percpu; + secbase = (unsigned long)mod_percpu(mod); else secbase = sechdrs[sym[i].st_shndx].sh_addr; sym[i].st_value += secbase; @@ -1954,7 +1963,7 @@ static noinline struct module *load_module(void __user *umod, unsigned int modindex, versindex, infoindex, pcpuindex; struct module *mod; long err = 0; - void *percpu = NULL, *ptr = NULL; /* Stops spurious gcc warning */ + void *ptr = NULL; /* Stops spurious gcc warning */ unsigned long symoffs, stroffs, *strmap; mm_segment_t old_fs; @@ -2094,15 +2103,11 @@ static noinline struct module *load_module(void __user *umod, if (pcpuindex) { /* We have a special allocation for this section. */ - percpu = percpu_modalloc(sechdrs[pcpuindex].sh_size, - sechdrs[pcpuindex].sh_addralign, - mod->name); - if (!percpu) { - err = -ENOMEM; + err = percpu_modalloc(mod, sechdrs[pcpuindex].sh_size, + sechdrs[pcpuindex].sh_addralign); + if (err) goto free_mod; - } sechdrs[pcpuindex].sh_flags &= ~(unsigned long)SHF_ALLOC; - mod->percpu = percpu; } /* Determine total sizes, and put offsets in sh_entsize. For now @@ -2317,7 +2322,7 @@ static noinline struct module *load_module(void __user *umod, sort_extable(mod->extable, mod->extable + mod->num_exentries); /* Finally, copy percpu area over. */ - percpu_modcopy(mod->percpu, (void *)sechdrs[pcpuindex].sh_addr, + percpu_modcopy(mod, (void *)sechdrs[pcpuindex].sh_addr, sechdrs[pcpuindex].sh_size); add_kallsyms(mod, sechdrs, hdr->e_shnum, symindex, strindex, @@ -2409,8 +2414,7 @@ static noinline struct module *load_module(void __user *umod, module_free(mod, mod->module_core); /* mod will be freed with core. Don't access it beyond this line! */ free_percpu: - if (percpu) - percpu_modfree(percpu); + percpu_modfree(mod); free_mod: kfree(args); kfree(strmap); -- cgit v1.1 From 10fad5e46f6c7bdfb01b1a012380a38e3c6ab346 Mon Sep 17 00:00:00 2001 From: Tejun Heo Date: Wed, 10 Mar 2010 18:57:54 +0900 Subject: percpu, module: implement and use is_kernel/module_percpu_address() lockdep has custom code to check whether a pointer belongs to static percpu area which is somewhat broken. Implement proper is_kernel/module_percpu_address() and replace the custom code. On UP, percpu variables are regular static variables and can't be distinguished from them. Always return %false on UP. Signed-off-by: Tejun Heo Acked-by: Peter Zijlstra Cc: Rusty Russell Cc: Ingo Molnar --- kernel/lockdep.c | 21 +++++---------------- kernel/module.c | 38 ++++++++++++++++++++++++++++++++++++++ 2 files changed, 43 insertions(+), 16 deletions(-) (limited to 'kernel') diff --git a/kernel/lockdep.c b/kernel/lockdep.c index c927a549..9bbb9c8 100644 --- a/kernel/lockdep.c +++ b/kernel/lockdep.c @@ -582,9 +582,6 @@ static int static_obj(void *obj) unsigned long start = (unsigned long) &_stext, end = (unsigned long) &_end, addr = (unsigned long) obj; -#ifdef CONFIG_SMP - int i; -#endif /* * static variable? @@ -595,24 +592,16 @@ static int static_obj(void *obj) if (arch_is_kernel_data(addr)) return 1; -#ifdef CONFIG_SMP /* - * percpu var? + * in-kernel percpu var? */ - for_each_possible_cpu(i) { - start = (unsigned long) &__per_cpu_start + per_cpu_offset(i); - end = (unsigned long) &__per_cpu_start + PERCPU_ENOUGH_ROOM - + per_cpu_offset(i); - - if ((addr >= start) && (addr < end)) - return 1; - } -#endif + if (is_kernel_percpu_address(addr)) + return 1; /* - * module var? + * module static or percpu var? */ - return is_module_address(addr); + return is_module_address(addr) || is_module_percpu_address(addr); } /* diff --git a/kernel/module.c b/kernel/module.c index e7a6e53..9f8d23d 100644 --- a/kernel/module.c +++ b/kernel/module.c @@ -415,6 +415,40 @@ static void percpu_modcopy(struct module *mod, memcpy(per_cpu_ptr(mod->percpu, cpu), from, size); } +/** + * is_module_percpu_address - test whether address is from module static percpu + * @addr: address to test + * + * Test whether @addr belongs to module static percpu area. + * + * RETURNS: + * %true if @addr is from module static percpu area + */ +bool is_module_percpu_address(unsigned long addr) +{ + struct module *mod; + unsigned int cpu; + + preempt_disable(); + + list_for_each_entry_rcu(mod, &modules, list) { + if (!mod->percpu_size) + continue; + for_each_possible_cpu(cpu) { + void *start = per_cpu_ptr(mod->percpu, cpu); + + if ((void *)addr >= start && + (void *)addr < start + mod->percpu_size) { + preempt_enable(); + return true; + } + } + } + + preempt_enable(); + return false; +} + #else /* ... !CONFIG_SMP */ static inline void __percpu *mod_percpu(struct module *mod) @@ -441,6 +475,10 @@ static inline void percpu_modcopy(struct module *mod, /* pcpusec should be 0, and size of that section should be 0. */ BUG_ON(size != 0); } +bool is_module_percpu_address(unsigned long addr) +{ + return false; +} #endif /* CONFIG_SMP */ -- cgit v1.1 From 88be12c440cfa2fa3f5be83507360aac9ea1c54e Mon Sep 17 00:00:00 2001 From: Dave Airlie Date: Mon, 29 Mar 2010 12:01:50 +0100 Subject: slow-work: use get_ref wrapper instead of directly calling get_ref Otherwise we can get an oops if the user has no get_ref/put_ref requirement. Signed-off-by: Dave Airlie Signed-off-by: David Howells Signed-off-by: Linus Torvalds --- kernel/slow-work.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/slow-work.c b/kernel/slow-work.c index 7494bbf..7d3f4fa 100644 --- a/kernel/slow-work.c +++ b/kernel/slow-work.c @@ -637,7 +637,7 @@ int delayed_slow_work_enqueue(struct delayed_slow_work *dwork, goto cancelled; /* the timer holds a reference whilst it is pending */ - ret = work->ops->get_ref(work); + ret = slow_work_get_ref(work); if (ret < 0) goto cant_get_ref; -- cgit v1.1 From a53f4f9efaeb1d87cfae066346979d4d70e1abe9 Mon Sep 17 00:00:00 2001 From: David Howells Date: Mon, 29 Mar 2010 13:08:52 +0100 Subject: SLOW_WORK: CONFIG_SLOW_WORK_PROC should be CONFIG_SLOW_WORK_DEBUG CONFIG_SLOW_WORK_PROC was changed to CONFIG_SLOW_WORK_DEBUG, but not in all instances. Change the remaining instances. This makes the debugfs file display the time mark and the owner's description again. Signed-off-by: David Howells Signed-off-by: Linus Torvalds --- kernel/slow-work.h | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) (limited to 'kernel') diff --git a/kernel/slow-work.h b/kernel/slow-work.h index 321f3c5..a29ebd1 100644 --- a/kernel/slow-work.h +++ b/kernel/slow-work.h @@ -43,28 +43,28 @@ extern void slow_work_new_thread_desc(struct slow_work *, struct seq_file *); */ static inline void slow_work_set_thread_pid(int id, pid_t pid) { -#ifdef CONFIG_SLOW_WORK_PROC +#ifdef CONFIG_SLOW_WORK_DEBUG slow_work_pids[id] = pid; #endif } static inline void slow_work_mark_time(struct slow_work *work) { -#ifdef CONFIG_SLOW_WORK_PROC +#ifdef CONFIG_SLOW_WORK_DEBUG work->mark = CURRENT_TIME; #endif } static inline void slow_work_begin_exec(int id, struct slow_work *work) { -#ifdef CONFIG_SLOW_WORK_PROC +#ifdef CONFIG_SLOW_WORK_DEBUG slow_work_execs[id] = work; #endif } static inline void slow_work_end_exec(int id, struct slow_work *work) { -#ifdef CONFIG_SLOW_WORK_PROC +#ifdef CONFIG_SLOW_WORK_DEBUG write_lock(&slow_work_execs_lock); slow_work_execs[id] = NULL; write_unlock(&slow_work_execs_lock); -- cgit v1.1 From eed63519e3e74d515d2007ecd895338d0ba2a85c Mon Sep 17 00:00:00 2001 From: Ian Campbell Date: Sun, 28 Mar 2010 19:42:56 -0700 Subject: x86: Do not free zero sized per cpu areas This avoids an infinite loop in free_early_partial(). Add a warning to free_early_partial() to catch future problems. -v5: put back start > end back into WARN_ONCE() -v6: use one line for warning, suggested by Linus -v7: more tests -v8: remove the function name as suggested by Johannes WARN_ONCE() will print out that function name. Signed-off-by: Ian Campbell Signed-off-by: Yinghai Lu Tested-by: Konrad Rzeszutek Wilk Tested-by: Joel Becker Tested-by: Stanislaw Gruszka Acked-by: Johannes Weiner Cc: Peter Zijlstra Cc: David Miller Cc: Benjamin Herrenschmidt Cc: Linus Torvalds LKML-Reference: <1269830604-26214-4-git-send-email-yinghai@kernel.org> Signed-off-by: Ingo Molnar --- kernel/early_res.c | 6 ++++++ 1 file changed, 6 insertions(+) (limited to 'kernel') diff --git a/kernel/early_res.c b/kernel/early_res.c index 3cb2c66..31aa933 100644 --- a/kernel/early_res.c +++ b/kernel/early_res.c @@ -333,6 +333,12 @@ void __init free_early_partial(u64 start, u64 end) struct early_res *r; int i; + if (start == end) + return; + + if (WARN_ONCE(start > end, " wrong range [%#llx, %#llx]\n", start, end)) + return; + try_next: i = find_overlapped_early(start, end); if (i >= max_early_res) -- cgit v1.1 From e36673ec5126f15a8cddf6049aede7bdcf484c26 Mon Sep 17 00:00:00 2001 From: Li Zefan Date: Wed, 24 Mar 2010 10:57:37 +0800 Subject: tracing: Fix lockdep warning in global_clock() # echo 1 > events/enable # echo global > trace_clock ------------[ cut here ]------------ WARNING: at kernel/lockdep.c:3162 check_flags+0xb2/0x190() ... ---[ end trace 3f86734a89416623 ]--- possible reason: unannotated irqs-on. ... There's no reason to use the raw_local_irq_save() in trace_clock_global. The local_irq_save() version is fine, and does not cause the bug in lockdep. Acked-by: Peter Zijlstra Signed-off-by: Li Zefan LKML-Reference: <4BA97FA1.7030606@cn.fujitsu.com> Signed-off-by: Steven Rostedt --- kernel/trace/trace_clock.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/trace/trace_clock.c b/kernel/trace/trace_clock.c index 6fbfb8f..9d589d8 100644 --- a/kernel/trace/trace_clock.c +++ b/kernel/trace/trace_clock.c @@ -84,7 +84,7 @@ u64 notrace trace_clock_global(void) int this_cpu; u64 now; - raw_local_irq_save(flags); + local_irq_save(flags); this_cpu = raw_smp_processor_id(); now = cpu_clock(this_cpu); @@ -110,7 +110,7 @@ u64 notrace trace_clock_global(void) arch_spin_unlock(&trace_clock_struct.lock); out: - raw_local_irq_restore(flags); + local_irq_restore(flags); return now; } -- cgit v1.1 From 292f60c0c4ab44aa2d589ba03c12e64a3b3c5e38 Mon Sep 17 00:00:00 2001 From: Julia Lawall Date: Mon, 29 Mar 2010 17:37:02 +0200 Subject: ring-buffer: Add missing unlock In some error handling cases the lock is not unlocked. The return is converted to a goto, to share the unlock at the end of the function. A simplified version of the semantic patch that finds this problem is as follows: (http://coccinelle.lip6.fr/) // @r exists@ expression E1; identifier f; @@ f (...) { <+... * spin_lock_irq (E1,...); ... when != E1 * return ...; ...+> } // Signed-off-by: Julia Lawall LKML-Reference: Signed-off-by: Steven Rostedt --- kernel/trace/ring_buffer.c | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) (limited to 'kernel') diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index d1187ef..9a0f9bf 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -1209,18 +1209,19 @@ rb_remove_pages(struct ring_buffer_per_cpu *cpu_buffer, unsigned nr_pages) for (i = 0; i < nr_pages; i++) { if (RB_WARN_ON(cpu_buffer, list_empty(cpu_buffer->pages))) - return; + goto out; p = cpu_buffer->pages->next; bpage = list_entry(p, struct buffer_page, list); list_del_init(&bpage->list); free_buffer_page(bpage); } if (RB_WARN_ON(cpu_buffer, list_empty(cpu_buffer->pages))) - return; + goto out; rb_reset_cpu(cpu_buffer); rb_check_pages(cpu_buffer); +out: spin_unlock_irq(&cpu_buffer->reader_lock); } @@ -1237,7 +1238,7 @@ rb_insert_pages(struct ring_buffer_per_cpu *cpu_buffer, for (i = 0; i < nr_pages; i++) { if (RB_WARN_ON(cpu_buffer, list_empty(pages))) - return; + goto out; p = pages->next; bpage = list_entry(p, struct buffer_page, list); list_del_init(&bpage->list); @@ -1246,6 +1247,7 @@ rb_insert_pages(struct ring_buffer_per_cpu *cpu_buffer, rb_reset_cpu(cpu_buffer); rb_check_pages(cpu_buffer); +out: spin_unlock_irq(&cpu_buffer->reader_lock); } -- cgit v1.1 From 570b8fb505896e007fd3bb07573ba6640e51851d Mon Sep 17 00:00:00 2001 From: Mathieu Desnoyers Date: Tue, 30 Mar 2010 00:04:00 +0100 Subject: CRED: Fix memory leak in error handling Fix a memory leak on an OOM condition in prepare_usermodehelper_creds(). Signed-off-by: Mathieu Desnoyers Signed-off-by: David Howells Signed-off-by: James Morris --- kernel/cred.c | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/cred.c b/kernel/cred.c index 1ed8ca1..1b1129d 100644 --- a/kernel/cred.c +++ b/kernel/cred.c @@ -364,7 +364,7 @@ struct cred *prepare_usermodehelper_creds(void) new = kmem_cache_alloc(cred_jar, GFP_ATOMIC); if (!new) - return NULL; + goto free_tgcred; kdebug("prepare_usermodehelper_creds() alloc %p", new); @@ -397,6 +397,10 @@ struct cred *prepare_usermodehelper_creds(void) error: put_cred(new); +free_tgcred: +#ifdef CONFIG_KEYS + kfree(tgcred); +#endif return NULL; } -- cgit v1.1 From 5a0e3ad6af8660be21ca98a971cd00f331318c05 Mon Sep 17 00:00:00 2001 From: Tejun Heo Date: Wed, 24 Mar 2010 17:04:11 +0900 Subject: include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h percpu.h is included by sched.h and module.h and thus ends up being included when building most .c files. percpu.h includes slab.h which in turn includes gfp.h making everything defined by the two files universally available and complicating inclusion dependencies. percpu.h -> slab.h dependency is about to be removed. Prepare for this change by updating users of gfp and slab facilities include those headers directly instead of assuming availability. As this conversion needs to touch large number of source files, the following script is used as the basis of conversion. http://userweb.kernel.org/~tj/misc/slabh-sweep.py The script does the followings. * Scan files for gfp and slab usages and update includes such that only the necessary includes are there. ie. if only gfp is used, gfp.h, if slab is used, slab.h. * When the script inserts a new include, it looks at the include blocks and try to put the new include such that its order conforms to its surrounding. It's put in the include block which contains core kernel includes, in the same order that the rest are ordered - alphabetical, Christmas tree, rev-Xmas-tree or at the end if there doesn't seem to be any matching order. * If the script can't find a place to put a new include (mostly because the file doesn't have fitting include block), it prints out an error message indicating which .h file needs to be added to the file. The conversion was done in the following steps. 1. The initial automatic conversion of all .c files updated slightly over 4000 files, deleting around 700 includes and adding ~480 gfp.h and ~3000 slab.h inclusions. The script emitted errors for ~400 files. 2. Each error was manually checked. Some didn't need the inclusion, some needed manual addition while adding it to implementation .h or embedding .c file was more appropriate for others. This step added inclusions to around 150 files. 3. The script was run again and the output was compared to the edits from #2 to make sure no file was left behind. 4. Several build tests were done and a couple of problems were fixed. e.g. lib/decompress_*.c used malloc/free() wrappers around slab APIs requiring slab.h to be added manually. 5. The script was run on all .h files but without automatically editing them as sprinkling gfp.h and slab.h inclusions around .h files could easily lead to inclusion dependency hell. Most gfp.h inclusion directives were ignored as stuff from gfp.h was usually wildly available and often used in preprocessor macros. Each slab.h inclusion directive was examined and added manually as necessary. 6. percpu.h was updated not to include slab.h. 7. Build test were done on the following configurations and failures were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my distributed build env didn't work with gcov compiles) and a few more options had to be turned off depending on archs to make things build (like ipr on powerpc/64 which failed due to missing writeq). * x86 and x86_64 UP and SMP allmodconfig and a custom test config. * powerpc and powerpc64 SMP allmodconfig * sparc and sparc64 SMP allmodconfig * ia64 SMP allmodconfig * s390 SMP allmodconfig * alpha SMP allmodconfig * um on x86_64 SMP allmodconfig 8. percpu.h modifications were reverted so that it could be applied as a separate patch and serve as bisection point. Given the fact that I had only a couple of failures from tests on step 6, I'm fairly confident about the coverage of this conversion patch. If there is a breakage, it's likely to be something in one of the arch headers which should be easily discoverable easily on most builds of the specific arch. Signed-off-by: Tejun Heo Guess-its-ok-by: Christoph Lameter Cc: Ingo Molnar Cc: Lee Schermerhorn --- kernel/async.c | 1 + kernel/audit.c | 1 + kernel/audit_tree.c | 1 + kernel/audit_watch.c | 1 + kernel/auditfilter.c | 1 + kernel/auditsc.c | 1 + kernel/cgroup_freezer.c | 1 + kernel/compat.c | 1 + kernel/cpu.c | 1 + kernel/cred.c | 1 + kernel/irq/numa_migrate.c | 1 + kernel/irq/proc.c | 1 + kernel/kallsyms.c | 1 + kernel/latencytop.c | 1 - kernel/lockdep.c | 1 + kernel/nsproxy.c | 1 + kernel/padata.c | 1 + kernel/perf_event.c | 1 + kernel/pid_namespace.c | 1 + kernel/power/hibernate.c | 1 + kernel/power/hibernate_nvs.c | 1 + kernel/power/snapshot.c | 1 + kernel/power/suspend.c | 1 + kernel/power/swap.c | 1 + kernel/res_counter.c | 1 - kernel/sched.c | 1 + kernel/sched_cpupri.c | 1 + kernel/smp.c | 1 + kernel/srcu.c | 1 - kernel/sys.c | 1 + kernel/sysctl_binary.c | 1 + kernel/taskstats.c | 1 + kernel/time.c | 1 - kernel/time/timecompare.c | 1 + kernel/timer.c | 1 + kernel/trace/blktrace.c | 1 + kernel/trace/ftrace.c | 1 + kernel/trace/power-traces.c | 1 - kernel/trace/ring_buffer.c | 1 + kernel/trace/trace.c | 2 +- kernel/trace/trace_events.c | 1 + kernel/trace/trace_events_filter.c | 1 + kernel/trace/trace_functions_graph.c | 1 + kernel/trace/trace_ksym.c | 1 + kernel/trace/trace_mmiotrace.c | 1 + kernel/trace/trace_selftest.c | 1 + kernel/trace/trace_stat.c | 1 + kernel/trace/trace_syscalls.c | 1 + kernel/trace/trace_workqueue.c | 1 + 49 files changed, 44 insertions(+), 6 deletions(-) (limited to 'kernel') diff --git a/kernel/async.c b/kernel/async.c index 27235f5..15319d6 100644 --- a/kernel/async.c +++ b/kernel/async.c @@ -56,6 +56,7 @@ asynchronous and synchronous parts of the kernel. #include #include #include +#include #include static async_cookie_t next_cookie = 1; diff --git a/kernel/audit.c b/kernel/audit.c index 78f7f86..c71bd26 100644 --- a/kernel/audit.c +++ b/kernel/audit.c @@ -46,6 +46,7 @@ #include #include #include +#include #include #include diff --git a/kernel/audit_tree.c b/kernel/audit_tree.c index 028e856..46a57b5 100644 --- a/kernel/audit_tree.c +++ b/kernel/audit_tree.c @@ -3,6 +3,7 @@ #include #include #include +#include struct audit_tree; struct audit_chunk; diff --git a/kernel/audit_watch.c b/kernel/audit_watch.c index cc7e879..8df4369 100644 --- a/kernel/audit_watch.c +++ b/kernel/audit_watch.c @@ -27,6 +27,7 @@ #include #include #include +#include #include #include #include "audit.h" diff --git a/kernel/auditfilter.c b/kernel/auditfilter.c index a706040..ce08041 100644 --- a/kernel/auditfilter.c +++ b/kernel/auditfilter.c @@ -27,6 +27,7 @@ #include #include #include +#include #include #include "audit.h" diff --git a/kernel/auditsc.c b/kernel/auditsc.c index f3a461c..97a3cef 100644 --- a/kernel/auditsc.c +++ b/kernel/auditsc.c @@ -49,6 +49,7 @@ #include #include #include +#include #include #include #include diff --git a/kernel/cgroup_freezer.c b/kernel/cgroup_freezer.c index 59e9ef6..d2ccd27 100644 --- a/kernel/cgroup_freezer.c +++ b/kernel/cgroup_freezer.c @@ -15,6 +15,7 @@ */ #include +#include #include #include #include diff --git a/kernel/compat.c b/kernel/compat.c index f6c204f..7f40e92 100644 --- a/kernel/compat.c +++ b/kernel/compat.c @@ -25,6 +25,7 @@ #include #include #include +#include #include diff --git a/kernel/cpu.c b/kernel/cpu.c index f8cced2..25bba73 100644 --- a/kernel/cpu.c +++ b/kernel/cpu.c @@ -14,6 +14,7 @@ #include #include #include +#include #ifdef CONFIG_SMP /* Serializes the updates to cpu_online_mask, cpu_present_mask */ diff --git a/kernel/cred.c b/kernel/cred.c index 1ed8ca1..d84bdef 100644 --- a/kernel/cred.c +++ b/kernel/cred.c @@ -10,6 +10,7 @@ */ #include #include +#include #include #include #include diff --git a/kernel/irq/numa_migrate.c b/kernel/irq/numa_migrate.c index 963559d..65d3845 100644 --- a/kernel/irq/numa_migrate.c +++ b/kernel/irq/numa_migrate.c @@ -6,6 +6,7 @@ */ #include +#include #include #include #include diff --git a/kernel/irq/proc.c b/kernel/irq/proc.c index 6f50ecc..7a6eb04 100644 --- a/kernel/irq/proc.c +++ b/kernel/irq/proc.c @@ -7,6 +7,7 @@ */ #include +#include #include #include #include diff --git a/kernel/kallsyms.c b/kernel/kallsyms.c index 8e5288a..13aff29 100644 --- a/kernel/kallsyms.c +++ b/kernel/kallsyms.c @@ -21,6 +21,7 @@ #include /* for cond_resched */ #include #include +#include #include diff --git a/kernel/latencytop.c b/kernel/latencytop.c index ca07c5c..877fb30 100644 --- a/kernel/latencytop.c +++ b/kernel/latencytop.c @@ -56,7 +56,6 @@ #include #include #include -#include #include static DEFINE_SPINLOCK(latency_lock); diff --git a/kernel/lockdep.c b/kernel/lockdep.c index c927a549..367f724 100644 --- a/kernel/lockdep.c +++ b/kernel/lockdep.c @@ -43,6 +43,7 @@ #include #include #include +#include #include diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c index 2ab6723..f74e6c0 100644 --- a/kernel/nsproxy.c +++ b/kernel/nsproxy.c @@ -13,6 +13,7 @@ * Pavel Emelianov */ +#include #include #include #include diff --git a/kernel/padata.c b/kernel/padata.c index 93caf65..fd03513 100644 --- a/kernel/padata.c +++ b/kernel/padata.c @@ -25,6 +25,7 @@ #include #include #include +#include #include #define MAX_SEQ_NR INT_MAX - NR_CPUS diff --git a/kernel/perf_event.c b/kernel/perf_event.c index 574ee58..a77266e 100644 --- a/kernel/perf_event.c +++ b/kernel/perf_event.c @@ -15,6 +15,7 @@ #include #include #include +#include #include #include #include diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c index 79aac93..a5aff94 100644 --- a/kernel/pid_namespace.c +++ b/kernel/pid_namespace.c @@ -13,6 +13,7 @@ #include #include #include +#include #define BITS_PER_PAGE (PAGE_SIZE*8) diff --git a/kernel/power/hibernate.c b/kernel/power/hibernate.c index da5288e..aa9e916 100644 --- a/kernel/power/hibernate.c +++ b/kernel/power/hibernate.c @@ -22,6 +22,7 @@ #include #include #include +#include #include #include diff --git a/kernel/power/hibernate_nvs.c b/kernel/power/hibernate_nvs.c index 39ac698..fdcad9e 100644 --- a/kernel/power/hibernate_nvs.c +++ b/kernel/power/hibernate_nvs.c @@ -10,6 +10,7 @@ #include #include #include +#include #include /* diff --git a/kernel/power/snapshot.c b/kernel/power/snapshot.c index 830cade..be861c2 100644 --- a/kernel/power/snapshot.c +++ b/kernel/power/snapshot.c @@ -26,6 +26,7 @@ #include #include #include +#include #include #include diff --git a/kernel/power/suspend.c b/kernel/power/suspend.c index 44cce10..56e7dbb 100644 --- a/kernel/power/suspend.c +++ b/kernel/power/suspend.c @@ -15,6 +15,7 @@ #include #include #include +#include #include "power.h" diff --git a/kernel/power/swap.c b/kernel/power/swap.c index 1d57573..66824d7 100644 --- a/kernel/power/swap.c +++ b/kernel/power/swap.c @@ -23,6 +23,7 @@ #include #include #include +#include #include "power.h" diff --git a/kernel/res_counter.c b/kernel/res_counter.c index bcdabf3..c7eaa37 100644 --- a/kernel/res_counter.c +++ b/kernel/res_counter.c @@ -10,7 +10,6 @@ #include #include #include -#include #include #include #include diff --git a/kernel/sched.c b/kernel/sched.c index 49d2fa7..86c7cc1 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -71,6 +71,7 @@ #include #include #include +#include #include #include diff --git a/kernel/sched_cpupri.c b/kernel/sched_cpupri.c index fccf9fb..e6871cb 100644 --- a/kernel/sched_cpupri.c +++ b/kernel/sched_cpupri.c @@ -27,6 +27,7 @@ * of the License. */ +#include #include "sched_cpupri.h" /* Convert between a 140 based task->prio, and our 102 based cpupri */ diff --git a/kernel/smp.c b/kernel/smp.c index 9867b6b..3fc6973 100644 --- a/kernel/smp.c +++ b/kernel/smp.c @@ -9,6 +9,7 @@ #include #include #include +#include #include #include diff --git a/kernel/srcu.c b/kernel/srcu.c index bde4295..2980da3 100644 --- a/kernel/srcu.c +++ b/kernel/srcu.c @@ -30,7 +30,6 @@ #include #include #include -#include #include #include diff --git a/kernel/sys.c b/kernel/sys.c index 8298878..6d1a7e0 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -36,6 +36,7 @@ #include #include #include +#include #include #include diff --git a/kernel/sysctl_binary.c b/kernel/sysctl_binary.c index 8cd50d8..5903057 100644 --- a/kernel/sysctl_binary.c +++ b/kernel/sysctl_binary.c @@ -13,6 +13,7 @@ #include #include #include +#include #ifdef CONFIG_SYSCTL_SYSCALL diff --git a/kernel/taskstats.c b/kernel/taskstats.c index 899ca51..11281d5 100644 --- a/kernel/taskstats.c +++ b/kernel/taskstats.c @@ -22,6 +22,7 @@ #include #include #include +#include #include #include #include diff --git a/kernel/time.c b/kernel/time.c index 8047980..656dccf 100644 --- a/kernel/time.c +++ b/kernel/time.c @@ -35,7 +35,6 @@ #include #include #include -#include #include #include diff --git a/kernel/time/timecompare.c b/kernel/time/timecompare.c index 12f5c55..ac38fbb 100644 --- a/kernel/time/timecompare.c +++ b/kernel/time/timecompare.c @@ -19,6 +19,7 @@ #include #include +#include #include /* diff --git a/kernel/timer.c b/kernel/timer.c index fc965ea..aeb6a54 100644 --- a/kernel/timer.c +++ b/kernel/timer.c @@ -39,6 +39,7 @@ #include #include #include +#include #include #include diff --git a/kernel/trace/blktrace.c b/kernel/trace/blktrace.c index 07f945a9..b3bc91a 100644 --- a/kernel/trace/blktrace.c +++ b/kernel/trace/blktrace.c @@ -21,6 +21,7 @@ #include #include #include +#include #include #include #include diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c index d9062f5..2404b59 100644 --- a/kernel/trace/ftrace.c +++ b/kernel/trace/ftrace.c @@ -24,6 +24,7 @@ #include #include #include +#include #include #include #include diff --git a/kernel/trace/power-traces.c b/kernel/trace/power-traces.c index 9f4f565..a22582a 100644 --- a/kernel/trace/power-traces.c +++ b/kernel/trace/power-traces.c @@ -9,7 +9,6 @@ #include #include #include -#include #define CREATE_TRACE_POINTS #include diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index d1187ef..2c839ca 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -14,6 +14,7 @@ #include #include #include +#include #include #include #include diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index 3ec2ee6..44f916a 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -33,10 +33,10 @@ #include #include #include +#include #include #include #include -#include #include #include "trace.h" diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c index beab8bf..c697c70 100644 --- a/kernel/trace/trace_events.c +++ b/kernel/trace/trace_events.c @@ -15,6 +15,7 @@ #include #include #include +#include #include #include diff --git a/kernel/trace/trace_events_filter.c b/kernel/trace/trace_events_filter.c index 4615f62..88c0b6d 100644 --- a/kernel/trace/trace_events_filter.c +++ b/kernel/trace/trace_events_filter.c @@ -22,6 +22,7 @@ #include #include #include +#include #include "trace.h" #include "trace_output.h" diff --git a/kernel/trace/trace_functions_graph.c b/kernel/trace/trace_functions_graph.c index e6989d9..9aed1a5 100644 --- a/kernel/trace/trace_functions_graph.c +++ b/kernel/trace/trace_functions_graph.c @@ -9,6 +9,7 @@ #include #include #include +#include #include #include "trace.h" diff --git a/kernel/trace/trace_ksym.c b/kernel/trace/trace_ksym.c index 94103cd..d59cd68 100644 --- a/kernel/trace/trace_ksym.c +++ b/kernel/trace/trace_ksym.c @@ -23,6 +23,7 @@ #include #include #include +#include #include #include "trace_output.h" diff --git a/kernel/trace/trace_mmiotrace.c b/kernel/trace/trace_mmiotrace.c index 0acd834..017fa37 100644 --- a/kernel/trace/trace_mmiotrace.c +++ b/kernel/trace/trace_mmiotrace.c @@ -9,6 +9,7 @@ #include #include #include +#include #include #include diff --git a/kernel/trace/trace_selftest.c b/kernel/trace/trace_selftest.c index 280fea4..81003b4 100644 --- a/kernel/trace/trace_selftest.c +++ b/kernel/trace/trace_selftest.c @@ -3,6 +3,7 @@ #include #include #include +#include static inline int trace_valid_entry(struct trace_entry *entry) { diff --git a/kernel/trace/trace_stat.c b/kernel/trace/trace_stat.c index a4bb239..96cffb2 100644 --- a/kernel/trace/trace_stat.c +++ b/kernel/trace/trace_stat.c @@ -10,6 +10,7 @@ #include +#include #include #include #include "trace_stat.h" diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c index 33c2a5b..4d6d711 100644 --- a/kernel/trace/trace_syscalls.c +++ b/kernel/trace/trace_syscalls.c @@ -1,5 +1,6 @@ #include #include +#include #include #include #include diff --git a/kernel/trace/trace_workqueue.c b/kernel/trace/trace_workqueue.c index 40cafb0..cc2d2fa 100644 --- a/kernel/trace/trace_workqueue.c +++ b/kernel/trace/trace_workqueue.c @@ -9,6 +9,7 @@ #include #include #include +#include #include #include "trace_stat.h" #include "trace.h" -- cgit v1.1 From 753649dbc49345a73a2454c770a3f2d54d11aec6 Mon Sep 17 00:00:00 2001 From: Thomas Gleixner Date: Wed, 31 Mar 2010 13:30:19 +0200 Subject: genirq: Force MSI irq handlers to run with interrupts disabled Network folks reported that directing all MSI-X vectors of their multi queue NICs to a single core can cause interrupt stack overflows when enough interrupts fire at the same time. This is caused by the fact that we run interrupt handlers by default with interrupts enabled unless the driver reuqests the interrupt with the IRQF_DISABLED set. The NIC handlers do not set this flag, so simultaneous interrupts can nest unlimited and cause the stack overflow. The only safe counter measure is to run the interrupt handlers with interrupts disabled. We can't switch to this mode in general right now, but it is safe to do so for MSI interrupts. Force IRQF_DISABLED for MSI interrupt handlers. Signed-off-by: Thomas Gleixner Cc: Andi Kleen Cc: Linus Torvalds Cc: Andrew Morton Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Alan Cox Cc: David Miller Cc: Greg Kroah-Hartman Cc: Arnaldo Carvalho de Melo Cc: stable@kernel.org --- kernel/irq/manage.c | 10 ++++++++++ 1 file changed, 10 insertions(+) (limited to 'kernel') diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c index 398fda1..704e488 100644 --- a/kernel/irq/manage.c +++ b/kernel/irq/manage.c @@ -757,6 +757,16 @@ __setup_irq(unsigned int irq, struct irq_desc *desc, struct irqaction *new) if (new->flags & IRQF_ONESHOT) desc->status |= IRQ_ONESHOT; + /* + * Force MSI interrupts to run with interrupts + * disabled. The multi vector cards can cause stack + * overflows due to nested interrupts when enough of + * them are directed to a core and fire at the same + * time. + */ + if (desc->msi_desc) + new->flags |= IRQF_DISABLED; + if (!(desc->status & IRQ_NOAUTOEN)) { desc->depth = 0; desc->status &= ~IRQ_DISABLED; -- cgit v1.1 From eb1e79611cc9bfe21978230e3521e77ea2d7874a Mon Sep 17 00:00:00 2001 From: Frederic Weisbecker Date: Tue, 23 Mar 2010 00:08:59 +0100 Subject: perf: Correctly align perf event tracing buffer The trace event buffer used by perf to record raw sample events is typed as an array of char and may then not be aligned to 8 by alloc_percpu(). But we need it to be aligned to 8 in sparc64 because we cast this buffer into a random structure type built by the TRACE_EVENT() macro to store the traces. So if a random 64 bits field is accessed inside, it may be not under an expected good alignment. Use an array of long instead to force the appropriate alignment, and perform a compile time check to ensure the size in byte of the buffer is a multiple of sizeof(long) so that its actual size doesn't get shrinked under us. This fixes unaligned accesses reported while using perf lock in sparc 64. Suggested-by: David Miller Suggested-by: Tejun Heo Signed-off-by: Frederic Weisbecker Cc: Peter Zijlstra Cc: Arnaldo Carvalho de Melo Cc: Paul Mackerras Cc: Ingo Molnar Cc: David Miller Cc: Steven Rostedt --- kernel/trace/trace_event_perf.c | 11 +++++++++-- 1 file changed, 9 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/trace/trace_event_perf.c b/kernel/trace/trace_event_perf.c index 81f691e..0565bb4 100644 --- a/kernel/trace/trace_event_perf.c +++ b/kernel/trace/trace_event_perf.c @@ -17,7 +17,12 @@ EXPORT_SYMBOL_GPL(perf_arch_fetch_caller_regs); static char *perf_trace_buf; static char *perf_trace_buf_nmi; -typedef typeof(char [PERF_MAX_TRACE_SIZE]) perf_trace_t ; +/* + * Force it to be aligned to unsigned long to avoid misaligned accesses + * suprises + */ +typedef typeof(unsigned long [PERF_MAX_TRACE_SIZE / sizeof(unsigned long)]) + perf_trace_t; /* Count the events in use (per event id, not per instance) */ static int total_ref_count; @@ -130,6 +135,8 @@ __kprobes void *perf_trace_buf_prepare(int size, unsigned short type, char *trace_buf, *raw_data; int pc, cpu; + BUILD_BUG_ON(PERF_MAX_TRACE_SIZE % sizeof(unsigned long)); + pc = preempt_count(); /* Protect the per cpu buffer, begin the rcu read side */ @@ -152,7 +159,7 @@ __kprobes void *perf_trace_buf_prepare(int size, unsigned short type, raw_data = per_cpu_ptr(trace_buf, cpu); /* zero the dead bytes from align to not leak stack to user */ - *(u64 *)(&raw_data[size - sizeof(u64)]) = 0ULL; + memset(&raw_data[size - sizeof(u64)], 0, sizeof(u64)); entry = (struct trace_entry *)raw_data; tracing_generic_entry_update(entry, *irq_flags, pc); -- cgit v1.1 From e49a5bd38159dfb1928fd25b173bc9de4bbadb21 Mon Sep 17 00:00:00 2001 From: Frederic Weisbecker Date: Mon, 22 Mar 2010 19:40:03 +0100 Subject: perf: Use hot regs with software sched switch/migrate events Scheduler's task migration events don't work because they always pass NULL regs perf_sw_event(). The event hence gets filtered in perf_swevent_add(). Scheduler's context switches events use task_pt_regs() to get the context when the event occured which is a wrong thing to do as this won't give us the place in the kernel where we went to sleep but the place where we left userspace. The result is even more wrong if we switch from a kernel thread. Use the hot regs snapshot for both events as they belong to the non-interrupt/exception based events family. Unlike page faults or so that provide the regs matching the exact origin of the event, we need to save the current context. This makes the task migration event working and fix the context switch callchains and origin ip. Example: perf record -a -e cs Before: 10.91% ksoftirqd/0 0 [k] 0000000000000000 | --- (nil) perf_callchain perf_prepare_sample __perf_event_overflow perf_swevent_overflow perf_swevent_add perf_swevent_ctx_event do_perf_sw_event __perf_sw_event perf_event_task_sched_out schedule run_ksoftirqd kthread kernel_thread_helper After: 23.77% hald-addon-stor [kernel.kallsyms] [k] schedule | --- schedule | |--60.00%-- schedule_timeout | wait_for_common | wait_for_completion | blk_execute_rq | scsi_execute | scsi_execute_req | sr_test_unit_ready | | | |--66.67%-- sr_media_change | | media_changed | | cdrom_media_changed | | sr_block_media_changed | | check_disk_change | | cdrom_open v2: Always build perf_arch_fetch_caller_regs() now that software events need that too. They don't need it from modules, unlike trace events, so we keep the EXPORT_SYMBOL in trace_event_perf.c Signed-off-by: Frederic Weisbecker Cc: Peter Zijlstra Cc: Arnaldo Carvalho de Melo Cc: Paul Mackerras Cc: Ingo Molnar Cc: David Miller --- kernel/perf_event.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) (limited to 'kernel') diff --git a/kernel/perf_event.c b/kernel/perf_event.c index 574ee58..b0feb47 100644 --- a/kernel/perf_event.c +++ b/kernel/perf_event.c @@ -1164,11 +1164,9 @@ void perf_event_task_sched_out(struct task_struct *task, struct perf_event_context *ctx = task->perf_event_ctxp; struct perf_event_context *next_ctx; struct perf_event_context *parent; - struct pt_regs *regs; int do_switch = 1; - regs = task_pt_regs(task); - perf_sw_event(PERF_COUNT_SW_CONTEXT_SWITCHES, 1, 1, regs, 0); + perf_sw_event(PERF_COUNT_SW_CONTEXT_SWITCHES, 1, 1, NULL, 0); if (likely(!ctx || !cpuctx->task_ctx)) return; -- cgit v1.1 From 8bb39f9aa068262732fe44b965d7a6eb5a5a7d67 Mon Sep 17 00:00:00 2001 From: Mike Galbraith Date: Fri, 26 Mar 2010 11:11:33 +0100 Subject: perf: Fix 'perf sched record' deadlock perf sched record can deadlock a box should the holder of handle->data->lock take an interrupt, and then attempt to acquire an rq lock held by a CPU trying to acquire the same lock. Disable interrupts. CPU0 CPU1 sched event with rq->lock held grab handle->data->lock spin on handle->data->lock interrupt try to grab rq->lock Reported-by: Li Zefan Signed-off-by: Mike Galbraith Tested-by: Li Zefan Signed-off-by: Peter Zijlstra Cc: Arnaldo Carvalho de Melo Cc: Frederic Weisbecker LKML-Reference: <1269598293.6174.8.camel@marge.simson.net> Signed-off-by: Ingo Molnar --- kernel/perf_event.c | 15 ++++++++++++--- 1 file changed, 12 insertions(+), 3 deletions(-) (limited to 'kernel') diff --git a/kernel/perf_event.c b/kernel/perf_event.c index b0feb47..96aae13 100644 --- a/kernel/perf_event.c +++ b/kernel/perf_event.c @@ -3376,15 +3376,23 @@ static void perf_event_task_output(struct perf_event *event, struct perf_task_event *task_event) { struct perf_output_handle handle; - int size; struct task_struct *task = task_event->task; - int ret; + unsigned long flags; + int size, ret; + + /* + * If this CPU attempts to acquire an rq lock held by a CPU spinning + * in perf_output_lock() from interrupt context, it's game over. + */ + local_irq_save(flags); size = task_event->event_id.header.size; ret = perf_output_begin(&handle, event, size, 0, 0); - if (ret) + if (ret) { + local_irq_restore(flags); return; + } task_event->event_id.pid = perf_event_pid(event, task); task_event->event_id.ppid = perf_event_pid(event, current); @@ -3395,6 +3403,7 @@ static void perf_event_task_output(struct perf_event *event, perf_output_put(&handle, task_event->event_id); perf_output_end(&handle); + local_irq_restore(flags); } static int perf_event_task_match(struct perf_event *event) -- cgit v1.1 From 269484a492d9177072ee11ec8c9bff71d256837a Mon Sep 17 00:00:00 2001 From: Mike Galbraith Date: Tue, 30 Mar 2010 11:09:53 +0200 Subject: sched: Fix proc_sched_set_task() MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Latencytop clearing sum_exec_runtime via proc_sched_set_task() breaks task_times(). Other places in kernel use nvcsw and nivcsw, which are being cleared as well, Clear task statistics only. Reported-by: Török Edwin Signed-off-by: Mike Galbraith Cc: Hidetoshi Seto Cc: Arjan van de Ven Signed-off-by: Peter Zijlstra LKML-Reference: <1269940193.19286.14.camel@marge.simson.net> Signed-off-by: Ingo Molnar --- kernel/sched_debug.c | 4 ---- 1 file changed, 4 deletions(-) (limited to 'kernel') diff --git a/kernel/sched_debug.c b/kernel/sched_debug.c index 67f95aa..9b49db1 100644 --- a/kernel/sched_debug.c +++ b/kernel/sched_debug.c @@ -518,8 +518,4 @@ void proc_sched_set_task(struct task_struct *p) p->se.nr_wakeups_idle = 0; p->sched_info.bkl_count = 0; #endif - p->se.sum_exec_runtime = 0; - p->se.prev_sum_exec_runtime = 0; - p->nvcsw = 0; - p->nivcsw = 0; } -- cgit v1.1 From 47a70985e5c093ae03d8ccf633c70a93761d86f2 Mon Sep 17 00:00:00 2001 From: Oleg Nesterov Date: Tue, 30 Mar 2010 18:58:29 +0200 Subject: sched: set_cpus_allowed_ptr(): Don't use rq->migration_thread after unlock Trivial typo fix. rq->migration_thread can be NULL after task_rq_unlock(), this is why we have "mt" which should be used instead. Signed-off-by: Oleg Nesterov Signed-off-by: Peter Zijlstra LKML-Reference: <20100330165829.GA18284@redhat.com> Signed-off-by: Ingo Molnar --- kernel/sched.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/sched.c b/kernel/sched.c index 49d2fa7..528a105 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -5387,7 +5387,7 @@ int set_cpus_allowed_ptr(struct task_struct *p, const struct cpumask *new_mask) get_task_struct(mt); task_rq_unlock(rq, &flags); - wake_up_process(rq->migration_thread); + wake_up_process(mt); put_task_struct(mt); wait_for_completion(&req.done); tlb_migrate_finish(p->mm); -- cgit v1.1 From a0279bd58060ccedbd414edf97d50cfa3778c370 Mon Sep 17 00:00:00 2001 From: Jason Wessel Date: Fri, 2 Apr 2010 11:33:29 -0500 Subject: kgdb: have ebin2mem call probe_kernel_write once Rather than call probe_kernel_write() one byte at a time, process the whole buffer locally and pass the entire result in one go. This way, architectures that need to do special handling based on the length can do so, or we only end up calling memcpy() once. [sonic.zhang@analog.com: Reported original problem and preliminary patch] Signed-off-by: Jason Wessel Signed-off-by: Sonic Zhang Signed-off-by: Mike Frysinger --- kernel/kgdb.c | 23 +++++++++-------------- 1 file changed, 9 insertions(+), 14 deletions(-) (limited to 'kernel') diff --git a/kernel/kgdb.c b/kernel/kgdb.c index 761fdd2..42fd128 100644 --- a/kernel/kgdb.c +++ b/kernel/kgdb.c @@ -391,27 +391,22 @@ int kgdb_mem2hex(char *mem, char *buf, int count) /* * Copy the binary array pointed to by buf into mem. Fix $, #, and - * 0x7d escaped with 0x7d. Return a pointer to the character after - * the last byte written. + * 0x7d escaped with 0x7d. Return -EFAULT on failure or 0 on success. + * The input buf is overwitten with the result to write to mem. */ static int kgdb_ebin2mem(char *buf, char *mem, int count) { - int err = 0; - char c; + int size = 0; + char *c = buf; while (count-- > 0) { - c = *buf++; - if (c == 0x7d) - c = *buf++ ^ 0x20; - - err = probe_kernel_write(mem, &c, 1); - if (err) - break; - - mem++; + c[size] = *buf++; + if (c[size] == 0x7d) + c[size] = *buf++ ^ 0x20; + size++; } - return err; + return probe_kernel_write(mem, c, size); } /* -- cgit v1.1 From 62fae312197a8fbcd3727261d59f5a6bd0dbf158 Mon Sep 17 00:00:00 2001 From: Jason Wessel Date: Fri, 2 Apr 2010 11:47:02 -0500 Subject: kgdb: eliminate kgdb_wait(), all cpus enter the same way This is a kgdb architectural change to have all the cpus (master or slave) enter the same function. A cpu that hits an exception (wants to be the master cpu) will call kgdb_handle_exception() from the trap handler and then invoke a kgdb_roundup_cpu() to synchronize the other cpus and bring them into the kgdb_handle_exception() as well. A slave cpu will enter kgdb_handle_exception() from the kgdb_nmicallback() and set the exception state to note that the processor is a slave. Previously the salve cpu would have called kgdb_wait(). This change allows the debug core to change cpus without resuming the system in order to inspect arch specific cpu information. Signed-off-by: Jason Wessel --- kernel/kgdb.c | 165 +++++++++++++++++++++++++++++----------------------------- 1 file changed, 82 insertions(+), 83 deletions(-) (limited to 'kernel') diff --git a/kernel/kgdb.c b/kernel/kgdb.c index 42fd128..6882c04 100644 --- a/kernel/kgdb.c +++ b/kernel/kgdb.c @@ -69,9 +69,16 @@ struct kgdb_state { struct pt_regs *linux_regs; }; +/* Exception state values */ +#define DCPU_WANT_MASTER 0x1 /* Waiting to become a master kgdb cpu */ +#define DCPU_NEXT_MASTER 0x2 /* Transition from one master cpu to another */ +#define DCPU_IS_SLAVE 0x4 /* Slave cpu enter exception */ +#define DCPU_SSTEP 0x8 /* CPU is single stepping */ + static struct debuggerinfo_struct { void *debuggerinfo; struct task_struct *task; + int exception_state; } kgdb_info[NR_CPUS]; /** @@ -558,49 +565,6 @@ static struct task_struct *getthread(struct pt_regs *regs, int tid) } /* - * CPU debug state control: - */ - -#ifdef CONFIG_SMP -static void kgdb_wait(struct pt_regs *regs) -{ - unsigned long flags; - int cpu; - - local_irq_save(flags); - cpu = raw_smp_processor_id(); - kgdb_info[cpu].debuggerinfo = regs; - kgdb_info[cpu].task = current; - /* - * Make sure the above info reaches the primary CPU before - * our cpu_in_kgdb[] flag setting does: - */ - smp_wmb(); - atomic_set(&cpu_in_kgdb[cpu], 1); - - /* Disable any cpu specific hw breakpoints */ - kgdb_disable_hw_debug(regs); - - /* Wait till primary CPU is done with debugging */ - while (atomic_read(&passive_cpu_wait[cpu])) - cpu_relax(); - - kgdb_info[cpu].debuggerinfo = NULL; - kgdb_info[cpu].task = NULL; - - /* fix up hardware debug registers on local cpu */ - if (arch_kgdb_ops.correct_hw_break) - arch_kgdb_ops.correct_hw_break(); - - /* Signal the primary CPU that we are done: */ - atomic_set(&cpu_in_kgdb[cpu], 0); - touch_softlockup_watchdog_sync(); - clocksource_touch_watchdog(); - local_irq_restore(flags); -} -#endif - -/* * Some architectures need cache flushes when we set/clear a * breakpoint: */ @@ -1395,34 +1359,12 @@ static int kgdb_reenter_check(struct kgdb_state *ks) return 1; } -/* - * kgdb_handle_exception() - main entry point from a kernel exception - * - * Locking hierarchy: - * interface locks, if any (begin_session) - * kgdb lock (kgdb_active) - */ -int -kgdb_handle_exception(int evector, int signo, int ecode, struct pt_regs *regs) +static int kgdb_cpu_enter(struct kgdb_state *ks, struct pt_regs *regs) { - struct kgdb_state kgdb_var; - struct kgdb_state *ks = &kgdb_var; unsigned long flags; int sstep_tries = 100; int error = 0; int i, cpu; - - ks->cpu = raw_smp_processor_id(); - ks->ex_vector = evector; - ks->signo = signo; - ks->ex_vector = evector; - ks->err_code = ecode; - ks->kgdb_usethreadid = 0; - ks->linux_regs = regs; - - if (kgdb_reenter_check(ks)) - return 0; /* Ouch, double exception ! */ - acquirelock: /* * Interrupts will be restored by the 'trap return' code, except when @@ -1430,13 +1372,42 @@ acquirelock: */ local_irq_save(flags); - cpu = raw_smp_processor_id(); + cpu = ks->cpu; + kgdb_info[cpu].debuggerinfo = regs; + kgdb_info[cpu].task = current; + /* + * Make sure the above info reaches the primary CPU before + * our cpu_in_kgdb[] flag setting does: + */ + smp_wmb(); + atomic_set(&cpu_in_kgdb[cpu], 1); /* - * Acquire the kgdb_active lock: + * CPU will loop if it is a slave or request to become a kgdb + * master cpu and acquire the kgdb_active lock: */ - while (atomic_cmpxchg(&kgdb_active, -1, cpu) != -1) + while (1) { + if (kgdb_info[cpu].exception_state & DCPU_WANT_MASTER) { + if (atomic_cmpxchg(&kgdb_active, -1, cpu) == cpu) + break; + } else if (kgdb_info[cpu].exception_state & DCPU_IS_SLAVE) { + if (!atomic_read(&passive_cpu_wait[cpu])) + goto return_normal; + } else { +return_normal: + /* Return to normal operation by executing any + * hw breakpoint fixup. + */ + if (arch_kgdb_ops.correct_hw_break) + arch_kgdb_ops.correct_hw_break(); + atomic_set(&cpu_in_kgdb[cpu], 0); + touch_softlockup_watchdog_sync(); + clocksource_touch_watchdog(); + local_irq_restore(flags); + return 0; + } cpu_relax(); + } /* * For single stepping, try to only enter on the processor @@ -1470,9 +1441,6 @@ acquirelock: if (kgdb_io_ops->pre_exception) kgdb_io_ops->pre_exception(); - kgdb_info[ks->cpu].debuggerinfo = ks->linux_regs; - kgdb_info[ks->cpu].task = current; - kgdb_disable_hw_debug(ks->linux_regs); /* @@ -1484,12 +1452,6 @@ acquirelock: atomic_set(&passive_cpu_wait[i], 1); } - /* - * spin_lock code is good enough as a barrier so we don't - * need one here: - */ - atomic_set(&cpu_in_kgdb[ks->cpu], 1); - #ifdef CONFIG_SMP /* Signal the other CPUs to enter kgdb_wait() */ if ((!kgdb_single_step) && kgdb_do_roundup) @@ -1521,8 +1483,6 @@ acquirelock: if (kgdb_io_ops->post_exception) kgdb_io_ops->post_exception(); - kgdb_info[ks->cpu].debuggerinfo = NULL; - kgdb_info[ks->cpu].task = NULL; atomic_set(&cpu_in_kgdb[ks->cpu], 0); if (!kgdb_single_step) { @@ -1555,13 +1515,52 @@ kgdb_restore: return error; } +/* + * kgdb_handle_exception() - main entry point from a kernel exception + * + * Locking hierarchy: + * interface locks, if any (begin_session) + * kgdb lock (kgdb_active) + */ +int +kgdb_handle_exception(int evector, int signo, int ecode, struct pt_regs *regs) +{ + struct kgdb_state kgdb_var; + struct kgdb_state *ks = &kgdb_var; + int ret; + + ks->cpu = raw_smp_processor_id(); + ks->ex_vector = evector; + ks->signo = signo; + ks->ex_vector = evector; + ks->err_code = ecode; + ks->kgdb_usethreadid = 0; + ks->linux_regs = regs; + + if (kgdb_reenter_check(ks)) + return 0; /* Ouch, double exception ! */ + kgdb_info[ks->cpu].exception_state |= DCPU_WANT_MASTER; + ret = kgdb_cpu_enter(ks, regs); + kgdb_info[ks->cpu].exception_state &= ~DCPU_WANT_MASTER; + return ret; +} + int kgdb_nmicallback(int cpu, void *regs) { #ifdef CONFIG_SMP + struct kgdb_state kgdb_var; + struct kgdb_state *ks = &kgdb_var; + + memset(ks, 0, sizeof(struct kgdb_state)); + ks->cpu = cpu; + ks->linux_regs = regs; + if (!atomic_read(&cpu_in_kgdb[cpu]) && - atomic_read(&kgdb_active) != cpu && - atomic_read(&cpu_in_kgdb[atomic_read(&kgdb_active)])) { - kgdb_wait((struct pt_regs *)regs); + atomic_read(&kgdb_active) != -1 && + atomic_read(&kgdb_active) != cpu) { + kgdb_info[cpu].exception_state |= DCPU_IS_SLAVE; + kgdb_cpu_enter(ks, regs); + kgdb_info[cpu].exception_state &= ~DCPU_IS_SLAVE; return 0; } #endif -- cgit v1.1 From ae6bf53e0255c8ab04b6fe31806e318432570e3c Mon Sep 17 00:00:00 2001 From: Jason Wessel Date: Fri, 2 Apr 2010 14:58:18 -0500 Subject: kgdb: use atomic_inc and atomic_dec instead of atomic_set Memory barriers should be used for the kgdb cpu synchronization. The atomic_set() does not imply a memory barrier. Reported-by: Will Deacon Signed-off-by: Jason Wessel --- kernel/kgdb.c | 15 +++++++-------- 1 file changed, 7 insertions(+), 8 deletions(-) (limited to 'kernel') diff --git a/kernel/kgdb.c b/kernel/kgdb.c index 6882c04..2f7f454 100644 --- a/kernel/kgdb.c +++ b/kernel/kgdb.c @@ -1379,8 +1379,7 @@ acquirelock: * Make sure the above info reaches the primary CPU before * our cpu_in_kgdb[] flag setting does: */ - smp_wmb(); - atomic_set(&cpu_in_kgdb[cpu], 1); + atomic_inc(&cpu_in_kgdb[cpu]); /* * CPU will loop if it is a slave or request to become a kgdb @@ -1400,7 +1399,7 @@ return_normal: */ if (arch_kgdb_ops.correct_hw_break) arch_kgdb_ops.correct_hw_break(); - atomic_set(&cpu_in_kgdb[cpu], 0); + atomic_dec(&cpu_in_kgdb[cpu]); touch_softlockup_watchdog_sync(); clocksource_touch_watchdog(); local_irq_restore(flags); @@ -1449,7 +1448,7 @@ return_normal: */ if (!kgdb_single_step) { for (i = 0; i < NR_CPUS; i++) - atomic_set(&passive_cpu_wait[i], 1); + atomic_inc(&passive_cpu_wait[i]); } #ifdef CONFIG_SMP @@ -1483,11 +1482,11 @@ return_normal: if (kgdb_io_ops->post_exception) kgdb_io_ops->post_exception(); - atomic_set(&cpu_in_kgdb[ks->cpu], 0); + atomic_dec(&cpu_in_kgdb[ks->cpu]); if (!kgdb_single_step) { for (i = NR_CPUS-1; i >= 0; i--) - atomic_set(&passive_cpu_wait[i], 0); + atomic_dec(&passive_cpu_wait[i]); /* * Wait till all the CPUs have quit * from the debugger. @@ -1736,11 +1735,11 @@ EXPORT_SYMBOL_GPL(kgdb_unregister_io_module); */ void kgdb_breakpoint(void) { - atomic_set(&kgdb_setting_breakpoint, 1); + atomic_inc(&kgdb_setting_breakpoint); wmb(); /* Sync point before breakpoint */ arch_kgdb_breakpoint(); wmb(); /* Sync point after breakpoint */ - atomic_set(&kgdb_setting_breakpoint, 0); + atomic_dec(&kgdb_setting_breakpoint); } EXPORT_SYMBOL_GPL(kgdb_breakpoint); -- cgit v1.1 From 4da75b9ceac6939cd76830ec9581bef5bb398ad3 Mon Sep 17 00:00:00 2001 From: Jason Wessel Date: Fri, 2 Apr 2010 11:57:18 -0500 Subject: kgdb: Turn off tracing while in the debugger The kernel debugger should turn off kernel tracing any time the debugger is active and restore it on resume. Signed-off-by: Jason Wessel Reviewed-by: Steven Rostedt --- kernel/kgdb.c | 8 ++++++++ 1 file changed, 8 insertions(+) (limited to 'kernel') diff --git a/kernel/kgdb.c b/kernel/kgdb.c index 2f7f454..11f3515 100644 --- a/kernel/kgdb.c +++ b/kernel/kgdb.c @@ -1365,6 +1365,7 @@ static int kgdb_cpu_enter(struct kgdb_state *ks, struct pt_regs *regs) int sstep_tries = 100; int error = 0; int i, cpu; + int trace_on = 0; acquirelock: /* * Interrupts will be restored by the 'trap return' code, except when @@ -1399,6 +1400,8 @@ return_normal: */ if (arch_kgdb_ops.correct_hw_break) arch_kgdb_ops.correct_hw_break(); + if (trace_on) + tracing_on(); atomic_dec(&cpu_in_kgdb[cpu]); touch_softlockup_watchdog_sync(); clocksource_touch_watchdog(); @@ -1474,6 +1477,9 @@ return_normal: kgdb_single_step = 0; kgdb_contthread = current; exception_level = 0; + trace_on = tracing_is_on(); + if (trace_on) + tracing_off(); /* Talk to debugger with gdbserial protocol */ error = gdb_serial_stub(ks); @@ -1505,6 +1511,8 @@ kgdb_restore: else kgdb_sstep_pid = 0; } + if (trace_on) + tracing_on(); /* Free kgdb_active */ atomic_set(&kgdb_active, -1); touch_softlockup_watchdog_sync(); -- cgit v1.1 From 26d80aa782e708c380a47601779d42d30bf016d6 Mon Sep 17 00:00:00 2001 From: Frederic Weisbecker Date: Sat, 3 Apr 2010 12:22:05 +0200 Subject: perf: Always build the stub perf_arch_fetch_caller_regs version Now that software events use perf_arch_fetch_caller_regs() too, we need the stub version to be always built in for archs that don't implement it. Fixes the following build error in PARISC: kernel/built-in.o: In function `perf_event_task_sched_out': (.text.perf_event_task_sched_out+0x54): undefined reference to `perf_arch_fetch_caller_regs' Reported-by: Ingo Molnar Signed-off-by: Frederic Weisbecker Cc: Peter Zijlstra Cc: Arnaldo Carvalho de Melo Cc: Paul Mackerras --- kernel/perf_event.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/perf_event.c b/kernel/perf_event.c index 96aae13..681af80 100644 --- a/kernel/perf_event.c +++ b/kernel/perf_event.c @@ -2784,12 +2784,11 @@ __weak struct perf_callchain_entry *perf_callchain(struct pt_regs *regs) return NULL; } -#ifdef CONFIG_EVENT_TRACING __weak void perf_arch_fetch_caller_regs(struct pt_regs *regs, unsigned long ip, int skip) { } -#endif + /* * Output -- cgit v1.1 From 449cedf099b23a250e7d61982e35555ccb871182 Mon Sep 17 00:00:00 2001 From: Eric Paris Date: Mon, 5 Apr 2010 16:16:26 -0400 Subject: audit: preface audit printk with audit There have been a number of reports of people seeing the message: "name_count maxed, losing inode data: dev=00:05, inode=3185" in dmesg. These usually lead to people reporting problems to the filesystem group who are in turn clueless what they mean. Eventually someone finds me and I explain what is going on and that these come from the audit system. The basics of the problem is that the audit subsystem never expects a single syscall to 'interact' (for some wish washy meaning of interact) with more than 20 inodes. But in fact some operations like loading kernel modules can cause changes to lots of inodes in debugfs. There are a couple real fixes being bandied about including removing the fixed compile time limit of 20 or not auditing changes in debugfs (or both) but neither are small and obvious so I am not sending them for immediate inclusion (I hope Al forwards a real solution next devel window). In the meantime this patch simply adds 'audit' to the beginning of the crap message so if a user sees it, they come blame me first and we can talk about what it means and make sure we understand all of the reasons it can happen and make sure this gets solved correctly in the long run. Signed-off-by: Eric Paris Signed-off-by: Linus Torvalds --- kernel/auditsc.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/auditsc.c b/kernel/auditsc.c index 97a3cef..3828ad5 100644 --- a/kernel/auditsc.c +++ b/kernel/auditsc.c @@ -1894,7 +1894,7 @@ static int audit_inc_name_count(struct audit_context *context, { if (context->name_count >= AUDIT_NAMES) { if (inode) - printk(KERN_DEBUG "name_count maxed, losing inode data: " + printk(KERN_DEBUG "audit: name_count maxed, losing inode data: " "dev=%02x:%02x, inode=%lu\n", MAJOR(inode->i_sb->s_dev), MINOR(inode->i_sb->s_dev), -- cgit v1.1 From 5fbfb18d7a5b846946d52c4a10e3aaa213ec31b6 Mon Sep 17 00:00:00 2001 From: Nick Piggin Date: Thu, 1 Apr 2010 19:09:40 +1100 Subject: Fix up possibly racy module refcounting Module refcounting is implemented with a per-cpu counter for speed. However there is a race when tallying the counter where a reference may be taken by one CPU and released by another. Reference count summation may then see the decrement without having seen the previous increment, leading to lower than expected count. A module which never has its actual reference drop below 1 may return a reference count of 0 due to this race. Module removal generally runs under stop_machine, which prevents this race causing bugs due to removal of in-use modules. However there are other real bugs in module.c code and driver code (module_refcount is exported) where the callers do not run under stop_machine. Fix this by maintaining running per-cpu counters for the number of module refcount increments and the number of refcount decrements. The increments are tallied after the decrements, so any decrement seen will always have its corresponding increment counted. The final refcount is the difference of the total increments and decrements, preventing a low-refcount from being returned. Signed-off-by: Nick Piggin Acked-by: Rusty Russell Signed-off-by: Linus Torvalds --- kernel/module.c | 35 +++++++++++++++++++++++++++-------- 1 file changed, 27 insertions(+), 8 deletions(-) (limited to 'kernel') diff --git a/kernel/module.c b/kernel/module.c index 9f8d23d..1016b75 100644 --- a/kernel/module.c +++ b/kernel/module.c @@ -521,11 +521,13 @@ static void module_unload_init(struct module *mod) int cpu; INIT_LIST_HEAD(&mod->modules_which_use_me); - for_each_possible_cpu(cpu) - per_cpu_ptr(mod->refptr, cpu)->count = 0; + for_each_possible_cpu(cpu) { + per_cpu_ptr(mod->refptr, cpu)->incs = 0; + per_cpu_ptr(mod->refptr, cpu)->decs = 0; + } /* Hold reference count during initialization. */ - __this_cpu_write(mod->refptr->count, 1); + __this_cpu_write(mod->refptr->incs, 1); /* Backwards compatibility macros put refcount during init. */ mod->waiter = current; } @@ -664,12 +666,28 @@ static int try_stop_module(struct module *mod, int flags, int *forced) unsigned int module_refcount(struct module *mod) { - unsigned int total = 0; + unsigned int incs = 0, decs = 0; int cpu; for_each_possible_cpu(cpu) - total += per_cpu_ptr(mod->refptr, cpu)->count; - return total; + decs += per_cpu_ptr(mod->refptr, cpu)->decs; + /* + * ensure the incs are added up after the decs. + * module_put ensures incs are visible before decs with smp_wmb. + * + * This 2-count scheme avoids the situation where the refcount + * for CPU0 is read, then CPU0 increments the module refcount, + * then CPU1 drops that refcount, then the refcount for CPU1 is + * read. We would record a decrement but not its corresponding + * increment so we would see a low count (disaster). + * + * Rare situation? But module_refcount can be preempted, and we + * might be tallying up 4096+ CPUs. So it is not impossible. + */ + smp_rmb(); + for_each_possible_cpu(cpu) + incs += per_cpu_ptr(mod->refptr, cpu)->incs; + return incs - decs; } EXPORT_SYMBOL(module_refcount); @@ -846,10 +864,11 @@ void module_put(struct module *module) { if (module) { preempt_disable(); - __this_cpu_dec(module->refptr->count); + smp_wmb(); /* see comment in module_refcount */ + __this_cpu_inc(module->refptr->decs); trace_module_put(module, _RET_IP_, - __this_cpu_read(module->refptr->count)); + __this_cpu_read(module->refptr->decs)); /* Maybe they're waiting for us to drop reference? */ if (unlikely(!module_is_live(module))) wake_up_process(module->waiter); -- cgit v1.1 From 84fba5ec91f11c0efb27d0ed6098f7447491f0df Mon Sep 17 00:00:00 2001 From: Anton Blanchard Date: Tue, 6 Apr 2010 17:02:19 +1000 Subject: sched: Fix sched_getaffinity() taskset on 2.6.34-rc3 fails on one of my ppc64 test boxes with the following error: sched_getaffinity(0, 16, 0x10029650030) = -1 EINVAL (Invalid argument) This box has 128 threads and 16 bytes is enough to cover it. Commit cd3d8031eb4311e516329aee03c79a08333141f1 (sched: sched_getaffinity(): Allow less than NR_CPUS length) is comparing this 16 bytes agains nr_cpu_ids. Fix it by comparing nr_cpu_ids to the number of bits in the cpumask we pass in. Signed-off-by: Anton Blanchard Reviewed-by: KOSAKI Motohiro Cc: Sharyathi Nagesh Cc: Ulrich Drepper Cc: Peter Zijlstra Cc: Linus Torvalds Cc: Jack Steiner Cc: Russ Anderson Cc: Mike Travis LKML-Reference: <20100406070218.GM5594@kryten> Signed-off-by: Ingo Molnar --- kernel/sched.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/sched.c b/kernel/sched.c index 528a105..eaf5c73 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -4902,7 +4902,7 @@ SYSCALL_DEFINE3(sched_getaffinity, pid_t, pid, unsigned int, len, int ret; cpumask_var_t mask; - if (len < nr_cpu_ids) + if ((len * BITS_PER_BYTE) < nr_cpu_ids) return -EINVAL; if (len & (sizeof(unsigned long)-1)) return -EINVAL; -- cgit v1.1 From a3a2e76c77fa22b114e421ac11dec0c56c3503fb Mon Sep 17 00:00:00 2001 From: KAMEZAWA Hiroyuki Date: Tue, 6 Apr 2010 14:34:42 -0700 Subject: mm: avoid null-pointer deref in sync_mm_rss() - We weren't zeroing p->rss_stat[] at fork() - Consequently sync_mm_rss() was dereferencing tsk->mm for kernel threads and was oopsing. - Make __sync_task_rss_stat() static, too. Addresses https://bugzilla.kernel.org/show_bug.cgi?id=15648 [akpm@linux-foundation.org: remove the BUG_ON(!mm->rss)] Reported-by: Troels Liebe Bentsen Signed-off-by: KAMEZAWA Hiroyuki "Michael S. Tsirkin" Cc: Andrea Arcangeli Cc: Rik van Riel Cc: Minchan Kim Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/exit.c | 3 ++- kernel/fork.c | 3 +++ 2 files changed, 5 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/exit.c b/kernel/exit.c index cce59cb..7f2683a 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -953,7 +953,8 @@ NORET_TYPE void do_exit(long code) acct_update_integrals(tsk); /* sync mm's RSS info before statistics gathering */ - sync_mm_rss(tsk, tsk->mm); + if (tsk->mm) + sync_mm_rss(tsk, tsk->mm); group_dead = atomic_dec_and_test(&tsk->signal->live); if (group_dead) { hrtimer_cancel(&tsk->signal->real_timer); diff --git a/kernel/fork.c b/kernel/fork.c index 4799c5f..44b0791 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1052,6 +1052,9 @@ static struct task_struct *copy_process(unsigned long clone_flags, p->prev_utime = cputime_zero; p->prev_stime = cputime_zero; #endif +#if defined(SPLIT_RSS_COUNTING) + memset(&p->rss_stat, 0, sizeof(p->rss_stat)); +#endif p->default_timer_slack_ns = current->timer_slack_ns; -- cgit v1.1 From d88d4050dcaf09e417aaa9a5024dd9449ef71b2e Mon Sep 17 00:00:00 2001 From: Jiri Slaby Date: Sat, 10 Apr 2010 22:28:56 +0200 Subject: PM / Hibernate: user.c, fix SNAPSHOT_SET_SWAP_AREA handling When CONFIG_DEBUG_BLOCK_EXT_DEVT is set we decode the device improperly by old_decode_dev and it results in an error while hibernating with s2disk. All users already pass the new device number, so switch to new_decode_dev(). Signed-off-by: Jiri Slaby Reported-and-tested-by: Jiri Kosina Signed-off-by: "Rafael J. Wysocki" --- kernel/power/user.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/power/user.c b/kernel/power/user.c index 4d22896..a8c9621 100644 --- a/kernel/power/user.c +++ b/kernel/power/user.c @@ -420,7 +420,7 @@ static long snapshot_ioctl(struct file *filp, unsigned int cmd, * User space encodes device types as two-byte values, * so we need to recode them */ - swdev = old_decode_dev(swap_area.dev); + swdev = new_decode_dev(swap_area.dev); if (swdev) { offset = swap_area.offset; data->swap = swap_type_of(swdev, offset, NULL); -- cgit v1.1 From bc293d62b26ec590afc90a9e0a31c45d355b7bd8 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Thu, 15 Apr 2010 12:50:39 -0700 Subject: rcu: Make RCU lockdep check the lockdep_recursion variable The lockdep facility temporarily disables lockdep checking by incrementing the current->lockdep_recursion variable. Such disabling happens in NMIs and in other situations where lockdep might expect to recurse on itself. This patch therefore checks current->lockdep_recursion, disabling RCU lockdep splats when this variable is non-zero. In addition, this patch removes the "likely()", as suggested by Lai Jiangshan. Reported-by: Frederic Weisbecker Reported-by: David Miller Tested-by: Frederic Weisbecker Signed-off-by: Paul E. McKenney Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com Cc: eric.dumazet@gmail.com LKML-Reference: <20100415195039.GA22623@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar --- kernel/rcupdate.c | 7 +++++++ 1 file changed, 7 insertions(+) (limited to 'kernel') diff --git a/kernel/rcupdate.c b/kernel/rcupdate.c index 63fe254..03a7ea1 100644 --- a/kernel/rcupdate.c +++ b/kernel/rcupdate.c @@ -69,6 +69,13 @@ EXPORT_SYMBOL_GPL(rcu_scheduler_active); #ifdef CONFIG_DEBUG_LOCK_ALLOC +int debug_lockdep_rcu_enabled(void) +{ + return rcu_scheduler_active && debug_locks && + current->lockdep_recursion == 0; +} +EXPORT_SYMBOL_GPL(debug_lockdep_rcu_enabled); + /** * rcu_read_lock_bh_held - might we be in RCU-bh read-side critical section? * -- cgit v1.1 From eff30363c0b8b057f773108589bfd8881659fe74 Mon Sep 17 00:00:00 2001 From: David Howells Date: Tue, 20 Apr 2010 22:41:18 +0100 Subject: CRED: Fix double free in prepare_usermodehelper_creds() error handling Patch 570b8fb505896e007fd3bb07573ba6640e51851d: Author: Mathieu Desnoyers Date: Tue Mar 30 00:04:00 2010 +0100 Subject: CRED: Fix memory leak in error handling attempts to fix a memory leak in the error handling by making the offending return statement into a jump down to the bottom of the function where a kfree(tgcred) is inserted. This is, however, incorrect, as it does a kfree() after doing put_cred() if security_prepare_creds() fails. That will result in a double free if 'error' is jumped to as put_cred() will also attempt to free the new tgcred record by virtue of it being pointed to by the new cred record. Signed-off-by: David Howells Signed-off-by: James Morris --- kernel/cred.c | 2 ++ 1 file changed, 2 insertions(+) (limited to 'kernel') diff --git a/kernel/cred.c b/kernel/cred.c index e1dbe9e..ce1a52b 100644 --- a/kernel/cred.c +++ b/kernel/cred.c @@ -398,6 +398,8 @@ struct cred *prepare_usermodehelper_creds(void) error: put_cred(new); + return NULL; + free_tgcred: #ifdef CONFIG_KEYS kfree(tgcred); -- cgit v1.1 From e134d200d57d43b171dcb0b55c178a1a0c7db14a Mon Sep 17 00:00:00 2001 From: David Howells Date: Wed, 21 Apr 2010 10:28:25 +0100 Subject: CRED: Fix a race in creds_are_invalid() in credentials debugging creds_are_invalid() reads both cred->usage and cred->subscribers and then compares them to make sure the number of processes subscribed to a cred struct never exceeds the refcount of that cred struct. The problem is that this can cause a race with both copy_creds() and exit_creds() as the two counters, whilst they are of atomic_t type, are only atomic with respect to themselves, and not atomic with respect to each other. This means that if creds_are_invalid() can read the values on one CPU whilst they're being modified on another CPU, and so can observe an evolving state in which the subscribers count now is greater than the usage count a moment before. Switching the order in which the counts are read cannot help, so the thing to do is to remove that particular check. I had considered rechecking the values to see if they're in flux if the test fails, but I can't guarantee they won't appear the same, even if they've changed several times in the meantime. Note that this can only happen if CONFIG_DEBUG_CREDENTIALS is enabled. The problem is only likely to occur with multithreaded programs, and can be tested by the tst-eintr1 program from glibc's "make check". The symptoms look like: CRED: Invalid credentials CRED: At include/linux/cred.h:240 CRED: Specified credentials: ffff88003dda5878 [real][eff] CRED: ->magic=43736564, put_addr=(null) CRED: ->usage=766, subscr=766 CRED: ->*uid = { 0,0,0,0 } CRED: ->*gid = { 0,0,0,0 } CRED: ->security is ffff88003d72f538 CRED: ->security {359, 359} ------------[ cut here ]------------ kernel BUG at kernel/cred.c:850! ... RIP: 0010:[] [] __invalid_creds+0x4e/0x52 ... Call Trace: [] copy_creds+0x6b/0x23f Note the ->usage=766 and subscr=766. The values appear the same because they've been re-read since the check was made. Reported-by: Roland McGrath Signed-off-by: David Howells Signed-off-by: James Morris --- kernel/cred.c | 2 -- 1 file changed, 2 deletions(-) (limited to 'kernel') diff --git a/kernel/cred.c b/kernel/cred.c index ce1a52b..62af181 100644 --- a/kernel/cred.c +++ b/kernel/cred.c @@ -793,8 +793,6 @@ bool creds_are_invalid(const struct cred *cred) { if (cred->magic != CRED_MAGIC) return true; - if (atomic_read(&cred->usage) < atomic_read(&cred->subscribers)) - return true; #ifdef CONFIG_SECURITY_SELINUX if (selinux_is_enabled()) { if ((unsigned long) cred->security < PAGE_SIZE) -- cgit v1.1