aboutsummaryrefslogtreecommitdiffstats
path: root/kernel/exit.c
Commit message (Collapse)AuthorAgeFilesLines
* sched: Fix ancient race in do_exit()Yasunori Goto2012-10-021-0/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit b5740f4b2cb3503b436925eb2242bc3d75cd3dfe upstream. try_to_wake_up() has a problem which may change status from TASK_DEAD to TASK_RUNNING in race condition with SMI or guest environment of virtual machine. As a result, exited task is scheduled() again and panic occurs. Here is the sequence how it occurs: ----------------------------------+----------------------------- | CPU A | CPU B ----------------------------------+----------------------------- TASK A calls exit().... do_exit() exit_mm() down_read(mm->mmap_sem); rwsem_down_failed_common() set TASK_UNINTERRUPTIBLE set waiter.task <= task A list_add to sem->wait_list : raw_spin_unlock_irq() (I/O interruption occured) __rwsem_do_wake(mmap_sem) list_del(&waiter->list); waiter->task = NULL wake_up_process(task A) try_to_wake_up() (task is still TASK_UNINTERRUPTIBLE) p->on_rq is still 1.) ttwu_do_wakeup() (*A) : (I/O interruption handler finished) if (!waiter.task) schedule() is not called due to waiter.task is NULL. tsk->state = TASK_RUNNING : check_preempt_curr(); : task->state = TASK_DEAD (*B) <--- set TASK_RUNNING (*C) schedule() (exit task is running again) BUG_ON() is called! -------------------------------------------------------- The execution time between (*A) and (*B) is usually very short, because the interruption is disabled, and setting TASK_RUNNING at (*C) must be executed before setting TASK_DEAD. HOWEVER, if SMI is interrupted between (*A) and (*B), (*C) is able to execute AFTER setting TASK_DEAD! Then, exited task is scheduled again, and BUG_ON() is called.... If the system works on guest system of virtual machine, the time between (*A) and (*B) may be also long due to scheduling of hypervisor, and same phenomenon can occur. By this patch, do_exit() waits for releasing task->pi_lock which is used in try_to_wake_up(). It guarantees the task becomes TASK_DEAD after waking up. Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/20120117174031.3118.E1E9C6FF@jp.fujitsu.com Signed-off-by: Ingo Molnar <mingo@elte.hu> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* ptrace: partially fix the do_wait(WEXITED) vs EXIT_DEAD->EXIT_ZOMBIE raceOleg Nesterov2012-01-061-1/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 50b8d257486a45cba7b65ca978986ed216bbcc10 upstream. Test-case: int main(void) { int pid, status; pid = fork(); if (!pid) { for (;;) { if (!fork()) return 0; if (waitpid(-1, &status, 0) < 0) { printf("ERR!! wait: %m\n"); return 0; } } } assert(ptrace(PTRACE_ATTACH, pid, 0,0) == 0); assert(waitpid(-1, NULL, 0) == pid); assert(ptrace(PTRACE_SETOPTIONS, pid, 0, PTRACE_O_TRACEFORK) == 0); do { ptrace(PTRACE_CONT, pid, 0, 0); pid = waitpid(-1, NULL, 0); } while (pid > 0); return 1; } It fails because ->real_parent sees its child in EXIT_DEAD state while the tracer is going to change the state back to EXIT_ZOMBIE in wait_task_zombie(). The offending commit is 823b018e which moved the EXIT_DEAD check, but in fact we should not blame it. The original code was not correct as well because it didn't take ptrace_reparented() into account and because we can't really trust ->ptrace. This patch adds the additional check to close this particular race but it doesn't solve the whole problem. We simply can't rely on ->ptrace in this case, it can be cleared if the tracer is multithreaded by the exiting ->parent. I think we should kill EXIT_DEAD altogether, we should always remove the soon-to-be-reaped child from ->children or at least we should never do the DEAD->ZOMBIE transition. But this is too complex for 3.2. Reported-and-tested-by: Denys Vlasenko <vda.linux@googlemail.com> Tested-by: Lukasz Michalik <lmi@ift.uni.wroc.pl> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
* memcg: clear mm->owner when last possible owner leavesKAMEZAWA Hiroyuki2011-06-151-16/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The following crash was reported: > Call Trace: > [<ffffffff81139792>] mem_cgroup_from_task+0x15/0x17 > [<ffffffff8113a75a>] __mem_cgroup_try_charge+0x148/0x4b4 > [<ffffffff810493f3>] ? need_resched+0x23/0x2d > [<ffffffff814cbf43>] ? preempt_schedule+0x46/0x4f > [<ffffffff8113afe8>] mem_cgroup_charge_common+0x9a/0xce > [<ffffffff8113b6d1>] mem_cgroup_newpage_charge+0x5d/0x5f > [<ffffffff81134024>] khugepaged+0x5da/0xfaf > [<ffffffff81078ea0>] ? __init_waitqueue_head+0x4b/0x4b > [<ffffffff81133a4a>] ? add_mm_counter.constprop.5+0x13/0x13 > [<ffffffff81078625>] kthread+0xa8/0xb0 > [<ffffffff814d13e8>] ? sub_preempt_count+0xa1/0xb4 > [<ffffffff814d5664>] kernel_thread_helper+0x4/0x10 > [<ffffffff814ce858>] ? retint_restore_args+0x13/0x13 > [<ffffffff8107857d>] ? __init_kthread_worker+0x5a/0x5a What happens is that khugepaged tries to charge a huge page against an mm whose last possible owner has already exited, and the memory controller crashes when the stale mm->owner is used to look up the cgroup to charge. mm->owner has never been set to NULL with the last owner going away, but nobody cared until khugepaged came along. Even then it wasn't a problem because the final mmput() on an mm was forced to acquire and release mmap_sem in write-mode, preventing an exiting owner to go away while the mmap_sem was held, and until "692e0b3 mm: thp: optimize memcg charge in khugepaged", the memory cgroup charge was protected by mmap_sem in read-mode. Instead of going back to relying on the mmap_sem to enforce lifetime of a task, this patch ensures that mm->owner is properly set to NULL when the last possible owner is exiting, which the memory controller can handle just fine. [akpm@linux-foundation.org: tweak comments] Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: Hugh Dickins <hughd@google.com> Reported-by: Dave Jones <davej@redhat.com> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Merge branch 'ptrace' of git://git.kernel.org/pub/scm/linux/kernel/git/oleg/miscLinus Torvalds2011-05-201-22/+88
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * 'ptrace' of git://git.kernel.org/pub/scm/linux/kernel/git/oleg/misc: (41 commits) signal: trivial, fix the "timespec declared inside parameter list" warning job control: reorganize wait_task_stopped() ptrace: fix signal->wait_chldexit usage in task_clear_group_stop_trapping() signal: sys_sigprocmask() needs retarget_shared_pending() signal: cleanup sys_sigprocmask() signal: rename signandsets() to sigandnsets() signal: do_sigtimedwait() needs retarget_shared_pending() signal: introduce do_sigtimedwait() to factor out compat/native code signal: sys_rt_sigtimedwait: simplify the timeout logic signal: cleanup sys_rt_sigprocmask() x86: signal: sys_rt_sigreturn() should use set_current_blocked() x86: signal: handle_signal() should use set_current_blocked() signal: sigprocmask() should do retarget_shared_pending() signal: sigprocmask: narrow the scope of ->siglock signal: retarget_shared_pending: optimize while_each_thread() loop signal: retarget_shared_pending: consider shared/unblocked signals only signal: introduce retarget_shared_pending() ptrace: ptrace_check_attach() should not do s/STOPPED/TRACED/ signal: Turn SIGNAL_STOP_DEQUEUED into GROUP_STOP_DEQUEUED signal: do_signal_stop: Remove the unneeded task_clear_group_stop_pending() ...
| * job control: reorganize wait_task_stopped()Tejun Heo2011-05-131-7/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | wait_task_stopped() tested task_stopped_code() without acquiring siglock and, if stop condition existed, called wait_task_stopped() and directly returned the result. This patch moves the initial task_stopped_code() testing into wait_task_stopped() and make wait_consider_task() fall through to wait_task_continue() on 0 return. This is for the following two reasons. * Because the initial task_stopped_code() test is done without acquiring siglock, it may race against SIGCONT generation. The stopped condition might have been replaced by continued state by the time wait_task_stopped() acquired siglock. This may lead to unexpected failure of WNOHANG waits. This reorganization addresses this single race case but there are other cases - TASK_RUNNING -> TASK_STOPPED transition and EXIT_* transitions. * Scheduled ptrace updates require changes to the initial test which would fit better inside wait_task_stopped(). Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com>
| * Merge branch 'ptrace' of ↵Oleg Nesterov2011-04-071-17/+67
| |\ | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc into ptrace
| | * job control: Allow access to job control events through ptraceesTejun Heo2011-03-231-8/+33
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently a real parent can't access job control stopped/continued events through a ptraced child. This utterly breaks job control when the children are ptraced. For example, if a program is run from an interactive shell and then strace(1) attaches to it, pressing ^Z would send SIGTSTP and strace(1) would notice it but the shell has no way to tell whether the child entered job control stop and thus can't tell when to take over the terminal - leading to awkward lone ^Z on the terminal. Because the job control and ptrace stopped states are independent, there is no reason to prevent real parents from accessing the stopped state regardless of ptrace. The continued state isn't separate but ptracers don't have any use for them as ptracees can never resume without explicit command from their ptracers, so as long as ptracers don't consume it, it should be fine. Although this is a behavior change, because the previous behavior is utterly broken when viewed from real parents and the change is only visible to real parents, I don't think it's necessary to make this behavior optional. One situation to be careful about is when a task from the real parent's group is ptracing. The parent group is the recipient of both ptrace and job control stop events and one stop can be reported as both job control and ptrace stops. As this can break the current ptrace users, suppress job control stopped events for these cases. If a real parent ptracer wants to know about both job control and ptrace stops, it can create a separate process to serve the role of real parent. Note that this only updates wait(2) side of things. The real parent can access the states via wait(2) but still is not properly notified (woken up and delivered signal). Test case polls wait(2) with WNOHANG to work around. Notification will be updated by future patches. Test case follows. #include <stdio.h> #include <unistd.h> #include <time.h> #include <errno.h> #include <sys/types.h> #include <sys/ptrace.h> #include <sys/wait.h> int main(void) { const struct timespec ts100ms = { .tv_nsec = 100000000 }; pid_t tracee, tracer; siginfo_t si; int i; tracee = fork(); if (tracee == 0) { while (1) { printf("tracee: SIGSTOP\n"); raise(SIGSTOP); nanosleep(&ts100ms, NULL); printf("tracee: SIGCONT\n"); raise(SIGCONT); nanosleep(&ts100ms, NULL); } } waitid(P_PID, tracee, &si, WSTOPPED | WNOHANG | WNOWAIT); tracer = fork(); if (tracer == 0) { nanosleep(&ts100ms, NULL); ptrace(PTRACE_ATTACH, tracee, NULL, NULL); for (i = 0; i < 11; i++) { si.si_pid = 0; waitid(P_PID, tracee, &si, WSTOPPED); if (si.si_pid && si.si_code == CLD_TRAPPED) ptrace(PTRACE_CONT, tracee, NULL, (void *)(long)si.si_status); } printf("tracer: EXITING\n"); return 0; } while (1) { si.si_pid = 0; waitid(P_PID, tracee, &si, WSTOPPED | WCONTINUED | WEXITED | WNOHANG); if (si.si_pid) printf("mommy : WAIT status=%02d code=%02d\n", si.si_status, si.si_code); nanosleep(&ts100ms, NULL); } return 0; } Before the patch, while ptraced, the parent can't see any job control events. tracee: SIGSTOP mommy : WAIT status=19 code=05 tracee: SIGCONT tracee: SIGSTOP tracee: SIGCONT tracee: SIGSTOP tracee: SIGCONT tracee: SIGSTOP tracer: EXITING mommy : WAIT status=19 code=05 ^C After the patch, tracee: SIGSTOP mommy : WAIT status=19 code=05 tracee: SIGCONT mommy : WAIT status=18 code=06 tracee: SIGSTOP mommy : WAIT status=19 code=05 tracee: SIGCONT mommy : WAIT status=18 code=06 tracee: SIGSTOP mommy : WAIT status=19 code=05 tracee: SIGCONT mommy : WAIT status=18 code=06 tracee: SIGSTOP tracer: EXITING mommy : WAIT status=19 code=05 ^C -v2: Oleg pointed out that wait(2) should be suppressed for the real parent's group instead of only the real parent task itself. Updated accordingly. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Oleg Nesterov <oleg@redhat.com>
| | * job control: Fix ptracer wait(2) hang and explain notask_error clearingTejun Heo2011-03-231-10/+34
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | wait(2) and friends allow access to stopped/continued states through zombies, which is required as the states are process-wide and should be accessible whether the leader task is alive or undead. wait_consider_task() implements this by always clearing notask_error and going through wait_task_stopped/continued() for unreaped zombies. However, while ptraced, the stopped state is per-task and as such if the ptracee became a zombie, there's no further stopped event to listen to and wait(2) and friends should return -ECHILD on the tracee. Fix it by clearing notask_error only if WCONTINUED | WEXITED is set for ptraced zombies. While at it, document why clearing notask_error is safe for each case. Test case follows. #include <stdio.h> #include <unistd.h> #include <pthread.h> #include <time.h> #include <sys/types.h> #include <sys/ptrace.h> #include <sys/wait.h> static void *nooper(void *arg) { pause(); return NULL; } int main(void) { const struct timespec ts1s = { .tv_sec = 1 }; pid_t tracee, tracer; siginfo_t si; tracee = fork(); if (tracee == 0) { pthread_t thr; pthread_create(&thr, NULL, nooper, NULL); nanosleep(&ts1s, NULL); printf("tracee exiting\n"); pthread_exit(NULL); /* let subthread run */ } tracer = fork(); if (tracer == 0) { ptrace(PTRACE_ATTACH, tracee, NULL, NULL); while (1) { if (waitid(P_PID, tracee, &si, WSTOPPED) < 0) { perror("waitid"); break; } ptrace(PTRACE_CONT, tracee, NULL, (void *)(long)si.si_status); } return 0; } waitid(P_PID, tracer, &si, WEXITED); kill(tracee, SIGKILL); return 0; } Before the patch, after the tracee becomes a zombie, the tracer's waitid(WSTOPPED) never returns and the program doesn't terminate. tracee exiting ^C After the patch, tracee exiting triggers waitid() to fail. tracee exiting waitid: No child processes -v2: Oleg pointed out that exited in addition to continued can happen for ptraced dead group leader. Clear notask_error for ptraced child on WEXITED too. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Oleg Nesterov <oleg@redhat.com>
| | * job control: Small reorganization of wait_consider_task()Tejun Heo2011-03-231-3/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | Move EXIT_DEAD test in wait_consider_task() above ptrace check. As ptraced tasks can't be EXIT_DEAD, this change doesn't cause any behavior change. This is to prepare for further changes. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Oleg Nesterov <oleg@redhat.com>
* | | ptrace: Prepare to fix racy accesses on task breakpointsFrederic Weisbecker2011-04-251-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When a task is traced and is in a stopped state, the tracer may execute a ptrace request to examine the tracee state and get its task struct. Right after, the tracee can be killed and thus its breakpoints released. This can happen concurrently when the tracer is in the middle of reading or modifying these breakpoints, leading to dereferencing a freed pointer. Hence, to prepare the fix, create a generic breakpoint reference holding API. When a reference on the breakpoints of a task is held, the breakpoints won't be released until the last reference is dropped. After that, no more ptrace request on the task's breakpoints can be serviced for the tracer. Reported-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Will Deacon <will.deacon@arm.com> Cc: Prasad <prasad@linux.vnet.ibm.com> Cc: Paul Mundt <lethal@linux-sh.org> Cc: v2.6.33.. <stable@kernel.org> Link: http://lkml.kernel.org/r/1302284067-7860-2-git-send-email-fweisbec@gmail.com
* | | Fix common misspellingsLucas De Marchi2011-03-311-1/+1
|/ / | | | | | | | | | | Fixes generated by 'codespell' and manually reviewed. Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>
* | block: initial patch for on-stack per-task pluggingJens Axboe2011-03-101-0/+1
|/ | | | | | | | | | | | | | | | | | | | | This patch adds support for creating a queuing context outside of the queue itself. This enables us to batch up pieces of IO before grabbing the block device queue lock and submitting them to the IO scheduler. The context is created on the stack of the process and assigned in the task structure, so that we can auto-unplug it if we hit a schedule event. The current queue plugging happens implicitly if IO is submitted to an empty device, yet callers have to remember to unplug that IO when they are going to wait for it. This is an ugly API and has caused bugs in the past. Additionally, it requires hacks in the vm (->sync_page() callback) to handle that logic. By switching to an explicit plugging scheme we make the API a lot nicer and can get rid of the ->sync_page() hack in the vm. Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
* Merge branch 'perf-fixes-for-linus' of ↵Linus Torvalds2011-01-111-5/+9
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (28 commits) perf session: Fix infinite loop in __perf_session__process_events perf evsel: Support perf_evsel__open(cpus > 1 && threads > 1) perf sched: Use PTHREAD_STACK_MIN to avoid pthread_attr_setstacksize() fail perf tools: Emit clearer message for sys_perf_event_open ENOENT return perf stat: better error message for unsupported events perf sched: Fix allocation result check perf, x86: P4 PMU - Fix unflagged overflows handling dynamic debug: Fix build issue with older gcc tracing: Fix TRACE_EVENT power tracepoint creation tracing: Fix preempt count leak tracepoint: Add __rcu annotation tracing: remove duplicate null-pointer check in skb tracepoint tracing/trivial: Add missing comma in TRACE_EVENT comment tracing: Include module.h in define_trace.h x86: Save rbp in pt_regs on irq entry x86, dumpstack: Fix unused variable warning x86, NMI: Clean-up default_do_nmi() x86, NMI: Allow NMI reason io port (0x61) to be processed on any CPU x86, NMI: Remove DIE_NMI_IPI x86, NMI: Add priorities to handlers ...
| * perf_events: Move code around to prepare for cgroupStephane Eranian2011-01-071-5/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | In particular this patch move perf_event_exit_task() before cgroup_exit() to allow for cgroup support. The cgroup_exit() function detaches the cgroups attached to a task. Other movements include hoisting some definitions and inlines at the top of perf_event.c Signed-off-by: Stephane Eranian <eranian@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <4d22058b.cdace30a.4657.ffff95b1@mx.google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* | core: Replace __get_cpu_var with __this_cpu_read if not used for an address.Christoph Lameter2010-12-171-1/+1
|/ | | | | | | | | | | | | | | | | | __get_cpu_var() can be replaced with this_cpu_read and will then use a single read instruction with implied address calculation to access the correct per cpu instance. However, the address of a per cpu variable passed to __this_cpu_read() cannot be determined (since it's an implied address conversion through segment prefixes). Therefore apply this only to uses of __get_cpu_var where the address of the variable is not used. Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Hugh Dickins <hughd@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Acked-by: H. Peter Anvin <hpa@zytor.com> Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Tejun Heo <tj@kernel.org>
* do_exit(): make sure that we run with get_fs() == USER_DSNelson Elhage2010-12-021-0/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | If a user manages to trigger an oops with fs set to KERNEL_DS, fs is not otherwise reset before do_exit(). do_exit may later (via mm_release in fork.c) do a put_user to a user-controlled address, potentially allowing a user to leverage an oops into a controlled write into kernel memory. This is only triggerable in the presence of another bug, but this potentially turns a lot of DoS bugs into privilege escalations, so it's worth fixing. I have proof-of-concept code which uses this bug along with CVE-2010-3849 to write a zero to an arbitrary kernel address, so I've tested that this is not theoretical. A more logical place to put this fix might be when we know an oops has occurred, before we call do_exit(), but that would involve changing every architecture, in multiple places. Let's just stick it in do_exit instead. [akpm@linux-foundation.org: update code comment] Signed-off-by: Nelson Elhage <nelhage@ksplice.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* posix-cpu-timers: workaround to suppress the problems with mt execOleg Nesterov2010-11-051-0/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | posix-cpu-timers.c correctly assumes that the dying process does posix_cpu_timers_exit_group() and removes all !CPUCLOCK_PERTHREAD timers from signal->cpu_timers list. But, it also assumes that timer->it.cpu.task is always the group leader, and thus the dead ->task means the dead thread group. This is obviously not true after de_thread() changes the leader. After that almost every posix_cpu_timer_ method has problems. It is not simple to fix this bug correctly. First of all, I think that timer->it.cpu should use struct pid instead of task_struct. Also, the locking should be reworked completely. In particular, tasklist_lock should not be used at all. This all needs a lot of nontrivial and hard-to-test changes. Change __exit_signal() to do posix_cpu_timers_exit_group() when the old leader dies during exec. This is not the fix, just the temporary hack to hide the problem for 2.6.37 and stable. IOW, this is obviously wrong but this is what we currently have anyway: cpu timers do not work after mt exec. In theory this change adds another race. The exiting leader can detach the timers which were attached to the new leader. However, the window between de_thread() and release_task() is small, we can pretend that sys_timer_create() was called before de_thread(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* exit: add lock context annotation on find_new_reaper()Namhyung Kim2010-10-271-0/+2
| | | | | | | | | | | | find_new_reaper() releases and regrabs tasklist_lock but was missing proper annotations. Add it. This remove following sparse warning: warning: context imbalance in 'find_new_reaper' - unexpected unlock Signed-off-by: Namhyung Kim <namhyung@gmail.com> Acked-by: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* oom: add per-mm oom disable countYing Han2010-10-261-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | It's pointless to kill a task if another thread sharing its mm cannot be killed to allow future memory freeing. A subsequent patch will prevent kills in such cases, but first it's necessary to have a way to flag a task that shares memory with an OOM_DISABLE task that doesn't incur an additional tasklist scan, which would make select_bad_process() an O(n^2) function. This patch adds an atomic counter to struct mm_struct that follows how many threads attached to it have an oom_score_adj of OOM_SCORE_ADJ_MIN. They cannot be killed by the kernel, so their memory cannot be freed in oom conditions. This only requires task_lock() on the task that we're operating on, it does not require mm->mmap_sem since task_lock() pins the mm and the operation is atomic. [rientjes@google.com: changelog and sys_unshare() code] [rientjes@google.com: protect oom_disable_count with task_lock in fork] [rientjes@google.com: use old_mm for oom_disable_count in exec] Signed-off-by: Ying Han <yinghan@google.com> Signed-off-by: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* perf: Fix up delayed_put_task_struct()Peter Zijlstra2010-09-091-3/+1
| | | | | | | | | I missed a perf_event_ctxp user when converting it to an array. Pull this last user into perf_event.c as well and fix it up. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* Fix unprotected access to task credentials in waitid()Daniel J Blueman2010-08-171-3/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Using a program like the following: #include <stdlib.h> #include <unistd.h> #include <sys/types.h> #include <sys/wait.h> int main() { id_t id; siginfo_t infop; pid_t res; id = fork(); if (id == 0) { sleep(1); exit(0); } kill(id, SIGSTOP); alarm(1); waitid(P_PID, id, &infop, WCONTINUED); return 0; } to call waitid() on a stopped process results in access to the child task's credentials without the RCU read lock being held - which may be replaced in the meantime - eliciting the following warning: =================================================== [ INFO: suspicious rcu_dereference_check() usage. ] --------------------------------------------------- kernel/exit.c:1460 invoked rcu_dereference_check() without protection! other info that might help us debug this: rcu_scheduler_active = 1, debug_locks = 1 2 locks held by waitid02/22252: #0: (tasklist_lock){.?.?..}, at: [<ffffffff81061ce5>] do_wait+0xc5/0x310 #1: (&(&sighand->siglock)->rlock){-.-...}, at: [<ffffffff810611da>] wait_consider_task+0x19a/0xbe0 stack backtrace: Pid: 22252, comm: waitid02 Not tainted 2.6.35-323cd+ #3 Call Trace: [<ffffffff81095da4>] lockdep_rcu_dereference+0xa4/0xc0 [<ffffffff81061b31>] wait_consider_task+0xaf1/0xbe0 [<ffffffff81061d15>] do_wait+0xf5/0x310 [<ffffffff810620b6>] sys_waitid+0x86/0x1f0 [<ffffffff8105fce0>] ? child_wait_callback+0x0/0x70 [<ffffffff81003282>] system_call_fastpath+0x16/0x1b This is fixed by holding the RCU read lock in wait_task_continued() to ensure that the task's current credentials aren't destroyed between us reading the cred pointer and us reading the UID from those credentials. Furthermore, protect wait_task_stopped() in the same way. We don't need to keep holding the RCU read lock once we've read the UID from the credentials as holding the RCU read lock doesn't stop the target task from changing its creds under us - so the credentials may be outdated immediately after we've read the pointer, lock or no lock. Signed-off-by: Daniel J Blueman <daniel.blueman@gmail.com> Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* ptrace: optimize exit_ptrace() for the likely caseOleg Nesterov2010-08-111-2/+5
| | | | | | | | | | | | | | | | | | | | | | | exit_ptrace() takes tasklist_lock unconditionally. We need this lock to avoid the race with ptrace_traceme(), it acts as a barrier. Change its caller, forget_original_parent(), to call exit_ptrace() under tasklist_lock. Change exit_ptrace() to drop and reacquire this lock if needed. This allows us to add the fastpath list_empty(ptraced) check. In the likely no-tracees case exit_ptrace() just returns and we avoid the lock() + unlock() sequence. "Zhang, Yanmin" <yanmin_zhang@linux.intel.com> suggested to add this check, and he reports that this change adds about 11% improvement in some tests. Suggested-and-tested-by: "Zhang, Yanmin" <yanmin_zhang@linux.intel.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* proc: turn signal_struct->count into "int nr_threads"Oleg Nesterov2010-05-271-4/+1
| | | | | | | | | | | | | | | | | | | | | | No functional changes, just s/atomic_t count/int nr_threads/. With the recent changes this counter has a single user, get_nr_threads() And, none of its callers need the really accurate number of threads, not to mention each caller obviously races with fork/exit. It is only used to report this value to the user-space, except first_tid() uses it to avoid the unnecessary while_each_thread() loop in the unlikely case. It is a bit sad we need a word in struct signal_struct for this, perhaps we can change get_nr_threads() to approximate the number of threads using signal->live and kill ->nr_threads later. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* exit: move taskstats_tgid_free() from __exit_signal() to free_signal_struct()Oleg Nesterov2010-05-271-1/+0
| | | | | | | | | | | | | | | Move taskstats_tgid_free() from __exit_signal() to free_signal_struct(). This way signal->stats never points to nowhere and we can read ->stats lockless. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Roland McGrath <roland@redhat.com> Cc: Veaceslav Falico <vfalico@redhat.com> Cc: Stanislaw Gruszka <sgruszka@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* exit: __exit_signal: use thread_group_leader() consistentlyOleg Nesterov2010-05-271-6/+6
| | | | | | | | | | | | | | | | | | | | | | | Cleanup: - Add the boolean, group_dead = thread_group_leader(), for clarity. - Do not test/set sig == NULL to detect the all-dead case, use this boolean. - Pass this boolen to __unhash_process() and use it instead of another thread_group_leader() call which needs ->group_leader. This can be considered as microoptimization, but hopefully this also allows us do do other cleanups later. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Roland McGrath <roland@redhat.com> Cc: Veaceslav Falico <vfalico@redhat.com> Cc: Stanislaw Gruszka <sgruszka@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* signals: kill the awful task_rq_unlock_wait() hackOleg Nesterov2010-05-271-5/+0
| | | | | | | | | | | | | | | | | Now that task->signal can't go away we can revert the horrible hack added by ad474caca3e2a0550b7ce0706527ad5ab389a4d4 ("fix for account_group_exec_runtime(), make sure ->signal can't be freed under rq->lock"). And we can do more cleanups sched_stats.h/posix-cpu-timers.c later. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Alan Cox <alan@linux.intel.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <peterz@infradead.org> Acked-by: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* signals: clear signal->tty when the last thread exitsOleg Nesterov2010-05-271-1/+4
| | | | | | | | | | | | | | | | | | | | | | | When the last thread exits signal->tty is freed, but the pointer is not cleared and points to nowhere. This is OK. Nobody should use signal->tty lockless, and it is no longer possible to take ->siglock. However this looks wrong even if correct, and the nice OOPS is better than subtle and hard to find bugs. Change __exit_signal() to clear signal->tty under ->siglock. Note: __exit_signal() needs more cleanups. It should not check "sig != NULL" to detect the all-dead case and we have the same issues with signal->stats. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Alan Cox <alan@linux.intel.com> Cc: Ingo Molnar <mingo@elte.hu> Acked-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* signals: make task_struct->signal immutable/refcountableOleg Nesterov2010-05-271-3/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We have a lot of problems with accessing task_struct->signal, it can "disappear" at any moment. Even current can't use its ->signal safely after exit_notify(). ->siglock helps, but it is not convenient, not always possible, and sometimes it makes sense to use task->signal even after this task has already dead. This patch adds the reference counter, sigcnt, into signal_struct. This reference is owned by task_struct and it is dropped in __put_task_struct(). Perhaps it makes sense to export get/put_signal_struct() later, but currently I don't see the immediate reason. Rename __cleanup_signal() to free_signal_struct() and unexport it. With the previous changes it does nothing except kmem_cache_free(). Change __exit_signal() to not clear/free ->signal, it will be freed when the last reference to any thread in the thread group goes away. Note: - when the last thead exits signal->tty can point to nowhere, see the next patch. - with or without this patch signal_struct->count should go away, or at least it should be "int nr_threads" for fs/proc. This will be addressed later. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Alan Cox <alan@linux.intel.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <peterz@infradead.org> Acked-by: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* fork/exit: move tty_kref_put() outside of __cleanup_signal()Oleg Nesterov2010-05-271-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | tty_kref_put() has two callsites in copy_process() paths, 1. if copy_process() suceeds it is called before we copy signal->tty from parent 2. otherwise it is called from __cleanup_signal() under bad_fork_cleanup_signal: label In both cases tty_kref_put() is not right and unneeded because we don't have the balancing tty_kref_get(). Fortunately, this is harmless because this can only happen without CLONE_THREAD, and in this case signal->tty must be NULL. Remove tty_kref_put() from copy_process() and __cleanup_signal(), and change another caller of __cleanup_signal(), __exit_signal(), to call tty_kref_put() by hand. I hope this change makes sense by itself, but it is also needed to make ->signal refcountable. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Alan Cox <alan@linux.intel.com> Acked-by: Roland McGrath <roland@redhat.com> Cc: Greg KH <greg@kroah.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* exit: avoid sig->count in __exit_signal() to detect the group-dead caseOleg Nesterov2010-05-271-2/+3
| | | | | | | | | | | | | | | | Change __exit_signal() to check thread_group_leader() instead of atomic_dec_and_test(&sig->count). This must be equivalent, the group leader must be released only after all other threads have exited and passed __exit_signal(). Henceforth sig->count is not actually used, except in fs/proc for get_nr_threads/etc. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Roland McGrath <roland@redhat.com> Cc: Veaceslav Falico <vfalico@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* exit: avoid sig->count in de_thread/__exit_signal synchronizationOleg Nesterov2010-05-271-1/+1
| | | | | | | | | | | | | | | | | de_thread() and __exit_signal() use signal_struct->count/notify_count for synchronization. We can simplify the code and use ->notify_count only. Instead of comparing these two counters, we can change de_thread() to set ->notify_count = nr_of_sub_threads, then change __exit_signal() to dec-and-test this counter and notify group_exit_task. Note that __exit_signal() checks "notify_count > 0" just for symmetry with exit_notify(), we could just check it is != 0. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Roland McGrath <roland@redhat.com> Cc: Veaceslav Falico <vfalico@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* exit: exit_notify() can trust signal->notify_count < 0Oleg Nesterov2010-05-271-5/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | signal_struct->count in its current form must die. - it has no reasons to be atomic_t - it looks like a reference counter, but it is not - otoh, we really need to make task->signal refcountable, just look at the extremely ugly task_rq_unlock_wait() called from __exit_signals(). - we should change the lifetime rules for task->signal, it should be pinned to task_struct. We have a lot of code which can be simplified after that. - it is not needed! while the code is correct, any usage of this counter is artificial, except fs/proc uses it correctly to show the number of threads. This series removes the usage of sig->count from exit pathes. This patch: Now that Veaceslav changed copy_signal() to use zalloc(), exit_notify() can just check notify_count < 0 to ensure the execing sub-threads needs the notification from us. No need to do other checks, notify_count != 0 must always mean ->group_exit_task != NULL is waiting for us. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Roland McGrath <roland@redhat.com> Cc: Veaceslav Falico <vfalico@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* cpuset,mm: fix no node to alloc memory when changing cpuset's memsMiao Xie2010-05-251-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Before applying this patch, cpuset updates task->mems_allowed and mempolicy by setting all new bits in the nodemask first, and clearing all old unallowed bits later. But in the way, the allocator may find that there is no node to alloc memory. The reason is that cpuset rebinds the task's mempolicy, it cleans the nodes which the allocater can alloc pages on, for example: (mpol: mempolicy) task1 task1's mpol task2 alloc page 1 alloc on node0? NO 1 1 change mems from 1 to 0 1 rebind task1's mpol 0-1 set new bits 0 clear disallowed bits alloc on node1? NO 0 ... can't alloc page goto oom This patch fixes this problem by expanding the nodes range first(set newly allowed bits) and shrink it lazily(clear newly disallowed bits). So we use a variable to tell the write-side task that read-side task is reading nodemask, and the write-side task clears newly disallowed nodes after read-side task ends the current memory allocation. [akpm@linux-foundation.org: fix spello] Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Paul Menage <menage@google.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: Ravikiran Thirumalai <kiran@scalex86.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Andi Kleen <andi@firstfloor.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Merge branch 'linus' into sched/coreIngo Molnar2010-04-151-1/+2
|\ | | | | | | | | | | Merge reason: merge the latest fixes, update to -rc4. Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * mm: avoid null-pointer deref in sync_mm_rss()KAMEZAWA Hiroyuki2010-04-071-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | - We weren't zeroing p->rss_stat[] at fork() - Consequently sync_mm_rss() was dereferencing tsk->mm for kernel threads and was oopsing. - Make __sync_task_rss_stat() static, too. Addresses https://bugzilla.kernel.org/show_bug.cgi?id=15648 [akpm@linux-foundation.org: remove the BUG_ON(!mm->rss)] Reported-by: Troels Liebe Bentsen <tlb@rapanden.dk> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> "Michael S. Tsirkin" <mst@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | sched: Remove remaining USER_SCHED codeLi Zefan2010-04-021-1/+0
|/ | | | | | | | | | | This is left over from commit 7c9414385e ("sched: Remove USER_SCHED"") Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Dhaval Giani <dhaval.giani@gmail.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: David Howells <dhowells@redhat.com> LKML-Reference: <4BA9A05F.7010407@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* Merge branch 'core-fixes-for-linus' of ↵Linus Torvalds2010-03-131-1/+1
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: locking: Make sparse work with inline spinlocks and rwlocks x86/mce: Fix RCU lockdep splats rcu: Increase RCU CPU stall timeouts if PROVE_RCU ftrace: Replace read_barrier_depends() with rcu_dereference_raw() rcu: Suppress RCU lockdep warnings during early boot rcu, ftrace: Fix RCU lockdep splat in ftrace_perf_buf_prepare() rcu: Suppress __mpol_dup() false positive from RCU lockdep rcu: Make rcu_read_lock_sched_held() handle !PREEMPT rcu: Add control variables to lockdep_rcu_dereference() diagnostics rcu, cgroup: Relax the check in task_subsys_state() as early boot is now handled by lockdep-RCU rcu: Use wrapper function instead of exporting tasklist_lock sched, rcu: Fix rcu_dereference() for RCU-lockdep rcu: Make task_subsys_state() RCU-lockdep checks handle boot-time use rcu: Fix holdoff for accelerated GPs for last non-dynticked CPU x86/gart: Unexport gart_iommu_aperture Fix trivial conflicts in kernel/trace/ftrace.c
| * rcu: Use wrapper function instead of exporting tasklist_lockPaul E. McKenney2010-03-041-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Lockdep-RCU commit d11c563d exported tasklist_lock, which is not a good thing. This patch instead exports a function that uses lockdep to check whether tasklist_lock is held. Suggested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com Cc: Christoph Hellwig <hch@lst.de> LKML-Reference: <1267631219-8713-1-git-send-email-paulmck@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* | kernel/exit.c: fix shadows sparse warningThiago Farina2010-03-061-1/+1
| | | | | | | | | | | | | | | | | | kernel/exit.c:1183:26: warning: symbol 'status' shadows an earlier one kernel/exit.c:1173:21: originally declared here Signed-off-by: Thiago Farina <tfransosi@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | mm: avoid false sharing of mm_counterKAMEZAWA Hiroyuki2010-03-061-1/+2
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | | Considering the nature of per mm stats, it's the shared object among threads and can be a cache-miss point in the page fault path. This patch adds per-thread cache for mm_counter. RSS value will be counted into a struct in task_struct and synchronized with mm's one at events. Now, in this patch, the event is the number of calls to handle_mm_fault. Per-thread value is added to mm at each 64 calls. rough estimation with small benchmark on parallel thread (2threads) shows [before] 4.5 cache-miss/faults [after] 4.0 cache-miss/faults Anyway, the most contended object is mmap_sem if the number of threads grows. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* sched: Use lockdep-based checking on rcu_dereference()Paul E. McKenney2010-02-251-3/+11
| | | | | | | | | | | | | | | | | | | | Update the rcu_dereference() usages to take advantage of the new lockdep-based checking. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com LKML-Reference: <1266887105-1528-6-git-send-email-paulmck@linux.vnet.ibm.com> [ -v2: fix allmodconfig missing symbol export build failure on x86 ] Signed-off-by: Ingo Molnar <mingo@elte.hu>
* do_wait() optimization: do not place sub-threads on task_struct->children listOleg Nesterov2009-12-171-19/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Thanks to Roland who pointed out de_thread() issues. Currently we add sub-threads to ->real_parent->children list. This buys nothing but slows down do_wait(). With this patch ->children contains only main threads (group leaders). The only complication is that forget_original_parent() should iterate over sub-threads by hand, and de_thread() needs another list_replace() when it changes ->group_leader. Henceforth do_wait_thread() can never see task_detached() && !EXIT_DEAD tasks, we can remove this check (and we can unify do_wait_thread() and ptrace_do_wait()). This change can confuse the optimistic search in mm_update_next_owner(), but this is fixable and minor. Perhaps badness() and oom_kill_process() should be updated, but they should be fixed in any case. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Roland McGrath <roland@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Ratan Nalumasu <rnalumasu@gmail.com> Cc: Vitaly Mayatskikh <vmayatsk@redhat.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* sched: Convert pi_lock to raw_spinlockThomas Gleixner2009-12-141-1/+1
| | | | | | | | | Convert locks which cannot be sleeping locks in preempt-rt to raw_spinlocks. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Ingo Molnar <mingo@elte.hu>
* tty: Move the leader test in disassociateAlan Cox2009-12-111-1/+1
| | | | | | | | | There are two call points, both want to check that tty->signal->leader is set. Move the test into disassociate_ctty() as that will make locking changes easier in a bit Signed-off-by: Alan Cox <alan@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
* Merge branch 'for-2.6.33' of git://git.kernel.dk/linux-2.6-blockLinus Torvalds2009-12-081-1/+1
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * 'for-2.6.33' of git://git.kernel.dk/linux-2.6-block: (113 commits) cfq-iosched: Do not access cfqq after freeing it block: include linux/err.h to use ERR_PTR cfq-iosched: use call_rcu() instead of doing grace period stall on queue exit blkio: Allow CFQ group IO scheduling even when CFQ is a module blkio: Implement dynamic io controlling policy registration blkio: Export some symbols from blkio as its user CFQ can be a module block: Fix io_context leak after failure of clone with CLONE_IO block: Fix io_context leak after clone with CLONE_IO cfq-iosched: make nonrot check logic consistent io controller: quick fix for blk-cgroup and modular CFQ cfq-iosched: move IO controller declerations to a header file cfq-iosched: fix compile problem with !CONFIG_CGROUP blkio: Documentation blkio: Wait on sync-noidle queue even if rq_noidle = 1 blkio: Implement group_isolation tunable blkio: Determine async workload length based on total number of queues blkio: Wait for cfq queue to get backlogged if group is empty blkio: Propagate cgroup weight updation to cfq groups blkio: Drop the reference to queue once the task changes cgroup blkio: Provide some isolation between groups ...
| * block: Fix io_context leak after failure of clone with CLONE_IOLouis Rilling2009-12-041-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | With CLONE_IO, parent's io_context->nr_tasks is incremented, but never decremented whenever copy_process() fails afterwards, which prevents exit_io_context() from calling IO schedulers exit functions. Give a task_struct to exit_io_context(), and call exit_io_context() instead of put_io_context() in copy_process() cleanup path. Signed-off-by: Louis Rilling <louis.rilling@kerlabs.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* | Merge branch 'sched-core-for-linus' of ↵Linus Torvalds2009-12-051-9/+13
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (35 commits) sched, cputime: Introduce thread_group_times() sched, cputime: Cleanups related to task_times() Revert "sched, x86: Optimize branch hint in __switch_to()" sched: Fix isolcpus boot option sched: Revert 498657a478c60be092208422fefa9c7b248729c2 sched, time: Define nsecs_to_jiffies() sched: Remove task_{u,s,g}time() sched: Introduce task_times() to replace task_{u,s}time() pair sched: Limit the number of scheduler debug messages sched.c: Call debug_show_all_locks() when dumping all tasks sched, x86: Optimize branch hint in __switch_to() sched: Optimize branch hint in context_switch() sched: Optimize branch hint in pick_next_task_fair() sched_feat_write(): Update ppos instead of file->f_pos sched: Sched_rt_periodic_timer vs cpu hotplug sched, kvm: Fix race condition involving sched_in_preempt_notifers sched: More generic WAKE_AFFINE vs select_idle_sibling() sched: Cleanup select_task_rq_fair() sched: Fix granularity of task_u/stime() sched: Fix/add missing update_rq_clock() calls ...
| * | sched, cputime: Introduce thread_group_times()Hidetoshi Seto2009-12-021-11/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is a real fix for problem of utime/stime values decreasing described in the thread: http://lkml.org/lkml/2009/11/3/522 Now cputime is accounted in the following way: - {u,s}time in task_struct are increased every time when the thread is interrupted by a tick (timer interrupt). - When a thread exits, its {u,s}time are added to signal->{u,s}time, after adjusted by task_times(). - When all threads in a thread_group exits, accumulated {u,s}time (and also c{u,s}time) in signal struct are added to c{u,s}time in signal struct of the group's parent. So {u,s}time in task struct are "raw" tick count, while {u,s}time and c{u,s}time in signal struct are "adjusted" values. And accounted values are used by: - task_times(), to get cputime of a thread: This function returns adjusted values that originates from raw {u,s}time and scaled by sum_exec_runtime that accounted by CFS. - thread_group_cputime(), to get cputime of a thread group: This function returns sum of all {u,s}time of living threads in the group, plus {u,s}time in the signal struct that is sum of adjusted cputimes of all exited threads belonged to the group. The problem is the return value of thread_group_cputime(), because it is mixed sum of "raw" value and "adjusted" value: group's {u,s}time = foreach(thread){{u,s}time} + exited({u,s}time) This misbehavior can break {u,s}time monotonicity. Assume that if there is a thread that have raw values greater than adjusted values (e.g. interrupted by 1000Hz ticks 50 times but only runs 45ms) and if it exits, cputime will decrease (e.g. -5ms). To fix this, we could do: group's {u,s}time = foreach(t){task_times(t)} + exited({u,s}time) But task_times() contains hard divisions, so applying it for every thread should be avoided. This patch fixes the above problem in the following way: - Modify thread's exit (= __exit_signal()) not to use task_times(). It means {u,s}time in signal struct accumulates raw values instead of adjusted values. As the result it makes thread_group_cputime() to return pure sum of "raw" values. - Introduce a new function thread_group_times(*task, *utime, *stime) that converts "raw" values of thread_group_cputime() to "adjusted" values, in same calculation procedure as task_times(). - Modify group's exit (= wait_task_zombie()) to use this introduced thread_group_times(). It make c{u,s}time in signal struct to have adjusted values like before this patch. - Replace some thread_group_cputime() by thread_group_times(). This replacements are only applied where conveys the "adjusted" cputime to users, and where already uses task_times() near by it. (i.e. sys_times(), getrusage(), and /proc/<PID>/stat.) This patch have a positive side effect: - Before this patch, if a group contains many short-life threads (e.g. runs 0.9ms and not interrupted by ticks), the group's cputime could be invisible since thread's cputime was accumulated after adjusted: imagine adjustment function as adj(ticks, runtime), {adj(0, 0.9) + adj(0, 0.9) + ....} = {0 + 0 + ....} = 0. After this patch it will not happen because the adjustment is applied after accumulated. v2: - remove if()s, put new variables into signal_struct. Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Spencer Candland <spencer@bluehost.com> Cc: Americo Wang <xiyou.wangcong@gmail.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Stanislaw Gruszka <sgruszka@redhat.com> LKML-Reference: <4B162517.8040909@jp.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * | sched: Remove task_{u,s,g}time()Hidetoshi Seto2009-11-261-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now all task_{u,s}time() pairs are replaced by task_times(). And task_gtime() is too simple to be an inline function. Cleanup them all. Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Stanislaw Gruszka <sgruszka@redhat.com> Cc: Spencer Candland <spencer@bluehost.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Americo Wang <xiyou.wangcong@gmail.com> LKML-Reference: <4B0E16D1.70902@jp.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * | sched: Introduce task_times() to replace task_{u,s}time() pairHidetoshi Seto2009-11-261-2/+5
| |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Functions task_{u,s}time() are called in pair in almost all cases. However task_stime() is implemented to call task_utime() from its inside, so such paired calls run task_utime() twice. It means we do heavy divisions (div_u64 + do_div) twice to get utime and stime which can be obtained at same time by one set of divisions. This patch introduces a function task_times(*tsk, *utime, *stime) to retrieve utime and stime at once in better, optimized way. Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Stanislaw Gruszka <sgruszka@redhat.com> Cc: Spencer Candland <spencer@bluehost.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Americo Wang <xiyou.wangcong@gmail.com> LKML-Reference: <4B0E16AE.906@jp.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>