aboutsummaryrefslogtreecommitdiffstats
path: root/arch/x86
Commit message (Collapse)AuthorAgeFilesLines
* x86/alternatives: Fix p6 nops on non-modular kernelsAvi Kivity2012-10-071-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | commit cb09cad44f07044d9810f18f6f9a6a6f3771f979 upstream. Probably a leftover from the early days of self-patching, p6nops are marked __initconst_or_module, which causes them to be discarded in a non-modular kernel. If something later triggers patching, it will overwrite kernel code with garbage. Reported-by: Tomas Racek <tracek@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com> Cc: Michael Tokarev <mjt@tls.msk.ru> Cc: Borislav Petkov <borislav.petkov@amd.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: qemu-devel@nongnu.org Cc: Anthony Liguori <anthony@codemonkey.ws> Cc: H. Peter Anvin <hpa@linux.intel.com> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Cc: Alan Cox <alan@linux.intel.com> Link: http://lkml.kernel.org/r/5034AE84.90708@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Ben Jencks <ben@bjencks.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* MCE: Fix vm86 handling for 32bit mce handlerAndi Kleen2012-10-021-1/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | commit a129a7c84582629741e5fa6f40026efcd7a65bd4 upstream. When running on 32bit the mce handler could misinterpret vm86 mode as ring 0. This can affect whether it does recovery or not; it was possible to panic when recovery was actually possible. Fix this by always forcing vm86 to look like ring 3. [ Backport to 3.0 notes: Things changed there slightly: - move mce_get_rip() up. It fills up m->cs and m->ip values which are evaluated in mce_severity(). Therefore move it up right before the mce_severity call. This seem to be another bug in 3.0? - Place the backport (fix m->cs in V86 case) to where m->cs gets filled which is mce_get_rip() in 3.0 ] Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Thomas Renninger <trenn@suse.de> Reviewed-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* x86: Fix boot on Twinhead H12YAlan Cox2012-10-021-0/+17
| | | | | | | | | | | | | | | | | | | | | commit 80b3e557371205566a71e569fbfcce5b11f92dbe upstream. Despite lots of investigation into why this is needed we don't know or have an elegant cure. The only answer found on this laptop is to mark a problem region as used so that Linux doesn't put anything there. Currently all the users add reserve= command lines and anyone not knowing this needs to find the magic page that documents it. Automate it instead. Signed-off-by: Alan Cox <alan@linux.intel.com> Tested-and-bugfixed-by: Arne Fitzenreiter <arne@fitzenreiter.de> Resolves-bug: https://bugzilla.kernel.org/show_bug.cgi?id=10231 Link: http://lkml.kernel.org/r/20120515174347.5109.94551.stgit@bluebook Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* xen/boot: Disable NUMA for PV guests.Konrad Rzeszutek Wilk2012-10-021-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 8d54db795dfb1049d45dc34f0dddbc5347ec5642 upstream. The hypervisor is in charge of allocating the proper "NUMA" memory and dealing with the CPU scheduler to keep them bound to the proper NUMA node. The PV guests (and PVHVM) have no inkling of where they run and do not need to know that right now. In the future we will need to inject NUMA configuration data (if a guest spans two or more NUMA nodes) so that the kernel can make the right choices. But those patches are not yet present. In the meantime, disable the NUMA capability in the PV guest, which also fixes a bootup issue. Andre says: "we see Dom0 crashes due to the kernel detecting the NUMA topology not by ACPI, but directly from the northbridge (CONFIG_AMD_NUMA). This will detect the actual NUMA config of the physical machine, but will crash about the mismatch with Dom0's virtual memory. Variation of the theme: Dom0 sees what it's not supposed to see. This happens with the said config option enabled and on a machine where this scanning is still enabled (K8 and Fam10h, not Bulldozer class) We have this dump then: NUMA: Warning: node ids are out of bound, from=-1 to=-1 distance=10 Scanning NUMA topology in Northbridge 24 Number of physical nodes 4 Node 0 MemBase 0000000000000000 Limit 0000000040000000 Node 1 MemBase 0000000040000000 Limit 0000000138000000 Node 2 MemBase 0000000138000000 Limit 00000001f8000000 Node 3 MemBase 00000001f8000000 Limit 0000000238000000 Initmem setup node 0 0000000000000000-0000000040000000 NODE_DATA [000000003ffd9000 - 000000003fffffff] Initmem setup node 1 0000000040000000-0000000138000000 NODE_DATA [0000000137fd9000 - 0000000137ffffff] Initmem setup node 2 0000000138000000-00000001f8000000 NODE_DATA [00000001f095e000 - 00000001f0984fff] Initmem setup node 3 00000001f8000000-0000000238000000 Cannot find 159744 bytes in node 3 BUG: unable to handle kernel NULL pointer dereference at (null) IP: [<ffffffff81d220e6>] __alloc_bootmem_node+0x43/0x96 Pid: 0, comm: swapper Not tainted 3.3.6 #1 AMD Dinar/Dinar RIP: e030:[<ffffffff81d220e6>] [<ffffffff81d220e6>] __alloc_bootmem_node+0x43/0x96 .. snip.. [<ffffffff81d23024>] sparse_early_usemaps_alloc_node+0x64/0x178 [<ffffffff81d23348>] sparse_init+0xe4/0x25a [<ffffffff81d16840>] paging_init+0x13/0x22 [<ffffffff81d07fbb>] setup_arch+0x9c6/0xa9b [<ffffffff81683954>] ? printk+0x3c/0x3e [<ffffffff81d01a38>] start_kernel+0xe5/0x468 [<ffffffff81d012cf>] x86_64_start_reservations+0xba/0xc1 [<ffffffff81007153>] ? xen_setup_runstate_info+0x2c/0x36 [<ffffffff81d050ee>] xen_start_kernel+0x565/0x56c " so we just disable NUMA scanning by setting numa_off=1. Reported-and-Tested-by: Andre Przywara <andre.przywara@amd.com> Acked-by: Andre Przywara <andre.przywara@amd.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* mm: hugetlbfs: correctly populate shared pmdMichal Hocko2012-09-141-5/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit eb48c071464757414538c68a6033c8f8c15196f8 upstream. Each page mapped in a process's address space must be correctly accounted for in _mapcount. Normally the rules for this are straightforward but hugetlbfs page table sharing is different. The page table pages at the PMD level are reference counted while the mapcount remains the same. If this accounting is wrong, it causes bugs like this one reported by Larry Woodman: kernel BUG at mm/filemap.c:135! invalid opcode: 0000 [#1] SMP CPU 22 Modules linked in: bridge stp llc sunrpc binfmt_misc dcdbas microcode pcspkr acpi_pad acpi] Pid: 18001, comm: mpitest Tainted: G W 3.3.0+ #4 Dell Inc. PowerEdge R620/07NDJ2 RIP: 0010:[<ffffffff8112cfed>] [<ffffffff8112cfed>] __delete_from_page_cache+0x15d/0x170 Process mpitest (pid: 18001, threadinfo ffff880428972000, task ffff880428b5cc20) Call Trace: delete_from_page_cache+0x40/0x80 truncate_hugepages+0x115/0x1f0 hugetlbfs_evict_inode+0x18/0x30 evict+0x9f/0x1b0 iput_final+0xe3/0x1e0 iput+0x3e/0x50 d_kill+0xf8/0x110 dput+0xe2/0x1b0 __fput+0x162/0x240 During fork(), copy_hugetlb_page_range() detects if huge_pte_alloc() shared page tables with the check dst_pte == src_pte. The logic is if the PMD page is the same, they must be shared. This assumes that the sharing is between the parent and child. However, if the sharing is with a different process entirely then this check fails as in this diagram: parent | ------------>pmd src_pte----------> data page ^ other--------->pmd--------------------| ^ child-----------| dst_pte For this situation to occur, it must be possible for Parent and Other to have faulted and failed to share page tables with each other. This is possible due to the following style of race. PROC A PROC B copy_hugetlb_page_range copy_hugetlb_page_range src_pte == huge_pte_offset src_pte == huge_pte_offset !src_pte so no sharing !src_pte so no sharing (time passes) hugetlb_fault hugetlb_fault huge_pte_alloc huge_pte_alloc huge_pmd_share huge_pmd_share LOCK(i_mmap_mutex) find nothing, no sharing UNLOCK(i_mmap_mutex) LOCK(i_mmap_mutex) find nothing, no sharing UNLOCK(i_mmap_mutex) pmd_alloc pmd_alloc LOCK(instantiation_mutex) fault UNLOCK(instantiation_mutex) LOCK(instantiation_mutex) fault UNLOCK(instantiation_mutex) These two processes are not poing to the same data page but are not sharing page tables because the opportunity was missed. When either process later forks, the src_pte == dst pte is potentially insufficient. As the check falls through, the wrong PTE information is copied in (harmless but wrong) and the mapcount is bumped for a page mapped by a shared page table leading to the BUG_ON. This patch addresses the issue by moving pmd_alloc into huge_pmd_share which guarantees that the shared pud is populated in the same critical section as pmd. This also means that huge_pte_offset test in huge_pmd_share is serialized correctly now which in turn means that the success of the sharing will be higher as the racing tasks see the pud and pmd populated together. Race identified and changelog written mostly by Mel Gorman. {akpm@linux-foundation.org: attempt to make the huge_pmd_share() comment comprehensible, clean up coding style] Reported-by: Larry Woodman <lwoodman@redhat.com> Tested-by: Larry Woodman <lwoodman@redhat.com> Reviewed-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Michal Hocko <mhocko@suse.cz> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: David Gibson <david@gibson.dropbear.id.au> Cc: Ken Chen <kenchen@google.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: Hillf Danton <dhillf@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* xen: mark local pages as FOREIGN in the m2p_overrideStefano Stabellini2012-08-261-0/+36
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit b9e0d95c041ca2d7ad297ee37c2e9cfab67a188f upstream. When the frontend and the backend reside on the same domain, even if we add pages to the m2p_override, these pages will never be returned by mfn_to_pfn because the check "get_phys_to_machine(pfn) != mfn" will always fail, so the pfn of the frontend will be returned instead (resulting in a deadlock because the frontend pages are already locked). INFO: task qemu-system-i38:1085 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. qemu-system-i38 D ffff8800cfc137c0 0 1085 1 0x00000000 ffff8800c47ed898 0000000000000282 ffff8800be4596b0 00000000000137c0 ffff8800c47edfd8 ffff8800c47ec010 00000000000137c0 00000000000137c0 ffff8800c47edfd8 00000000000137c0 ffffffff82213020 ffff8800be4596b0 Call Trace: [<ffffffff81101ee0>] ? __lock_page+0x70/0x70 [<ffffffff81a0fdd9>] schedule+0x29/0x70 [<ffffffff81a0fe80>] io_schedule+0x60/0x80 [<ffffffff81101eee>] sleep_on_page+0xe/0x20 [<ffffffff81a0e1ca>] __wait_on_bit_lock+0x5a/0xc0 [<ffffffff81101ed7>] __lock_page+0x67/0x70 [<ffffffff8106f750>] ? autoremove_wake_function+0x40/0x40 [<ffffffff811867e6>] ? bio_add_page+0x36/0x40 [<ffffffff8110b692>] set_page_dirty_lock+0x52/0x60 [<ffffffff81186021>] bio_set_pages_dirty+0x51/0x70 [<ffffffff8118c6b4>] do_blockdev_direct_IO+0xb24/0xeb0 [<ffffffff811e71a0>] ? ext3_get_blocks_handle+0xe00/0xe00 [<ffffffff8118ca95>] __blockdev_direct_IO+0x55/0x60 [<ffffffff811e71a0>] ? ext3_get_blocks_handle+0xe00/0xe00 [<ffffffff811e91c8>] ext3_direct_IO+0xf8/0x390 [<ffffffff811e71a0>] ? ext3_get_blocks_handle+0xe00/0xe00 [<ffffffff81004b60>] ? xen_mc_flush+0xb0/0x1b0 [<ffffffff81104027>] generic_file_aio_read+0x737/0x780 [<ffffffff813bedeb>] ? gnttab_map_refs+0x15b/0x1e0 [<ffffffff811038f0>] ? find_get_pages+0x150/0x150 [<ffffffff8119736c>] aio_rw_vect_retry+0x7c/0x1d0 [<ffffffff811972f0>] ? lookup_ioctx+0x90/0x90 [<ffffffff81198856>] aio_run_iocb+0x66/0x1a0 [<ffffffff811998b8>] do_io_submit+0x708/0xb90 [<ffffffff81199d50>] sys_io_submit+0x10/0x20 [<ffffffff81a18d69>] system_call_fastpath+0x16/0x1b The explanation is in the comment within the code: We need to do this because the pages shared by the frontend (xen-blkfront) can be already locked (lock_page, called by do_read_cache_page); when the userspace backend tries to use them with direct_IO, mfn_to_pfn returns the pfn of the frontend, so do_blockdev_direct_IO is going to try to lock the same pages again resulting in a deadlock. A simplified call graph looks like this: pygrub QEMU ----------------------------------------------- do_read_cache_page io_submit | | lock_page ext3_direct_IO | bio_add_page | lock_page Internally the xen-blkback uses m2p_add_override to swizzle (temporarily) a 'struct page' to have a different MFN (so that it can point to another guest). It also can easily find out whether another pfn corresponding to the mfn exists in the m2p, and can set the FOREIGN bit in the p2m, making sure that mfn_to_pfn returns the pfn of the backend. This allows the backend to perform direct_IO on these pages, but as a side effect prevents the frontend from using get_user_pages_fast on them while they are being shared with the backend. Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* x86, microcode: Sanitize per-cpu microcode reloading interfaceBorislav Petkov2012-08-151-7/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit c9fc3f778a6a215ace14ee556067c73982b6d40f upstream. Microcode reloading in a per-core manner is a very bad idea for both major x86 vendors. And the thing is, we have such interface with which we can end up with different microcode versions applied on different cores of an otherwise homogeneous wrt (family,model,stepping) system. So turn off the possibility of doing that per core and allow it only system-wide. This is a minimal fix which we'd like to see in stable too thus the more-or-less arbitrary decision to allow system-wide reloading only on the BSP: $ echo 1 > /sys/devices/system/cpu/cpu0/microcode/reload ... and disable the interface on the other cores: $ echo 1 > /sys/devices/system/cpu/cpu23/microcode/reload -bash: echo: write error: Invalid argument Also, allowing the reload only from one CPU (the BSP in that case) doesn't allow the reload procedure to degenerate into an O(n^2) deal when triggering reloads from all /sys/devices/system/cpu/cpuX/microcode/reload sysfs nodes simultaneously. A more generic fix will follow. Signed-off-by: Borislav Petkov <borislav.petkov@amd.com> Cc: Henrique de Moraes Holschuh <hmh@hmh.eng.br> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1340280437-7718-2-git-send-email-bp@amd64.org Signed-off-by: H. Peter Anvin <hpa@zytor.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* x86, microcode: microcode_core.c simple_strtoul cleanupShuah Khan2012-08-151-5/+4
| | | | | | | | | | | | | | | commit e826abd523913f63eb03b59746ffb16153c53dc4 upstream. Change reload_for_cpu() in kernel/microcode_core.c to call kstrtoul() instead of calling obsoleted simple_strtoul(). Signed-off-by: Shuah Khan <shuahkhan@gmail.com> Reviewed-by: Borislav Petkov <bp@alien8.de> Link: http://lkml.kernel.org/r/1336324264.2897.9.camel@lorien2 Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Cc: Henrique de Moraes Holschuh <hmh@hmh.eng.br> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* x86, nops: Missing break resulting in incorrect selection on IntelAlan Cox2012-08-151-1/+1
| | | | | | | | | | | | | | commit d6250a3f12edb3a86db9598ffeca3de8b4a219e9 upstream. The Intel case falls through into the generic case which then changes the values. For cases like the P6 it doesn't do the right thing so this seems to be a screwup. Signed-off-by: Alan Cox <alan@linux.intel.com> Link: http://lkml.kernel.org/n/tip-lww2uirad4skzjlmrm0vru8o@git.kernel.org Signed-off-by: H. Peter Anvin <hpa@zytor.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* x86: Simplify code by removing a !SMP #ifdefs from 'struct cpuinfo_x86'Kevin Winchester2012-08-158-26/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 141168c36cdee3ff23d9c7700b0edc47cb65479f and commit 3f806e50981825fa56a7f1938f24c0680816be45 upstream. Several fields in struct cpuinfo_x86 were not defined for the !SMP case, likely to save space. However, those fields still have some meaning for UP, and keeping them allows some #ifdef removal from other files. The additional size of the UP kernel from this change is not significant enough to worry about keeping up the distinction: text data bss dec hex filename 4737168 506459 972040 6215667 5ed7f3 vmlinux.o.before 4737444 506459 972040 6215943 5ed907 vmlinux.o.after for a difference of 276 bytes for an example UP config. If someone wants those 276 bytes back badly then it should be implemented in a cleaner way. Signed-off-by: Kevin Winchester <kjwinchester@gmail.com> Cc: Steffen Persvold <sp@numascale.com> Link: http://lkml.kernel.org/r/1324428742-12498-1-git-send-email-kjwinchester@gmail.com Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Borislav Petkov <borislav.petkov@amd.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* ACPI: Remove one board specific WARN when ignoring timer overridingFeng Tang2012-07-161-5/+0
| | | | | | | | | | | | | commit 7f68b4c2e158019c2ec494b5cfbd9c83b4e5b253 upstream. Current WARN msg is only for the ati_ixp4x0 board, while this function is used by mulitple platforms. So this one board specific warning is not appropriate any more. Signed-off-by: Feng Tang <feng.tang@intel.com> Signed-off-by: Len Brown <len.brown@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* ACPI: Make acpi_skip_timer_override cover all source_irq==0 casesFeng Tang2012-07-161-6/+8
| | | | | | | | | | | | | | | | | | | | commit ae10ccdc3093486f8c2369d227583f9d79f628e5 upstream. Currently when acpi_skip_timer_override is set, it only cover the (source_irq == 0 && global_irq == 2) cases. While there is also platform which need use this option and its global_irq is not 2. This patch will extend acpi_skip_timer_override to cover all timer overriding cases as long as the source irq is 0. This is the first part of a fix to kernel bug bugzilla 40002: "IRQ 0 assigned to VGA" https://bugzilla.kernel.org/show_bug.cgi?id=40002 Reported-and-tested-by: Szymon Kowalczyk <fazerxlo@o2.pl> Signed-off-by: Feng Tang <feng.tang@intel.com> Signed-off-by: Len Brown <len.brown@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* x86, cpufeature: Rename X86_FEATURE_DTS to X86_FEATURE_DTHERMH. Peter Anvin2012-07-162-2/+2
| | | | | | | | | | | | | | | | | | | | | | | commit 4ad33411308596f2f918603509729922a1ec4411 upstream. It makes sense to label "Digital Thermal Sensor" as "DTS", but unfortunately the string "dts" was already used for "Debug Store", and /proc/cpuinfo is a user space ABI. Therefore, rename this to "dtherm". This conflict went into mainline via the hwmon tree without any x86 maintainer ack, and without any kind of hint in the subject. a4659053 x86/hwmon: fix initialization of coretemp Reported-by: Jean Delvare <khali@linux-fr.org> Link: http://lkml.kernel.org/r/4FE34BCB.5050305@linux.intel.com Cc: Jan Beulich <JBeulich@suse.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> [bwh: Backported to 3.2: drop the coretemp device table change] Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* ACPI, x86: fix Dell M6600 ACPI reboot regression via DMIZhang Rui2012-07-161-0/+8
| | | | | | | | | | | | | | | commit 76eb9a30db4bc8fd172f9155247264b5f2686d7b upstream. Dell Precision M6600 is known to require PCI reboot, so add it to the reboot blacklist in pci_reboot_dmi_table[]. https://bugzilla.kernel.org/show_bug.cgi?id=42749 cc: x86@kernel.org Signed-off-by: Zhang Rui <rui.zhang@intel.com> Signed-off-by: Len Brown <len.brown@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* ACPI: Add a quirk for "AMILO PRO V2030" to ignore the timer overridingFeng Tang2012-07-161-0/+8
| | | | | | | | | | | | | | | | | | | commit f6b54f083cc66cf9b11d2120d8df3c2ad4e0836d upstream. This is the 2nd part of fix for kernel bugzilla 40002: "IRQ 0 assigned to VGA" https://bugzilla.kernel.org/show_bug.cgi?id=40002 The root cause is the buggy FW, whose ACPI tables assign the GSI 16 to 2 irqs 0 and 16(VGA), and the VGA is the right owner of GSI 16. So add a quirk to ignore the irq0 overriding GSI 16 for the FUJITSU SIEMENS AMILO PRO V2030 platform will solve this issue. Reported-and-tested-by: Szymon Kowalczyk <fazerxlo@o2.pl> Signed-off-by: Feng Tang <feng.tang@intel.com> Signed-off-by: Len Brown <len.brown@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* xen/setup: filter APERFMPERF cpuid feature outAndre Przywara2012-06-221-0/+8
| | | | | | | | | | | | | | | | | | | | | | | | commit 5e626254206a709c6e937f3dda69bf26c7344f6f upstream. Xen PV kernels allow access to the APERF/MPERF registers to read the effective frequency. Access to the MSRs is however redirected to the currently scheduled physical CPU, making consecutive read and compares unreliable. In addition each rdmsr traps into the hypervisor. So to avoid bogus readouts and expensive traps, disable the kernel internal feature flag for APERF/MPERF if running under Xen. This will a) remove the aperfmperf flag from /proc/cpuinfo b) not mislead the power scheduler (arch/x86/kernel/cpu/sched.c) to use the feature to improve scheduling (by default disabled) c) not mislead the cpufreq driver to use the MSRs This does not cover userland programs which access the MSRs via the device file interface, but this will be addressed separately. Signed-off-by: Andre Przywara <andre.przywara@amd.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* x86, MCE, AMD: Make APIC LVT thresholding interrupt optionalBorislav Petkov2012-06-171-11/+44
| | | | | | | | | | | | | | | | | | commit f227d4306cf30e1d5b6f231e8ef9006c34f3d186 upstream. Currently, the APIC LVT interrupt for error thresholding is implicitly enabled. However, there are models in the F15h range which do not enable it. Make the code machinery which sets up the APIC interrupt support an optional setting and add an ->interrupt_capable member to the bank representation mirroring that capability and enable the interrupt offset programming only if it is true. Simplify code and fixup comment style while at it. Signed-off-by: Borislav Petkov <borislav.petkov@amd.com> Signed-off-by: Robert Richter <robert.richter@amd.com>
* crypto: aesni-intel - fix unaligned cbc decrypt for x86-32Mathias Krause2012-06-171-2/+4
| | | | | | | | | | | | | | | | | | commit 7c8d51848a88aafdb68f42b6b650c83485ea2f84 upstream. The 32 bit variant of cbc(aes) decrypt is using instructions requiring 128 bit aligned memory locations but fails to ensure this constraint in the code. Fix this by loading the data into intermediate registers with load unaligned instructions. This fixes reported general protection faults related to aesni. References: https://bugzilla.kernel.org/show_bug.cgi?id=43223 Reported-by: Daniel <garkein@mailueberfall.de> Signed-off-by: Mathias Krause <minipli@googlemail.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* mm: pmd_read_atomic: fix 32bit PAE pmd walk vs pmd_populate SMP race conditionAndrea Arcangeli2012-06-101-0/+50
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 26c191788f18129af0eb32a358cdaea0c7479626 upstream. When holding the mmap_sem for reading, pmd_offset_map_lock should only run on a pmd_t that has been read atomically from the pmdp pointer, otherwise we may read only half of it leading to this crash. PID: 11679 TASK: f06e8000 CPU: 3 COMMAND: "do_race_2_panic" #0 [f06a9dd8] crash_kexec at c049b5ec #1 [f06a9e2c] oops_end at c083d1c2 #2 [f06a9e40] no_context at c0433ded #3 [f06a9e64] bad_area_nosemaphore at c043401a #4 [f06a9e6c] __do_page_fault at c0434493 #5 [f06a9eec] do_page_fault at c083eb45 #6 [f06a9f04] error_code (via page_fault) at c083c5d5 EAX: 01fb470c EBX: fff35000 ECX: 00000003 EDX: 00000100 EBP: 00000000 DS: 007b ESI: 9e201000 ES: 007b EDI: 01fb4700 GS: 00e0 CS: 0060 EIP: c083bc14 ERR: ffffffff EFLAGS: 00010246 #7 [f06a9f38] _spin_lock at c083bc14 #8 [f06a9f44] sys_mincore at c0507b7d #9 [f06a9fb0] system_call at c083becd start len EAX: ffffffda EBX: 9e200000 ECX: 00001000 EDX: 6228537f DS: 007b ESI: 00000000 ES: 007b EDI: 003d0f00 SS: 007b ESP: 62285354 EBP: 62285388 GS: 0033 CS: 0073 EIP: 00291416 ERR: 000000da EFLAGS: 00000286 This should be a longstanding bug affecting x86 32bit PAE without THP. Only archs with 64bit large pmd_t and 32bit unsigned long should be affected. With THP enabled the barrier() in pmd_none_or_trans_huge_or_clear_bad() would partly hide the bug when the pmd transition from none to stable, by forcing a re-read of the *pmd in pmd_offset_map_lock, but when THP is enabled a new set of problem arises by the fact could then transition freely in any of the none, pmd_trans_huge or pmd_trans_stable states. So making the barrier in pmd_none_or_trans_huge_or_clear_bad() unconditional isn't good idea and it would be a flakey solution. This should be fully fixed by introducing a pmd_read_atomic that reads the pmd in order with THP disabled, or by reading the pmd atomically with cmpxchg8b with THP enabled. Luckily this new race condition only triggers in the places that must already be covered by pmd_none_or_trans_huge_or_clear_bad() so the fix is localized there but this bug is not related to THP. NOTE: this can trigger on x86 32bit systems with PAE enabled with more than 4G of ram, otherwise the high part of the pmd will never risk to be truncated because it would be zero at all times, in turn so hiding the SMP race. This bug was discovered and fully debugged by Ulrich, quote: ---- [..] pmd_none_or_trans_huge_or_clear_bad() loads the content of edx and eax. 496 static inline int pmd_none_or_trans_huge_or_clear_bad(pmd_t *pmd) 497 { 498 /* depend on compiler for an atomic pmd read */ 499 pmd_t pmdval = *pmd; // edi = pmd pointer 0xc0507a74 <sys_mincore+548>: mov 0x8(%esp),%edi ... // edx = PTE page table high address 0xc0507a84 <sys_mincore+564>: mov 0x4(%edi),%edx ... // eax = PTE page table low address 0xc0507a8e <sys_mincore+574>: mov (%edi),%eax [..] Please note that the PMD is not read atomically. These are two "mov" instructions where the high order bits of the PMD entry are fetched first. Hence, the above machine code is prone to the following race. - The PMD entry {high|low} is 0x0000000000000000. The "mov" at 0xc0507a84 loads 0x00000000 into edx. - A page fault (on another CPU) sneaks in between the two "mov" instructions and instantiates the PMD. - The PMD entry {high|low} is now 0x00000003fda38067. The "mov" at 0xc0507a8e loads 0xfda38067 into eax. ---- Reported-by: Ulrich Obergfell <uobergfe@redhat.com> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Hugh Dickins <hughd@google.com> Cc: Larry Woodman <lwoodman@redhat.com> Cc: Petr Matousek <pmatouse@redhat.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* x86/mce: Fix check for processor context when machine check was taken.Tony Luck2012-06-011-6/+10
| | | | | | | | | | | | | | commit 875e26648cf9b6db9d8dc07b7959d7c61fb3f49c upstream. Linus pointed out that there was no value is checking whether m->ip was zero - because zero is a legimate value. If we have a reliable (or faked in the VM86 case) "m->cs" we can use it to tell whether we were in user mode or kernelwhen the machine check hit. Reported-by: Linus Torvalds <torvalds@linuxfoundation.org> Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* perf/x86: Update event scheduling constraints for AMD family 15h modelsRobert Richter2012-06-011-1/+10
| | | | | | | | | | | | | | commit 5bcdf5e4fee3c45e1281c25e4941f2163cb28c65 upstream. This update is for newer family 15h cpu models from 0x02 to 0x1f. Signed-off-by: Robert Richter <robert.richter@amd.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Link: http://lkml.kernel.org/r/1337337642-1621-1-git-send-email-robert.richter@amd.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* percpu, x86: don't use PMD_SIZE as embedded atom_size on 32bitTejun Heo2012-05-211-1/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit d5e28005a1d2e67833852f4c9ea8ec206ea3ff85 upstream. With the embed percpu first chunk allocator, x86 uses either PAGE_SIZE or PMD_SIZE for atom_size. PMD_SIZE is used when CPU supports PSE so that percpu areas are aligned to PMD mappings and possibly allow using PMD mappings in vmalloc areas in the future. Using larger atom_size doesn't waste actual memory; however, it does require larger vmalloc space allocation later on for !first chunks. With reasonably sized vmalloc area, PMD_SIZE shouldn't be a problem but x86_32 at this point is anything but reasonable in terms of address space and using larger atom_size reportedly leads to frequent percpu allocation failures on certain setups. As there is no reason to not use PMD_SIZE on x86_64 as vmalloc space is aplenty and most x86_64 configurations support PSE, fix the issue by always using PMD_SIZE on x86_64 and PAGE_SIZE on x86_32. v2: drop cpu_has_pse test and make x86_64 always use PMD_SIZE and x86_32 PAGE_SIZE as suggested by hpa. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Yanmin Zhang <yanmin.zhang@intel.com> Reported-by: ShuoX Liu <shuox.liu@intel.com> Acked-by: H. Peter Anvin <hpa@zytor.com> LKML-Reference: <4F97BA98.6010001@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* xen/pci: don't use PCI BIOS service for configuration space accessesDavid Vrabel2012-05-211-2/+5
| | | | | | | | | | | | | | | | | | commit 76a8df7b49168509df02461f83fab117a4a86e08 upstream. The accessing PCI configuration space with the PCI BIOS32 service does not work in PV guests. On systems without MMCONFIG or where the BIOS hasn't marked the MMCONFIG region as reserved in the e820 map, the BIOS service is probed (even though direct access is preferred) and this hangs. Acked-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: David Vrabel <david.vrabel@citrix.com> [v1: Fixed compile error when CONFIG_PCI is not set] Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* xen/pte: Fix crashes when trying to see non-existent PGD/PMD/PUD/PTEsKonrad Rzeszutek Wilk2012-05-211-1/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit b7e5ffe5d83fa40d702976d77452004abbe35791 upstream. If I try to do "cat /sys/kernel/debug/kernel_page_tables" I end up with: BUG: unable to handle kernel paging request at ffffc7fffffff000 IP: [<ffffffff8106aa51>] ptdump_show+0x221/0x480 PGD 0 Oops: 0000 [#1] SMP CPU 0 .. snip.. RAX: 0000000000000000 RBX: ffffc00000000fff RCX: 0000000000000000 RDX: 0000800000000000 RSI: 0000000000000000 RDI: ffffc7fffffff000 which is due to the fact we are trying to access a PFN that is not accessible to us. The reason (at least in this case) was that PGD[256] is set to __HYPERVISOR_VIRT_START which was setup (by the hypervisor) to point to a read-only linear map of the MFN->PFN array. During our parsing we would get the MFN (a valid one), try to look it up in the MFN->PFN tree and find it invalid and return ~0 as PFN. Then pte_mfn_to_pfn would happilly feed that in, attach the flags and return it back to the caller. 'ptdump_show' bitshifts it and gets and invalid value that it tries to dereference. Instead of doing all of that, we detect the ~0 case and just return !_PAGE_PRESENT. This bug has been in existence .. at least until 2.6.37 (yikes!) Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* xen/smp: Fix crash when booting with ACPI hotplug CPUs.Konrad Rzeszutek Wilk2012-05-071-0/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit cf405ae612b0f7e2358db7ff594c0e94846137aa upstream. When we boot on a machine that can hotplug CPUs and we are using 'dom0_max_vcpus=X' on the Xen hypervisor line to clip the amount of CPUs available to the initial domain, we get this: (XEN) Command line: com1=115200,8n1 dom0_mem=8G noreboot dom0_max_vcpus=8 sync_console mce_verbosity=verbose console=com1,vga loglvl=all guest_loglvl=all .. snip.. DMI: Intel Corporation S2600CP/S2600CP, BIOS SE5C600.86B.99.99.x032.072520111118 07/25/2011 .. snip. SMP: Allowing 64 CPUs, 32 hotplug CPUs installing Xen timer for CPU 7 cpu 7 spinlock event irq 361 NMI watchdog: disabled (cpu7): hardware events not enabled Brought up 8 CPUs .. snip.. [acpi processor finds the CPUs are not initialized and starts calling arch_register_cpu, which creates /sys/devices/system/cpu/cpu8/online] CPU 8 got hotplugged CPU 9 got hotplugged CPU 10 got hotplugged .. snip.. initcall 1_acpi_battery_init_async+0x0/0x1b returned 0 after 406 usecs calling erst_init+0x0/0x2bb @ 1 [and the scheduler sticks newly started tasks on the new CPUs, but said CPUs cannot be initialized b/c the hypervisor has limited the amount of vCPUS to 8 - as per the dom0_max_vcpus=8 flag. The spinlock tries to kick the other CPU, but the structure for that is not initialized and we crash.] BUG: unable to handle kernel paging request at fffffffffffffed8 IP: [<ffffffff81035289>] xen_spin_lock+0x29/0x60 PGD 180d067 PUD 180e067 PMD 0 Oops: 0002 [#1] SMP CPU 7 Modules linked in: Pid: 1, comm: swapper/0 Not tainted 3.4.0-rc2upstream-00001-gf5154e8 #1 Intel Corporation S2600CP/S2600CP RIP: e030:[<ffffffff81035289>] [<ffffffff81035289>] xen_spin_lock+0x29/0x60 RSP: e02b:ffff8801fb9b3a70 EFLAGS: 00010282 With this patch, we cap the amount of vCPUS that the initial domain can run, to exactly what dom0_max_vcpus=X has specified. In the future, if there is a hypercall that will allow a running domain to expand past its initial set of vCPUS, this patch should be re-evaluated. Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* xen: correctly check for pending events when restoring irq flagsDavid Vrabel2012-05-071-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | commit 7eb7ce4d2e8991aff4ecb71a81949a907ca755ac upstream. In xen_restore_fl_direct(), xen_force_evtchn_callback() was being called even if no events were pending. This resulted in (depending on workload) about a 100 times as many xen_version hypercalls as necessary. Fix this by correcting the sense of the conditional jump. This seems to give a significant performance benefit for some workloads. There is some subtle tricksy "..since the check here is trying to check both pending and masked in a single cmpw, but I think this is correct. It will call check_events now only when the combined mask+pending word is 0x0001 (aka unmasked, pending)." (Ian) Acked-by: Ian Campbell <ian.campbell@citrix.com> Signed-off-by: David Vrabel <david.vrabel@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* x86, apic: APIC code touches invalid MSR on P5 class machinesBryan O'Donoghue2012-05-071-14/+20
| | | | | | | | | | | | | | | | | | | | | | | commit cbf2829b61c136edcba302a5e1b6b40e97d32c00 upstream. Current APIC code assumes MSR_IA32_APICBASE is present for all systems. Pentium Classic P5 and friends didn't have this MSR. MSR_IA32_APICBASE was introduced as an architectural MSR by Intel @ P6. Code paths that can touch this MSR invalidly are when vendor == Intel && cpu-family == 5 and APIC bit is set in CPUID - or when you simply pass lapic on the kernel command line, on a P5. The below patch stops Linux incorrectly interfering with the MSR_IA32_APICBASE for P5 class machines. Other code paths exist that touch the MSR - however those paths are not currently reachable for a conformant P5. Signed-off-by: Bryan O'Donoghue <bryan.odonoghue@linux.intel.com> Link: http://lkml.kernel.org/r/4F8EEDD3.1080404@linux.intel.com Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* iommu/amd: Make sure IOMMU interrupts are re-enabled on resumeJoerg Roedel2012-04-131-7/+17
| | | | | | | | | | | | | | | | commit 9ddd592a191b32f2ee6c4b6ed2bd52665c3a49f5 upstream. Unfortunatly the interrupts for the event log and the peripheral page-faults are only enabled at boot but not re-enabled at resume. Fix that. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> [bwh: Backport to 3.0: - Drop change to PPR log which was added in 3.3 - Source is under arch/x86/kernel] Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* sched/x86: Fix overflow in cyc2ns_offsetSalman Qazi2012-04-132-7/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 9993bc635d01a6ee7f6b833b4ee65ce7c06350b1 upstream. When a machine boots up, the TSC generally gets reset. However, when kexec is used to boot into a kernel, the TSC value would be carried over from the previous kernel. The computation of cycns_offset in set_cyc2ns_scale is prone to an overflow, if the machine has been up more than 208 days prior to the kexec. The overflow happens when we multiply *scale, even though there is enough room to store the final answer. We fix this issue by decomposing tsc_now into the quotient and remainder of division by CYC2NS_SCALE_FACTOR and then performing the multiplication separately on the two components. Refactor code to share the calculation with the previous fix in __cycles_2_ns(). Signed-off-by: Salman Qazi <sqazi@google.com> Acked-by: John Stultz <john.stultz@linaro.org> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Turner <pjt@google.com> Cc: john stultz <johnstul@us.ibm.com> Link: http://lkml.kernel.org/r/20120310004027.19291.88460.stgit@dungbeetle.mtv.corp.google.com Signed-off-by: Ingo Molnar <mingo@elte.hu> Cc: Mike Galbraith <efault@gmx.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* Revert "x86/ioapic: Add register level checks to detect bogus io-apic entries"Greg Kroah-Hartman2012-04-131-32/+8
| | | | | | | | | | | | | | | | | | This reverts commit c2ec63edaf48c90c3495eeb0b75bb05102fbf71a [73d63d038ee9f769f5e5b46792d227fe20e442c5 upstream] It causes problems, so needs to be reverted from 3.2-stable for now. Reported-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Jon Dufresne <jon@jondufresne.org> Cc: Suresh Siddha <suresh.b.siddha@intel.com> Cc: <yinghai@kernel.org> Cc: Josh Boyer <jwboyer@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Teck Choon Giam <giamteckchoon@gmail.com> Cc: Ben Guthro <ben@guthro.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* x86/PCI: do not tie MSI MS-7253 use_crs quirk to BIOS versionJonathan Nieder2012-04-131-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit a97f4f5e524bcd09a85ef0b8821a14d35e69335f upstream. Carlos was getting WARNING: at drivers/pci/pci.c:118 pci_ioremap_bar+0x24/0x52() when probing his sound card, and sound did not work. After adding pci=use_crs to the kernel command line, no more trouble. Ok, we can add a quirk. dmidecode output reveals that this is an MSI MS-7253, for which we already have a quirk, but the short-sighted author tied the quirk to a single BIOS version, making it not kick in on Carlos's machine with BIOS V1.2. If a later BIOS update makes it no longer necessary to look at the _CRS info it will still be harmless, so let's stop trying to guess which versions have and don't have accurate _CRS tables. Addresses https://bugtrack.alsa-project.org/alsa-bug/view.php?id=5533 Also see <https://bugzilla.kernel.org/show_bug.cgi?id=42619>. Reported-by: Carlos Luna <caralu74@gmail.com> Reviewed-by: Bjorn Helgaas <bhelgaas@google.com> Signed-off-by: Jonathan Nieder <jrnieder@gmail.com> Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* x86/PCI: use host bridge _CRS info on MSI MS-7253Jonathan Nieder2012-04-131-0/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 8411371709610c826bf65684f886bfdfb5780ca1 upstream. In the spirit of commit 29cf7a30f8a0 ("x86/PCI: use host bridge _CRS info on ASUS M2V-MX SE"), this DMI quirk turns on "pci_use_crs" by default on a board that needs it. This fixes boot failures and oopses introduced in 3e3da00c01d0 ("x86/pci: AMD one chain system to use pci read out res"). The quirk is quite targetted (to a specific board and BIOS version) for two reasons: (1) to emphasize that this method of tackling the problem one quirk at a time is a little insane (2) to give BIOS vendors an opportunity to use simpler tables and allow us to return to generic behavior (whatever that happens to be) with a later BIOS update In other words, I am not at all happy with having quirks like this. But it is even worse for the kernel not to work out of the box on these machines, so... Reference: https://bugzilla.kernel.org/show_bug.cgi?id=42619 Reported-by: Svante Signell <svante.signell@telia.com> Signed-off-by: Jonathan Nieder <jrnieder@gmail.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* x86,kgdb: Fix DEBUG_RODATA limitation using text_poke()Jason Wessel2012-04-131-0/+60
| | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 3751d3e85cf693e10e2c47c03c8caa65e171099b upstream. There has long been a limitation using software breakpoints with a kernel compiled with CONFIG_DEBUG_RODATA going back to 2.6.26. For this particular patch, it will apply cleanly and has been tested all the way back to 2.6.36. The kprobes code uses the text_poke() function which accommodates writing a breakpoint into a read-only page. The x86 kgdb code can solve the problem similarly by overriding the default breakpoint set/remove routines and using text_poke() directly. The x86 kgdb code will first attempt to use the traditional probe_kernel_write(), and next try using a the text_poke() function. The break point install method is tracked such that the correct break point removal routine will get called later on. Cc: x86@kernel.org Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Inspried-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Signed-off-by: Jason Wessel <jason.wessel@windriver.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* x86 bpf_jit: fix a bug in emitting the 16-bit immediate operand of ANDzhuangfeiran@ict.ac.cn2012-04-131-1/+1
| | | | | | | | | | | | | [ Upstream commit 1d24fb3684f347226747c6b11ea426b7b992694e ] When K >= 0xFFFF0000, AND needs the two least significant bytes of K as its operand, but EMIT2() gives it the least significant byte of K and 0x2. EMIT() should be used here to replace EMIT2(). Signed-off-by: Feiran Zhuang <zhuangfeiran@ict.ac.cn> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* x86, tls: Off by one limit checkDan Carpenter2012-04-021-2/+2
| | | | | | | | | | | | | commit 8f0750f19789cf352d7e24a6cc50f2ab1b4f1372 upstream. These are used as offsets into an array of GDT_ENTRY_TLS_ENTRIES members so GDT_ENTRY_TLS_ENTRIES is one past the end of the array. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Link: http://lkml.kernel.org/r/20120324075250.GA28258@elgon.mountain Signed-off-by: H. Peter Anvin <hpa@zytor.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* x86, tsc: Skip refined tsc calibration on systems with reliable TSCAlok Kataria2012-04-021-0/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 57779dc2b3b75bee05ef5d1ada47f615f7a13932 upstream. While running the latest Linux as guest under VMware in highly over-committed situations, we have seen cases when the refined TSC algorithm fails to get a valid tsc_start value in tsc_refine_calibration_work from multiple attempts. As a result the kernel keeps on scheduling the tsc_irqwork task for later. Subsequently after several attempts when it gets a valid start value it goes through the refined calibration and either bails out or uses the new results. Given that the kernel originally read the TSC frequency from the platform, which is the best it can get, I don't think there is much value in refining it. So for systems which get the TSC frequency from the platform we should skip the refined tsc algorithm. We can use the TSC_RELIABLE cpu cap flag to detect this, right now it is set only on VMware and for Moorestown Penwell both of which have there own TSC calibration methods. Signed-off-by: Alok N Kataria <akataria@vmware.com> Cc: John Stultz <johnstul@us.ibm.com> Cc: Dirk Brandewie <dirk.brandewie@gmail.com> Cc: Alan Cox <alan@linux.intel.com> [jstultz: Reworked to simply not schedule the refining work, rather then scheduling the work and bombing out later] Signed-off-by: John Stultz <john.stultz@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* net: bpf_jit: fix BPF_S_LDX_B_MSH compilationEric Dumazet2012-04-021-9/+5
| | | | | | | | | | | | | | | | | | | | [ Upstream commit dc72d99dabb870ca5bd6d9fff674be853bb4a88d ] Matt Evans spotted that x86 bpf_jit was incorrectly handling negative constant offsets in BPF_S_LDX_B_MSH instruction. We need to abort JIT compilation like we do in common_load so that filter uses the interpreter code and can call __load_pointer() Reference: http://lists.openwall.net/netdev/2011/07/19/11 Thanks to Indan Zupancic to bring back this issue. Reported-by: Matt Evans <matt@ozlabs.org> Reported-by: Indan Zupancic <indan@nul.nu> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* x86-32: Fix endless loop when processing signals for kernel tasksDmitry Adamushko2012-04-021-7/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 29a2e2836ff9ea65a603c89df217f4198973a74f upstream. The problem occurs on !CONFIG_VM86 kernels [1] when a kernel-mode task returns from a system call with a pending signal. A real-life scenario is a child of 'khelper' returning from a failed kernel_execve() in ____call_usermodehelper() [ kernel/kmod.c ]. kernel_execve() fails due to a pending SIGKILL, which is the result of "kill -9 -1" (at least, busybox's init does it upon reboot). The loop is as follows: * syscall_exit_work: - work_pending: // start_of_the_loop - work_notify_sig: - do_notify_resume() - do_signal() - if (!user_mode(regs)) return; - resume_userspace // TIF_SIGPENDING is still set - work_pending // so we call work_pending => goto // start_of_the_loop More information can be found in another LKML thread: http://www.serverphorums.com/read.php?12,457826 [1] the problem was also seen on MIPS. Signed-off-by: Dmitry Adamushko <dmitry.adamushko@gmail.com> Link: http://lkml.kernel.org/r/1332448765.2299.68.camel@dimm Cc: Oleg Nesterov <oleg@redhat.com> Cc: Roland McGrath <roland@hack.frob.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: H. Peter Anvin <hpa@zytor.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* KVM: x86: fix missing checks in syscall emulationStephan Bärwolf2012-04-022-0/+64
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit c2226fc9e87ba3da060e47333657cd6616652b84 upstream. On hosts without this patch, 32bit guests will crash (and 64bit guests may behave in a wrong way) for example by simply executing following nasm-demo-application: [bits 32] global _start SECTION .text _start: syscall (I tested it with winxp and linux - both always crashed) Disassembly of section .text: 00000000 <_start>: 0: 0f 05 syscall The reason seems a missing "invalid opcode"-trap (int6) for the syscall opcode "0f05", which is not available on Intel CPUs within non-longmodes, as also on some AMD CPUs within legacy-mode. (depending on CPU vendor, MSR_EFER and cpuid) Because previous mentioned OSs may not engage corresponding syscall target-registers (STAR, LSTAR, CSTAR), they remain NULL and (non trapping) syscalls are leading to multiple faults and finally crashs. Depending on the architecture (AMD or Intel) pretended by guests, various checks according to vendor's documentation are implemented to overcome the current issue and behave like the CPUs physical counterparts. [mtosatti: cleanup/beautify code] Signed-off-by: Stephan Baerwolf <stephan.baerwolf@tu-ilmenau.de> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* KVM: x86: extend "struct x86_emulate_ops" with "get_cpuid"Stephan Bärwolf2012-04-022-0/+26
| | | | | | | | | | | | | | | | | commit bdb42f5afebe208eae90406959383856ae2caf2b upstream. In order to be able to proceed checks on CPU-specific properties within the emulator, function "get_cpuid" is introduced. With "get_cpuid" it is possible to virtually call the guests "cpuid"-opcode without changing the VM's context. [mtosatti: cleanup/beautify code] Signed-off-by: Stephan Baerwolf <stephan.baerwolf@tu-ilmenau.de> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* mm: thp: fix pmd_bad() triggering in code paths holding mmap_sem read modeAndrea Arcangeli2012-04-021-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 1a5a9906d4e8d1976b701f889d8f35d54b928f25 upstream. In some cases it may happen that pmd_none_or_clear_bad() is called with the mmap_sem hold in read mode. In those cases the huge page faults can allocate hugepmds under pmd_none_or_clear_bad() and that can trigger a false positive from pmd_bad() that will not like to see a pmd materializing as trans huge. It's not khugepaged causing the problem, khugepaged holds the mmap_sem in write mode (and all those sites must hold the mmap_sem in read mode to prevent pagetables to go away from under them, during code review it seems vm86 mode on 32bit kernels requires that too unless it's restricted to 1 thread per process or UP builds). The race is only with the huge pagefaults that can convert a pmd_none() into a pmd_trans_huge(). Effectively all these pmd_none_or_clear_bad() sites running with mmap_sem in read mode are somewhat speculative with the page faults, and the result is always undefined when they run simultaneously. This is probably why it wasn't common to run into this. For example if the madvise(MADV_DONTNEED) runs zap_page_range() shortly before the page fault, the hugepage will not be zapped, if the page fault runs first it will be zapped. Altering pmd_bad() not to error out if it finds hugepmds won't be enough to fix this, because zap_pmd_range would then proceed to call zap_pte_range (which would be incorrect if the pmd become a pmd_trans_huge()). The simplest way to fix this is to read the pmd in the local stack (regardless of what we read, no need of actual CPU barriers, only compiler barrier needed), and be sure it is not changing under the code that computes its value. Even if the real pmd is changing under the value we hold on the stack, we don't care. If we actually end up in zap_pte_range it means the pmd was not none already and it was not huge, and it can't become huge from under us (khugepaged locking explained above). All we need is to enforce that there is no way anymore that in a code path like below, pmd_trans_huge can be false, but pmd_none_or_clear_bad can run into a hugepmd. The overhead of a barrier() is just a compiler tweak and should not be measurable (I only added it for THP builds). I don't exclude different compiler versions may have prevented the race too by caching the value of *pmd on the stack (that hasn't been verified, but it wouldn't be impossible considering pmd_none_or_clear_bad, pmd_bad, pmd_trans_huge, pmd_none are all inlines and there's no external function called in between pmd_trans_huge and pmd_none_or_clear_bad). if (pmd_trans_huge(*pmd)) { if (next-addr != HPAGE_PMD_SIZE) { VM_BUG_ON(!rwsem_is_locked(&tlb->mm->mmap_sem)); split_huge_page_pmd(vma->vm_mm, pmd); } else if (zap_huge_pmd(tlb, vma, pmd, addr)) continue; /* fall through */ } if (pmd_none_or_clear_bad(pmd)) Because this race condition could be exercised without special privileges this was reported in CVE-2012-1179. The race was identified and fully explained by Ulrich who debugged it. I'm quoting his accurate explanation below, for reference. ====== start quote ======= mapcount 0 page_mapcount 1 kernel BUG at mm/huge_memory.c:1384! At some point prior to the panic, a "bad pmd ..." message similar to the following is logged on the console: mm/memory.c:145: bad pmd ffff8800376e1f98(80000000314000e7). The "bad pmd ..." message is logged by pmd_clear_bad() before it clears the page's PMD table entry. 143 void pmd_clear_bad(pmd_t *pmd) 144 { -> 145 pmd_ERROR(*pmd); 146 pmd_clear(pmd); 147 } After the PMD table entry has been cleared, there is an inconsistency between the actual number of PMD table entries that are mapping the page and the page's map count (_mapcount field in struct page). When the page is subsequently reclaimed, __split_huge_page() detects this inconsistency. 1381 if (mapcount != page_mapcount(page)) 1382 printk(KERN_ERR "mapcount %d page_mapcount %d\n", 1383 mapcount, page_mapcount(page)); -> 1384 BUG_ON(mapcount != page_mapcount(page)); The root cause of the problem is a race of two threads in a multithreaded process. Thread B incurs a page fault on a virtual address that has never been accessed (PMD entry is zero) while Thread A is executing an madvise() system call on a virtual address within the same 2 MB (huge page) range. virtual address space .---------------------. | | | | .-|---------------------| | | | | | |<-- B(fault) | | | 2 MB | |/////////////////////|-. huge < |/////////////////////| > A(range) page | |/////////////////////|-' | | | | | | '-|---------------------| | | | | '---------------------' - Thread A is executing an madvise(..., MADV_DONTNEED) system call on the virtual address range "A(range)" shown in the picture. sys_madvise // Acquire the semaphore in shared mode. down_read(&current->mm->mmap_sem) ... madvise_vma switch (behavior) case MADV_DONTNEED: madvise_dontneed zap_page_range unmap_vmas unmap_page_range zap_pud_range zap_pmd_range // // Assume that this huge page has never been accessed. // I.e. content of the PMD entry is zero (not mapped). // if (pmd_trans_huge(*pmd)) { // We don't get here due to the above assumption. } // // Assume that Thread B incurred a page fault and .---------> // sneaks in here as shown below. | // | if (pmd_none_or_clear_bad(pmd)) | { | if (unlikely(pmd_bad(*pmd))) | pmd_clear_bad | { | pmd_ERROR | // Log "bad pmd ..." message here. | pmd_clear | // Clear the page's PMD entry. | // Thread B incremented the map count | // in page_add_new_anon_rmap(), but | // now the page is no longer mapped | // by a PMD entry (-> inconsistency). | } | } | v - Thread B is handling a page fault on virtual address "B(fault)" shown in the picture. ... do_page_fault __do_page_fault // Acquire the semaphore in shared mode. down_read_trylock(&mm->mmap_sem) ... handle_mm_fault if (pmd_none(*pmd) && transparent_hugepage_enabled(vma)) // We get here due to the above assumption (PMD entry is zero). do_huge_pmd_anonymous_page alloc_hugepage_vma // Allocate a new transparent huge page here. ... __do_huge_pmd_anonymous_page ... spin_lock(&mm->page_table_lock) ... page_add_new_anon_rmap // Here we increment the page's map count (starts at -1). atomic_set(&page->_mapcount, 0) set_pmd_at // Here we set the page's PMD entry which will be cleared // when Thread A calls pmd_clear_bad(). ... spin_unlock(&mm->page_table_lock) The mmap_sem does not prevent the race because both threads are acquiring it in shared mode (down_read). Thread B holds the page_table_lock while the page's map count and PMD table entry are updated. However, Thread A does not synchronize on that lock. ====== end quote ======= [akpm@linux-foundation.org: checkpatch fixes] Reported-by: Ulrich Obergfell <uobergfe@redhat.com> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Hugh Dickins <hughd@google.com> Cc: Dave Jones <davej@redhat.com> Acked-by: Larry Woodman <lwoodman@redhat.com> Acked-by: Rik van Riel <riel@redhat.com> Cc: Mark Salter <msalter@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* x86/ioapic: Add register level checks to detect bogus io-apic entriesSuresh Siddha2012-04-021-8/+32
| | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 73d63d038ee9f769f5e5b46792d227fe20e442c5 upstream. With the recent changes to clear_IO_APIC_pin() which tries to clear remoteIRR bit explicitly, some of the users started to see "Unable to reset IRR for apic .." messages. Close look shows that these are related to bogus IO-APIC entries which return's all 1's for their io-apic registers. And the above mentioned error messages are benign. But kernel should have ignored such io-apic's in the first place. Check if register 0, 1, 2 of the listed io-apic are all 1's and ignore such io-apic. Reported-by: Álvaro Castillo <midgoon@gmail.com> Tested-by: Jon Dufresne <jon@jondufresne.org> Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Cc: yinghai@kernel.org Cc: kernel-team@fedoraproject.org Cc: Josh Boyer <jwboyer@redhat.com> Link: http://lkml.kernel.org/r/1331577393.31585.94.camel@sbsiddha-desk.sc.intel.com [ Performed minor cleanup of affected code. ] Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* x86: Derandom delay_tsc for 64 bitThomas Gleixner2012-03-191-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit a7f4255f906f60f72e00aad2fb000939449ff32e upstream. Commit f0fbf0abc093 ("x86: integrate delay functions") converted delay_tsc() into a random delay generator for 64 bit. The reason is that it merged the mostly identical versions of delay_32.c and delay_64.c. Though the subtle difference of the result was: static void delay_tsc(unsigned long loops) { - unsigned bclock, now; + unsigned long bclock, now; Now the function uses rdtscl() which returns the lower 32bit of the TSC. On 32bit that's not problematic as unsigned long is 32bit. On 64 bit this fails when the lower 32bit are close to wrap around when bclock is read, because the following check if ((now - bclock) >= loops) break; evaluated to true on 64bit for e.g. bclock = 0xffffffff and now = 0 because the unsigned long (now - bclock) of these values results in 0xffffffff00000001 which is definitely larger than the loops value. That explains Tvortkos observation: "Because I am seeing udelay(500) (_occasionally_) being short, and that by delaying for some duration between 0us (yep) and 491us." Make those variables explicitely u32 again, so this works for both 32 and 64 bit. Reported-by: Tvrtko Ursulin <tvrtko.ursulin@onelan.co.uk> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* x86/amd: Fix L1i and L2 cache sharing information for AMD family 15h processorsAndreas Herrmann2012-02-291-8/+36
| | | | | | | | | | | | | | | | | | commit 32c3233885eb10ac9cb9410f2f8cd64b8df2b2a1 upstream. For L1 instruction cache and L2 cache the shared CPU information is wrong. On current AMD family 15h CPUs those caches are shared between both cores of a compute unit. This fixes https://bugzilla.kernel.org/show_bug.cgi?id=42607 Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com> Cc: Petkov Borislav <Borislav.Petkov@amd.com> Cc: Dave Jones <davej@redhat.com> Link: http://lkml.kernel.org/r/20120208195229.GA17523@alberich.amd.com Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* i387: re-introduce FPU state preloading at context switch timeLinus Torvalds2012-02-294-42/+133
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 34ddc81a230b15c0e345b6b253049db731499f7e upstream. After all the FPU state cleanups and finally finding the problem that caused all our FPU save/restore problems, this re-introduces the preloading of FPU state that was removed in commit b3b0870ef3ff ("i387: do not preload FPU state at task switch time"). However, instead of simply reverting the removal, this reimplements preloading with several fixes, most notably - properly abstracted as a true FPU state switch, rather than as open-coded save and restore with various hacks. In particular, implementing it as a proper FPU state switch allows us to optimize the CR0.TS flag accesses: there is no reason to set the TS bit only to then almost immediately clear it again. CR0 accesses are quite slow and expensive, don't flip the bit back and forth for no good reason. - Make sure that the same model works for both x86-32 and x86-64, so that there are no gratuitous differences between the two due to the way they save and restore segment state differently due to architectural differences that really don't matter to the FPU state. - Avoid exposing the "preload" state to the context switch routines, and in particular allow the concept of lazy state restore: if nothing else has used the FPU in the meantime, and the process is still on the same CPU, we can avoid restoring state from memory entirely, just re-expose the state that is still in the FPU unit. That optimized lazy restore isn't actually implemented here, but the infrastructure is set up for it. Of course, older CPU's that use 'fnsave' to save the state cannot take advantage of this, since the state saving also trashes the state. In other words, there is now an actual _design_ to the FPU state saving, rather than just random historical baggage. Hopefully it's easier to follow as a result. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* i387: move TS_USEDFPU flag from thread_info to task_structLinus Torvalds2012-02-296-32/+30
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit f94edacf998516ac9d849f7bc6949a703977a7f3 upstream. This moves the bit that indicates whether a thread has ownership of the FPU from the TS_USEDFPU bit in thread_info->status to a word of its own (called 'has_fpu') in task_struct->thread.has_fpu. This fixes two independent bugs at the same time: - changing 'thread_info->status' from the scheduler causes nasty problems for the other users of that variable, since it is defined to be thread-synchronous (that's what the "TS_" part of the naming was supposed to indicate). So perfectly valid code could (and did) do ti->status |= TS_RESTORE_SIGMASK; and the compiler was free to do that as separate load, or and store instructions. Which can cause problems with preemption, since a task switch could happen in between, and change the TS_USEDFPU bit. The change to TS_USEDFPU would be overwritten by the final store. In practice, this seldom happened, though, because the 'status' field was seldom used more than once, so gcc would generally tend to generate code that used a read-modify-write instruction and thus happened to avoid this problem - RMW instructions are naturally low fat and preemption-safe. - On x86-32, the current_thread_info() pointer would, during interrupts and softirqs, point to a *copy* of the real thread_info, because x86-32 uses %esp to calculate the thread_info address, and thus the separate irq (and softirq) stacks would cause these kinds of odd thread_info copy aliases. This is normally not a problem, since interrupts aren't supposed to look at thread information anyway (what thread is running at interrupt time really isn't very well-defined), but it confused the heck out of irq_fpu_usable() and the code that tried to squirrel away the FPU state. (It also caused untold confusion for us poor kernel developers). It also turns out that using 'task_struct' is actually much more natural for most of the call sites that care about the FPU state, since they tend to work with the task struct for other reasons anyway (ie scheduling). And the FPU data that we are going to save/restore is found there too. Thanks to Arjan Van De Ven <arjan@linux.intel.com> for pointing us to the %esp issue. Cc: Arjan van de Ven <arjan@linux.intel.com> Reported-and-tested-by: Raphael Prevost <raphael@buro.asia> Acked-and-tested-by: Suresh Siddha <suresh.b.siddha@intel.com> Tested-by: Peter Anvin <hpa@zytor.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* i387: move AMD K7/K8 fpu fxsave/fxrstor workaround from save to restoreLinus Torvalds2012-02-293-22/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 4903062b5485f0e2c286a23b44c9b59d9b017d53 upstream. The AMD K7/K8 CPUs don't save/restore FDP/FIP/FOP unless an exception is pending. In order to not leak FIP state from one process to another, we need to do a floating point load after the fxsave of the old process, and before the fxrstor of the new FPU state. That resets the state to the (uninteresting) kernel load, rather than some potentially sensitive user information. We used to do this directly after the FPU state save, but that is actually very inconvenient, since it (a) corrupts what is potentially perfectly good FPU state that we might want to lazy avoid restoring later and (b) on x86-64 it resulted in a very annoying ordering constraint, where "__unlazy_fpu()" in the task switch needs to be delayed until after the DS segment has been reloaded just to get the new DS value. Coupling it to the fxrstor instead of the fxsave automatically avoids both of these issues, and also ensures that we only do it when actually necessary (the FP state after a save may never actually get used). It's simply a much more natural place for the leaked state cleanup. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* i387: do not preload FPU state at task switch timeLinus Torvalds2012-02-294-68/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | commit b3b0870ef3ffed72b92415423da864f440f57ad6 upstream. Yes, taking the trap to re-load the FPU/MMX state is expensive, but so is spending several days looking for a bug in the state save/restore code. And the preload code has some rather subtle interactions with both paravirtualization support and segment state restore, so it's not nearly as simple as it should be. Also, now that we no longer necessarily depend on a single bit (ie TS_USEDFPU) for keeping track of the state of the FPU, we migth be able to do better. If we are really switching between two processes that keep touching the FP state, save/restore is inevitable, but in the case of having one process that does most of the FPU usage, we may actually be able to do much better than the preloading. In particular, we may be able to keep track of which CPU the process ran on last, and also per CPU keep track of which process' FP state that CPU has. For modern CPU's that don't destroy the FPU contents on save time, that would allow us to do a lazy restore by just re-enabling the existing FPU state - with no restore cost at all! Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* i387: don't ever touch TS_USEDFPU directly, use helper functionsLinus Torvalds2012-02-294-23/+58
| | | | | | | | | | | | | | | | | | | | | | | | | commit 6d59d7a9f5b723a7ac1925c136e93ec83c0c3043 upstream. This creates three helper functions that do the TS_USEDFPU accesses, and makes everybody that used to do it by hand use those helpers instead. In addition, there's a couple of helper functions for the "change both CR0.TS and TS_USEDFPU at the same time" case, and the places that do that together have been changed to use those. That means that we have fewer random places that open-code this situation. The intent is partly to clarify the code without actually changing any semantics yet (since we clearly still have some hard to reproduce bug in this area), but also to make it much easier to use another approach entirely to caching the CR0.TS bit for software accesses. Right now we use a bit in the thread-info 'status' variable (this patch does not change that), but we might want to make it a full field of its own or even make it a per-cpu variable. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* i387: move TS_USEDFPU clearing out of __save_init_fpu and into callersLinus Torvalds2012-02-291-3/+6
| | | | | | | | | | | | | commit b6c66418dcad0fcf83cd1d0a39482db37bf4fc41 upstream. Touching TS_USEDFPU without touching CR0.TS is confusing, so don't do it. By moving it into the callers, we always do the TS_USEDFPU next to the CR0.TS accesses in the source code, and it's much easier to see how the two go hand in hand. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>