aboutsummaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* ipv6: update ip6_rt_last_gc every time GC is runMichal Kubeček2015-10-132-4/+6
| | | | | | | | | | | | | | commit 49a18d86f66d33a20144ecb5a34bba0d1856b260 upstream. As pointed out by Eric Dumazet, net->ipv6.ip6_rt_last_gc should hold the last time garbage collector was run so that we should update it whenever fib6_run_gc() calls fib6_clean_all(), not only if we got there from ip6_dst_gc(). Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
* ipv6: prevent fib6_run_gc() contentionMichal Kubeček2015-10-134-16/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 2ac3ac8f86f2fe065d746d9a9abaca867adec577 upstream. On a high-traffic router with many processors and many IPv6 dst entries, soft lockup in fib6_run_gc() can occur when number of entries reaches gc_thresh. This happens because fib6_run_gc() uses fib6_gc_lock to allow only one thread to run the garbage collector but ip6_dst_gc() doesn't update net->ipv6.ip6_rt_last_gc until fib6_run_gc() returns. On a system with many entries, this can take some time so that in the meantime, other threads pass the tests in ip6_dst_gc() (ip6_rt_last_gc is still not updated) and wait for the lock. They then have to run the garbage collector one after another which blocks them for quite long. Resolve this by replacing special value ~0UL of expire parameter to fib6_run_gc() by explicit "force" parameter to choose between spin_lock_bh() and spin_trylock_bh() and call fib6_run_gc() with force=false if gc_thresh is reached but not max_size. Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Signed-off-by: David S. Miller <davem@davemloft.net> [bwh: Backported to 3.2: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
* perf tools: Fix build with perl 5.18Kirill A. Shutemov2015-10-131-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 575bf1d04e908469d26da424b52fc1b12a1db9d8 upstream. perl.h from new Perl release doesn't like -Wundef and -Wswitch-default: /usr/lib/perl5/core_perl/CORE/perl.h:548:5: error: "SILENT_NO_TAINT_SUPPORT" is not defined [-Werror=undef] #if SILENT_NO_TAINT_SUPPORT && !defined(NO_TAINT_SUPPORT) ^ /usr/lib/perl5/core_perl/CORE/perl.h:556:5: error: "NO_TAINT_SUPPORT" is not defined [-Werror=undef] #if NO_TAINT_SUPPORT ^ In file included from /usr/lib/perl5/core_perl/CORE/perl.h:3471:0, from util/scripting-engines/trace-event-perl.c:30: /usr/lib/perl5/core_perl/CORE/sv.h:1455:5: error: "NO_TAINT_SUPPORT" is not defined [-Werror=undef] #if NO_TAINT_SUPPORT ^ In file included from /usr/lib/perl5/core_perl/CORE/perl.h:3472:0, from util/scripting-engines/trace-event-perl.c:30: /usr/lib/perl5/core_perl/CORE/regexp.h:436:5: error: "NO_TAINT_SUPPORT" is not defined [-Werror=undef] #if NO_TAINT_SUPPORT ^ In file included from /usr/lib/perl5/core_perl/CORE/hv.h:592:0, from /usr/lib/perl5/core_perl/CORE/perl.h:3480, from util/scripting-engines/trace-event-perl.c:30: /usr/lib/perl5/core_perl/CORE/hv_func.h: In function ‘S_perl_hash_siphash_2_4’: /usr/lib/perl5/core_perl/CORE/hv_func.h:222:3: error: switch missing default case [-Werror=switch-default] switch( left ) ^ /usr/lib/perl5/core_perl/CORE/hv_func.h: In function ‘S_perl_hash_superfast’: /usr/lib/perl5/core_perl/CORE/hv_func.h:274:5: error: switch missing default case [-Werror=switch-default] switch (rem) { \ ^ /usr/lib/perl5/core_perl/CORE/hv_func.h: In function ‘S_perl_hash_murmur3’: /usr/lib/perl5/core_perl/CORE/hv_func.h:398:5: error: switch missing default case [-Werror=switch-default] switch(bytes_in_carry) { /* how many bytes in carry */ ^ Let's disable the warnings for code which uses perl.h. Signed-off-by: Kirill A. Shutemov <kirill@shutemov.name> Cc: Ingo Molnar <mingo@redhat.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1372063394-20126-1-git-send-email-kirill@shutemov.name Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> [bwh: Backported to 3.2: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Cc: Vinson Lee <vlee@twopensource.com>
* fib_rules: fix fib rule dumps across multiple skbsWilson Kok2015-10-131-5/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 41fc014332d91ee90c32840bf161f9685b7fbf2b ] dump_rules returns skb length and not error. But when family == AF_UNSPEC, the caller of dump_rules assumes that it returns an error. Hence, when family == AF_UNSPEC, we continue trying to dump on -EMSGSIZE errors resulting in incorrect dump idx carried between skbs belonging to the same dump. This results in fib rule dump always only dumping rules that fit into the first skb. This patch fixes dump_rules to return error so that we exit correctly and idx is correctly maintained between skbs that are part of the same dump. Signed-off-by: Wilson Kok <wkok@cumulusnetworks.com> Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net> [bwh: Backported to 3.2: - s/portid/pid/ - Check whether fib_nl_fill_rule() returns < 0, as it may return > 0 on success (thanks to Roland Dreier)] Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Cc: Roland Dreier <roland@purestorage.com>
* net/ipv6: Correct PIM6 mrt_lock handlingRichard Laing2015-10-131-1/+1
| | | | | | | | | | | | | | | | [ Upstream commit 25b4a44c19c83d98e8c0807a7ede07c1f28eab8b ] In the IPv6 multicast routing code the mrt_lock was not being released correctly in the MFC iterator, as a result adding or deleting a MIF would cause a hang because the mrt_lock could not be acquired. This fix is a copy of the code for the IPv4 case and ensures that the lock is released correctly. Signed-off-by: Richard Laing <richard.laing@alliedtelesis.co.nz> Acked-by: Cong Wang <cwang@twopensource.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* bonding: correct the MAC address for "follow" fail_over_mac policydingtianhong2015-10-131-0/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit a951bc1e6ba58f11df5ed5ddc41311e10f5fd20b ] The "follow" fail_over_mac policy is useful for multiport devices that either become confused or incur a performance penalty when multiple ports are programmed with the same MAC address, but the same MAC address still may happened by this steps for this policy: 1) echo +eth0 > /sys/class/net/bond0/bonding/slaves bond0 has the same mac address with eth0, it is MAC1. 2) echo +eth1 > /sys/class/net/bond0/bonding/slaves eth1 is backup, eth1 has MAC2. 3) ifconfig eth0 down eth1 became active slave, bond will swap MAC for eth0 and eth1, so eth1 has MAC1, and eth0 has MAC2. 4) ifconfig eth1 down there is no active slave, and eth1 still has MAC1, eth2 has MAC2. 5) ifconfig eth0 up the eth0 became active slave again, the bond set eth0 to MAC1. Something wrong here, then if you set eth1 up, the eth0 and eth1 will have the same MAC address, it will break this policy for ACTIVE_BACKUP mode. This patch will fix this problem by finding the old active slave and swap them MAC address before change active slave. Signed-off-by: Ding Tianhong <dingtianhong@huawei.com> Tested-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net> [bwh: Backported to 3.2: - bond_for_each_slave() takes an extra int paramter - Use compare_ether_addr() instead of ether_addr_equal()] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* ipv6: lock socket in ip6_datagram_connect()Eric Dumazet2015-10-133-9/+28
| | | | | | | | | | | | | | | | [ Upstream commit 03645a11a570d52e70631838cb786eb4253eb463 ] ip6_datagram_connect() is doing a lot of socket changes without socket being locked. This looks wrong, at least for udp_lib_rehash() which could corrupt lists because of concurrent udp_sk(sk)->udp_portaddr_hash accesses. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net> [bwh: Backported to 3.2: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* net: Fix skb csum races when peekingHerbert Xu2015-10-131-1/+2
| | | | | | | | | | | | | | | | | | | | | [ Upstream commit 89c22d8c3b278212eef6a8cc66b570bc840a6f5a ] When we calculate the checksum on the recv path, we store the result in the skb as an optimisation in case we need the checksum again down the line. This is in fact bogus for the MSG_PEEK case as this is done without any locking. So multiple threads can peek and then store the result to the same skb, potentially resulting in bogus skb states. This patch fixes this by only storing the result if the skb is not shared. This preserves the optimisations for the few cases where it can be done safely due to locking or other reasons, e.g., SIOCINQ. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* net: pktgen: fix race between pktgen_thread_worker() and kthread_stop()Oleg Nesterov2015-10-131-1/+3
| | | | | | | | | | | | | [ Upstream commit fecdf8be2d91e04b0a9a4f79ff06499a36f5d14f ] pktgen_thread_worker() is obviously racy, kthread_stop() can come between the kthread_should_stop() check and set_current_state(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reported-by: Jan Stancek <jstancek@redhat.com> Reported-by: Marcelo Leitner <mleitner@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* net/tipc: initialize security state for new connection socketStephen Smalley2015-10-131-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit fdd75ea8df370f206a8163786e7470c1277a5064 ] Calling connect() with an AF_TIPC socket would trigger a series of error messages from SELinux along the lines of: SELinux: Invalid class 0 type=AVC msg=audit(1434126658.487:34500): avc: denied { <unprintable> } for pid=292 comm="kworker/u16:5" scontext=system_u:system_r:kernel_t:s0 tcontext=system_u:object_r:unlabeled_t:s0 tclass=<unprintable> permissive=0 This was due to a failure to initialize the security state of the new connection sock by the tipc code, leaving it with junk in the security class field and an unlabeled secid. Add a call to security_sk_clone() to inherit the security state from the parent socket. Reported-by: Tim Shearer <tim.shearer@overturenetworks.com> Signed-off-by: Stephen Smalley <sds@tycho.nsa.gov> Acked-by: Paul Moore <paul@paul-moore.com> Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net> [bwh: Backported to 3.2: adjust context, indentation] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* Initialize msg/shm IPC objects before doing ipc_addid()Linus Torvalds2015-10-133-19/+20
| | | | | | | | | | | | | | | | | | | | | | | commit b9a532277938798b53178d5a66af6e2915cb27cf upstream. As reported by Dmitry Vyukov, we really shouldn't do ipc_addid() before having initialized the IPC object state. Yes, we initialize the IPC object in a locked state, but with all the lockless RCU lookup work, that IPC object lock no longer means that the state cannot be seen. We already did this for the IPC semaphore code (see commit e8577d1f0329: "ipc/sem.c: fully initialize sem_array before making it visible") but we clearly forgot about msg and shm. Reported-by: Dmitry Vyukov <dvyukov@google.com> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Davidlohr Bueso <dbueso@suse.de> Cc: stable@vger.kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> [bwh: Backported to 3.2: - Adjust context - The error path being moved looks a little different] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* ipc/sem.c: fully initialize sem_array before making it visibleManfred Spraul2015-10-131-8/+9
| | | | | | | | | | | | | | | | | | | | | | | | commit e8577d1f0329d4842e8302e289fb2c22156abef4 upstream. ipc_addid() makes a new ipc identifier visible to everyone. New objects start as locked, so that the caller can complete the initialization after the call. Within struct sem_array, at least sma->sem_base and sma->sem_nsems are accessed without any locks, therefore this approach doesn't work. Thus: Move the ipc_addid() to the end of the initialization. Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Reported-by: Rik van Riel <riel@redhat.com> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Davidlohr Bueso <dave@stgolabs.net> Acked-by: Rafael Aquini <aquini@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> [bwh: Backported to 3.2: - Adjust context - The error path being moved looks a little different] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* RDS: verify the underlying transport exists before creating a connectionSasha Levin2015-10-131-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 74e98eb085889b0d2d4908f59f6e00026063014f upstream. There was no verification that an underlying transport exists when creating a connection, this would cause dereferencing a NULL ptr. It might happen on sockets that weren't properly bound before attempting to send a message, which will cause a NULL ptr deref: [135546.047719] kasan: GPF could be caused by NULL-ptr deref or user memory accessgeneral protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC KASAN [135546.051270] Modules linked in: [135546.051781] CPU: 4 PID: 15650 Comm: trinity-c4 Not tainted 4.2.0-next-20150902-sasha-00041-gbaa1222-dirty #2527 [135546.053217] task: ffff8800835bc000 ti: ffff8800bc708000 task.ti: ffff8800bc708000 [135546.054291] RIP: __rds_conn_create (net/rds/connection.c:194) [135546.055666] RSP: 0018:ffff8800bc70fab0 EFLAGS: 00010202 [135546.056457] RAX: dffffc0000000000 RBX: 0000000000000f2c RCX: ffff8800835bc000 [135546.057494] RDX: 0000000000000007 RSI: ffff8800835bccd8 RDI: 0000000000000038 [135546.058530] RBP: ffff8800bc70fb18 R08: 0000000000000001 R09: 0000000000000000 [135546.059556] R10: ffffed014d7a3a23 R11: ffffed014d7a3a21 R12: 0000000000000000 [135546.060614] R13: 0000000000000001 R14: ffff8801ec3d0000 R15: 0000000000000000 [135546.061668] FS: 00007faad4ffb700(0000) GS:ffff880252000000(0000) knlGS:0000000000000000 [135546.062836] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [135546.063682] CR2: 000000000000846a CR3: 000000009d137000 CR4: 00000000000006a0 [135546.064723] Stack: [135546.065048] ffffffffafe2055c ffffffffafe23fc1 ffffed00493097bf ffff8801ec3d0008 [135546.066247] 0000000000000000 00000000000000d0 0000000000000000 ac194a24c0586342 [135546.067438] 1ffff100178e1f78 ffff880320581b00 ffff8800bc70fdd0 ffff880320581b00 [135546.068629] Call Trace: [135546.069028] ? __rds_conn_create (include/linux/rcupdate.h:856 net/rds/connection.c:134) [135546.069989] ? rds_message_copy_from_user (net/rds/message.c:298) [135546.071021] rds_conn_create_outgoing (net/rds/connection.c:278) [135546.071981] rds_sendmsg (net/rds/send.c:1058) [135546.072858] ? perf_trace_lock (include/trace/events/lock.h:38) [135546.073744] ? lockdep_init (kernel/locking/lockdep.c:3298) [135546.074577] ? rds_send_drop_to (net/rds/send.c:976) [135546.075508] ? __might_fault (./arch/x86/include/asm/current.h:14 mm/memory.c:3795) [135546.076349] ? __might_fault (mm/memory.c:3795) [135546.077179] ? rds_send_drop_to (net/rds/send.c:976) [135546.078114] sock_sendmsg (net/socket.c:611 net/socket.c:620) [135546.078856] SYSC_sendto (net/socket.c:1657) [135546.079596] ? SYSC_connect (net/socket.c:1628) [135546.080510] ? trace_dump_stack (kernel/trace/trace.c:1926) [135546.081397] ? ring_buffer_unlock_commit (kernel/trace/ring_buffer.c:2479 kernel/trace/ring_buffer.c:2558 kernel/trace/ring_buffer.c:2674) [135546.082390] ? trace_buffer_unlock_commit (kernel/trace/trace.c:1749) [135546.083410] ? trace_event_raw_event_sys_enter (include/trace/events/syscalls.h:16) [135546.084481] ? do_audit_syscall_entry (include/trace/events/syscalls.h:16) [135546.085438] ? trace_buffer_unlock_commit (kernel/trace/trace.c:1749) [135546.085515] rds_ib_laddr_check(): addr 36.74.25.172 ret -99 node type -1 Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* virtio-net: drop NETIF_F_FRAGLISTJason Wang2015-10-131-2/+2
| | | | | | | | | | | | | | | | | | | commit 48900cb6af4282fa0fb6ff4d72a81aa3dadb5c39 upstream. virtio declares support for NETIF_F_FRAGLIST, but assumes that there are at most MAX_SKB_FRAGS + 2 fragments which isn't always true with a fraglist. A longer fraglist in the skb will make the call to skb_to_sgvec overflow the sg array, leading to memory corruption. Drop NETIF_F_FRAGLIST so we only get what we can handle. Cc: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* ipv6: addrconf: validate new MTU before applying itMarcelo Leitner2015-10-131-1/+16
| | | | | | | | | | | | | | | | | | | | | | | | | commit 77751427a1ff25b27d47a4c36b12c3c8667855ac upstream. Currently we don't check if the new MTU is valid or not and this allows one to configure a smaller than minimum allowed by RFCs or even bigger than interface own MTU, which is a problem as it may lead to packet drops. If you have a daemon like NetworkManager running, this may be exploited by remote attackers by forging RA packets with an invalid MTU, possibly leading to a DoS. (NetworkManager currently only validates for values too small, but not for too big ones.) The fix is just to make sure the new value is valid. That is, between IPV6_MIN_MTU and interface's MTU. Note that similar check is already performed at ndisc_router_discovery(), for when kernel itself parses the RA. Signed-off-by: Marcelo Ricardo Leitner <mleitner@redhat.com> Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* md: use kzalloc() when bitmap is disabledBenjamin Randazzo2015-10-131-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit b6878d9e03043695dbf3fa1caa6dfc09db225b16 upstream. In drivers/md/md.c get_bitmap_file() uses kmalloc() for creating a mdu_bitmap_file_t called "file". 5769 file = kmalloc(sizeof(*file), GFP_NOIO); 5770 if (!file) 5771 return -ENOMEM; This structure is copied to user space at the end of the function. 5786 if (err == 0 && 5787 copy_to_user(arg, file, sizeof(*file))) 5788 err = -EFAULT But if bitmap is disabled only the first byte of "file" is initialized with zero, so it's possible to read some bytes (up to 4095) of kernel space memory from user space. This is an information leak. 5775 /* bitmap disabled, zero the first byte and copy out */ 5776 if (!mddev->bitmap_info.file) 5777 file->pathname[0] = '\0'; Signed-off-by: Benjamin Randazzo <benjamin@randazzo.fr> Signed-off-by: NeilBrown <neilb@suse.com> [bwh: Backported to 3.2: patch both possible allocation calls] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* USB: whiteheat: fix potential null-deref at probeJohan Hovold2015-10-131-0/+31
| | | | | | | | | | | | | | | | | | | commit cbb4be652d374f64661137756b8f357a1827d6a4 upstream. Fix potential null-pointer dereference at probe by making sure that the required endpoints are present. The whiteheat driver assumes there are at least five pairs of bulk endpoints, of which the final pair is used for the "command port". An attempt to bind to an interface with fewer bulk endpoints would currently lead to an oops. Fixes CVE-2015-5257. Reported-by: Moein Ghasemzadeh <moein@istuary.com> Signed-off-by: Johan Hovold <johan@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* ocfs2/dlm: fix deadlock when dispatch assert masterJoseph Qi2015-10-132-3/+10
| | | | | | | | | | | | | | | | | | | | | | | | commit 012572d4fc2e4ddd5c8ec8614d51414ec6cae02a upstream. The order of the following three spinlocks should be: dlm_domain_lock < dlm_ctxt->spinlock < dlm_lock_resource->spinlock But dlm_dispatch_assert_master() is called while holding dlm_ctxt->spinlock and dlm_lock_resource->spinlock, and then it calls dlm_grab() which will take dlm_domain_lock. Once another thread (for example, dlm_query_join_handler) has already taken dlm_domain_lock, and tries to take dlm_ctxt->spinlock deadlock happens. Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: "Junxiao Bi" <junxiao.bi@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> [bwh: Backported to 3.2: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* x86/paravirt: Replace the paravirt nop with a bona fide empty functionAndy Lutomirski2015-10-132-4/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit fc57a7c68020dcf954428869eafd934c0ab1536f upstream. PARAVIRT_ADJUST_EXCEPTION_FRAME generates this code (using nmi as an example, trimmed for readability): ff 15 00 00 00 00 callq *0x0(%rip) # 2796 <nmi+0x6> 2792: R_X86_64_PC32 pv_irq_ops+0x2c That's a call through a function pointer to regular C function that does nothing on native boots, but that function isn't protected against kprobes, isn't marked notrace, and is certainly not guaranteed to preserve any registers if the compiler is feeling perverse. This is bad news for a CLBR_NONE operation. Of course, if everything works correctly, once paravirt ops are patched, it gets nopped out, but what if we hit this code before paravirt ops are patched in? This can potentially cause breakage that is very difficult to debug. A more subtle failure is possible here, too: if _paravirt_nop uses the stack at all (even just to push RBP), it will overwrite the "NMI executing" variable if it's called in the NMI prologue. The Xen case, perhaps surprisingly, is fine, because it's already written in asm. Fix all of the cases that default to paravirt_nop (including adjust_exception_frame) with a big hammer: replace paravirt_nop with an asm function that is just a ret instruction. The Xen case may have other problems, so document them. This is part of a fix for some random crashes that Sasha saw. Reported-and-tested-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Andy Lutomirski <luto@kernel.org> Link: http://lkml.kernel.org/r/8f5d2ba295f9d73751c33d97fda03e0495d9ade0.1442791737.git.luto@kernel.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de> [bwh: Backported to 3.2: adjust filename, context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* cifs: use server timestamp for ntlmv2 authenticationPeter Seiderer2015-10-131-1/+51
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 98ce94c8df762d413b3ecb849e2b966b21606d04 upstream. Linux cifs mount with ntlmssp against an Mac OS X (Yosemite 10.10.5) share fails in case the clocks differ more than +/-2h: digest-service: digest-request: od failed with 2 proto=ntlmv2 digest-service: digest-request: kdc failed with -1561745592 proto=ntlmv2 Fix this by (re-)using the given server timestamp for the ntlmv2 authentication (as Windows 7 does). A related problem was also reported earlier by Namjae Jaen (see below): Windows machine has extended security feature which refuse to allow authentication when there is time difference between server time and client time when ntlmv2 negotiation is used. This problem is prevalent in embedded enviornment where system time is set to default 1970. Modern servers send the server timestamp in the TargetInfo Av_Pair structure in the challenge message [see MS-NLMP 2.2.2.1] In [MS-NLMP 3.1.5.1.2] it is explicitly mentioned that the client must use the server provided timestamp if present OR current time if it is not Reported-by: Namjae Jeon <namjae.jeon@samsung.com> Signed-off-by: Peter Seiderer <ps.report@gmx.net> Signed-off-by: Steve French <smfrench@gmail.com> [bwh: Backported to 3.2: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* xhci: change xhci 1.0 only restrictions to support xhci 1.1Mathias Nyman2015-10-132-5/+5
| | | | | | | | | | | | | | | commit dca7794539eff04b786fb6907186989e5eaaa9c2 upstream. Some changes between xhci 0.96 and xhci 1.0 specifications forced us to check the hci version in code, some of these checks were implemented as hci_version == 1.0, which will not work with new xhci 1.1 controllers. xhci 1.1 behaves similar to xhci 1.0 in these cases, so change these checks to hci_version >= 1.0 Signed-off-by: Mathias Nyman <mathias.nyman@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* usb: xhci: Clear XHCI_STATE_DYING on startRoger Quadros2015-10-131-1/+2
| | | | | | | | | | | | | commit e5bfeab0ad515b4f6df39fe716603e9dc6d3dfd0 upstream. For whatever reason if XHCI died in the previous instant then it will never recover on the next xhci_start unless we clear the DYING flag. Signed-off-by: Roger Quadros <rogerq@ti.com> Signed-off-by: Mathias Nyman <mathias.nyman@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* xhci: give command abortion one more chance before killing xhciMathias Nyman2015-10-131-0/+9
| | | | | | | | | | | | | commit a6809ffd1687b3a8c192960e69add559b9d32649 upstream. We want to give the command abortion an additional try to stop the command ring before we completely hose xhci. Tested-by: Vincent Pelletier <plr.vincent@gmail.com> Signed-off-by: Mathias Nyman <mathias.nyman@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> [bwh: Backported to 3.2: call handshake() rather than xhci_handshake()] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* usb: Use the USB_SS_MULT() macro to get the burst multiplier.Mathias Nyman2015-10-131-2/+3
| | | | | | | | | | | | | | | | | | commit ff30cbc8da425754e8ab96904db1d295bd034f27 upstream. Bits 1:0 of the bmAttributes are used for the burst multiplier. The rest of the bits used to be reserved (zero), but USB3.1 takes bit 7 into use. Use the existing USB_SS_MULT() macro instead to make sure the mult value and hence max packet calculations are correct for USB3.1 devices. Note that burst multiplier in bmAttributes is zero based and that the USB_SS_MULT() macro adds one. Signed-off-by: Mathias Nyman <mathias.nyman@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* KVM: x86: trap AMD MSRs for the TSeg base and maskPaolo Bonzini2015-10-132-0/+3
| | | | | | | | | | | | | | | commit 3afb1121800128aae9f5722e50097fcf1a9d4d88 upstream. These have roughly the same purpose as the SMRR, which we do not need to implement in KVM. However, Linux accesses MSR_K8_TSEG_ADDR at boot, which causes problems when running a Xen dom0 under KVM. Just return 0, meaning that processor protection of SMRAM is not in effect. Reported-by: M A Young <m.a.young@durham.ac.uk> Acked-by: Borislav Petkov <bp@suse.de> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* s390/compat: correct uc_sigmask of the compat signal frameMartin Schwidefsky2015-10-131-4/+26
| | | | | | | | | | | | | | | | | | | | | commit 8d4bd0ed0439dfc780aab801a085961925ed6838 upstream. The uc_sigmask in the ucontext structure is an array of words to keep the 64 signal bits (or 1024 if you ask glibc but the kernel sigset_t only has 64 bits). For 64 bit the sigset_t contains a single 8 byte word, but for 31 bit there are two 4 byte words. The compat signal handler code uses a simple copy of the 64 bit sigset_t to the 31 bit compat_sigset_t. As s390 is a big-endian architecture this is incorrect, the two words in the 31 bit sigset_t array need to be swapped. Reported-by: Stefan Liebler <stli@linux.vnet.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> [bwh: Backported to 3.2: - Introduce local compat_sigset_t in setup_frame32() - Adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* ASoC: fix broken pxa SoC supportRobert Jarzmik2015-10-132-9/+8
| | | | | | | | | | | | | | | | | | | | | | | commit 3c8f7710c1c44fb650bc29b6ef78ed8b60cfaa28 upstream. The previous fix of pxa library support, which was introduced to fix the library dependency, broke the previous SoC behavior, where a machine code binding pxa2xx-ac97 with a coded relied on : - sound/soc/pxa/pxa2xx-ac97.c - sound/soc/codecs/XXX.c For example, the mioa701_wm9713.c machine code is currently broken. The "select ARM" statement wrongly selects the soc/arm/pxa2xx-ac97 for compilation, as per an unfortunate fate SND_PXA2XX_AC97 is both declared in sound/arm/Kconfig and sound/soc/pxa/Kconfig. Fix this by ensuring that SND_PXA2XX_SOC correctly triggers the correct pxa2xx-ac97 compilation. Fixes: 846172dfe33c ("ASoC: fix SND_PXA2XX_LIB Kconfig warning") Signed-off-by: Robert Jarzmik <robert.jarzmik@free.fr> Signed-off-by: Mark Brown <broonie@kernel.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* x86/platform: Fix Geode LX timekeeping in the generic x86 buildDavid Woodhouse2015-10-131-7/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 03da3ff1cfcd7774c8780d2547ba0d995f7dc03d upstream. In 2007, commit 07190a08eef36 ("Mark TSC on GeodeLX reliable") bypassed verification of the TSC on Geode LX. However, this code (now in the check_system_tsc_reliable() function in arch/x86/kernel/tsc.c) was only present if CONFIG_MGEODE_LX was set. OpenWRT has recently started building its generic Geode target for Geode GX, not LX, to include support for additional platforms. This broke the timekeeping on LX-based devices, because the TSC wasn't marked as reliable: https://dev.openwrt.org/ticket/20531 By adding a runtime check on is_geode_lx(), we can also include the fix if CONFIG_MGEODEGX1 or CONFIG_X86_GENERIC are set, thus fixing the problem. Signed-off-by: David Woodhouse <David.Woodhouse@intel.com> Cc: Andres Salomon <dilinger@queued.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Marcelo Tosatti <marcelo@kvack.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1442409003.131189.87.camel@infradead.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* ARM: fix Thumb2 signal handling when ARMv6 is enabledRussell King2015-10-131-5/+10
| | | | | | | | | | | | | | | | | | | | | | | | | commit 9b55613f42e8d40d5c9ccb8970bde6af4764b2ab upstream. When a kernel is built covering ARMv6 to ARMv7, we omit to clear the IT state when entering a signal handler. This can cause the first few instructions to be conditionally executed depending on the parent context. In any case, the original test for >= ARMv7 is broken - ARMv6 can have Thumb-2 support as well, and an ARMv6T2 specific build would omit this code too. Relax the test back to ARMv6 or greater. This results in us always clearing the IT state bits in the PSR, even on CPUs where these bits are reserved. However, they're reserved for the IT state, so this should cause no harm. Fixes: d71e1352e240 ("Clear the IT state when invoking a Thumb-2 signal handler") Acked-by: Tony Lindgren <tony@atomide.com> Tested-by: H. Nikolaus Schaller <hns@goldelico.com> Tested-by: Grazvydas Ignotas <notasas@gmail.com> Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* ARM: 7880/1: Clear the IT state independent of the Thumb-2 modeT.J. Purtell2015-10-131-4/+10
| | | | | | | | | | | | | | | commit 6ecf830e5029598732e04067e325d946097519cb upstream. The ARM architecture reference specifies that the IT state bits in the PSR must be all zeros in ARM mode or behavior is unspecified. On the Qualcomm Snapdragon S4/Krait architecture CPUs the processor continues to consider the IT state bits while in ARM mode. This makes it so that some instructions are skipped by the CPU. Signed-off-by: T.J. Purtell <tj@mobisocial.us> [rmk+kernel@arm.linux.org.uk: fixed whitespace formatting in patch] Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* btrfs: skip waiting on ordered range for special filesJeff Mahoney2015-10-131-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | commit a30e577c96f59b1e1678ea5462432b09bf7d5cbc upstream. In btrfs_evict_inode, we properly truncate the page cache for evicted inodes but then we call btrfs_wait_ordered_range for every inode as well. It's the right thing to do for regular files but results in incorrect behavior for device inodes for block devices. filemap_fdatawrite_range gets called with inode->i_mapping which gets resolved to the block device inode before getting passed to wbc_attach_fdatawrite_inode and ultimately to inode_to_bdi. What happens next depends on whether there's an open file handle associated with the inode. If there is, we write to the block device, which is unexpected behavior. If there isn't, we through normally and inode->i_data is used. We can also end up racing against open/close which can result in crashes when i_mapping points to a block device inode that has been closed. Since there can't be any page cache associated with special file inodes, it's safe to skip the btrfs_wait_ordered_range call entirely and avoid the problem. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=100911 Tested-by: Christoph Biedl <linux-kernel.bfrz@manchmal.in-ulm.de> Signed-off-by: Jeff Mahoney <jeffm@suse.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* Btrfs: fix read corruption of compressed and shared extentsFilipe Manana2015-10-131-7/+60
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 005efedf2c7d0a270ffbe28d8997b03844f3e3e7 upstream. If a file has a range pointing to a compressed extent, followed by another range that points to the same compressed extent and a read operation attempts to read both ranges (either completely or part of them), the pages that correspond to the second range are incorrectly filled with zeroes. Consider the following example: File layout [0 - 8K] [8K - 24K] | | | | points to extent X, points to extent X, offset 4K, length of 8K offset 0, length 16K [extent X, compressed length = 4K uncompressed length = 16K] If a readpages() call spans the 2 ranges, a single bio to read the extent is submitted - extent_io.c:submit_extent_page() would only create a new bio to cover the second range pointing to the extent if the extent it points to had a different logical address than the extent associated with the first range. This has a consequence of the compressed read end io handler (compression.c:end_compressed_bio_read()) finish once the extent is decompressed into the pages covering the first range, leaving the remaining pages (belonging to the second range) filled with zeroes (done by compression.c:btrfs_clear_biovec_end()). So fix this by submitting the current bio whenever we find a range pointing to a compressed extent that was preceded by a range with a different extent map. This is the simplest solution for this corner case. Making the end io callback populate both ranges (or more, if we have multiple pointing to the same extent) is a much more complex solution since each bio is tightly coupled with a single extent map and the extent maps associated to the ranges pointing to the shared extent can have different offsets and lengths. The following test case for fstests triggers the issue: seq=`basename $0` seqres=$RESULT_DIR/$seq echo "QA output created by $seq" tmp=/tmp/$$ status=1 # failure is the default! trap "_cleanup; exit \$status" 0 1 2 3 15 _cleanup() { rm -f $tmp.* } # get standard environment, filters and checks . ./common/rc . ./common/filter # real QA test starts here _need_to_be_root _supported_fs btrfs _supported_os Linux _require_scratch _require_cloner rm -f $seqres.full test_clone_and_read_compressed_extent() { local mount_opts=$1 _scratch_mkfs >>$seqres.full 2>&1 _scratch_mount $mount_opts # Create a test file with a single extent that is compressed (the # data we write into it is highly compressible no matter which # compression algorithm is used, zlib or lzo). $XFS_IO_PROG -f -c "pwrite -S 0xaa 0K 4K" \ -c "pwrite -S 0xbb 4K 8K" \ -c "pwrite -S 0xcc 12K 4K" \ $SCRATCH_MNT/foo | _filter_xfs_io # Now clone our extent into an adjacent offset. $CLONER_PROG -s $((4 * 1024)) -d $((16 * 1024)) -l $((8 * 1024)) \ $SCRATCH_MNT/foo $SCRATCH_MNT/foo # Same as before but for this file we clone the extent into a lower # file offset. $XFS_IO_PROG -f -c "pwrite -S 0xaa 8K 4K" \ -c "pwrite -S 0xbb 12K 8K" \ -c "pwrite -S 0xcc 20K 4K" \ $SCRATCH_MNT/bar | _filter_xfs_io $CLONER_PROG -s $((12 * 1024)) -d 0 -l $((8 * 1024)) \ $SCRATCH_MNT/bar $SCRATCH_MNT/bar echo "File digests before unmounting filesystem:" md5sum $SCRATCH_MNT/foo | _filter_scratch md5sum $SCRATCH_MNT/bar | _filter_scratch # Evicting the inode or clearing the page cache before reading # again the file would also trigger the bug - reads were returning # all bytes in the range corresponding to the second reference to # the extent with a value of 0, but the correct data was persisted # (it was a bug exclusively in the read path). The issue happened # only if the same readpages() call targeted pages belonging to the # first and second ranges that point to the same compressed extent. _scratch_remount echo "File digests after mounting filesystem again:" # Must match the same digests we got before. md5sum $SCRATCH_MNT/foo | _filter_scratch md5sum $SCRATCH_MNT/bar | _filter_scratch } echo -e "\nTesting with zlib compression..." test_clone_and_read_compressed_extent "-o compress=zlib" _scratch_unmount echo -e "\nTesting with lzo compression..." test_clone_and_read_compressed_extent "-o compress=lzo" status=0 exit Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Qu Wenruo<quwenruo@cn.fujitsu.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> [bwh: Backported to 3.2: - Maintain prev_em_start in both functions calling __extent_read_full_page() in a loop - Adjust context and order] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* USB: option: add ZTE PIDsLiu.Zhao2015-10-131-0/+24
| | | | | | | | | | | | commit 19ab6bc5674a30fdb6a2436b068d19a3c17dc73e upstream. This is intended to add ZTE device PIDs on kernel. Signed-off-by: Liu.Zhao <lzsos369@163.com> [johan: sort the new entries ] Signed-off-by: Johan Hovold <johan@kernel.org> [bwh: Backported to 3.2: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* perf header: Fixup reading of HEADER_NRCPUS featureArnaldo Carvalho de Melo2015-10-131-14/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit caa470475d9b59eeff093ae650800d34612c4379 upstream. The original patch introducing this header wrote the number of CPUs available and online in one order and then swapped those values when reading, fix it. Before: # perf record usleep 1 # perf report --header-only | grep 'nrcpus \(online\|avail\)' # nrcpus online : 4 # nrcpus avail : 4 # echo 0 > /sys/devices/system/cpu/cpu2/online # perf record usleep 1 # perf report --header-only | grep 'nrcpus \(online\|avail\)' # nrcpus online : 4 # nrcpus avail : 3 # echo 0 > /sys/devices/system/cpu/cpu1/online # perf record usleep 1 # perf report --header-only | grep 'nrcpus \(online\|avail\)' # nrcpus online : 4 # nrcpus avail : 2 After the fix, bringing back the CPUs online: # perf report --header-only | grep 'nrcpus \(online\|avail\)' # nrcpus online : 2 # nrcpus avail : 4 # echo 1 > /sys/devices/system/cpu/cpu2/online # perf record usleep 1 # perf report --header-only | grep 'nrcpus \(online\|avail\)' # nrcpus online : 3 # nrcpus avail : 4 # echo 1 > /sys/devices/system/cpu/cpu1/online # perf record usleep 1 # perf report --header-only | grep 'nrcpus \(online\|avail\)' # nrcpus online : 4 # nrcpus avail : 4 Acked-by: Namhyung Kim <namhyung@kernel.org> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Borislav Petkov <bp@suse.de> Cc: David Ahern <dsahern@gmail.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Kan Liang <kan.liang@intel.com> Cc: Stephane Eranian <eranian@google.com> Cc: Wang Nan <wangnan0@huawei.com> Fixes: fbe96f29ce4b ("perf tools: Make perf.data more self-descriptive (v8)") Link: http://lkml.kernel.org/r/20150911153323.GP23511@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> [bwh: Backported to 3.2: print_nrcpus() reads and prints these fields immediately, so read both of them into an array before printing them in reverse order.] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* hfs: fix B-tree corruption after insertion at position 0Hin-Tak Leung2015-10-131-9/+11
| | | | | | | | | | | | | | | | | | | | | | | commit b4cc0efea4f0bfa2477c56af406cfcf3d3e58680 upstream. Fix B-tree corruption when a new record is inserted at position 0 in the node in hfs_brec_insert(). This is an identical change to the corresponding hfs b-tree code to Sergei Antonov's "hfsplus: fix B-tree corruption after insertion at position 0", to keep similar code paths in the hfs and hfsplus drivers in sync, where appropriate. Signed-off-by: Hin-Tak Leung <htl10@users.sourceforge.net> Cc: Sergei Antonov <saproj@gmail.com> Cc: Joe Perches <joe@perches.com> Reviewed-by: Vyacheslav Dubeyko <slava@dubeyko.com> Cc: Anton Altaparmakov <anton@tuxera.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* hfs,hfsplus: cache pages correctly between bnode_create and bnode_freeHin-Tak Leung2015-10-132-8/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 7cb74be6fd827e314f81df3c5889b87e4c87c569 upstream. Pages looked up by __hfs_bnode_create() (called by hfs_bnode_create() and hfs_bnode_find() for finding or creating pages corresponding to an inode) are immediately kmap()'ed and used (both read and write) and kunmap()'ed, and should not be page_cache_release()'ed until hfs_bnode_free(). This patch fixes a problem I first saw in July 2012: merely running "du" on a large hfsplus-mounted directory a few times on a reasonably loaded system would get the hfsplus driver all confused and complaining about B-tree inconsistencies, and generates a "BUG: Bad page state". Most recently, I can generate this problem on up-to-date Fedora 22 with shipped kernel 4.0.5, by running "du /" (="/" + "/home" + "/mnt" + other smaller mounts) and "du /mnt" simultaneously on two windows, where /mnt is a lightly-used QEMU VM image of the full Mac OS X 10.9: $ df -i / /home /mnt Filesystem Inodes IUsed IFree IUse% Mounted on /dev/mapper/fedora-root 3276800 551665 2725135 17% / /dev/mapper/fedora-home 52879360 716221 52163139 2% /home /dev/nbd0p2 4294967295 1387818 4293579477 1% /mnt After applying the patch, I was able to run "du /" (60+ times) and "du /mnt" (150+ times) continuously and simultaneously for 6+ hours. There are many reports of the hfsplus driver getting confused under load and generating "BUG: Bad page state" or other similar issues over the years. [1] The unpatched code [2] has always been wrong since it entered the kernel tree. The only reason why it gets away with it is that the kmap/memcpy/kunmap follow very quickly after the page_cache_release() so the kernel has not had a chance to reuse the memory for something else, most of the time. The current RW driver appears to have followed the design and development of the earlier read-only hfsplus driver [3], where-by version 0.1 (Dec 2001) had a B-tree node-centric approach to read_cache_page()/page_cache_release() per bnode_get()/bnode_put(), migrating towards version 0.2 (June 2002) of caching and releasing pages per inode extents. When the current RW code first entered the kernel [2] in 2005, there was an REF_PAGES conditional (and "//" commented out code) to switch between B-node centric paging to inode-centric paging. There was a mistake with the direction of one of the REF_PAGES conditionals in __hfs_bnode_create(). In a subsequent "remove debug code" commit [4], the read_cache_page()/page_cache_release() per bnode_get()/bnode_put() were removed, but a page_cache_release() was mistakenly left in (propagating the "REF_PAGES <-> !REF_PAGE" mistake), and the commented-out page_cache_release() in bnode_release() (which should be spanned by !REF_PAGES) was never enabled. References: [1]: Michael Fox, Apr 2013 http://www.spinics.net/lists/linux-fsdevel/msg63807.html ("hfsplus volume suddenly inaccessable after 'hfs: recoff %d too large'") Sasha Levin, Feb 2015 http://lkml.org/lkml/2015/2/20/85 ("use after free") https://bugs.launchpad.net/ubuntu/+source/linux/+bug/740814 https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1027887 https://bugzilla.kernel.org/show_bug.cgi?id=42342 https://bugzilla.kernel.org/show_bug.cgi?id=63841 https://bugzilla.kernel.org/show_bug.cgi?id=78761 [2]: http://git.kernel.org/cgit/linux/kernel/git/tglx/history.git/commit/\ fs/hfs/bnode.c?id=d1081202f1d0ee35ab0beb490da4b65d4bc763db commit d1081202f1d0ee35ab0beb490da4b65d4bc763db Author: Andrew Morton <akpm@osdl.org> Date: Wed Feb 25 16:17:36 2004 -0800 [PATCH] HFS rewrite http://git.kernel.org/cgit/linux/kernel/git/tglx/history.git/commit/\ fs/hfsplus/bnode.c?id=91556682e0bf004d98a529bf829d339abb98bbbd commit 91556682e0bf004d98a529bf829d339abb98bbbd Author: Andrew Morton <akpm@osdl.org> Date: Wed Feb 25 16:17:48 2004 -0800 [PATCH] HFS+ support [3]: http://sourceforge.net/projects/linux-hfsplus/ http://sourceforge.net/projects/linux-hfsplus/files/Linux%202.4.x%20patch/hfsplus%200.1/ http://sourceforge.net/projects/linux-hfsplus/files/Linux%202.4.x%20patch/hfsplus%200.2/ http://linux-hfsplus.cvs.sourceforge.net/viewvc/linux-hfsplus/linux/\ fs/hfsplus/bnode.c?r1=1.4&r2=1.5 Date: Thu Jun 6 09:45:14 2002 +0000 Use buffer cache instead of page cache in bnode.c. Cache inode extents. [4]: http://git.kernel.org/cgit/linux/kernel/git/\ stable/linux-stable.git/commit/?id=a5e3985fa014029eb6795664c704953720cc7f7d commit a5e3985fa014029eb6795664c704953720cc7f7d Author: Roman Zippel <zippel@linux-m68k.org> Date: Tue Sep 6 15:18:47 2005 -0700 [PATCH] hfs: remove debug code Signed-off-by: Hin-Tak Leung <htl10@users.sourceforge.net> Signed-off-by: Sergei Antonov <saproj@gmail.com> Reviewed-by: Anton Altaparmakov <anton@tuxera.com> Reported-by: Sasha Levin <sasha.levin@oracle.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@infradead.org> Cc: Vyacheslav Dubeyko <slava@dubeyko.com> Cc: Sougata Santra <sougata@tuxera.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* powerpc/MSI: Fix race condition in tearing down MSI interruptsPaul Mackerras2015-10-135-9/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit e297c939b745e420ef0b9dc989cb87bda617b399 upstream. This fixes a race which can result in the same virtual IRQ number being assigned to two different MSI interrupts. The most visible consequence of that is usually a warning and stack trace from the sysfs code about an attempt to create a duplicate entry in sysfs. The race happens when one CPU (say CPU 0) is disposing of an MSI while another CPU (say CPU 1) is setting up an MSI. CPU 0 calls (for example) pnv_teardown_msi_irqs(), which calls msi_bitmap_free_hwirqs() to indicate that the MSI (i.e. its hardware IRQ number) is no longer in use. Then, before CPU 0 gets to calling irq_dispose_mapping() to free up the virtal IRQ number, CPU 1 comes in and calls msi_bitmap_alloc_hwirqs() to allocate an MSI, and gets the same hardware IRQ number that CPU 0 just freed. CPU 1 then calls irq_create_mapping() to get a virtual IRQ number, which sees that there is currently a mapping for that hardware IRQ number and returns the corresponding virtual IRQ number (which is the same virtual IRQ number that CPU 0 was using). CPU 0 then calls irq_dispose_mapping() and frees that virtual IRQ number. Now, if another CPU comes along and calls irq_create_mapping(), it is likely to get the virtual IRQ number that was just freed, resulting in the same virtual IRQ number apparently being used for two different hardware interrupts. To fix this race, we just move the call to msi_bitmap_free_hwirqs() to after the call to irq_dispose_mapping(). Since virq_to_hw() doesn't work for the virtual IRQ number after irq_dispose_mapping() has been called, we need to call it before irq_dispose_mapping() and remember the result for the msi_bitmap_free_hwirqs() call. The pattern of calling msi_bitmap_free_hwirqs() before irq_dispose_mapping() appears in 5 places under arch/powerpc, and appears to have originated in commit 05af7bd2d75e ("[POWERPC] MPIC U3/U4 MSI backend") from 2007. Fixes: 05af7bd2d75e ("[POWERPC] MPIC U3/U4 MSI backend") Reported-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> [bwh: Backported to 3.2: - powernv uses a private functions instead of msi_bitmap_free_hwirqs() - Adjust filename, context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* pagemap: hide physical addresses from non-privileged usersKonstantin Khlebnikov2015-10-131-16/+12
| | | | | | | | | | | | | | | | | | | | | | | commit 1c90308e7a77af6742a97d1021cca923b23b7f0d upstream. This patch makes pagemap readable for normal users and hides physical addresses from them. For some use-cases PFN isn't required at all. See http://lkml.kernel.org/r/1425935472-17949-1-git-send-email-kirill@shutemov.name Fixes: ab676b7d6fbf ("pagemap: do not leak physical addresses to non-privileged userspace") Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Reviewed-by: Mark Williamson <mwilliamson@undo-software.com> Tested-by: Mark Williamson <mwilliamson@undo-software.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> [bwh: Backported to 3.2: - Add the same check in the places where we look up a PFN - Add struct pagemapread * parameters where necessary - Open-code file_ns_capable() - Delete pagemap_open() entirely, as it would always return 0] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* ARM: 8429/1: disable GCC SRA optimizationArd Biesheuvel2015-10-131-0/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit a077224fd35b2f7fbc93f14cf67074fc792fbac2 upstream. While working on the 32-bit ARM port of UEFI, I noticed a strange corruption in the kernel log. The following snprintf() statement (in drivers/firmware/efi/efi.c:efi_md_typeattr_format()) snprintf(pos, size, "|%3s|%2s|%2s|%2s|%3s|%2s|%2s|%2s|%2s]", was producing the following output in the log: | | | | | |WB|WT|WC|UC] | | | | | |WB|WT|WC|UC] | | | | | |WB|WT|WC|UC] |RUN| | | | |WB|WT|WC|UC]* |RUN| | | | |WB|WT|WC|UC]* | | | | | |WB|WT|WC|UC] |RUN| | | | |WB|WT|WC|UC]* | | | | | |WB|WT|WC|UC] |RUN| | | | | | | |UC] |RUN| | | | | | | |UC] As it turns out, this is caused by incorrect code being emitted for the string() function in lib/vsprintf.c. The following code if (!(spec.flags & LEFT)) { while (len < spec.field_width--) { if (buf < end) *buf = ' '; ++buf; } } for (i = 0; i < len; ++i) { if (buf < end) *buf = *s; ++buf; ++s; } while (len < spec.field_width--) { if (buf < end) *buf = ' '; ++buf; } when called with len == 0, triggers an issue in the GCC SRA optimization pass (Scalar Replacement of Aggregates), which handles promotion of signed struct members incorrectly. This is a known but as yet unresolved issue. (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65932). In this particular case, it is causing the second while loop to be executed erroneously a single time, causing the additional space characters to be printed. So disable the optimization by passing -fno-ipa-sra. Acked-by: Nicolas Pitre <nico@linaro.org> Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* fs: create and use seq_show_option for escapingKees Cook2015-10-1312-23/+61
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit a068acf2ee77693e0bf39d6e07139ba704f461c3 upstream. Many file systems that implement the show_options hook fail to correctly escape their output which could lead to unescaped characters (e.g. new lines) leaking into /proc/mounts and /proc/[pid]/mountinfo files. This could lead to confusion, spoofed entries (resulting in things like systemd issuing false d-bus "mount" notifications), and who knows what else. This looks like it would only be the root user stepping on themselves, but it's possible weird things could happen in containers or in other situations with delegated mount privileges. Here's an example using overlay with setuid fusermount trusting the contents of /proc/mounts (via the /etc/mtab symlink). Imagine the use of "sudo" is something more sneaky: $ BASE="ovl" $ MNT="$BASE/mnt" $ LOW="$BASE/lower" $ UP="$BASE/upper" $ WORK="$BASE/work/ 0 0 none /proc fuse.pwn user_id=1000" $ mkdir -p "$LOW" "$UP" "$WORK" $ sudo mount -t overlay -o "lowerdir=$LOW,upperdir=$UP,workdir=$WORK" none /mnt $ cat /proc/mounts none /root/ovl/mnt overlay rw,relatime,lowerdir=ovl/lower,upperdir=ovl/upper,workdir=ovl/work/ 0 0 none /proc fuse.pwn user_id=1000 0 0 $ fusermount -u /proc $ cat /proc/mounts cat: /proc/mounts: No such file or directory This fixes the problem by adding new seq_show_option and seq_show_option_n helpers, and updating the vulnerable show_option handlers to use them as needed. Some, like SELinux, need to be open coded due to unusual existing escape mechanisms. [akpm@linux-foundation.org: add lost chunk, per Kees] [keescook@chromium.org: seq_show_option should be using const parameters] Signed-off-by: Kees Cook <keescook@chromium.org> Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Acked-by: Jan Kara <jack@suse.com> Acked-by: Paul Moore <paul@paul-moore.com> Cc: J. R. Okajima <hooanon05g@gmail.com> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> [bwh: Backported to 3.2: - Drop changes to overlayfs, reiserfs - Drop vers option from cifs - ceph changes are all in one file - Adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* crypto: ghash-clmulni: specify context size for ghash async algorithmAndrey Ryabinin2015-10-131-0/+1
| | | | | | | | | | | | | commit 71c6da846be478a61556717ef1ee1cea91f5d6a8 upstream. Currently context size (cra_ctxsize) doesn't specified for ghash_async_alg. Which means it's zero. Thus crypto_create_tfm() doesn't allocate needed space for ghash_async_ctx, so any read/write to ctx (e.g. in ghash_async_init_tfm()) is not valid. Signed-off-by: Andrey Ryabinin <aryabinin@odin.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* Input: evdev - do not report errors form flush()Takashi Iwai2015-10-131-9/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit eb38f3a4f6e86f8bb10a3217ebd85ecc5d763aae upstream. We've got bug reports showing the old systemd-logind (at least system-210) aborting unexpectedly, and this turned out to be because of an invalid error code from close() call to evdev devices. close() is supposed to return only either EINTR or EBADFD, while the device returned ENODEV. logind was overreacting to it and decided to kill itself when an unexpected error code was received. What a tragedy. The bad error code comes from flush fops, and actually evdev_flush() returns ENODEV when device is disconnected or client's access to it is revoked. But in these cases the fact that flush did not actually happen is not an error, but rather normal behavior. For non-disconnected devices result of flush is also not that interesting as there is no potential of data loss and even if it fails application has no way of handling the error. Because of that we are better off always returning success from evdev_flush(). Also returning EINTR from flush()/close() is discouraged (as it is not clear how application should handle this error), so let's stop taking evdev->mutex interruptibly. Bugzilla: http://bugzilla.suse.com/show_bug.cgi?id=939834 Signed-off-by: Takashi Iwai <tiwai@suse.de> Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com> [bwh: Backported to 3.2: there's no revoked flag to test] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* IB/uverbs: reject invalid or unknown opcodesChristoph Hellwig2015-10-131-1/+9
| | | | | | | | | | | | | | | commit b632ffa7cee439ba5dce3b3bc4a5cbe2b3e20133 upstream. We have many WR opcodes that are only supported in kernel space and/or require optional information to be copied into the WR structure. Reject all those not explicitly handled so that we can't pass invalid information to drivers. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* Add radeon suspend/resume quirk for HP Compaq dc5750.Jeffery Miller2015-10-131-0/+8
| | | | | | | | | | | | | commit 09bfda10e6efd7b65bcc29237bee1765ed779657 upstream. With the radeon driver loaded the HP Compaq dc5750 Small Form Factor machine fails to resume from suspend. Adding a quirk similar to other devices avoids the problem and the system resumes properly. Signed-off-by: Jeffery Miller <jmiller@neverware.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* drm/i915: Always mark the object as dirty when used by the GPUChris Wilson2015-10-131-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 51bc140431e233284660b1d22c47dec9ecdb521e upstream. There have been many hard to track down bugs whereby userspace forgot to flag a write buffer and then cause graphics corruption or a hung GPU when that buffer was later purged under memory pressure (as the buffer appeared clean, its pages would have been evicted rather than preserved and any changes more recent than in the backing storage would be lost). In retrospect this is a rare optimisation against memory pressure, already the slow path. If we always mark the buffer as dirty when accessed by the GPU, anything not used can still be evicted cheaply (ideal behaviour for mark-and-sweep eviction) but we do not run the risk of corruption. For correct read serialisation, userspace still has to notify when the GPU writes to an object. However, there are certain situations under which userspace may wish to tell white lies to the kernel... Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: Kristian Høgsberg <krh@bitplanet.net> Cc: Jesse Barnes <jbarnes@virtuousgeek.org> Cc: "Goel, Akash" <akash.goel@intel.co> Cc: Michał Winiarski <michal.winiarski@intel.com> Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch> Signed-off-by: Jani Nikula <jani.nikula@intel.com> [bwh: Backported to 3.2: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* spi: spi-pxa2xx: Check status register to determine if SSSR_TINT is disabledTan, Jui Nee2015-10-131-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | commit 02bc933ebb59208f42c2e6305b2c17fd306f695d upstream. On Intel Baytrail, there is case when interrupt handler get called, no SPI message is captured. The RX FIFO is indeed empty when RX timeout pending interrupt (SSSR_TINT) happens. Use the BIOS version where both HSUART and SPI are on the same IRQ. Both drivers are using IRQF_SHARED when calling the request_irq function. When running two separate and independent SPI and HSUART application that generate data traffic on both components, user will see messages like below on the console: pxa2xx-spi pxa2xx-spi.0: bad message state in interrupt handler This commit will fix this by first checking Receiver Time-out Interrupt, if it is disabled, ignore the request and return without servicing. Signed-off-by: Tan, Jui Nee <jui.nee.tan@intel.com> Acked-by: Jarkko Nikula <jarkko.nikula@linux.intel.com> Signed-off-by: Mark Brown <broonie@kernel.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* IB/uverbs: Fix race between ib_uverbs_open and remove_oneYishai Hadas2015-10-132-14/+32
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 35d4a0b63dc0c6d1177d4f532a9deae958f0662c upstream. Fixes: 2a72f212263701b927559f6850446421d5906c41 ("IB/uverbs: Remove dev_table") Before this commit there was a device look-up table that was protected by a spin_lock used by ib_uverbs_open and by ib_uverbs_remove_one. When it was dropped and container_of was used instead, it enabled the race with remove_one as dev might be freed just after: dev = container_of(inode->i_cdev, struct ib_uverbs_device, cdev) but before the kref_get. In addition, this buggy patch added some dead code as container_of(x,y,z) can never be NULL and so dev can never be NULL. As a result the comment above ib_uverbs_open saying "the open method will either immediately run -ENXIO" is wrong as it can never happen. The solution follows Jason Gunthorpe suggestion from below URL: https://www.mail-archive.com/linux-rdma@vger.kernel.org/msg25692.html cdev will hold a kref on the parent (the containing structure, ib_uverbs_device) and only when that kref is released it is guaranteed that open will never be called again. In addition, fixes the active count scheme to use an atomic not a kref to prevent WARN_ON as pointed by above comment from Jason. Signed-off-by: Yishai Hadas <yishaih@mellanox.com> Signed-off-by: Shachar Raindel <raindel@mellanox.com> Reviewed-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com> Signed-off-by: Doug Ledford <dledford@redhat.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* IB/mlx4: Use correct SL on AH query under RoCENoa Osherovich2015-10-131-1/+5
| | | | | | | | | | | | | | commit 5e99b139f1b68acd65e36515ca347b03856dfb5a upstream. The mlx4 IB driver implementation for ib_query_ah used a wrong offset (28 instead of 29) when link type is Ethernet. Fixed to use the correct one. Fixes: fa417f7b520e ('IB/mlx4: Add support for IBoE') Signed-off-by: Shani Michaeli <shanim@mellanox.com> Signed-off-by: Noa Osherovich <noaos@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* SUNRPC: xs_reset_transport must mark the connection as disconnectedTrond Myklebust2015-10-131-0/+2
| | | | | | | | | | commit 0c78789e3a030615c6650fde89546cadf40ec2cc upstream. In case the reconnection attempt fails. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> [bwh: Backported to 3.2: add local variable xprt] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* IB/qib: Change lkey table allocation to support more MRsMike Marciniszyn2015-10-134-4/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | commit d6f1c17e162b2a11e708f28fa93f2f79c164b442 upstream. The lkey table is allocated with with a get_user_pages() with an order based on a number of index bits from a module parameter. The underlying kernel code cannot allocate that many contiguous pages. There is no reason the underlying memory needs to be physically contiguous. This patch: - switches the allocation/deallocation to vmalloc/vfree - caps the number of bits to 23 to insure at least 1 generation bit o this matches the module parameter description Reviewed-by: Vinit Agnihotri <vinit.abhay.agnihotri@intel.com> Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com> [bwh: Backported to 3.2: - Adjust context - Add definition of qib_dev_warn(), added upstream by commit ddb887658970 ("IB/qib: Convert opcode counters to per-context")] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>