aboutsummaryrefslogtreecommitdiffstats
path: root/include
Commit message (Collapse)AuthorAgeFilesLines
* Backport mac80211 from 3.4 kernelWolfgang Wiedmeyer2017-01-217-107/+850
| | | | | | | | | The ath9k_htc driver depends on mac80211, but mac80211 can't be build. The reason is that net/wireless is almost completely backported from a 3.4 kernel. To follow suit, mac80211 is also backported from 3.4, more precisely from 3.4.113. This makes mac80211 build. Signed-off-by: Wolfgang Wiedmeyer <wolfgit@wiedmeyer.de>
* Merge branch 'cm-13.0' of ↵Wolfgang Wiedmeyer2016-12-1317-111/+234
|\ | | | | | | https://github.com/CyanogenMod/android_kernel_samsung_smdk4412 into replicant-6.0
| * perf: protect group_leader from races that cause ctx double-freeJohn Dias2016-12-131-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When moving a group_leader perf event from a software-context to a hardware-context, there's a race in checking and updating that context. The existing locking solution doesn't work; note that it tries to grab a lock inside the group_leader's context object, which you can only get at by going through a pointer that should be protected from these races. To avoid that problem, and to produce a simple solution, we can just use a lock per group_leader to protect all checks on the group_leader's context. The new lock is grabbed and released when no context locks are held. RM-290 Bug: 30955111 Bug: 31095224 Change-Id: If37124c100ca6f4aa962559fba3bd5dbbec8e052
| * BACKPORT: lockdep: Silence warning if CONFIG_LOCKDEP isn't setPaul Bolle2016-12-131-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since commit c9a4962881929df7f1ef6e63e1b9da304faca4dd ("nfsd: make client_lock per net") compiling nfs4state.o without CONFIG_LOCKDEP set, triggers this GCC warning: fs/nfsd/nfs4state.c: In function ‘free_client’: fs/nfsd/nfs4state.c:1051:19: warning: unused variable ‘nn’ [-Wunused-variable] The cause of that warning is that lockdep_assert_held() compiles away if CONFIG_LOCKDEP is not set. Silence this warning by using the argument to lockdep_assert_held() as a nop if CONFIG_LOCKDEP is not set. Signed-off-by: Paul Bolle <pebolle@tiscali.nl> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stanislav Kinsbursky <skinsbursky@parallels.com> Cc: J. Bruce Fields <bfields@redhat.com> Link: http://lkml.kernel.org/r/1359060797.1325.33.camel@x61.thuisdomein Signed-off-by: Ingo Molnar <mingo@kernel.org> -- include/linux/lockdep.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Change-Id: I4a4e78fd92dccffe5fc7c3a2617ef7d4cf59f738
| * BACKPORT: perf: Introduce perf_pmu_migrate_context()Yan, Zheng2016-12-131-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Originally from Peter Zijlstra. The helper migrates perf events from one cpu to another cpu. Conflicts (perf: Fix race in removing an event): kernel/events/core.c Change-Id: I7885fe36c9e2803b10477d556163197085be3d19 Signed-off-by: Zheng Yan <zheng.z.yan@intel.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1339741902-8449-5-git-send-email-zheng.z.yan@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * mm: remove gup_flags FOLL_WRITE games from __get_user_pages()Linus Torvalds2016-10-221-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is an ancient bug that was actually attempted to be fixed once (badly) by me eleven years ago in commit 4ceb5db9757a ("Fix get_user_pages() race for write access") but that was then undone due to problems on s390 by commit f33ea7f404e5 ("fix get_user_pages bug"). In the meantime, the s390 situation has long been fixed, and we can now fix it by checking the pte_dirty() bit properly (and do it better). The s390 dirty bit was implemented in abf09bed3cce ("s390/mm: implement software dirty bits") which made it into v3.9. Earlier kernels will have to look at the page state itself. Also, the VM has become more scalable, and what used a purely theoretical race back then has become easier to trigger. To fix it, we introduce a new internal FOLL_COW flag to mark the "yes, we already did a COW" rather than play racy games with FOLL_WRITE that is very fundamental, and then use the pte dirty flag to validate that the FOLL_COW flag is still valid. Change-Id: Id9bec3722797dff7d0ff0d9f6097c4229e31fd62 Reported-and-tested-by: Phil "not Paul" Oester <kernel@linuxace.com> Acked-by: Hugh Dickins <hughd@google.com> Reviewed-by: Michal Hocko <mhocko@suse.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Kees Cook <keescook@chromium.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Willy Tarreau <w@1wt.eu> Cc: Nick Piggin <npiggin@gmail.com> Cc: Greg Thelen <gthelen@google.com> Cc: stable@vger.kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> [wt: s/gup.c/memory.c; s/follow_page_pte/follow_page_mask; s/faultin_page/__get_user_page] Signed-off-by: Willy Tarreau <w@1wt.eu>
| * tcp: fix use after free in tcp_xmit_retransmit_queue()Eric Dumazet2016-10-191-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When tcp_sendmsg() allocates a fresh and empty skb, it puts it at the tail of the write queue using tcp_add_write_queue_tail() Then it attempts to copy user data into this fresh skb. If the copy fails, we undo the work and remove the fresh skb. Unfortunately, this undo lacks the change done to tp->highest_sack and we can leave a dangling pointer (to a freed skb) Later, tcp_xmit_retransmit_queue() can dereference this pointer and access freed memory. For regular kernels where memory is not unmapped, this might cause SACK bugs because tcp_highest_sack_seq() is buggy, returning garbage instead of tp->snd_nxt, but with various debug features like CONFIG_DEBUG_PAGEALLOC, this can crash the kernel. This bug was found by Marco Grassi thanks to syzkaller. Change-Id: I264f97d30d0a623011d9ee811c63fa0e0c2149a2 Fixes: 6859d49475d4 ("[TCP]: Abstract tp->highest_sack accessing & point to next skb") Reported-by: Marco Grassi <marco.gra@gmail.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Cc: Yuchung Cheng <ycheng@google.com> Cc: Neal Cardwell <ncardwell@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * mm: add a field to store names for private anonymous memoryColin Cross2016-08-233-1/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Userspace processes often have multiple allocators that each do anonymous mmaps to get memory. When examining memory usage of individual processes or systems as a whole, it is useful to be able to break down the various heaps that were allocated by each layer and examine their size, RSS, and physical memory usage. This patch adds a user pointer to the shared union in vm_area_struct that points to a null terminated string inside the user process containing a name for the vma. vmas that point to the same address will be merged, but vmas that point to equivalent strings at different addresses will not be merged. Userspace can set the name for a region of memory by calling prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, start, len, (unsigned long)name); Setting the name to NULL clears it. The names of named anonymous vmas are shown in /proc/pid/maps as [anon:<name>] and in /proc/pid/smaps in a new "Name" field that is only present for named vmas. If the userspace pointer is no longer valid all or part of the name will be replaced with "<fault>". The idea to store a userspace pointer to reduce the complexity within mm (at the expense of the complexity of reading /proc/pid/mem) came from Dave Hansen. This results in no runtime overhead in the mm subsystem other than comparing the anon_name pointers when considering vma merging. The pointer is stored in a union with fieds that are only used on file-backed mappings, so it does not increase memory usage. Change-Id: I53b093d98dc24f41377824f34e076edced4a6f07
| * power: max17042_battery: Set type to UNKNOWNZhao Wei Liew2016-08-171-0/+1
| | | | | | | | | | | | | | | | | | | | | | This is a fuelgauge driver, not an actual battery driver. Setting its type to 'Battery' will confuse healthd, causing healthd to pick this driver instead of the actual battery driver for reading battery stats. Issue-Id: NIGHTLIES-3279 Change-Id: Ia45e74599d391a90cb526aa07a2525b64c3eec96
| * staging: android: lowmemorykiller: implement task's adj rbtreeHong-Mei Li2016-06-131-0/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Based on the current LMK implementation, LMK has to scan all processes to select the correct task to kill during low memory. The basic idea for the optimization is to : queue all tasks with oom_score_adj priority, and then LMK just selects the proper task from the queue(rbtree) to kill. performance improvement: the current implementation: average time to find a task to kill : 1004us the optimized implementation: average time to find a task to kill: 43us Change-Id: I4dbbdd5673314dbbdabb71c3eff0dc229ce4ea91 Signed-off-by: Hong-Mei Li <a21834@motorola.com> Reviewed-on: http://gerrit.pcs.mot.com/548917 SLT-Approved: Slta Waiver <sltawvr@motorola.com> Tested-by: Jira Key <jirakey@motorola.com> Reviewed-by: Yi-Wei Zhao <gbjc64@motorola.com> Submit-Approved: Jira Key <jirakey@motorola.com> Signed-off-by: D. Andrei Măceș <dmaces@nd.edu> Conflicts: drivers/staging/android/Kconfig drivers/staging/android/lowmemorykiller.c fs/proc/base.c mm/oom_kill.c Conflicts: drivers/staging/android/lowmemorykiller.c mm/oom_kill.c Conflicts: mm/oom_kill.c Conflicts: drivers/staging/android/lowmemorykiller.c mm/oom_kill.c
| * kernel: avoid adding non-thread-group task to LMK rbtreeHong-Mei Li2016-06-131-4/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | To maintain the task adj RB tree, we add a task to the RB tree when fork, and delete it when exit. The place is exactly the same as the linear p->tasks list, only when the task is thread_group_leader. But to handle the oom_score_adj change case, which did not check the thread_group_leader, we may del/add a non-leader task to the RB tree. Finally leave the task in the RB tree, since we would not really delete a non-leader task from the tree. The orphan task would finally be freed, and cause later use-after-free panic when accessing RB tree. Solution: Move the rbtree adj_node to signal_struct, which is shared between task and all threads. This can make sure we only add one node for a thread group. Change-Id: I1e8dfe490656408863b3726c7bc9e4ee6dc5abc1 Signed-off-by: Hong-Mei Li <a21834@motorola.com> Reviewed-on: http://gerrit.mot.com/754224 SLTApproved: Slta Waiver <sltawvr@motorola.com> SME-Granted: SME Approvals Granted Tested-by: Jira Key <jirakey@motorola.com> Reviewed-by: Zhi-Ming Yuan <a14194@motorola.com> Reviewed-by: Yi-Wei Zhao <gbjc64@motorola.com> Submit-Approved: Jira Key <jirakey@motorola.com> (cherry picked from commit b3f12a2465542888ec5c868c38022e0e5f7631ca) Signed-off-by: Abdul Salam <salamab@motorola.com> Reviewed-on: http://gerrit.mot.com/766108 Reviewed-by: Sudharsan Yettapu <sudharsan.yettapu@motorola.com> Reviewed-by: Ravikumar Vembu <raviv@motorola.com> (cherry picked from commit 558ef1fceae5d4c8509cb2a40d98c841525f7ea3) Reviewed-on: http://gerrit.mot.com/768300 Conflicts: kernel/fork.c
| * mm: implement WasActive page flag (for improving cleancache)Dan Magenheimer2016-06-121-0/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | (Feedback welcome if there is a different/better way to do this without using a page flag!) Since about 2.6.27, the page replacement algorithm maintains an "active" bit to help decide which pages are most eligible to reclaim, see http://linux-mm.org/PageReplacementDesign This "active' information is also useful to cleancache but is lost by the time that cleancache has the opportunity to preserve the pageful of data. This patch adds a new page flag "WasActive" to retain the state. The flag may possibly be useful elsewhere. It is up to each cleancache backend to utilize the bit as it desires. The matching patch for zcache is included here for clarification/discussion purposes, though it will need to go through GregKH and the staging tree. The patch resolves issues reported with cleancache which occur especially during streaming workloads on older processors, see https://lkml.org/lkml/2011/8/17/351 Signed-off-by: Dan Magenheimer <dan.magenheimer@oracle.com> Conflicts: include/linux/page-flags.h Change-Id: I0fcb2302a7b9c5e66db005229f679baee90f262f Conflicts: include/linux/page-flags.h
| * mm: Need page_swap_info() helper method from upstreamD. Andrei Măceș2016-06-121-0/+1
| | | | | | | | | | | | | | | | Stolen from commit f981c5950fa85916ba49bea5d9a7a5078f47e569: "mm: methods for teaching filesystems about PG_swapcache pages" Change-Id: I6673913f9c825d3a6de88a652e99bcaf04eb1dd6
| * mm: swap: don't delay swap free for fast swap devicesVinayak Menon2016-06-122-3/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There are couple of issues with swapcache usage when ZRAM is used as swap device. 1) Kernel does a swap readahead which can be around 6 to 8 pages depending on total ram, which is not required for zram since accesses are fast. 2) Kernel delays the freeing up of swapcache expecting a later hit, which again is useless in the case of zram. 3) This is not related to swapcache, but zram usage itself. As mentioned in (2) kernel delays freeing of swapcache, but along with that it delays zram compressed page free also. i.e. there can be 2 copies, though one is compressed. This patch addresses these issues using two new flags QUEUE_FLAG_FAST and SWP_FAST, to indicate that accesses to the device will be fast and cheap, and instructs the swap layer to free up swap space agressively, and not to do read ahead. Change-Id: I5d2d5176a5f9420300bb2f843f6ecbdb25ea80e4 Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org> Signed-off-by: D. Andrei Măceș <dmaces@nd.edu> Conflicts: include/linux/blkdev.h include/linux/swap.h mm/swap_state.c mm/swapfile.c Conflicts: include/linux/blkdev.h
| * zsmalloc: change return value unit of zs_get_total_size_bytesMinchan Kim2016-06-121-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | zs_get_total_size_bytes returns a amount of memory zsmalloc consumed with *byte unit* but zsmalloc operates *page unit* rather than byte unit so let's change the API so benefit we could get is that reduce unnecessary overhead (ie, change page unit with byte unit) in zsmalloc. Since return type is pages, "zs_get_total_pages" is better than "zs_get_total_size_bytes". Signed-off-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Dan Streetman <ddstreet@ieee.org> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: <juno.choi@lge.com> Cc: <seungho1.park@lge.com> Cc: Luigi Semenzato <semenzato@google.com> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Seth Jennings <sjennings@variantweb.net> Cc: David Horner <ds2horner@gmail.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Conflicts: mm/zsmalloc.c Change-Id: If5697d7b7f8ebaab3b58c1f9f84de747eb909ca3
| * lz4: fix compression/decompression signedness mismatchSergey Senozhatsky2016-06-121-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | LZ4 compression and decompression functions require different in signedness input/output parameters: unsigned char for compression and signed char for decompression. Change decompression API to require "(const) unsigned char *". Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Kyungsik Lee <kyungsik.lee@lge.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Yann Collet <yann.collet.73@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * lib: add lz4 compressor moduleChanho Min2016-06-121-0/+36
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patchset is for supporting LZ4 compression and the crypto API using it. As shown below, the size of data is a little bit bigger but compressing speed is faster under the enabled unaligned memory access. We can use lz4 de/compression through crypto API as well. Also, It will be useful for another potential user of lz4 compression. lz4 Compression Benchmark: Compiler: ARM gcc 4.6.4 ARMv7, 1 GHz based board Kernel: linux 3.4 Uncompressed data Size: 101 MB Compressed Size compression Speed LZO 72.1MB 32.1MB/s, 33.0MB/s(UA) LZ4 75.1MB 30.4MB/s, 35.9MB/s(UA) LZ4HC 59.8MB 2.4MB/s, 2.5MB/s(UA) - UA: Unaligned memory Access support - Latest patch set for LZO applied This patch: Add support for LZ4 compression in the Linux Kernel. LZ4 Compression APIs for kernel are based on LZ4 implementation by Yann Collet and were changed for kernel coding style. LZ4 homepage : http://fastcompression.blogspot.com/p/lz4.html LZ4 source repository : http://code.google.com/p/lz4/ svn revision : r90 Two APIs are added: lz4_compress() support basic lz4 compression whereas lz4hc_compress() support high compression or CPU performance get lower but compression ratio get higher. Also, we require the pre-allocated working memory with the defined size and destination buffer must be allocated with the size of lz4_compressbound. [akpm@linux-foundation.org: make lz4_compresshcctx() static] Signed-off-by: Chanho Min <chanho.min@lge.com> Cc: "Darrick J. Wong" <djwong@us.ibm.com> Cc: Bob Pearson <rpearson@systemfabricworks.com> Cc: Richard Weinberger <richard@nod.at> Cc: Herbert Xu <herbert@gondor.hengli.com.au> Cc: Yann Collet <yann.collet.73@gmail.com> Cc: Kyungsik Lee <kyungsik.lee@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * lib: add support for LZ4-compressed kernelKyungsik Lee2016-06-121-0/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add support for extracting LZ4-compressed kernel images, as well as LZ4-compressed ramdisk images in the kernel boot process. Signed-off-by: Kyungsik Lee <kyungsik.lee@lge.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Russell King <rmk@arm.linux.org.uk> Cc: Borislav Petkov <bp@alien8.de> Cc: Florian Fainelli <florian@openwrt.org> Cc: Yann Collet <yann.collet.73@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Conflicts: scripts/Makefile.lib Change-Id: I2ad2607d9edf0f41c7e7a621f1da72174b142e2d
| * decompressor: add LZ4 decompressor moduleKyungsik Lee2016-06-121-0/+51
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add support for LZ4 decompression in the Linux Kernel. LZ4 Decompression APIs for kernel are based on LZ4 implementation by Yann Collet. Benchmark Results(PATCH v3) Compiler: Linaro ARM gcc 4.6.2 1. ARMv7, 1.5GHz based board Kernel: linux 3.4 Uncompressed Kernel Size: 14MB Compressed Size Decompression Speed LZO 6.7MB 20.1MB/s, 25.2MB/s(UA) LZ4 7.3MB 29.1MB/s, 45.6MB/s(UA) 2. ARMv7, 1.7GHz based board Kernel: linux 3.7 Uncompressed Kernel Size: 14MB Compressed Size Decompression Speed LZO 6.0MB 34.1MB/s, 52.2MB/s(UA) LZ4 6.5MB 86.7MB/s - UA: Unaligned memory Access support - Latest patch set for LZO applied This patch set is for adding support for LZ4-compressed Kernel. LZ4 is a very fast lossless compression algorithm and it also features an extremely fast decoder [1]. But we have five of decompressors already and one question which does arise, however, is that of where do we stop adding new ones? This issue had been discussed and came to the conclusion [2]. Russell King said that we should have: - one decompressor which is the fastest - one decompressor for the highest compression ratio - one popular decompressor (eg conventional gzip) If we have a replacement one for one of these, then it should do exactly that: replace it. The benchmark shows that an 8% increase in image size vs a 66% increase in decompression speed compared to LZO(which has been known as the fastest decompressor in the Kernel). Therefore the "fast but may not be small" compression title has clearly been taken by LZ4 [3]. [1] http://code.google.com/p/lz4/ [2] http://thread.gmane.org/gmane.linux.kbuild.devel/9157 [3] http://thread.gmane.org/gmane.linux.kbuild.devel/9347 LZ4 homepage: http://fastcompression.blogspot.com/p/lz4.html LZ4 source repository: http://code.google.com/p/lz4/ Signed-off-by: Kyungsik Lee <kyungsik.lee@lge.com> Signed-off-by: Yann Collet <yann.collet.73@gmail.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Russell King <rmk@arm.linux.org.uk> Cc: Borislav Petkov <bp@alien8.de> Cc: Florian Fainelli <florian@openwrt.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * zsmalloc: add copyrightMinchan Kim2016-06-121-0/+1
| | | | | | | | | | | | | | | | | | Add my copyright to the zsmalloc source code which I maintain. Signed-off-by: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * zsmalloc: move it under mmMinchan Kim2016-06-121-0/+50
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch moves zsmalloc under mm directory. Before that, description will explain why we have needed custom allocator. Zsmalloc is a new slab-based memory allocator for storing compressed pages. It is designed for low fragmentation and high allocation success rate on large object, but <= PAGE_SIZE allocations. zsmalloc differs from the kernel slab allocator in two primary ways to achieve these design goals. zsmalloc never requires high order page allocations to back slabs, or "size classes" in zsmalloc terms. Instead it allows multiple single-order pages to be stitched together into a "zspage" which backs the slab. This allows for higher allocation success rate under memory pressure. Also, zsmalloc allows objects to span page boundaries within the zspage. This allows for lower fragmentation than could be had with the kernel slab allocator for objects between PAGE_SIZE/2 and PAGE_SIZE. With the kernel slab allocator, if a page compresses to 60% of it original size, the memory savings gained through compression is lost in fragmentation because another object of the same size can't be stored in the leftover space. This ability to span pages results in zsmalloc allocations not being directly addressable by the user. The user is given an non-dereferencable handle in response to an allocation request. That handle must be mapped, using zs_map_object(), which returns a pointer to the mapped region that can be used. The mapping is necessary since the object data may reside in two different noncontigious pages. The zsmalloc fulfills the allocation needs for zram perfectly [sjenning@linux.vnet.ibm.com: borrow Seth's quote] Signed-off-by: Minchan Kim <minchan@kernel.org> Acked-by: Nitin Gupta <ngupta@vflare.org> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Bob Liu <bob.liu@oracle.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Hugh Dickins <hughd@google.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Luigi Semenzato <semenzato@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Pekka Enberg <penberg@kernel.org> Cc: Rik van Riel <riel@redhat.com> Cc: Seth Jennings <sjenning@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Change-Id: Ib026c17143131089494dc394c4a35e230220ec83 Conflicts: drivers/staging/Kconfig drivers/staging/Makefile Conflicts: mm/Kconfig mm/Makefile
| * mm: zcache/tmem/cleancache: s/flush/invalidate/Dan Magenheimer2016-06-121-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Complete the renaming from "flush" to "invalidate" across both tmem frontends (cleancache and frontswap) and both tmem backends (Xen and zcache), as required by akpm. This change is completely cosmetic. [v10: no change] [v9: akpm@linux-foundation.org: change "flush" to "invalidate", part 3] Signed-off-by: Dan Magenheimer <dan.magenheimer@oracle.com> Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Jan Beulich <JBeulich@novell.com> Acked-by: Seth Jennings <sjenning@linux.vnet.ibm.com> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Matthew Wilcox <matthew@wil.cx> Cc: Chris Mason <chris.mason@oracle.com> Cc: Rik Riel <riel@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> [v11: Remove the frontswap part] Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Conflicts: drivers/xen/tmem.c include/linux/cleancache.h Change-Id: Id9661e5fc4bb6f416129f38c1e3df80319653041
| * Revert "Add ZRAM_FOR_ANDROID"Andreas Blaesius2016-06-121-6/+0
| | | | | | | | Change-Id: I6aff6a484dd94730f2032ceb838e0741ca6fa878
| * Revert "smdk4412 : modem_if KK driver from N5100ZTCNL4"Roberto Gibellini2016-05-271-157/+87
| | | | | | | | | | | | This reverts commit 540f1d84d4f8fb27e73dfcb6f3b13c39fa667041. Change-Id: I8671bdc7f46a11375b6f710efa4af6bf32aea908
| * smdk4412 : modem_if KK driver from N5100ZTCNL4RGIB2016-05-251-87/+157
| | | | | | | | Change-Id: I903a0f614751f374e1705df5c35f4e1e21190b13
| * remove pmemSimon Shields2016-05-201-93/+0
| | | | | | | | Change-Id: I53ceca9c1e0896241513e166de39684d3654f068
| * pipe: limit the per-user amount of pages allocated in pipesWilly Tarreau2016-05-032-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | On no-so-small systems, it is possible for a single process to cause an OOM condition by filling large pipes with data that are never read. A typical process filling 4000 pipes with 1 MB of data will use 4 GB of memory. On small systems it may be tricky to set the pipe max size to prevent this from happening. This patch makes it possible to enforce a per-user soft limit above which new pipes will be limited to a single page, effectively limiting them to 4 kB each, as well as a hard limit above which no new pipes may be created for this user. This has the effect of protecting the system against memory abuse without hurting other users, and still allowing pipes to work correctly though with less data at once. The limit are controlled by two new sysctls : pipe-user-pages-soft, and pipe-user-pages-hard. Both may be disabled by setting them to zero. The default soft limit allows the default number of FDs per process (1024) to create pipes of the default size (64kB), thus reaching a limit of 64MB before starting to create only smaller pipes. With 256 processes limited to 1024 FDs each, this results in 1024*64kB + (256*1024 - 1024) * 4kB = 1084 MB of memory allocated for a user. The hard limit is disabled by default to avoid breaking existing applications that make intensive use of pipes (eg: for splicing). Reported-by: socketpair@gmail.com Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Mitigates: CVE-2013-4312 (Linux 2.0+) Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Willy Tarreau <w@1wt.eu> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Conflicts: Documentation/sysctl/fs.txt fs/pipe.c include/linux/sched.h Change-Id: Ic7c678af18129943e16715fdaa64a97a7f0854be
| * Input: add infrastructure for selecting clockid for event time stampsJohn Stultz2016-03-161-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As noted by Arve and others, since wall time can jump backwards, it is difficult to use for input because one cannot determine if one event occurred before another or for how long a key was pressed. However, the timestamp field is part of the kernel ABI, and cannot be changed without possibly breaking existing users. This patch adds a new IOCTL that allows a clockid to be set in the evdev_client struct that will specify which time base to use for event timestamps (ie: CLOCK_MONOTONIC instead of CLOCK_REALTIME). For now we only support CLOCK_MONOTONIC and CLOCK_REALTIME, but in the future we could support other clockids if appropriate. The default remains CLOCK_REALTIME, so we don't change the ABI. Signed-off-by: John Stultz <john.stultz@linaro.org> Reviewed-by: Daniel Kurtz <djkurtz@google.com> Signed-off-by: Dmitry Torokhov <dtor@mail.ru> Conflicts: include/linux/input.h Change-Id: I7b9b442dcd7930a1e72c688327e6fb7275107128
| * net: add length argument to skb_copy_and_csum_datagram_iovecSabrina Dubroca2016-03-151-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Without this length argument, we can read past the end of the iovec in memcpy_toiovec because we have no way of knowing the total length of the iovec's buffers. This is needed for stable kernels where 89c22d8c3b27 ("net: Fix skb csum races when peeking") has been backported but that don't have the ioviter conversion, which is almost all the stable trees <= 3.18. This also fixes a kernel crash for NFS servers when the client uses -onfsvers=3,proto=udp to mount the export. Change-Id: I1865e3d7a1faee42a5008a9ad58c4d3323ea4bab Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org> (cherry picked from commit c91234366e4cfd4f70c73e7d79ede92a6e462a88)
| * mm: Fix NULL pointer dereference in madvise(MADV_WILLNEED) supportKirill A. Shutemov2016-03-151-3/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Sasha Levin found a NULL pointer dereference that is due to a missing page table lock, which in turn is due to the pmd entry in question being a transparent huge-table entry. The code - introduced in commit 1998cc048901 ("mm: make madvise(MADV_WILLNEED) support swap file prefetch") - correctly checks for this situation using pmd_none_or_trans_huge_or_clear_bad(), but it turns out that that function doesn't work correctly. pmd_none_or_trans_huge_or_clear_bad() expected that pmd_bad() would trigger if the transparent hugepage bit was set, but it doesn't do that if pmd_numa() is also set. Note that the NUMA bit only gets set on real NUMA machines, so people trying to reproduce this on most normal development systems would never actually trigger this. Fix it by removing the very subtle (and subtly incorrect) expectation, and instead just checking pmd_trans_huge() explicitly. Reported-by: Sasha Levin <sasha.levin@oracle.com> Acked-by: Andrea Arcangeli <aarcange@redhat.com> [ Additionally remove the now stale test for pmd_trans_huge() inside the pmd_bad() case - Linus ] Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Change-Id: I3f3763f236ef102de735297cd175cf514d40d28f
| * mnt: Only change user settable mount flags in remountEric W. Biederman2016-03-151-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit a6138db815df5ee542d848318e5dae681590fccd upstream. Kenton Varda <kenton@sandstorm.io> discovered that by remounting a read-only bind mount read-only in a user namespace the MNT_LOCK_READONLY bit would be cleared, allowing an unprivileged user to the remount a read-only mount read-write. Correct this by replacing the mask of mount flags to preserve with a mask of mount flags that may be changed, and preserve all others. This ensures that any future bugs with this mask and remount will fail in an easy to detect way where new mount flags simply won't change. Change-Id: I8ab8bda03a14b9b43e78f1dc6c818bbec048e986 Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Francis Moreau <francis.moro@gmail.com> Signed-off-by: Zefan Li <lizefan@huawei.com>
| * include/linux/poison.h: fix LIST_POISON{1,2} offsetVasily Kulikov2016-03-101-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Poison pointer values should be small enough to find a room in non-mmap'able/hardly-mmap'able space. E.g. on x86 "poison pointer space" is located starting from 0x0. Given unprivileged users cannot mmap anything below mmap_min_addr, it should be safe to use poison pointers lower than mmap_min_addr. The current poison pointer values of LIST_POISON{1,2} might be too big for mmap_min_addr values equal or less than 1 MB (common case, e.g. Ubuntu uses only 0x10000). There is little point to use such a big value given the "poison pointer space" below 1 MB is not yet exhausted. Changing it to a smaller value solves the problem for small mmap_min_addr setups. The values are suggested by Solar Designer: http://www.openwall.com/lists/oss-security/2015/05/02/6 Bug: 26186802 Change-Id: I2663f4e4d8725547c90ea14e082f10ae0cf80679 Signed-off-by: Yuan Lin <yualin@google.com>
| * ipv4: try to cache dst_entries which would cause a redirectHannes Frederic Sowa2016-02-221-5/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Not caching dst_entries which cause redirects could be exploited by hosts on the same subnet, causing a severe DoS attack. This effect aggravated since commit f88649721268999 ("ipv4: fix dst race in sk_dst_get()"). Lookups causing redirects will be allocated with DST_NOCACHE set which will force dst_release to free them via RCU. Unfortunately waiting for RCU grace period just takes too long, we can end up with >1M dst_entries waiting to be released and the system will run OOM. rcuos threads cannot catch up under high softirq load. Attaching the flag to emit a redirect later on to the specific skb allows us to cache those dst_entries thus reducing the pressure on allocation and deallocation. This issue was discovered by Marcelo Leitner. Cc: Julian Anastasov <ja@ssi.bg> Signed-off-by: Marcelo Leitner <mleitner@redhat.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net> Conflicts: include/net/ip.h net/ipv4/route.c Change-Id: I53e4b500a4db2f5fece937a42a3bd810b2640c44
| * netfilter: nf_conntrack: reserve two bytes for nf_ct_ext->lenAndrey Vagin2016-02-221-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | "len" contains sizeof(nf_ct_ext) and size of extensions. In a worst case it can contain all extensions. Bellow you can find sizes for all types of extensions. Their sum is definitely bigger than 256. nf_ct_ext_types[0]->len = 24 nf_ct_ext_types[1]->len = 32 nf_ct_ext_types[2]->len = 24 nf_ct_ext_types[3]->len = 32 nf_ct_ext_types[4]->len = 152 nf_ct_ext_types[5]->len = 2 nf_ct_ext_types[6]->len = 16 nf_ct_ext_types[7]->len = 8 I have seen "len" up to 280 and my host has crashes w/o this patch. The right way to fix this problem is reducing the size of the ecache extension (4) and Florian is going to do this, but these changes will be quite large to be appropriate for a stable tree. Change-Id: Id44470ab1d54526993927cdda68342e591a5d6c3 Fixes: 5b423f6a40a0 (netfilter: nf_conntrack: fix racy timer handling with reliable) Cc: Pablo Neira Ayuso <pablo@netfilter.org> Cc: Patrick McHardy <kaber@trash.net> Cc: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Andrey Vagin <avagin@openvz.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| * net: add validation for the socket syscall protocol argumentHannes Frederic Sowa2016-02-201-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 郭永刚 reported that one could simply crash the kernel as root by using a simple program: int socket_fd; struct sockaddr_in addr; addr.sin_port = 0; addr.sin_addr.s_addr = INADDR_ANY; addr.sin_family = 10; socket_fd = socket(10,3,0x40000000); connect(socket_fd , &addr,16); AF_INET, AF_INET6 sockets actually only support 8-bit protocol identifiers. inet_sock's skc_protocol field thus is sized accordingly, thus larger protocol identifiers simply cut off the higher bits and store a zero in the protocol fields. This could lead to e.g. NULL function pointer because as a result of the cut off inet_num is zero and we call down to inet_autobind, which is NULL for raw sockets. kernel: Call Trace: kernel: [<ffffffff816db90e>] ? inet_autobind+0x2e/0x70 kernel: [<ffffffff816db9a4>] inet_dgram_connect+0x54/0x80 kernel: [<ffffffff81645069>] SYSC_connect+0xd9/0x110 kernel: [<ffffffff810ac51b>] ? ptrace_notify+0x5b/0x80 kernel: [<ffffffff810236d8>] ? syscall_trace_enter_phase2+0x108/0x200 kernel: [<ffffffff81645e0e>] SyS_connect+0xe/0x10 kernel: [<ffffffff81779515>] tracesys_phase2+0x84/0x89 I found no particular commit which introduced this problem. Change-Id: If01a1f7d3c652e8e67d5090eb8ea91389829b2ea CVE: CVE-2015-8543 Cc: Cong Wang <cwang@twopensource.com> Reported-by: 郭永刚 <guoyonggang@360.cn> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
| * block: separate priority boosting from REQ_METAChristoph Hellwig2016-02-161-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | Add a new REQ_PRIO to let requests preempt others in the cfq I/O schedule, and lave REQ_META purely for marking requests as metadata in blktrace. All existing callers of REQ_META except for XFS are updated to also set REQ_PRIO for now. Backported to 3.0.x by Ketut Putu Kumajaya <ketut.kumajaya@gmail.com> Change-Id: Iad5ba7a105438776f74788c0aedaf85210c613f9
* | kernel: add support for gcc 5Sasha Levin2016-12-091-0/+66
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 71458cfc782eafe4b27656e078d379a34e472adf upstream. We're missing include/linux/compiler-gcc5.h which is required now because gcc branched off to v5 in trunk. Just copy the relevant bits out of include/linux/compiler-gcc4.h, no new code is added as of now. This fixes a build error when using gcc 5. Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* | Input: add infrastructure for selecting clockid for event time stampsJohn Stultz2016-03-181-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As noted by Arve and others, since wall time can jump backwards, it is difficult to use for input because one cannot determine if one event occurred before another or for how long a key was pressed. However, the timestamp field is part of the kernel ABI, and cannot be changed without possibly breaking existing users. This patch adds a new IOCTL that allows a clockid to be set in the evdev_client struct that will specify which time base to use for event timestamps (ie: CLOCK_MONOTONIC instead of CLOCK_REALTIME). For now we only support CLOCK_MONOTONIC and CLOCK_REALTIME, but in the future we could support other clockids if appropriate. The default remains CLOCK_REALTIME, so we don't change the ABI. Signed-off-by: John Stultz <john.stultz@linaro.org> Reviewed-by: Daniel Kurtz <djkurtz@google.com> Signed-off-by: Dmitry Torokhov <dtor@mail.ru> Conflicts: include/linux/input.h Change-Id: I7b9b442dcd7930a1e72c688327e6fb7275107128
* | net: add length argument to skb_copy_and_csum_datagram_iovecSabrina Dubroca2016-03-181-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Without this length argument, we can read past the end of the iovec in memcpy_toiovec because we have no way of knowing the total length of the iovec's buffers. This is needed for stable kernels where 89c22d8c3b27 ("net: Fix skb csum races when peeking") has been backported but that don't have the ioviter conversion, which is almost all the stable trees <= 3.18. This also fixes a kernel crash for NFS servers when the client uses -onfsvers=3,proto=udp to mount the export. Change-Id: I1865e3d7a1faee42a5008a9ad58c4d3323ea4bab Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org> (cherry picked from commit c91234366e4cfd4f70c73e7d79ede92a6e462a88)
* | mm: Fix NULL pointer dereference in madvise(MADV_WILLNEED) supportKirill A. Shutemov2016-03-181-3/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Sasha Levin found a NULL pointer dereference that is due to a missing page table lock, which in turn is due to the pmd entry in question being a transparent huge-table entry. The code - introduced in commit 1998cc048901 ("mm: make madvise(MADV_WILLNEED) support swap file prefetch") - correctly checks for this situation using pmd_none_or_trans_huge_or_clear_bad(), but it turns out that that function doesn't work correctly. pmd_none_or_trans_huge_or_clear_bad() expected that pmd_bad() would trigger if the transparent hugepage bit was set, but it doesn't do that if pmd_numa() is also set. Note that the NUMA bit only gets set on real NUMA machines, so people trying to reproduce this on most normal development systems would never actually trigger this. Fix it by removing the very subtle (and subtly incorrect) expectation, and instead just checking pmd_trans_huge() explicitly. Reported-by: Sasha Levin <sasha.levin@oracle.com> Acked-by: Andrea Arcangeli <aarcange@redhat.com> [ Additionally remove the now stale test for pmd_trans_huge() inside the pmd_bad() case - Linus ] Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Change-Id: I3f3763f236ef102de735297cd175cf514d40d28f
* | mnt: Only change user settable mount flags in remountEric W. Biederman2016-03-181-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit a6138db815df5ee542d848318e5dae681590fccd upstream. Kenton Varda <kenton@sandstorm.io> discovered that by remounting a read-only bind mount read-only in a user namespace the MNT_LOCK_READONLY bit would be cleared, allowing an unprivileged user to the remount a read-only mount read-write. Correct this by replacing the mask of mount flags to preserve with a mask of mount flags that may be changed, and preserve all others. This ensures that any future bugs with this mask and remount will fail in an easy to detect way where new mount flags simply won't change. Change-Id: I8ab8bda03a14b9b43e78f1dc6c818bbec048e986 Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Francis Moreau <francis.moro@gmail.com> Signed-off-by: Zefan Li <lizefan@huawei.com>
* | include/linux/poison.h: fix LIST_POISON{1,2} offsetVasily Kulikov2016-03-181-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Poison pointer values should be small enough to find a room in non-mmap'able/hardly-mmap'able space. E.g. on x86 "poison pointer space" is located starting from 0x0. Given unprivileged users cannot mmap anything below mmap_min_addr, it should be safe to use poison pointers lower than mmap_min_addr. The current poison pointer values of LIST_POISON{1,2} might be too big for mmap_min_addr values equal or less than 1 MB (common case, e.g. Ubuntu uses only 0x10000). There is little point to use such a big value given the "poison pointer space" below 1 MB is not yet exhausted. Changing it to a smaller value solves the problem for small mmap_min_addr setups. The values are suggested by Solar Designer: http://www.openwall.com/lists/oss-security/2015/05/02/6 Bug: 26186802 Change-Id: I2663f4e4d8725547c90ea14e082f10ae0cf80679 Signed-off-by: Yuan Lin <yualin@google.com>
* | ipv4: try to cache dst_entries which would cause a redirectHannes Frederic Sowa2016-03-181-5/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Not caching dst_entries which cause redirects could be exploited by hosts on the same subnet, causing a severe DoS attack. This effect aggravated since commit f88649721268999 ("ipv4: fix dst race in sk_dst_get()"). Lookups causing redirects will be allocated with DST_NOCACHE set which will force dst_release to free them via RCU. Unfortunately waiting for RCU grace period just takes too long, we can end up with >1M dst_entries waiting to be released and the system will run OOM. rcuos threads cannot catch up under high softirq load. Attaching the flag to emit a redirect later on to the specific skb allows us to cache those dst_entries thus reducing the pressure on allocation and deallocation. This issue was discovered by Marcelo Leitner. Cc: Julian Anastasov <ja@ssi.bg> Signed-off-by: Marcelo Leitner <mleitner@redhat.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net> Conflicts: include/net/ip.h net/ipv4/route.c Change-Id: I53e4b500a4db2f5fece937a42a3bd810b2640c44
* | netfilter: nf_conntrack: reserve two bytes for nf_ct_ext->lenAndrey Vagin2016-03-181-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | "len" contains sizeof(nf_ct_ext) and size of extensions. In a worst case it can contain all extensions. Bellow you can find sizes for all types of extensions. Their sum is definitely bigger than 256. nf_ct_ext_types[0]->len = 24 nf_ct_ext_types[1]->len = 32 nf_ct_ext_types[2]->len = 24 nf_ct_ext_types[3]->len = 32 nf_ct_ext_types[4]->len = 152 nf_ct_ext_types[5]->len = 2 nf_ct_ext_types[6]->len = 16 nf_ct_ext_types[7]->len = 8 I have seen "len" up to 280 and my host has crashes w/o this patch. The right way to fix this problem is reducing the size of the ecache extension (4) and Florian is going to do this, but these changes will be quite large to be appropriate for a stable tree. Change-Id: Id44470ab1d54526993927cdda68342e591a5d6c3 Fixes: 5b423f6a40a0 (netfilter: nf_conntrack: fix racy timer handling with reliable) Cc: Pablo Neira Ayuso <pablo@netfilter.org> Cc: Patrick McHardy <kaber@trash.net> Cc: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Andrey Vagin <avagin@openvz.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
* | net: add validation for the socket syscall protocol argumentHannes Frederic Sowa2016-03-181-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 郭永刚 reported that one could simply crash the kernel as root by using a simple program: int socket_fd; struct sockaddr_in addr; addr.sin_port = 0; addr.sin_addr.s_addr = INADDR_ANY; addr.sin_family = 10; socket_fd = socket(10,3,0x40000000); connect(socket_fd , &addr,16); AF_INET, AF_INET6 sockets actually only support 8-bit protocol identifiers. inet_sock's skc_protocol field thus is sized accordingly, thus larger protocol identifiers simply cut off the higher bits and store a zero in the protocol fields. This could lead to e.g. NULL function pointer because as a result of the cut off inet_num is zero and we call down to inet_autobind, which is NULL for raw sockets. kernel: Call Trace: kernel: [<ffffffff816db90e>] ? inet_autobind+0x2e/0x70 kernel: [<ffffffff816db9a4>] inet_dgram_connect+0x54/0x80 kernel: [<ffffffff81645069>] SYSC_connect+0xd9/0x110 kernel: [<ffffffff810ac51b>] ? ptrace_notify+0x5b/0x80 kernel: [<ffffffff810236d8>] ? syscall_trace_enter_phase2+0x108/0x200 kernel: [<ffffffff81645e0e>] SyS_connect+0xe/0x10 kernel: [<ffffffff81779515>] tracesys_phase2+0x84/0x89 I found no particular commit which introduced this problem. Change-Id: If01a1f7d3c652e8e67d5090eb8ea91389829b2ea CVE: CVE-2015-8543 Cc: Cong Wang <cwang@twopensource.com> Reported-by: 郭永刚 <guoyonggang@360.cn> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* | block: separate priority boosting from REQ_METAChristoph Hellwig2016-03-181-2/+4
|/ | | | | | | | | | | | Add a new REQ_PRIO to let requests preempt others in the cfq I/O schedule, and lave REQ_META purely for marking requests as metadata in blktrace. All existing callers of REQ_META except for XFS are updated to also set REQ_PRIO for now. Backported to 3.0.x by Ketut Putu Kumajaya <ketut.kumajaya@gmail.com> Change-Id: Iad5ba7a105438776f74788c0aedaf85210c613f9
* f2fs: support 3.0arter972016-02-134-4/+49
| | | | | | | | | | Initial backporting done by nowcomputing, (https://github.com/nowcomputing/f2fs-backports.git) Additional patches required by upstream jaegeuk/f2fs.git/linux-3.4 done by arter97. Change-Id: Ibbd3a608857338482f974fa4b1a8d3c02c267d9f Signed-off-by: Park Ju Hyung <qkrwngud825@gmail.com>
* fs: import f2fs from ↵arter972016-02-133-0/+1222
| | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs.git (branch dev) Up-to-date with git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs.git @04a17fb17fafada39f96bfb41ceb2dc1c11b2af6 (f2fs: avoid to read inline data except first page) Change-Id: I1fc76a61defd530c4e97587980ba43e98db6119e Signed-off-by: Park Ju Hyung <qkrwngud825@gmail.com>
* cputime: Clean up cputime_to_usecs and usecs_to_cputime macrosMichal Hocko2015-12-021-2/+2
| | | | | | | | | | | | | | Get rid of semicolon so that those expressions can be used also somewhere else than just in an assignment. Signed-off-by: Michal Hocko <mhocko@suse.cz> Acked-by: Arnd Bergmann <arnd@arndb.de> Cc: Dave Jones <davej@redhat.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Link: http://lkml.kernel.org/r/7565417ce30d7e6b1ddc169843af0777dbf66e75.1314172057.git.mhocko@suse.cz Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Change-Id: I0ffcd25ee16589fd98906d3d9f5ee20200542175
* hashtable: introduce a small and naive hashtableSasha Levin2015-12-021-0/+192
| | | | | | | | | | | | This hashtable implementation is using hlist buckets to provide a simple hashtable to prevent it from getting reimplemented all over the kernel. Signed-off-by: Sasha Levin <levinsasha928@gmail.com> [ Merging this now, so that subsystems can start applying Sasha's patches that use this - Linus ] Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Change-Id: I08357176e20fb805170de4736915cde9103db7d2