aboutsummaryrefslogtreecommitdiffstatshomepage
path: root/net/mpls
AgeCommit message (Collapse)AuthorFilesLines
2025-06-17mpls: Use rcu_dereference_rtnl() in mpls_route_input_rcu().Kuniyuki Iwashima1-2/+2
As syzbot reported [0], mpls_route_input_rcu() can be called from mpls_getroute(), where is under RTNL. net->mpls.platform_label is only updated under RTNL. Let's use rcu_dereference_rtnl() in mpls_route_input_rcu() to silence the splat. [0]: WARNING: suspicious RCU usage 6.15.0-rc7-syzkaller-00082-g5cdb2c77c4c3 #0 Not tainted ---------------------------- net/mpls/af_mpls.c:84 suspicious rcu_dereference_check() usage! other info that might help us debug this: rcu_scheduler_active = 2, debug_locks = 1 1 lock held by syz.2.4451/17730: #0: ffffffff9012a3e8 (rtnl_mutex){+.+.}-{4:4}, at: rtnl_lock net/core/rtnetlink.c:80 [inline] #0: ffffffff9012a3e8 (rtnl_mutex){+.+.}-{4:4}, at: rtnetlink_rcv_msg+0x371/0xe90 net/core/rtnetlink.c:6961 stack backtrace: CPU: 1 UID: 0 PID: 17730 Comm: syz.2.4451 Not tainted 6.15.0-rc7-syzkaller-00082-g5cdb2c77c4c3 #0 PREEMPT(full) Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/07/2025 Call Trace: <TASK> __dump_stack lib/dump_stack.c:94 [inline] dump_stack_lvl+0x16c/0x1f0 lib/dump_stack.c:120 lockdep_rcu_suspicious+0x166/0x260 kernel/locking/lockdep.c:6865 mpls_route_input_rcu+0x1d4/0x200 net/mpls/af_mpls.c:84 mpls_getroute+0x621/0x1ea0 net/mpls/af_mpls.c:2381 rtnetlink_rcv_msg+0x3c9/0xe90 net/core/rtnetlink.c:6964 netlink_rcv_skb+0x16d/0x440 net/netlink/af_netlink.c:2534 netlink_unicast_kernel net/netlink/af_netlink.c:1313 [inline] netlink_unicast+0x53a/0x7f0 net/netlink/af_netlink.c:1339 netlink_sendmsg+0x8d1/0xdd0 net/netlink/af_netlink.c:1883 sock_sendmsg_nosec net/socket.c:712 [inline] __sock_sendmsg net/socket.c:727 [inline] ____sys_sendmsg+0xa98/0xc70 net/socket.c:2566 ___sys_sendmsg+0x134/0x1d0 net/socket.c:2620 __sys_sendmmsg+0x200/0x420 net/socket.c:2709 __do_sys_sendmmsg net/socket.c:2736 [inline] __se_sys_sendmmsg net/socket.c:2733 [inline] __x64_sys_sendmmsg+0x9c/0x100 net/socket.c:2733 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0xcd/0x230 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7f0a2818e969 Code: ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 a8 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007f0a28f52038 EFLAGS: 00000246 ORIG_RAX: 0000000000000133 RAX: ffffffffffffffda RBX: 00007f0a283b5fa0 RCX: 00007f0a2818e969 RDX: 0000000000000003 RSI: 0000200000000080 RDI: 0000000000000003 RBP: 00007f0a28210ab1 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 R13: 0000000000000000 R14: 00007f0a283b5fa0 R15: 00007ffce5e9f268 </TASK> Fixes: 0189197f4416 ("mpls: Basic routing support") Reported-by: syzbot+8a583bdd1a5cc0b0e068@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netdev/68507981.a70a0220.395abc.01ef.GAE@google.com/ Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250616201532.1036568-1-kuni1840@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-15mpls: Use nlmsg_payload in mpls_valid_getroute_reqBreno Leitao1-2/+2
Leverage the new nlmsg_payload() helper to avoid checking for message size and then reading the nlmsg data. Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250414-nlmsg-v2-8-3d90cb42c6af@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-15mpls: Use nlmsg_payload in mpls_valid_fib_dump_reqBreno Leitao1-2/+2
Leverage the new nlmsg_payload() helper to avoid checking for message size and then reading the nlmsg data. Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250414-nlmsg-v2-5-3d90cb42c6af@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-16percpu: use TYPEOF_UNQUAL() in variable declarationsUros Bizjak1-2/+2
Use TYPEOF_UNQUAL() to declare variables as a corresponding type without named address space qualifier to avoid "`__seg_gs' specified for auto variable `var'" errors. Link: https://lkml.kernel.org/r/20250127160709.80604-4-ubizjak@gmail.com Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Acked-by: Nadav Amit <nadav.amit@gmail.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Arnd Bergmann <arnd@arndb.de> Cc: "David S. Miller" <davem@davemloft.net> Cc: Eric Dumazet <edumazet@google.com> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Will Deacon <will@kernel.org> Cc: Waiman Long <longman@redhat.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-10-22rtnetlink: Return int from rtnl_af_register().Kuniyuki Iwashima1-1/+4
The next patch will add init_srcu_struct() in rtnl_af_register(), then we need to handle its error. Let's add the error handling in advance to make the following patch cleaner. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Matt Johnston <matt@codeconstruct.com.au> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-10-15neighbour: Remove NEIGH_DN_TABLE.Kuniyuki Iwashima1-1/+1
Since commit 1202cdd66531 ("Remove DECnet support from kernel"), NEIGH_DN_TABLE is no longer used. MPLS has implicit dependency on it in nla_put_via(), but nla_get_via() does not support DECnet. Let's remove NEIGH_DN_TABLE. Now, neigh_tables[] has only 2 elements and no extra iteration for DECnet in many places. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20241014235216.10785-1-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-10mpls: Handle error of rtnl_register_module().Kuniyuki Iwashima1-11/+21
Since introduced, mpls_init() has been ignoring the returned value of rtnl_register_module(), which could fail silently. Handling the error allows users to view a module as an all-or-nothing thing in terms of the rtnetlink functionality. This prevents syzkaller from reporting spurious errors from its tests, where OOM often occurs and module is automatically loaded. Let's handle the errors by rtnl_register_many(). Fixes: 03c0566542f4 ("mpls: Netlink commands to add, remove, and dump routes") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-08-23net: mpls: delete redundant judgment statementsLi Zetao1-4/+2
The initial value of err is -ENOBUFS, and err is guaranteed to be less than 0 before all goto errout. Therefore, on the error path of errout, there is no need to repeatedly judge that err is less than 0, and delete redundant judgments to make the code more concise. Signed-off-by: Li Zetao <lizetao1@huawei.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-08-16mpls: Reduce skb re-allocations due to skb_cow()Christoph Paasch1-1/+1
mpls_xmit() needs to prepend the MPLS-labels to the packet. That implies one needs to make sure there is enough space for it in the headers. Calling skb_cow() implies however that one wants to change even the playload part of the packet (which is not true for MPLS). Thus, call skb_cow_head() instead, which is what other tunnelling protocols do. Running a server with this comm it entirely removed the calls to pskb_expand_head() from the callstack in mpls_xmit() thus having significant CPU-reduction, especially at peak times. Cc: Roopa Prabhu <roopa@nvidia.com> Reported-by: Craig Taylor <cmtaylor@apple.com> Signed-off-by: Christoph Paasch <cpaasch@apple.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20240815161201.22021-1-cpaasch@apple.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-07-24sysctl: treewide: constify the ctl_table argument of proc_handlersJoel Granados1-2/+2
const qualify the struct ctl_table argument in the proc_handler function signatures. This is a prerequisite to moving the static ctl_table structs into .rodata data which will ensure that proc_handler function pointers cannot be modified. This patch has been generated by the following coccinelle script: ``` virtual patch @r1@ identifier ctl, write, buffer, lenp, ppos; identifier func !~ "appldata_(timer|interval)_handler|sched_(rt|rr)_handler|rds_tcp_skbuf_handler|proc_sctp_do_(hmac_alg|rto_min|rto_max|udp_port|alpha_beta|auth|probe_interval)"; @@ int func( - struct ctl_table *ctl + const struct ctl_table *ctl ,int write, void *buffer, size_t *lenp, loff_t *ppos); @r2@ identifier func, ctl, write, buffer, lenp, ppos; @@ int func( - struct ctl_table *ctl + const struct ctl_table *ctl ,int write, void *buffer, size_t *lenp, loff_t *ppos) { ... } @r3@ identifier func; @@ int func( - struct ctl_table * + const struct ctl_table * ,int , void *, size_t *, loff_t *); @r4@ identifier func, ctl; @@ int func( - struct ctl_table *ctl + const struct ctl_table *ctl ,int , void *, size_t *, loff_t *); @r5@ identifier func, write, buffer, lenp, ppos; @@ int func( - struct ctl_table * + const struct ctl_table * ,int write, void *buffer, size_t *lenp, loff_t *ppos); ``` * Code formatting was adjusted in xfs_sysctl.c to comply with code conventions. The xfs_stats_clear_proc_handler, xfs_panic_mask_proc_handler and xfs_deprecated_dointvec_minmax where adjusted. * The ctl_table argument in proc_watchdog_common was const qualified. This is called from a proc_handler itself and is calling back into another proc_handler, making it necessary to change it as part of the proc_handler migration. Co-developed-by: Thomas Weißschuh <linux@weissschuh.net> Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Co-developed-by: Joel Granados <j.granados@samsung.com> Signed-off-by: Joel Granados <j.granados@samsung.com>
2024-05-03net: Remove the now superfluous sentinel elements from ctl_table arrayJoel Granados1-7/+6
This commit comes at the tail end of a greater effort to remove the empty elements at the end of the ctl_table arrays (sentinels) which will reduce the overall build time size of the kernel and run time memory bloat by ~64 bytes per sentinel (further information Link : https://lore.kernel.org/all/ZO5Yx5JFogGi%2FcBo@bombadil.infradead.org/) * Remove sentinel element from ctl_table structs. * Remove the zeroing out of an array element (to make it look like a sentinel) in neigh_sysctl_register and lowpan_frags_ns_sysctl_register This is not longer needed and is safe after commit c899710fe7f9 ("networking: Update to register_net_sysctl_sz") added the array size to the ctl_table registration. * Replace the for loop stop condition in sysctl_core_net_init that tests for procname == NULL with one that depends on array size * Removed the "-1" in mpls_net_init that adjusted for having an extra empty element when looping over ctl_table arrays * Use a table_size variable to keep the value of ARRAY_SIZE Signed-off-by: Joel Granados <j.granados@samsung.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-30inet: introduce dst_rtable() helperEric Dumazet1-1/+1
I added dst_rt6_info() in commit e8dfd42c17fa ("ipv6: introduce dst_rt6_info() helper") This patch does a similar change for IPv4. Instead of (struct rtable *)dst casts, we can use : #define dst_rtable(_ptr) \ container_of_const(_ptr, struct rtable, dst) Patch is smaller than IPv6 one, because IPv4 has skb_rtable() helper. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Link: https://lore.kernel.org/r/20240429133009.1227754-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-29ipv6: introduce dst_rt6_info() helperEric Dumazet1-1/+1
Instead of (struct rt6_info *)dst casts, we can use : #define dst_rt6_info(_ptr) \ container_of_const(_ptr, struct rt6_info, dst) Some places needed missing const qualifiers : ip6_confirm_neigh(), ipv6_anycast_destination(), ipv6_unicast_destination(), has_gateway() v2: added missing parts (David Ahern) Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-22sysctl: treewide: constify ctl_table_header::ctl_table_argThomas Weißschuh1-2/+2
To be able to constify instances of struct ctl_tables it is necessary to remove ways through which non-const versions are exposed from the sysctl core. One of these is the ctl_table_arg member of struct ctl_table_header. Constify this reference as a prerequisite for the full constification of struct ctl_table instances. No functional change. Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-11mpls: no longer hold RTNL in mpls_netconf_dump_devconf()Eric Dumazet1-37/+22
- Use for_each_netdev_dump() to no longer rely on net->dev_index_head hash table. - No longer care of net->dev_base_seq - Fix return value at the end of a dump, so that NLMSG_DONE can be appended to current skb, saving one recvmsg() system call. - No longer grab RTNL, RCU protection is enough, afer adding one READ_ONCE(mdev->input_enabled) in mpls_netconf_fill_devconf() Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20240410111951.2673193-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-08ipv4: Set scope explicitly in ip_route_output().Guillaume Nault1-1/+1
Add a "scope" parameter to ip_route_output() so that callers don't have to override the tos parameter with the RTO_ONLINK flag if they want a local scope. This will allow converting flowi4_tos to dscp_t in the future, thus allowing static analysers to flag invalid interactions between "tos" (the DSCP bits) and ECN. Only three users ask for local scope (bonding, arp and atm). The others continue to use RT_SCOPE_UNIVERSE. While there, add a comment to warn users about the limitations of ip_route_output(). Signed-off-by: Guillaume Nault <gnault@redhat.com> Acked-by: Leon Romanovsky <leonro@nvidia.com> # infiniband Signed-off-by: David S. Miller <davem@davemloft.net>
2024-03-14Merge tag 'mm-nonmm-stable-2024-03-14-09-36' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull non-MM updates from Andrew Morton: - Kuan-Wei Chiu has developed the well-named series "lib min_heap: Min heap optimizations". - Kuan-Wei Chiu has also sped up the library sorting code in the series "lib/sort: Optimize the number of swaps and comparisons". - Alexey Gladkov has added the ability for code running within an IPC namespace to alter its IPC and MQ limits. The series is "Allow to change ipc/mq sysctls inside ipc namespace". - Geert Uytterhoeven has contributed some dhrystone maintenance work in the series "lib: dhry: miscellaneous cleanups". - Ryusuke Konishi continues nilfs2 maintenance work in the series "nilfs2: eliminate kmap and kmap_atomic calls" "nilfs2: fix kernel bug at submit_bh_wbc()" - Nathan Chancellor has updated our build tools requirements in the series "Bump the minimum supported version of LLVM to 13.0.1". - Muhammad Usama Anjum continues with the selftests maintenance work in the series "selftests/mm: Improve run_vmtests.sh". - Oleg Nesterov has done some maintenance work against the signal code in the series "get_signal: minor cleanups and fix". Plus the usual shower of singleton patches in various parts of the tree. Please see the individual changelogs for details. * tag 'mm-nonmm-stable-2024-03-14-09-36' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (77 commits) nilfs2: prevent kernel bug at submit_bh_wbc() nilfs2: fix failure to detect DAT corruption in btree and direct mappings ocfs2: enable ocfs2_listxattr for special files ocfs2: remove SLAB_MEM_SPREAD flag usage assoc_array: fix the return value in assoc_array_insert_mid_shortcut() buildid: use kmap_local_page() watchdog/core: remove sysctl handlers from public header nilfs2: use div64_ul() instead of do_div() mul_u64_u64_div_u64: increase precision by conditionally swapping a and b kexec: copy only happens before uchunk goes to zero get_signal: don't initialize ksig->info if SIGNAL_GROUP_EXIT/group_exec_task get_signal: hide_si_addr_tag_bits: fix the usage of uninitialized ksig get_signal: don't abuse ksig->info.si_signo and ksig->sig const_structs.checkpatch: add device_type Normalise "name (ad@dr)" MODULE_AUTHORs to "name <ad@dr>" dyndbg: replace kstrdup() + strchr() with kstrdup_and_replace() list: leverage list_is_head() for list_entry_is_head() nilfs2: MAINTAINERS: drop unreachable project mirror site smp: make __smp_processor_id() 0-argument macro fat: fix uninitialized field in nostale filehandles ...
2024-03-07mpls: Do not orphan the skbChristoph Paasch1-2/+0
We observed that TCP-pacing was falling back to the TCP-layer pacing instead of utilizing sch_fq for the pacing. This causes significant CPU-usage due to the hrtimer running on a per-TCP-connection basis. The issue is that mpls_xmit() calls skb_orphan() and thus sets skb->sk to NULL. Which implies that many of the goodies of TCP won't work. Pacing falls back to TCP-layer pacing. TCP Small Queues does not work, ... It is safe to remove this call to skb_orphan() in mpls_xmit() as there really is not reason for it to be there. It appears that this call to skb_orphan comes from the very initial implementation of MPLS. Cc: Roopa Prabhu <roopa@nvidia.com> Reported-by: Craig Taylor <cmtaylor@apple.com> Signed-off-by: Christoph Paasch <cpaasch@apple.com> Link: https://lore.kernel.org/r/20240306181117.77419-1-cpaasch@apple.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-06Normalise "name (ad@dr)" MODULE_AUTHORs to "name <ad@dr>"Ahelenia Ziemiańska1-1/+1
Found with git grep 'MODULE_AUTHOR(".*([^)]*@' Fixed with sed -i '/MODULE_AUTHOR(".*([^)]*@/{s/ (/ </g;s/)"/>"/;s/)and/> and/}' \ $(git grep -l 'MODULE_AUTHOR(".*([^)]*@') Also: in drivers/media/usb/siano/smsusb.c normalise ", INC" to ", Inc"; this is what every other MODULE_AUTHOR for this company says, and it's what the header says in drivers/sbus/char/openprom.c normalise a double-spaced separator; this is clearly copied from the copyright header, where the names are aligned on consecutive lines thusly: * Linux/SPARC PROM Configuration Driver * Copyright (C) 1996 Thomas K. Dyas (tdyas@noc.rutgers.edu) * Copyright (C) 1996 Eddie C. Dost (ecd@skynet.be) but the authorship branding is single-line Link: https://lkml.kernel.org/r/mk3geln4azm5binjjlfsgjepow4o73domjv6ajybws3tz22vb3@tarta.nabijaczleweli.xyz Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Cc: Joe Perches <joe@perches.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-26inet: allow ip_valid_fib_dump_req() to be called with RTNL or RCUEric Dumazet1-1/+3
Add a new field into struct fib_dump_filter, to let callers tell if they use RTNL locking or RCU. This is used in the following patch, when inet_dump_fib() no longer holds RTNL. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Donald Hunter <donald.hunter@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-02-23net: mpls: error out if inner headers are not setFlorian Westphal1-0/+3
mpls_gso_segment() assumes skb_inner_network_header() returns a valid result: mpls_hlen = skb_inner_network_header(skb) - skb_network_header(skb); if (unlikely(!mpls_hlen || mpls_hlen % MPLS_HLEN)) goto out; if (unlikely(!pskb_may_pull(skb, mpls_hlen))) With syzbot reproducer, skb_inner_network_header() yields 0, skb_network_header() returns 108, so this will "pskb_may_pull(skb, -108)))" which triggers a newly added DEBUG_NET_WARN_ON_ONCE() check: ------------[ cut here ]------------ WARNING: CPU: 0 PID: 5068 at include/linux/skbuff.h:2723 pskb_may_pull_reason include/linux/skbuff.h:2723 [inline] WARNING: CPU: 0 PID: 5068 at include/linux/skbuff.h:2723 pskb_may_pull include/linux/skbuff.h:2739 [inline] WARNING: CPU: 0 PID: 5068 at include/linux/skbuff.h:2723 mpls_gso_segment+0x773/0xaa0 net/mpls/mpls_gso.c:34 [..] skb_mac_gso_segment+0x383/0x740 net/core/gso.c:53 nsh_gso_segment+0x40a/0xad0 net/nsh/nsh.c:108 skb_mac_gso_segment+0x383/0x740 net/core/gso.c:53 __skb_gso_segment+0x324/0x4c0 net/core/gso.c:124 skb_gso_segment include/net/gso.h:83 [inline] [..] sch_direct_xmit+0x11a/0x5f0 net/sched/sch_generic.c:327 [..] packet_sendmsg+0x46a9/0x6130 net/packet/af_packet.c:3113 [..] First iteration of this patch made mpls_hlen signed and changed test to error out to "mpls_hlen <= 0 || ..". Eric Dumazet said: > I was thinking about adding a debug check in skb_inner_network_header() > if inner_network_header is zero (that would mean it is not 'set' yet), > but this would trigger even after your patch. So add new skb_inner_network_header_was_set() helper and use that. The syzbot reproducer injects data via packet socket. The skb that gets allocated and passed down the stack has ->protocol set to NSH (0x894f) and gso_type set to SKB_GSO_UDP | SKB_GSO_DODGY. This gets passed to skb_mac_gso_segment(), which sees NSH as ptype to find a callback for. nsh_gso_segment() retrieves next type: proto = tun_p_to_eth_p(nsh_hdr(skb)->np); ... which is MPLS (TUN_P_MPLS_UC). It updates skb->protocol and then calls mpls_gso_segment(). Inner offsets are all 0, so mpls_gso_segment() ends up with a negative header size. In case more callers rely on silent handling of such large may_pull values we could also 'legalize' this behaviour, either replacing the debug check with (len > INT_MAX) test or removing it and instead adding a comment before existing if (unlikely(len > skb->len)) return SKB_DROP_REASON_PKT_TOO_SMALL; test in pskb_may_pull_reason(), saying that this check also implicitly takes care of callers that miscompute header sizes. Cc: Simon Horman <horms@kernel.org> Fixes: 219eee9c0d16 ("net: skbuff: add overflow debug check to pull/push helpers") Reported-by: syzbot+99d15fcdb0132a1e1a82@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netdev/00000000000043b1310611e388aa@google.com/raw Signed-off-by: Florian Westphal <fw@strlen.de> Link: https://lore.kernel.org/r/20240222140321.14080-1-fw@strlen.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-15networking: Update to register_net_sysctl_szJoel Granados1-2/+4
Move from register_net_sysctl to register_net_sysctl_sz for all the networking related files. Do this while making sure to mirror the NULL assignments with a table_size of zero for the unprivileged users. We need to move to the new function in preparation for when we change SIZE_MAX to ARRAY_SIZE() in the register_net_sysctl macro. Failing to do so would erroneously allow ARRAY_SIZE() to be called on a pointer. We hold off the SIZE_MAX to ARRAY_SIZE change until we have migrated all the relevant net sysctl registering functions to register_net_sysctl_sz in subsequent commits. An additional size function was added to the following files in order to calculate the size of an array that is defined in another file: include/net/ipv6.h net/ipv6/icmp.c net/ipv6/route.c net/ipv6/sysctl_net_ipv6.c Signed-off-by: Joel Granados <j.granados@samsung.com> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2023-06-10net: move gso declarations and functions to their own filesEric Dumazet2-0/+2
Move declarations into include/net/gso.h and code into net/core/gso.c Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Stanislav Fomichev <sdf@google.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20230608191738.3947077-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-02-15net: mpls: fix stale pointer if allocation fails during device renameJakub Kicinski1-0/+4
lianhui reports that when MPLS fails to register the sysctl table under new location (during device rename) the old pointers won't get overwritten and may be freed again (double free). Handle this gracefully. The best option would be unregistering the MPLS from the device completely on failure, but unfortunately mpls_ifdown() can fail. So failing fully is also unreliable. Another option is to register the new table first then only remove old one if the new one succeeds. That requires more code, changes order of notifications and two tables may be visible at the same time. sysctl point is not used in the rest of the code - set to NULL on failures and skip unregister if already NULL. Reported-by: lianhui tang <bluetlh@gmail.com> Fixes: 0fae3bf018d9 ("mpls: handle device renames for per-device sysctls") Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-10-28net: Remove the obsolte u64_stats_fetch_*_irq() users (net).Thomas Gleixner1-2/+2
Now that the 32bit UP oddity is gone and 32bit uses always a sequence count, there is no need for the fetch_irq() variants anymore. Convert to the regular interface. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-08-29net: Use u64_stats_fetch_begin_irq() for stats fetch.Sebastian Andrzej Siewior1-2/+2
On 32bit-UP u64_stats_fetch_begin() disables only preemption. If the reader is in preemptible context and the writer side (u64_stats_update_begin*()) runs in an interrupt context (IRQ or softirq) then the writer can update the stats during the read operation. This update remains undetected. Use u64_stats_fetch_begin_irq() to ensure the stats fetch on 32bit-UP are not interrupted by a writer. 32bit-SMP remains unaffected by this change. Cc: "David S. Miller" <davem@davemloft.net> Cc: Catherine Sullivan <csully@google.com> Cc: David Awogbemila <awogbemila@google.com> Cc: Dimitris Michailidis <dmichail@fungible.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Hans Ulli Kroll <ulli.kroll@googlemail.com> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Jeroen de Borst <jeroendb@google.com> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: Linus Walleij <linus.walleij@linaro.org> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Simon Horman <simon.horman@corigine.com> Cc: linux-arm-kernel@lists.infradead.org Cc: linux-wireless@vger.kernel.org Cc: netdev@vger.kernel.org Cc: oss-drivers@corigine.com Cc: stable@vger.kernel.org Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-04-07net: mpls: fix memdup.cocci warningGONG, Ruiqi1-2/+1
Simply use kmemdup instead of explicitly allocating and copying memory. Generated by: scripts/coccinelle/api/memdup.cocci Signed-off-by: GONG, Ruiqi <gongruiqi1@huawei.com> Link: https://lore.kernel.org/r/20220406114629.182833-1-gongruiqi1@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-10net: mpls: Fix GCC 12 warningVictor Erminpour1-1/+1
When building with automatic stack variable initialization, GCC 12 complains about variables defined outside of switch case statements. Move the variable outside the switch, which silences the warning: ./net/mpls/af_mpls.c:1624:21: error: statement will never be executed [-Werror=switch-unreachable] 1624 | int err; | ^~~ Signed-off-by: Victor Erminpour <victor.erminpour@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-12-02Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski2-36/+63
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-11-29net: mpls: Make for_nexthops iterator constBenjamin Poirier2-5/+5
There are separate for_nexthops and change_nexthops iterators. The for_nexthops variant should use const. Signed-off-by: Benjamin Poirier <bpoirier@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-29net: mpls: Remove duplicate variable from iterator macroBenjamin Poirier1-7/+6
__nh is just a copy of nh with a different type. Signed-off-by: Benjamin Poirier <bpoirier@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-29net: mpls: Remove rcu protection from nh_devBenjamin Poirier2-25/+16
Following the previous commit, nh_dev can no longer be accessed and modified concurrently. Signed-off-by: Benjamin Poirier <bpoirier@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-29net: mpls: Fix notifications when deleting a deviceBenjamin Poirier1-16/+52
There are various problems related to netlink notifications for mpls route changes in response to interfaces being deleted: * delete interface of only nexthop DELROUTE notification is missing RTA_OIF attribute * delete interface of non-last nexthop NEWROUTE notification is missing entirely * delete interface of last nexthop DELROUTE notification is missing nexthop All of these problems stem from the fact that existing routes are modified in-place before sending a notification. Restructure mpls_ifdown() to avoid changing the route in the DELROUTE cases and to create a copy in the NEWROUTE case. Fixes: f8efb73c97e2 ("mpls: multipath route support") Signed-off-by: Benjamin Poirier <bpoirier@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-23mpls: defer ttl decrement in mpls_forward()Kangmin Park1-1/+1
Defer ttl decrement to optimize in tx_err case. There is no need to decrease ttl in the case of goto tx_err. Signed-off-by: Kangmin Park <l4stpr0gr4m@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-27mpls: Remove redundant assignment to errJiapeng Chong1-1/+0
Variable err is set to -ENOMEM but this value is never read as it is overwritten with a new value later on, hence it is a redundant assignment and can be removed. Cleans up the following clang-analyzer warning: net/mpls/af_mpls.c:1022:2: warning: Value stored to 'err' is never read [clang-analyzer-deadcode.DeadStores]. Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-09net: avoid infinite loop in mpls_gso_segment when mpls_hlen == 0Balazs Nemeth1-0/+3
A packet with skb_inner_network_header(skb) == skb_network_header(skb) and ETH_P_MPLS_UC will prevent mpls_gso_segment from pulling any headers from the packet. Subsequently, the call to skb_mac_gso_segment will again call mpls_gso_segment with the same packet leading to an infinite loop. In addition, ensure that the header length is a multiple of four, which should hold irrespective of the number of stacked labels. Signed-off-by: Balazs Nemeth <bnemeth@redhat.com> Acked-by: Willem de Bruijn <willemb@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-11-03mpls: drop skb's dst in mpls_forward()Guillaume Nault1-0/+2
Commit 394de110a733 ("net: Added pointer check for dst->ops->neigh_lookup in dst_neigh_lookup_skb") added a test in dst_neigh_lookup_skb() to avoid a NULL pointer dereference. The root cause was the MPLS forwarding code, which doesn't call skb_dst_drop() on incoming packets. That is, if the packet is received from a collect_md device, it has a metadata_dst attached to it that doesn't implement any dst_ops function. To align the MPLS behaviour with IPv4 and IPv6, let's drop the dst in mpls_forward(). This way, dst_neigh_lookup_skb() doesn't need to test ->neigh_lookup any more. Let's keep a WARN condition though, to document the precondition and to ease detection of such problems in the future. Signed-off-by: Guillaume Nault <gnault@redhat.com> Link: https://lore.kernel.org/r/f8c2784c13faa54469a2aac339470b1049ca6b63.1604102750.git.gnault@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-20mpls: load mpls_gso after mpls_iptunnelAlexander Ovechkin1-0/+1
mpls_iptunnel is used only for mpls encapsuation, and if encaplusated packet is larger than MTU we need mpls_gso for segmentation. Signed-off-by: Alexander Ovechkin <ovov@yandex-team.ru> Acked-by: Dmitry Yakunin <zeil@yandex-team.ru> Reviewed-by: David Ahern <dsahern@gmail.com> Link: https://lore.kernel.org/r/20201020114333.26866-1-ovov@yandex-team.ru Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-08-23treewide: Use fallthrough pseudo-keywordGustavo A. R. Silva1-1/+1
Replace the existing /* fall through */ comments and its variants with the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary fall-through markings when it is the case. [1] https://www.kernel.org/doc/html/v5.7/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-through Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
2020-07-27net: Removed the device type check to add mpls support for devicesMartin Varghese1-14/+3
MPLS has no dependency with the device type of underlying devices. Hence the device type check to add mpls support for devices can be avoided. Signed-off-by: Martin Varghese <martin.varghese@nokia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-14treewide: replace '---help---' in Kconfig files with 'help'Masahiro Yamada1-3/+3
Since commit 84af7a6194e4 ("checkpatch: kconfig: prefer 'help' over '---help---'"), the number of '---help---' has been gradually decreasing, but there are still more than 2400 instances. This commit finishes the conversion. While I touched the lines, I also fixed the indentation. There are a variety of indentation styles found. a) 4 spaces + '---help---' b) 7 spaces + '---help---' c) 8 spaces + '---help---' d) 1 space + 1 tab + '---help---' e) 1 tab + '---help---' (correct indentation) f) 1 tab + 1 space + '---help---' g) 1 tab + 2 spaces + '---help---' In order to convert all of them to 1 tab + 'help', I ran the following commend: $ find . -name 'Kconfig*' | xargs sed -i 's/^[[:space:]]*---help---/\thelp/' Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2020-05-29net: Make mpls_entry_encode() available for generic usersEli Cohen1-11/+0
Move mpls_entry_encode() from net/mpls/internal.h to include/net/mpls.h and make it available for other users. Specifically, hardware driver that offload MPLS can benefit from that. Suggested-by: Jakub Kicinski <kuba@kernel.org> Suggested-by: David Ahern <dsahern@gmail.com> Signed-off-by: Eli Cohen <eli@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2020-05-22mpls: Add support for IPv6 tunnelsVadim Fedorenko1-1/+2
Add support for IPv6 tunnel devices in AF_MPLS. Signed-off-by: Vadim Fedorenko <vfedorenko@novek.ru> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-27sysctl: pass kernel pointers to ->proc_handlerChristoph Hellwig1-3/+2
Instead of having all the sysctl handlers deal with user pointers, which is rather hairy in terms of the BPF interaction, copy the input to and from userspace in common code. This also means that the strings are always NUL-terminated by the common code, making the API a little bit safer. As most handler just pass through the data to one of the common handlers a lot of the changes are mechnical. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Andrey Ignatov <rdna@fb.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-03-29net: add net available in build_stateAlexander Aring1-1/+1
The build_state callback of lwtunnel doesn't contain the net namespace structure yet. This patch will add it so we can check on specific address configuration at creation time of rpl source routes. Signed-off-by: Alexander Aring <alex.aring@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-28net: mpls: Replace zero-length array with flexible-array memberGustavo A. R. Silva1-2/+2
The current codebase makes use of the zero-length array language extension to the C90 standard, but the preferred mechanism to declare variable-length types such as these ones is a flexible array member[1][2], introduced in C99: struct foo { int stuff; struct boo array[]; }; By making use of the mechanism above, we will get a compiler warning in case the flexible array does not occur last in the structure, which will help us prevent some kind of undefined behavior bugs from being inadvertently introduced[3] to the codebase from now on. Also, notice that, dynamic memory allocations won't be affected by this change: "Flexible array members have incomplete type, and so the sizeof operator may not be applied. As a quirk of the original implementation of zero-length arrays, sizeof evaluates to zero."[1] This issue was found with the help of Coccinelle. [1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html [2] https://github.com/KSPP/linux/issues/21 [3] commit 76497732932f ("cxgb3/l2t: Fix undefined behaviour") Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-04net: ipv6_stub: use ip6_dst_lookup_flow instead of ip6_dst_lookupSabrina Dubroca1-4/+3
ipv6_stub uses the ip6_dst_lookup function to allow other modules to perform IPv6 lookups. However, this function skips the XFRM layer entirely. All users of ipv6_stub->ip6_dst_lookup use ip_route_output_flow (via the ip_route_output_key and ip_route_output helpers) for their IPv4 lookups, which calls xfrm_lookup_route(). This patch fixes this inconsistent behavior by switching the stub to ip6_dst_lookup_flow, which also calls xfrm_lookup_route(). This requires some changes in all the callers, as these two functions take different arguments and have different return types. Fixes: 5f81bd2e5d80 ("ipv6: export a stub for IPv6 symbols used by vxlan") Reported-by: Xiumei Mu <xmu@redhat.com> Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-25ipv4: mpls: fix mpls_xmit for iptunnelAlexey Kodanev1-4/+4
When using mpls over gre/gre6 setup, rt->rt_gw4 address is not set, the same for rt->rt_gw_family. Therefore, when rt->rt_gw_family is checked in mpls_xmit(), neigh_xmit() call is skipped. As a result, such setup doesn't work anymore. This issue was found with LTP mpls03 tests. Fixes: 1550c171935d ("ipv4: Prepare rtable for IPv6 gateway") Signed-off-by: Alexey Kodanev <alexey.kodanev@oracle.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-18proc/sysctl: add shared variables for range checkMatteo Croce1-6/+4
In the sysctl code the proc_dointvec_minmax() function is often used to validate the user supplied value between an allowed range. This function uses the extra1 and extra2 members from struct ctl_table as minimum and maximum allowed value. On sysctl handler declaration, in every source file there are some readonly variables containing just an integer which address is assigned to the extra1 and extra2 members, so the sysctl range is enforced. The special values 0, 1 and INT_MAX are very often used as range boundary, leading duplication of variables like zero=0, one=1, int_max=INT_MAX in different source files: $ git grep -E '\.extra[12].*&(zero|one|int_max)' |wc -l 248 Add a const int array containing the most commonly used values, some macros to refer more easily to the correct array member, and use them instead of creating a local one for every object file. This is the bloat-o-meter output comparing the old and new binary compiled with the default Fedora config: # scripts/bloat-o-meter -d vmlinux.o.old vmlinux.o add/remove: 2/2 grow/shrink: 0/2 up/down: 24/-188 (-164) Data old new delta sysctl_vals - 12 +12 __kstrtab_sysctl_vals - 12 +12 max 14 10 -4 int_max 16 - -16 one 68 - -68 zero 128 28 -100 Total: Before=20583249, After=20583085, chg -0.00% [mcroce@redhat.com: tipc: remove two unused variables] Link: http://lkml.kernel.org/r/20190530091952.4108-1-mcroce@redhat.com [akpm@linux-foundation.org: fix net/ipv6/sysctl_net_ipv6.c] [arnd@arndb.de: proc/sysctl: make firmware loader table conditional] Link: http://lkml.kernel.org/r/20190617130014.1713870-1-arnd@arndb.de [akpm@linux-foundation.org: fix fs/eventpoll.c] Link: http://lkml.kernel.org/r/20190430180111.10688-1-mcroce@redhat.com Signed-off-by: Matteo Croce <mcroce@redhat.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Kees Cook <keescook@chromium.org> Reviewed-by: Aaron Tomlin <atomlin@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-06-12mpls: fix af_mpls dependencies for realMatteo Croce1-1/+1
Randy reported that selecting MPLS_ROUTING without PROC_FS breaks the build, because since commit c1a9d65954c6 ("mpls: fix af_mpls dependencies"), MPLS_ROUTING selects PROC_SYSCTL, but Kconfig's select doesn't recursively handle dependencies. Change the select into a dependency. Fixes: c1a9d65954c6 ("mpls: fix af_mpls dependencies") Reported-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Matteo Croce <mcroce@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-09mpls: fix af_mpls dependenciesMatteo Croce1-0/+1
MPLS routing code relies on sysctl to work, so let it select PROC_SYSCTL. Reported-by: Randy Dunlap <rdunlap@infradead.org> Suggested-by: David Ahern <dsahern@gmail.com> Signed-off-by: Matteo Croce <mcroce@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-09mpls: fix warning with multi-label encapGeorge Wilkie1-1/+1
If you configure a route with multiple labels, e.g. ip route add 10.10.3.0/24 encap mpls 16/100 via 10.10.2.2 dev ens4 A warning is logged: kernel: [ 130.561819] netlink: 'ip': attribute type 1 has an invalid length. This happens because mpls_iptunnel_policy has set the type of MPLS_IPTUNNEL_DST to fixed size NLA_U32. Change it to a minimum size. nla_get_labels() does the remaining validation. Fixes: e3e4712ec096 ("mpls: ip tunnel support") Signed-off-by: George Wilkie <gwilkie@vyatta.att-mail.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-30treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 152Thomas Gleixner2-11/+2
Based on 1 normalized pattern(s): this program is free software you can redistribute it and or modify it under the terms of the gnu general public license as published by the free software foundation either version 2 of the license or at your option any later version extracted by the scancode license scanner the SPDX license identifier GPL-2.0-or-later has been chosen to replace the boilerplate/reference in 3029 file(s). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Allison Randal <allison@lohutok.net> Cc: linux-spdx@vger.kernel.org Link: https://lkml.kernel.org/r/20190527070032.746973796@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-21treewide: Add SPDX license identifier - Makefile/KconfigThomas Gleixner2-0/+2
Add SPDX license identifiers to all Make/Kconfig files which: - Have no license information of any form These files fall under the project license, GPL v2 only. The resulting SPDX license identifier is: GPL-2.0-only Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-21treewide: Add SPDX license identifier for more missed filesThomas Gleixner1-0/+1
Add SPDX license identifiers to all files which: - Have no license information of any form - Have MODULE_LICENCE("GPL*") inside which was used in the initial scan/conversion to ignore the file These files fall under the project license, GPL v2 only. The resulting SPDX license identifier is: GPL-2.0-only Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-04-27netlink: make validation more configurable for future strictnessJohannes Berg2-14/+16
We currently have two levels of strict validation: 1) liberal (default) - undefined (type >= max) & NLA_UNSPEC attributes accepted - attribute length >= expected accepted - garbage at end of message accepted 2) strict (opt-in) - NLA_UNSPEC attributes accepted - attribute length >= expected accepted Split out parsing strictness into four different options: * TRAILING - check that there's no trailing data after parsing attributes (in message or nested) * MAXTYPE - reject attrs > max known type * UNSPEC - reject attributes with NLA_UNSPEC policy entries * STRICT_ATTRS - strictly validate attribute size The default for future things should be *everything*. The current *_strict() is a combination of TRAILING and MAXTYPE, and is renamed to _deprecated_strict(). The current regular parsing has none of this, and is renamed to *_parse_deprecated(). Additionally it allows us to selectively set one of the new flags even on old policies. Notably, the UNSPEC flag could be useful in this case, since it can be arranged (by filling in the policy) to not be an incompatible userspace ABI change, but would then going forward prevent forgetting attribute entries. Similar can apply to the POLICY flag. We end up with the following renames: * nla_parse -> nla_parse_deprecated * nla_parse_strict -> nla_parse_deprecated_strict * nlmsg_parse -> nlmsg_parse_deprecated * nlmsg_parse_strict -> nlmsg_parse_deprecated_strict * nla_parse_nested -> nla_parse_nested_deprecated * nla_validate_nested -> nla_validate_nested_deprecated Using spatch, of course: @@ expression TB, MAX, HEAD, LEN, POL, EXT; @@ -nla_parse(TB, MAX, HEAD, LEN, POL, EXT) +nla_parse_deprecated(TB, MAX, HEAD, LEN, POL, EXT) @@ expression NLH, HDRLEN, TB, MAX, POL, EXT; @@ -nlmsg_parse(NLH, HDRLEN, TB, MAX, POL, EXT) +nlmsg_parse_deprecated(NLH, HDRLEN, TB, MAX, POL, EXT) @@ expression NLH, HDRLEN, TB, MAX, POL, EXT; @@ -nlmsg_parse_strict(NLH, HDRLEN, TB, MAX, POL, EXT) +nlmsg_parse_deprecated_strict(NLH, HDRLEN, TB, MAX, POL, EXT) @@ expression TB, MAX, NLA, POL, EXT; @@ -nla_parse_nested(TB, MAX, NLA, POL, EXT) +nla_parse_nested_deprecated(TB, MAX, NLA, POL, EXT) @@ expression START, MAX, POL, EXT; @@ -nla_validate_nested(START, MAX, POL, EXT) +nla_validate_nested_deprecated(START, MAX, POL, EXT) @@ expression NLH, HDRLEN, MAX, POL, EXT; @@ -nlmsg_validate(NLH, HDRLEN, MAX, POL, EXT) +nlmsg_validate_deprecated(NLH, HDRLEN, MAX, POL, EXT) For this patch, don't actually add the strict, non-renamed versions yet so that it breaks compile if I get it wrong. Also, while at it, make nla_validate and nla_parse go down to a common __nla_validate_parse() function to avoid code duplication. Ultimately, this allows us to have very strict validation for every new caller of nla_parse()/nlmsg_parse() etc as re-introduced in the next patch, while existing things will continue to work as is. In effect then, this adds fully strict validation for any new command. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-27netlink: make nla_nest_start() add NLA_F_NESTED flagMichal Kubecek1-1/+1
Even if the NLA_F_NESTED flag was introduced more than 11 years ago, most netlink based interfaces (including recently added ones) are still not setting it in kernel generated messages. Without the flag, message parsers not aware of attribute semantics (e.g. wireshark dissector or libmnl's mnl_nlmsg_fprintf()) cannot recognize nested attributes and won't display the structure of their contents. Unfortunately we cannot just add the flag everywhere as there may be userspace applications which check nlattr::nla_type directly rather than through a helper masking out the flags. Therefore the patch renames nla_nest_start() to nla_nest_start_noflag() and introduces nla_nest_start() as a wrapper adding NLA_F_NESTED. The calls which add NLA_F_NESTED manually are rewritten to use nla_nest_start(). Except for changes in include/net/netlink.h, the patch was generated using this semantic patch: @@ expression E1, E2; @@ -nla_nest_start(E1, E2) +nla_nest_start_noflag(E1, E2) @@ expression E1, E2; @@ -nla_nest_start_noflag(E1, E2 | NLA_F_NESTED) +nla_nest_start(E1, E2) Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Acked-by: Jiri Pirko <jiri@mellanox.com> Acked-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-22net: Rename net/nexthop.h net/rtnh.hDavid Ahern1-1/+1
The header contains rtnh_ macros so rename the file accordingly. Allows a later patch to use the nexthop.h name for the new nexthop code. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-08ipv4: Add support to rtable for ipv6 gatewayDavid Ahern1-2/+3
Add support for an IPv6 gateway to rtable. Since a gateway is either IPv4 or IPv6, make it a union with rt_gw4 where rt_gw_family decides which address is in use. When dumping the route data, encode an ipv6 nexthop using RTA_VIA. Signed-off-by: David Ahern <dsahern@gmail.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-08ipv4: Prepare rtable for IPv6 gatewayDavid Ahern1-4/+7
To allow the gateway to be either an IPv4 or IPv6 address, remove rt_uses_gateway from rtable and replace with rt_gw_family. If rt_gw_family is set it implies rt_uses_gateway. Rename rt_gateway to rt_gw4 to represent the IPv4 version. Signed-off-by: David Ahern <dsahern@gmail.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-29ipv6: Move ipv6 stubs to a separate header fileDavid Ahern1-1/+1
The number of stubs is growing and has nothing to do with addrconf. Move the definition of the stubs to a separate header file and update users. In the move, drop the vxlan specific comment before ipv6_stub. Code move only; no functional change intended. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-19mpls: Fix 6PE forwardingVinay K Nallamothu1-3/+9
This patch adds support for 6PE (RFC 4798) which uses IPv4-mapped IPv6 nexthop to connect IPv6 islands over IPv4 only MPLS network core. Prior to this fix, to find the link-layer destination mac address, 6PE enabled host/router was sending IPv6 ND requests for IPv4-mapped IPv6 nexthop address over the interface facing the IPv4 only core which wouldn't success as the core is IPv6 free. This fix changes that behavior on 6PE host to treat the nexthop as IPv4 address and send ARP requests whenever the next-hop address is an IPv4-mapped IPv6 address. Below topology illustrates the issue and how the patch addresses it. abcd::1.1.1.1 (lo) abcd::2.2.2.2 (lo) R0 (PE/host)------------------------R1--------------------------------R2 (PE/host) <--- IPv4 MPLS core ---> <------ IPv4 MPLS core --------> eth1 eth2 eth3 eth4 172.18.0.10 172.18.0.11 172.19.0.11 172.19.0.12 ffff::172.18.0.10 ffff::172.19.0.12 <------------------IPv6 MPLS tunnel ----------------------> R0 and R2 act as 6PE routers of IPv6 islands. R1 is IPv4 only with MPLS tunnels between R0,R1 and R1,R2. docker exec r0 ip -f inet6 route add abcd::2.2.2.2/128 nexthop encap mpls 100 via ::ffff:172.18.0.11 dev eth1 docker exec r2 ip -f inet6 route add abcd::1.1.1.1/128 nexthop encap mpls 200 via ::ffff:172.19.0.11 dev eth4 docker exec r1 ip -f mpls route add 100 via inet 172.19.0.12 dev eth3 docker exec r1 ip -f mpls route add 200 via inet 172.18.0.10 dev eth2 With the change, when R0 sends an IPv6 packet over MPLS tunnel to abcd::2.2.2.2, using ::ffff:172.18.0.11 as the nexthop, it does neighbor discovery for 172.18.18.0.11. Signed-off-by: Vinay K Nallamothu <nvinay@juniper.net> Tested-by: Avinash Lingala <ar977m@att.com> Tested-by: Aravind Srinivas Srinivasa Prabhakar <aprabh@juniper.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-02Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller1-0/+3
2019-02-26mpls: Return error for RTA_GATEWAY attributeDavid Ahern1-0/+3
MPLS does not support nexthops with an MPLS address family. Specifically, it does not handle RTA_GATEWAY attribute. Make it clear by returning an error. Fixes: 03c0566542f4c ("mpls: Netlink commands to add, remove, and dump routes") Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-08mpls_iptunnel: use struct_size() helperGustavo A. R. Silva1-2/+2
One of the more common cases of allocation size calculations is finding the size of a structure that has a zero-sized array at the end, along with memory for some number of elements for that array. For example: struct foo { int stuff; struct boo entry[]; }; instance = alloc(sizeof(struct foo) + count * sizeof(struct boo)); Instead of leaving these open-coded and prone to type mistakes, we can now use the new struct_size() helper: instance = alloc(struct_size(instance, entry, count)); This code was detected with the help of Coccinelle. Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-19net: mpls: netconf: perform strict checks also for doit handlersJakub Kicinski1-3/+39
Make RTM_GETNETCONF's doit handler use strict checks when NETLINK_F_STRICT_CHK is set. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-19net: mpls: route: perform strict checks also for doit handlersJakub Kicinski1-2/+59
Make RTM_GETROUTE's doit handler use strict checks when NETLINK_F_STRICT_CHK is set. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-16net/mpls: Handle kernel side filtering of route dumpsDavid Ahern1-5/+28
Update the dump request parsing in MPLS for the non-INET case to enable kernel side filtering. If INET is disabled the only filters that make sense for MPLS are protocol and nexthop device. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-16net: Enable kernel side filtering of route dumpsDavid Ahern1-4/+5
Update parsing of route dump request to enable kernel side filtering. Allow filtering results by protocol (e.g., which routing daemon installed the route), route type (e.g., unicast), table id and nexthop device. These amount to the low hanging fruit, yet a huge improvement, for dumping routes. ip_valid_fib_dump_req is called with RTNL held, so __dev_get_by_index can be used to look up the device index without taking a reference. From there filter->dev is only used during dump loops with the lock still held. Set NLM_F_DUMP_FILTERED in the answer_flags so the user knows the results have been filtered should no entries be returned. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-16net/mpls: Plumb support for filtering route dumpsDavid Ahern1-1/+41
Implement kernel side filtering of routes by egress device index and protocol. MPLS uses only a single table and route type. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-16net: Add struct for fib dump filterDavid Ahern1-4/+8
Add struct fib_dump_filter for options on limiting which routes are returned in a dump request. The current list is table id, protocol, route type, rtm_flags and nexthop device index. struct net is needed to lookup the net_device from the index. Declare the filter for each route dump handler and plumb the new arguments from dump handlers to ip_valid_fib_dump_req. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-10net/mpls: Implement handler for strict data checking on dumpsDavid Ahern1-1/+35
Without CONFIG_INET enabled compiles fail with: net/mpls/af_mpls.o: In function `mpls_dump_routes': af_mpls.c:(.text+0xed0): undefined reference to `ip_valid_fib_dump_req' The preference is for MPLS to use the same handler as ipv4 and ipv6 to allow consistency when doing a dump for AF_UNSPEC which walks all address families invoking the route dump handler. If INET is disabled then fallback to an MPLS version which can be tighter on the data checks. Fixes: e8ba330ac0c5 ("rtnetlink: Update fib dumps for strict data checking") Reported-by: Randy Dunlap <rdunlap@infradead.org> Reported-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-08net: Update netconf dump handlers for strict data checkingDavid Ahern1-1/+17
Update inet_netconf_dump_devconf, inet6_netconf_dump_devconf, and mpls_netconf_dump_devconf for strict data checking. If the flag is set, the dump request is expected to have an netconfmsg struct as the header. The struct only has the family member and no attributes can be appended. Signed-off-by: David Ahern <dsahern@gmail.com> Acked-by: Christian Brauner <christian@brauner.io> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-08rtnetlink: Update fib dumps for strict data checkingDavid Ahern1-0/+8
Add helper to check netlink message for route dumps. If the strict flag is set the dump request is expected to have an rtmsg struct as the header. All elements of the struct are expected to be 0 with the exception of rtm_flags (which is used by both ipv4 and ipv6 dumps) and no attributes can be appended. rtm_flags can only have RTM_F_CLONED and RTM_F_PREFIX set. Update inet_dump_fib, inet6_dump_fib, mpls_dump_routes, ipmr_rtm_dumproute, and ip6mr_rtm_dumproute to call this helper if strict data checking is enabled. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-08net: Add extack to nlmsg_parseDavid Ahern1-1/+1
Make sure extack is passed to nlmsg_parse where easy to do so. Most of these are dump handlers and leveraging the extack in the netlink_callback. Signed-off-by: David Ahern <dsahern@gmail.com> Acked-by: Christian Brauner <christian@brauner.io> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-24mpls: allow routes on ip6gre devicesSaif Hasan1-1/+5
Summary: This appears to be necessary and sufficient change to enable `MPLS` on `ip6gre` tunnels (RFC4023). This diff allows IP6GRE devices to be recognized by MPLS kernel module and hence user can configure interface to accept packets with mpls headers as well setup mpls routes on them. Test Plan: Test plan consists of multiple containers connected via GRE-V6 tunnel. Then carrying out testing steps as below. - Carry out necessary sysctl settings on all containers ``` sysctl -w net.mpls.platform_labels=65536 sysctl -w net.mpls.ip_ttl_propagate=1 sysctl -w net.mpls.conf.lo.input=1 ``` - Establish IP6GRE tunnels ``` ip -6 tunnel add name if_1_2_1 mode ip6gre \ local 2401:db00:21:6048:feed:0::1 \ remote 2401:db00:21:6048:feed:0::2 key 1 ip link set dev if_1_2_1 up sysctl -w net.mpls.conf.if_1_2_1.input=1 ip -4 addr add 169.254.0.2/31 dev if_1_2_1 scope link ip -6 tunnel add name if_1_3_1 mode ip6gre \ local 2401:db00:21:6048:feed:0::1 \ remote 2401:db00:21:6048:feed:0::3 key 1 ip link set dev if_1_3_1 up sysctl -w net.mpls.conf.if_1_3_1.input=1 ip -4 addr add 169.254.0.4/31 dev if_1_3_1 scope link ``` - Install MPLS encap rules on node-1 towards node-2 ``` ip route add 192.168.0.11/32 nexthop encap mpls 32/64 \ via inet 169.254.0.3 dev if_1_2_1 ``` - Install MPLS forwarding rules on node-2 and node-3 ``` // node2 ip -f mpls route add 32 via inet 169.254.0.7 dev if_2_4_1 // node3 ip -f mpls route add 64 via inet 169.254.0.12 dev if_4_3_1 ``` - Ping 192.168.0.11 (node4) from 192.168.0.1 (node1) (where routing towards 192.168.0.1 is via IP route directly towards node1 from node4) ``` ping 192.168.0.11 ``` - tcpdump on interface to capture ping packets wrapped within MPLS header which inturn wrapped within IP6GRE header ``` 16:43:41.121073 IP6 2401:db00:21:6048:feed::1 > 2401:db00:21:6048:feed::2: DSTOPT GREv0, key=0x1, length 100: MPLS (label 32, exp 0, ttl 255) (label 64, exp 0, [S], ttl 255) IP 192.168.0.1 > 192.168.0.11: ICMP echo request, id 1208, seq 45, length 64 0x0000: 6000 2cdb 006c 3c3f 2401 db00 0021 6048 `.,..l<?$....!`H 0x0010: feed 0000 0000 0001 2401 db00 0021 6048 ........$....!`H 0x0020: feed 0000 0000 0002 2f00 0401 0401 0100 ......../....... 0x0030: 2000 8847 0000 0001 0002 00ff 0004 01ff ...G............ 0x0040: 4500 0054 3280 4000 ff01 c7cb c0a8 0001 E..T2.@......... 0x0050: c0a8 000b 0800 a8d7 04b8 002d 2d3c a05b ...........--<.[ 0x0060: 0000 0000 bcd8 0100 0000 0000 1011 1213 ................ 0x0070: 1415 1617 1819 1a1b 1c1d 1e1f 2021 2223 .............!"# 0x0080: 2425 2627 2829 2a2b 2c2d 2e2f 3031 3233 $%&'()*+,-./0123 0x0090: 3435 3637 4567 ``` Signed-off-by: Saif Hasan <has@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-24mpls: remove trailing whitepaceStephen Hemminger1-1/+1
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-04net: rename skb_gso_validate_mtu -> skb_gso_validate_network_lenDaniel Axtens1-1/+1
If you take a GSO skb, and split it into packets, will the network length (L3 headers + L4 headers + payload) of those packets be small enough to fit within a given MTU? skb_gso_validate_mtu gives you the answer to that question. However, we recently added to add a way to validate the MAC length of a split GSO skb (L2+L3+L4+payload), and the names get confusing, so rename skb_gso_validate_mtu to skb_gso_validate_network_len Signed-off-by: Daniel Axtens <dja@axtens.net> Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-08mpls, nospec: Sanitize array index in mpls_label_ok()Dan Williams1-10/+14
mpls_label_ok() validates that the 'platform_label' array index from a userspace netlink message payload is valid. Under speculation the mpls_label_ok() result may not resolve in the CPU pipeline until after the index is used to access an array element. Sanitize the index to zero to prevent userspace-controlled arbitrary out-of-bounds speculation, a precursor for a speculative execution side channel vulnerability. Cc: <stable@vger.kernel.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-04net: use rtnl_register_module where neededFlorian Westphal1-6/+9
all of these can be compiled as a module, so use new _module version to make sure module can no longer be removed while callback/dump is in use. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-04Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller1-0/+1
Files removed in 'net-next' had their license header updated in 'net'. We take the remove from 'net-next'. Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-02License cleanup: add SPDX GPL-2.0 license identifier to files with no licenseGreg Kroah-Hartman1-0/+1
Many source files in the tree are missing licensing information, which makes it harder for compliance tools to determine the correct license. By default all files without license information are under the default license of the kernel, which is GPL version 2. Update the files which contain no license information with the 'GPL-2.0' SPDX license identifier. The SPDX identifier is a legally binding shorthand, which can be used instead of the full boiler plate text. This patch is based on work done by Thomas Gleixner and Kate Stewart and Philippe Ombredanne. How this work was done: Patches were generated and checked against linux-4.14-rc6 for a subset of the use cases: - file had no licensing information it it. - file was a */uapi/* one with no licensing information in it, - file was a */uapi/* one with existing licensing information, Further patches will be generated in subsequent months to fix up cases where non-standard license headers were used, and references to license had to be inferred by heuristics based on keywords. The analysis to determine which SPDX License Identifier to be applied to a file was done in a spreadsheet of side by side results from of the output of two independent scanners (ScanCode & Windriver) producing SPDX tag:value files created by Philippe Ombredanne. Philippe prepared the base worksheet, and did an initial spot review of a few 1000 files. The 4.13 kernel was the starting point of the analysis with 60,537 files assessed. Kate Stewart did a file by file comparison of the scanner results in the spreadsheet to determine which SPDX license identifier(s) to be applied to the file. She confirmed any determination that was not immediately clear with lawyers working with the Linux Foundation. Criteria used to select files for SPDX license identifier tagging was: - Files considered eligible had to be source code files. - Make and config files were included as candidates if they contained >5 lines of source - File already had some variant of a license header in it (even if <5 lines). All documentation files were explicitly excluded. The following heuristics were used to determine which SPDX license identifiers to apply. - when both scanners couldn't find any license traces, file was considered to have no license information in it, and the top level COPYING file license applied. For non */uapi/* files that summary was: SPDX license identifier # files ---------------------------------------------------|------- GPL-2.0 11139 and resulted in the first patch in this series. If that file was a */uapi/* path one, it was "GPL-2.0 WITH Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was: SPDX license identifier # files ---------------------------------------------------|------- GPL-2.0 WITH Linux-syscall-note 930 and resulted in the second patch in this series. - if a file had some form of licensing information in it, and was one of the */uapi/* ones, it was denoted with the Linux-syscall-note if any GPL family license was found in the file or had no licensing in it (per prior point). Results summary: SPDX license identifier # files ---------------------------------------------------|------ GPL-2.0 WITH Linux-syscall-note 270 GPL-2.0+ WITH Linux-syscall-note 169 ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21 ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17 LGPL-2.1+ WITH Linux-syscall-note 15 GPL-1.0+ WITH Linux-syscall-note 14 ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5 LGPL-2.0+ WITH Linux-syscall-note 4 LGPL-2.1 WITH Linux-syscall-note 3 ((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3 ((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1 and that resulted in the third patch in this series. - when the two scanners agreed on the detected license(s), that became the concluded license(s). - when there was disagreement between the two scanners (one detected a license but the other didn't, or they both detected different licenses) a manual inspection of the file occurred. - In most cases a manual inspection of the information in the file resulted in a clear resolution of the license that should apply (and which scanner probably needed to revisit its heuristics). - When it was not immediately clear, the license identifier was confirmed with lawyers working with the Linux Foundation. - If there was any question as to the appropriate license identifier, the file was flagged for further research and to be revisited later in time. In total, over 70 hours of logged manual review was done on the spreadsheet to determine the SPDX license identifiers to apply to the source files by Kate, Philippe, Thomas and, in some cases, confirmation by lawyers working with the Linux Foundation. Kate also obtained a third independent scan of the 4.13 code base from FOSSology, and compared selected files where the other two scanners disagreed against that SPDX file, to see if there was new insights. The Windriver scanner is based on an older version of FOSSology in part, so they are related. Thomas did random spot checks in about 500 files from the spreadsheets for the uapi headers and agreed with SPDX license identifier in the files he inspected. For the non-uapi files Thomas did random spot checks in about 15000 files. In initial set of patches against 4.14-rc6, 3 files were found to have copy/paste license identifier errors, and have been fixed to reflect the correct identifier. Additionally Philippe spent 10 hours this week doing a detailed manual inspection and review of the 12,461 patched files from the initial patch version early this week with: - a full scancode scan run, collecting the matched texts, detected license ids and scores - reviewing anything where there was a license detected (about 500+ files) to ensure that the applied SPDX license was correct - reviewing anything where there was no detection but the patch license was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied SPDX license was correct This produced a worksheet with 20 files needing minor correction. This worksheet was then exported into 3 different .csv files for the different types of files to be modified. These .csv files were then reviewed by Greg. Thomas wrote a script to parse the csv files and add the proper SPDX tag to the file, in the format that the file expected. This script was further refined by Greg based on the output to detect more types of files automatically and to distinguish between header and source .c files (which need different comment types.) Finally Greg ran the script using the .csv files to generate the patches. Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org> Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-10-12ip_tunnel: fix building with NET_IP_TUNNEL=mArnd Bergmann1-0/+1
When af_mpls is built-in but the tunnel support is a module, we get a link failure: net/mpls/af_mpls.o: In function `mpls_init': af_mpls.c:(.init.text+0xdc): undefined reference to `ip_tunnel_encap_add_ops' This adds a Kconfig statement to prevent the broken configuration and force mpls to be a module as well in this case. Fixes: bdc476413dcd ("ip_tunnel: add mpls over gre support") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Amine Kherbouche <amine.kherbouche@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-11net: mpls: make function ipgre_mpls_encap_hlen staticColin Ian King1-1/+1
The function ipgre_mpls_encap_hlen is local to the source and does not need to be in global scope, so make it static. Cleans up sparse warning: symbol 'ipgre_mpls_encap_hlen' was not declared. Should it be static? Fixes: bdc476413dcdb ("ip_tunnel: add mpls over gre support") Signed-off-by: Colin Ian King <colin.king@canonical.com> Acked-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-07ip_tunnel: add mpls over gre supportAmine Kherbouche1-0/+36
This commit introduces the MPLSoGRE support (RFC 4023), using ip tunnel API by simply adding ipgre_tunnel_encap_(add|del)_mpls_ops() and the new tunnel type TUNNEL_ENCAP_MPLS. Signed-off-by: Amine Kherbouche <amine.kherbouche@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-09rtnetlink: make rtnl_register accept a flags parameterFlorian Westphal1-4/+4
This change allows us to later indicate to rtnetlink core that certain doit functions should be called without acquiring rtnl_mutex. This change should have no effect, we simply replace the last (now unused) calcit argument with the new flag. Signed-off-by: Florian Westphal <fw@strlen.de> Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-08mpls: fix uninitialized in_label var warning in mpls_getrouteRoopa Prabhu1-4/+8
Fix the below warning generated by static checker: net/mpls/af_mpls.c:2111 mpls_getroute() error: uninitialized symbol 'in_label'." Fixes: 397fc9e5cefe ("mpls: route get support") Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-05mpls: fix rtm policy in mpls_getrouteRoopa Prabhu1-1/+1
fix rtm policy name typo in mpls_getroute and also remove export of rtm_ipv4_policy Fixes: 397fc9e5cefe ("mpls: route get support") Reported-by: David S. Miller <davem@davemloft.net> Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-04mpls: route get supportRoopa Prabhu1-1/+162
This patch adds RTM_GETROUTE doit handler for mpls routes. Input: RTA_DST - input label RTA_NEWDST - labels in packet for multipath selection By default the getroute handler returns matched nexthop label, via and oif With RTM_F_FIB_MATCH flag, full matched route is returned. example (with patched iproute2): $ip -f mpls route show 101 nexthop as to 102/103 via inet 172.16.2.2 dev virt1-2 nexthop as to 302/303 via inet 172.16.12.2 dev virt1-12 201 nexthop as to 202/203 via inet6 2001:db8:2::2 dev virt1-2 nexthop as to 402/403 via inet6 2001:db8:12::2 dev virt1-12 $ip -f mpls route get 103 RTNETLINK answers: Network is unreachable $ip -f mpls route get 101 101 as to 102/103 via inet 172.16.2.2 dev virt1-2 $ip -f mpls route get as to 302/303 101 101 as to 302/303 via inet 172.16.12.2 dev virt1-12 $ip -f mpls route get fibmatch 103 RTNETLINK answers: Network is unreachable $ip -f mpls route get fibmatch 101 101 nexthop as to 102/103 via inet 172.16.2.2 dev virt1-2 nexthop as to 302/303 via inet 172.16.12.2 dev virt1-12 Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-06Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller1-1/+1
Just some simple overlapping changes in marvell PHY driver and the DSA core code. Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-31mpls: fix clearing of dead nh_flags on link upRoopa Prabhu1-1/+1
recent fixes to use WRITE_ONCE for nh_flags on link up, accidently ended up leaving the deadflags on a nh. This patch fixes the WRITE_ONCE to use freshly evaluated nh_flags. Fixes: 39eb8cd17588 ("net: mpls: rt_nhn_alive and nh_flags should be accessed using READ_ONCE") Reported-by: Satish Ashok <sashok@cumulusnetworks.com> Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Acked-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-30net: mpls: remove unnecessary initialization of errDavid Ahern1-2/+0
err is initialized to EINVAL and not used before it is set again. Remove the unnecessary initialization. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-30net: mpls: Make nla_get_via in af_mpls.cDavid Ahern2-50/+48
nla_get_via is only used in af_mpls.c. Remove declaration from internal.h and move up in af_mpls.c before first use. Code move only; no functional change intended. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-30net: mpls: Add extack messages for route add and delete failuresDavid Ahern2-40/+87
Add error messages for failures in adding and deleting mpls routes. This covers most of the annoying EINVAL errors. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-30net: mpls: Pull common label check into helperDavid Ahern1-15/+17
mpls_route_add and mpls_route_del have the same checks on the label. Move to a helper. Avoid duplicate extack messages in the next patch. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-30net: Fill in extack for mpls lwt encapDavid Ahern3-16/+47
Fill in extack for errors in build_state for mpls lwt encap including passing extack to nla_get_labels and adding error messages for failures in it. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-30net: add extack arg to lwtunnel build stateDavid Ahern1-2/+3
Pass extack arg down to lwtunnel_build_state and the build_state callbacks. Add messages for failures in lwtunnel_build_state, and add the extarg to nla_parse where possible in the build_state callbacks. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-08treewide: use kv[mz]alloc* rather than opencoded variantsMichal Hocko1-4/+1
There are many code paths opencoding kvmalloc. Let's use the helper instead. The main difference to kvmalloc is that those users are usually not considering all the aspects of the memory allocator. E.g. allocation requests <= 32kB (with 4kB pages) are basically never failing and invoke OOM killer to satisfy the allocation. This sounds too disruptive for something that has a reasonable fallback - the vmalloc. On the other hand those requests might fallback to vmalloc even when the memory allocator would succeed after several more reclaim/compaction attempts previously. There is no guarantee something like that happens though. This patch converts many of those places to kv[mz]alloc* helpers because they are more conservative. Link: http://lkml.kernel.org/r/20170306103327.2766-2-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> # Xen bits Acked-by: Kees Cook <keescook@chromium.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Andreas Dilger <andreas.dilger@intel.com> # Lustre Acked-by: Christian Borntraeger <borntraeger@de.ibm.com> # KVM/s390 Acked-by: Dan Williams <dan.j.williams@intel.com> # nvdim Acked-by: David Sterba <dsterba@suse.com> # btrfs Acked-by: Ilya Dryomov <idryomov@gmail.com> # Ceph Acked-by: Tariq Toukan <tariqt@mellanox.com> # mlx4 Acked-by: Leon Romanovsky <leonro@mellanox.com> # mlx5 Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Anton Vorontsov <anton@enomsg.org> Cc: Colin Cross <ccross@android.com> Cc: Tony Luck <tony.luck@intel.com> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Ben Skeggs <bskeggs@redhat.com> Cc: Kent Overstreet <kent.overstreet@gmail.com> Cc: Santosh Raspatur <santosh@chelsio.com> Cc: Hariprasad S <hariprasad@chelsio.com> Cc: Yishai Hadas <yishaih@mellanox.com> Cc: Oleg Drokin <oleg.drokin@intel.com> Cc: "Yan, Zheng" <zyan@redhat.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: David Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-04-17net: rtnetlink: plumb extended ack to doit functionDavid Ahern1-3/+6
Add netlink_ext_ack arg to rtnl_doit_func. Pass extack arg to nlmsg_parse for doit functions that call it directly. This is the first step to using extended error reporting in rtnetlink. >From here individual subsystems can be updated to set netlink_ext_ack as needed. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-13netlink: pass extended ACK struct to parsing functionsJohannes Berg2-3/+4
Pass the new extended ACK reporting struct to all of the generic netlink parsing functions. For now, pass NULL in almost all callers (except for some in the core.) Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-01net: mpls: Increase max number of labels for lwt encapDavid Ahern3-8/+15
Alow users to push down more labels per MPLS encap. Similar to LSR case, move label array to the end of mpls_iptunnel_encap and allocate based on the number of labels for the route. For consistency with the LSR case, re-use the same maximum number of labels. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-01net: mpls: bump maximum number of labelsDavid Ahern2-34/+71
Allow users to push down more labels per MPLS route. With the previous patches, no memory allocations are based on MAX_NEW_LABELS; the limit is only used to keep userspace in check. At this point MAX_NEW_LABELS is only used for mpls_route_config (copying route data from userspace) and processing nexthops looking for the max number of labels across the route spec. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-01net: mpls: Limit memory allocation for mpls_routeDavid Ahern1-10/+21
Limit memory allocation size for mpls_route to 4096. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-01net: mpls: change mpls_route layoutDavid Ahern2-31/+51
Move labels to the end of mpls_nh as a 0-sized array and within mpls_route move the via for a nexthop after the mpls_nh. The new layout becomes: +----------------------+ | mpls_route | +----------------------+ | mpls_nh 0 | +----------------------+ | alignment padding | 4 bytes for odd number of labels; 0 for even +----------------------+ | via[rt_max_alen] 0 | +----------------------+ | alignment padding | via's aligned on sizeof(unsigned long) +----------------------+ | ... | +----------------------+ | mpls_nh n-1 | +----------------------+ | via[rt_max_alen] n-1 | +----------------------+ Memory allocated for nexthop + via is constant across all nexthops and their via. It is based on the maximum number of labels across all nexthops and the maximum via length. The size is saved in the mpls_route as rt_nh_size. Accessing a nexthop becomes rt->rt_nh + index * rt->rt_nh_size. The offset of the via address from a nexthop is saved as rt_via_offset so that given an mpls_nh pointer the via for that hop is simply nh + rt->rt_via_offset. With prior code, memory allocated per mpls_route with 1 nexthop: via is an ethernet address - 64 bytes via is an ipv4 address - 64 via is an ipv6 address - 72 With this patch set, memory allocated per mpls_route with 1 nexthop and 1 or 2 labels: via is an ethernet address - 56 bytes via is an ipv4 address - 56 via is an ipv6 address - 64 The 8-byte reduction is due to the previous patch; the change introduced by this patch has no impact on the size of allocations for 1 or 2 labels. Performance impact of this change was examined using network namespaces with veth pairs connecting namespaces. ns0 inserts the packet to the label-switched path using an lwt route with encap mpls. ns1 adds 1 or 2 labels depending on test, ns2 (and ns3 for 2-label test) pops the label and forwards. ns3 (or ns4) for a 2-label is the destination. Similar series of namespaces used for 2-nexthop test. Intent is to measure changes to latency (overhead in manipulating the packet) in the forwarding path. Tests used netperf with UDP_RR. IPv4: current patches 1 label, 1 nexthop 29908 30115 2 label, 1 nexthop 29071 29612 1 label, 2 nexthop 29582 29776 2 label, 2 nexthop 29086 29149 IPv6: current patches 1 label, 1 nexthop 24502 24960 2 label, 1 nexthop 24041 24407 1 label, 2 nexthop 23795 23899 2 label, 2 nexthop 23074 22959 In short, the change has no effect to a modest increase in performance. This is expected since this patch does not really have an impact on routes with 1 or 2 labels (the current limit) and 1 or 2 nexthops. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-01net: mpls: Convert number of nexthops to u8David Ahern2-13/+20
Number of nexthops and number of alive nexthops are tracked using an unsigned int. A route should never have more than 255 nexthops so convert both to u8. Update all references and intermediate variables to consistently use u8 as well. Shrinks the size of mpls_route from 32 bytes to 24 bytes with a 2-byte hole before the nexthops. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-01net: mpls: rt_nhn_alive and nh_flags should be accessed using READ_ONCEDavid Ahern2-11/+33
The number of alive nexthops for a route (rt->rt_nhn_alive) and the flags for a next hop (nh->nh_flags) are modified by netdev event handlers. The event handlers run with rtnl_lock held so updates are always done with the lock held. The packet path accesses the fields under the rcu lock. Since those fields can change at any moment in the packet path, both fields should be accessed using READ_ONCE. Updates to both fields should use WRITE_ONCE. Update mpls_select_multipath (packet path) and mpls_ifdown and mpls_ifup (event handlers) accordingly. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-29net: mpls: Update lfib_nlmsg_size to skip deleted nexthopsDavid Ahern1-0/+2
A recent commit skips nexthops in a route if the device has been deleted. Update lfib_nlmsg_size accordingly. Reported-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com> Acked-by: Robert Shearman <rshearma@brocade.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-28net: mpls: Send netconf messages on device register and unregisterDavid Ahern1-5/+11
Send netconf notifications for MPLS when the device registers and unregisters. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-28net:mpls: Refactor mpls_netconf_notify_devconf to take eventDavid Ahern1-7/+5
Refactor mpls_netconf_notify_devconf to take the event as an input arg. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-27net: mpls: Delete route when all nexthops have been deletedDavid Ahern1-1/+8
When all devices for all nexthops in a route have been deleted, the route is effectively dead, so remove it. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-27net: mpls: Don't show nexthop if device has been deletedDavid Ahern1-3/+5
If the device for a nexthop in a multipath route is deleted, the nexthop is effectively removed from the route. Currently, a route dump still returns the nexhop though without the device set: $ ip -f mpls ro ls 100 nexthopvia inet 10.11.1.2 dev br0 nexthopvia inet 10.100.3.1 dev eth3 $ ip li del br0 $ ip -f mpls ro ls 100 nexthopvia inet 10.11.1.2 dev * dead linkdown nexthopvia inet 10.100.3.1 dev eth3 Since the nexthop is effectively deleted, drop the hop from the route dump. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-24net: mpls: Fix setting ttl_propagate for rt2David Ahern1-1/+1
Fix copy and paste error setting rt_ttl_propagate. Fixes: 5b441ac8784c1 ("mpls: allow TTL propagation to IP packets to be configured") Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Acked-by: Robert Shearman <rshearma@brocade.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-23Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller1-3/+10
Conflicts: drivers/net/ethernet/broadcom/genet/bcmmii.c drivers/net/hyperv/netvsc.c kernel/bpf/hashtab.c Almost entirely overlapping changes. Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-16net: mpls: Fix nexthop alive tracking on down eventsDavid Ahern1-3/+10
Alive tracking of nexthops can account for a link twice if the carrier goes down followed by an admin down of the same link rendering multipath routes useless. This is similar to 79099aab38c8 for UNREGISTER events and DOWN events. Fix by tracking number of alive nexthops in mpls_ifdown similar to the logic in mpls_ifup. Checking the flags per nexthop once after all events have been processed is simpler than trying to maintian a running count through all event combinations. Also, WRITE_ONCE is used instead of ACCESS_ONCE to set rt_nhn_alive per a comment from checkpatch: WARNING: Prefer WRITE_ONCE(<FOO>, <BAR>) over ACCESS_ONCE(<FOO>) = <BAR> Fixes: c89359a42e2a4 ("mpls: support for dead routes") Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Acked-by: Robert Shearman <rshearma@brocade.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-15Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller1-1/+3
Conflicts: drivers/net/ethernet/broadcom/genet/bcmgenet.c net/core/sock.c Conflicts were overlapping changes in bcmgenet and the lockdep handling of sockets. Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-13mpls: allow TTL propagation from IP packets to be configuredRobert Shearman2-13/+71
Allow TTL propagation from IP packets to MPLS packets to be configured. Add a new optional LWT attribute, MPLS_IPTUNNEL_TTL, which allows the TTL to be set in the resulting MPLS packet, with the value of 0 having the semantics of enabling propagation of the TTL from the IP header (i.e. non-zero values disable propagation). Also allow the configuration to be overridden globally by reusing the same sysctl to control whether the TTL is propagated from IP packets into the MPLS header. If the per-LWT attribute is set then it overrides the global configuration. If the TTL isn't propagated then a default TTL value is used which can be configured via a new sysctl, "net.mpls.default_ttl". This is kept separate from the configuration of whether IP TTL propagation is enabled as it can be used in the future when non-IP payloads are supported (i.e. where there is no payload TTL that can be propagated). Signed-off-by: Robert Shearman <rshearma@brocade.com> Acked-by: David Ahern <dsa@cumulusnetworks.com> Tested-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-13mpls: allow TTL propagation to IP packets to be configuredRobert Shearman2-8/+86
Provide the ability to control on a per-route basis whether the TTL value from an MPLS packet is propagated to an IPv4/IPv6 packet when the last label is popped as per the theoretical model in RFC 3443 through a new route attribute, RTA_TTL_PROPAGATE which can be 0 to mean disable propagation and 1 to mean enable propagation. In order to provide the ability to change the behaviour for packets arriving with IPv4/IPv6 Explicit Null labels and to provide an easy way for a user to change the behaviour for all existing routes without having to reprogram them, a global knob is provided. This is done through the addition of a new per-namespace sysctl, "net.mpls.ip_ttl_propagate", which defaults to enabled. If the per-route attribute is set (either enabled or disabled) then it overrides the global configuration. Signed-off-by: Robert Shearman <rshearma@brocade.com> Acked-by: David Ahern <dsa@cumulusnetworks.com> Tested-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-12mpls: Do not decrement alive counter for unregister eventsDavid Ahern1-1/+2
Multipath routes can be rendered usesless when a device in one of the paths is deleted. For example: $ ip -f mpls ro ls 100 nexthop as to 200 via inet 172.16.2.2 dev virt12 nexthop as to 300 via inet 172.16.3.2 dev br0 101 nexthop as to 201 via inet6 2000:2::2 dev virt12 nexthop as to 301 via inet6 2000:3::2 dev br0 $ ip li del br0 When br0 is deleted the other hop is not considered in mpls_select_multipath because of the alive check -- rt_nhn_alive is 0. rt_nhn_alive is decremented once in mpls_ifdown when the device is taken down (NETDEV_DOWN) and again when it is deleted (NETDEV_UNREGISTER). For a 2 hop route, deleting one device drops the alive count to 0. Since devices are taken down before unregistering, the decrement on NETDEV_UNREGISTER is redundant. Fixes: c89359a42e2a4 ("mpls: support for dead routes") Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-12mpls: Send route delete notifications when router module is unloadedDavid Ahern1-0/+1
When the mpls_router module is unloaded, mpls routes are deleted but notifications are not sent to userspace leaving userspace caches out of sync. Add the call to mpls_notify_route in mpls_net_exit as routes are freed. Fixes: 0189197f44160 ("mpls: Basic routing support") Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-20net: mpls: Add support for netconfDavid Ahern2-3/+211
Add netconf support to MPLS. Allows userpsace to learn and be notified of changes to 'input' enable setting per interface. Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Acked-by: Robert Shearman <rshearma@brocade.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-30lwtunnel: remove device arg to lwtunnel_build_stateDavid Ahern1-1/+1
Nothing about lwt state requires a device reference, so remove the input argument. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-28Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2-23/+26
Two trivial overlapping changes conflicts in MPLS and mlx5. Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-24net: Specify the owning module for lwtunnel opsRobert Shearman1-0/+1
Modules implementing lwtunnel ops should not be allowed to unload while there is state alive using those ops, so specify the owning module for all lwtunnel ops. Signed-off-by: Robert Shearman <rshearma@brocade.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-23net: mpls: Fix multipath selection for LSR use caseDavid Ahern1-23/+25
MPLS multipath for LSR is broken -- always selecting the first nexthop in the one label case. For example: $ ip -f mpls ro ls 100 nexthop as to 200 via inet 172.16.2.2 dev virt12 nexthop as to 300 via inet 172.16.3.2 dev virt13 101 nexthop as to 201 via inet6 2000:2::2 dev virt12 nexthop as to 301 via inet6 2000:3::2 dev virt13 In this example incoming packets have a single MPLS labels which means BOS bit is set. The BOS bit is passed from mpls_forward down to mpls_multipath_hash which never processes the hash loop because BOS is 1. Update mpls_multipath_hash to process the entire label stack. mpls_hdr_len tracks the total mpls header length on each pass (on pass N mpls_hdr_len is N * sizeof(mpls_shim_hdr)). When the label is found with the BOS set it verifies the skb has sufficient header for ipv4 or ipv6, and find the IPv4 and IPv6 header by using the last mpls_hdr pointer and adding 1 to advance past it. With these changes I have verified the code correctly sees the label, BOS, IPv4 and IPv6 addresses in the network header and icmp/tcp/udp traffic for ipv4 and ipv6 are distributed across the nexthops. Fixes: 1c78efa8319ca ("mpls: flow-based multipath selection") Acked-by: Robert Shearman <rshearma@brocade.com> Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-17mpls: Packet statsRobert Shearman3-28/+222
Having MPLS packet stats is useful for observing network operation and for diagnosing network problems. In the absence of anything better, RFC2863 and RFC3813 are used for guidance for which stats to expose and the semantics of them. In particular rx_noroutes maps to in unknown protos in RFC2863. The stats are exposed to userspace via AF_MPLS attributes embedded in the IFLA_STATS_AF_SPEC attribute of RTM_GETSTATS messages. All the introduced fields are 64-bit, even error ones, to ensure no overflow with long uptimes. Per-CPU counters are used to avoid cache-line contention on the commonly used fields. The other fields have also been made per-CPU for code to avoid performance problems in error conditions on the assumption that on some platforms the cost of atomic operations could be more expensive than sending the packet (which is what would be done in the success case). If that's not the case, we could instead not use per-CPU counters for these fields. Only unicast and non-fragment are exposed at the moment, but other counters can be exposed in the future either by adding to the end of struct mpls_link_stats or by additional netlink attributes in the AF_MPLS IFLA_STATS_AF_SPEC nested attribute. Signed-off-by: Robert Shearman <rshearma@brocade.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-06Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller1-1/+1
2016-12-05net: af_mpls.c add space before open parenthesisSuraj Deshmukh1-1/+1
Adding space after switch keyword before open parenthesis for readability purpose. This patch fixes the checkpatch.pl warning: space required before the open parenthesis '(' Signed-off-by: Suraj Deshmukh <surajssd009005@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-23lwt: Remove unused len fieldThomas Graf1-4/+1
The field is initialized by ILA and MPLS but never used. Remove it. Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-03mpls: move mpls_hdr to a common locationJiri Benc1-9/+1
This will be also used by openvswitch. Signed-off-by: Jiri Benc <jbenc@redhat.com> Acked-by: David Ahern <dsa@cumulusnetworks.com> Acked-by: Pravin B Shelar <pshelar@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-01mpls: get rid of trivial returnsstephen hemminger1-5/+0
return at end of function is useless. Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-30net: mpls: Fixups for GSODavid Ahern2-11/+33
As reported by Lennert the MPLS GSO code is failing to properly segment large packets. There are a couple of problems: 1. the inner protocol is not set so the gso segment functions for inner protocol layers are not getting run, and 2 MPLS labels for packets that use the "native" (non-OVS) MPLS code are not properly accounted for in mpls_gso_segment. The MPLS GSO code was added for OVS. It is re-using skb_mac_gso_segment to call the gso segment functions for the higher layer protocols. That means skb_mac_gso_segment is called twice -- once with the network protocol set to MPLS and again with the network protocol set to the inner protocol. This patch sets the inner skb protocol addressing item 1 above and sets the network_header and inner_network_header to mark where the MPLS labels start and end. The MPLS code in OVS is also updated to set the two network markers. >From there the MPLS GSO code uses the difference between the network header and the inner network header to know the size of the MPLS header that was pushed. It then pulls the MPLS header, resets the mac_len and protocol for the inner protocol and then calls skb_mac_gso_segment to segment the skb. Afterward the inner protocol segmentation is done the skb protocol is set to mpls for each segment and the network and mac headers restored. Reported-by: Lennert Buytenhek <buytenh@wantstofly.org> Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-30net: lwtunnel: Handle fragmentationRoopa Prabhu1-4/+5
Today mpls iptunnel lwtunnel_output redirect expects the tunnel output function to handle fragmentation. This is ok but can be avoided if we did not do the mpls output redirect too early. ie we could wait until ip fragmentation is done and then call mpls output for each ip fragment. To make this work we will need, 1) the lwtunnel state to carry encap headroom 2) and do the redirect to the encap output handler on the ip fragment (essentially do the output redirect after fragmentation) This patch adds tunnel headroom in lwtstate to make sure we account for tunnel data in mtu calculations during fragmentation and adds new xmit redirect handler to redirect to lwtunnel xmit func after ip fragmentation. This includes IPV6 and some mtu fixes and testing from David Ahern. Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-09mpls: allow routes on ipip and sit devicesSimon Horman1-2/+4
Allow MPLS routes on IPIP and SIT devices now that they support forwarding MPLS packets. Signed-off-by: Simon Horman <simon.horman@netronome.com> Reviewed-by: Dinan Gunawardena <dinan.gunawardena@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-16mpls: allow routes on ipgre devicesSimon Horman1-3/+4
This appears to be necessary and sufficient to provide MPLS in GRE (RFC4023) support. This can be used by establishing an ipgre tunnel device and then routing MPLS over it. The following example will forward MPLS frames received with an outermost MPLS label 100 over tun1, a GRE tunnel. The forwarded packet will have the outermost MPLS LSE removed and two new LSEs added with labels 200 (outermost) and 300 (next). ip link add name tun1 type gre remote 10.0.99.193 local 10.0.99.192 ttl 225 ip link set up dev tun1 ip addr add 10.0.98.192/24 dev tun1 ip route sh echo 1 > /proc/sys/net/mpls/conf/eth0/input echo 101 > /proc/sys/net/mpls/platform_labels ip -f mpls route add 100 as 200/300 via inet 10.0.98.193 ip -f mpls route sh Also remove unnecessary braces. Reviewed-by: Dinan Gunawardena <dinan.gunawardena@netronome.com> Signed-off-by: Simon Horman <simon.horman@netronome.com> Acked-by: Robert Shearman <rshearma@brocade.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-03skbuff: introduce skb_gso_validate_mtuMarcelo Ricardo Leitner1-1/+1
skb_gso_network_seglen is not enough for checking fragment sizes if skb is using GSO_BY_FRAGS as we have to check frag per frag. This patch introduces skb_gso_validate_mtu, based on the former, which will wrap the use case inside it as all calls to skb_gso_network_seglen were to validate if it fits on a given TMU, and improve the check. Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Tested-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-20gso: Remove arbitrary checks for unsupported GSOTom Herbert1-10/+1
In several gso_segment functions there are checks of gso_type against a seemingly arbitrary list of SKB_GSO_* flags. This seems like an attempt to identify unsupported GSO types, but since the stack is the one that set these GSO types in the first place this seems unnecessary to do. If a combination isn't valid in the first place that stack should not allow setting it. This is a code simplication especially for add new GSO types. Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-14GSO: Add GSO type for fixed IPv4 IDAlexander Duyck1-0/+1
This patch adds support for TSO using IPv4 headers with a fixed IP ID field. This is meant to allow us to do a lossless GRO in the case of TCP flows that use a fixed IP ID such as those that convert IPv6 header to IPv4 headers. In addition I am adding a feature that for now I am referring to TSO with IP ID mangling. Basically when this flag is enabled the device has the option to either output the flow with incrementing IP IDs or with a fixed IP ID regardless of what the original IP ID ordering was. This is useful in cases where the DF bit is set and we do not care if the original IP ID value is maintained. Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-08mpls: find_outdev: check for err ptr in addition to NULL checkRoopa Prabhu1-0/+3
find_outdev calls inet{,6}_fib_lookup_dev() or dev_get_by_index() to find the output device. In case of an error, inet{,6}_fib_lookup_dev() returns error pointer and dev_get_by_index() returns NULL. But the function only checks for NULL and thus can end up calling dev_put on an ERR_PTR. This patch adds an additional check for err ptr after the NULL check. Before: Trying to add an mpls route with no oif from user, no available path to 10.1.1.8 and no default route: $ip -f mpls route add 100 as 200 via inet 10.1.1.8 [ 822.337195] BUG: unable to handle kernel NULL pointer dereference at 00000000000003a3 [ 822.340033] IP: [<ffffffff8148781e>] mpls_nh_assign_dev+0x10b/0x182 [ 822.340033] PGD 1db38067 PUD 1de9e067 PMD 0 [ 822.340033] Oops: 0000 [#1] SMP [ 822.340033] Modules linked in: [ 822.340033] CPU: 0 PID: 11148 Comm: ip Not tainted 4.5.0-rc7+ #54 [ 822.340033] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5.1-0-g8936dbb-20141113_115728-nilsson.home.kraxel.org 04/01/2014 [ 822.340033] task: ffff88001db82580 ti: ffff88001dad4000 task.ti: ffff88001dad4000 [ 822.340033] RIP: 0010:[<ffffffff8148781e>] [<ffffffff8148781e>] mpls_nh_assign_dev+0x10b/0x182 [ 822.340033] RSP: 0018:ffff88001dad7a88 EFLAGS: 00010282 [ 822.340033] RAX: ffffffffffffff9b RBX: ffffffffffffff9b RCX: 0000000000000002 [ 822.340033] RDX: 00000000ffffff9b RSI: 0000000000000008 RDI: 0000000000000000 [ 822.340033] RBP: ffff88001ddc9ea0 R08: ffff88001e9f1768 R09: 0000000000000000 [ 822.340033] R10: ffff88001d9c1100 R11: ffff88001e3c89f0 R12: ffffffff8187e0c0 [ 822.340033] R13: ffffffff8187e0c0 R14: ffff88001ddc9e80 R15: 0000000000000004 [ 822.340033] FS: 00007ff9ed798700(0000) GS:ffff88001fc00000(0000) knlGS:0000000000000000 [ 822.340033] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 822.340033] CR2: 00000000000003a3 CR3: 000000001de89000 CR4: 00000000000006f0 [ 822.340033] Stack: [ 822.340033] 0000000000000000 0000000100000000 0000000000000000 0000000000000000 [ 822.340033] 0000000000000000 0801010a00000000 0000000000000000 0000000000000000 [ 822.340033] 0000000000000004 ffffffff8148749b ffffffff8187e0c0 000000000000001c [ 822.340033] Call Trace: [ 822.340033] [<ffffffff8148749b>] ? mpls_rt_alloc+0x2b/0x3e [ 822.340033] [<ffffffff81488e66>] ? mpls_rtm_newroute+0x358/0x3e2 [ 822.340033] [<ffffffff810e7bbc>] ? get_page+0x5/0xa [ 822.340033] [<ffffffff813b7d94>] ? rtnetlink_rcv_msg+0x17e/0x191 [ 822.340033] [<ffffffff8111794e>] ? __kmalloc_track_caller+0x8c/0x9e [ 822.340033] [<ffffffff813c9393>] ? rht_key_hashfn.isra.20.constprop.57+0x14/0x1f [ 822.340033] [<ffffffff813b7c16>] ? __rtnl_unlock+0xc/0xc [ 822.340033] [<ffffffff813cb794>] ? netlink_rcv_skb+0x36/0x82 [ 822.340033] [<ffffffff813b4507>] ? rtnetlink_rcv+0x1f/0x28 [ 822.340033] [<ffffffff813cb2b1>] ? netlink_unicast+0x106/0x189 [ 822.340033] [<ffffffff813cb5b3>] ? netlink_sendmsg+0x27f/0x2c8 [ 822.340033] [<ffffffff81392ede>] ? sock_sendmsg_nosec+0x10/0x1b [ 822.340033] [<ffffffff81393df1>] ? ___sys_sendmsg+0x182/0x1e3 [ 822.340033] [<ffffffff810e4f35>] ? __alloc_pages_nodemask+0x11c/0x1e4 [ 822.340033] [<ffffffff8110619c>] ? PageAnon+0x5/0xd [ 822.340033] [<ffffffff811062fe>] ? __page_set_anon_rmap+0x45/0x52 [ 822.340033] [<ffffffff810e7bbc>] ? get_page+0x5/0xa [ 822.340033] [<ffffffff810e85ab>] ? __lru_cache_add+0x1a/0x3a [ 822.340033] [<ffffffff81087ea9>] ? current_kernel_time64+0x9/0x30 [ 822.340033] [<ffffffff813940c4>] ? __sys_sendmsg+0x3c/0x5a [ 822.340033] [<ffffffff8148f597>] ? entry_SYSCALL_64_fastpath+0x12/0x6a [ 822.340033] Code: 83 08 04 00 00 65 ff 00 48 8b 3c 24 e8 40 7c f2 ff eb 13 48 c7 c3 9f ff ff ff eb 0f 89 ce e8 f1 ae f1 ff 48 89 c3 48 85 db 74 15 <48> 8b 83 08 04 00 00 65 ff 08 48 81 fb 00 f0 ff ff 76 0d eb 07 [ 822.340033] RIP [<ffffffff8148781e>] mpls_nh_assign_dev+0x10b/0x182 [ 822.340033] RSP <ffff88001dad7a88> [ 822.340033] CR2: 00000000000003a3 [ 822.435363] ---[ end trace 98cc65e6f6b8bf11 ]--- After patch: $ip -f mpls route add 100 as 200 via inet 10.1.1.8 RTNETLINK answers: Network is unreachable Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Reported-by: David Miller <davem@davemloft.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-21mpls: autoload lwt moduleRobert Shearman1-0/+1
Avoid users having to manually load the module by adding a module alias allowing it to be autoloaded by the lwt infra. Signed-off-by: Robert Shearman <rshearma@brocade.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-17Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2-14/+33
Conflicts: drivers/net/geneve.c Here we had an overlapping change, where in 'net' the extraneous stats bump was being removed whilst in 'net-next' the final argument to udp_tunnel6_xmit_skb() was being changed. Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-12mpls: make via address optional for multipath routesRobert Shearman1-9/+13
The via address is optional for a single path route, yet is mandatory when the multipath attribute is used: # ip -f mpls route add 100 dev lo # ip -f mpls route add 101 nexthop dev lo RTNETLINK answers: Invalid argument Make them consistent by making the via address optional when the RTA_MULTIPATH attribute is being parsed so that both forms of specifying the route work. Signed-off-by: Robert Shearman <rshearma@brocade.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-12mpls: fix out-of-bounds access when via address not specifiedRobert Shearman1-5/+12
When a via address isn't specified, the via table is left initialised to 0 (NEIGH_ARP_TABLE), and the via address length also left initialised to 0. This results in a via address array of length 0 being allocated (contiguous with route and nexthop array), meaning that when a packet is sent using neigh_xmit the neighbour lookup and creation will cause an out-of-bounds access when accessing the 4 bytes of the IPv4 address it assumes it has been given a pointer to. This could be fixed by allocating the 4 bytes of via address necessary and leaving it as all zeroes. However, it seems wrong to me to use an ipv4 nexthop (including possibly ARPing for 0.0.0.0) when the user didn't specify to do so. Instead, set the via address table to NEIGH_NR_TABLES to signify it hasn't been specified and use this at forwarding time to signify a neigh_xmit using an L2 address consisting of the device address. This mechanism is the same as that used for both ARP and ND for loopback interfaces and those flagged as no-arp, which are all we can really support in this case. Fixes: cf4b24f0024f ("mpls: reduce memory usage of routes") Signed-off-by: Robert Shearman <rshearma@brocade.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-12mpls: don't dump RTA_VIA attribute if not specifiedRobert Shearman1-2/+6
The problem seen is that when adding a route with a nexthop with no via address specified, iproute2 generates bogus output: # ip -f mpls route add 100 dev lo # ip -f mpls route list 100 via inet 0.0.8.0 dev lo The reason for this is that the kernel generates an RTA_VIA attribute with the family set to AF_INET, but the via address data having zero length. The cause of family being AF_INET is that on route insert cfg->rc_via_table is left set to 0, which just happens to be NEIGH_ARP_TABLE which is then translated into AF_INET. iproute2 doesn't validate the length prior to printing and so prints garbage. Although it could be fixed to do the validation, I would argue that AF_INET addresses should always be exactly 4 bytes so the kernel is really giving userspace bogus data. Therefore, avoid generating the RTA_VIA attribute when dumping the route if the via address wasn't specified on add/modify. This is indicated by NEIGH_ARP_TABLE and a zero via address length - if the user specified a via address the address length would have been validated such that it was 4 bytes. Although this is a change in behaviour that is visible to userspace, I believe that what was generated before was invalid and as such userspace wouldn't be expecting it. Signed-off-by: Robert Shearman <rshearma@brocade.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-12mpls: validate L2 via address lengthRobert Shearman1-0/+4
If an L2 via address for an mpls nexthop is specified, the length of the L2 address must match that expected by the output device, otherwise it could access memory beyond the end of the via address buffer in the route. This check was present prior to commit f8efb73c97e2 ("mpls: multipath route support"), but got lost in the refactoring, so add it back, applying it to all nexthops in multipath routes. Fixes: f8efb73c97e2 ("mpls: multipath route support") Signed-off-by: Robert Shearman <rshearma@brocade.com> Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-11mpls_iptunnel: add static qualifier to mpls_outputRoopa Prabhu1-1/+1
This gets rid of the following compile warn: net/mpls/mpls_iptunnel.c:40:5: warning: no previous prototype for mpls_output [-Wmissing-prototypes] Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Acked-by: Robert Shearman <rshearma@brocade.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-07mpls: fix sending of local encapped packetsRobert Shearman1-2/+2
Locally generated IPv4 and (probably) IPv6 packets are dropped because skb->protocol isn't set. We could write wrappers to lwtunnel_output for IPv4 and IPv6 that set the protocol accordingly and then call lwtunnel_output, but mpls_output relies on the AF-specific type of dst anyway to get the via address. Therefore, make use of dst->dst_ops->family in mpls_output to determine the type of nexthop and thus protocol of the packet instead of checking skb->protocol. Fixes: 61adedf3e3f1 ("route: move lwtunnel state to dst_entry") Reported-by: Sam Russell <sam.h.russell@gmail.com> Signed-off-by: Robert Shearman <rshearma@brocade.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-03mpls: support for dead routesRoopa Prabhu2-28/+159
Adds support for RTNH_F_DEAD and RTNH_F_LINKDOWN flags on mpls routes due to link events. Also adds code to ignore dead routes during route selection. Unlike ip routes, mpls routes are not deleted when the route goes dead. This is current mpls behaviour and this patch does not change that. With this patch however, routes will be marked dead. dead routes are not notified to userspace (this is consistent with ipv4 routes). dead routes: ----------- $ip -f mpls route show 100 nexthop as to 200 via inet 10.1.1.2 dev swp1 nexthop as to 700 via inet 10.1.1.6 dev swp2 $ip link set dev swp1 down $ip link show dev swp1 4: swp1: <BROADCAST,MULTICAST> mtu 1500 qdisc pfifo_fast state DOWN mode DEFAULT group default qlen 1000 link/ether 00:02:00:00:00:01 brd ff:ff:ff:ff:ff:ff $ip -f mpls route show 100 nexthop as to 200 via inet 10.1.1.2 dev swp1 dead linkdown nexthop as to 700 via inet 10.1.1.6 dev swp2 linkdown routes: ---------------- $ip -f mpls route show 100 nexthop as to 200 via inet 10.1.1.2 dev swp1 nexthop as to 700 via inet 10.1.1.6 dev swp2 $ip link show dev swp1 4: swp1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP mode DEFAULT group default qlen 1000 link/ether 00:02:00:00:00:01 brd ff:ff:ff:ff:ff:ff /* carrier goes down */ $ip link show dev swp1 4: swp1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo_fast state DOWN mode DEFAULT group default qlen 1000 link/ether 00:02:00:00:00:01 brd ff:ff:ff:ff:ff:ff $ip -f mpls route show 100 nexthop as to 200 via inet 10.1.1.2 dev swp1 linkdown nexthop as to 700 via inet 10.1.1.6 dev swp2 Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Acked-by: Robert Shearman <rshearma@brocade.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-27mpls: reduce memory usage of routesRobert Shearman2-38/+111
Nexthops for MPLS routes have a via address field sized for the largest via address that is expected, which is 32 bytes. This means that in the most common case of having ipv4 via addresses, 28 bytes of memory more than required are used per nexthop. In the other common case of an ipv6 nexthop then 16 bytes more than required are used. With large numbers of MPLS routes this extra memory usage could start to become significant. To avoid allocating memory for a maximum length via address when not all of it is required and to allow for ease of iterating over nexthops, then the via addresses are changed to be stored in the same memory block as the route and nexthops, but in an array after the end of the array of nexthops. New accessors are provided to retrieve a pointer to the via address. To allow for O(1) access without having to store a pointer or offset per nh, the via address for each nexthop is sized according to the maximum via address for any nexthop in the route, which is stored in a new route field, rt_max_alen, but this is in an existing hole in struct mpls_route so it doesn't increase the size of the structure. Each via address is ensured to be aligned to VIA_ALEN_ALIGN to account for architectures that don't allow unaligned accesses. Signed-off-by: Robert Shearman <rshearma@brocade.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-27mpls: fix forwarding using v4/v6 explicit nullRobert Shearman1-0/+2
Fill in the via address length for the predefined IPv4 and IPv6 explicit-null label routes. Fixes: f8efb73c97e2 ("mpls: multipath route support") Signed-off-by: Robert Shearman <rshearma@brocade.com> Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-23mpls: flow-based multipath selectionRobert Shearman1-4/+83
Change the selection of a multipath route to use a flow-based hash. This more suitable for traffic sensitive to reordering within a flow (e.g. TCP, L2VPN) and whilst still allowing a good distribution of traffic given enough flows. Selection of the path for a multipath route is done using a hash of: 1. Label stack up to MAX_MP_SELECT_LABELS labels or up to and including entropy label, whichever is first. 2. 3-tuple of (L3 src, L3 dst, proto) from IPv4/IPv6 header in MPLS payload, if present. Naturally, a 5-tuple hash using L4 information in addition would be possible and be better in some scenarios, but there is a tradeoff between looking deeper into the packet to achieve good distribution, and packet forwarding performance, and I have erred on the side of the latter as the default. Signed-off-by: Robert Shearman <rshearma@brocade.com> Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-23mpls: multipath route supportRoopa Prabhu2-148/+400
This patch adds support for MPLS multipath routes. Includes following changes to support multipath: - splits struct mpls_route into 'struct mpls_route + struct mpls_nh' - 'struct mpls_nh' represents a mpls nexthop label forwarding entry - moves mpls route and nexthop structures into internal.h - A mpls_route can point to multiple mpls_nh structs - the nexthops are maintained as a array (similar to ipv4 fib) - In the process of restructuring, this patch also consistently changes all labels to u8 - Adds support to parse/fill RTA_MULTIPATH netlink attribute for multipath routes similar to ipv4/v6 fib - In this patch, the multipath route nexthop selection algorithm simply returns the first nexthop. It is replaced by a hash based algorithm from Robert Shearman in the next patch - mpls_route_update cleanup: remove 'dev' handling in mpls_route_update. mpls_route_update though implemented to update based on dev, it was never used that way. And the dev handling gets tricky with multiple nexthops. Cannot match against any single nexthops dev. So, this patch removes the unused 'dev' handling in mpls_route_update. - dead route/path handling will be implemented in a subsequent patch Example: $ip -f mpls route add 100 nexthop as 200 via inet 10.1.1.2 dev swp1 \ nexthop as 700 via inet 10.1.1.6 dev swp2 \ nexthop as 800 via inet 40.1.1.2 dev swp3 $ip -f mpls route show 100 nexthop as to 200 via inet 10.1.1.2 dev swp1 nexthop as to 700 via inet 10.1.1.6 dev swp2 nexthop as to 800 via inet 40.1.1.2 dev swp3 Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Acked-by: Robert Shearman <rshearma@brocade.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-08dst: Pass net into dst->outputEric W. Biederman1-1/+1
The network namespace is already passed into dst_output pass it into dst->output lwt->output and friends. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-31mpls: fix mpls_net_init memory leakNikolay Aleksandrov1-1/+3
Fix a memory leak in the mpls netns init function in case of failure. If register_net_sysctl fails then we need to free the ctl_table. Fixes: 7720c01f3f59 ("mpls: Add a sysctl to control the size of the mpls label table") Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-24lwt: Add cfg argument to build_stateTom Herbert1-0/+1
Add cfg and family arguments to lwt build state functions. cfg is a void pointer and will either be a pointer to a fib_config or fib6_config structure. The family parameter indicates which one (either AF_INET or AF_INET6). LWT encpasulation implementation may use the fib configuration to build the LWT state. Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-20route: move lwtunnel state to dst_entryJiri Benc1-5/+2
Currently, the lwtunnel state resides in per-protocol data. This is a problem if we encapsulate ipv6 traffic in an ipv4 tunnel (or vice versa). The xmit function of the tunnel does not know whether the packet has been routed to it by ipv4 or ipv6, yet it needs the lwtstate data. Moving the lwtstate data to dst_entry makes such inter-protocol tunneling possible. As a bonus, this brings a nice diffstat. Signed-off-by: Jiri Benc <jbenc@redhat.com> Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com> Acked-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-09mpls: Enforce payload type of traffic sent using explicit NULLRobert Shearman1-27/+44
RFC 4182 s2 states that if an IPv4 Explicit NULL label is the only label on the stack, then after popping the resulting packet must be treated as a IPv4 packet and forwarded based on the IPv4 header. The same is true for IPv6 Explicit NULL with an IPv6 packet following. Therefore, when installing the IPv4/IPv6 Explicit NULL label routes, add an attribute that specifies the expected payload type for use at forwarding time for determining the type of the encapsulated packet instead of inspecting the first nibble of the packet. Signed-off-by: Robert Shearman <rshearma@brocade.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-06af_mpls: add null dev check in find_outdevRoopa Prabhu1-0/+3
This patch adds null dev check for the 'cfg->rc_via_table == NEIGH_LINK_TABLE or dev_get_by_index() failed' case Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-06mpls: small cleanup in inet/inet6_fib_lookup_dev()Dan Carpenter1-9/+4
We recently changed this code from returning NULL to returning ERR_PTR. There are some left over NULL assignments which we can remove. We can preserve the error code from ip_route_output() instead of always returning -ENODEV. Also these functions use a mix of gotos and direct returns. There is no cleanup necessary so I changed the gotos to direct returns. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com> Acked-by: Robert Shearman <rshearma@brocade.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-03mpls: Use definition for reserved label checksRobert Shearman1-10/+11
In multiple locations there are checks for whether the label in hand is a reserved label or not using the arbritray value of 16. Factor this out into a #define for better maintainability and for documentation. Signed-off-by: Robert Shearman <rshearma@brocade.com> Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-31af_mpls: fix undefined reference to ip6_route_outputRoopa Prabhu1-8/+33
Undefined reference to ip6_route_output and ip_route_output was reported with CONFIG_INET=n and CONFIG_IPV6=n. This patch uses ipv6_stub_impl.ipv6_dst_lookup instead of ip6_route_output. And wraps affected code under IS_ENABLED(CONFIG_INET) and IS_ENABLED(CONFIG_IPV6). Reported-by: kbuild test robot <fengguang.wu@intel.com> Reported-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-22mpls_iptunnel: fix sparse warn: remove incorrect rcu_dereferenceRoopa Prabhu1-1/+1
fix for: net/mpls/mpls_iptunnel.c:73:19: sparse: incompatible types in comparison expression (different address spaces) remove incorrect rcu_dereference possibly left over from earlier revisions of the code. Reported-by: kbuild test robot <fengguang.wu@intel.com> Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-21mpls: make RTA_OIF optionalRoopa Prabhu1-1/+66
If user did not specify an oif, try and get it from the via address. If failed to get device, return with -ENODEV. Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-21mpls: ip tunnel supportRoopa Prabhu3-1/+241
This implementation uses lwtunnel infrastructure to register hooks for mpls tunnel encaps. It picks cues from iptunnel_encaps infrastructure and previous mpls iptunnel RFC patches from Eric W. Biederman and Robert Shearman Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-21mpls: export mpls functions for use by mpls iptunnelsRoopa Prabhu2-5/+15
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-13Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller1-0/+11
2015-06-11mpls: handle device renames for per-device sysctlsRobert Shearman1-0/+11
If a device is renamed and the original name is subsequently reused for a new device, the following warning is generated: sysctl duplicate entry: /net/mpls/conf/veth0//input CPU: 3 PID: 1379 Comm: ip Not tainted 4.1.0-rc4+ #20 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014 0000000000000000 0000000000000000 ffffffff81566aaf 0000000000000000 ffffffff81236279 ffff88002f7d7f00 0000000000000000 ffff88000db336d8 ffff88000db33698 0000000000000005 ffff88002e046000 ffff8800168c9280 Call Trace: [<ffffffff81566aaf>] ? dump_stack+0x40/0x50 [<ffffffff81236279>] ? __register_sysctl_table+0x289/0x5a0 [<ffffffffa051a24f>] ? mpls_dev_notify+0x1ff/0x300 [mpls_router] [<ffffffff8108db7f>] ? notifier_call_chain+0x4f/0x70 [<ffffffff81470e72>] ? register_netdevice+0x2b2/0x480 [<ffffffffa0524748>] ? veth_newlink+0x178/0x2d3 [veth] [<ffffffff8147f84c>] ? rtnl_newlink+0x73c/0x8e0 [<ffffffff8147f27a>] ? rtnl_newlink+0x16a/0x8e0 [<ffffffff81459ff2>] ? __kmalloc_reserve.isra.30+0x32/0x90 [<ffffffff8147ccfd>] ? rtnetlink_rcv_msg+0x8d/0x250 [<ffffffff8145b027>] ? __alloc_skb+0x47/0x1f0 [<ffffffff8149badb>] ? __netlink_lookup+0xab/0xe0 [<ffffffff8147cc70>] ? rtnetlink_rcv+0x30/0x30 [<ffffffff8149e7a0>] ? netlink_rcv_skb+0xb0/0xd0 [<ffffffff8147cc64>] ? rtnetlink_rcv+0x24/0x30 [<ffffffff8149df17>] ? netlink_unicast+0x107/0x1a0 [<ffffffff8149e4be>] ? netlink_sendmsg+0x50e/0x630 [<ffffffff8145209c>] ? sock_sendmsg+0x3c/0x50 [<ffffffff81452beb>] ? ___sys_sendmsg+0x27b/0x290 [<ffffffff811bd258>] ? mem_cgroup_try_charge+0x88/0x110 [<ffffffff811bd5b6>] ? mem_cgroup_commit_charge+0x56/0xa0 [<ffffffff811d7700>] ? do_filp_open+0x30/0xa0 [<ffffffff8145336e>] ? __sys_sendmsg+0x3e/0x80 [<ffffffff8156c3f2>] ? system_call_fastpath+0x16/0x75 Fix this by unregistering the previous sysctl table (registered for the path containing the original device name) and re-registering the table for the path containing the new device name. Fixes: 37bde79979c3 ("mpls: Per-device enabling of packet input") Reported-by: Scott Feldman <sfeldma@gmail.com> Signed-off-by: Robert Shearman <rshearma@brocade.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-08Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2-1/+2
2015-06-07mpls: fix possible use after free of deviceRobert Shearman2-1/+2
The mpls device is used in an RCU read context without a lock being held. As the memory is freed without waiting for the RCU grace period to elapse, the freed memory could still be in use. Address this by using kfree_rcu to free the memory for the mpls device after the RCU grace period has elapsed. Fixes: 03c57747a702 ("mpls: Per-device MPLS state") Signed-off-by: Robert Shearman <rshearma@brocade.com> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-01net: Add priority to packet_offload objects.David S. Miller1-0/+2
When we scan a packet for GRO processing, we want to see the most common packet types in the front of the offload_base list. So add a priority field so we can handle this properly. IPv4/IPv6 get the highest priority with the implicit zero priority field. Next comes ethernet with a priority of 10, and then we have the MPLS types with a priority of 15. Suggested-by: Eric Dumazet <eric.dumazet@gmail.com> Suggested-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-09mpls: Change reserved label names to be consistent with netbsdTom Herbert1-9/+9
Since these are now visible to userspace it is nice to be consistent with BSD (sys/netmpls/mpls.h in netBSD). Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-05mpls: Move reserved label definitionsTom Herbert2-19/+9
Move to include/uapi/linux/mpls.h to be externally visibile. Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-22mpls: Prevent use of implicit NULL label as outgoing labelRobert Shearman1-0/+9
The reserved implicit-NULL label isn't allowed to appear in the label stack for packets, so make it an error for the control plane to specify it as an outgoing label. Suggested-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Robert Shearman <rshearma@brocade.com> Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-22mpls: Per-device enabling of packet inputRobert Shearman2-2/+69
An MPLS network is a single trust domain where the edges must be in control of what labels make their way into the core. The simplest way of ensuring this is for the edge device to always impose the labels, and not allow forward labeled traffic from untrusted neighbours. This is achieved by allowing a per-device configuration of whether MPLS traffic input from that interface should be processed or not. To be secure by default, the default state is changed to MPLS being disabled on all interfaces unless explicitly enabled and no global option is provided to change the default. Whilst this differs from other protocols (e.g. IPv6), network operators are used to explicitly enabling MPLS forwarding on interfaces, and with the number of links to the MPLS core typically fairly low this doesn't present too much of a burden on operators. Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Robert Shearman <rshearma@brocade.com> Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-22mpls: Per-device MPLS stateRobert Shearman2-2/+51
Add per-device MPLS state to supported interfaces. Use the presence of this state in mpls_route_add to determine that this is a supported interface. Use the presence of mpls_dev to drop packets that arrived on an unsupported interface - previously they were allowed through. Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Robert Shearman <rshearma@brocade.com> Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-12mpls: In mpls_egress verify the packet length.Eric W. Biederman1-1/+16
Reobert Shearman noticed that mpls_egress is failing to verify that the bytes to be examined are in fact present in the packet before mpls_egress reads those bytes. As suggested by David Miller reduce this to a single pskb_may_pull call so that we don't do unnecessary work in the fast path. Reported-by: Robert Shearman <rshearma@brocade.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-11mpls: Allow mpls_gso and mpls_router to be built as modulesRobert Shearman2-4/+6
CONFIG_MPLS=m doesn't result in a kernel module being built because it applies to the net/mpls directory, rather than to .o files. So revert the MPLS menuitem to being a boolean and make MPLS_GSO and MPLS_ROUTING tristates to allow mpls_gso and mpls_router modules to be produced as desired. Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Robert Shearman <rshearma@brocade.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-09mpls: Spelling: s/conceved/conceived/, s/as/a/Geert Uytterhoeven1-2/+2
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-08neigh: Use neigh table index for neigh_packet_xmitEric W. Biederman1-13/+22
Remove a little bit of unnecessary work when transmitting a packet with neigh_packet_xmit. Use the neighbour table index not the address family as a parameter. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-08mpls: Correct the ttl decrement.Eric W. Biederman1-1/+1
According to RFC3032 section 2.4.2 packets with an outgoing ttl of 0 MUST NOT be forwarded. According to section 2.4.1 an outgoing TTL of 0 comes from an incomming TTL <= 1. Therefore any packets that is received with a ttl <= 1 should not have it's ttl decremented and forwarded. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-08mpls: Better error code for unsupported option.Eric W. Biederman1-1/+1
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-08mpls: Cleanup the rcu usage in the code.Eric W. Biederman1-26/+47
Sparse was generating a lot of warnings mostly from missing annotations in the code. Add missing annotations and in a few cases tweak the code for performance by moving work before loops. This also fixes a problematic ommision of rcu_assign_pointer and rcu_dereference. Hopefully with complete rcu annotations any new rcu errors will stick out like a sore thumb. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-08mpls: Fix the kzalloc argument order in mpls_rt_allocEric W. Biederman1-1/+1
*Blink* I got the argument order wrong to kzalloc and the code was working properly when tested. *Blink* Fix that. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-06mpls: Properly validate RTA_VIA payload lengthRobert Shearman1-1/+4
If the nla length is less than 2 then the nla data could be accessed beyond the accessible bounds. So ensure that the nla is big enough to at least read the via_family before doing so. Replace magic value of 2. Fixes: 03c0566542f4 ("mpls: Basic support for adding and removing routes") Cc: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Robert Shearman <rshearma@brocade.com> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-05mpls: using vzalloc requires including vmalloc.hStephen Rothwell1-0/+1
Fixes this build error: net/mpls/af_mpls.c: In function 'resize_platform_label_table': net/mpls/af_mpls.c:767:4: error: implicit declaration of function 'vzalloc' [-Werror=implicit-function-declaration] labels = vzalloc(size); ^ Fixes: 7720c01f3f59 ("mpls: Add a sysctl to control the size of the mpls label table") Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-04mpls: rtm_mpls_policy[] can be staticWu Fengguang1-1/+1
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-04mpls: Multicast route table change notificationsEric W. Biederman1-0/+60
Unlike IPv4 this code notifies on all cases where mpls routes are added or removed and it never automatically removes routes. Avoiding both the userspace confusion that is caused by omitting route updates and the possibility of a flood of netlink traffic when an interface goes doew. For now reserved labels are handled automatically and userspace is not notified. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-04mpls: Netlink commands to add, remove, and dump routesEric W. Biederman1-0/+229
This change adds two new netlink routing attributes: RTA_VIA and RTA_NEWDST. RTA_VIA specifies the specifies the next machine to send a packet to like RTA_GATEWAY. RTA_VIA differs from RTA_GATEWAY in that it includes the address family of the address of the next machine to send a packet to. Currently the MPLS code supports addresses in AF_INET, AF_INET6 and AF_PACKET. For AF_INET and AF_INET6 the destination mac address is acquired from the neighbour table. For AF_PACKET the destination mac_address is specified in the netlink configuration. I think raw destination mac address support with the family AF_PACKET will prove useful. There is MPLS-TP which is defined to operate on machines that do not support internet packets of any flavor. Further seem to be corner cases where it can be useful. At this point I don't care much either way. RTA_NEWDST specifies the destination address to forward the packet with. MPLS typically changes it's destination address at every hop. For a swap operation RTA_NEWDST is specified with a length of one label. For a push operation RTA_NEWDST is specified with two or more labels. For a pop operation RTA_NEWDST is not specified or equivalently an emtpy RTAN_NEWDST is specified. Those new netlink attributes are used to implement handling of rt-netlink RTM_NEWROUTE, RTM_DELROUTE, and RTM_GETROUTE messages, to maintain the MPLS label table. rtm_to_route_config parses a netlink RTM_NEWROUTE or RTM_DELROUTE message, verify no unhandled attributes or unhandled values are present and sets up the data structures for mpls_route_add and mpls_route_del. I did my best to match up with the existing conventions with the caveats that MPLS addresses are all destination-specific-addresses, and so don't properly have a scope. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-04mpls: Functions for reading and wrinting mpls labels over netlinkEric W. Biederman2-0/+60
Reading and writing addresses in network byte order in netlink is traditional and I see no reason to change that. MPLS is interesting as effectively it has variabely length addresses (the MPLS label stack). To represent these variable length addresses in netlink I use a valid MPLS label stack (complete with stop bit). This achieves two things: a well defined existing format is used, and the data can be interpreted without looking at it's length. Not needed to look at the length to decode the variable length network representation allows existing userspace functions such as inet_ntop to be used without needed to change their prototype. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-04mpls: Basic support for adding and removing routesEric W. Biederman1-0/+133
mpls_route_add and mpls_route_del implement the basic logic for adding and removing Next Hop Label Forwarding Entries from the MPLS input label map. The addition and subtraction is done in a way that is consistent with how the existing routing table in Linux are maintained. Thus all of the work to deal with NLM_F_APPEND, NLM_F_EXCL, NLM_F_REPLACE, and NLM_F_CREATE. Cases that are not clearly defined such as changing the interpretation of the mpls reserved labels is not allowed. Because it seems like the right thing to do adding an MPLS route without specifying an input label and allowing the kernel to pick a free label table entry is supported. The implementation is currently less than optimal but that can be changed. As I don't have anything else to test with only ethernet and the loopback device are the only two device types currently supported for forwarding MPLS over. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-04mpls: Add a sysctl to control the size of the mpls label tableEric W. Biederman1-0/+146
This sysctl gives two benefits. By defaulting the table size to 0 mpls even when compiled in and enabled defaults to not forwarding any packets. This prevents unpleasant surprises for users. The other benefit is that as mpls labels are allocated locally a dense table a small dense label table may be used which saves memory and is extremely simple and efficient to implement. This sysctl allows userspace to choose the restrictions on the label table size userspace applications need to cope with. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-04mpls: Basic routing supportEric W. Biederman4-0/+411
This change adds a new Kconfig option MPLS_ROUTING. The core of this change is the code to look at an mpls packet received from another machine. Look that packet up in a routing table and forward the packet on. Support of MPLS over ATM is not considered or attempted here. This implemntation follows RFC3032 and implements the MPLS shim header that can pass over essentially any network. What RFC3021 refers to as the as the Incoming Label Map (ILM) I call net->mpls.platform_label[]. What RFC3031 refers to as the Next Label Hop Forwarding Entry (NHLFE) I call mpls_route. Though calling it the label fordwarding information base (lfib) might also be valid. Further the implemntation forwards packets as described in RFC3032. There is no need and given the original motivation for MPLS a strong discincentive to have a flexible label forwarding path. In essence the logic is the topmost label is read, looked up, removed, and replaced by 0 or more new lables and the sent out the specified interface to it's next hop. Quite a few optional features are not implemented here. Among them are generation of ICMP errors when the TTL is exceeded or the packet is larger than the next hop MTU (those conditions are detected and the packets are dropped instead of generating an icmp error). The traffic class field is always set to 0. The implementation focuses on IP over MPLS and does not handle egress of other kinds of protocols. Instead of implementing coordination with the neighbour table and sorting out how to input next hops in a different address family (for which there is value). I was lazy and implemented a next hop mac address instead. The code is simpler and there are flavor of MPLS such as MPLS-TP where neither an IPv4 nor an IPv6 next hop is appropriate so a next hop by mac address would need to be implemented at some point. Two new definitions AF_MPLS and PF_MPLS are exposed to userspace. Decoding the mpls header must be done by first byeswapping a 32bit bit endian word into the local cpu endian and then bit shifting to extract the pieces. There is no C bit-field that can represent a wire format mpls header on a little endian machine as the low bits of the 20bit label wind up in the wrong half of third byte. Therefore internally everything is deal with in cpu native byte order except when writing to and reading from a packet. For management simplicity if a label is configured to forward out an interface that is down the packet is dropped early. Similarly if an network interface is removed rt_dev is updated to NULL (so no reference is preserved) and any packets for that label are dropped. Keeping the label entries in the kernel allows the kernel label table to function as the definitive source of which labels are allocated and which are not. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-04mpls: Refactor how the mpls module is builtEric W. Biederman1-1/+17
This refactoring is needed to allow more than just mpls gso support to be built into the mpls moddule. Reviewed-by: Simon Horman <horms@verge.net.au> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-30net: mark some potential candidates __read_mostlyDaniel Borkmann1-2/+2
They are all either written once or extremly rarely (e.g. from init code), so we can move them to the .data..read_mostly section. Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-23mpls: Fix allowed protocols for mpls gsoPravin B Shelar1-4/+1
MPLS and Tunnel GSO does not work together. Reject packet which request such GSO. Fixes: 0d89d2035f ("MPLS: Add limited GSO support"). Signed-off-by: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-05net: Remove MPLS GSO feature.Pravin B Shelar1-2/+1
Device can export MPLS GSO support in dev->mpls_features same way it export vlan features in dev->vlan_features. So it is safe to remove NETIF_F_GSO_MPLS redundant flag. Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
2014-10-31mpls: Allow mpls_gso to be built as modulePravin B Shelar1-1/+1
Kconfig already allows mpls to be built as module. Following patch fixes Makefile to do same. CC: Simon Horman <simon.horman@netronome.com> Signed-off-by: Pravin B Shelar <pshelar@nicira.com> Acked-by: Simon Horman <simon.horman@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-31mpls: Fix mpls_gso handler.Pravin B Shelar1-2/+1
mpls gso handler needs to pull skb after segmenting skb. CC: Simon Horman <simon.horman@netronome.com> Signed-off-by: Pravin B Shelar <pshelar@nicira.com> Acked-by: Simon Horman <simon.horman@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-20net: gso: use feature flag argument in all protocol gso handlersFlorian Westphal1-1/+1
skb_gso_segment() has a 'features' argument representing offload features available to the output path. A few handlers, e.g. GRE, instead re-fetch the features of skb->dev and use those instead of the provided ones when handing encapsulation/tunnels. Depending on dev->hw_enc_features of the output device skb_gso_segment() can then return NULL even when the caller has disabled all GSO feature bits, as segmentation of inner header thinks device will take care of segmentation. This e.g. affects the tbf scheduler, which will silently drop GRE-encap GSO skbs that did not fit the remaining token quota as the segmentation does not work when device supports corresponding hw offload capabilities. Cc: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-26net: Remove gso_send_check as an offload callbackTom Herbert1-7/+0
The send_check logic was only interesting in cases of TCP offload and UDP UFO where the checksum needed to be initialized to the pseudo header checksum. Now we've moved that logic into the related gso_segment functions so gso_send_check is no longer needed. Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-04gre: Call gso_make_checksumTom Herbert1-0/+1
Call gso_make_checksum. This should have the benefit of using a checksum that may have been previously computed for the packet. This also adds NETIF_F_GSO_GRE_CSUM to differentiate devices that offload GRE GSO with and without the GRE checksum offloaed. Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>