Search Results (809 CVEs found)

CVE Vendors Products Updated CVSS v3.1
CVE-2026-31583 1 Linux 1 Linux Kernel 2026-06-01 7.8 High
In the Linux kernel, the following vulnerability has been resolved: media: em28xx: fix use-after-free in em28xx_v4l2_open() em28xx_v4l2_open() reads dev->v4l2 without holding dev->lock, creating a race with em28xx_v4l2_init()'s error path and em28xx_v4l2_fini(), both of which free the em28xx_v4l2 struct and set dev->v4l2 to NULL under dev->lock. This race leads to two issues: - use-after-free in v4l2_fh_init() when accessing vdev->ctrl_handler, since the video_device is embedded in the freed em28xx_v4l2 struct. - NULL pointer dereference in em28xx_resolution_set() when accessing v4l2->norm, since dev->v4l2 has been set to NULL. Fix this by moving the mutex_lock() before the dev->v4l2 read and adding a NULL check for dev->v4l2 under the lock.
CVE-2026-31580 1 Linux 1 Linux Kernel 2026-06-01 7.8 High
In the Linux kernel, the following vulnerability has been resolved: bcache: fix cached_dev.sb_bio use-after-free and crash In our production environment, we have received multiple crash reports regarding libceph, which have caught our attention: ``` [6888366.280350] Call Trace: [6888366.280452] blk_update_request+0x14e/0x370 [6888366.280561] blk_mq_end_request+0x1a/0x130 [6888366.280671] rbd_img_handle_request+0x1a0/0x1b0 [rbd] [6888366.280792] rbd_obj_handle_request+0x32/0x40 [rbd] [6888366.280903] __complete_request+0x22/0x70 [libceph] [6888366.281032] osd_dispatch+0x15e/0xb40 [libceph] [6888366.281164] ? inet_recvmsg+0x5b/0xd0 [6888366.281272] ? ceph_tcp_recvmsg+0x6f/0xa0 [libceph] [6888366.281405] ceph_con_process_message+0x79/0x140 [libceph] [6888366.281534] ceph_con_v1_try_read+0x5d7/0xf30 [libceph] [6888366.281661] ceph_con_workfn+0x329/0x680 [libceph] ``` After analyzing the coredump file, we found that the address of dc->sb_bio has been freed. We know that cached_dev is only freed when it is stopped. Since sb_bio is a part of struct cached_dev, rather than an alloc every time. If the device is stopped while writing to the superblock, the released address will be accessed at endio. This patch hopes to wait for sb_write to complete in cached_dev_free. It should be noted that we analyzed the cause of the problem, then tell all details to the QWEN and adopted the modifications it made.
CVE-2026-23394 1 Linux 1 Linux Kernel 2026-06-01 4.7 Medium
In the Linux kernel, the following vulnerability has been resolved: af_unix: Give up GC if MSG_PEEK intervened. Igor Ushakov reported that GC purged the receive queue of an alive socket due to a race with MSG_PEEK with a nice repro. This is the exact same issue previously fixed by commit cbcf01128d0a ("af_unix: fix garbage collect vs MSG_PEEK"). After GC was replaced with the current algorithm, the cited commit removed the locking dance in unix_peek_fds() and reintroduced the same issue. The problem is that MSG_PEEK bumps a file refcount without interacting with GC. Consider an SCC containing sk-A and sk-B, where sk-A is close()d but can be recv()ed via sk-B. The bad thing happens if sk-A is recv()ed with MSG_PEEK from sk-B and sk-B is close()d while GC is checking unix_vertex_dead() for sk-A and sk-B. GC thread User thread --------- ----------- unix_vertex_dead(sk-A) -> true <------. \ `------ recv(sk-B, MSG_PEEK) invalidate !! -> sk-A's file refcount : 1 -> 2 close(sk-B) -> sk-B's file refcount : 2 -> 1 unix_vertex_dead(sk-B) -> true Initially, sk-A's file refcount is 1 by the inflight fd in sk-B recvq. GC thinks sk-A is dead because the file refcount is the same as the number of its inflight fds. However, sk-A's file refcount is bumped silently by MSG_PEEK, which invalidates the previous evaluation. At this moment, sk-B's file refcount is 2; one by the open fd, and one by the inflight fd in sk-A. The subsequent close() releases one refcount by the former. Finally, GC incorrectly concludes that both sk-A and sk-B are dead. One option is to restore the locking dance in unix_peek_fds(), but we can resolve this more elegantly thanks to the new algorithm. The point is that the issue does not occur without the subsequent close() and we actually do not need to synchronise MSG_PEEK with the dead SCC detection. When the issue occurs, close() and GC touch the same file refcount. If GC sees the refcount being decremented by close(), it can just give up garbage-collecting the SCC. Therefore, we only need to signal the race during MSG_PEEK with a proper memory barrier to make it visible to the GC. Let's use seqcount_t to notify GC when MSG_PEEK occurs and let it defer the SCC to the next run. This way no locking is needed on the MSG_PEEK side, and we can avoid imposing a penalty on every MSG_PEEK unnecessarily. Note that we can retry within unix_scc_dead() if MSG_PEEK is detected, but we do not do so to avoid hung task splat from abusive MSG_PEEK calls.
CVE-2026-45942 1 Linux 1 Linux Kernel 2026-05-30 7.8 High
In the Linux kernel, the following vulnerability has been resolved: ext4: fix e4b bitmap inconsistency reports A bitmap inconsistency issue was observed during stress tests under mixed huge-page workloads. Ext4 reported multiple e4b bitmap check failures like: ext4_mb_complex_scan_group:2508: group 350, 8179 free clusters as per group info. But got 8192 blocks Analysis and experimentation confirmed that the issue is caused by a race condition between page migration and bitmap modification. Although this timing window is extremely narrow, it is still hit in practice: folio_lock ext4_mb_load_buddy __migrate_folio check ref count folio_mc_copy __filemap_get_folio folio_try_get(folio) ...... mb_mark_used ext4_mb_unload_buddy __folio_migrate_mapping folio_ref_freeze folio_unlock The root cause of this issue is that the fast path of load_buddy only increments the folio's reference count, which is insufficient to prevent concurrent folio migration. We observed that the folio migration process acquires the folio lock. Therefore, we can determine whether to take the fast path in load_buddy by checking the lock status. If the folio is locked, we opt for the slow path (which acquires the lock) to close this concurrency window. Additionally, this change addresses the following issues: When the DOUBLE_CHECK macro is enabled to inspect bitmap-related issues, the following error may be triggered: corruption in group 324 at byte 784(6272): f in copy != ff on disk/prealloc Analysis reveals that this is a false positive. There is a specific race window where the bitmap and the group descriptor become momentarily inconsistent, leading to this error report: ext4_mb_load_buddy ext4_mb_load_buddy __filemap_get_folio(create|lock) folio_lock ext4_mb_init_cache folio_mark_uptodate __filemap_get_folio(no lock) ...... mb_mark_used mb_mark_used_double mb_cmp_bitmaps mb_set_bits(e4b->bd_bitmap) folio_unlock The original logic assumed that since mb_cmp_bitmaps is called when the bitmap is newly loaded from disk, the folio lock would be sufficient to prevent concurrent access. However, this overlooks a specific race condition: if another process attempts to load buddy and finds the folio is already in an uptodate state, it will immediately begin using it without holding folio lock.
CVE-2026-45894 1 Linux 1 Linux Kernel 2026-05-30 7.8 High
In the Linux kernel, the following vulnerability has been resolved: iommu/vt-d: Clear Present bit before tearing down PASID entry The Intel VT-d Scalable Mode PASID table entry consists of 512 bits (64 bytes). When tearing down an entry, the current implementation zeros the entire 64-byte structure immediately using multiple 64-bit writes. Since the IOMMU hardware may fetch these 64 bytes using multiple internal transactions (e.g., four 128-bit bursts), updating or zeroing the entire entry while it is active (P=1) risks a "torn" read. If a hardware fetch occurs simultaneously with the CPU zeroing the entry, the hardware could observe an inconsistent state, leading to unpredictable behavior or spurious faults. Follow the "Guidance to Software for Invalidations" in the VT-d spec (Section 6.5.3.3) by implementing the recommended ownership handshake: 1. Clear only the 'Present' (P) bit of the PASID entry. 2. Use a dma_wmb() to ensure the cleared bit is visible to hardware before proceeding. 3. Execute the required invalidation sequence (PASID cache, IOTLB, and Device-TLB flush) to ensure the hardware has released all cached references. 4. Only after the flushes are complete, zero out the remaining fields of the PASID entry. Also, add a dma_wmb() in pasid_set_present() to ensure that all other fields of the PASID entry are visible to the hardware before the Present bit is set.
CVE-2026-45859 1 Linux 1 Linux Kernel 2026-05-30 7.5 High
In the Linux kernel, the following vulnerability has been resolved: netfilter: nfnetlink_queue: do shared-unconfirmed check before segmentation Ulrich reports a regression with nfqueue: If an application did not set the 'F_GSO' capability flag and a gso packet with an unconfirmed nf_conn entry is received all packets are now dropped instead of queued, because the check happens after skb_gso_segment(). In that case, we did have exclusive ownership of the skb and its associated conntrack entry. The elevated use count is due to skb_clone happening via skb_gso_segment(). Move the check so that its peformed vs. the aggregated packet. Then, annotate the individual segments except the first one so we can do a 2nd check at reinject time. For the normal case, where userspace does in-order reinjects, this avoids packet drops: first reinjected segment continues traversal and confirms entry, remaining segments observe the confirmed entry. While at it, simplify nf_ct_drop_unconfirmed(): We only care about unconfirmed entries with a refcnt > 1, there is no need to special-case dying entries. This only happens with UDP. With TCP, the only unconfirmed packet will be the TCP SYN, those aren't aggregated by GRO. Next patch adds a udpgro test case to cover this scenario.
CVE-2026-45892 1 Linux 1 Linux Kernel 2026-05-30 7.0 High
In the Linux kernel, the following vulnerability has been resolved: ext4: drop extent cache after doing PARTIAL_VALID1 zeroout When splitting an unwritten extent in the middle and converting it to initialized in ext4_split_extent() with the EXT4_EXT_MAY_ZEROOUT and EXT4_EXT_DATA_VALID2 flags set, it could leave a stale unwritten extent. Assume we have an unwritten file and buffered write in the middle of it without dioread_nolock enabled, it will allocate blocks as written extent. 0 A B N [UUUUUUUUUUUU] on-disk extent U: unwritten extent [UUUUUUUUUUUU] extent status tree [--DDDDDDDD--] D: valid data |<- ->| ----> this range needs to be initialized ext4_split_extent() first try to split this extent at B with EXT4_EXT_DATA_PARTIAL_VALID1 and EXT4_EXT_MAY_ZEROOUT flag set, but ext4_split_extent_at() failed to split this extent due to temporary lack of space. It zeroout B to N and leave the entire extent as unwritten. 0 A B N [UUUUUUUUUUUU] on-disk extent [UUUUUUUUUUUU] extent status tree [--DDDDDDDDZZ] Z: zeroed data ext4_split_extent() then try to split this extent at A with EXT4_EXT_DATA_VALID2 flag set. This time, it split successfully and leave an written extent from A to N. 0 A B N [UUWWWWWWWWWW] on-disk extent W: written extent [UUUUUUUUUUUU] extent status tree [--DDDDDDDDZZ] Finally ext4_map_create_blocks() only insert extent A to B to the extent status tree, and leave an stale unwritten extent in the status tree. 0 A B N [UUWWWWWWWWWW] on-disk extent W: written extent [UUWWWWWWWWUU] extent status tree [--DDDDDDDDZZ] Fix this issue by always cached extent status entry after zeroing out the second part.
CVE-2026-23287 1 Linux 1 Linux Kernel 2026-05-29 5.5 Medium
In the Linux kernel, the following vulnerability has been resolved: irqchip/sifive-plic: Fix frozen interrupt due to affinity setting PLIC ignores interrupt completion message for disabled interrupt, explained by the specification: The PLIC signals it has completed executing an interrupt handler by writing the interrupt ID it received from the claim to the claim/complete register. The PLIC does not check whether the completion ID is the same as the last claim ID for that target. If the completion ID does not match an interrupt source that is currently enabled for the target, the completion is silently ignored. This caused problems in the past, because an interrupt can be disabled while still being handled and plic_irq_eoi() had no effect. That was fixed by checking if the interrupt is disabled, and if so enable it, before sending the completion message. That check is done with irqd_irq_disabled(). However, that is not sufficient because the enable bit for the handling hart can be zero despite irqd_irq_disabled(d) being false. This can happen when affinity setting is changed while a hart is still handling the interrupt. This problem is easily reproducible by dumping a large file to uart (which generates lots of interrupts) and at the same time keep changing the uart interrupt's affinity setting. The uart port becomes frozen almost instantaneously. Fix this by checking PLIC's enable bit instead of irqd_irq_disabled().
CVE-2026-46106 1 Linux 1 Linux Kernel 2026-05-29 5.5 Medium
In the Linux kernel, the following vulnerability has been resolved: eventfs: Hold eventfs_mutex and SRCU when remount walks events Commit 340f0c7067a9 ("eventfs: Update all the eventfs_inodes from the events descriptor") had eventfs_set_attrs() recurse through ei->children on remount. The walk only holds the rcu_read_lock() taken by tracefs_apply_options() over tracefs_inodes, which is wrong: - list_for_each_entry over ei->children races with the list_del_rcu() in eventfs_remove_rec() -- LIST_POISON1 deref, same shape as d2603279c7d6. - eventfs_inodes are freed via call_srcu(&eventfs_srcu, ...). rcu_read_lock() does not extend an SRCU grace period, so ti->private can be reclaimed under the walk. - The writes to ei->attr race with eventfs_set_attr(), which holds eventfs_mutex. Reproducer: while :; do mount -o remount,uid=$((RANDOM%1000)) /sys/kernel/tracing; done & while :; do echo "p:kp submit_bio" > /sys/kernel/tracing/kprobe_events echo > /sys/kernel/tracing/kprobe_events done Wrap the events portion of tracefs_apply_options() in eventfs_remount_lock()/_unlock() that take eventfs_mutex and srcu_read_lock(&eventfs_srcu). eventfs_set_attrs() doesn't sleep so the nested rcu_read_lock() is fine; lockdep_assert_held() pins the contract. Comment in tracefs_drop_inode() said "RCU cycle" -- it is SRCU.
CVE-2026-45917 1 Linux 1 Linux Kernel 2026-05-28 5.5 Medium
In the Linux kernel, the following vulnerability has been resolved: ipvs: do not keep dest_dst if dev is going down There is race between the netdev notifier ip_vs_dst_event() and the code that caches dst with dev that is going down. As the FIB can be notified for the closed device after our handler finishes, it is possible valid route to be returned and cached resuling in a leaked dev reference until the dest is not removed. To prevent new dest_dst to be attached to dest just after the handler dropped the old one, add a netif_running() check to make sure the notifier handler is not currently running for device that is closing.
CVE-2026-45905 1 Linux 1 Linux Kernel 2026-05-28 5.5 Medium
In the Linux kernel, the following vulnerability has been resolved: xfrm: fix ip_rt_bug race in icmp_route_lookup reverse path icmp_route_lookup() performs multiple route lookups to find a suitable route for sending ICMP error messages, with special handling for XFRM (IPsec) policies. The lookup sequence is: 1. First, lookup output route for ICMP reply (dst = original src) 2. Pass through xfrm_lookup() for policy check 3. If blocked (-EPERM) or dst is not local, enter "reverse path" 4. In reverse path, call xfrm_decode_session_reverse() to get fl4_dec which reverses the original packet's flow (saddr<->daddr swapped) 5. If fl4_dec.saddr is local (we are the original destination), use __ip_route_output_key() for output route lookup 6. If fl4_dec.saddr is NOT local (we are a forwarding node), use ip_route_input() to simulate the reverse packet's input path 7. Finally, pass rt2 through xfrm_lookup() with XFRM_LOOKUP_ICMP flag The bug occurs in step 6: ip_route_input() is called with fl4_dec.daddr (original packet's source) as destination. If this address becomes local between the initial check and ip_route_input() call (e.g., due to concurrent "ip addr add"), ip_route_input() returns a LOCAL route with dst.output set to ip_rt_bug. This route is then used for ICMP output, causing dst_output() to call ip_rt_bug(), triggering a WARN_ON: ------------[ cut here ]------------ WARNING: net/ipv4/route.c:1275 at ip_rt_bug+0x21/0x30, CPU#1 Call Trace: <TASK> ip_push_pending_frames+0x202/0x240 icmp_push_reply+0x30d/0x430 __icmp_send+0x1149/0x24f0 ip_options_compile+0xa2/0xd0 ip_rcv_finish_core+0x829/0x1950 ip_rcv+0x2d7/0x420 __netif_receive_skb_one_core+0x185/0x1f0 netif_receive_skb+0x90/0x450 tun_get_user+0x3413/0x3fb0 tun_chr_write_iter+0xe4/0x220 ... Fix this by checking rt2->rt_type after ip_route_input(). If it's RTN_LOCAL, the route cannot be used for output, so treat it as an error. The reproducer requires kernel modification to widen the race window, making it unsuitable as a selftest. It is available at: https://gist.github.com/mrpre/eae853b72ac6a750f5d45d64ddac1e81
CVE-2026-45927 1 Linux 1 Linux Kernel 2026-05-28 6.3 Medium
In the Linux kernel, the following vulnerability has been resolved: bpf: Require frozen map for calculating map hash Currently, bpf_map_get_info_by_fd calculates and caches the hash of the map regardless of the map's frozen state. This leads to a TOCTOU bug where userspace can call BPF_OBJ_GET_INFO_BY_FD to cache the hash and then modify the map contents before freezing. Therefore, a trusted loader can be tricked into verifying the stale hash while loading the modified contents. Fix this by returning -EPERM if the map is not frozen when the hash is requested. This ensures the hash is only generated for the final, immutable state of the map.
CVE-2026-46121 1 Linux 1 Linux Kernel 2026-05-28 7.0 High
In the Linux kernel, the following vulnerability has been resolved: mm/damon/sysfs-schemes: protect memcg_path kfree() with damon_sysfs_lock Patch series "mm/damon/sysfs-schemes: fix use-after-free for [memcg_]path". Reads of 'memcg_path' and 'path' files in DAMON sysfs interface could race with their writes, results in use-after-free. Fix those. This patch (of 2): damon_sysfs_scheme_filter->mmecg_path can be read and written by users, via DAMON sysfs memcg_path file. It can also be indirectly read, for the parameters {on,off}line committing to DAMON. The reads for parameters committing are protected by damon_sysfs_lock to avoid the sysfs files being destroyed while any of the parameters are being read. But the user-driven direct reads and writes are not protected by any lock, while the write is deallocating the memcg_path-pointing buffer. As a result, the readers could read the already freed buffer (user-after-free). Note that the user-reads don't race when the same open file is used by the writer, due to kernfs's open file locking. Nonetheless, doing the reads and writes with separate open files would be common. Fix it by protecting both the user-direct reads and writes with damon_sysfs_lock.
CVE-2026-42336 1 1panel 1 Maxkb 2026-05-27 N/A
MaxKB is an open-source AI assistant for enterprise. MaxKB 2.8.0 and prior are vulnerable to a server-side request forgery (SSRF) bypass in the OSS file service URL fetch functionality due to inconsistent DNS resolution between validation and actual request execution, allowing attackers to access internal network services. This vulnerability is fixed in 2.8.1.
CVE-2025-71303 1 Linux 1 Linux Kernel 2026-05-27 N/A
In the Linux kernel, the following vulnerability has been resolved: accel/amdxdna: Fix race condition when checking rpm_on When autosuspend is triggered, driver rpm_on flag is set to indicate that a suspend/resume is already in progress. However, when a userspace application submits a command during this narrow window, amdxdna_pm_resume_get() may incorrectly skip the resume operation because the rpm_on flag is still set. This results in commands being submitted while the device has not actually resumed, causing unexpected behavior. The set_dpm() is called by suspend/resume, it relied on rpm_on flag to avoid calling into rpm suspend/resume recursivly. So to fix this, remove the use of the rpm_on flag entirely. Instead, introduce aie2_pm_set_dpm() which explicitly resumes the device before invoking set_dpm(). With this change, set_dpm() is called directly inside the suspend or resume execution path. Otherwise, aie2_pm_set_dpm() is called.
CVE-2026-43381 1 Linux 1 Linux Kernel 2026-05-26 5.5 Medium
In the Linux kernel, the following vulnerability has been resolved: nouveau/dpcd: return EBUSY for aux xfer if the device is asleep If we have runtime suspended, and userspace wants to use /dev/drm_dp_* then just tell it the device is busy instead of crashing in the GSP code. WARNING: CPU: 2 PID: 565741 at drivers/gpu/drm/nouveau/nvkm/subdev/gsp/rm/r535/rpc.c:164 r535_gsp_msgq_wait+0x9a/0xb0 [nouveau] CPU: 2 UID: 0 PID: 565741 Comm: fwupd Not tainted 6.18.10-200.fc43.x86_64 #1 PREEMPT(lazy) Hardware name: LENOVO 20QTS0PQ00/20QTS0PQ00, BIOS N2OET65W (1.52 ) 08/05/2024 RIP: 0010:r535_gsp_msgq_wait+0x9a/0xb0 [nouveau] This is a simple fix to get backported. We should probably engineer a proper power domain solution to wake up devices and keep them awake while fw updates are happening.
CVE-2026-29518 2 Rsync Project, Samba 2 Rsync, Rsync 2026-05-26 7 High
Rsync versions before 3.4.3 contain a time-of-check to time-of-use (TOCTOU) race condition in daemon file handling that allows attackers to redirect file writes outside intended directories by replacing parent directory components with symbolic links. Attackers with write access to a module path can exploit this race condition to create or overwrite arbitrary files, potentially modifying sensitive system files and achieving privilege escalation when the daemon runs with elevated privileges. This vulnerability can only be triggered if the chroot setting is false.
CVE-2026-23460 1 Linux 1 Linux Kernel 2026-05-26 5.5 Medium
In the Linux kernel, the following vulnerability has been resolved: net/rose: fix NULL pointer dereference in rose_transmit_link on reconnect syzkaller reported a bug [1], and the reproducer is available at [2]. ROSE sockets use four sk->sk_state values: TCP_CLOSE, TCP_LISTEN, TCP_SYN_SENT, and TCP_ESTABLISHED. rose_connect() already rejects calls for TCP_ESTABLISHED (-EISCONN) and TCP_CLOSE with SS_CONNECTING (-ECONNREFUSED), but lacks a check for TCP_SYN_SENT. When rose_connect() is called a second time while the first connection attempt is still in progress (TCP_SYN_SENT), it overwrites rose->neighbour via rose_get_neigh(). If that returns NULL, the socket is left with rose->state == ROSE_STATE_1 but rose->neighbour == NULL. When the socket is subsequently closed, rose_release() sees ROSE_STATE_1 and calls rose_write_internal() -> rose_transmit_link(skb, NULL), causing a NULL pointer dereference. Per connect(2), a second connect() while a connection is already in progress should return -EALREADY. Add this missing check for TCP_SYN_SENT to complete the state validation in rose_connect(). [1] https://syzkaller.appspot.com/bug?extid=d00f90e0af54102fb271 [2] https://gist.github.com/mrpre/9e6779e0d13e2c66779b1653fef80516
CVE-2026-23272 1 Linux 1 Linux Kernel 2026-05-23 7.8 High
In the Linux kernel, the following vulnerability has been resolved: netfilter: nf_tables: unconditionally bump set->nelems before insertion In case that the set is full, a new element gets published then removed without waiting for the RCU grace period, while RCU reader can be walking over it already. To address this issue, add the element transaction even if set is full, but toggle the set_full flag to report -ENFILE so the abort path safely unwinds the set to its previous state. As for element updates, decrement set->nelems to restore it. A simpler fix is to call synchronize_rcu() in the error path. However, with a large batch adding elements to already maxed-out set, this could cause noticeable slowdown of such batches.
CVE-2026-43420 1 Linux 1 Linux Kernel 2026-05-22 4.7 Medium
In the Linux kernel, the following vulnerability has been resolved: ceph: fix i_nlink underrun during async unlink During async unlink, we drop the `i_nlink` counter before we receive the completion (that will eventually update the `i_nlink`) because "we assume that the unlink will succeed". That is not a bad idea, but it races against deletions by other clients (or against the completion of our own unlink) and can lead to an underrun which emits a WARNING like this one: WARNING: CPU: 85 PID: 25093 at fs/inode.c:407 drop_nlink+0x50/0x68 Modules linked in: CPU: 85 UID: 3221252029 PID: 25093 Comm: php-cgi8.1 Not tainted 6.14.11-cm4all1-ampere #655 Hardware name: Supermicro ARS-110M-NR/R12SPD-A, BIOS 1.1b 10/17/2023 pstate: 60400009 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : drop_nlink+0x50/0x68 lr : ceph_unlink+0x6c4/0x720 sp : ffff80012173bc90 x29: ffff80012173bc90 x28: ffff086d0a45aaf8 x27: ffff0871d0eb5680 x26: ffff087f2a64a718 x25: 0000020000000180 x24: 0000000061c88647 x23: 0000000000000002 x22: ffff07ff9236d800 x21: 0000000000001203 x20: ffff07ff9237b000 x19: ffff088b8296afc0 x18: 00000000f3c93365 x17: 0000000000070000 x16: ffff08faffcbdfe8 x15: ffff08faffcbdfec x14: 0000000000000000 x13: 45445f65645f3037 x12: 34385f6369706f74 x11: 0000a2653104bb20 x10: ffffd85f26d73290 x9 : ffffd85f25664f94 x8 : 00000000000000c0 x7 : 0000000000000000 x6 : 0000000000000002 x5 : 0000000000000081 x4 : 0000000000000481 x3 : 0000000000000000 x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff08727d3f91e8 Call trace: drop_nlink+0x50/0x68 (P) vfs_unlink+0xb0/0x2e8 do_unlinkat+0x204/0x288 __arm64_sys_unlinkat+0x3c/0x80 invoke_syscall.constprop.0+0x54/0xe8 do_el0_svc+0xa4/0xc8 el0_svc+0x18/0x58 el0t_64_sync_handler+0x104/0x130 el0t_64_sync+0x154/0x158 In ceph_unlink(), a call to ceph_mdsc_submit_request() submits the CEPH_MDS_OP_UNLINK to the MDS, but does not wait for completion. Meanwhile, between this call and the following drop_nlink() call, a worker thread may process a CEPH_CAP_OP_IMPORT, CEPH_CAP_OP_GRANT or just a CEPH_MSG_CLIENT_REPLY (the latter of which could be our own completion). These will lead to a set_nlink() call, updating the `i_nlink` counter to the value received from the MDS. If that new `i_nlink` value happens to be zero, it is illegal to decrement it further. But that is exactly what ceph_unlink() will do then. The WARNING can be reproduced this way: 1. Force async unlink; only the async code path is affected. Having no real clue about Ceph internals, I was unable to find out why the MDS wouldn't give me the "Fxr" capabilities, so I patched get_caps_for_async_unlink() to always succeed. (Note that the WARNING dump above was found on an unpatched kernel, without this kludge - this is not a theoretical bug.) 2. Add a sleep call after ceph_mdsc_submit_request() so the unlink completion gets handled by a worker thread before drop_nlink() is called. This guarantees that the `i_nlink` is already zero before drop_nlink() runs. The solution is to skip the counter decrement when it is already zero, but doing so without a lock is still racy (TOCTOU). Since ceph_fill_inode() and handle_cap_grant() both hold the `ceph_inode_info.i_ceph_lock` spinlock while set_nlink() runs, this seems like the proper lock to protect the `i_nlink` updates. I found prior art in NFS and SMB (using `inode.i_lock`) and AFS (using `afs_vnode.cb_lock`). All three have the zero check as well.