| Age | Commit message (Collapse) | Author | Files | Lines |
|
* kvm-arm64/vgic-v5-PPI-fixes:
: .
: Substantial cleanup of the vgic-v5 PPI support. From the original
: cover letter:
:
: "With the GICv5 PPi support merged in, it has become obvious that a few
: things could be improved, both from the correctness and maintainability
: angles."
: .
KVM: arm64: Fix arch timer interrupts for GICv3-on-GICv5 guests
irqchip/gic-v5: Immediately exec priority drop following activate
Documentation: KVM: Clarify that PMU_V3_IRQ IntID requirements for GICv5
Documentation: KVM: Fix typos in VGICv5 documentation
KVM: arm64: selftests: Improve error handling for GICv5 PPI selftest
KVM: arm64: selftests: Cleanup unused vars in GICv5 PPI selftest
KVM: arm64: selftests: Add missing GIC CDEN to no-vgic-v5 selftest
KVM: arm64: vgic-v5: Atomically assign bits to PPI DVI bitmap
KVM: arm64: vgic-v5: Add missing trap handing for NV triage
KVM: arm64: vgic-v5: Limit support to 64 PPIs
KVM: arm64: vgic: Rationalise per-CPU irq accessor
KVM: arm64: vgic-v5: Drop defensive checks from vgic_v5_ppi_queue_irq_unlock()
KVM: arm64: vgic: Consolidate vgic_allocate_private_irqs_locked()
KVM: arm64: vgic: Constify struct irq_ops usage
KVM: arm64: vgic-v5: Drop pointless ARM64_HAS_GICV5_CPUIF check
KVM: arm64: vgic-v5: Remove use of __assign_bit() with a constant
KVM: arm64: vgic-v5: Move PPI caps into kvm_vgic_global_state
KVM: arm64: vgic-v5: Add for_each_visible_v5_ppi() iterator
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
vgic-v5 has introduced much more prevalent usage of the struct
irq_ops mechanism.
In the process, it becomes evident that suffers from two related
problems:
- it contains flags, rather than only callbacks
- it is mutable, because we need to update the above flags
Swap the flags for a helper retrieving the flags, and make all
irq_ops const, something that is slightly satisfying.
Reviewed-by: Joey Gouly <joey.gouly@arm.com>
Link: https://lore.kernel.org/r/20260520091949.542365-6-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
Constant vgic properties are usually kept in kvm_vgic_global_state,
but the vgic-v5 code does its own thing.
Move the ppi_caps data into the global structure, which has the
modest additional advantage of making it ro_after_init.
Link: https://lore.kernel.org/r/20260520091949.542365-3-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
Just like the timer, the PMU has an interrupt cache that serves little
purpose. Drop it.
Reviewed-by: Oliver Upton <oupton@kernel.org>
Link: https://patch.msgid.link/20260520100200.543845-5-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
The timer code makes use of a per-timer irq level cache, which
looks like a very minor optimisation to avoid taking a lock upon
updating the GIC view of the interrupt when it is unchanged from
the previous state.
This is coming in the way of more important correctness issues,
so get rid of the cache, which simplifies a couple of minor things.
Reviewed-by: Oliver Upton <oupton@kernel.org>
Link: https://patch.msgid.link/20260520100200.543845-4-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
The userspace notification of interrupts is has a few problems:
- it is utterly pointless
- it is annoyingly split between detecting the need for notification
and the population of the interrupts in the run structure
We can't do anything about the former (yet), but the latter can be
addressed. If we detect that we must notify userspace, we know that
we are going to exit, as we populate the exit status. Which means
we can also populate the interrupt state at this stage and be done
with it.
This simplifies the structure of the code.
Reviewed-by: Oliver Upton <oupton@kernel.org>
Link: https://patch.msgid.link/20260520100200.543845-3-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
The newly introduced arch_timer_context::direct field is a bit pointless,
as it is always set on timers that are... err... direct, while
we already have a way to get to that by doing a get_map() operation.
Additionally, this field is:
- only set when get_map() is called
- never cleared
and the single point where it is actually checked doesn't call get_map()
at all.
At this stage, it is probably better to just kill it, and rely on
get_map() to give us the correct information.
Reviewed-by: Sascha Bischoff <sascha.bischoff@arm.com>
Fixes: 9491c63b6cd7b ("KVM: arm64: gic-v5: Enlighten arch timer for GICv5")
Link: https://sashiko.dev/#/patchset/20260319154937.3619520-1-sascha.bischoff%40arm.com
Link: https://patch.msgid.link/20260401103611.357092-12-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
GICv5 systems will likely not support the full set of PPIs. The
presence of any virtual PPI is tied to the presence of the physical
PPI. Therefore, the available PPIs will be limited by the physical
host. Userspace cannot drive any PPIs that are not implemented.
Moreover, it is not desirable to expose all PPIs to the guest in the
first place, even if they are supported in hardware. Some devices,
such as the arch timer, are implemented in KVM, and hence those PPIs
shouldn't be driven by userspace, either.
Provided a new UAPI:
KVM_DEV_ARM_VGIC_GRP_CTRL => KVM_DEV_ARM_VGIC_USERPSPACE_PPIs
This allows userspace to query which PPIs it is able to drive via
KVM_IRQ_LINE.
Additionally, introduce a check in kvm_vm_ioctl_irq_line() to reject
any PPIs not in the userspace mask.
Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Link: https://patch.msgid.link/20260319154937.3619520-40-sascha.bischoff@arm.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
Make it mandatory to use the architected PPI when running a GICv5
guest. Attempts to set anything other than the architected PPI (23)
are rejected.
Additionally, KVM_ARM_VCPU_PMU_V3_INIT is relaxed to no longer require
KVM_ARM_VCPU_PMU_V3_IRQ to be called for GICv5-based guests. In this
case, the architectued PPI is automatically used.
Documentation is bumped accordingly.
Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Reviewed-by: Joey Gouly <joey.gouly@arm.com>
Link: https://patch.msgid.link/20260319154937.3619520-33-sascha.bischoff@arm.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
Now that GICv5 has arrived, the arch timer requires some TLC to
address some of the key differences introduced with GICv5.
For PPIs on GICv5, the queue_irq_unlock irq_op is used as AP lists are
not required at all for GICv5. The arch timer also introduces an
irq_op - get_input_level. Extend the arch-timer-provided irq_ops to
include the PPI op for vgic_v5 guests.
When possible, DVI (Direct Virtual Interrupt) is set for PPIs when
using a vgic_v5, which directly inject the pending state into the
guest. This means that the host never sees the interrupt for the guest
for these interrupts. This has three impacts.
* First of all, the kvm_cpu_has_pending_timer check is updated to
explicitly check if the timers are expected to fire.
* Secondly, for mapped timers (which use DVI) they must be masked on
the host prior to entering a GICv5 guest, and unmasked on the return
path. This is handled in set_timer_irq_phys_masked.
* Thirdly, it makes zero sense to attempt to inject state for a DVI'd
interrupt. Track which timers are direct, and skip the call to
kvm_vgic_inject_irq() for these.
The final, but rather important, change is that the architected PPIs
for the timers are made mandatory for a GICv5 guest. Attempts to set
them to anything else are actively rejected. Once a vgic_v5 is
initialised, the arch timer PPIs are also explicitly reinitialised to
ensure the correct GICv5-compatible PPIs are used - this also adds in
the GICv5 PPI type to the intid.
Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Link: https://patch.msgid.link/20260319154937.3619520-32-sascha.bischoff@arm.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
Update kvm_vgic_create to create a vgic_v5 device. When creating a
vgic, FEAT_GCIE in the ID_AA64PFR2 is only exposed to vgic_v5-based
guests, and is hidden otherwise. GIC in ~ID_AA64PFR0_EL1 is never
exposed for a vgic_v5 guest.
When initialising a vgic_v5, skip kvm_vgic_dist_init as GICv5 doesn't
support one. The current vgic_v5 implementation only supports PPIs, so
no SPIs are initialised either.
The current vgic_v5 support doesn't extend to nested guests. Therefore,
the init of vgic_v5 for a nested guest is failed in vgic_v5_init.
As the current vgic_v5 doesn't require any resources to be mapped,
vgic_v5_map_resources is simply used to check that the vgic has indeed
been initialised. Again, this will change as more GICv5 support is
merged in.
Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Link: https://patch.msgid.link/20260319154937.3619520-29-sascha.bischoff@arm.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
GICv5 adds support for directly injected PPIs. The mechanism for
setting this up is GICv5 specific, so rather than adding
GICv5-specific code to the common vgic code, we introduce a new
irq_op.
This new irq_op is intended to be used to enable or disable direct
injection for interrupts that support it. As it is an irq_op, it has
no effect unless explicitly populated in the irq_ops structure for a
particular interrupt. The usage is demonstracted in the subsequent
change.
Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com>
Link: https://patch.msgid.link/20260319154937.3619520-26-sascha.bischoff@arm.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
There are times when the default behaviour of vgic_queue_irq_unlock()
is undesirable. This is because some GICs, such a GICv5 which is the
main driver for this change, handle the majority of the interrupt
lifecycle in hardware. In this case, there is no need for a per-VCPU
AP list as the interrupt can be made pending directly. This is done
either via the ICH_PPI_x_EL2 registers for PPIs, or with the VDPEND
system instruction for SPIs and LPIs.
The vgic_queue_irq_unlock() function is made overridable using a new
function pointer in struct irq_ops. vgic_queue_irq_unlock() is
overridden if the function pointer is non-null.
This new irq_op is unused in this change - it is purely providing the
infrastructure itself. The subsequent PPI injection changes provide a
demonstration of the usage of the queue_irq_unlock irq_op.
Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Link: https://patch.msgid.link/20260319154937.3619520-20-sascha.bischoff@arm.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
We only want to expose a subset of the PPIs to a guest. If a PPI does
not have an owner, it is not being actively driven by a device. The
SW_PPI is a special case, as it is likely for userspace to wish to
inject that.
Therefore, just prior to running the guest for the first time, we need
to finalize the PPIs. A mask is generated which, when combined with
trapping a guest's PPI accesses, allows for the guest's view of the
PPI to be filtered. This mask is global to the VM as all VCPUs PPI
configurations must match.
In addition, the PPI HMR is calculated.
Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Link: https://patch.msgid.link/20260319154937.3619520-19-sascha.bischoff@arm.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
This change introduces GICv5 load/put. Additionally, it plumbs in
save/restore for:
* PPIs (ICH_PPI_x_EL2 regs)
* ICH_VMCR_EL2
* ICH_APR_EL2
* ICC_ICSR_EL1
A GICv5-specific enable bit is added to struct vgic_vmcr as this
differs from previous GICs. On GICv5-native systems, the VMCR only
contains the enable bit (driven by the guest via ICC_CR0_EL1.EN) and
the priority mask (PCR).
A struct gicv5_vpe is also introduced. This currently only contains a
single field - bool resident - which is used to track if a VPE is
currently running or not, and is used to avoid a case of double load
or double put on the WFI path for a vCPU. This struct will be extended
as additional GICv5 support is merged, specifically for VPE doorbells.
Co-authored-by: Timothy Hayes <timothy.hayes@arm.com>
Signed-off-by: Timothy Hayes <timothy.hayes@arm.com>
Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Link: https://patch.msgid.link/20260319154937.3619520-18-sascha.bischoff@arm.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
Introduce the following hyp functions to save/restore GICv5 state:
* __vgic_v5_save_apr()
* __vgic_v5_restore_vmcr_apr()
* __vgic_v5_save_ppi_state() - no hypercall required
* __vgic_v5_restore_ppi_state() - no hypercall required
* __vgic_v5_save_state() - no hypercall required
* __vgic_v5_restore_state() - no hypercall required
Note that the functions tagged as not requiring hypercalls are always
called directly from the same context. They are either called via the
vgic_save_state()/vgic_restore_state() path when running with VHE, or
via __hyp_vgic_save_state()/__hyp_vgic_restore_state() otherwise. This
mimics how vgic_v3_save_state()/vgic_v3_restore_state() are
implemented.
Overall, the state of the following registers is saved/restored:
* ICC_ICSR_EL1
* ICH_APR_EL2
* ICH_PPI_ACTIVERx_EL2
* ICH_PPI_DVIRx_EL2
* ICH_PPI_ENABLERx_EL2
* ICH_PPI_PENDRx_EL2
* ICH_PPI_PRIORITYRx_EL2
* ICH_VMCR_EL2
All of these are saved/restored to/from the KVM vgic_v5 CPUIF shadow
state, with the exception of the PPI active, pending, and enable
state. The pending state is saved and restored from kvm_host_data as
any changes here need to be tracked and propagated back to the
vgic_irq shadow structures (coming in a future commit). Therefore, an
entry and an exit copy is required. The active and enable state is
restored from the vgic_v5 CPUIF, but is saved to kvm_host_data. Again,
this needs to by synced back into the shadow data structures.
The ICSR must be save/restored as this register is shared between host
and guest. Therefore, to avoid leaking host state to the guest, this
must be saved and restored. Moreover, as this can by used by the host
at any time, it must be save/restored eagerly. Note: the host state is
not preserved as the host should only use this register when
preemption is disabled.
As with GICv3, the VMCR is eagerly saved as this is required when
checking if interrupts can be injected or not, and therefore impacts
things such as WFI.
As part of restoring the ICH_VMCR_EL2 and ICH_APR_EL2, GICv3-compat
mode is also disabled by setting the ICH_VCTLR_EL2.V3 bit to 0. The
correspoinding GICv3-compat mode enable is part of the VMCR & APR
restore for a GICv3 guest as it only takes effect when actually
running a guest.
Co-authored-by: Timothy Hayes <timothy.hayes@arm.com>
Signed-off-by: Timothy Hayes <timothy.hayes@arm.com>
Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com>
Link: https://patch.msgid.link/20260319154937.3619520-17-sascha.bischoff@arm.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
Add in a sanitization function for ID_AA64PFR2_EL1, preserving the
already-present behaviour for the FPMR, MTEFAR, and MTESTOREONLY
fields. Add sanitisation for the GCIE field, which is set to IMP if
the host supports a GICv5 guest and NI, otherwise.
Extend the sanitisation that takes place in kvm_vgic_create() to zero
the ID_AA64PFR2.GCIE field when a non-GICv5 GIC is created. More
importantly, move this sanitisation to a separate function,
kvm_vgic_finalize_sysregs(), and call it from kvm_finalize_sys_regs().
We are required to finalize the GIC and GCIE fields a second time in
kvm_finalize_sys_regs() due to how QEMU blindly reads out then
verbatim restores the system register state. This avoids the issue
where both the GCIE and GIC features are marked as present (an
architecturally invalid combination), and hence guests fall over. See
the comment in kvm_finalize_sys_regs() for more details.
Overall, the following happens:
* Before an irqchip is created, FEAT_GCIE is presented if the host
supports GICv5-based guests.
* Once an irqchip is created, all other supported irqchips are hidden
from the guest; system register state reflects the guest's irqchip.
* Userspace is allowed to set invalid irqchip feature combinations in
the system registers, but...
* ...invalid combinations are removed a second time prior to the first
run of the guest, and things hopefully just work.
All of this extra work is required to make sure that "legacy" GICv3
guests based on QEMU transparently work on compatible GICv5 hosts
without modification.
Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Link: https://patch.msgid.link/20260319154937.3619520-13-sascha.bischoff@arm.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
As part of booting the system and initialising KVM, create and
populate a mask of the implemented PPIs. This mask allows future PPI
operations (such as save/restore or state, or syncing back into the
shadow state) to only consider PPIs that are actually implemented on
the host.
The set of implemented virtual PPIs matches the set of implemented
physical PPIs for a GICv5 host. Therefore, this mask represents all
PPIs that could ever by used by a GICv5-based guest on a specific
host, albeit pre-filtered by what we support in KVM (see next
paragraph).
Only architected PPIs are currently supported in KVM with
GICv5. Moreover, as KVM only supports a subset of all possible PPIS
(Timers, PMU, GICv5 SW_PPI) the PPI mask only includes these PPIs, if
present. The timers are always assumed to be present; if we have KVM
we have EL2, which means that we have the EL1 & EL2 Timer PPIs. If we
have a PMU (v3), then the PMUIRQ is present. The GICv5 SW_PPI is
always assumed to be present.
Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Link: https://patch.msgid.link/20260319154937.3619520-12-sascha.bischoff@arm.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
GICv5 has moved from using interrupt ranges for different interrupt
types to using some of the upper bits of the interrupt ID to denote
the interrupt type. This is not compatible with older GICs (which rely
on ranges of interrupts to determine the type), and hence a set of
helpers is introduced. These helpers take a struct kvm*, and use the
vgic model to determine how to interpret the interrupt ID.
Helpers are introduced for PPIs, SPIs, and LPIs. Additionally, a
helper is introduced to determine if an interrupt is private - SGIs
and PPIs for older GICs, and PPIs only for GICv5.
Additionally, vgic_is_v5() is introduced (which unsurpisingly returns
true when running a GICv5 guest), and the existing vgic_is_v3() check
is moved from vgic.h to arm_vgic.h (to live alongside the vgic_is_v5()
one), and has been converted into a macro.
The helpers are plumbed into the core vgic code, as well as the Arch
Timer and PMU code.
There should be no functional changes as part of this change.
Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com>
Reviewed-by: Joey Gouly <joey.gouly@arm.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Link: https://patch.msgid.link/20260319154937.3619520-10-sascha.bischoff@arm.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
Prior to this change, the act of mapping a virtual IRQ to a physical
one also set the irq_ops. Unmapping then reset the irq_ops to NULL. So
far, this has been fine and hasn't caused any major issues.
Now, however, as GICv5 support is being added to KVM, it has become
apparent that conflating mapping/unmapping IRQs and setting/clearing
irq_ops can cause issues. The reason is that the upcoming GICv5
support introduces a set of default irq_ops for PPIs, and removing
this when unmapping will cause things to break rather horribly.
Split out the mapping/unmapping of IRQs from the setting/clearing of
irq_ops. The arch timer code is updated to set the irq_ops following a
successful map. The irq_ops are intentionally not removed again on an
unmap as the only irq_op introduced by the arch timer only takes
effect if the hw bit in struct vgic_irq is set. Therefore, it is safe
to leave this in place, and it avoids additional complexity when GICv5
support is introduced.
Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com>
Link: https://patch.msgid.link/20260319154937.3619520-6-sascha.bischoff@arm.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
The current implementation uses `vgic_state_iter` in `struct
vgic_dist` to track the sequence position. This effectively makes the
iterator shared across all open file descriptors for the VM.
This approach has significant drawbacks:
- It enforces mutual exclusion, preventing concurrent reads of the
debugfs file (returning -EBUSY).
- It relies on storing transient iterator state in the long-lived
VM structure (`vgic_dist`).
Refactor the implementation to use the standard `seq_file` iterator.
Instead of storing state in `kvm_arch`, rely on the `pos` argument
passed to the `start` and `next` callbacks, which tracks the logical
index specific to the file descriptor.
This change enables concurrent access and eliminates the
`vgic_state_iter` field from `struct vgic_dist`.
Signed-off-by: Fuad Tabba <tabba@google.com>
Link: https://patch.msgid.link/20260202085721.3954942-4-tabba@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
The vgic-debug interface implementation uses XArray marks
(`LPI_XA_MARK_DEBUG_ITER`) to "snapshot" LPIs at the start of iteration.
This modifies global state for a read-only operation and complicates
reference counting, leading to leaks if iteration is aborted or fails.
Reimplement the iterator to use dynamic iteration logic:
- Remove `lpi_idx` from `struct vgic_state_iter`.
- Replace the XArray marking mechanism with dynamic iteration using
`xa_find_after(..., XA_PRESENT)`.
- Wrap XArray traversals in `rcu_read_lock()`/`rcu_read_unlock()` to
ensure safety against concurrent modifications (e.g., LPI unmapping).
- Handle potential races where an LPI is removed during iteration by
gracefully skipping it in `show()`, rather than warning.
- Remove the unused `LPI_XA_MARK_DEBUG_ITER` definition.
This simplifies the lifecycle management of the iterator and prevents
resource leaks associated with the marking mechanism, and paves the way
for using a standard seq_file iterator.
Signed-off-by: Fuad Tabba <tabba@google.com>
Link: https://patch.msgid.link/20260202085721.3954942-3-tabba@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
Add the plumbing of GICv2 interrupt deactivation via GICV_DIR.
This requires adding a new device so that we can easily decode
the DIR address.
The deactivation itself is very similar to the GICv3 version.
Tested-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Tested-by: Mark Brown <broonie@kernel.org>
Link: https://msgid.link/20251120172540.2267180-39-maz@kernel.org
Signed-off-by: Oliver Upton <oupton@kernel.org>
|
|
SPIs are specially annpying, as they can be activated on a CPU and
deactivated on another. WHich means that when an SPI is in flight
anywhere, all CPUs need to have their TDIR trap bit set.
This translates into broadcasting an IPI across all CPUs to make sure
they set their trap bit, The number of in-flight SPIs is kept in
an atomic variable so that CPUs can turn the trap bit off as soon
as possible.
Tested-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Tested-by: Mark Brown <broonie@kernel.org>
Link: https://msgid.link/20251120172540.2267180-32-maz@kernel.org
Signed-off-by: Oliver Upton <oupton@kernel.org>
|
|
Deactivation via ICV_DIR_EL1 is both relatively straightforward
(we have the interrupt that needs deactivation) and really awkward.
The main issue is that the interrupt may either be in an LR on
another CPU, or ourside of any LR.
In the former case, we process the deactivation is if ot was
a write to GICD_CACTIVERn, which is already implemented as a big
hammer IPI'ing all vcpus. In the latter case, we just perform
a normal deactivation, similar to what we do for EOImode==0.
Another annoying aspect is that we need to tell the CPU owning
the interrupt that its ap_list needs laudering. We use a brand new
vcpu request to that effect.
Note that this doesn't address deactivation via the GICV MMIO view,
which will be taken care of in a later change.
Tested-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Tested-by: Mark Brown <broonie@kernel.org>
Link: https://msgid.link/20251120172540.2267180-29-maz@kernel.org
Signed-off-by: Oliver Upton <oupton@kernel.org>
|
|
We currently cannot identify whether an interrupt is queued into
a LR. It wasn't needed until now, but that's about to change.
Add yet another flag to track that state.
Tested-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Tested-by: Mark Brown <broonie@kernel.org>
Link: https://msgid.link/20251120172540.2267180-9-maz@kernel.org
Signed-off-by: Oliver Upton <oupton@kernel.org>
|
|
struct vgic_irq has grown over the years, in a rather bad way.
Repack it using bitfields so that the individual flags, and move
things around a bit so that it a bit smaller.
Tested-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Tested-by: Mark Brown <broonie@kernel.org>
Link: https://msgid.link/20251120172540.2267180-8-maz@kernel.org
Signed-off-by: Oliver Upton <oupton@kernel.org>
|
|
Future changes will require KVM to be able to perform deactivations
by writing to the physical CPU interface. Add the corresponding
VA to the kvm_info structure, and let KVM stash it.
Tested-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Tested-by: Mark Brown <broonie@kernel.org>
Link: https://msgid.link/20251120172540.2267180-3-maz@kernel.org
Signed-off-by: Oliver Upton <oupton@kernel.org>
|
|
Now that the whole timer infrastructure is handled as system register
accesses, get rid of the now unused ad-hoc infrastructure.
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
Move the timer_set_offset() helper to arm_arch_timer.h, so that it
is next to timer_get_offset(), and accessible by the rest of KVM.
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
Having to follow a pointer to a vcpu is pretty dumb, when the timers
are are a fixed offset in the vcpu structure itself.
Trade the vcpu pointer for a timer_id, which can then be used to
compute the vcpu address as needed.
Reviewed-by: Joey Gouly <joey.gouly@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
We currently have a vcpu pointer nested into each timer context.
As we are about to remove this pointer, introduce a helper (aptly
named timer_context_to_vcpu()) that returns this pointer, at least
until we repaint the data structure.
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
KVM/arm64 updates for 6.18
- Add support for FF-A 1.2 as the secure memory conduit for pKVM,
allowing more registers to be used as part of the message payload.
- Change the way pKVM allocates its VM handles, making sure that the
privileged hypervisor is never tricked into using uninitialised
data.
- Speed up MMIO range registration by avoiding unnecessary RCU
synchronisation, which results in VMs starting much quicker.
- Add the dump of the instruction stream when panic-ing in the EL2
payload, just like the rest of the kernel has always done. This will
hopefully help debugging non-VHE setups.
- Add 52bit PA support to the stage-1 page-table walker, and make use
of it to populate the fault level reported to the guest on failing
to translate a stage-1 walk.
- Add NV support to the GICv3-on-GICv5 emulation code, ensuring
feature parity for guests, irrespective of the host platform.
- Fix some really ugly architecture problems when dealing with debug
in a nested VM. This has some bad performance impacts, but is at
least correct.
- Add enough infrastructure to be able to disable EL2 features and
give effective values to the EL2 control registers. This then allows
a bunch of features to be turned off, which helps cross-host
migration.
- Large rework of the selftest infrastructure to allow most tests to
transparently run at EL2. This is the first step towards enabling
NV testing.
- Various fixes and improvements all over the map, including one BE
fix, just in time for the removal of the feature.
|
|
* kvm-arm64/gic-v5-nv:
: .
: Add NV support to GICv5 in GICv3 emulation mode, ensuring that the v3
: guest support is identical to that of a pure v3 platform.
:
: Patches courtesy of Sascha Bischoff (20250828105925.3865158-1-sascha.bischoff@arm.com)
: .
irqchip/gic-v5: Drop has_gcie_v3_compat from gic_kvm_info
KVM: arm64: Use ARM64_HAS_GICV5_LEGACY for GICv5 probing
arm64: cpucaps: Add GICv5 Legacy vCPU interface (GCIE_LEGACY) capability
KVM: arm64: Enable nested for GICv5 host with FEAT_GCIE_LEGACY
KVM: arm64: Don't access ICC_SRE_EL2 if GICv3 doesn't support v2 compatibility
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
We currently access ICC_SRE_EL2 at each load/put on VHE, and on each
entry/exit on nVHE. Both are quite onerous on NV, as this register
always traps.
We do this to make sure the EL1 guest doesn't flip between v2 and v3
behind our back. But all modern implementations have dropped v2,
and this is just overhead.
At the same time, the GICv5 spec has been fixed to allow access to
ICC_SRE_EL2 in legacy mode. Use this opportunity to replace the
GICv5 checks for v2 compat checks, with an ad-hoc static key.
Co-developed-by: Sascha Bischoff <sascha.bischoff@arm.com>
Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com>
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
It is now used only within kvm_vgic_map_resources(). vgic_dist::ready
is already written directly by this function, so it is clearer to
bypass the macro for reads as well.
Signed-off-by: Keir Fraser <keirf@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
syzkaller has caught us red-handed once more, this time nesting regular
spinlocks behind raw spinlocks:
=============================
[ BUG: Invalid wait context ]
6.16.0-rc3-syzkaller-g7b8346bd9fce #0 Not tainted
-----------------------------
syz.0.29/3743 is trying to lock:
a3ff80008e2e9e18 (&xa->xa_lock#20){....}-{3:3}, at: vgic_put_irq+0xb4/0x190 arch/arm64/kvm/vgic/vgic.c:137
other info that might help us debug this:
context-{5:5}
3 locks held by syz.0.29/3743:
#0: a3ff80008e2e90a8 (&kvm->slots_lock){+.+.}-{4:4}, at: kvm_vgic_destroy+0x50/0x624 arch/arm64/kvm/vgic/vgic-init.c:499
#1: a3ff80008e2e9fa0 (&kvm->arch.config_lock){+.+.}-{4:4}, at: kvm_vgic_destroy+0x5c/0x624 arch/arm64/kvm/vgic/vgic-init.c:500
#2: 58f0000021be1428 (&vgic_cpu->ap_list_lock){....}-{2:2}, at: vgic_flush_pending_lpis+0x3c/0x31c arch/arm64/kvm/vgic/vgic.c:150
stack backtrace:
CPU: 0 UID: 0 PID: 3743 Comm: syz.0.29 Not tainted 6.16.0-rc3-syzkaller-g7b8346bd9fce #0 PREEMPT
Hardware name: linux,dummy-virt (DT)
Call trace:
show_stack+0x2c/0x3c arch/arm64/kernel/stacktrace.c:466 (C)
__dump_stack+0x30/0x40 lib/dump_stack.c:94
dump_stack_lvl+0xd8/0x12c lib/dump_stack.c:120
dump_stack+0x1c/0x28 lib/dump_stack.c:129
print_lock_invalid_wait_context kernel/locking/lockdep.c:4833 [inline]
check_wait_context kernel/locking/lockdep.c:4905 [inline]
__lock_acquire+0x978/0x299c kernel/locking/lockdep.c:5190
lock_acquire+0x14c/0x2e0 kernel/locking/lockdep.c:5871
__raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline]
_raw_spin_lock_irqsave+0x5c/0x7c kernel/locking/spinlock.c:162
vgic_put_irq+0xb4/0x190 arch/arm64/kvm/vgic/vgic.c:137
vgic_flush_pending_lpis+0x24c/0x31c arch/arm64/kvm/vgic/vgic.c:158
__kvm_vgic_vcpu_destroy+0x44/0x500 arch/arm64/kvm/vgic/vgic-init.c:455
kvm_vgic_destroy+0x100/0x624 arch/arm64/kvm/vgic/vgic-init.c:505
kvm_arch_destroy_vm+0x80/0x138 arch/arm64/kvm/arm.c:244
kvm_destroy_vm virt/kvm/kvm_main.c:1308 [inline]
kvm_put_kvm+0x800/0xff8 virt/kvm/kvm_main.c:1344
kvm_vm_release+0x58/0x78 virt/kvm/kvm_main.c:1367
__fput+0x4ac/0x980 fs/file_table.c:465
____fput+0x20/0x58 fs/file_table.c:493
task_work_run+0x1bc/0x254 kernel/task_work.c:227
resume_user_mode_work include/linux/resume_user_mode.h:50 [inline]
do_notify_resume+0x1b4/0x270 arch/arm64/kernel/entry-common.c:151
exit_to_user_mode_prepare arch/arm64/kernel/entry-common.c:169 [inline]
exit_to_user_mode arch/arm64/kernel/entry-common.c:178 [inline]
el0_svc+0xb4/0x160 arch/arm64/kernel/entry-common.c:768
el0t_64_sync_handler+0x78/0x108 arch/arm64/kernel/entry-common.c:786
el0t_64_sync+0x198/0x19c arch/arm64/kernel/entry.S:600
This is of course no good, but is at odds with how LPI refcounts are
managed. Solve the locking mess by deferring the release of unreferenced
LPIs after the ap_list_lock is released. Mark these to-be-released LPIs
specially to avoid racing with vgic_put_irq() and causing a double-free.
Since references can only be taken on LPIs with a nonzero refcount,
extending the lifetime of freed LPIs is still safe.
Reviewed-by: Marc Zyngier <maz@kernel.org>
Reported-by: syzbot+cef594105ac7e60c6d93@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/kvmarm/68acd0d9.a00a0220.33401d.048b.GAE@google.com/
Link: https://lore.kernel.org/r/20250905100531.282980-5-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
KVM's use of krefs to manage LPIs isn't adding much, especially
considering vgic_irq_release() is a noop due to the lack of sufficient
context.
Switch to using a regular refcount in anticipation of adding a
meaningful release concept for LPIs.
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20250905100531.282980-3-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
While LPIs lack an active state, KVM unconditionally folds the active
state from the LR into the vgic_irq struct meaning this field cannot be
'creatively' reused for something else.
Drop the misleading comment to reflect this.
Link: https://lore.kernel.org/r/20250905100531.282980-2-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
https://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
KVM/arm64 changes for 6.17, round #1
- Host driver for GICv5, the next generation interrupt controller for
arm64, including support for interrupt routing, MSIs, interrupt
translation and wired interrupts.
- Use FEAT_GCIE_LEGACY on GICv5 systems to virtualize GICv3 VMs on
GICv5 hardware, leveraging the legacy VGIC interface.
- Userspace control of the 'nASSGIcap' GICv3 feature, allowing
userspace to disable support for SGIs w/o an active state on hardware
that previously advertised it unconditionally.
- Map supporting endpoints with cacheable memory attributes on systems
with FEAT_S2FWB and DIC where KVM no longer needs to perform cache
maintenance on the address range.
- Nested support for FEAT_RAS and FEAT_DoubleFault2, allowing the guest
hypervisor to inject external aborts into an L2 VM and take traps of
masked external aborts to the hypervisor.
- Convert more system register sanitization to the config-driven
implementation.
- Fixes to the visibility of EL2 registers, namely making VGICv3 system
registers accessible through the VGIC device instead of the ONE_REG
vCPU ioctls.
- Various cleanups and minor fixes.
|
|
* kvm-arm64/vgic-v4-ctl:
: Userspace control of nASSGIcap, courtesy of Raghavendra Rao Ananta
:
: Allow userspace to decide if support for SGIs without an active state is
: advertised to the guest, allowing VMs from GICv3-only hardware to be
: migrated to to GICv4.1 capable machines.
Documentation: KVM: arm64: Describe VGICv3 registers writable pre-init
KVM: arm64: selftests: Add test for nASSGIcap attribute
KVM: arm64: vgic-v3: Allow userspace to write GICD_TYPER2.nASSGIcap
KVM: arm64: vgic-v3: Allow access to GICD_IIDR prior to initialization
KVM: arm64: vgic-v3: Consolidate MAINT_IRQ handling
KVM: arm64: Disambiguate support for vSGIs v. vLPIs
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
KVM unconditionally advertises GICD_TYPER2.nASSGIcap (which internally
implies vSGIs) on GICv4.1 systems. Allow userspace to change whether a
VM supports the feature. Only allow changes prior to VGIC initialization
as at that point vPEs need to be allocated for the VM.
For convenience, bundle support for vLPIs and vSGIs behind this feature,
allowing userspace to control vPE allocation for VMs in environments
that may be constrained on vPE IDs.
Signed-off-by: Raghavendra Rao Ananta <rananta@google.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20250724062805.2658919-5-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
Add support for GICv3 compat mode (FEAT_GCIE_LEGACY) which allows a
GICv5 host to run GICv3-based VMs. This change enables the
VHE/nVHE/hVHE/protected modes, but does not support nested
virtualization.
A lazy-disable approach is taken for compat mode; it is enabled on the
vgic_v3_load path but not disabled on the vgic_v3_put path. A
non-GICv3 VM, i.e., one based on GICv5, is responsible for disabling
compat mode on the corresponding vgic_v5_load path. Currently, GICv5
is not supported, and hence compat mode is not disabled again once it
is enabled, and this function is intentionally omitted from the code.
Co-authored-by: Timothy Hayes <timothy.hayes@arm.com>
Signed-off-by: Timothy Hayes <timothy.hayes@arm.com>
Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com>
Link: https://lore.kernel.org/r/20250627100847.1022515-5-sascha.bischoff@arm.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
Don't bother WARNing if updating an IRTE route fails now that vendor code
provides much more precise WARNs. The generic WARN doesn't provide enough
information to actually debug the problem, and has obviously done nothing
to surface the myriad bugs in KVM x86's implementation.
Drop all of the associated return code plumbing that existed just so that
common KVM could WARN.
Link: https://lore.kernel.org/r/20250611224604.313496-34-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
The virtual mapping and "GSI" routing of a particular vLPI is subject to
change in response to the guest / userspace. This can be pretty annoying
to deal with when KVM needs to track the physical state that's managed
for vLPI direct injection.
Make vgic_v4_unset_forwarding() resilient by using the host IRQ to
resolve the vgic IRQ. Since this uses the LPI xarray directly, finding
the ITS by doorbell address + grabbing it's its_lock is no longer
necessary. Note that matching the right ITS / ITE is already handled in
vgic_v4_set_forwarding(), and unless there's a bug in KVM's VGIC ITS
emulation the virtual mapping that should remain stable for the lifetime
of the vLPI mapping.
Tested-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20250523194722.4066715-4-oliver.upton@linux.dev
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
* kvm-arm64/pmu-fixes:
: vPMU fixes for 6.15 courtesy of Akihiko Odaki
:
: Various fixes to KVM's vPMU implementation, notably ensuring
: userspace-directed changes to the PMCs are reflected in the backing perf
: events.
KVM: arm64: PMU: Reload when resetting
KVM: arm64: PMU: Reload when user modifies registers
KVM: arm64: PMU: Fix SET_ONE_REG for vPMC regs
KVM: arm64: PMU: Assume PMU presence in pmu-emul.c
KVM: arm64: PMU: Set raw values from user to PM{C,I}NTEN{SET,CLR}, PMOVS{SET,CLR}
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
* kvm-arm64/pmuv3-asahi:
: Support PMUv3 for KVM guests on Apple silicon
:
: Take advantage of some IMPLEMENTATION DEFINED traps available on Apple
: parts to trap-and-emulate the PMUv3 registers on behalf of a KVM guest.
: Constrain the vPMU to a cycle counter and single event counter, as the
: Apple PMU has events that cannot be counted on every counter.
:
: There is a small new interface between the ARM PMU driver and KVM, where
: the PMU driver owns the PMUv3 -> hardware event mappings.
arm64: Enable IMP DEF PMUv3 traps on Apple M*
KVM: arm64: Provide 1 event counter on IMPDEF hardware
drivers/perf: apple_m1: Provide helper for mapping PMUv3 events
KVM: arm64: Remap PMUv3 events onto hardware
KVM: arm64: Advertise PMUv3 if IMPDEF traps are present
KVM: arm64: Compute synthetic sysreg ESR for Apple PMUv3 traps
KVM: arm64: Move PMUVer filtering into KVM code
KVM: arm64: Use guard() to cleanup usage of arm_pmus_lock
KVM: arm64: Drop kvm_arm_pmu_available static key
KVM: arm64: Use a cpucap to determine if system supports FEAT_PMUv3
KVM: arm64: Always support SW_INCR PMU event
KVM: arm64: Compute PMCEID from arm_pmu's event bitmaps
drivers/perf: apple_m1: Support host/guest event filtering
drivers/perf: apple_m1: Refactor event select/filter configuration
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
Replace kvm_pmu_vcpu_reset() with the generic PMU reloading mechanism to
ensure the consistency with system registers and to reduce code size.
Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20250315-pmc-v5-5-ecee87dab216@daynix.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
Reload the perf event when setting the vPMU counter (vPMC) registers
(PMCCNTR_EL0 and PMEVCNTR<n>_EL0). This is a change corresponding to
commit 9228b26194d1 ("KVM: arm64: PMU: Fix GET_ONE_REG
for vPMC regs to return the current value") but for SET_ONE_REG.
Values of vPMC registers are saved in sysreg files on certain occasions.
These saved values don't represent the current values of the vPMC
registers if the perf events for the vPMCs count events after the save.
The current values of those registers are the sum of the sysreg file
value and the current perf event counter value. But, when userspace
writes those registers (using KVM_SET_ONE_REG), KVM only updates the
sysreg file value and leaves the current perf event counter value as is.
It is also important to keep the correct state even if userspace writes
them after first run, specifically when debugging Windows on QEMU with
GDB; QEMU tries to write back all visible registers when resuming the VM
execution with GDB, corrupting the PMU state. Windows always uses the
PMU so this can cause adverse effects on that particular OS.
Fix this by releasing the current perf event and trigger recreating one
with KVM_REQ_RELOAD_PMU.
Fixes: 051ff581ce70 ("arm64: KVM: Add access handler for event counter register")
Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20250315-pmc-v5-3-ecee87dab216@daynix.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
With the PMUv3 cpucap, kvm_arm_pmu_available is no longer used in the
hot path of guest entry/exit. On top of that, guest support for PMUv3
may not correlate with host support for the feature, e.g. on IMPDEF
hardware.
Throw out the static key and just inspect the list of PMUs to determine
if PMUv3 is supported for KVM guests.
Tested-by: Janne Grunau <j@jannau.net>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20250305202641.428114-7-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
KVM is about to learn some new tricks to virtualize PMUv3 on IMPDEF
hardware. As part of that, we now need to differentiate host support
from guest support for PMUv3.
Add a cpucap to determine if an architectural PMUv3 is present to guard
host usage of PMUv3 controls.
Tested-by: Janne Grunau <j@jannau.net>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20250305202641.428114-6-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
Emulating the vGIC means emulating the dreaded Maintenance Interrupt.
This is a two-pronged problem:
- while running L2, getting an MI translates into an MI injected
in the L1 based on the state of the HW.
- while running L1, we must accurately reflect the state of the
MI line, based on the in-memory state.
The MI INTID is added to the distributor, as expected on any
virtualisation-capable implementation, and further patches
will allow its configuration.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20250225172930.1850838-11-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
When entering a nested VM, we set up the hypervisor control interface
based on what the guest hypervisor has set. Especially, we investigate
each list register written by the guest hypervisor whether HW bit is
set. If so, we translate hw irq number from the guest's point of view
to the real hardware irq number if there is a mapping.
Co-developed-by: Jintack Lim <jintack@cs.columbia.edu>
Signed-off-by: Jintack Lim <jintack@cs.columbia.edu>
[Christoffer: Redesigned execution flow around vcpu load/put]
Co-developed-by: Christoffer Dall <christoffer.dall@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@arm.com>
[maz: Rewritten to support GICv3 instead of GICv2, NV2 support]
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20250225172930.1850838-9-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
Wire the handling of all GICv3 EL2 registers, and provide emulation
for all the non memory-backed registers (ICC_SRE_EL2, ICH_VTR_EL2,
ICH_MISR_EL2, ICH_ELRSR_EL2, and ICH_EISR_EL2).
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20250225172930.1850838-7-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
* kvm-arm64/pkvm-memshare-declutter:
: .
: pKVM memory transition simplifications, courtesy of Quentin Perret.
:
: From the cover letter:
: "Since its early days, pKVM has formalized memory 'transitions' (shares
: and donations) using 'struct pkvm_mem_transition' and bunch of helpers
: to manipulate it. The intention was for all transitions to use this
: machinery to ensure we're checking things consistently. However, as
: development progressed, it became clear that the rigidity of this model
: made it really difficult to use in some use-cases which ended-up
: side-stepping it entirely. That is the case for the
: hyp_{un}pin_shared_mem() and host_{un}share_guest() paths upstream which
: use lower level helpers directly, as well as for several other pKVM
: features that should land upstream in the future (ex: when a guest
: relinquishes a page during ballooning, when annotating a page that is
: being DMA'd to, ...). On top of this, the pkvm_mem_transition machinery
: requires a lot of boilerplate which makes the code hard to read, but
: also adds layers of indirection that no compilers seems to see through,
: hence leading to suboptimal generated code.
:
: Given all the above, this series removes the pkvm_mem_transition
: machinery from mem_protect.c, and converts all its users to use
: __*_{check,set}_page_state_range() low-level helpers directly."
: .
KVM: arm64: Drop pkvm_mem_transition for host/hyp donations
KVM: arm64: Drop pkvm_mem_transition for host/hyp sharing
KVM: arm64: Drop pkvm_mem_transition for FF-A
KVM: arm64: Only apply PMCR_EL0.P to the guest range of counters
KVM: arm64: nv: Reload PMU events upon MDCR_EL2.HPME change
KVM: arm64: Use KVM_REQ_RELOAD_PMU to handle PMCR_EL0.E change
KVM: arm64: Add unified helper for reprogramming counters by mask
KVM: arm64: Always check the state from hyp_ack_unshare()
KVM: arm64: Fix set_id_regs selftest for ASIDBITS becoming unwritable
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
It appears that on Qualcomm's x1e CPU, CNTVOFF_EL2 doesn't really
work, specially with HCR_EL2.E2H=1.
A non-zero offset results in a screaming virtual timer interrupt,
to the tune of a few 100k interrupts per second on a 4 vcpu VM.
This is also evidenced by this CPU's inability to correctly run
any of the timer selftests.
The only case this doesn't break is when this register is set to 0,
which breaks VM migration.
When HCR_EL2.E2H=0, the timer seems to behave normally, and does
not result in an interrupt storm.
As a workaround, use the fact that this CPU implements FEAT_ECV,
and trap all accesses to the virtual timer and counter, keeping
CNTVOFF_EL2 set to zero, and emulate accesses to CVAL/TVAL/CTL
and the counter itself, fixing up the timer to account for the
missing offset.
And if you think this is disgusting, you'd probably be right.
Acked-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20241217142321.763801-12-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
Although FEAT_ECV allows us to correctly emulate the timers, it also
reduces performances pretty badly.
Mitigate this by emulating the CTL/CVAL register reads in the
inner run loop, without returning to the general kernel.
Acked-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20241217142321.763801-6-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
Emulating the timers with FEAT_NV2 is a bit odd, as the timers
can be reconfigured behind our back without the hypervisor even
noticing. In the VHE case, that's an actual regression in the
architecture...
Co-developed-by: Christoffer Dall <christoffer.dall@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@arm.com>
Acked-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20241217142321.763801-3-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
Having separate helpers for enabling/disabling counters provides the
wrong abstraction, as the state of each counter needs to be evaluated
independently and, in some cases, use a different global enable bit.
Collapse the enable/disable accessors into a single, common helper that
reconfigures every counter set in @mask, leaving the complexity of
determining if an event is actually enabled in
kvm_pmu_counter_is_enabled().
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241217175513.3658056-1-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
VGIC_MAX_PRIVATE is a pretty useless definition, and is better
replaced with VGIC_NR_PRIVATE_IRQS.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241117165757.247686-4-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
* kvm-arm64/nv-pmu:
: Support for vEL2 PMU controls
:
: Align the vEL2 PMU support with the current state of non-nested KVM,
: including:
:
: - Trap routing, with the annoying complication of EL2 traps that apply
: in Host EL0
:
: - PMU emulation, using the correct configuration bits depending on
: whether a counter falls in the hypervisor or guest range of PMCs
:
: - Perf event swizzling across nested boundaries, as the event filtering
: needs to be remapped to cope with vEL2
KVM: arm64: nv: Reprogram PMU events affected by nested transition
KVM: arm64: nv: Apply EL2 event filtering when in hyp context
KVM: arm64: nv: Honor MDCR_EL2.HLP
KVM: arm64: nv: Honor MDCR_EL2.HPME
KVM: arm64: Add helpers to determine if PMC counts at a given EL
KVM: arm64: nv: Adjust range of accessible PMCs according to HPMN
KVM: arm64: Rename kvm_pmu_valid_counter_mask()
KVM: arm64: nv: Advertise support for FEAT_HPMN0
KVM: arm64: nv: Describe trap behaviour of MDCR_EL2.HPMN
KVM: arm64: nv: Honor MDCR_EL2.{TPM, TPMCR} in Host EL0
KVM: arm64: nv: Reinject traps that take effect in Host EL0
KVM: arm64: nv: Rename BEHAVE_FORWARD_ANY
KVM: arm64: nv: Allow coarse-grained trap combos to use complex traps
KVM: arm64: Describe RES0/RES1 bits of MDCR_EL2
arm64: sysreg: Add new definitions for ID_AA64DFR0_EL1
arm64: sysreg: Migrate MDCR_EL2 definition to table
arm64: sysreg: Describe ID_AA64DFR2_EL1 fields
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
* kvm-arm64/psci-1.3:
: PSCI v1.3 support, courtesy of David Woodhouse
:
: Bump KVM's PSCI implementation up to v1.3, with the added bonus of
: implementing the SYSTEM_OFF2 call. Like other system-scoped PSCI calls,
: this gets relayed to userspace for further processing with a new
: KVM_SYSTEM_EVENT_SHUTDOWN flag.
:
: As an added bonus, implement client-side support for hibernation with
: the SYSTEM_OFF2 call.
arm64: Use SYSTEM_OFF2 PSCI call to power off for hibernate
KVM: arm64: nvhe: Pass through PSCI v1.3 SYSTEM_OFF2 call
KVM: selftests: Add test for PSCI SYSTEM_OFF2
KVM: arm64: Add support for PSCI v1.2 and v1.3
KVM: arm64: Add PSCI v1.3 SYSTEM_OFF2 function for hibernation
firmware/psci: Add definitions for PSCI v1.3 specification
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
Start reprogramming PMU events at nested boundaries now that everything
is in place to handle the EL2 event filter. Only repaint events where
the filter differs between EL1 and EL2 as a slight optimization.
PMU now 'works' for nested VMs, albeit slow.
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241025182559.3364829-1-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
The value of MDCR_EL2.HPMN controls the number of event counters made
visible to EL0 and EL1. This means it is possible for the guest
hypervisor to allow direct access to event counters to the L2.
Rework KVM's PMU register emulation to take the effects of HPMN into
account when handling a trap. For bitmask-style registers, writes only
affect accessible registers.
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241025182354.3364124-14-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
Nested PMU support requires dynamically changing the visible range of
PMU counters based on the exception level and value of MDCR_EL2.HPMN. At
the same time, the PMU emulation code needs to know the absolute number
of implemented counters, regardless of context.
Rename the existing helper to make it obvious that it returns the number
of implemented counters and not anything else.
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241025182354.3364124-13-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
MDCR_EL2.HPMN splits the PMU event counters into two ranges: the first
range is accessible from all ELs, and the second range is accessible
only to EL2/3. Supposing the guest hypervisor allows direct access to
the PMU counters from the L2, KVM needs to locally handle those
accesses.
Add a new complex trap configuration for HPMN that checks if the counter
index is accessible to the current context. As written, the architecture
suggests HPMN only causes PMEVCNTR<n>_EL0 to trap, though intuition (and
the pseudocode) suggest that the trap applies to PMEVTYPER<n>_EL0 as
well.
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241025182354.3364124-11-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
Accessing CNTHCTL_EL2 is fraught with danger if running with
HCR_EL2.E2H=1: half of the bits are held in CNTKCTL_EL1, and
thus can be changed behind our back, while the rest lives
in the CNTHCTL_EL2 shadow copy that is memory-based.
Yes, this is a lot of fun!
Make sure that we merge the two on read access, while we can
write to CNTKCTL_EL1 in a more straightforward manner.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241023145345.1613824-7-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
As with PSCI v1.1 in commit 512865d83fd9 ("KVM: arm64: Bump guest PSCI
version to 1.1"), expose v1.3 to the guest by default. The SYSTEM_OFF2
call which is exposed by doing so is compatible for userspace because
it's just a new flag in the event that KVM raises, in precisely the same
way that SYSTEM_RESET2 was compatible when v1.1 was enabled by default.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Miguel Luis <miguel.luis@oracle.com>
Link: https://lore.kernel.org/r/20241019172459.2241939-4-dwmw2@infradead.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
There are 2 defines for the number of PMU counters:
ARMV8_PMU_MAX_COUNTERS and ARMPMU_MAX_HWEVENTS. Both are the same
currently, but Armv9.4/8.9 increases the number of possible counters
from 32 to 33. With this change, the maximum number of counters will
differ for KVM's PMU emulation which is PMUv3.4. Give KVM PMU emulation
its own define to decouple it from the rest of the kernel's number PMU
counters.
The VHE PMU code needs to match the PMU driver, so switch it to use
ARMPMU_MAX_HWEVENTS instead.
Acked-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Rob Herring (Arm) <robh@kernel.org>
Tested-by: James Clark <james.clark@linaro.org>
Link: https://lore.kernel.org/r/20240731-arm-pmu-3-9-icntr-v3-6-280a8d7ff465@kernel.org
Signed-off-by: Will Deacon <will@kernel.org>
|
|
The PMUv3 and KVM code each have a define for the PMU cycle counter
index. Move KVM's define to a shared location and use it for PMUv3
driver.
Reviewed-by: Marc Zyngier <maz@kernel.org>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Rob Herring (Arm) <robh@kernel.org>
Tested-by: James Clark <james.clark@linaro.org>
Link: https://lore.kernel.org/r/20240731-arm-pmu-3-9-icntr-v3-5-280a8d7ff465@kernel.org
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Various PMUv3 registers which are a mask of counters are 64-bit
registers, but the accessor functions take a u32. This has been fine as
the upper 32-bits have been RES0 as there has been a maximum of 32
counters prior to Armv9.4/8.9. With Armv9.4/8.9, a 33rd counter is
added. Update the accessor functions to use a u64 instead.
Acked-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Rob Herring (Arm) <robh@kernel.org>
Tested-by: James Clark <james.clark@linaro.org>
Link: https://lore.kernel.org/r/20240731-arm-pmu-3-9-icntr-v3-2-280a8d7ff465@kernel.org
Signed-off-by: Will Deacon <will@kernel.org>
|
|
* kvm-arm64/misc-6.10:
: .
: Misc fixes and updates targeting 6.10
:
: - Improve boot-time diagnostics when the sysreg tables
: are not correctly sorted
:
: - Allow FFA_MSG_SEND_DIRECT_REQ in the FFA proxy
:
: - Fix duplicate XNX field in the ID_AA64MMFR1_EL1
: writeable mask
:
: - Allocate PPIs and SGIs outside of the vcpu structure, allowing
: for smaller EL2 mapping and some flexibility in implementing
: more or less than 32 private IRQs.
:
: - Use bitmap_gather() instead of its open-coded equivalent
:
: - Make protected mode use hVHE if available
:
: - Purge stale mpidr_data if a vcpu is created after the MPIDR
: map has been created
: .
KVM: arm64: Destroy mpidr_data for 'late' vCPU creation
KVM: arm64: Use hVHE in pKVM by default on CPUs with VHE support
KVM: arm64: Fix hvhe/nvhe early alias parsing
KVM: arm64: Convert kvm_mpidr_index() to bitmap_gather()
KVM: arm64: vgic: Allocate private interrupts on demand
KVM: arm64: Remove duplicated AA64MMFR1_EL1 XNX
KVM: arm64: Remove FFA_MSG_SEND_DIRECT_REQ from the denylist
KVM: arm64: Improve out-of-order sysreg table diagnostics
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
* kvm-arm64/pkvm-6.10: (25 commits)
: .
: At last, a bunch of pKVM patches, courtesy of Fuad Tabba.
: From the cover letter:
:
: "This series is a bit of a bombay-mix of patches we've been
: carrying. There's no one overarching theme, but they do improve
: the code by fixing existing bugs in pKVM, refactoring code to
: make it more readable and easier to re-use for pKVM, or adding
: functionality to the existing pKVM code upstream."
: .
KVM: arm64: Force injection of a data abort on NISV MMIO exit
KVM: arm64: Restrict supported capabilities for protected VMs
KVM: arm64: Refactor setting the return value in kvm_vm_ioctl_enable_cap()
KVM: arm64: Document the KVM/arm64-specific calls in hypercalls.rst
KVM: arm64: Rename firmware pseudo-register documentation file
KVM: arm64: Reformat/beautify PTP hypercall documentation
KVM: arm64: Clarify rationale for ZCR_EL1 value restored on guest exit
KVM: arm64: Introduce and use predicates that check for protected VMs
KVM: arm64: Add is_pkvm_initialized() helper
KVM: arm64: Simplify vgic-v3 hypercalls
KVM: arm64: Move setting the page as dirty out of the critical section
KVM: arm64: Change kvm_handle_mmio_return() return polarity
KVM: arm64: Fix comment for __pkvm_vcpu_init_traps()
KVM: arm64: Prevent kmemleak from accessing .hyp.data
KVM: arm64: Do not map the host fpsimd state to hyp in pKVM
KVM: arm64: Rename __tlb_switch_to_{guest,host}() in VHE
KVM: arm64: Support TLB invalidation in guest context
KVM: arm64: Avoid BBM when changing only s/w bits in Stage-2 PTE
KVM: arm64: Check for PTE validity when checking for executable/cacheable
KVM: arm64: Avoid BUG-ing from the host abort path
...
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
Private interrupts are currently part of the CPU interface structure
that is part of each and every vcpu we create.
Currently, we have 32 of them per vcpu, resulting in a per-vcpu array
that is just shy of 4kB. On its own, that's no big deal, but it gets
in the way of other things:
- each vcpu gets mapped at EL2 on nVHE/hVHE configurations. This
requires memory that is physically contiguous. However, the EL2
code has no purpose looking at the interrupt structures and
could do without them being mapped.
- supporting features such as EPPIs, which extend the number of
private interrupts past the 32 limit would make the array
even larger, even for VMs that do not use the EPPI feature.
Address these issues by moving the private interrupt array outside
of the vcpu, and replace it with a simple pointer. We take this
opportunity to make it obvious what gets initialised when, as
that path was remarkably opaque, and tighten the locking.
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20240502154545.3012089-1-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
Consolidate the GICv3 VMCR accessor hypercalls into the APR save/restore
hypercalls so that all of the EL2 GICv3 state is covered by a single pair
of hypercalls.
Signed-off-by: Fuad Tabba <tabba@google.com>
Acked-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20240423150538.2103045-17-tabba@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
The last genuine use case for the lpi_list_lock was the global LPI
translation cache, which has been removed in favor of a per-ITS xarray.
Remove a layer from the locking puzzle by getting rid of it.
vgic_add_lpi() still has a critical section that needs to protect
against the insertion of other LPIs; change it to take the LPI xarray's
xa_lock to retain this property.
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20240422200158.2606761-13-oliver.upton@linux.dev
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
The MSI injection fast path has been transitioned away from the global
translation cache. Rip it out.
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20240422200158.2606761-12-oliver.upton@linux.dev
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
Within the context of a single ITS, it is possible to use an xarray to
cache the device ID & event ID translation to a particular irq
descriptor. Take advantage of this to build a translation cache capable
of fitting all valid translations for a given ITS.
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20240422200158.2606761-9-oliver.upton@linux.dev
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
The last user has been transitioned to walking the LPI xarray directly.
Cut the wart off, and get rid of the now unneeded lpi_count while doing
so.
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20240422200158.2606761-7-oliver.upton@linux.dev
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
The vgic debug iterator is the final user of vgic_copy_lpi_list(), but
is a bit more complicated to transition to something else. Use a mark
in the LPI xarray to record the indices 'known' to the debug iterator.
Protect against the LPIs from being freed by associating an additional
reference with the xarray mark.
Rework iter_next() to let the xarray walk 'drive' the iteration after
visiting all of the SGIs, PPIs, and SPIs.
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20240422200158.2606761-6-oliver.upton@linux.dev
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
Programming PMU events in the host that count during guest execution is
a feature supported by perf, e.g.
perf stat -e cpu_cycles:G ./lkvm run
While this works for VHE, the guest/host event bitmaps are not carried
through to the hypervisor in the nVHE configuration. Make
kvm_pmu_update_vcpu_events() conditional on whether or not _hardware_
supports PMUv3 rather than if the vCPU as vPMU enabled.
Cc: stable@vger.kernel.org
Fixes: 84d751a019a9 ("KVM: arm64: Pass pmu events to hyp via vcpu")
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20240305184840.636212-3-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
* kvm-arm64/lpi-xarray:
: xarray-based representation of vgic LPIs
:
: KVM's linked-list of LPI state has proven to be a bottleneck in LPI
: injection paths, due to lock serialization when acquiring / releasing a
: reference on an IRQ.
:
: Start the tedious process of reworking KVM's LPI injection by replacing
: the LPI linked-list with an xarray, leveraging this to allow RCU readers
: to walk it outside of the spinlock.
KVM: arm64: vgic: Don't acquire the lpi_list_lock in vgic_put_irq()
KVM: arm64: vgic: Ensure the irq refcount is nonzero when taking a ref
KVM: arm64: vgic: Rely on RCU protection in vgic_get_lpi()
KVM: arm64: vgic: Free LPI vgic_irq structs in an RCU-safe manner
KVM: arm64: vgic: Use atomics to count LPIs
KVM: arm64: vgic: Get rid of the LPI linked-list
KVM: arm64: vgic-its: Walk the LPI xarray in vgic_copy_lpi_list()
KVM: arm64: vgic-v3: Iterate the xarray to find pending LPIs
KVM: arm64: vgic: Use xarray to find LPI in vgic_get_lpi()
KVM: arm64: vgic: Store LPIs in an xarray
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
Free the vgic_irq structs in an RCU-safe manner to allow reads of the
LPI configuration data to happen in parallel with the release of LPIs.
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20240221054253.3848076-8-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
Switch to using atomics for LPI accounting, allowing vgic_irq references
to be dropped in parallel.
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20240221054253.3848076-7-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
All readers of LPI configuration have been transitioned to use the LPI
xarray. Get rid of the linked-list altogether.
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20240221054253.3848076-6-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
Using a linked-list for LPIs is less than ideal as it of course requires
iterative searches to find a particular entry. An xarray is a better
data structure for this use case, as it provides faster searches and can
still handle a potentially sparse range of INTID allocations.
Start by storing LPIs in an xarray, punting usage of the xarray to a
subsequent change. The observant among you will notice that we added yet
another lock to the chain of locking order rules; document the ordering
of the xa_lock. Don't worry, we'll get rid of the lpi_list_lock one
day...
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20240221054253.3848076-2-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
In order to make it easier to check whether a particular feature
is exposed to a guest, add a new set of helpers, with kvm_has_feat()
being the most useful.
Let's start making use of them in the PMU code (courtesy of Oliver).
Follow-up changes will introduce additional use patterns.
Reviewed-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Co-developed--by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20240214131827.2856277-3-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
KVM/arm64 updates for 6.7
- Generalized infrastructure for 'writable' ID registers, effectively
allowing userspace to opt-out of certain vCPU features for its guest
- Optimization for vSGI injection, opportunistically compressing MPIDR
to vCPU mapping into a table
- Improvements to KVM's PMU emulation, allowing userspace to select
the number of PMCs available to a VM
- Guest support for memory operation instructions (FEAT_MOPS)
- Cleanups to handling feature flags in KVM_ARM_VCPU_INIT, squashing
bugs and getting rid of useless code
- Changes to the way the SMCCC filter is constructed, avoiding wasted
memory allocations when not in use
- Load the stage-2 MMU context at vcpu_load() for VHE systems, reducing
the overhead of errata mitigations
- Miscellaneous kernel and selftest fixes
|
|
* kvm-arm64/pmu_pmcr_n:
: User-defined PMC limit, courtesy Raghavendra Rao Ananta
:
: Certain VMMs may want to reserve some PMCs for host use while running a
: KVM guest. This was a bit difficult before, as KVM advertised all
: supported counters to the guest. Userspace can now limit the number of
: advertised PMCs by writing to PMCR_EL0.N, as KVM's sysreg and PMU
: emulation enforce the specified limit for handling guest accesses.
KVM: selftests: aarch64: vPMU test for validating user accesses
KVM: selftests: aarch64: vPMU register test for unimplemented counters
KVM: selftests: aarch64: vPMU register test for implemented counters
KVM: selftests: aarch64: Introduce vpmu_counter_access test
tools: Import arm_pmuv3.h
KVM: arm64: PMU: Allow userspace to limit PMCR_EL0.N for the guest
KVM: arm64: Sanitize PM{C,I}NTEN{SET,CLR}, PMOVS{SET,CLR} before first run
KVM: arm64: Add {get,set}_user for PM{C,I}NTEN{SET,CLR}, PMOVS{SET,CLR}
KVM: arm64: PMU: Set PMCR_EL0.N for vCPU based on the associated PMU
KVM: arm64: PMU: Add a helper to read a vCPU's PMCR_EL0
KVM: arm64: Select default PMU in KVM_ARM_VCPU_INIT handler
KVM: arm64: PMU: Introduce helpers to set the guest's PMU
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
* kvm-arm64/sgi-injection:
: vSGI injection improvements + fixes, courtesy Marc Zyngier
:
: Avoid linearly searching for vSGI targets using a compressed MPIDR to
: index a cache. While at it, fix some egregious bugs in KVM's mishandling
: of vcpuid (user-controlled value) and vcpu_idx.
KVM: arm64: Clarify the ordering requirements for vcpu/RD creation
KVM: arm64: vgic-v3: Optimize affinity-based SGI injection
KVM: arm64: Fast-track kvm_mpidr_to_vcpu() when mpidr_data is available
KVM: arm64: Build MPIDR to vcpu index cache at runtime
KVM: arm64: Simplify kvm_vcpu_get_mpidr_aff()
KVM: arm64: Use vcpu_idx for invalidation tracking
KVM: arm64: vgic: Use vcpu_idx for the debug information
KVM: arm64: vgic-v2: Use cpuid from userspace as vcpu_id
KVM: arm64: vgic-v3: Refactor GICv3 SGI generation
KVM: arm64: vgic-its: Treat the collection target address as a vcpu_id
KVM: arm64: vgic: Make kvm_vgic_inject_irq() take a vcpu pointer
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
* kvm-arm64/pmevtyper-filter:
: Fixes to KVM's handling of the PMUv3 exception level filtering bits
:
: - NSH (count at EL2) and M (count at EL3) should be stateful when the
: respective EL is advertised in the ID registers but have no effect on
: event counting.
:
: - NSU and NSK modify the event filtering of EL0 and EL1, respectively.
: Though the kernel may not use these bits, other KVM guests might.
: Implement these bits exactly as written in the pseudocode if EL3 is
: advertised.
KVM: arm64: Add PMU event filter bits required if EL3 is implemented
KVM: arm64: Make PMEVTYPER<n>_EL0.NSH RES0 if EL2 isn't advertised
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
For unimplemented counters, the registers PM{C,I}NTEN{SET,CLR}
and PMOVS{SET,CLR} are expected to have the corresponding bits RAZ.
Hence to ensure correct KVM's PMU emulation, mask out the RES0 bits.
Defer this work to the point that userspace can no longer change the
number of advertised PMCs.
Signed-off-by: Raghavendra Rao Ananta <rananta@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20231020214053.2144305-7-rananta@google.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
The number of PMU event counters is indicated in PMCR_EL0.N.
For a vCPU with PMUv3 configured, the value is set to the same
value as the current PE on every vCPU reset. Unless the vCPU is
pinned to PEs that has the PMU associated to the guest from the
initial vCPU reset, the value might be different from the PMU's
PMCR_EL0.N on heterogeneous PMU systems.
Fix this by setting the vCPU's PMCR_EL0.N to the PMU's PMCR_EL0.N
value. Track the PMCR_EL0.N per guest, as only one PMU can be set
for the guest (PMCR_EL0.N must be the same for all vCPUs of the
guest), and it is convenient for updating the value.
To achieve this, the patch introduces a helper,
kvm_arm_pmu_get_max_counters(), that reads the maximum number of
counters from the arm_pmu associated to the VM. Make the function
global as upcoming patches will be interested to know the value
while setting the PMCR.N of the guest from userspace.
KVM does not yet support userspace modifying PMCR_EL0.N.
The following patch will add support for that.
Reviewed-by: Sebastian Ott <sebott@redhat.com>
Co-developed-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Reiji Watanabe <reijiw@google.com>
Signed-off-by: Raghavendra Rao Ananta <rananta@google.com>
Link: https://lore.kernel.org/r/20231020214053.2144305-5-rananta@google.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
Add a helper to read a vCPU's PMCR_EL0, and use it whenever KVM
reads a vCPU's PMCR_EL0.
Currently, the PMCR_EL0 value is tracked per vCPU. The following
patches will make (only) PMCR_EL0.N track per guest. Having the
new helper will be useful to combine the PMCR_EL0.N field
(tracked per guest) and the other fields (tracked per vCPU)
to provide the value of PMCR_EL0.
No functional change intended.
Reviewed-by: Sebastian Ott <sebott@redhat.com>
Signed-off-by: Reiji Watanabe <reijiw@google.com>
Signed-off-by: Raghavendra Rao Ananta <rananta@google.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20231020214053.2144305-4-rananta@google.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
Future changes to KVM's sysreg emulation will rely on having a valid PMU
instance to determine the number of implemented counters (PMCR_EL0.N).
This is earlier than when userspace is expected to modify the vPMU
device attributes, where the default is selected today.
Select the default PMU when handling KVM_ARM_VCPU_INIT such that it is
available in time for sysreg emulation.
Reviewed-by: Sebastian Ott <sebott@redhat.com>
Co-developed-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Reiji Watanabe <reijiw@google.com>
Signed-off-by: Raghavendra Rao Ananta <rananta@google.com>
Link: https://lore.kernel.org/r/20231020214053.2144305-3-rananta@google.com
[Oliver: rewrite changelog]
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
The NSH bit, which filters event counting at EL2, is required by the
architecture if an implementation has EL2. Even though KVM doesn't
support nested virt yet, it makes no effort to hide the existence of EL2
from the ID registers. Userspace can, however, change the value of PFR0
to hide EL2. Align KVM's sysreg emulation with the architecture and make
NSH RES0 if EL2 isn't advertised. Keep in mind the bit is ignored when
constructing the backing perf event.
While at it, build the event type mask using explicit field definitions
instead of relying on ARMV8_PMU_EVTYPE_MASK. KVM probably should've been
doing this in the first place, as it avoids changes to the
aforementioned mask affecting sysreg emulation.
Reviewed-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Link: https://lore.kernel.org/r/20231019185618.3442949-2-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
Contrary to common belief, HCR_EL2.TGE has a direct and immediate
effect on the way the EL0 physical counter is offset. Flipping
TGE from 1 to 0 while at EL2 immediately changes the way the counter
compared to the CVAL limit.
This means that we cannot directly save/restore the guest's view of
CVAL, but that we instead must treat it as if CNTPOFF didn't exist.
Only in the world switch, once we figure out that we do have CNTPOFF,
can we must the offset back and forth depending on the polarity of
TGE.
Fixes: 2b4825a86940 ("KVM: arm64: timers: Use CNTPOFF_EL2 to offset the physical timer")
Reported-by: Ganapatrao Kulkarni <gankulkarni@os.amperecomputing.com>
Tested-by: Ganapatrao Kulkarni <gankulkarni@os.amperecomputing.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
Passing a vcpu_id to kvm_vgic_inject_irq() is silly for two reasons:
- we often confuse vcpu_id and vcpu_idx
- we eventually have to convert it back to a vcpu
- we can't count
Instead, pass a vcpu pointer, which is unambiguous. A NULL vcpu
is also allowed for interrupts that are not private to a vcpu
(such as SPIs).
Reviewed-by: Zenghui Yu <yuzenghui@huawei.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230927090911.3355209-2-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
The vCPU-scoped feature bitmap was left in place a couple of releases
ago in case the change to VM-scoped vCPU features broke anyone. Nobody
has complained and the interop between VM and vCPU bitmaps is pretty
gross. Throw it out.
Link: https://lore.kernel.org/r/20230920195036.1169791-9-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
Get rid of the return value for kvm_reset_vcpu() as there are no longer
any cases where it returns a nonzero value.
Link: https://lore.kernel.org/r/20230920195036.1169791-8-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
Most of the internal definitions for PMU emulation are guarded with
CONFIG_HW_PERF_EVENTS. However, this isn't enough, and leads to
these definitions leaking if CONFIG_KVM isn't enabled.
This leads to some compilation breakage in this exact configuration.
Fix it by falling back to the dummy stubs if either perf or KVM
isn't selected.
Reported-by: Alexander Stein <alexander.stein@ew.tq-group.com>
Tested-by: Alexander Stein <alexander.stein@ew.tq-group.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
Huang Shijie reports that, when profiling a guest from the host
with a number of events that exceeds the number of available
counters, the reported counts are wildly inaccurate. Without
the counter oversubscription, the reported counts are correct.
Their investigation indicates that upon counter rotation (which
takes place on the back of a timer interrupt), we fail to
re-apply the guest EL0 enabling, leading to the counting of host
events instead of guest events.
In order to solve this, add yet another hook between the host PMU
driver and KVM, re-applying the guest EL0 configuration if the
right conditions apply (the host is VHE, we are in interrupt
context, and we interrupted a running vcpu). This triggers a new
vcpu request which will apply the correct configuration on guest
reentry.
With this, we have the correct counts, even when the counters are
oversubscribed.
Reported-by: Huang Shijie <shijie@os.amperecomputing.com>
Suggested-by: Oliver Upton <oliver.upton@linux.dev>
Tested_by: Huang Shijie <shijie@os.amperecomputing.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Cc: Leo Yan <leo.yan@linaro.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20230809013953.7692-1-shijie@os.amperecomputing.com
Acked-by: Mark Rutland <mark.rutland@arm.com>
Link: https://lore.kernel.org/r/20230820090108.177817-1-maz@kernel.org
|
|
Xiang reports that VMs occasionally fail to boot on GICv4.1 systems when
running a preemptible kernel, as it is possible that a vCPU is blocked
without requesting a doorbell interrupt.
The issue is that any preemption that occurs between vgic_v4_put() and
schedule() on the block path will mark the vPE as nonresident and *not*
request a doorbell irq. This occurs because when the vcpu thread is
resumed on its way to block, vcpu_load() will make the vPE resident
again. Once the vcpu actually blocks, we don't request a doorbell
anymore, and the vcpu won't be woken up on interrupt delivery.
Fix it by tracking that we're entering WFI, and key the doorbell
request on that flag. This allows us not to make the vPE resident
when going through a preempt/schedule cycle, meaning we don't lose
any state.
Cc: stable@vger.kernel.org
Fixes: 8e01d9a396e6 ("KVM: arm64: vgic-v4: Move the GICv4 residency flow to be driven by vcpu_load/put")
Reported-by: Xiang Chen <chenxiang66@hisilicon.com>
Suggested-by: Zenghui Yu <yuzenghui@huawei.com>
Tested-by: Xiang Chen <chenxiang66@hisilicon.com>
Co-developed-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Acked-by: Zenghui Yu <yuzenghui@huawei.com>
Link: https://lore.kernel.org/r/20230713070657.3873244-1-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
Common KVM changes for 6.5:
- Fix unprotected vcpu->pid dereference via debugfs
- Fix KVM_BUG() and KVM_BUG_ON() macros with 64-bit conditionals
- Refactor failure path in kvm_io_bus_unregister_dev() to simplify the code
- Misc cleanups
|
|
There's no longer a need for the baggage of the old scheme for handling
configurable ID register fields. Rip it all out in favor of the
generalized infrastructure.
Link: https://lore.kernel.org/r/20230609190054.1542113-12-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
Current usage of kvm_io_device requires users to destruct it with an extra
call of kvm_iodevice_destructor after the device gets unregistered from
kvm_io_bus. This is not necessary and can cause errors if a user forgot
to make the extra call.
Simplify the usage by combining kvm_iodevice_destructor into
kvm_io_bus_unregister_dev. This reduces LOCs a bit for users and can
avoid the leakage of destructing the device explicitly.
Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20230207123713.3905-2-wei.w.wang@intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
KVM allows userspace to write an IMPDEF PMU version to the corresponding
32bit and 64bit ID register fields for the sake of backwards
compatibility with kernels that lacked commit 3d0dba5764b9 ("KVM: arm64:
PMU: Move the ID_AA64DFR0_EL1.PMUver limit to VM creation"). Plumbing
that IMPDEF PMU version through to the gues is getting in the way of
progress, and really doesn't any sense in the first place.
Bite the bullet and reinterpret the IMPDEF PMU version as NI (0) for
userspace writes. Additionally, spill the dirty details into a comment.
Link: https://lore.kernel.org/r/20230609190054.1542113-5-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
Pull kvm updates from Paolo Bonzini:
"s390:
- More phys_to_virt conversions
- Improvement of AP management for VSIE (nested virtualization)
ARM64:
- Numerous fixes for the pathological lock inversion issue that
plagued KVM/arm64 since... forever.
- New framework allowing SMCCC-compliant hypercalls to be forwarded
to userspace, hopefully paving the way for some more features being
moved to VMMs rather than be implemented in the kernel.
- Large rework of the timer code to allow a VM-wide offset to be
applied to both virtual and physical counters as well as a
per-timer, per-vcpu offset that complements the global one. This
last part allows the NV timer code to be implemented on top.
- A small set of fixes to make sure that we don't change anything
affecting the EL1&0 translation regime just after having having
taken an exception to EL2 until we have executed a DSB. This
ensures that speculative walks started in EL1&0 have completed.
- The usual selftest fixes and improvements.
x86:
- Optimize CR0.WP toggling by avoiding an MMU reload when TDP is
enabled, and by giving the guest control of CR0.WP when EPT is
enabled on VMX (VMX-only because SVM doesn't support per-bit
controls)
- Add CR0/CR4 helpers to query single bits, and clean up related code
where KVM was interpreting kvm_read_cr4_bits()'s "unsigned long"
return as a bool
- Move AMD_PSFD to cpufeatures.h and purge KVM's definition
- Avoid unnecessary writes+flushes when the guest is only adding new
PTEs
- Overhaul .sync_page() and .invlpg() to utilize .sync_page()'s
optimizations when emulating invalidations
- Clean up the range-based flushing APIs
- Revamp the TDP MMU's reaping of Accessed/Dirty bits to clear a
single A/D bit using a LOCK AND instead of XCHG, and skip all of
the "handle changed SPTE" overhead associated with writing the
entire entry
- Track the number of "tail" entries in a pte_list_desc to avoid
having to walk (potentially) all descriptors during insertion and
deletion, which gets quite expensive if the guest is spamming
fork()
- Disallow virtualizing legacy LBRs if architectural LBRs are
available, the two are mutually exclusive in hardware
- Disallow writes to immutable feature MSRs (notably
PERF_CAPABILITIES) after KVM_RUN, similar to CPUID features
- Overhaul the vmx_pmu_caps selftest to better validate
PERF_CAPABILITIES
- Apply PMU filters to emulated events and add test coverage to the
pmu_event_filter selftest
- AMD SVM:
- Add support for virtual NMIs
- Fixes for edge cases related to virtual interrupts
- Intel AMX:
- Don't advertise XTILE_CFG in KVM_GET_SUPPORTED_CPUID if
XTILE_DATA is not being reported due to userspace not opting in
via prctl()
- Fix a bug in emulation of ENCLS in compatibility mode
- Allow emulation of NOP and PAUSE for L2
- AMX selftests improvements
- Misc cleanups
MIPS:
- Constify MIPS's internal callbacks (a leftover from the hardware
enabling rework that landed in 6.3)
Generic:
- Drop unnecessary casts from "void *" throughout kvm_main.c
- Tweak the layout of "struct kvm_mmu_memory_cache" to shrink the
struct size by 8 bytes on 64-bit kernels by utilizing a padding
hole
Documentation:
- Fix goof introduced by the conversion to rST"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (211 commits)
KVM: s390: pci: fix virtual-physical confusion on module unload/load
KVM: s390: vsie: clarifications on setting the APCB
KVM: s390: interrupt: fix virtual-physical confusion for next alert GISA
KVM: arm64: Have kvm_psci_vcpu_on() use WRITE_ONCE() to update mp_state
KVM: arm64: Acquire mp_state_lock in kvm_arch_vcpu_ioctl_vcpu_init()
KVM: selftests: Test the PMU event "Instructions retired"
KVM: selftests: Copy full counter values from guest in PMU event filter test
KVM: selftests: Use error codes to signal errors in PMU event filter test
KVM: selftests: Print detailed info in PMU event filter asserts
KVM: selftests: Add helpers for PMC asserts in PMU event filter test
KVM: selftests: Add a common helper for the PMU event filter guest code
KVM: selftests: Fix spelling mistake "perrmited" -> "permitted"
KVM: arm64: vhe: Drop extra isb() on guest exit
KVM: arm64: vhe: Synchronise with page table walker on MMU update
KVM: arm64: pkvm: Document the side effects of kvm_flush_dcache_to_poc()
KVM: arm64: nvhe: Synchronise with page table walker on TLBI
KVM: arm64: Handle 32bit CNTPCTSS traps
KVM: arm64: nvhe: Synchronise with page table walker on vcpu run
KVM: arm64: vgic: Don't acquire its_lock before config_lock
KVM: selftests: Add test to verify KVM's supported XCR0
...
|
|
* kvm-arm64/smccc-filtering:
: .
: SMCCC call filtering and forwarding to userspace, courtesy of
: Oliver Upton. From the cover letter:
:
: "The Arm SMCCC is rather prescriptive in regards to the allocation of
: SMCCC function ID ranges. Many of the hypercall ranges have an
: associated specification from Arm (FF-A, PSCI, SDEI, etc.) with some
: room for vendor-specific implementations.
:
: The ever-expanding SMCCC surface leaves a lot of work within KVM for
: providing new features. Furthermore, KVM implements its own
: vendor-specific ABI, with little room for other implementations (like
: Hyper-V, for example). Rather than cramming it all into the kernel we
: should provide a way for userspace to handle hypercalls."
: .
KVM: selftests: Fix spelling mistake "KVM_HYPERCAL_EXIT_SMC" -> "KVM_HYPERCALL_EXIT_SMC"
KVM: arm64: Test that SMC64 arch calls are reserved
KVM: arm64: Prevent userspace from handling SMC64 arch range
KVM: arm64: Expose SMC/HVC width to userspace
KVM: selftests: Add test for SMCCC filter
KVM: selftests: Add a helper for SMCCC calls with SMC instruction
KVM: arm64: Let errors from SMCCC emulation to reach userspace
KVM: arm64: Return NOT_SUPPORTED to guest for unknown PSCI version
KVM: arm64: Introduce support for userspace SMCCC filtering
KVM: arm64: Add support for KVM_EXIT_HYPERCALL
KVM: arm64: Use a maple tree to represent the SMCCC filter
KVM: arm64: Refactor hvc filtering to support different actions
KVM: arm64: Start handling SMCs from EL1
KVM: arm64: Rename SMC/HVC call handler to reflect reality
KVM: arm64: Add vm fd device attribute accessors
KVM: arm64: Add a helper to check if a VM has ran once
KVM: x86: Redefine 'longmode' as a flag for KVM_EXIT_HYPERCALL
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
* kvm-arm64/timer-vm-offsets: (21 commits)
: .
: This series aims at satisfying multiple goals:
:
: - allow a VMM to atomically restore a timer offset for a whole VM
: instead of updating the offset each time a vcpu get its counter
: written
:
: - allow a VMM to save/restore the physical timer context, something
: that we cannot do at the moment due to the lack of offsetting
:
: - provide a framework that is suitable for NV support, where we get
: both global and per timer, per vcpu offsetting, and manage
: interrupts in a less braindead way.
:
: Conflict resolution involves using the new per-vcpu config lock instead
: of the home-grown timer lock.
: .
KVM: arm64: Handle 32bit CNTPCTSS traps
KVM: arm64: selftests: Augment existing timer test to handle variable offset
KVM: arm64: selftests: Deal with spurious timer interrupts
KVM: arm64: selftests: Add physical timer registers to the sysreg list
KVM: arm64: nv: timers: Support hyp timer emulation
KVM: arm64: nv: timers: Add a per-timer, per-vcpu offset
KVM: arm64: Document KVM_ARM_SET_CNT_OFFSETS and co
KVM: arm64: timers: Abstract the number of valid timers per vcpu
KVM: arm64: timers: Fast-track CNTPCT_EL0 trap handling
KVM: arm64: Elide kern_hyp_va() in VHE-specific parts of the hypervisor
KVM: arm64: timers: Move the timer IRQs into arch_timer_vm_data
KVM: arm64: timers: Abstract per-timer IRQ access
KVM: arm64: timers: Rationalise per-vcpu timer init
KVM: arm64: timers: Allow save/restoring of the physical timer
KVM: arm64: timers: Allow userspace to set the global counter offset
KVM: arm64: Expose {un,}lock_all_vcpus() to the rest of KVM
KVM: arm64: timers: Allow physical offset without CNTPOFF_EL2
KVM: arm64: timers: Use CNTPOFF_EL2 to offset the physical timer
arm64: Add HAS_ECV_CNTPOFF capability
arm64: Add CNTPOFF_EL2 register definition
...
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
As the SMCCC (and related specifications) march towards an 'everything
and the kitchen sink' interface for interacting with a system it becomes
less likely that KVM will support every related feature. We could do
better by letting userspace have a crack at it instead.
Allow userspace to define an 'SMCCC filter' that applies to both HVCs
and SMCs initiated by the guest. Supporting both conduits with this
interface is important for a couple of reasons. Guest SMC usage is table
stakes for a nested guest, as HVCs are always taken to the virtual EL2.
Additionally, guests may want to interact with a service on the secure
side which can now be proxied by userspace.
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230404154050.2270077-10-oliver.upton@linux.dev
|
|
Maple tree is an efficient B-tree implementation that is intended for
storing non-overlapping intervals. Such a data structure is a good fit
for the SMCCC filter as it is desirable to sparsely allocate the 32 bit
function ID space.
To that end, add a maple tree to kvm_arch and correctly init/teardown
along with the VM. Wire in a test against the hypercall filter for HVCs
which does nothing until the controls are exposed to userspace.
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230404154050.2270077-8-oliver.upton@linux.dev
|
|
KVM handles SMCCC calls from virtual EL2 that use the SMC instruction
since commit bd36b1a9eb5a ("KVM: arm64: nv: Handle SMCs taken from
virtual EL2"). Thus, the function name of the handler no longer reflects
reality.
Normalize the name on SMCCC, since that's the only hypercall interface
KVM supports in the first place. No fuctional change intended.
Reviewed-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230404154050.2270077-5-oliver.upton@linux.dev
|
|
Emulating EL2 also means emulating the EL2 timers. To do so, we expand
our timer framework to deal with at most 4 timers. At any given time,
two timers are using the HW timers, and the two others are purely
emulated.
The role of deciding which is which at any given time is left to a
mapping function which is called every time we need to make such a
decision.
Reviewed-by: Colton Lewis <coltonlewis@google.com>
Co-developed-by: Christoffer Dall <christoffer.dall@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230330174800.2677007-18-maz@kernel.org
|
|
Being able to set a global offset isn't enough.
With NV, we also need to a per-vcpu, per-timer offset (for example,
CNTVCT_EL0 being offset by CNTVOFF_EL2).
Use a similar method as the VM-wide offset to have a timer point
to the shadow register that contains the offset value.
Reviewed-by: Colton Lewis <coltonlewis@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230330174800.2677007-17-maz@kernel.org
|
|
Having the timer IRQs duplicated into each vcpu isn't great, and
becomes absolutely awful with NV. So let's move these into
the per-VM arch_timer_vm_data structure.
This simplifies a lot of code, but requires us to introduce a
mutex so that we can reason about userspace trying to change
an interrupt number while another vcpu is running, something
that wasn't really well handled so far.
Reviewed-by: Colton Lewis <coltonlewis@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230330174800.2677007-12-maz@kernel.org
|
|
As we are about to move the location of the per-timer IRQ into
the VM structure, abstract the location of the IRQ behind an
accessor. This will make the repainting sligntly less painful.
Reviewed-by: Colton Lewis <coltonlewis@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230330174800.2677007-11-maz@kernel.org
|
|
The way we initialise our timer contexts may be satisfactory
for two timers, but will be getting pretty annoying with four.
Cleanup the whole thing by removing the code duplication and
getting rid of unused IRQ configuration elements.
Reviewed-by: Colton Lewis <coltonlewis@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230330174800.2677007-10-maz@kernel.org
|
|
With ECV and CNTPOFF_EL2, it is very easy to offer an offset for
the physical timer. So let's do just that.
Nothing can set the offset yet, so this should have no effect
whatsoever (famous last words...).
Reviewed-by: Colton Lewis <coltonlewis@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230330174800.2677007-5-maz@kernel.org
|
|
Instead of accumulating the fractional ns value generated every time
we compute a ns delta in a global variable, use a per-vcpu, per-timer
variable. This keeps the fractional ns local to the timer instead of
contributing to any odd, unrelated timer.
Reviewed-by: Colton Lewis <coltonlewis@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230330174800.2677007-2-maz@kernel.org
|
|
Having the ARM PMUv3 driver sitting in arch/arm64/kernel is getting
in the way of being able to use perf on ARMv8 cores running a 32bit
kernel, such as 32bit KVM guests.
This patch moves it into drivers/perf/arm_pmuv3.c, with an include
file in include/linux/perf/arm_pmuv3.h. The only thing left in
arch/arm64 is some mundane perf stuff.
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Zaid Al-Bassam <zalbassam@google.com>
Tested-by: Florian Fainelli <f.fainelli@gmail.com>
Link: https://lore.kernel.org/r/20230317195027.3746949-2-zalbassam@google.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Having a per-vcpu virtual offset is a pain. It needs to be synchronized
on each update, and expands badly to a setup where different timers can
have different offsets, or have composite offsets (as with NV).
So let's start by replacing the use of the CNTVOFF_EL2 shadow register
(which we want to reclaim for NV anyway), and make the virtual timer
carry a pointer to a VM-wide offset.
This simplifies the code significantly. It also addresses two terrible bugs:
- The use of CNTVOFF_EL2 leads to some nice offset corruption
when the sysreg gets reset, as reported by Joey.
- The kvm mutex is taken from a vcpu ioctl, which goes against
the locking rules...
Reported-by: Joey Gouly <joey.gouly@arm.com>
Reviewed-by: Reiji Watanabe <reijiw@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230224173915.GA17407@e124191.cambridge.arm.com
Tested-by: Joey Gouly <joey.gouly@arm.com>
Link: https://lore.kernel.org/r/20230224191640.3396734-1-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
KVM/riscv changes for 6.3
- Fix wrong usage of PGDIR_SIZE to check page sizes
- Fix privilege mode setting in kvm_riscv_vcpu_trap_redirect()
- Redirect illegal instruction traps to guest
- SBI PMU support for guest
|
|
Currently, the unknown no-running-vcpu sites are reported when a
dirty page is tracked by mark_page_dirty_in_slot(). Until now, the
only known no-running-vcpu site is saving vgic/its tables through
KVM_DEV_ARM_{VGIC_GRP_CTRL, ITS_SAVE_TABLES} command on KVM device
"kvm-arm-vgic-its". Unfortunately, there are more unknown sites to
be handled and no-running-vcpu context will be allowed in these
sites: (1) KVM_DEV_ARM_{VGIC_GRP_CTRL, ITS_RESTORE_TABLES} command
on KVM device "kvm-arm-vgic-its" to restore vgic/its tables. The
vgic3 LPI pending status could be restored. (2) Save vgic3 pending
table through KVM_DEV_ARM_{VGIC_GRP_CTRL, VGIC_SAVE_PENDING_TABLES}
command on KVM device "kvm-arm-vgic-v3".
In order to handle those unknown cases, we need a unified helper
vgic_write_guest_lock(). struct vgic_dist::save_its_tables_in_progress
is also renamed to struct vgic_dist::save_tables_in_progress.
No functional change intended.
Suggested-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Gavin Shan <gshan@redhat.com>
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230126235451.469087-3-gshan@redhat.com
|
|
Define pr_fmt using KBUILD_MODNAME for all KVM x86 code so that printks
use consistent formatting across common x86, Intel, and AMD code. In
addition to providing consistent print formatting, using KBUILD_MODNAME,
e.g. kvm_amd and kvm_intel, allows referencing SVM and VMX (and SEV and
SGX and ...) as technologies without generating weird messages, and
without causing naming conflicts with other kernel code, e.g. "SEV: ",
"tdx: ", "sgx: " etc.. are all used by the kernel for non-KVM subsystems.
Opportunistically move away from printk() for prints that need to be
modified anyways, e.g. to drop a manual "kvm: " prefix.
Opportunistically convert a few SGX WARNs that are similarly modified to
WARN_ONCE; in the very unlikely event that the WARNs fire, odds are good
that they would fire repeatedly and spam the kernel log without providing
unique information in each print.
Note, defining pr_fmt yields undesirable results for code that uses KVM's
printk wrappers, e.g. vcpu_unimpl(). But, that's a pre-existing problem
as SVM/kvm_amd already defines a pr_fmt, and thankfully use of KVM's
wrappers is relatively limited in KVM x86 code.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Paul Durrant <paul@xen.org>
Message-Id: <20221130230934.1014142-35-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
For a number of historical reasons, the KVM/arm64 hotplug setup is pretty
complicated, and we have two extra CPUHP notifiers for vGIC and timers.
It looks pretty pointless, and gets in the way of further changes.
So let's just expose some helpers that can be called from the core
CPUHP callback, and get rid of everything else.
This gives us the opportunity to drop a useless notifier entry,
as well as tidy-up the timer enable/disable, which was a bit odd.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20221130230934.1014142-17-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
* kvm-arm64/pmu-unchained:
: .
: PMUv3 fixes and improvements:
:
: - Make the CHAIN event handling strictly follow the architecture
:
: - Add support for PMUv3p5 (64bit counters all the way)
:
: - Various fixes and cleanups
: .
KVM: arm64: PMU: Fix period computation for 64bit counters with 32bit overflow
KVM: arm64: PMU: Sanitise PMCR_EL0.LP on first vcpu run
KVM: arm64: PMU: Simplify PMCR_EL0 reset handling
KVM: arm64: PMU: Replace version number '0' with ID_AA64DFR0_EL1_PMUVer_NI
KVM: arm64: PMU: Make kvm_pmc the main data structure
KVM: arm64: PMU: Simplify vcpu computation on perf overflow notification
KVM: arm64: PMU: Allow PMUv3p5 to be exposed to the guest
KVM: arm64: PMU: Implement PMUv3p5 long counter support
KVM: arm64: PMU: Allow ID_DFR0_EL1.PerfMon to be set from userspace
KVM: arm64: PMU: Allow ID_AA64DFR0_EL1.PMUver to be set from userspace
KVM: arm64: PMU: Move the ID_AA64DFR0_EL1.PMUver limit to VM creation
KVM: arm64: PMU: Do not let AArch32 change the counters' top 32 bits
KVM: arm64: PMU: Simplify setting a counter to a specific value
KVM: arm64: PMU: Add counter_index_to_*reg() helpers
KVM: arm64: PMU: Only narrow counters that are not 64bit wide
KVM: arm64: PMU: Narrow the overflow checking when required
KVM: arm64: PMU: Distinguish between 64bit counter and 64bit overflow
KVM: arm64: PMU: Always advertise the CHAIN event
KVM: arm64: PMU: Align chained counter implementation with architecture pseudocode
arm64: Add ID_DFR0_EL1.PerfMon values for PMUv3p7 and IMP_DEF
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
PMUv3p5 (which is mandatory with ARMv8.5) comes with some extra
features:
- All counters are 64bit
- The overflow point is controlled by the PMCR_EL0.LP bit
Add the required checks in the helpers that control counter
width and overflow, as well as the sysreg handling for the LP
bit. A new kvm_pmu_is_3p5() helper makes it easy to spot the
PMUv3p5 specific handling.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221113163832.3154370-14-maz@kernel.org
|
|
As further patches will enable the selection of a PMU revision
from userspace, sample the supported PMU revision at VM creation
time, rather than building each time the ID_AA64DFR0_EL1 register
is accessed.
This shouldn't result in any change in behaviour.
Reviewed-by: Reiji Watanabe <reijiw@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221113163832.3154370-11-maz@kernel.org
|
|
pseudocode
Ricardo recently pointed out that the PMU chained counter emulation
in KVM wasn't quite behaving like the one on actual hardware, in
the sense that a chained counter would expose an overflow on
both halves of a chained counter, while KVM would only expose the
overflow on the top half.
The difference is subtle, but significant. What does the architecture
say (DDI0087 H.a):
- Up to PMUv3p4, all counters but the cycle counter are 32bit
- A 32bit counter that overflows generates a CHAIN event on the
adjacent counter after exposing its own overflow status
- The CHAIN event is accounted if the counter is correctly
configured (CHAIN event selected and counter enabled)
This all means that our current implementation (which uses 64bit
perf events) prevents us from emulating this overflow on the lower half.
How to fix this? By implementing the above, to the letter.
This largely results in code deletion, removing the notions of
"counter pair", "chained counters", and "canonical counter".
The code is further restructured to make the CHAIN handling similar
to SWINC, as the two are now extremely similar in behaviour.
Reported-by: Ricardo Koller <ricarkol@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Reiji Watanabe <reijiw@google.com>
Link: https://lore.kernel.org/r/20221113163832.3154370-3-maz@kernel.org
|
|
Enable ring-based dirty memory tracking on ARM64:
- Enable CONFIG_HAVE_KVM_DIRTY_RING_ACQ_REL.
- Enable CONFIG_NEED_KVM_DIRTY_RING_WITH_BITMAP.
- Set KVM_DIRTY_LOG_PAGE_OFFSET for the ring buffer's physical page
offset.
- Add ARM64 specific kvm_arch_allow_write_without_running_vcpu() to
keep the site of saving vgic/its tables out of the no-running-vcpu
radar.
Signed-off-by: Gavin Shan <gshan@redhat.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221110104914.31280-5-gshan@redhat.com
|
|
Align kvm_vgic_addr() with the rest of the code by moving the
userspace accesses into it. kvm_vgic_addr() is also made static.
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
We carry a legacy interface to set the base addresses for GICv2.
As this is currently plumbed into the same handling code as
the modern interface, it limits the evolution we can make there.
Add a helper dedicated to this handling, with a view of maybe
removing this in the future.
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
* kvm-arm64/per-vcpu-host-pmu-data:
: .
: Pass the host PMU state in the vcpu to avoid the use of additional
: shared memory between EL1 and EL2 (this obviously only applies
: to nVHE and Protected setups).
:
: Patches courtesy of Fuad Tabba.
: .
KVM: arm64: pmu: Restore compilation when HW_PERF_EVENTS isn't selected
KVM: arm64: Reenable pmu in Protected Mode
KVM: arm64: Pass pmu events to hyp via vcpu
KVM: arm64: Repack struct kvm_pmu to reduce size
KVM: arm64: Wrapper for getting pmu_events
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
* kvm-arm64/vgic-invlpir:
: .
: Implement MMIO-based LPI invalidation for vGICv3.
: .
KVM: arm64: vgic-v3: Advertise GICR_CTLR.{IR, CES} as a new GICD_IIDR revision
KVM: arm64: vgic-v3: Implement MMIO-based LPI invalidation
KVM: arm64: vgic-v3: Expose GICR_CTLR.RWP when disabling LPIs
irqchip/gic-v3: Exposes bit values for GICR_CTLR.{IR, CES}
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
* kvm-arm64/hcall-selection:
: .
: Introduce a new set of virtual sysregs for userspace to
: select the hypercalls it wants to see exposed to the guest.
:
: Patches courtesy of Raghavendra and Oliver.
: .
KVM: arm64: Fix hypercall bitmap writeback when vcpus have already run
KVM: arm64: Hide KVM_REG_ARM_*_BMAP_BIT_COUNT from userspace
Documentation: Fix index.rst after psci.rst renaming
selftests: KVM: aarch64: Add the bitmap firmware registers to get-reg-list
selftests: KVM: aarch64: Introduce hypercall ABI test
selftests: KVM: Create helper for making SMCCC calls
selftests: KVM: Rename psci_cpu_on_test to psci_test
tools: Import ARM SMCCC definitions
Docs: KVM: Add doc for the bitmap firmware registers
Docs: KVM: Rename psci.rst to hypercalls.rst
KVM: arm64: Add vendor hypervisor firmware register
KVM: arm64: Add standard hypervisor firmware register
KVM: arm64: Setup a framework for hypercall bitmap firmware registers
KVM: arm64: Factor out firmware register handling from psci.c
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
Moving kvm_pmu_events into the vcpu (and refering to it) broke the
somewhat unusual case where the kernel has no support for a PMU
at all.
In order to solve this, move things around a bit so that we can
easily avoid refering to the pmu structure outside of PMU-aware
code. As a bonus, pmu.c isn't compiled in when HW_PERF_EVENTS
isn't selected.
Reported-by: kernel test robot <lkp@intel.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/202205161814.KQHpOzsJ-lkp@intel.com
|
|
Instead of the host accessing hyp data directly, pass the pmu
events of the current cpu to hyp via the vcpu.
This adds 64 bits (in two fields) to the vcpu that need to be
synced before every vcpu run in nvhe and protected modes.
However, it isolates the hypervisor from the host, which allows
us to use pmu in protected mode in a subsequent patch.
No visible side effects in behavior intended.
Signed-off-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Oliver Upton <oupton@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20220510095710.148178-4-tabba@google.com
|
|
struct kvm_pmu has 2 holes using 10 bytes. This is instantiated
in all vcpus, so it adds up. Repack the structures to remove the
holes.
No functional change intended.
Reviewed-by: Oliver Upton <oupton@google.com>
Signed-off-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20220510095710.148178-3-tabba@google.com
|
|
Since adversising GICR_CTLR.{IC,CES} is directly observable from
a guest, we need to make it selectable from userspace.
For that, bump the default GICD_IIDR revision and let userspace
downgrade it to the previous default. For GICv2, the two distributor
revisions are strictly equivalent.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20220405182327.205520-5-maz@kernel.org
|
|
Since GICv4.1, it has become legal for an implementation to advertise
GICR_{INVLPIR,INVALLR,SYNCR} while having an ITS, allowing for a more
efficient invalidation scheme (no guest command queue contention when
multiple CPUs are generating invalidations).
Provide the invalidation registers as a primitive to their ITS
counterpart. Note that we don't advertise them to the guest yet
(the architecture allows an implementation to do this).
Signed-off-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Oliver Upton <oupton@google.com>
Link: https://lore.kernel.org/r/20220405182327.205520-4-maz@kernel.org
|
|
When disabling LPIs, a guest needs to poll GICR_CTLR.RWP in order
to be sure that the write has taken effect. We so far reported it
as 0, as we didn't advertise that LPIs could be turned off the
first place.
Start tracking this state during which LPIs are being disabled,
and expose the 'in progress' state via the RWP bit.
We also take this opportunity to disallow enabling LPIs and programming
GICR_{PEND,PROP}BASER while LPI disabling is in progress, as allowed by
the architecture (UNPRED behaviour).
We don't advertise the feature to the guest yet (which is allowed by
the architecture).
Reviewed-by: Oliver Upton <oupton@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20220405182327.205520-3-maz@kernel.org
|
|
KVM regularly introduces new hypercall services to the guests without
any consent from the userspace. This means, the guests can observe
hypercall services in and out as they migrate across various host
kernel versions. This could be a major problem if the guest
discovered a hypercall, started using it, and after getting migrated
to an older kernel realizes that it's no longer available. Depending
on how the guest handles the change, there's a potential chance that
the guest would just panic.
As a result, there's a need for the userspace to elect the services
that it wishes the guest to discover. It can elect these services
based on the kernels spread across its (migration) fleet. To remedy
this, extend the existing firmware pseudo-registers, such as
KVM_REG_ARM_PSCI_VERSION, but by creating a new COPROC register space
for all the hypercall services available.
These firmware registers are categorized based on the service call
owners, but unlike the existing firmware pseudo-registers, they hold
the features supported in the form of a bitmap.
During the VM initialization, the registers are set to upper-limit of
the features supported by the corresponding registers. It's expected
that the VMMs discover the features provided by each register via
GET_ONE_REG, and write back the desired values using SET_ONE_REG.
KVM allows this modification only until the VM has started.
Some of the standard features are not mapped to any bits of the
registers. But since they can recreate the original problem of
making it available without userspace's consent, they need to
be explicitly added to the case-list in
kvm_hvc_call_default_allowed(). Any function-id that's not enabled
via the bitmap, or not listed in kvm_hvc_call_default_allowed, will
be returned as SMCCC_RET_NOT_SUPPORTED to the guest.
Older userspace code can simply ignore the feature and the
hypercall services will be exposed unconditionally to the guests,
thus ensuring backward compatibility.
In this patch, the framework adds the register only for ARM's standard
secure services (owner value 4). Currently, this includes support only
for ARM True Random Number Generator (TRNG) service, with bit-0 of the
register representing mandatory features of v1.0. Other services are
momentarily added in the upcoming patches.
Signed-off-by: Raghavendra Rao Ananta <rananta@google.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
[maz: reduced the scope of some helpers, tidy-up bitmap max values,
dropped error-only fast path]
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20220502233853.1233742-3-rananta@google.com
|
|
Common hypercall firmware register handing is currently employed
by psci.c. Since the upcoming patches add more of these registers,
it's better to move the generic handling to hypercall.c for a
cleaner presentation.
While we are at it, collect all the firmware registers under
fw_reg_ids[] to help implement kvm_arm_get_fw_num_regs() and
kvm_arm_copy_fw_reg_indices() in a generic way. Also, define
KVM_REG_FEATURE_LEVEL_MASK using a GENMASK instead.
No functional change intended.
Signed-off-by: Raghavendra Rao Ananta <rananta@google.com>
Reviewed-by: Oliver Upton <oupton@google.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
[maz: fixed KVM_REG_FEATURE_LEVEL_MASK]
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20220502233853.1233742-2-rananta@google.com
|
|
kvm_cpu_has_pending_timer() ends up checking all the possible
timers for a wake-up cause. However, we already check for
pending interrupts whenever we try to wake-up a vcpu, including
the timer interrupts.
Obviously, doing the same work twice is once too many. Reduce
this helper to almost nothing, but keep it around, as we are
going to make use of it soon.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20220419182755.601427-4-maz@kernel.org
|
|
* kvm-arm64/psci-1.1:
: .
: Limited PSCI-1.1 support from Will Deacon:
:
: This small series exposes the PSCI SYSTEM_RESET2 call to guests, which
: allows the propagation of a "reset_type" and a "cookie" back to the VMM.
: Although Linux guests only ever pass 0 for the type ("SYSTEM_WARM_RESET"),
: the vendor-defined range can be used by a bootloader to provide additional
: information about the reset, such as an error code.
: .
KVM: arm64: Remove unneeded semicolons
KVM: arm64: Indicate SYSTEM_RESET2 in kvm_run::system_event flags field
KVM: arm64: Expose PSCI SYSTEM_RESET2 call to the guest
KVM: arm64: Bump guest PSCI version to 1.1
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
Expose PSCI version v1.1 to the guest by default. The only difference
for now is that an updated version number is reported by PSCI_VERSION.
Cc: Marc Zyngier <maz@kernel.org>
Cc: James Morse <james.morse@arm.com>
Cc: Alexandru Elisei <alexandru.elisei@arm.com>
Cc: Suzuki K Poulose <suzuki.poulose@arm.com>
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20220221153524.15397-2-will@kernel.org
|
|
* kvm-arm64/pmu-bl:
: .
: Improve PMU support on heterogeneous systems, courtesy of Alexandru Elisei
: .
KVM: arm64: Refuse to run VCPU if the PMU doesn't match the physical CPU
KVM: arm64: Add KVM_ARM_VCPU_PMU_V3_SET_PMU attribute
KVM: arm64: Keep a list of probed PMUs
KVM: arm64: Keep a per-VM pointer to the default PMU
perf: Fix wrong name in comment for struct perf_cpu_context
KVM: arm64: Do not change the PMU event filter after a VCPU has run
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
The ARM PMU driver calls kvm_host_pmu_init() after probing to tell KVM that
a hardware PMU is available for guest emulation. Heterogeneous systems can
have more than one PMU present, and the callback gets called multiple
times, once for each of them. Keep track of all the PMUs available to KVM,
as they're going to be needed later.
Reviewed-by: Reiji Watanabe <reijiw@google.com>
Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20220127161759.53553-5-alexandru.elisei@arm.com
|
|
kvm_psci_version() consumes a pointer to struct kvm in addition to a
vcpu pointer. Drop the kvm pointer as it is unused. While the comment
suggests the explicit kvm pointer was useful for calling from hyp, there
exist no such callsite in hyp.
Signed-off-by: Oliver Upton <oupton@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20220208012705.640444-1-oupton@google.com
|
|
Pull kvm updates from Paolo Bonzini:
"RISCV:
- Use common KVM implementation of MMU memory caches
- SBI v0.2 support for Guest
- Initial KVM selftests support
- Fix to avoid spurious virtual interrupts after clearing hideleg CSR
- Update email address for Anup and Atish
ARM:
- Simplification of the 'vcpu first run' by integrating it into KVM's
'pid change' flow
- Refactoring of the FP and SVE state tracking, also leading to a
simpler state and less shared data between EL1 and EL2 in the nVHE
case
- Tidy up the header file usage for the nvhe hyp object
- New HYP unsharing mechanism, finally allowing pages to be unmapped
from the Stage-1 EL2 page-tables
- Various pKVM cleanups around refcounting and sharing
- A couple of vgic fixes for bugs that would trigger once the vcpu
xarray rework is merged, but not sooner
- Add minimal support for ARMv8.7's PMU extension
- Rework kvm_pgtable initialisation ahead of the NV work
- New selftest for IRQ injection
- Teach selftests about the lack of default IPA space and page sizes
- Expand sysreg selftest to deal with Pointer Authentication
- The usual bunch of cleanups and doc update
s390:
- fix sigp sense/start/stop/inconsistency
- cleanups
x86:
- Clean up some function prototypes more
- improved gfn_to_pfn_cache with proper invalidation, used by Xen
emulation
- add KVM_IRQ_ROUTING_XEN_EVTCHN and event channel delivery
- completely remove potential TOC/TOU races in nested SVM consistency
checks
- update some PMCs on emulated instructions
- Intel AMX support (joint work between Thomas and Intel)
- large MMU cleanups
- module parameter to disable PMU virtualization
- cleanup register cache
- first part of halt handling cleanups
- Hyper-V enlightened MSR bitmap support for nested hypervisors
Generic:
- clean up Makefiles
- introduce CONFIG_HAVE_KVM_DIRTY_RING
- optimize memslot lookup using a tree
- optimize vCPU array usage by converting to xarray"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (268 commits)
x86/fpu: Fix inline prefix warnings
selftest: kvm: Add amx selftest
selftest: kvm: Move struct kvm_x86_state to header
selftest: kvm: Reorder vcpu_load_state steps for AMX
kvm: x86: Disable interception for IA32_XFD on demand
x86/fpu: Provide fpu_sync_guest_vmexit_xfd_state()
kvm: selftests: Add support for KVM_CAP_XSAVE2
kvm: x86: Add support for getting/setting expanded xstate buffer
x86/fpu: Add uabi_size to guest_fpu
kvm: x86: Add CPUID support for Intel AMX
kvm: x86: Add XCR0 support for Intel AMX
kvm: x86: Disable RDMSR interception of IA32_XFD_ERR
kvm: x86: Emulate IA32_XFD_ERR for guest
kvm: x86: Intercept #NM for saving IA32_XFD_ERR
x86/fpu: Prepare xfd_err in struct fpu_guest
kvm: x86: Add emulation for IA32_XFD
x86/fpu: Provide fpu_update_guest_xfd() for IA32_XFD emulation
kvm: x86: Enable dynamic xfeatures at KVM_SET_CPUID2
x86/fpu: Provide fpu_enable_guest_xfd_features() for KVM
x86/fpu: Add guest support to xfd_enable_feature()
...
|
|
arm_vgic.h does not require all the stuff that kernel.h provides.
Replace kernel.h inclusion with the list of what is really being used.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20220104151940.55399-1-andriy.shevchenko@linux.intel.com
|
|
Move the definition of kvm_arm_pmu_available to pmu-emul.c and, out of
"necessity", hide it behind CONFIG_HW_PERF_EVENTS. Provide a stub for
the key's wrapper, kvm_arm_support_pmu_v3(). Moving the key's definition
out of perf.c will allow a future commit to delete perf.c entirely.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20211111020738.2512932-16-seanjc@google.com
|
|
Russell reported that since 5.13, KVM's probing of the PMU has
started to fail on his HW. As it turns out, there is an implicit
ordering dependency between the architectural PMU probing code and
and KVM's own probing. If, due to probe ordering reasons, KVM probes
before the PMU driver, it will fail to detect the PMU and prevent it
from being advertised to guests as well as the VMM.
Obviously, this is one probing too many, and we should be able to
deal with any ordering.
Add a callback from the PMU code into KVM to advertise the registration
of a host CPU PMU, allowing for any probing order.
Fixes: 5421db1be3b1 ("KVM: arm64: Divorce the perf code from oprofile helpers")
Reported-by: "Russell King (Oracle)" <linux@armlinux.org.uk>
Tested-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/YUYRKVflRtUytzy5@shell.armlinux.org.uk
Cc: stable@vger.kernel.org
|
|
In order to deal with these systems that do not offer HW-based
deactivation of interrupts, let implement a SW-based approach:
- When the irq is queued into a LR, treat it as a pure virtual
interrupt and set the EOI flag in the LR.
- When the interrupt state is read back from the LR, force a
deactivation when the state is invalid (neither active nor
pending)
Interrupts requiring such treatment get the VGIC_SW_RESAMPLE flag.
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
We already have the option to attach a callback to an interrupt
to retrieve its pending state. As we are planning to expand this
facility, move this callback into its own data structure.
This will limit the size of individual interrupts as the ops
structures can be shared across multiple interrupts.
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
The vGIC, as architected by ARM, allows a virtual interrupt to
trigger the deactivation of a physical interrupt. This allows
the following interrupt to be delivered without requiring an exit.
However, some implementations have choosen not to implement this,
meaning that we will need some unsavoury workarounds to deal with this.
On detecting such a case, taint the kernel and spit a nastygram.
We'll deal with this in later patches.
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
KVM/arm64 is the sole user of perf_num_counters(), and really
could do without it. Stop using the obsolete API by relying on
the existing probing code.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Acked-by: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20210414134409.1266357-2-maz@kernel.org
|
|
Commit 23bde34771f1 ("KVM: arm64: vgic-v3: Drop the
reporting of GICR_TYPER.Last for userspace") temporarily fixed
a bug identified when attempting to access the GICR_TYPER
register before the redistributor region setting, but dropped
the support of the LAST bit.
Emulating the GICR_TYPER.Last bit still makes sense for
architecture compliance though. This patch restores its support
(if the redistributor region was set) while keeping the code safe.
We introduce a new helper, vgic_mmio_vcpu_rdist_is_last() which
computes whether a redistributor is the highest one of a series
of redistributor contributor pages.
With this new implementation we do not need to have a uaccess
read accessor anymore.
Signed-off-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20210405163941.510258-9-eric.auger@redhat.com
|
|
We currently find out about the presence of a HW PMU (or the handling
of that PMU by perf, which amounts to the same thing) in a fairly
roundabout way, by checking the number of counters available to perf.
That's good enough for now, but we will soon need to find about about
that on paths where perf is out of reach (in the world switch).
Instead, let's turn kvm_arm_support_pmu_v3() into a static key.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Alexandru Elisei <alexandru.elisei@arm.com>
Link: https://lore.kernel.org/r/20210209114844.3278746-2-maz@kernel.org
Message-Id: <20210305185254.3730990-5-maz@kernel.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
KVM_ARM_PMU only existed for the benefit of 32bit ARM hosts,
and makes no sense now that we are 64bit only. Get rid of it.
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
kvmarm-master/queue
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
In order to reduce the impact of the VPT parsing happening on the GIC,
we can split the vcpu reseidency in two phases:
- programming GICR_VPENDBASER: this still happens in vcpu_load()
- checking for the VPT parsing to be complete: this can happen
on vcpu entry (in kvm_vgic_flush_hwstate())
This allows the GIC and the CPU to work in parallel, rewmoving some
of the entry overhead.
Suggested-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Shenming Lu <lushenming@huawei.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20201128141857.983-3-lushenming@huawei.com
|
|
The PMU ready state has no user left. Goodbye.
Reviewed-by: Alexandru Elisei <alexandru.elisei@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
Pull KVM updates from Paolo Bonzini:
"For x86, there is a new alternative and (in the future) more scalable
implementation of extended page tables that does not need a reverse
map from guest physical addresses to host physical addresses.
For now it is disabled by default because it is still lacking a few of
the existing MMU's bells and whistles. However it is a very solid
piece of work and it is already available for people to hammer on it.
Other updates:
ARM:
- New page table code for both hypervisor and guest stage-2
- Introduction of a new EL2-private host context
- Allow EL2 to have its own private per-CPU variables
- Support of PMU event filtering
- Complete rework of the Spectre mitigation
PPC:
- Fix for running nested guests with in-kernel IRQ chip
- Fix race condition causing occasional host hard lockup
- Minor cleanups and bugfixes
x86:
- allow trapping unknown MSRs to userspace
- allow userspace to force #GP on specific MSRs
- INVPCID support on AMD
- nested AMD cleanup, on demand allocation of nested SVM state
- hide PV MSRs and hypercalls for features not enabled in CPUID
- new test for MSR_IA32_TSC writes from host and guest
- cleanups: MMU, CPUID, shared MSRs
- LAPIC latency optimizations ad bugfixes"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (232 commits)
kvm: x86/mmu: NX largepage recovery for TDP MMU
kvm: x86/mmu: Don't clear write flooding count for direct roots
kvm: x86/mmu: Support MMIO in the TDP MMU
kvm: x86/mmu: Support write protection for nesting in tdp MMU
kvm: x86/mmu: Support disabling dirty logging for the tdp MMU
kvm: x86/mmu: Support dirty logging for the TDP MMU
kvm: x86/mmu: Support changed pte notifier in tdp MMU
kvm: x86/mmu: Add access tracking for tdp_mmu
kvm: x86/mmu: Support invalidate range MMU notifier for TDP MMU
kvm: x86/mmu: Allocate struct kvm_mmu_pages for all pages in TDP MMU
kvm: x86/mmu: Add TDP MMU PF handler
kvm: x86/mmu: Remove disallowed_hugepage_adjust shadow_walk_iterator arg
kvm: x86/mmu: Support zapping SPTEs in the TDP MMU
KVM: Cache as_id in kvm_memory_slot
kvm: x86/mmu: Add functions to handle changed TDP SPTEs
kvm: x86/mmu: Allocate and free TDP MMU roots
kvm: x86/mmu: Init / Uninit the TDP MMU
kvm: x86/mmu: Introduce tdp_iter
KVM: mmu: extract spte.h and spte.c
KVM: mmu: Separate updating a PTE from kvm_set_pte_rmapp
...
|
|
As we can now hide events from the guest, let's also adjust its view of
PCMEID{0,1}_EL1 so that it can figure out why some common events are not
counting as they should.
The astute user can still look into the TRM for their CPU and find out
they've been cheated, though. Nobody's perfect.
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
kvm_vcpu_kick() is not NMI safe. When the overflow handler is called from
NMI context, defer waking the vcpu to an irq_work queue.
A vcpu can be freed while it's not running by kvm_destroy_vm(). Prevent
running the irq_work for a non-existent vcpu by calling irq_work_sync() on
the PMU destroy path.
[Alexandru E.: Added irq_work_sync()]
Signed-off-by: Julien Thierry <julien.thierry@arm.com>
Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
Tested-by: Sumit Garg <sumit.garg@linaro.org> (Developerbox)
Cc: Julien Thierry <julien.thierry.kdev@gmail.com>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: James Morse <james.morse@arm.com>
Cc: Suzuki K Pouloze <suzuki.poulose@arm.com>
Cc: kvm@vger.kernel.org
Cc: kvmarm@lists.cs.columbia.edu
Link: https://lore.kernel.org/r/20200924110706.254996-6-alexandru.elisei@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Move the timer gsisters to the sysreg file. This will further help when
they are directly changed by a nesting hypervisor in the VNCR page.
This requires moving the initialisation of the timer struct so that some
of the helpers (such as arch_timer_ctx_index) can work correctly at an
early stage.
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
kvm_timer_sync_hwstate() has nothing to do with the timer HW state,
but more to do with the state of a userspace interrupt controller.
Change the suffix from _hwstate to_user, in keeping with the rest
of the code.
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
If we move the used_lrs field to the version-specific cpu interface
structure, the following functions only operate on the struct
vgic_v3_cpu_if and not the full vcpu:
__vgic_v3_save_state
__vgic_v3_restore_state
__vgic_v3_activate_traps
__vgic_v3_deactivate_traps
__vgic_v3_save_aprs
__vgic_v3_restore_aprs
This is going to be very useful for nested virt, so move the used_lrs
field and change the prototypes and implementations of these functions to
take the cpu_if parameter directly.
No functional change.
Reviewed-by: James Morse <james.morse@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
In order to let a guest buy in the new, active-less SGIs, we
need to be able to switch between the two modes.
Handle this by stopping all guest activity, transfer the state
from one mode to the other, and resume the guest. Nothing calls
this code so far, but a later patch will plug it into the MMIO
emulation.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Zenghui Yu <yuzenghui@huawei.com>
Link: https://lore.kernel.org/r/20200304203330.4967-20-maz@kernel.org
|
|
In order to hide some of the differences between v4.0 and v4.1, move
the doorbell management out of the KVM code, and into the GICv4-specific
layer. This allows the calling code to ask for the doorbell when blocking,
and otherwise to leave the doorbell permanently disabled.
This matches the v4.1 code perfectly, and only results in a minor
refactoring of the v4.0 code.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Zenghui Yu <yuzenghui@huawei.com>
Link: https://lore.kernel.org/r/20200304203330.4967-14-maz@kernel.org
|
|
|
|
Fix various comments, including wrong function names, grammar mistakes
and specification references.
Signed-off-by: Zenghui Yu <yuzenghui@huawei.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20191029071919.177-3-yuzenghui@huawei.com
|
|
The callsite of kvm_send_userspace_msi() is currently arch agnostic.
There seems no reason to keep an extra declaration of it in arm_vgic.h
(we already have one in include/linux/kvm_host.h).
Remove it.
Signed-off-by: Zenghui Yu <yuzenghui@huawei.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Link: https://lore.kernel.org/r/20191029071919.177-2-yuzenghui@huawei.com
|
|
When the VHE code was reworked, a lot of the vgic stuff was moved around,
but the GICv4 residency code did stay untouched, meaning that we come
in and out of residency on each flush/sync, which is obviously suboptimal.
To address this, let's move things around a bit:
- Residency entry (flush) moves to vcpu_load
- Residency exit (sync) moves to vcpu_put
- On blocking (entry to WFI), we "put"
- On unblocking (exit from WFI), we "load"
Because these can nest (load/block/put/load/unblock/put, for example),
we now have per-VPE tracking of the residency state.
Additionally, vgic_v4_put gains a "need doorbell" parameter, which only
gets set to true when blocking because of a WFI. This allows a finer
control of the doorbell, which now also gets disabled as soon as
it gets signaled.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20191027144234.8395-2-maz@kernel.org
|
|
We currently intertwine the KVM PSCI implementation with the general
dispatch of hypercall handling, which makes perfect sense because PSCI
is the only category of hypercalls we support.
However, as we are about to support additional hypercalls, factor out
this functionality into a separate hypercall handler file.
Signed-off-by: Christoffer Dall <christoffer.dall@arm.com>
[steven.price@arm.com: rebased]
Reviewed-by: Andrew Jones <drjones@redhat.com>
Signed-off-by: Steven Price <steven.price@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
At the moment we use 2 IO devices per GICv3 redistributor: one
one for the RD_base frame and one for the SGI_base frame.
Instead we can use a single IO device per redistributor (the 2
frames are contiguous). This saves slots on the KVM_MMIO_BUS
which is currently limited to NR_IOBUS_DEVS (1000).
This change allows to instantiate up to 512 redistributors and may
speed the guest boot with a large number of VCPUs.
Signed-off-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
Add the basic data structure that expresses an MSI to LPI
translation as well as the allocation/release hooks.
The size of the cache is arbitrarily defined as 16*nr_vcpus.
Tested-by: Andre Przywara <andre.przywara@arm.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
Since commit commit 328e56647944 ("KVM: arm/arm64: vgic: Defer
touching GICH_VMCR to vcpu_load/put"), we leave ICH_VMCR_EL2 (or
its GICv2 equivalent) loaded as long as we can, only syncing it
back when we're scheduled out.
There is a small snag with that though: kvm_vgic_vcpu_pending_irq(),
which is indirectly called from kvm_vcpu_check_block(), needs to
evaluate the guest's view of ICC_PMR_EL1. At the point were we
call kvm_vcpu_check_block(), the vcpu is still loaded, and whatever
changes to PMR is not visible in memory until we do a vcpu_put().
Things go really south if the guest does the following:
mov x0, #0 // or any small value masking interrupts
msr ICC_PMR_EL1, x0
[vcpu preempted, then rescheduled, VMCR sampled]
mov x0, #ff // allow all interrupts
msr ICC_PMR_EL1, x0
wfi // traps to EL2, so samping of VMCR
[interrupt arrives just after WFI]
Here, the hypervisor's view of PMR is zero, while the guest has enabled
its interrupts. kvm_vgic_vcpu_pending_irq() will then say that no
interrupts are pending (despite an interrupt being received) and we'll
block for no reason. If the guest doesn't have a periodic interrupt
firing once it has blocked, it will stay there forever.
To avoid this unfortuante situation, let's resync VMCR from
kvm_arch_vcpu_blocking(), ensuring that a following kvm_vcpu_check_block()
will observe the latest value of PMR.
This has been found by booting an arm64 Linux guest with the pseudo NMI
feature, and thus using interrupt priorities to mask interrupts instead
of the usual PSTATE masking.
Cc: stable@vger.kernel.org # 4.12
Fixes: 328e56647944 ("KVM: arm/arm64: vgic: Defer touching GICH_VMCR to vcpu_load/put")
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
We use "pmc->idx" and the "chained" bitmap to determine if the pmc is
chained, in kvm_pmu_pmc_is_chained(). But idx might be uninitialized
(and random) when we doing this decision, through a KVM_ARM_VCPU_INIT
ioctl -> kvm_pmu_vcpu_reset(). And the test_bit() against this random
idx will potentially hit a KASAN BUG [1].
In general, idx is the static property of a PMU counter that is not
expected to be modified across resets, as suggested by Julien. It
looks more reasonable if we can setup the PMU counter idx for a vcpu
in its creation time. Introduce a new function - kvm_pmu_vcpu_init()
for this basic setup. Oh, and the KASAN BUG will get fixed this way.
[1] https://www.spinics.net/lists/kvm-arm/msg36700.html
Fixes: 80f393a23be6 ("KVM: arm/arm64: Support chained PMU counters")
Suggested-by: Andrew Murray <andrew.murray@arm.com>
Suggested-by: Julien Thierry <julien.thierry@arm.com>
Acked-by: Julien Thierry <julien.thierry@arm.com>
Signed-off-by: Zenghui Yu <yuzenghui@huawei.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
|
|
ARMv8 provides support for chained PMU counters, where an event type
of 0x001E is set for odd-numbered counters, the event counter will
increment by one for each overflow of the preceding even-numbered
counter. Let's emulate this in KVM by creating a 64 bit perf counter
when a user chains two emulated counters together.
For chained events we only support generating an overflow interrupt
on the high counter. We use the attributes of the low counter to
determine the attributes of the perf event.
Suggested-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Andrew Murray <andrew.murray@arm.com>
Reviewed-by: Julien Thierry <julien.thierry@arm.com>
Reviewed-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
|
|
We currently use pmc->bitmask to determine the width of the pmc - however
it's superfluous as the pmc index already describes if the pmc is a cycle
counter or event counter. The architecture clearly describes the widths of
these counters.
Let's remove the bitmask to simplify the code.
Signed-off-by: Andrew Murray <andrew.murray@arm.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
|
|
The kvm_pmu_{enable/disable}_counter functions can enable/disable
multiple counters at once as they operate on a bitmask. Let's
make this clearer by renaming the function.
Suggested-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Signed-off-by: Andrew Murray <andrew.murray@arm.com>
Reviewed-by: Julien Thierry <julien.thierry@arm.com>
Reviewed-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
|
|
Based on 1 normalized pattern(s):
this program is free software you can redistribute it and or modify
it under the terms of the gnu general public license version 2 as
published by the free software foundation this program is
distributed in the hope that it will be useful but without any
warranty without even the implied warranty of merchantability or
fitness for a particular purpose see the gnu general public license
for more details you should have received a copy of the gnu general
public license along with this program if not see http www gnu org
licenses
extracted by the scancode license scanner the SPDX license identifier
GPL-2.0-only
has been chosen to replace the boilerplate/reference in 503 file(s).
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexios Zavras <alexios.zavras@intel.com>
Reviewed-by: Allison Randal <allison@lohutok.net>
Reviewed-by: Enrico Weigelt <info@metux.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190602204653.811534538@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
Based on 1 normalized pattern(s):
this program is free software you can redistribute it and or modify
it under the terms of the gnu general public license as published by
the free software foundation either version 2 of the license this
program is distributed in the hope that it will be useful but
without any warranty without even the implied warranty of
merchantability or fitness for a particular purpose see the gnu
general public license for more details you should have received a
copy of the gnu general public license along with this program if
not see http www gnu org licenses
extracted by the scancode license scanner the SPDX license identifier
GPL-2.0-only
has been chosen to replace the boilerplate/reference in 15 file(s).
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Allison Randal <allison@lohutok.net>
Reviewed-by: Alexios Zavras <alexios.zavras@intel.com>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190530000437.237481593@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
Based on 1 normalized pattern(s):
this program is free software you can redistribute it and or modify
it under the terms of the gnu general public license version 2 as
published by the free software foundation this program is
distributed in the hope that it will be useful but without any
warranty without even the implied warranty of merchantability or
fitness for a particular purpose see the gnu general public license
for more details you should have received a copy of the gnu general
public license along with this program if not write to the free
software foundation inc 59 temple place suite 330 boston ma 02111
1307 usa
extracted by the scancode license scanner the SPDX license identifier
GPL-2.0-only
has been chosen to replace the boilerplate/reference in 136 file(s).
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexios Zavras <alexios.zavras@intel.com>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190530000436.384967451@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
Pull KVM updates from Paolo Bonzini:
"ARM:
- some cleanups
- direct physical timer assignment
- cache sanitization for 32-bit guests
s390:
- interrupt cleanup
- introduction of the Guest Information Block
- preparation for processor subfunctions in cpu models
PPC:
- bug fixes and improvements, especially related to machine checks
and protection keys
x86:
- many, many cleanups, including removing a bunch of MMU code for
unnecessary optimizations
- AVIC fixes
Generic:
- memcg accounting"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (147 commits)
kvm: vmx: fix formatting of a comment
KVM: doc: Document the life cycle of a VM and its resources
MAINTAINERS: Add KVM selftests to existing KVM entry
Revert "KVM/MMU: Flush tlb directly in the kvm_zap_gfn_range()"
KVM: PPC: Book3S: Add count cache flush parameters to kvmppc_get_cpu_char()
KVM: PPC: Fix compilation when KVM is not enabled
KVM: Minor cleanups for kvm_main.c
KVM: s390: add debug logging for cpu model subfunctions
KVM: s390: implement subfunction processor calls
arm64: KVM: Fix architecturally invalid reset value for FPEXC32_EL2
KVM: arm/arm64: Remove unused timer variable
KVM: PPC: Book3S: Improve KVM reference counting
KVM: PPC: Book3S HV: Fix build failure without IOMMU support
Revert "KVM: Eliminate extra function calls in kvm_get_dirty_log_protect()"
x86: kvmguest: use TSC clocksource if invariant TSC is exposed
KVM: Never start grow vCPU halt_poll_ns from value below halt_poll_ns_grow_start
KVM: Expose the initial start value in grow_halt_poll_ns() as a module parameter
KVM: grow_halt_poll_ns() should never shrink vCPU halt_poll_ns
KVM: x86/mmu: Consolidate kvm_mmu_zap_all() and kvm_mmu_zap_mmio_sptes()
KVM: x86/mmu: WARN if zapping a MMIO spte results in zapping children
...
|
|
We are currently emulating two timers in two different ways. When we
add support for nested virtualization in the future, we are going to be
emulating either two timers in two diffferent ways, or four timers in a
single way.
We need a unified data structure to keep track of how we map virtual
state to physical state and we need to cleanup some of the timer code to
operate more independently on a struct arch_timer_context instead of
trying to consider the global state of the VCPU and recomputing all
state.
Co-written with Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@arm.com>
|
|
VHE systems don't have to emulate the physical timer, we can simply
assign the EL1 physical timer directly to the VM as the host always
uses the EL2 timers.
In order to minimize the amount of cruft, AArch32 gets definitions for
the physical timer too, but is should be generally unused on this
architecture.
Co-written with Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@arm.com>
|
|
Prepare for having 4 timer data structures (2 for now).
Move loaded to the cpu data structure and not the individual timer
structure, in preparation for assigning the EL1 phys timer as well.
Signed-off-by: Christoffer Dall <christoffer.dall@arm.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
|
|
At the moment we have separate system register emulation handlers for
each timer register. Actually they are quite similar, and we rely on
kvm_arm_timer_[gs]et_reg() for the actual emulation anyways, so let's
just merge all of those handlers into one function, which just marshalls
the arguments and then hands off to a set of common accessors.
This makes extending the emulation to include EL2 timers much easier.
Signed-off-by: Andre Przywara <andre.przywara@arm.com>
[Fixed 32-bit VM breakage and reduced to reworking existing code]
Signed-off-by: Christoffer Dall <christoffer.dall@arm.com>
[Fixed 32bit host, general cleanup]
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
|
|
Instead of calling into kvm_timer_[un]schedule from the main kvm
blocking path, test if the VCPU is on the wait queue from the load/put
path and perform the background timer setup/cancel in this path.
This has the distinct advantage that we no longer race between load/put
and schedule/unschedule and programming and canceling of the bg_timer
always happens when the timer state is not loaded.
Note that we must now remove the checks in kvm_timer_blocking that do
not schedule a background timer if one of the timers can fire, because
we no longer have a guarantee that kvm_vcpu_check_block() will be called
before kvm_timer_blocking.
Reported-by: Andre Przywara <andre.przywara@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@arm.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
|
|
vgic_cpu->ap_list_lock must always be taken with interrupts disabled as
it is used in interrupt context.
For configurations such as PREEMPT_RT_FULL, this means that it should
be a raw_spinlock since RT spinlocks are interruptible.
Signed-off-by: Julien Thierry <julien.thierry@arm.com>
Acked-by: Christoffer Dall <christoffer.dall@arm.com>
Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@arm.com>
|
|
vgic_dist->lpi_list_lock must always be taken with interrupts disabled as
it is used in interrupt context.
For configurations such as PREEMPT_RT_FULL, this means that it should
be a raw_spinlock since RT spinlocks are interruptible.
Signed-off-by: Julien Thierry <julien.thierry@arm.com>
Acked-by: Christoffer Dall <christoffer.dall@arm.com>
Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@arm.com>
|
|
vgic_irq->irq_lock must always be taken with interrupts disabled as
it is used in interrupt context.
For configurations such as PREEMPT_RT_FULL, this means that it should
be a raw_spinlock since RT spinlocks are interruptible.
Signed-off-by: Julien Thierry <julien.thierry@arm.com>
Acked-by: Christoffer Dall <christoffer.dall@arm.com>
Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@arm.com>
|
|
The use of a work queue in the hrtimer expire function for the bg_timer
is a leftover from the time when we would inject interrupts when the
bg_timer expired.
Since we are no longer doing that, we can instead call
kvm_vcpu_wake_up() directly from the hrtimer function and remove all
workqueue functionality from the arch timer code.
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@arm.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
|
|
Although vgic-v3 now supports Group0 interrupts, it still doesn't
deal with Group0 SGIs. As usually with the GIC, nothing is simple:
- ICC_SGI1R can signal SGIs of both groups, since GICD_CTLR.DS==1
with KVM (as per 8.1.10, Non-secure EL1 access)
- ICC_SGI0R can only generate Group0 SGIs
- ICC_ASGI1R sees its scope refocussed to generate only Group0
SGIs (as per the note at the bottom of Table 8-14)
We only support Group1 SGIs so far, so no material change.
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Christoffer Dall <christoffer.dall@arm.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
|
|
Simply letting IGROUPR be writable from userspace would break
migration from old kernels to newer kernels, because old kernels
incorrectly report interrupt groups as group 1. This would not be a big
problem if userspace wrote GICD_IIDR as read from the kernel, because we
could detect the incompatibility and return an error to userspace.
Unfortunately, this is not the case with current userspace
implementations and simply letting IGROUPR be writable from userspace for
an emulated GICv2 silently breaks migration and causes the destination
VM to no longer run after migration.
We now encourage userspace to write the read and expected value of
GICD_IIDR as the first part of a GIC register restore, and if we observe
a write to GICD_IIDR we know that userspace has been updated and has had
a chance to cope with older kernels (VGICv2 IIDR.Revision == 0)
incorrectly reporting interrupts as group 1, and therefore we now allow
groups to be user writable.
Reviewed-by: Andrew Jones <drjones@redhat.com>
Signed-off-by: Christoffer Dall <christoffer.dall@arm.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
|