Commit Graph

1497 Commits (dcad73ca5ad3e1fe011c52a24036f67ad69fadc1)

Author SHA1 Message Date
Michael Brown dcad73ca5a [efi] Add support for driving EFI_MANAGED_NETWORK_PROTOCOL devices
We want exclusive access to the network device, both for performance
reasons and because we perform operations such as EAPoL that affect
the entire link.  We currently drive the network card via either a
native hardware driver or via the SNP or NII/UNDI interfaces, both of
which grant us this exclusive access.

Add an alternative driver that drives the network card non-exclusively
via the EFI_MANAGED_NETWORK_PROTOCOL interface.  This can function as
a fallback for situations where neither SNP nor NII/UNDI interfaces
are functional, and also opens up the possibility of non-destructively
installing a temporary network device over which to download the
autoexec.ipxe script.

Signed-off-by: Michael Brown <mcb30@ipxe.org>
2024-03-25 17:58:33 +00:00
Michael Brown a15ce00182 [efi] Match chainloaded device by uppermost matching handle
Commit 4c5b794 ("[efi] Use the SNP protocol instance to match the SNP
chainloading device") switched the chainloaded device matching logic
to use a target protocol instance rather than the loaded image's
device handle, on the basis that we want to bind to the parent SNP
device rather than to a duplicate SNP protocol instance installed onto
an IPv4 or IPv6 child device handle.

It is possible that our calls to DisconnectController() and
ConnectController() will cause the target protocol instance to be
uninstalled and reinstalled, which may change the value of the
protocol instance pointer.  Allow for this by identifying and matching
against the uppermost handle that initially has this target protocol
instance installed.

Signed-off-by: Michael Brown <mcb30@ipxe.org>
2024-03-25 17:58:33 +00:00
Michael Brown 926816c58f [efi] Pad transmit buffer length to work around vendor driver bugs
The Mellanox/Nvidia UEFI driver is built from the same codebase as the
iPXE driver, and appears to contain the bug that was fixed in commit
c11734e ("[golan] Use ETH_HLEN for inline header size").  This results
in identical failures when using the SNP or NII interface (via
e.g. snponly.efi) to drive a Mellanox card while EAPoL is enabled.

Work around the underlying UEFI driver bug by padding transmit I/O
buffers to the minimum Ethernet frame length before passing them to
the underlying driver's transmit function.

This padding is not technically necessary, since almost all modern
hardware will insert transmit padding as necessary (and where the
hardware does not support doing so, the underlying UEFI driver is
responsible for adding any necessary padding).  However, it is
guaranteed to be harmless (other than a miniscule performance impact):
the Ethernet specification requires zero padding up to the minimum
frame length for packets that are transmitted onto the wire, and so
the receiver will see the same packet whether or not we manually
insert this padding in software.

The additional padding causes the underlying Mellanox driver to avoid
its faulty code path, since it will never be asked to transmit a very
short packet.

Tested-by: Eric Hagberg <ehagberg@janestreet.com>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
2024-03-18 22:52:05 +00:00
Rabia Manaa c11734eee0 [golan] Use ETH_HLEN for inline header size
The driver does not correctly handle very short transmitted packets
such as EAPoL-Start where the entire DMA content lies within the
current send work queue entry inline header length of 18 bytes.

Fix by reducing the inline header length to the Ethernet frame header
length of 14 bytes.

Modified-by: Michael Brown <mcb30@ipxe.org>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
2024-03-17 22:55:32 +00:00
Michael Brown bac967d51a [snp] Allocate additional padding for receive buffers
Some SNP implementations (observed with a wifi adapter in a Dell
Latitude 3440 laptop) seem to require additional space in the
allocated receive buffers, otherwise full-length packets will be
silently dropped.

The EDK2 MnpDxe driver happens to allocate an additional 8 bytes of
padding (4 for a VLAN tag, 4 for the Ethernet frame checksum).  Match
this behaviour since drivers are very likely to have been tested
against MnpDxe.

Signed-off-by: Michael Brown <mcb30@ipxe.org>
2024-03-16 23:28:34 +00:00
Geert Stappers e5f3ba0ca7 [drivers] Sort PCI_ROM() entries numerically
Done with the help of this Perl script:

$MARKER = 'PCI_ROM';  # a regex
$AB = 1;  # At Begin
@HEAD = ();
@ITEMS = ();
@TAIL = ();

foreach $fn (@ARGV) {
    open(IN, $fn) or die "Can't open file '$fn': $!\n";
    while (<IN>) {
        if (/$MARKER/) {
            push @ITEMS, $_;
            $AB = 0;  # not anymore at begin
        }
        else {
            if ($AB) {
                push @HEAD, $_;
            }
            else {
                push @TAIL, $_;
            }
        }
    }
} continue {
    close IN;
    open(OUT, ">$fn") or die "Can't open file '$fn' for output: $!\n";
    print OUT @HEAD;
    print OUT sort @ITEMS;
    print OUT @TAIL;
    close OUT;
    # For a next file
    $AB = 1;
    @HEAD = ();
    @ITEMS = ();
    @TAIL = ();
}

Executed that script while src/drivers/ as current working directory,
provided '$(grep -rl PCI_ROM)' as argument.

Signed-off-by: Geert Stappers <stappers@stappers.it>
2024-02-22 14:19:04 +00:00
Joseph Wong a846c4ccfc [bnxt] Add support for BCM957608
Add support for BCM957608 device.  Add support for additional link
speeds supported by BCM957608.

Signed-off-by: Joseph Wong <joseph.wong@broadcom.com>
2024-02-08 15:10:12 +00:00
Joseph Wong de8a0821c7 [bnxt] Add support for additional chip IDs
Add additional chip IDs that can be recognized as part of the thor
family.

Signed-off-by: Michael Brown <mcb30@ipxe.org>
2024-01-19 22:08:48 +00:00
Michael Brown 4b7d9a6af0 [libc] Replace linker_assert() with build_assert()
We currently implement build-time assertions via a mechanism that
generates a call to an undefined external function that will cause the
link to fail unless the compiler can prove that the asserted condition
is true (and thereby eliminate the undefined function call).

This assertion mechanism can be used for conditions that are not
amenable to the use of static_assert(), since static_assert() will not
allow for proofs via dead code elimination.

Add __attribute__((error(...))) to the undefined external function, so
that the error is raised at compile time rather than at link time.
This allows us to provide a more meaningful error message (which will
include the file name and line number, as with any other compile-time
error), and avoids the need for the caller to specify a unique symbol
name for the external function.

Change the name from linker_assert() to build_assert(), since the
assertion now takes place at compile time rather than at link time.

Signed-off-by: Michael Brown <mcb30@ipxe.org>
2024-01-16 13:35:08 +00:00
Michael Brown 6d29415c89 [libc] Make static_assert() available via assert.h
Expose static_assert() via assert.h and migrate link-time assertions
to build-time assertions where possible.

Signed-off-by: Michael Brown <mcb30@ipxe.org>
2024-01-16 13:35:08 +00:00
Christian Helmuth 119c415ee4 [intel] Add PCI ID for I219-LM (23)
Successfully tested on FUJITSU LIFEBOOK U7413.

Signed-off-by: Christian Helmuth <christian.helmuth@genode-labs.com>
2023-12-21 13:53:24 +01:00
Michael Brown 36e1a559a2 [pci] Check that ECAM configuration space is within reachable memory
Some machines (observed with an AWS EC2 m7a.large instance) will place
the ECAM configuration space window above 4GB, thereby making it
unreachable from non-paged 32-bit code.  This problem is currently
ignored by iPXE, since the address is silently truncated in the call
to ioremap().  (Note that other uses of ioremap() are not affected
since the PCI core will already have checked for unreachable 64-bit
BARs when retrieving the physical address to be mapped.)

Fix by adding an explicit check that the region to be mapped starts
within the reachable memory address space.  (Assume that no machines
will be sufficiently peverse to provide a region that straddles the
4GB boundary.)

Signed-off-by: Michael Brown <mcb30@ipxe.org>
2023-11-02 15:38:08 +00:00
Michael Brown 1f3a37e342 [pci] Cache ECAM mapping errors
When an error occurs during ECAM configuration space mapping, preserve
the error within the existing cached mapping (instead of invalidating
the cached mapping) in order to avoid flooding the debug log with
repeated identical mapping errors.

Signed-off-by: Michael Brown <mcb30@ipxe.org>
2023-11-02 15:20:27 +00:00
Michael Brown 74ec00a9f3 [pci] Handle non-zero starting bus in ECAM allocations
The base address provided in the PCI ECAM allocation within the ACPI
MCFG table is the base address for the segment as a whole, not for the
starting bus within that allocation.  On machines that provide ECAM
allocations with a non-zero starting bus number (observed with an AWS
EC2 m7a.large instance), this will result in iPXE accessing the wrong
memory addresses within the ECAM region.

Fix by adding the appropriate starting bus offset to the base address.

Signed-off-by: Michael Brown <mcb30@ipxe.org>
2023-11-02 15:05:15 +00:00
Michael Brown f883203132 [pci] Force completion of ECAM configuration space writes
The PCIe specification requires that "processor and host bridge
implementations must ensure that a method exists for the software to
determine when the write using the ECAM is completed by the completer"
but does not specify any particular method to be used.  Some platforms
might treat writes to the ECAM region as non-posted, others might
require reading back from a dedicated (and implementation-specific)
completion register to determine when the configuration space write
has completed.

Since PCI configuration space writes will never be used for any
performance-critical datapath operations (on any sane hardware), a
simple and platform-independent solution is to always read back from
the written register in order to guarantee that the write must have
completed.  This is safe to do, since the PCIe specification defines a
limited set of configuration register types, none of which have read
side effects.

Signed-off-by: Michael Brown <mcb30@ipxe.org>
2023-11-01 22:32:21 +00:00
Michael Brown 115707c0ed [iphone] Add missing va_start()/va_end() around reused argument list
The ipair_tx() function uses a va_list twice (first to calculate the
formatted string length before allocation, then to construct the
string in the allocated buffer) but is missing the va_start() and
va_end() around the second usage.  This is undefined behaviour that
happens to work on some build platforms.

Fix by adding the missing va_start() and va_end() around the second
usage of the variadic argument list.

Reported-by: Andreas Hammarskjöld <andreas@2PintSoftware.com>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
2023-10-24 11:43:56 +01:00
Michael Brown ae4e85bde9 [netdevice] Allocate private data for each network upper-layer driver
Allow network upper-layer drivers (such as LLDP, which attaches to
each network device in order to provide a corresponding LLDP settings
block) to specify a size for private data, which will be allocated as
part of the network device structure (as with the existing private
data allocated for the underlying device driver).

This will allow network upper-layer drivers to be simplified by
omitting memory allocation and freeing code.  If the upper-layer
driver requires a reference counter (e.g. for interface
initialisation), then it may use the network device's existing
reference counter, since this is now the reference counter for the
containing block of memory.

Signed-off-by: Michael Brown <mcb30@ipxe.org>
2023-09-13 20:23:46 +01:00
Michael Brown eeb7cd56e5 [netdevice] Remove netdev_priv() helper function
Some network device drivers use the trivial netdev_priv() helper
function while others use the netdev->priv pointer directly.

Standardise on direct use of netdev->priv, in order to free up the
function name netdev_priv() for reuse.

Signed-off-by: Michael Brown <mcb30@ipxe.org>
2023-09-13 16:29:48 +01:00
Alexander Eichner 9e99a55b31 [virtio] Fix implementation of vpm_ioread32()
The current implementation of vpm_ioread32() erroneously reads only 16
bits of data, which fails when used with the (stricter) virtio device
emulation in VirtualBox.

Fix by using the correct readl()/inl() I/O wrappers.

Reworded-by: Michael Brown <mcb30@ipxe.org>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
2023-08-22 13:45:44 +01:00
Michael Brown f3036fc213 [linux] Set a default MAC address for tap devices
Avoid the need to always specify a local MAC address on the command
line by setting a default hardware MAC address (using the same default
address as for slirp devices).

Signed-off-by: Michael Brown <mcb30@ipxe.org>
2023-07-05 15:24:32 +01:00
Michael Brown 59d065c9ac [linux] Fix error control flow in af_packet_nic_probe()
Signed-off-by: Michael Brown <mcb30@ipxe.org>
2023-07-05 15:17:58 +01:00
Michael Brown 48ae5d5361 [linux] Fix error control flow in tap_probe()
Signed-off-by: Michael Brown <mcb30@ipxe.org>
2023-07-05 14:47:13 +01:00
Michael Brown 3ef4f7e2ef [console] Avoid overlap between special keys and Unicode characters
The special key range (from KEY_MIN upwards) currently overlaps with
the valid range for Unicode characters, and therefore prohibits the
use of Unicode key values outside the ASCII range.

Create space for Unicode key values by moving the special keys to the
range immediately above the maximum valid Unicode character.  This
allows the existing encoding of special keys as an efficiently packed
representation of the equivalent ANSI escape sequence to be maintained
almost as-is.

Signed-off-by: Michael Brown <mcb30@ipxe.org>
2023-07-04 14:33:43 +01:00
Michael Brown 2689a6e776 [efi] Always poll for TX completions
Polling for TX completions is arguably redundant when there are no
transmissions currently in progress.  Commit c6c7e78 ("[efi] Poll for
TX completions only when there is an outstanding TX buffer") switched
to setting the PXE_OPFLAGS_GET_TRANSMITTED_BUFFERS flag only when
there is an in-progress transmission awaiting completion, in order to
reduce reported TX errors and debug message noise from buggy NII
implementations that report spurious TX completions whenever the
transmit queue is empty.

Some other NII implementations (observed with the Realtek driver in a
Dell Latitude 3440) seem to have a bug in the transmit datapath
handling which results in the transmit ring freezing after sending a
few hundred packets under heavy load.  The symptoms are that the
TPPoll register's NPQ bit remains set and the 256-entry transmit ring
contains a large number of uncompleted descriptors (with the OWN bit
set), the first two of which have identical data buffer addresses.

Though iPXE will submit at most one in-progress transmission via NII,
the Dell/Realtek driver seems to make a page-aligned copy of each
transmit data buffer and to report TX completions immediately without
waiting for the packet to actually be transmitted.  These synthetic TX
completions continue even after the hardware transmit ring freezes.

Setting PXE_OPFLAGS_GET_TRANSMITTED_BUFFERS on every poll reduces the
probability of this Dell/Realtek driver bug being triggered by a
factor of around 500, which brings the failure rate down to the point
that it can sensibly be managed by external logic such as the
"--timeout" option for image downloads.  Closing and reopening the
interface (via "ifclose"/"ifopen") will clear the error condition and
allow transmissions to resume.

Revert to setting PXE_OPFLAGS_GET_TRANSMITTED_BUFFERS on every poll,
and silently ignore situations in which the hardware reports a
completion when no transmission is in progress.  This approximately
matches the behaviour of the SnpDxe driver, which will also generally
set PXE_OPFLAGS_GET_TRANSMITTED_BUFFERS on every poll.

Signed-off-by: Michael Brown <mcb30@ipxe.org>
2023-06-21 11:49:53 +01:00
Matt Parrella bf25e23d07 [intel] Add workaround for I210 reset hardware bugs
The Intel I210's packet buffer size registers reset only on power up,
not when a reset signal is asserted.  This can lead to the inability
to pass traffic in the event that the DMA TX Maximum Packet Size
(which does reset to its default value on reset) is bigger than the TX
Packet Buffer Size.

For example, an operating system may be using the time sensitive
networking features of the I210 and the registers may be programmed
correctly, but then a reset signal is asserted and iPXE on the next
boot will be unable to use the I210.

Mimic what Linux does and forcibly set the registers to their default
values.

Signed-off-by: Matt Parrella <parrella.matthew@gmail.com>
2023-03-14 14:44:32 +00:00
Forest Crossman 523788ccda [intelx] Add PCI IDs for Intel 82599 10GBASE-T NIC
Signed-off-by: Forest Crossman <cyrozap@gmail.com>
2023-03-05 18:22:18 -06:00
Michael Brown 2733c4763a [iscsi] Limit maximum transfer size to MaxBurstLength
We currently specify only the iSCSI default value for MaxBurstLength
and ignore any negotiated value, since our internal block device API
allows only for receiving directly into caller-allocated buffers and
so we have no intrinsic limit on burst length.

A conscientious target may however refuse to attempt a transfer that
we request for a number of blocks that would exceed the negotiated
maximum burst length.

Fix by recording the negotiated maximum burst length and using it to
limit the maximum number of blocks per transfer as reported by the
SCSI layer.

Signed-off-by: Michael Brown <mcb30@ipxe.org>
2023-02-16 13:27:25 +00:00
Michael Brown be8ecaf805 [eisa] Check for system board presence before probing for slots
EISA expansion slot I/O port addresses overlap space that may be
assigned to PCI devices, which can lead to register reads and writes
with unwanted side effects during EISA probing.

Reduce the chances of performing EISA probing on PCI devices by
probing EISA slot vendor and product ID registers only if the EISA
system board vendor ID register indicates that the motherboard
supports EISA.

Debugged-by: Václav Ovsík <vaclav.ovsik@gmail.com>
Tested-by: Václav Ovsík <vaclav.ovsik@gmail.com>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
2023-02-10 23:34:59 +00:00
Michael Brown 4e456d9928 [efi] Do not attempt to drive PCI bridge devices
The "bridge" driver introduced in 3aa6b79 ("[pci] Add minimal PCI
bridge driver") is required only for BIOS builds using the ENA driver,
where experimentation shows that we cannot rely on the BIOS to fully
assign MMIO addresses.

Since the driver is a valid PCI driver, it will end up binding to all
PCI bridge devices even on a UEFI platform, where the firmware is
likely to have completed MMIO address assignment correctly.  This has
no impact on most systems since there is generally no UEFI driver for
PCI bridges: the enumeration of the whole PCI bus is handled by the
PciBusDxe driver bound to the root bridge.

Experimentation shows that at least one laptop will freeze at the
point that iPXE attempts to bind to the bridge device.  No deeper
investigation has been carried out to find the root cause.

Fix by causing efipci_supported() to return an error unless the
configuration space header type indicates a non-bridge device.

Reported-by: Marcel Petersen <mp@sbe.de>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
2023-02-03 16:10:31 +00:00
Michael Brown b6304f2984 [realtek] Explicitly disable VLAN offload
Some cards seem to have the receive VLAN tag stripping feature enabled
by default, which causes received VLAN packets to be misinterpreted as
being received by the trunk device.

Fix by disabling VLAN tag stripping in the C+ Command Register.

Debugged-by: Xinming Lai <yiyihu@gmail.com>
Tested-by: Xinming Lai <yiyihu@gmail.com>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
2023-02-01 19:09:30 +00:00
Mohammed Taha c5426cdaa9 [golan] Add new PCI ID for NVIDIA BlueField-3 network device
Signed-off-by: Michael Brown <mcb30@ipxe.org>
2023-01-23 22:52:30 +00:00
Michael Brown 68734b9a4d [efi] Bind to only the topmost instance of the SNP or NII protocols
UEFI has the mildly annoying habit of installing copies of the
EFI_SIMPLE_NETWORK_PROTOCOL instance on the IPv4 and IPv6 child device
handles.  This can cause iPXE's SNP driver to attempt to bind to a
copy of the EFI_SIMPLE_NETWORK_PROTOCOL that iPXE itself provided on a
different handle.

Fix by refusing to bind to an SNP (or NII) handle if there exists
another instance of the same protocol further up the device path (on
the basis that we always want to bind to the highest possible device).

Signed-off-by: Michael Brown <mcb30@ipxe.org>
2023-01-23 19:27:13 +00:00
Michael Brown 2fef0c541e [efi] Extend efi_locate_device() to allow searching up the device path
Extend the functionality of efi_locate_device() to allow callers to
find instances of the protocol that may exist further up the device
path.

Signed-off-by: Michael Brown <mcb30@ipxe.org>
2023-01-23 19:27:13 +00:00
Alexander Graf 6b977d1250 [ena] Allocate an unused Asynchronous Event Notification Queue (AENQ)
We currently don't allocate an Asynchronous Event Notification Queue
(AENQ) because we don't actually care about any of the events that may
come in.

The ENA firmware found on Graviton instances requires the AENQ to
exist, otherwise all admin queue commands will fail.

Fix by allocating an AENQ and disabling all events (so that we do not
need to include code to acknowledge any events that may arrive).

Signed-off-by: Alexander Graf <graf@amazon.com>
2023-01-18 22:47:58 +00:00
Michael Brown c4c03e5be8 [netdevice] Allow duplicate MAC addresses
Many laptops now include the ability to specify a "system-specific MAC
address" (also known as "pass-through MAC"), which is supposed to be
used for both the onboard NIC and for any attached docking station or
other USB NIC.  This is intended to simplify interoperability with
software or hardware that relies on a MAC address to recognise an
individual machine: for example, a deployment server may associate the
MAC address with a particular operating system image to be deployed.
This therefore creates legitimate situations in which duplicate MAC
addresses may exist within the same system.

As described in commit 98d09a1 ("[netdevice] Avoid registering
duplicate network devices"), the Xen netfront driver relies on the
rejection of duplicate MAC addresses in order to inhibit registration
of the emulated PCI devices that a Xen PV-HVM guest will create to
shadow each of the paravirtual network devices.

Move the code that rejects duplicate MAC addresses from the network
device core to the Xen netfront driver, to allow for the existence of
duplicate MAC addresses in non-Xen setups.

Signed-off-by: Michael Brown <mcb30@ipxe.org>
2023-01-15 00:42:52 +00:00
Michael Brown ab19546386 [efi] Disable receive filters to work around buggy UNDI drivers
Some UNDI drivers (such as the AMI UsbNetworkPkg currently in the
process of being upstreamed into EDK2) have a bug that will prevent
any packets from being received unless at least one attempt has been
made to disable some receive filters.

Work around these buggy drivers by attempting to disable receive
filters before enabling them.  Ignore any errors, since we genuinely
do not care whether or not the disabling succeeds.

Signed-off-by: Michael Brown <mcb30@ipxe.org>
2023-01-11 00:18:18 +00:00
Christian I. Nilsson 563bff4722 [intel] Add PCI ID for I219-V and -LM 16,17
Signed-off-by: Christian I. Nilsson <nikize@gmail.com>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
2022-11-15 13:05:28 +00:00
Michael Brown 2ae5355321 [pci] Backup and restore standard config space across PCIe FLR
The behaviour of PCI devices across a function-level reset seems to be
inconsistent in practice: some devices will preserve PCI BARs, some
will not.

Fix the behaviour of FLR on devices that do not preserve PCI BARs by
backing up and restoring PCI configuration space across the reset.
Preserve only the standard portion of the configuration space, since
there may be registers with unexpected side effects in the remaining
non-standardised space.

Signed-off-by: Michael Brown <mcb30@ipxe.org>
2022-11-13 21:38:41 +00:00
Michael Brown ca2be7e094 [pci] Allow PCI config space backup to be limited by maximum offset
Signed-off-by: Michael Brown <mcb30@ipxe.org>
2022-11-13 20:42:09 +00:00
Michael Brown 081b3eefc4 [ena] Assign memory BAR if left empty by BIOS
Some BIOSes in AWS EC2 (observed with a c6i.metal instance in
eu-west-2) will fail to assign an MMIO address to the ENA device,
which causes ioremap() to fail.

Experiments show that the ENA device is the only device behind its
bridge, even when multiple ENA devices are present, and that the BIOS
does assign a memory window to the bridge.

We may therefore choose to assign the device an MMIO address at the
start of the bridge's memory window.

Signed-off-by: Michael Brown <mcb30@ipxe.org>
2022-09-19 17:49:25 +01:00
Michael Brown 3aa6b79c8d [pci] Add minimal PCI bridge driver
Add a minimal driver for PCI bridges that can be used to locate the
bridge to which a PCI device is attached.

Signed-off-by: Michael Brown <mcb30@ipxe.org>
2022-09-19 17:47:57 +01:00
Michael Brown 649176cd60 [pci] Select PCI I/O API at runtime for cloud images
Pretty much all physical machines and off-the-shelf virtual machines
will provide a functional PCI BIOS.  We therefore default to using
only the PCI BIOS, with no fallback to an alternative mechanism if the
PCI BIOS fails.

AWS EC2 provides the opportunity to experience some exceptions to this
rule.  For example, the t3a.nano instances in eu-west-1 have no
functional PCI BIOS at all.  As of commit 83516ba ("[cloud] Use
PCIAPI_DIRECT for cloud images") we therefore use direct Type 1
configuration space accesses in the images built and published for use
in the cloud.

Recent experience has discovered yet more variation in AWS EC2
instances.  For example, some of the metal instance types have
multiple PCI host bridges and the direct Type 1 accesses therefore
see only a subset of the PCI devices.

Attempt to accommodate future such variations by making the PCI I/O
API selectable at runtime and choosing ECAM (if available), falling
back to the PCI BIOS (if available), then finally falling back to
direct Type 1 accesses.

This is implemented as a dedicated PCIAPI_CLOUD API, rather than by
having the PCI core select a suitable API at runtime (as was done for
timers in commit 302f1ee ("[time] Allow timer to be selected at
runtime").  The common case will remain that only the PCI BIOS API is
required, and we would prefer to retain the optimisations that come
from inlining the configuration space accesses in this common case.
Cloud images are (at present) disk images rather than ROM images, and
so the increased code size required for this design approach in the
PCIAPI_CLOUD case is acceptable.

Signed-off-by: Michael Brown <mcb30@ipxe.org>
2022-09-18 13:41:21 +01:00
Michael Brown be667ba948 [pci] Add support for the Enhanced Configuration Access Mechanism (ECAM)
The ACPI MCFG table describes a direct mapping of PCI configuration
space into MMIO space.  This mapping allows access to extended
configuration space (up to 4096 bytes) and also provides for the
existence of multiple host bridges.

Add support for the ECAM mechanism described by the ACPI MCFG table,
as a selectable PCI I/O API alongside the existing PCI BIOS and Type 1
mechanisms.

Signed-off-by: Michael Brown <mcb30@ipxe.org>
2022-09-16 01:05:47 +01:00
Michael Brown ff228f745c [pci] Generalise pci_num_bus() to pci_discover()
Allow pci_find_next() to discover devices beyond the first PCI
segment, by generalising pci_num_bus() (which implicitly assumes that
there is only a single PCI segment) with pci_discover() (which has the
ability to return an arbitrary contiguous chunk of PCI bus:dev.fn
address space).

Signed-off-by: Michael Brown <mcb30@ipxe.org>
2022-09-15 16:49:47 +01:00
Michael Brown 56b30364c5 [pci] Check for wraparound in callers of pci_find_next()
The semantics of the bus:dev.fn parameter passed to pci_find_next()
are "find the first existent PCI device at this address or higher",
with the caller expected to increment the address between finding
devices.  This does not allow the parameter to distinguish between the
two cases "start from address zero" and "wrapped after incrementing
maximal possible address", which could therefore lead to an infinite
loop in the degenerate case that a device with address ffff:ff:1f.7
really exists.

Fix by checking for wraparound in the caller (which is already
responsible for performing the increment).

Signed-off-by: Michael Brown <mcb30@ipxe.org>
2022-09-15 15:20:58 +01:00
Michael Brown 8fc3c26eae [pci] Allow pci_find_next() to return non-zero PCI segments
Separate the return status code from the returned PCI bus:dev.fn
address, in order to allow pci_find_next() to be used to find devices
with a non-zero PCI segment number.

Signed-off-by: Michael Brown <mcb30@ipxe.org>
2022-09-15 15:20:58 +01:00
Michael Brown a80124456e [ena] Increase receive ring size to 128 entries
Some versions of the ENA hardware (observed on a c6i.large instance in
eu-west-2) seem to require a receive ring containing at least 128
entries: any smaller ring will never see receive completions or will
stall after the first few completions.

Increase the receive ring size to 128 entries (determined empirically)
for compatibility with these hardware versions.  Limit the receive
ring fill level to 16 (as at present) to avoid consuming more memory
than will typically be available in the internal heap.

Signed-off-by: Michael Brown <mcb30@ipxe.org>
2022-08-26 19:38:27 +01:00
Michael Brown 3b81a4e256 [ena] Provide a host information page
Some versions of the ENA firmware (observed on a c6i.large instance in
eu-west-2) seem to require a host information page, without which the
CREATE_CQ command will fail with ENA_ADMIN_UNKNOWN_ERROR.

These firmware versions also seem to require us to claim that we are a
Linux kernel with a specific driver major version number.  This
appears to be a firmware bug, as revealed by Linux kernel commit
1a63443af ("net/amazon: Ensure that driver version is aligned to the
linux kernel"): this commit changed the value of the driver version
number field to be the Linux kernel version, and was hastily reverted
in commit 92040c6da ("net: ena: fix broken interface between ENA
driver and FW") which clarified that the version number field does
actually have some undocumented significance to some versions of the
firmware.

Fix by providing a host information page via the SET_FEATURE command,
incorporating the apparently necessary lies about our identity.

Signed-off-by: Michael Brown <mcb30@ipxe.org>
2022-08-26 19:38:27 +01:00
Michael Brown 9f81e97af5 [ena] Specify the unused completion queue MSI-X vector as 0xffffffff
Some versions of the ENA firmware (observed on a c6i.large instance in
eu-west-2) will complain if the completion queue's MSI-X vector field
is left empty, even though the queue configuration specifies that
interrupts are not used.

Work around these firmware versions by passing in what appears to be
the magic "no MSI-X vector" value in this field.

Signed-off-by: Michael Brown <mcb30@ipxe.org>
2022-08-26 19:38:27 +01:00
Michael Brown 6d2cead461 [ena] Allow for out-of-order completions
The ENA data path design has separate submission and completion
queues.  Submission queues must be refilled in strict order (since
there is only a single linear tail pointer used to communicate the
existence of new entries to the hardware), and completion queue
entries include a request identifier copied verbatim from the
submission queue entry.  Once the submission queue doorbell has been
rung, software never again reads from the submission queue entry and
nothing ever needs to write back to the submission queue entry since
completions are reported via the separate completion queue.

This design allows the hardware to complete submission queue entries
out of order, provided that it internally caches at least as many
entries as it leaves gaps.

Record and identify I/O buffers by request identifier (using a
circular ring buffer of unique request identifiers), and remove the
assumption that submission queue entries will be completed in order.

Signed-off-by: Michael Brown <mcb30@ipxe.org>
2022-08-26 19:38:25 +01:00