TLS connections will almost always create background connections to
perform cross-signed certificate downloads and OCSP checks. There is
currently no direct visibility into which checks are taking place,
which makes troubleshooting difficult in the absence of either a
packet capture or a debug build.
Use the job progress message buffer to report the current cross-signed
certificate download or OCSP status check, where applicable.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Record the session ID (if any) provided by the server and attempt to
reuse it for any concurrent connections to the same server.
If multiple connections are initiated concurrently (e.g. when using
PeerDist) then defer sending the ClientHello for all but the first
connection, to allow time for the first connection to potentially
obtain a session ID (and thereby speed up the negotiation for all
remaining connections).
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Devices that support jumbo frames will currently default to the
largest possible MTU. This assumption is valid for virtual adapters
such as virtio-net, where the MTU must have been configured by a
system administrator, but is unsafe in the general case of a physical
adapter.
Default to the standard Ethernet MTU, unless explicitly overridden
either by the driver or via the ${netX/mtu} setting.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Avoid calling rndis_halt() and rndis->op->close() twice if the call to
register_netdev() fails.
Reported-by: Roman Kagan <rkagan@virtuozzo.com>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
register_netdev expects ->hw_addr and ->ll_addr to be already filled,
so move it towards the end of register_rndis, after the respective
fields have been successfully queried from the underlying device.
Signed-off-by: Roman Kagan <rkagan@virtuozzo.com>
Modified-by: Michael Brown <mcb30@ipxe.org>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
As pointedly documented in RFC7230 section 2.3, HTTP is a stateless
protocol: each request message can be understood in isolation from any
other requests or responses. Various authentication schemes such as
NTLM break this fundamental property of HTTP and rely on the same TCP
connection being reused.
Work around these broken authentication schemes by ensuring that the
most recently pooled connection is reused for the subsequent
authentication retry.
Reported-by: Andreas Hammarskjöld <junior@2PintSoftware.com>
Tested-by: Andreas Hammarskjöld <junior@2PintSoftware.com>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
The cipherstream xfer_window_changed() message is used to retrigger
the TLS transmit state machine. If the transmit state machine is
idle, then the window change message will not be propagated to the
plainstream interface. This can potentially cause the plainstream
interface peer (e.g. httpcore) to block waiting for a window change
message that will never arrive.
Fix by ensuring that the window change message is propagated to the
plainstream interface if the transmit state machine is idle. (If the
transmit state machine is not idle then the plainstream window will be
zero anyway.)
Signed-off-by: Michael Brown <mcb30@ipxe.org>
In TLS terminology a session conceptually spans multiple individual
connections, and essentially represents the stored cryptographic state
(master secret and cipher suite) required to establish communication
without going through the certificate and key exchange handshakes.
Rename tls_session to tls_connection in order to make the name
tls_session available to represent the session state.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
A failure in tls_generate_random() will result in a call to ref_put()
before the received data list has been initialised, which will cause
free_tls() to attempt to traverse an uninitialised list.
Fix by ensuring that all fields referenced by free_tls() are
initialised before any of the potential failure paths.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
The blocked link test in eth_slow_lacp_rx() is performed before the
actor TLV is copied to the partner TLV, and so must test the actor
state field rather than the partner state field.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Mark the link as blocked if the LACP partner is not reporting itself
as being in sync, collecting, and distributing.
This matches the behaviour for STP: we mark the link as blocked if we
detect that the switch is actively blocking traffic, in order to
extend the DHCP discovery period and so prevent boot failures on
switches that take an excessively long time to enable ports.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
The iSCSI root path may contain a literal IPv6 address. Update the
parser to handle this address format correctly.
Signed-off-by: Hannes Reinecke <hare@suse.de>
Modified-by: Michael Brown <mcb30@ipxe.org>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
The UNDI layer uses the NETDEV_IRQ_ENABLED flag to choose whether to
return PXENV_UNDI_ISR_OUT_OURS or PXENV_UNDI_ISR_OUT_NOT_OURS for a
given interrupt. For a network device that does not support
interrupts, the flag will never be set and so pxenv_undi_isr() will
always return PXENV_UNDI_ISR_OUT_NOT_OURS. This causes some NBPs
(such as lpxelinux.0) to hang.
Redefine NETDEV_IRQ_ENABLED as a simple administrative flag which can
be set even on network devices that do not support interrupts. This
allows pxenv_undi_isr() (which is the sole user of NETDEV_IRQ_ENABLED)
to function as expected by lpxelinux.0.
Signed-off-by: Martin Habets <mhabets@solarflare.com>
Modified-by: Michael Brown <mcb30@ipxe.org>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
The precise HTTP response status code is currently visible only at
DBGLVL_EXTRA. Allow for easier debugging by reporting the whole
status line at DBGLVL_LOG for any unsuccessful responses.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Allow individual authentication schemes to parse WWW-Authenticate
headers that do not comply with RFC2617.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Servers may provide multiple WWW-Authenticate headers, each offering a
different authentication scheme. We currently fail the request as
soon as we encounter an unrecognised scheme, which prevents subsequent
offers from succeeding.
Fix by silently ignoring headers for schemes that we do not recognise.
If no schemes are recognised then the request will eventually fail
anyway due to the 401 response code.
If multiple schemes are supported, arbitrarily choose the scheme
appearing first within the response headers.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
In fully self-contained deployments it may be desirable to build iPXE
with an empty CROSSCERT source to avoid talking to external services.
Add an explicit check for this case and make validator_start_download
fail immediately if the base URI is empty.
Signed-off-by: Ladi Prosek <lprosek@redhat.com>
Modified-by: Michael Brown <mcb30@ipxe.org>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Expose the underlying hardware address as a setting. For IPoIB
devices, this provides scripts with access to the Infiniband GUID.
Requested-by: Allen, Benjamin S. <bsallen@alcf.anl.gov>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Record and report the number of peers (calculated as the maximum
number of peers discovered for a block's segment at the time that the
block download is complete), and the percentage of blocks retrieved
from peers rather than from the origin server.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Some external code (such as the UEFI UNDI driver for the Realtek USB
NIC on a Microsoft Surface Book) will block during transmission
attempts and can take several seconds to report a transmit error. If
there is a large queue of pending transmissions, then the accumulated
time from a series of such failures can easily exceed the EFI watchdog
timeout, resulting in what appears to be a system lockup followed by a
reboot.
Work around this problem by immediately cancelling any pending
transmissions as soon as any transmit error occurs.
The only expected transmit error under normal operation is ENOBUFS
arising when the hardware transmit queue is full. By definition, this
can happen only for drivers that do not utilise deferred
transmissions, and so this new behaviour will not affect these
drivers.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Support renegotiation with servers supporting RFC5746. This allows
for the use of per-directory client certificates.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
When performing a SAN boot, the plainstream window size will be zero
(since this is the mechanism used internally to indicate that no data
should be fetched via the initial request). This zero value currently
propagates to the advertised TCP window size, which prevents the TLS
negotiation from completing.
Fix by ensuring that the cipherstream window is held open until TLS
negotiation is complete, and only then falling back to passing through
the plainstream window size.
Reported-by: John Wigley <johnwigley#ipxe@acorna.co.uk>
Tested-by: John Wigley <johnwigley#ipxe@acorna.co.uk>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
As of kernel 4.11, the LIO target will propose a value for
FirstBurstLength if the initiator did not do so. This is entirely
redundant in our case, since FirstBurstLength is defined by RFC 3720
to be
"Irrelevant when: ( InitialR2T=Yes and ImmediateData=No )"
and we already enforce both InitialR2T=Yes and ImmediateData=No in our
initial proposal. However, LIO (arguably correctly) complains when we
do not respond to its redundant proposal of an already-irrelevant
value.
Fix by always proposing the default value for FirstBurstLength.
Debugged-by: Patrick Seeburger <info@8bit.de>
Tested-by: Patrick Seeburger <info@8bit.de>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
HTTP implements xfer_window_changed() on the underlying server
connection using http_step(), which does not propagate the window
change notification to the data transfer interface. This breaks the
multipath-capable SAN boot code, which relies on the window change
notification to discover that the HTTP block device is ready for
commands to be issued.
Fix by sending xfer_window_changed() in http_step() once the
underlying connection has been determined to be ready.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Describe all SAN devices via ACPI tables such as the iBFT. For tables
that can describe only a single device (i.e. the aBFT and sBFT), one
table is installed per device. For multi-device tables (i.e. the
iBFT), all devices are described in a single table.
An underlying SAN device connection may be closed at the time that we
need to construct an ACPI table. We therefore introduce the concept
of an "ACPI descriptor" which enables the SAN boot code to maintain an
opaque pointer to the underlying object, and an "ACPI model" which can
build tables from a list of such descriptors. This separates the
lifecycles of ACPI descriptions from the lifecycles of the block
device interfaces, and allows for construction of the ACPI tables even
if the block device interface has been closed.
For a multipath SAN device, iPXE will wait until sufficient
information is available to describe all devices but will not wait for
all paths to connect successfully. For example: with a multipath
iSCSI boot iPXE will wait until at least one path has become available
and name resolution has completed on all other paths. We do this
since the iBFT has to include IP addresses rather than DNS names. We
will commence booting without waiting for the inactive paths to either
become available or close; this avoids unnecessary boot delays.
Note that the Linux kernel will refuse to accept an iBFT with more
than two NIC or target structures. We therefore describe only the
NICs that are actually required in order to reach the described
targets. Any iBFT with at most two targets is therefore guaranteed to
describe at most two NICs.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Any underlying errors arising during ib_create_cq() or ib_create_qp()
are lost since the functions simply return NULL on error. This makes
debugging harder, since a debug-enabled build is required to discover
the root cause of the error.
Fix by returning a status code from these functions, thereby allowing
any underlying errors to be propagated.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Some iSCSI targets send NOP-In. Rather than closing the connection
when we receive one, it is more user friendly to log a debug message
and keep the connection open. Eventually, it would be nice if iPXE
supported replying to NOP-Ins, but we might as well keep the
connection open until the target disconnects us.
Modified-by: Michael Brown <mcb30@ipxe.org>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Use intfs_shutdown() and intfs_restart() to cleanly shut down multiple
interfaces that may loop back to the same object.
This fixes a regression introduced by commit daa8ed9 ("[interface]
Provide intf_reinit() to reinitialise nullified interfaces") which
broke the use of HTTP Basic and Digest authentication.
Reported-by: murmansk <murmansk@hotmail.com>
Reported-by: Brett Waldo <brettwaldo@gmail.com>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Allow the active timer (providing udelay() and currticks()) to be
selected at runtime based on probing during the INIT_EARLY stage of
initialisation.
TICKS_PER_SEC is now a fixed compile-time constant for all builds, and
is independent of the underlying clock tick rate. We choose the value
1024 to allow multiplications and divisions on seconds to be converted
to bit shifts.
TICKS_PER_MS is defined as 1, allowing multiplications and divisions
on milliseconds to be omitted entirely. The 2% inaccuracy in this
definition is negligible when using the standard BIOS timer (running
at around 18.2Hz).
TIMER_RDTSC now checks for a constant TSC before claiming to be a
usable timer. (This timer can be tested in KVM via the command-line
option "-cpu host,+invtsc".)
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Separate out the concept of "hardware maximum supported frame length"
and "configured link MTU", and limit the latter according to the
former.
In networks where the DHCP-supplied link MTU is inconsistent with the
hardware or driver capabilities (e.g. a network using jumbo frames),
this will result in iPXE advertising a TCP MSS consistent with a size
that can actually be received.
Note that the term "MTU" is typically used to refer to the maximum
length excluding the link-layer headers; we adopt this usage.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Provide a settings applicator to modify netdev->max_pkt_len in
response to changes to the "mtu" setting (DHCP option 26).
Note that as with MAC address changes, drivers are permitted to
completely ignore any changes in the MTU value. The net result will
be that iPXE effectively uses the smaller of either the hardware
default MTU or the software configured MTU.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
For some unspecified "security" reason, the Google Compute Engine
metadata server will refuse any requests that do not include the
non-standard HTTP header "Metadata-Flavor: Google".
Attempt to autodetect such requests (by comparing the hostname against
"metadata.google.internal"), and add the "Metadata-Flavor: Google"
header if applicable.
Enable this feature in the CONFIG=cloud build, and include a sample
embedded script allowing iPXE to boot from a script configured as
metadata via e.g.
# Create shared boot image
make bin/ipxe.usb CONFIG=cloud EMBED=config/cloud/gce.ipxe
# Configure per-instance boot script
gcloud compute instances add-metadata <instance> \
--metadata-from-file ipxeboot=boot.ipxe
Signed-off-by: Michael Brown <mcb30@ipxe.org>
The ISC Kea DHCP server transmits its DHCPOFFER as a unicast packet
with a broadcast IPv4 destination address (255.255.255.255). This
combination is currently rejected by iPXE.
Fix by explicitly accepting the local network broadcast address
(255.255.255.255) as a valid unicast destination address.
Reported-by: Roy Ledochowski <roy.ledochowski@hpe.com>
Tested-by: Roy Ledochowski <roy.ledochowski@hpe.com>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
The command and data interfaces may be connected to the same object.
Nullify the data interface before shutting down the control interface
to avoid potential infinite loops.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Select the IPv6 source address and corresponding router (if any) using
a very simplified version of the algorithm from RFC6724:
- Ignore any source address that has a smaller scope than the
destination address. For example, do not use a link-local source
address when sending to a global destination address.
- If we have a source address which is on the same link as the
destination address, then use that source address.
- If we are left with multiple possible source addresses, then choose
the address with the smallest scope. For example, if we are sending
to a site-local destination address and we have both a global source
address and a site-local source address, then use the site-local
source address.
- If we are still left with multiple possible source addresses, then
choose the address with the longest matching prefix.
For the purposes of this algorithm, we treat RFC4193 Unique Local
Addresses as having organisation-local scope. Since we use only
link-local scope for our multicast transmissions, this approximation
should remain valid in all practical situations.
Originally-implemented-by: Thomas Bächler <thomas@archlinux.org>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Use the IPv6 settings to construct the routing table, in a matter
analogous to the construction of the IPv4 routing table.
This allows for manual assignment of IPv6 addresses via e.g.
set net0/ip6 2001:ba8:0:1d4::6950:5845
set net0/len6 64
set net0/gateway6 fe80::226:bff:fedd:d3c0
The prefix length ("len6") may be omitted, in which case a default
prefix length of 64 will be assumed.
Multiple IPv6 addresses may be assigned manually by implicitly
creating child settings blocks. For example:
set net0/ip6 2001:ba8:0:1d4::6950:5845
set net0.ula/ip6 fda4:2496:e992::6950:5845
Signed-off-by: Michael Brown <mcb30@ipxe.org>
A reasonable user expectation is that ${net0/ip6} should show the
"highest-priority" of the IPv6 addresses, even when multiple IPv6
addresses are active. The expected order of priority is likely to be
manually-assigned addresses first, then stateful DHCPv6 addresses,
then SLAAC addresses, and lastly link-local addresses.
Using ${priority} to enforce an ordering is undesirable since that
would affect the priority assigned to each of the net<N> blocks as a
whole, so use the sibling ordering capability instead.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Originally-implemented-by: Hannes Reinecke <hare@suse.de>
Originally-implemented-by: Marin Hannache <git@mareo.fr>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Originally-implemented-by: Hannes Reinecke <hare@suse.de>
Originally-implemented-by: Marin Hannache <git@mareo.fr>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Expose the IPv6 address (or prefix) as ${ip6}, the prefix length as
${len6}, and the router address as ${gateway6}.
Originally-implemented-by: Hannes Reinecke <hare@suse.de>
Originally-implemented-by: Marin Hannache <git@mareo.fr>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
The settings scope ipv6_scope refers specifically to IPv6 settings
that have a corresponding DHCPv6 option. Rename to dhcpv6_scope to
more accurately reflect this purpose.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
We currently perform IPv6 stateless address autoconfiguration (SLAAC)
in response to any router advertisement with the relevant flags set.
This can result in the local IPv6 source address changing midway
through a TCP connection, since our connections bind only to a local
port number and do not store a local network address.
In addition, this behaviour for SLAAC is inconsistent with that for
DHCPv4 and stateful DHCPv6, both of which will be performed only as a
result of an explicit autoconfiguration action (e.g. via the default
autoboot sequence, or the "ifconf" command).
Fix by ignoring router advertisements arriving outside the context of
an ongoing autoconfiguration attempt.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
In a busy network (such as a public cloud), IPv4 addresses may be
recycled rapidly. When this happens, unidirectional traffic (such as
UDP syslog) will succeed, but bidirectional traffic (such as TCP
connections) may fail due to stale ARP cache entries on other nodes.
The remote ARP cache expiry timeout is likely to exceed iPXE's
connection timeout, meaning that boot attempts can fail before the
problem is automatically resolved.
Fix by sending gratuitous ARPs whenever an IPv4 address is changed, to
attempt to update stale remote ARP cache entries. Note that this is
not a guaranteed fix, since ARP is an unreliable protocol.
We avoid sending gratuitous ARPs unconditionally, since otherwise any
unrelated settings change (e.g. "set dns 192.168.0.1") would cause
unexpected gratuitous ARPs to be sent.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
The vendor class identifier strings in DHCP_ARCH_VENDOR_CLASS_ID are
out of sync with the (correct) client architecture values in
DHCP_ARCH_CLIENT_ARCHITECTURE.
Fix by removing all definitions of DHCP_ARCH_VENDOR_CLASS_ID, and
instead generating the vendor class identifier string automatically
based on DHCP_ARCH_CLIENT_ARCHITECTURE and DHCP_ARCH_CLIENT_NDI.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
RFC3315 defines DHCPv6 option 16 (vendor class identifier) but does
not define any direct relationship with the roughly equivalent DHCPv4
option 60.
The PXE specification predates IPv6, and the UEFI specification is
expectedly vague on the subject. Examination of the reference EDK2
codebase suggests that the DHCPv6 vendor class identifier will be
formatted in accordance with RFC3315, using a single vendor-class-data
item in which the opaque-data field is the string as would appear in
DHCPv4 option 60.
RFC3315 requires the vendor class identifier to specify an IANA
enterprise number, as a way of disambiguating the vendor-class-data
namespace. The EDK2 code uses the value 343, described as:
// TODO: IANA TBD: temporarily using Intel's
Since this "TODO" has been present since at least 2010, it is probably
safe to assume that it has now become a de facto standard.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
RFC5970 defines DHCPv6 options 61 (client system architecture type)
and 62 (client network interface identifier), with contents equivalent
to DHCPv4 options 93 and 94 respectively.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
DHCPv4 and DHCPv6 share some values in common for the architecture-
specific options (such as the client system architecture type), but
use different encapsulations: DHCPv4 has a single byte for the option
length while DHCPv6 has a 16-bit field for the option length.
Move the containing DHCP_OPTION() and related wrappers from the
individual dhcp_arch.h files to dhcp.c, thus allowing for the
architecture-specific values to be reused in dhcpv6.c.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
In some circumstances, intermediate devices may lose state in a way
that temporarily prevents the successful delivery of packets from a
TCP peer. For example, a firewall may drop a NAT forwarding table
entry.
Since iPXE spends most of its time downloading files (and hence purely
receiving data, sending only TCP ACKs), this can easily happen in a
situation in which there is no reason for iPXE's TCP stack to generate
any retransmissions. The temporary loss of connectivity can therefore
effectively become permanent.
Work around this problem by sending TCP keepalives after a period of
inactivity on an established connection.
TCP keepalives usually send a single garbage byte in sequence number
space that has already been ACKed by the peer. Since we do not need
to elicit a response from the peer, we instead send pure ACKs (with no
garbage data) in order to keep the transmit code path simple.
Originally-implemented-by: Ladi Prosek <lprosek@redhat.com>
Debugged-by: Ladi Prosek <lprosek@redhat.com>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Some HTTP/2 servers send the header "Connection: upgrade, close". This
currently causes iPXE to fail due to the unrecognised "upgrade" token.
Fix by ignoring any unrecognised tokens in the "Connection" header.
Reported-by: Ján ONDREJ (SAL) <ondrejj@salstar.sk>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Add a build configuration option NET_PROTO_LACP to control whether or
not LACP support is included for Ethernet devices.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
It is possible for the preloaded UNDI device to end up with no
specified bus type, since it may not be recognised as either a PCI or
an ISAPnP device. This will result in a bus type value of zero, which
currently results in NULL being treated as a string pointer by
netdev_fetch_bustype().
Fix by returning ENOENT if an unknown bus type is specified.
Reported-by: Todd Stansell <todd@stansell.org>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Provide a build option CROSSCERT in config/crypto.h to allow the
default cross-signed certificate source to be configured at build
time. The ${crosscert} setting may still be used to reconfigure the
cross-signed certificate source at runtime.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
There is no practical way to generate an underlength ARP packet since
an ARP packet is always padded up to the minimum Ethernet frame length
(or dropped by the receiving Ethernet hardware if incorrectly padded),
but the absence of an explicit check causes warnings from some
analysis tools.
Fix by adding an explicit check on the I/O buffer length.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Many TLS records contain variable-length fields. We currently
validate the overall record length, but do so only after reading the
length of the variable-length field. If the record is too short to
even contain the length field, then we may read uninitialised data
from beyond the end of the record.
This is harmless in practice (since the subsequent overall record
length check would fail regardless of the value read from the
uninitialised length field), but causes warnings from some analysis
tools.
Fix by validating that the overall record length is sufficient to
contain the length field before reading from the length field.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Add a build configuration option VNIC_IPOIB to control whether or not
IPoIB support is included for Infiniband devices.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
When a CMRC connection is closed, the deferred shutdown process calls
ib_destroy_qp(). This will cause the receive work queue entries to
complete in error (since they are being cancelled), which will in turn
reschedule the deferred shutdown process. This eventually leads to
ib_destroy_conn() being called on a connection that has already been
freed.
Fix by explicitly cancelling any pending shutdown process after the
shutdown process has completed.
Ironically, this almost exactly reverts commit 019d4c1 ("[infiniband]
Use a one-shot process for CMRC shutdown"); prior to the introduction
of one-shot processes the only way to achieve a one-shot process was
for the process to cancel itself.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
TFTP URIs are intrinsically problematic, since:
- TFTP servers may use either normal slashes or backslashes as a
directory separator,
- TFTP servers allow filenames to be specified using relative paths
(with no initial directory separator),
- TFTP filenames present in a DHCP filename field may use special
characters such as "?" or "#" that prevent parsing as a generic URI.
As of commit 7667536 ("[uri] Refactor URI parsing and formatting"), we
have directly constructed TFTP URIs from DHCP next-server and filename
pairs, avoiding the generic URI parser. This eliminated the problems
related to special characters, but indirectly made it impossible to
parse a "tftp://..." URI string into a TFTP URI with a non-absolute
path.
Re-introduce the convention of requiring an extra slash in a
"tftp://..." URI string in order to specify a TFTP URI with an initial
slash in the filename. For example:
tftp://192.168.0.1/boot/pxelinux.0 => RRQ "boot/pxelinux.0"
tftp://192.168.0.1//boot/pxelinux.0 => RRQ "/boot/pxelinux.0"
This is ugly, but there seems to be no other sensible way to provide
the ability to specify all possible TFTP filenames.
A side-effect of this change is that format_uri() will no longer add a
spurious initial "/" when formatting a relative URI string. This
improves the console output when fetching an image specified via a
relative URI.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Expose the network interface name (e.g. "net0") as a setting. This
allows a script to obtain the name of the most recently opened network
interface via ${netX/ifname}.
Signed-off-by: Andrew Widdersheim <amwiddersheim@gmail.com>
Modified-by: Michael Brown <mcb30@ipxe.org>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
The three nominally-disambiguated ENOTSUP errors accidentally all used
the same error disambiguator, rendering them identical. Fix by
changing all three values. We avoid reusing the 0x01 disambiguator
value, since that remains ambiguous in older binaries.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
For historical reasons, iPXE sets the current working URI to the root
of the TFTP server whenever the TFTP server address is changed. This
was originally implemented in the hope of allowing a DHCP-provided
TFTP filename to be treated simply as a relative URI. This usage
turns out to be impractical since DHCP-provided TFTP filenames may
include characters which would have special significance to the URI
parser, and so the DHCP next-server+filename combination is now
handled by the dedicated pxe_uri() function instead.
The practice of setting the current working URI to the root of the
TFTP server is potentially helpful for interactive uses of iPXE,
allowing a user to type e.g.
iPXE> dhcp
Configuring (net0 52:54:00:12:34:56)... ok
iPXE> chain pxelinux.0
and have the URI "pxelinux.0" interpreted as being relative to the
root of the TFTP server provided via DHCP.
The current implementation of tftp_apply_settings() has an unintended
flaw. When the "dhcp" command is used to renew a DHCP lease (or to
pick up potentially modified DHCP options), the old settings block
will be unregistered before the new settings block is registered.
This causes tftp_apply_settings() to believe that the TFTP server has
been changed twice (to 0.0.0.0 and back again), and so the current
working URI will always be set to the root of the TFTP server, even if
the DHCP response provides exactly the same TFTP server as previously.
Fix by doing nothing in tftp_apply_settings() whenever there is no
TFTP server address.
Debugged-by: Andrew Widdersheim <awiddersheim@inetu.net>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Resolve redirection URIs as being relative to the original HTTP
request URI, rather than treating them as being implicitly relative to
the current working URI.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
For switches which remain permanently in the non-forwarding state (or
which erroneously report a non-forwarding state), ensure that iPXE
will eventually give up waiting for the link to become unblocked.
Originally-fixed-by: Wissam Shoukair <wissams@mellanox.com>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
If we detect (via STP) that a switch port is in a non-forwarding
state, then the link is marked as being temporarily blocked and DHCP
discovery will be deferred until the link becomes unblocked.
The timer used to decide when to give up waiting for ProxyDHCPOFFERs
is currently based on the time that DHCP discovery was started, and
makes no allowances for any time spent waiting for the link to become
unblocked. Consequently, if STP is used then the timeout for
ProxyDHCPOFFERs becomes essentially zero.
Fix by resetting the recorded start time whenever DHCP discovery is
deferred due to a blocked link.
Debugged-by: Sebastian Roth <sebastian.roth@zoho.com>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Avoid accidentally dereferencing a NULL cipher context pointer for
plaintext blocks (which are usually messages with a block length of
zero, indicating a missing block).
Signed-off-by: Michael Brown <mcb30@ipxe.org>
TCP/IP checksum fields are one's complement values and therefore have
two possible representations of zero: positive zero (0x0000) and
negative zero (0xffff).
In RFC768, UDP over IPv4 exploits this redundancy to repurpose the
positive representation of zero (0x0000) to mean "no checksum
calculated"; checksums are optional for UDP over IPv4.
In RFC2460, checksums are made mandatory for UDP over IPv4. The
wording of the RFC is such that the UDP header is mandated to use only
the negative representation of zero (0xffff), rather than simply
requiring the checksum to be correct but allowing for either
representation of zero to be used.
In RFC1071, an example algorithm is given for calculating the TCP/IP
checksum. This algorithm happens to produce only the positive
representation of zero (0x0000); this is an artifact of the way that
unsigned arithmetic is used to calculate a signed one's complement
sum (and its final negation).
A common misconception has developed (exemplified in RFC1624) that
this artifact is part of the specification. Many people have assumed
that the checksum field should never contain the negative
representation of zero (0xffff).
A sensible receiver will calculate the checksum over the whole packet
and verify that the result is zero (in whichever representation of
zero happens to be generated by the receiver's algorithm). Such a
receiver will not care which representation of zero happens to be used
in the checksum field.
However, there are receivers in existence which will verify the
received checksum the hard way: by calculating the checksum over the
remainder of the packet and comparing the result against the checksum
field. If the representation of zero used by the receiver's algorithm
does not match the representation of zero used by the transmitter (and
so placed in the checksum field), and if the receiver does not
explicitly allow for both representations to compare as equal, then
the receiver may reject packets with a valid checksum.
For UDP, the combined RFCs effectively mandate that we should generate
only the negative representation of zero in the checksum field.
For IP, TCP and ICMP, the RFCs do not mandate which representation of
zero should be used, but the misconceptions which have grown up around
RFC1071 and RFC1624 suggest that it would be least surprising to
generate only the positive representation of zero in the checksum
field.
Fix by ensuring that all of our checksum algorithms generate only the
positive representation of zero, and explicitly inverting this in the
case of transmitted UDP packets.
Reported-by: Wissam Shoukair <wissams@mellanox.com>
Tested-by: Wissam Shoukair <wissams@mellanox.com>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
We currently do not populate the ciaddr field in the constructed PXE
Boot Server ACK packet. This causes a WDS server to respond with a
broadcast packet, which is then ignored by wdsmgfw.efi since it does
not match the specified IP address filter.
Fix by populating ciaddr within the constructed PXE Boot Server ACK
packet.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
We attempt to mimic the behaviour of Intel's PXE ROM by skipping the
separate ProxyDHCPREQUEST if the ProxyDHCPOFFER already contains a
boot filename or a PXE boot menu.
Experimentation reveals that Intel's PXE ROM will also check for a
non-empty next-server address alongside the boot filename. Update our
test to match this behaviour.
Reported-by: Wissam Shoukair <wissams@mellanox.com>
Tested-by: Wissam Shoukair <wissams@mellanox.com>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Commit 09b057c ("[settings] Remove "uristring" setting type") removed
support for URI-encoded settings via the "uristring" setting type, on
the basis that such encoding was no longer necessary to avoid problems
with the command line parser.
Other valid use cases for the "uristring" setting type do exist: for
example, a password containing a '/' character expanded via
chain http://username:${password:uristring}@server.name/boot.php
Restore the existence of the "uristring" setting, avoiding the
potentially large stack allocations that were used in the old code
prior to commit 09b057c ("[settings] Remove "uristring" setting
type").
Requested-by: Robin Smidsrød <robin@smidsrod.no>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Some ProxyDHCP servers and PXE boot servers do not specify a DHCP
server identifier via option 54. We currently work around this in a
variety of ad-hoc ways:
- if a ProxyDHCPACK has no server identifier then we treat it as
having the correct server identifier,
- if a boot server ACK has no server identifier then we use the
packet's source IP address as the server identifier.
Introduce the concept of a DHCP server pseudo-identifier, defined as
being:
- the server identifier (option 54), or
- if there is no server identifier, then the next-server address
(siaddr),
- if there is no server identifier or next-server address, then the
DHCP packet's source IP address.
Use the pseudo-identifier in place of the server identifier when
handling ProxyDHCP and PXE boot server responses.
Originally-fixed-by: Wissam Shoukair <wissams@mellanox.com>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
The Infiniband link status change callback ipoib_link_state_changed()
may be called while the IPoIB device is closed, in which case there
will not be an IPoIB queue pair to be joined to the IPv4 broadcast
group. This leads to NULL pointer dereferences in ib_mcast_attach()
and ib_mcast_detach().
Fix by not attempting to join (or leave) the broadcast group unless we
actually have an IPoIB queue pair.
Signed-off-by: Wissam Shoukair <wissams@mellanox.com>
Modified-by: Michael Brown <mcb30@ipxe.org>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Rewrite the HTTP core to allow for the addition of arbitrary content
encoding mechanisms, such as PeerDist and gzip.
The core now exposes http_open() which can be used to create requests
with an explicitly selected HTTP method, an optional requested content
range, and an optional request body. A simple wrapper provides the
preexisting behaviour of creating either a GET request or an
application/x-www-form-urlencoded POST request (if the URI includes
parameters).
The HTTP SAN interface is now implemented using the generic block
device translator. Individual blocks are requested using http_open()
to create a range request.
Server connections are now managed via a connection pool; this allows
for multiple requests to the same server (e.g. for SAN blocks) to be
completely unaware of each other. Repeated HTTPS connections to the
same server can reuse a pooled connection, avoiding the per-connection
overhead of establishing a TLS session (which can take several seconds
if using a client certificate).
Support for HTTP SAN booting and for the Basic and Digest
authentication schemes is now optional and can be controlled via the
SANBOOT_PROTO_HTTP, HTTP_AUTH_BASIC, and HTTP_AUTH_DIGEST build
configuration options in config/general.h.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Add support for SHA-224, SHA-384, and SHA-512 as digest algorithms in
X.509 certificates, and allow the choice of public-key, cipher, and
digest algorithms to be configured at build time via config/crypto.h.
Originally-implemented-by: Tufan Karadere <tufank@gmail.com>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
The current implementation handles big-endian 24-bit integers (which
occur in several TLS record types) by treating them as big-endian
32-bit integers which are shifted by 8 bits. This can result in
"Invalid read" errors when running under valgrind, if the 24-bit field
happens to be exactly at the end of an I/O buffer.
Fix by ensuring that we touch only the three bytes which comprise the
24-bit integer.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
VLAN and 802.11 devices use a network device operations structure that
wraps an underlying structure. For example, the vlan_operations
structure wraps the network device operations structure of the
underlying trunk device. This can cause false positives from the
current implementation of netdev_irq_supported(), which will always
report that VLAN devices support interrupts since it has no visibility
into the support provided by the underlying trunk device.
Fix by allowing network devices to explicitly flag that interrupts are
not supported, despite the presence of an irq() method.
Originally-fixed-by: Wissam Shoukair <wissams@mellanox.com>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
iscsi_tx_done() is missing "break" statements at the end of each case.
(Fortunately, this happens not to cause a bug in practice, since
iscsi_login_request_done() is effectively a no-op when completing a
data-out PDU.)
Reported-by: Wissam Shoukair <wissams@mellanox.com>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Extend the IPv6 concept of "scope ID" (indicating the network device
index) to IPv4 socket addresses, so that IPv4 multicast transmissions
may specify the transmitting network device.
The scope ID is not (currently) exposed via the string representation
of the socket address, since IPv4 does not use the IPv6 concept of
link-local addresses (which could legitimately be specified in a URI).
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Redefine various IPv4 address constants and testing macros to avoid
unnecessary byte swapping at runtime, and slightly rename the macros
to prevent code from accidentally using the old definitions.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Avoid using zero as a network device index, so that a zero
sin6_scope_id can be used to mean "unspecified" (rather than
unintentionally meaning "net0").
Signed-off-by: Michael Brown <mcb30@ipxe.org>
When an IPv6 socket address string specifies a link-local or multicast
address but does not specify the requisite network device name
(e.g. "fe80::69ff:fe50:5845" rather than "fe80::69ff:fe50:5845%net0"),
assume the use of "netX".
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Provide a generic inject_fault() function that can be used to inject
random faults with configurable probabilities.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
We currently do not wait for a received FIN before exiting to boot a
loaded OS. In the common case of booting from an HTTP server, this
means that the TCP connection is left consuming resources on the
server side: the server will retransmit the FIN several times before
giving up.
Fix by initiating a graceful close of all TCP connections and waiting
(for up to one second) for all connections to finish closing
gracefully (i.e. for the outgoing FIN to have been sent and ACKed, and
for the incoming FIN to have been received and ACKed at least once).
Signed-off-by: Michael Brown <mcb30@ipxe.org>
The only way to map an eIPoIB MAC address (REMAC) to an IPoIB MAC
address is to intercept an incoming ARP request or reply.
If we do not have an REMAC cache entry for a particular destination
MAC address, then we cannot transmit the packet. This can arise in at
least two situations:
- An external program (e.g. a PXE NBP using the UNDI API) may attempt
to transmit to a destination MAC address that has been obtained by
some method other than ARP.
- Memory pressure may have caused REMAC cache entries to be
discarded. This is fairly likely on a busy network, since REMAC
cache entries are created for all received (broadcast) ARP
requests. (We can't sensibly avoid creating these cache entries,
since they are required in order to send an ARP reply, and when we
are being used via the UNDI API we may have no knowledge of which
IP addresses are "ours".)
Attempt to ameliorate the situation by generating a semi-spurious ARP
request whenever we find a missing REMAC cache entry. This will
hopefully trigger an ARP reply, which would then provide us with the
information required to populate the REMAC cache.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
If the link is blocked (e.g. due to a Spanning Tree Protocol port not
yet forwarding packets) then defer DHCP discovery until the link
becomes unblocked.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
A fairly common end-user problem is that the default configuration of
a switch may leave the port in a non-forwarding state for a
substantial length of time (tens of seconds) after link up. This can
cause iPXE to time out and give up attempting to boot.
We cannot force the switch to start forwarding packets sooner, since
any attempt to send a Spanning Tree Protocol bridge PDU may cause the
switch to disable our port (if the switch happens to have the Bridge
PDU Guard feature enabled for the port).
For non-ancient versions of the Spanning Tree Protocol, we can detect
whether or not the port is currently forwarding and use this to inform
the network device core that the link is currently blocked.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
When Spanning Tree Protocol (STP) is used, there may be a substantial
delay (tens of seconds) from the time that the link goes up to the
time that the port starts forwarding packets.
Add a generic concept of a "blocked link" (i.e. a link which is up but
which is not expected to communicate successfully), and allow "ifstat"
to indicate when a link is blocked.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
In some Ethernet framing variants the two-byte protocol field is used
as a length, with the Ethernet header being followed by an IEEE 802.2
LLC header. The first two bytes of the LLC header are the DSAP and
SSAP.
If the received Ethernet packet appears to use this framing, then
interpret the two-byte DSAP and SSAP as being the network-layer
protocol. This allows support for receiving Spanning Tree Protocol
frames (which use an LLC header with {DSAP,SSAP}=0x4242) to be added
without requiring a full LLC protocol layer.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
We currently shrink the TCP window permanently if we are ever forced
(by a low-memory condition) to discard a previously received TCP
packet. This behaviour was intended to reduce the number of
retransmissions in a lossy network, since lost packets might
potentially result in the entire window contents being retransmitted.
Since commit e0fc8fe ("[tcp] Implement support for TCP Selective
Acknowledgements (SACK)") the cost of lost packets has been reduced by
around one order of magnitude, and the reduction in the window size
(which affects the maximum throughput) is now the more significant
cost.
Remove the code which reduces the TCP maximum window size when a
received packet is discarded.
Reported-by: Wissam Shoukair <wissams@mellanox.com>
Tested-by: Wissam Shoukair <wissams@mellanox.com>
Signed-off-by: Michael Brown <mcb30@ipxe.org>