This code largely inspired by tap.c. Allows for testing iPXE on real
NICs from within Linux. For example:
make bin-x86_64-linux/af_packet.linux
valgrind ./bin-x86_64-linux/af_packet.linux --net af_packet,if=eth3
Tested as x86_64 and i386 binary.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Virtio 0.9 implementation was limited to the maximum virtqueue size of
MAX_QUEUE_NUM and the virtio-net driver would fail to initialize on hosts
exceeding this limit.
This commit lifts the restriction by allocating the queue memory based on
the actual queue size instead of using a fixed maximum. Note that virtio
1.0 still uses the MAX_QUEUE_NUM constant to cap the size (unfortunately
this functionality is not available in virtio 0.9).
Signed-off-by: Ladi Prosek <lprosek@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
This commit introduces virtnet_free_virtqueues called on all virtqueue
error and shutdown paths. vpm_find_vqs no longer cleans up after itself
and instead expects virtnet_free_virtqueues to be always called to undo
its effect.
Signed-off-by: Ladi Prosek <lprosek@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
vpm_find_vqs incorrectly accepted the host provided queue size with no
regard to iPXE's internal limitations. Virtio 1.0 makes it possible for
the driver to override the queue size to reduce memory requirements and
iPXE is a great use case for this feature.
Also removing the extra vq->vring.num assignment which is already
handled in vring_init.
Signed-off-by: Ladi Prosek <lprosek@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Updates:
- Nodnic: Support for arm cq doorbell via the UAR BAR
- Ensure hardware is quiescent when no interface is open - WinPE WA
- Support for clear interrupt via BAR
- Nodnic: Support for send TX doorbells via the UAR BAR
- Added ConnectX-5EX device
- Added ConnectX-5 device
Signed-off-by: Raed Salem <raeds@mellanox.com>
Modified-by: Michael Brown <mcb30@ipxe.org>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Commit db34436 ("[intel] Strip spurious VLAN tags received by virtual
function NICs") accidentally introduced two copies of the
intel[x]vf_mbox_queues() function. Remove the unintended copy.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
The physical function may be configured to transparently insert a VLAN
tag into all transmitted packets. Unfortunately, it does not
equivalently strip this same VLAN tag from all received packets. This
behaviour may be observed in some Amazon EC2 instances with Enhanced
Networking enabled: transmissions work as expected but all packets
received by iPXE appear to have a spurious VLAN tag.
We can configure the receive queue to strip VLAN tags via the
RXDCTL.VME bit. We need to find out from the PF driver whether or not
we should do so.
There exists a "get queue configuration" mailbox message which
contains a field labelled IXGBE_VF_TRANS_VLAN in the Linux driver.
A comment in the Linux PF driver describes this field as "notify VF of
need for VLAN tag stripping, and correct queue". It will be filled
with a non-zero value if the PF is enforcing the use of a single VLAN
tag. It will also be filled with a non-zero value if the PF is using
multiple traffic classes.
The Linux VF driver seems to treat this field as being simply the
number of traffic classes, and gives it no VLAN-related
interpretation. The Linux VF driver instead handles the VLAN tag
stripping by simply assuming that any unrecognised VLAN tag ought to
be silently dropped.
We choose to strip and ignore the VLAN tag if the IXGBE_VF_TRANS_VLAN
field has a non-zero value.
Reported-by: Leonid Vasetsky <leonidv@velostrata.com>
Tested-by: Leonid Vasetsky <leonidv@velostrata.com>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
iPXE debug logging doesn't support %u. This commit replaces it with
%d in virtio-pci debug format strings.
Signed-off-by: Ladi Prosek <lprosek@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
ARM64 has a weaker memory order model than x86. The missing memory
barrier caused phy initialization notification to be delayed beyond
the link-wait timeout (15 secs).
Signed-off-by: Leendert van Doorn <leendert@paramecium.org>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Extend the 16-bit PCI bus:dev.fn address to a 32-bit seg🚌dev.fn
address, assuming a segment value of zero in contexts where multiple
segments are unsupported by the underlying data structures (e.g. in
the iBFT or BOFM tables).
Signed-off-by: Michael Brown <mcb30@ipxe.org>
This backport is from linux kernel upstream commit 83d6f1f ("ath9k:
fix buffer overrun for ar9287").
Signed-off-by: Christian Hesse <mail@eworm.de>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
The UEFI specification requires the EFI_SIMPLE_NETWORK_PROTOCOL
GetStatus() method to set TxBuf to NULL if there are no transmit
buffers to recycle.
Some implementations (observed with Lan9118Dxe in EDK2) fill in TxBuf
only when there is a transmit buffer to recycle, which leads to large
numbers of "spurious TX completion" errors.
Work around this problem by initialising TxBuf to NULL before calling
the GetStatus() method.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Commit 86f96a4 ("[tg3] Remove x86-specific inline assembly")
introduced a regression in _tg3_flag() in 64-bit builds, since any
flags in the upper 32 bits of a 64-bit unsigned long would be
discarded when truncating to a 32-bit int.
Debugged-by: Shane Thompson <shane.thompson@aeontech.com.au>
Tested-by: Shane Thompson <shane.thompson@aeontech.com.au>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
This commit makes virtio-net support devices with VEN 0x1af4 and DEV
0x1041, which is how non-transitional (modern-only) virtio-net devices
are exposed on the PCI bus.
Transitional devices supporting both the old 0.9.5 and new 1.0 version
of the virtio spec are driven using the new protocol. Legacy devices
are driven using the old protocol, same as before this commit.
Signed-off-by: Ladi Prosek <lprosek@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
This commit adds support for driving virtio 1.0 PCI devices. In
addition to various helpers, a number of vpm_ functions are introduced
to be used instead of their legacy vp_ counterparts when accessing
virtio 1.0 (aka modern) devices.
Signed-off-by: Ladi Prosek <lprosek@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Modified-by: Michael Brown <mcb30@ipxe.org>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Virtio 1.0 introduces new constants and data structures, common to all
devices as well as specific to virtio-net. This commit adds a subset
of these to be able to drive the virtio-net 1.0 network device.
Signed-off-by: Ladi Prosek <lprosek@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
PCI devices may support more capabilities of the same type (for
example PCI_CAP_ID_VNDR) and there was no way to discover all of them.
This commit adds a new API pci_find_next_capability which provides
this functionality. It would typically be used like so:
for (pos = pci_find_capability(pci, PCI_CAP_ID_VNDR);
pos > 0;
pos = pci_find_next_capability(pci, pos, PCI_CAP_ID_VNDR)) {
...
}
Signed-off-by: Ladi Prosek <lprosek@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
There is no way for the hardware to give us an invalid length in the
LRH, since it must have parsed this length field in order to perform
header splitting. However, this is difficult to prove conclusively.
Add an unnecessary length check to explicitly reject any packets
larger than the posted receive I/O buffer.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
There is no way for the hardware to give us an invalid length in the
LRH, since it must have parsed this length field in order to perform
header splitting. However, this is difficult to prove conclusively.
Add an unnecessary length check to explicitly reject any packets
larger than the posted receive I/O buffer.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Some versions of gcc complain that "'__bswap_variable_32' is static
but used in inline function 'golan_check_rc_and_cmd_status' which is
not static".
Fix by making golan_check_rc_and_cmd_status() a static inline.
Modified-by: Michael Brown <mcb30@ipxe.org>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
The Infiniband specification (volume 1, section 11.4.1.2 "Post Receive
Request") notes that for UD QPs, the GRH will be placed in the first
40 bytes of the receive buffer if present. (If no GRH is present,
which is normal, then the first 40 bytes of the receive buffer will be
unused.)
Mellanox hardware performs this placement automatically: other headers
will be stripped (and their values returned via the CQE), but the
first 40 bytes of the data buffer will be consumed by the (probably
non-existent) GRH.
This does not fit neatly into iPXE's internal abstraction, which
expects the data buffer to represent just the data payload with the
addresses from the GRH (if present) passed as additional parameters to
ib_complete_recv().
The end result of this discrepancy is that attempts to receive
full-sized 2048-byte IPoIB packets on Mellanox hardware will fail.
Fix by allocating a separate ring buffer to hold the received GRHs.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
This driver is the original source of the current readq() and writeq()
implementations for 32-bit iPXE. Switch to using the now-centralised
definitions, to avoid including architecture-specific code in an
otherwise architecture-independent driver.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Some EoIB implementations utilise an EoIB-to-Ethernet gateway device
that does not perform a FullMember join to the multicast group for the
EoIB broadcast domain. This has various exciting side-effects, such
as requiring every EoIB node to send every broadcast packet twice.
As an added bonus, the gateway may also break the EoIB MAC address to
GID mapping protocol by sending Ethernet-sourced packets from the
wrong QPN.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Some EoIB implementations require each individual EoIB node to create
the multicast group for the EoIB broadcast domain.
It is left as an exercise for the interested reader to determine how
such an implementation might ever allow the parameters of such a
multicast group to be changed without requiring a simultaneous upgrade
of every driver on every operating system on every machine currently
attached to the fabric.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Some EoIB implementations transmit a vendor-proprietary heartbeat
packet on the same multicast group used to provide the EoIB broadcast
domain.
Silently ignore these heartbeat packets, to avoid cluttering up the
network interface error statistics.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
EoIB is a fairly simple protocol in which raw Ethernet frames
(excluding the CRC) are encapsulated within Infiniband Unreliable
Datagrams, with a four-byte fixed EoIB header (which conveys no actual
information). The Ethernet broadcast domain is provided by a
multicast group, similar to the IPoIB IPv4 multicast group.
The mapping from Ethernet MAC addresses to Infiniband address vectors
is achieved by snooping incoming traffic and building a peer cache
which can then be used to map a MAC address into a port GID. The
address vector is completed using a path record lookup, as for IPoIB.
Note that this requires every packet to include a GRH.
Add basic support for EoIB devices. This driver is substantially
derived from the IPoIB driver. There is currently no mechanism for
automatically creating EoIB devices.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Commit e62e52b ("[ipoib] Simplify test for received broadcast
packets") relies upon the multicast LID being present in the
destination address vector as passed to ipoib_complete_recv().
Unfortunately, this information is not present in many Infiniband
devices' completion queue entries.
Fix by testing instead for the presence of a multicast GID.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
ath_rx_init() demonstrates some serious confusion over how to use
pointers, resulting in (uint32_t*)NULL being used as a temporary
variable. This does not end well.
The broken code in question is performing manual alignment of I/O
buffers, which can now be achieved more simply using alloc_iob_raw().
Fix by removing ath_rxbuf_alloc() entirely.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Some protocols (such as ARP) may modify the received packet and re-use
the same I/O buffer for transmission of a reply. The SMSC95XX
transmit header is larger than the receive header: the re-used I/O
buffer therefore does not have sufficient headroom for the transmit
header, and the ARP reply will therefore fail to be transmitted. This
is essentially the same problem as in commit 2e72d10 ("[ncm] Reserve
headroom in received packets").
Fix by reserving sufficient space at the start of each received packet
to allow for the difference between the lengths of the transmit and
receive headers.
This problem is not caught by the current driver development test
suite (documented at http://ipxe.org/dev/driver), since even the large
file transfer tests tend to completely sufficiently quickly that there
is no need for the server to ever send an ARP request. The failure
shows up only when using a very slow protocol such as RFC7440-enhanced
TFTP (as used by Windows Deployment Services).
Signed-off-by: Michael Brown <mcb30@ipxe.org>
The LED pins are configured by default as GPIO inputs. While it is
conceivable that a board might actually use these pins as GPIOs, no
such board is known to exist.
The Linux smsc95xx driver configures these pins unconditionally as LED
outputs. Assume that it is safe to do likewise.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
The SMSC95xx devices tend to be used in embedded systems with a
variety of ad-hoc mechanisms for storing the MAC address.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Some xHCI controllers (such as qemu's emulated xHCI controller) do not
correctly handle zero-length packets that are part of a TRB chain.
The zero-length TRB ends up being squashed and does not result in a
zero-length packet as seen by the device.
Work around this problem by marking the zero-length packet as
belonging to a separate transfer descriptor.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Some hubs (e.g. the Avocent Corp. Virtual Hub on a Lenovo x3550
Integrated Management Module) have been observed to require more than
the standard 200ms for ports to stabilise, with the result that
devices appear to disconnect and immediately reconnect during the
initial bus enumeration.
Work around this problem by allowing specific hubs an extra 500ms of
settling time.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Record the speed of a USB device based on the port's speed at the time
that the device was enabled. This allows us to remember the device's
speed even after the device has been disconnected (and so the port's
current speed has changed).
In particular, this allows us to correctly identify the transaction
translator for a low-speed or full-speed device after the device has
been disconnected.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
The usb_message() and usb_stream() functions currently check for
port->speed==USB_SPEED_NONE to determine whether or not a device has
been unplugged. This test will give a false negative result if a new
device has been plugged in before the hotplug mechanism has finished
handling the removal of the old device.
Fix by checking instead the port->disconnected flag, which is now
cleared only after completing the removal of the old device.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Tested using QEMU and usbredir to expose the LAN9512 chip present on a
Raspberry Pi.
There is a known issue with the LAN9512: an extra two bytes are
appended to every transmitted packet. These two bytes comprise:
{ 0x00, 0x08 } if packet length == 0 (mod 8)
{ CRC[0], 0x00 } if packet length == 7 (mod 8)
{ CRC[0], CRC[1] } otherwise
The extra bytes are appended whether the Ethernet CRC is generated
manually or added automatically by the hardware. The issue occurs
with the Linux kernel driver as well as the iPXE driver. It appears
to be an undocumented hardware errata.
TCP/IP traffic is not affected, since the IP header length field
causes the extraneous bytes to be discarded by the receiver. However,
protocols that rely on the length of the Ethernet frame (such as FCoE
or iPXE's "lotest" protocol) will be unusable on this hardware.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
On some models (notably ICH), the PHY reset mechanism appears to be
broken. In particular, the PHY_CTRL register will be correctly loaded
from NVM but the values will not be propagated to the "OEM bits" PHY
register. This typically has the effect of dropping the link speed to
10Mbps.
Since the original version of this driver in commit 945e428 ("[intel]
Replace driver for Intel Gigabit NICs"), we have always worked around
this problem by skipping the PHY reset if the link is already up.
Enhance this workaround by explicitly checking for known-broken PCI
IDs.
Reported-by: Robin Smidsrød <robin@smidsrod.no>
Tested-by: Robin Smidsrød <robin@smidsrod.no>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Make the class ID a property of the USB driver (rather than a property
of the USB device ID), and allow USB drivers to specify a wildcard ID
for any of the three component IDs (class, subclass, or protocol).
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Generate a score for each possible USB device configuration based on
the available driver support, and select the configuration with the
highest score. This will allow us to prefer ECM over RNDIS (for
devices which support both) and will allow us to meaningfully select a
configuration even when we have drivers available for all functions
(e.g. when exposing unused functions via EFI_USB_IO_PROTOCOL).
Signed-off-by: Michael Brown <mcb30@ipxe.org>
The decision on whether or not a zero-length packet needs to be
transmitted is independent of the host controller and belongs in the
USB core.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Allow iPXE to coexist with other USB device drivers, by attaching to
the EFI_USB_IO_PROTOCOL instances provided by the UEFI platform
firmware.
The EFI_USB_IO_PROTOCOL is an unsurprisingly badly designed
abstraction of a USB device. The poor design choices intrinsic in the
UEFI specification prevent efficient operation as a network device,
with the result that devices operated using the EFI_USB_IO_PROTOCOL
operate approximately two orders of magnitude slower than devices
operated using our native EHCI or xHCI host controller drivers.
Since the performance is so abysmally slow, and since the underlying
problems are due to fundamental architectural mistakes in the UEFI
specification, support for the EFI_USB_IO_PROTOCOL host controller
driver is left as disabled by default. Users are advised to use the
native iPXE host controller drivers instead.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
The raw EFI_HANDLE value is almost never useful to know, and simply
adds noise to the already verbose debug messages. Improve the
legibility of debug messages by using only the name generated by
efi_handle_name().
Signed-off-by: Michael Brown <mcb30@ipxe.org>
The Infiniband link status change callback ipoib_link_state_changed()
may be called while the IPoIB device is closed, in which case there
will not be an IPoIB queue pair to be joined to the IPv4 broadcast
group. This leads to NULL pointer dereferences in ib_mcast_attach()
and ib_mcast_detach().
Fix by not attempting to join (or leave) the broadcast group unless we
actually have an IPoIB queue pair.
Signed-off-by: Wissam Shoukair <wissams@mellanox.com>
Modified-by: Michael Brown <mcb30@ipxe.org>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Multicast MAC addresses will never have REMAC cache entries, and the
corresponding multicast IPoIB MAC address cannot be obtained simply by
issuing an ARP request.
For the trivial volume of multicast packets that we expect to send in
any realistic scenario, the simplest solution is to send them as
broadcasts instead.
Reported-by: Wissam Shoukair <wissams@mellanox.com>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
The only way to map an eIPoIB MAC address (REMAC) to an IPoIB MAC
address is to intercept an incoming ARP request or reply.
If we do not have an REMAC cache entry for a particular destination
MAC address, then we cannot transmit the packet. This can arise in at
least two situations:
- An external program (e.g. a PXE NBP using the UNDI API) may attempt
to transmit to a destination MAC address that has been obtained by
some method other than ARP.
- Memory pressure may have caused REMAC cache entries to be
discarded. This is fairly likely on a busy network, since REMAC
cache entries are created for all received (broadcast) ARP
requests. (We can't sensibly avoid creating these cache entries,
since they are required in order to send an ARP reply, and when we
are being used via the UNDI API we may have no knowledge of which
IP addresses are "ours".)
Attempt to ameliorate the situation by generating a semi-spurious ARP
request whenever we find a missing REMAC cache entry. This will
hopefully trigger an ARP reply, which would then provide us with the
information required to populate the REMAC cache.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
As with the neighbour cache, discarding an REMAC cache entry is
potentially very disruptive.
Originally-fixed-by: Wissam Shoukair <wissams@mellanox.com>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Some Intel Skylake platforms (observed on a prototype Lenovo ThinkPad)
report the list of available USB3 protocol speed ID values as {1,2,3}
but then report a port's speed using ID value 4.
The value 4 happens to be the default value for SuperSpeed (when no
protocol speed ID value list is explicitly defined), and the hardware
seems to function correctly if we simply ignore its protocol speed ID
table and assume that it uses the default values.
Fix by adding a "broken PSI values" quirk for this controller.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
gcc 4.8.2 fails to report this erroneous comparison unless assertions
are enabled.
Reported-by: Mary-Ann Johnson <MaryAnn.Johnson@displaylink.com>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
The xHCI slot ID is one-based, not zero-based. Fix the length of the
xhci->slot[] array to account for this, and add assertions to check
that the hardware returns a valid slot ID in response to the Enable
Slot command.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
When jumbo frames are enabled, the Linux ixgbe physical function
driver will disable the virtual function's receive datapath by
default, and will enable it only if the virtual function negotiates
API version 1.1 (or higher) and explicitly selects an MTU.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Virtual functions use a mailbox to communicate with the physical
function driver: this covers functionality such as obtaining the MAC
address.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Intel virtual function NICs almost work with the use of "legacy"
transmit and receive descriptors (which are backwards compatible right
back to the original Intel Gigabit NICs).
Unfortunately the "TX switching" feature (which allows for VM<->VM
traffic to be looped back within the NIC itself) does not work when a
legacy TX descriptor is used: the packet is instead sent onto the
wire.
Fix by allowing for the use of an "advanced" TX descriptor (containing
exactly the same information as is found in the "legacy" descriptor).
Signed-off-by: Michael Brown <mcb30@ipxe.org>
The recorded disconnections (in port->disconnected) will currently be
left uncleared if usb_attached() returns an error (e.g. because there
are no drivers for a particular USB device). This is incorrect
behaviour: the disconnection has been handled and the record should be
cleared until the next physical disconnection is detected (via the CSC
bit).
The problem is masked for EHCI, UHCI, and USB hubs, since these will
report a changed port (via usb_port_changed()) only when the
underlying hardware reports a change. xHCI will call
usb_port_changed() in response to any port status event, at which
point the stale value of port->disconnected will be erroneously acted
upon. This can lead to an endless loop of repeatedly enumerating the
same device when a driverless device is attached to an xHCI root hub
port.
Fix by unconditionally clearing port->disconnected in usb_hotplugged().
Reported-by: Robin Smidsrød <robin@smidsrod.no>
Tested-by: Robin Smidsrød <robin@smidsrod.no>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
The action of registering a new hub can itself happen in only two
ways: either a new USB hub has been created (in which case we are
already inside a call to usb_hotplug()), or a new root hub has been
created.
In the former case, we do not need to issue a further call to
usb_hotplug(), since the hub's ports will all be marked as changed and
so will be handled after the return from register_usb_hub() anyway.
Calling usb_hotplug() within register_usb_hub() leads to a confusing
order of events, such as:
- root hub port 1 detects a change
- root hub port 2 detects a change
- usb_hotplug() is called
- root hub port 1 finds a USB hub
- usb_hotplug() is called
- this inner call to usb_hotplug() handles root hub port 2
Fix by calling usb_hotplug() only from usb_step() and from
register_usb_bus(). This avoids recursive calls to usb_hotplug() and
ensures that devices are enumerated in the order of detection.
Tested-by: Robin Smidsrød <robin@smidsrod.no>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
When USB network card drivers are used, the BIOS' legacy USB
capability is necessarily disabled since there is no way to share the
host controller between the BIOS and iPXE. This currently results in
USB keyboards becoming non-functional in USB-enabled builds of iPXE.
Fix by adding basic support for USB keyboards, enabled by default in
iPXE builds which include USB support.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
When an EHCI hotplug action results in the controller disowning the
port, it will result in a hotplug action on the corresponding UHCI or
OHCI controller. Allow such hotplug actions to be carried out as part
of the same call to usb_step() or usb_register_bus(), by maintaining a
single central list of changed ports.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
The USB core will currently fail to detect disconnections if a new
device has attached by the time the port is examined in
usb_hotplug().
Fix by recording the fact that a disconnection has taken place
whenever the "connection status changed" (CSC) bit is observed to be
set. (Whether the change represents a disconnection or a
reconnection, it indicates that the port has experienced some time of
being disconnected.)
Note that the time at which a disconnection can be detected varies by
hub type. In particular: root hubs can observe the CSC bit when
polling, and so will record the disconnection before calling
usb_port_changed(), but USB hubs read the port status (and hence the
CSC bit) only during the call to hub_speed(), long after the call to
usb_port_changed().
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Rename PCI_CLASS() (which constructs a struct pci_class_id) to
PCI_CLASS_ID(), and provide PCI_CLASS() as a macro which constructs
the 24-bit scalar value of a PCI class code.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
The USB API currently assumes that host controllers will have
immediate data buffer space available in which to store the setup
packet. This is true for xHCI, partially true for EHCI (which happens
to have 12 bytes of padding in each transfer descriptor due to
alignment requirements), and not true at all for UHCI.
Include the setup packet within the I/O buffer passed to the host
controller's message() method, thereby eliminating the requirement for
host controllers to provide immediate data buffers.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
The current API for Base16 (and Base64) encoding requires the caller
to always provide sufficient buffer space. This prevents the use of
the generic encoding/decoding functionality in some situations, such
as in formatting the hex setting types.
Implement a generic hex_encode() (based on the existing
format_hex_setting()), implement base16_encode() and base16_decode()
in terms of the more generic hex_encode() and hex_decode(), and update
all callers to provide the additional buffer length parameter.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
This changed in Linux kernel the same way in commit 7067e701
("ath9k_hw: remove confusing logic inversion in an ANI variable") by
Felix Fietkau.
Additionally this fixes "error: logical not is only applied to the
left hand side of comparison" with GCC 5.1.0.
Signed-off-by: Christian Hesse <mail@eworm.de>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
I218-LM (rev 3) is found in Lenovo Thinkpad X250. The remaining
device IDs are from linux/drivers/net/ethernet/intel/e1000e/hw.h
Signed-off-by: Christian Hesse <mail@eworm.de>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
On some RTL8169 onboard NICs (observed with a Lenovo ThinkPad 11e),
the EEPROM is not merely not present: any attempt to read from the
non-existent EEPROM will crash and reboot the system.
The equivalent code to read from the EEPROM was removed from the Linux
r8169 driver in 2009 with a comment suggesting that it was similarly
found to be unreliable on some systems.
Fix by accessing the EEPROM only on RTL8139 NICs, and assuming that
the MAC address will always be correctly preset on RTL8169 NICs.
Reported-by: Evan Prohaska <eprohaska@edkey.org>
Tested-by: Evan Prohaska <eprohaska@edkey.org>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
The emulated Intel 82545em in some versions of VMware (observed with
ESXi v5.1) seems to sometimes fail to set the RXT0 bit in the
interrupt cause register (ICR), causing iPXE to stop receiving
packets. Work around this problem (for the 82545em only) by always
polling the receive queue regardless of the state of the ICR.
Reported-by: Slava Bendersky <volga629@networklab.ca>
Tested-by: Slava Bendersky <volga629@networklab.ca>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
At least one NII implementation (in a Microsoft Surface tablet) seems
to fail to report the absence (sic) of TX completions properly. Work
around this by checking for TX completions only when we expect to see
one.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Some NII implementations will fail the GET_STATUS operation if we
request the media status. Fix by doing so only if GET_INIT_INFO
reported that media status is supported.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
In theory USB3 ports do not require a reset to enable the port.
Experimentation shows that this is sometimes required, particularly
when rerouting ports from EHCI to xHCI and switching speeds.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
xHCI provides a somewhat convoluted mechanism for specifying details
of a transaction translator. Hubs must be marked as such in the
device slot context. The only opportunity to do so is as part of a
Configure Endpoint command, which can be executed only when opening
the hub's interrupt endpoint.
We add a mechanism for host controllers to intercept the opening of
hub devices, providing xHCI with an opportunity to update the internal
device slot structure for the corresponding USB device to indicate
that the device is a hub. We then include the hub-specific details in
the input context whenever any Configure Endpoint command is issued.
When a device is opened, we record the device slot and port for its
transaction translator (if any), and supply these as part of the
Address Device command.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Support low-speed and full-speed devices attached to a USB2 hub. Such
devices use a transaction translator (TT) within the USB2 hub, which
asynchronously initiates transactions on the lower-speed bus and
returns the result via a split completion on the high-speed bus.
We make the simplifying assumption that there will never be more than
sixteen active interrupt endpoints behind a single transaction
translator; this assumption allows us to schedule all periodic start
splits in microframe 0 and all periodic split completions in
microframes 2 and 3. (We do not handle isochronous endpoints.)
Signed-off-by: Michael Brown <mcb30@ipxe.org>
The current endpoint reset logic defers the reset until the caller
attempts to enqueue a new transfer to that endpoint. This is
insufficient when dealing with endpoints behind a transaction
translator, since the transaction translator is a resource shared
between multiple endpoints.
We cannot reset the endpoint as part of the completion handling, since
that would introduce recursive calls to usb_poll(). Instead, we
add the endpoint to a list of halted endpoints, and perform the reset
on the next call to usb_step().
Signed-off-by: Michael Brown <mcb30@ipxe.org>
The endpoint may already have enqueued TRBs at the time that
xhci_endpoint_reset() is called. Ring the doorbell to resume
processing these TRBs immediately, rather than waiting until the next
call to xhci_endpoint_message() or xhci_endpoint_stream().
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Several of the USB timeouts were chosen on the principle of "pick an
arbitrary but ridiculously large value, just to be safe". It turns
out that some of the timeouts permitted by the USB specification are
even larger: for example, control transactions are allowed to take up
to five seconds to complete.
Fix up these USB timeout values to match those found in the USB2
specification.
Debugged-by: Robin Smidsrød <robin@smidsrod.no>
Tested-by: Robin Smidsrød <robin@smidsrod.no>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
xHCI (and EHCI) nominally provide a mechanism for releasing ownership
of the host controller back to the BIOS, which can then potentially
restore legacy USB keyboard functionality.
This is a rarely used code path, since most operating systems claim
ownership and never attempt to later return to the BIOS. On some
systems (observed with a Lenovo X1 Carbon), this code path leads to
obscure and interesting bugs: if the xHCI and EHCI controllers are
both claimed and later released back to the BIOS, then a subsequent
call to INT 16,0305 to set the keyboard repeat rate to a non-default
value will lock the system.
Obscure though this sequence of operations may sound, it is exactly
what happens when using iPXE to boot a Linux kernel via a USB network
card. There is old and probably unwanted code in Linux's
arch/x86/boot/main.c which sets the keyboard repeat rate (with the
accompanying comment "Set keyboard repeat rate (why?)"). When booting
Linux via a USB network card on a Lenovo X1 Carbon, the system
therefore locks up immediately after jumping to the kernel's entry
point.
Work around this problem by preventing the release of ownership back
to the BIOS if it is known that we are shutting down to boot an OS.
This should allow legacy USB keyboard functionality to be restored if
the user chooses to exit iPXE, while avoiding the rarely used code
paths (and corresponding BIOS bugs) if the user chooses instead to
boot an OS.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
If the BIOS fails to gracefully release ownership of the xHCI
controller, we can forcibly claim it by disabling all SMIs via the
USB legacy support control/status register.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
RX FIFO overflow is almost inevitable since the (usable) USB2 bus
bandwidth is approximately one quarter of the Ethernet bandwidth.
Avoid flooding the console with RX FIFO overflow messages in a
standard debug build.
With TCP SACK implemented, the RX FIFO overflow no longer causes a
catastrophic drop in throughput. Experimentation shows that HTTP
downloads now progress at a fairly smooth 250Mbps, which is around the
maximum speed attainable for a USB2 NIC.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
This driver is functional but any downloads via a TCP-based protocol
tend to perform poorly. The 1Gbps Ethernet line rate is substantially
higher than the 480Mbps (in practice around 280Mbps) provided by USB2,
and the device has only 32kB of internal buffer memory. Our 256kB TCP
receive window therefore rapidly overflows the RX FIFO, leading to
multiple dropped packets (usually within the same TCP window) and
hence a low overall throughput.
Reducing the TCP window size so that the RX FIFO does not overflow
greatly increases throughput, but is not a general-purpose solution.
Further investigation is required to determine how other OSes
(e.g. Linux) cope with this scenario. It is possible that
implementing TCP SACK would provide some benefit.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Most devices expose at least the link up/down status via a bit in a
MAC register, since the MAC generally already needs to know whether or
not the link is up. Some devices (e.g. the SMSC75xx USB NIC) expose
this information to software only via the MII registers.
Provide a generic mii_check_link() implementation to check the BMSR
and report the link status via netdev_link_{up,down}().
Signed-off-by: Michael Brown <mcb30@ipxe.org>
iPXE already sends RX notifications to the backend when needed, but
does not set the "feature-rx-notify" flag. As of XenServer 6.5, this
flag is mandatory and omitting it will cause the backend to fail.
Fix by setting the "feature-rx-notify" flag, to inform the backend
that we will send notifications.
Reported-by: Shalom Bhooshi <shalom.bhooshi@citrix.com>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Restore the original values of XUSB2PR and USB3PSSEN, in case we are
booting an OS with no support for xHCI.
Suggested-by: Dan Ellis <Dan.Ellis@displaylink.com>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Intel PCH controllers default to routing USB2 ports to EHCI rather
than xHCI, and default to disabling SuperSpeed connections.
Manipulate the PCI configuration space registers as necessary to
reroute ports and enable SuperSpeed.
Originally-fixed-by: Dan Ellis <Dan.Ellis@displaylink.com>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Relicense files with kind permission from
Stefan Hajnoczi <stefanha@redhat.com>
alongside the contributors who have already granted such relicensing
permission.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
At some point in the past few years, binutils became more aggressive
at removing unused symbols. To function as a symbol requirement, a
relocation record must now be in a section marked with @progbits and
must not be in a section which gets discarded during the link (either
via --gc-sections or via /DISCARD/).
Update REQUIRE_SYMBOL() to generate relocation records meeting these
criteria. To minimise the impact upon the final binary size, we use
existing symbols (specified via the REQUIRING_SYMBOL() macro) as the
relocation targets where possible. We use R_386_NONE or R_X86_64_NONE
relocation types to prevent any actual unwanted relocation taking
place. Where no suitable symbol exists for REQUIRING_SYMBOL() (such
as in config.c), the macro PROVIDE_REQUIRING_SYMBOL() can be used to
generate a one-byte-long symbol to act as the relocation target.
If there are versions of binutils for which this approach fails, then
the fallback will probably involve killing off REQUEST_SYMBOL(),
redefining REQUIRE_SYMBOL() to use the current definition of
REQUEST_SYMBOL(), and postprocessing the linked ELF file with
something along the lines of "nm -u | wc -l" to check that there are
no undefined symbols remaining.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
These files cannot be automatically relicensed by util/relicense.pl
since they either contain unusual but trivial contributions (such as
the addition of __nonnull function attributes), or contain lines
dating back to the initial git revision (and so require manual
knowledge of the code's origin).
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Relicence files with kind permission from the following contributors:
Alex Williamson <alex.williamson@redhat.com>
Eduardo Habkost <ehabkost@redhat.com>
Greg Jednaszewski <jednaszewski@gmail.com>
H. Peter Anvin <hpa@zytor.com>
Marin Hannache <git@mareo.fr>
Robin Smidsrød <robin@smidsrod.no>
Shao Miller <sha0.miller@gmail.com>
Thomas Horsten <thomas@horsten.com>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Add the standard warranty disclaimer and Free Software Foundation
address paragraphs to the licence text where these are not currently
present.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
When a command times out, abort it (via the Command Abort bit in the
Command Ring Control Register) so that subsequent commands may execute
as expected.
This improves robustness when a device fails to respond to the Set
Address command, since the subsequent Disable Slot command will now
succeed.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
If the Disable Slot command fails then the hardware may continue to
write to the slot context. Leak the memory used by the slot context
to avoid future memory corruption.
This situation has been observed in practice when a Set Address
command fails, causing the command ring to become temporarily
unresponsive.
Note that there is no need to similarly leak memory on the failure
path in xhci_device_open(), since in the event of a failure the
hardware is never informed of the slot context address.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Provide a generic framework for allocating, refilling, and optionally
recycling I/O buffers used by bulk IN and interrupt endpoints.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Commit a60f2dd ("[usb] Try multiple USB device configurations")
changed the behaviour of register_usb() such that if no drivers are
found then the device will be closed and the memory used will be
freed.
If a port status change subsequently occurs while the device is still
physically attached, then usb_hotplug() will see this as a new device
having been attached, since there is no device recorded as being
currently attached to the port. This can lead to spurious hotplug
events (or even endless loops of hotplug events, if the process of
opening and closing the device happens to generate a port status
change).
Fix by using a separate flag to indicate that a device is physically
attached (even if we have no corresponding struct usb_device).
Reported-by: Dan Ellis <Dan.Ellis@displaylink.com>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Some xHCI controllers (observed with a Renesas Electronics PCIe USB3
card) seem to require a delay after forcing the link state of USB3
ports to RxDetect. Omitting this delay causes strange behaviour
including system lockups.
Add an unconditional 20ms delay after writing the port link states.
This seems to be sufficient to avoid the problem.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Some USB endpoints require that a short packet be used to terminate
transfers, since they have no other way to determine message
boundaries. If the message length happens to be an exact multiple of
the USB packet size, then this requires the use of an additional
zero-length packet.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
USB Communications Device Class devices may use a union functional
descriptor to group several interfaces into a function.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Iterate over a USB device's available configurations until we find one
for which we have working drivers.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Some protocols (such as ARP) may modify the received packet and re-use
the same I/O buffer for transmission of a reply. To allow this,
reserve sufficient headroom at the start of each received packet
buffer for our transmit datapath headers.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Some devices have a very small number of internal buffers, and rely on
being able to pack multiple packets into each buffer. Using 2048-byte
buffers on such devices produces throughput of around 100Mbps. Using
a small number of much larger buffers (e.g. 32kB) increases the
throughput to around 780Mbps. (The full 1Gbps is not reached because
the high RTT induced by the use of multi-packet buffers causes us to
saturate our 256kB TCP window.)
Since allocation of large buffers is very likely to fail, allocate the
buffer set only once when the device is opened and recycle buffers
immediately after use. Received data is now always copied to
per-packet buffers.
If allocation of large buffers fails, fall back to allocating a larger
number of smaller buffers. This will give reduced performance, but
the device will at least still be functional.
Share code between the interrupt and bulk IN endpoint handlers, since
the buffer handling is now very similar.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Allow drivers to specify a supported PCI class code. To save space in
the final binary, make this an attribute of the driver rather than an
attribute of a PCI device ID list entry.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
The (undocumented) VMBus protocol seems to allow for transfer
page-based packets where the data payload is split into an arbitrary
set of ranges within the transfer page set.
The RNDIS protocol includes a length field within the header of each
message, and it is known from observation that multiple RNDIS messages
can be concatenated into a single VMBus message.
iPXE currently assumes that the transfer page range boundaries are
entirely arbitrary, and uses the RNDIS header length to determine the
RNDIS message boundaries.
Windows Server 2012 R2 generates an RNDIS_INDICATE_STATUS_MSG for an
undocumented and unknown status code (0x40020006) with a malformed
RNDIS header length: the length does not cover the StatusBuffer
portion of the message. This causes iPXE to report a malformed RNDIS
message and to discard any further RNDIS messages within the same
VMBus message.
The Linux Hyper-V driver assumes that the transfer page range
boundaries correspond to RNDIS message boundaries, and so does not
notice the malformed length field in the RNDIS header.
Match the behaviour of the Linux Hyper-V driver: assume that the
transfer page range boundaries correspond to the RNDIS message
boundaries and ignore the RNDIS header length. This avoids triggering
the "malformed packet" error and also avoids unnecessary data copying:
since we now have one I/O buffer per RNDIS message, there is no longer
any need to use iob_split().
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Empirical observation suggests that 32 is a sensible size to minimise
the number of deferred packet transmissions without overflowing the
VMBus transmit ring buffer.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Allow for elision of transmitted TCP ACKs by handling all received
VMBus messages in each network device poll operation.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
On Windows Server 2012 R2, the receive buffer teardown completion
message seems to occasionally be deferred until after the VMBus
channel has been closed. This happens even if there are no packets
currently in the receive buffer.
Work around this problem by separating the revocation and teardown of
the receive buffer, and deferring the teardown until after the VMBus
channel has been closed.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
The i350 (and possibly other Intel NICs) have a non-trivial
correspondence between the PCI function number and the external
physical port number. For example, the i350 has a "LAN Function Sel"
bit within the EEPROM which can invert the mapping so that function 0
becomes port 3, function 1 becomes port 2, etc.
Unfortunately the MAC addresses within the EEPROM are indexed by
physical port number rather than PCI function number. The end result
is that when anything other than the default mapping is used, iPXE
will use the wrong address as the base MAC address.
Fix by using the autoloaded MAC address if it is valid, and falling
back to reading the MAC address directly from the EEPROM only if no
autoloaded address is available.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
End users almost certainly don't care whether the underlying interface
is SNP or NII/UNDI. Try to minimise surprise and unnecessary
documentation by including the NII driver whenever the SNP driver is
requested.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
iPXE itself exposes a dummy NII protocol with no UNDI. Avoid
potentially dereferencing a NULL pointer by checking for a non-zero
UNDI address.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Some UEFI network drivers provide a software UNDI interface which is
exposed via the Network Interface Identifier Protocol (NII), rather
than providing a Simple Network Protocol (SNP).
The UEFI platform firmware will usually include the SnpDxe driver,
which attaches to NII and provides an SNP interface. The SNP
interface is usually provided on the same handle as the underlying NII
device. This causes problems for our EFI driver model: when
efi_driver_connect() detaches existing drivers from the handle it will
cause the SNP interface to be uninstalled, and so our SNP driver will
not be able to attach to the handle. The platform firmware will
eventually reattach the SnpDxe driver and may attach us to the SNP
handle, but we have no way to prevent other drivers from attaching
first.
Fix by providing a driver which can attach directly to the NII
protocol, using the software UNDI interface to drive the network
device.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
The snpnet driver uses netdev_tx_defer() and so must ensure that space
in the (single-entry) transmit descriptor ring is freed up before
calling netdev_tx_complete().
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Add the ID for the LM variant and differentiate it from the I217-V.
Signed-off-by: Jan Kiszka <jan.kiszka@web.de>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
We currently require information about the underlying PCI device to
populate the snpnet device's name and description. If the underlying
device is not a PCI device, this will fail and prevent the device from
being registered.
Fix by falling back to populating the device description with
information based on the EFI handle, if no PCI device information is
available.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Some systems will install a child of the SNP device and use this as
our loaded image's device handle, duplicating the installation of the
underlying SNP protocol onto the child device handle. On such
systems, we want to end up driving the parent device (and
disconnecting any other drivers, such as MNP, which may be attached to
the parent device).
Fix by recording the SNP protocol instance at initialisation time, and
using this to match against device handles (rather than simply
comparing the handles themselves).
Reported-by: Jarrod Johnson <jarrod.b.johnson@gmail.com>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
ICH8 devices have an errata which requires us to reconfigure the
packet buffer size (PBS) register, and correspondingly adjust the
packet buffer allocation (PBA) register. The "Intel I/O Controller
Hub ICH8/9/10 and 82566/82567/82562V Software Developer's Manual"
notes for the PBS register that:
10.4.20 Packet Buffer Size - PBS (01008h; R/W)
Note: The default setting of this register is 20 KB and is
incorrect. This register must be programmed to 16 KB.
Initial value: 0014h
0018h (ICH9/ICH10)
It is unclear from this comment precisely which devices require the
workaround to be applied. We currently attempt to err on the side of
caution: if we detect an initial value of either 0x14 or 0x18 then the
workaround will be applied. If the workaround is applied
unnecessarily, then the effect should be just that we use less than
the full amount of the available packet buffer memory.
Unfortunately this approach does not play nicely with other device
drivers. For example, the Linux e1000e driver will rewrite PBA while
assuming that PBS still contains the default value, which can result
in inconsistent values between the two registers, and a corresponding
inability to transmit or receive packets. Even more unfortunately,
the contents of PBS and PBA are not reset by anything less than a
power cycle, meaning that this error condition will survive a hardware
reset.
The Linux driver (written and maintained by Intel) applies the PBS/PBA
errata workaround only for devices in the ICH8 family, identified via
the PCI device ID. Adopt a similar approach, using the PCI_ROM()
driver data field to indicate when the workaround is required.
Reported-by: Donald Bindner <dbindner@truman.edu>
Debugged-by: Donald Bindner <dbindner@truman.edu>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Under some circumstances (e.g. if iPXE itself is booted via iSCSI, or
after an unclean reboot), the backend may not be in the expected
InitWait state when iPXE starts up.
There is no generic reset mechanism for Xenbus devices. Recent
versions of xen-netback will gracefully perform all of the required
steps if the frontend sets its state to Initialising. Older versions
(such as that found in XenServer 6.2.0) require the frontend to
transition through Closed before reaching Initialising.
Add a reset mechanism for netfront devices which does the following:
- read current backend state
- if backend state is anything other than InitWait, then set the
frontend state to Closed and wait for the backend to also reach
Closed
- set the frontend state to Initialising and wait for the backend to
reach InitWait.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Using version 1 grant tables limits guests to using 16TB of grantable
RAM, and prevents the use of subpage grants. Some versions of the Xen
hypervisor refuse to allow the grant table version to be set after the
first grant references have been created, so the loaded operating
system may be stuck with whatever choice we make here. We therefore
currently use version 2 grant tables, since they give the most
flexibility to the loaded OS.
Current versions (7.2.0) of the Windows PV drivers have no support for
version 2 grant tables, and will merrily create version 1 entries in
what the hypervisor believes to be a version 2 table. This causes
some confusion.
Avoid this problem by attempting to use version 1 tables, since
otherwise we may render Windows unable to boot.
Play nicely with other potential bootloaders by accepting either
version 1 or version 2 grant tables (if we are unable to set our
requested version).
Note that the use of version 1 tables on a 64-bit system introduces a
possible failure path in which a frame number cannot fit into the
32-bit field within the v1 structure. This in turn introduces
additional failure paths into netfront_transmit() and
netfront_refill_rx().
Signed-off-by: Michael Brown <mcb30@ipxe.org>
The behavior observed in the Apple EFI (1.10) RecieveFilters() call
is:
- failure if any of the PROMISCUOUS or MULTICAST filters are
included
- success if only UNICAST is included, however the result is
UNICAST|BROADCAST
- success if only UNICAST and BROADCAST are included
- if UNICAST, or UNICAST|BROADCAST are used, but the previous call
tried (and failed) to set UNICAST|BROADCAST|MULTICAST, then the
result is UNICAST|BROADCAST|MULTICAST
Work around this apparently broken SNP implementation by trying
RecieveFilterMask, then falling back to UNICAST|BROADCAST|MULTICAST,
then UNICAST|BROADCAST, and finally UNICAST.
Modified-by: Michael Brown <mcb30@ipxe.org>
Tested-by: Curtis Larsen <larsen@dixie.edu>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Some EFI 1.10 systems (observed on an Apple iMac) do not allow us to
open the device path protocol with an attribute of
EFI_OPEN_PROTOCOL_BY_DRIVER and so we cannot maintain a safe,
long-lived pointer to the device path. Work around this by instead
opening the device path protocol with an attribute of
EFI_OPEN_PROTOCOL_GET_PROTOCOL whenever we need to use it.
Debugged-by: Curtis Larsen <larsen@dixie.edu>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
According to the UEFI specification, the MCastFilter parameter (which
we currently pass as NULL, along with a zero MCastFilterCnt) is
optional only if ResetMCastFilter is true.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Dump the existing openers of a protocol whenever we are unable to open
a protocol using attributes of BY_DEVICE, EXCLUSIVE, or
BY_CHILD_CONTROLLER.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Using efi_devpath_text() is marginally more efficient if we already
have the device path protocol available, but the mild increase in
efficiency is not worth compromising the clarity of the pattern:
DBGC ( device, "THING %p %s ...", device, efi_handle_name ( device ) );
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Rewrite the SNP NIC driver to use non-blocking and deferrable
transmissions, to provide link status detection, to provide
information about the underlying (PCI) hardware device, and to avoid
unnecessary I/O buffer allocations during receive polling.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
The VF might not have assigned a MAC address upon startup, and will
end up with a random MAC address during probe(). With this patch the
MAC address can be changed later on.
Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
If the VF doesn't have a MAC address assigned we should create a
random MAC address.
Signed-off-by: Hannes Reinecke <hare@suse.de>
Modified-by: Michael Brown <mcb30@ipxe.org>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
The iBFT includes an "origin" field to indicate the source of the IP
address. We use the heuristic of assuming that the source should be
"manual" if the IP address originates directly from the network device
settings block, and "DHCP" otherwise. This is an imperfect guess, but
is likely to be correct in most common situations.
Originally-implemented-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Parse the sense data to extract the reponse code, the sense key, the
additional sense code, and the additional sense code qualifier.
Originally-implemented-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
As of commit d28bb51 ("[tcp] Defer sending ACKs until all received
packets have been processed"), increasing the RX ring size will
increase the number of received packets per transmitted ACK (since
each poll will process up to one complete receive ring). Under KVM,
this can make a substantial (up to ~200%) difference to the overall
download speed, since transmissions are very expensive.
Increase the ring fill level from four to eight packets: this
increases the download speed by around 50% at a cost of around 8kB of
heap space. Further speedups are possible by increasing the ring size
further, but it would be preferable to find alternative methods which
do not use noticeable amounts of heap space.
Tested-by: Robin Smidsrød <robin@smidsrod.no>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
When profiling, exclude any time spent inside the hypervisor
responding to our MMIO accesses. This substantially reduces the
variance accumulated on many other profilers.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Inside a virtual machine, writing the RX ring tail pointer may incur a
substantial overhead of processing inside the hypervisor. Minimise
this overhead by writing the tail pointer once per batch of
descriptors, rather than once per descriptor.
Profiling under qemu-kvm (version 1.6.2) shows that this reduces the
amount of time taken to refill the RX descriptor ring by around 90%.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Operations which are negligible on physical hardware (such as issuing
a posted write to the transmit ring tail register) may involve
substantial amounts of processing within the hypervisor if running in
a virtual machine.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
It is unclear from the datasheets whether or not the TX ring can be
completely filled (i.e. whether writing the tail value as equal to the
current head value will cause the ring to be treated as completely
full or completely empty). It is very plausible that this edge case
could differ in behaviour between real hardware and the many
implementations of an emulated Intel NIC found in various virtual
machines. Err on the side of caution and always leave at least one
ring entry empty.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
On an Asus Z87-K motherboard with an onboard 8168 NIC, booting into
Windows 7 and then warm rebooting into iPXE results in a broken RX
datapath: packets can be transmitted successfully but garbage is
received. A cold reboot clears the problem.
A dump of the PHY registers reveals only one difference: in the
failure case the bits ADVERTISE_PAUSE_CAP and ADVERTISE_PAUSE_ASYM are
cleared. Explicitly setting these bits does not fix the problem.
A dump of the MAC registers reveals a few differences, of which the
most obvious culprit is the undocumented bit 24 of the Receive
Configuration Register (RCR), which is set in the failure case.
Explicitly clearing this bit does fix the problem.
Reported-by: Sebastian Nielsen <ipxe@sebbe.eu>
Reported-by: Oliver Rath <rath@mglug.de>
Debugged-by: Sebastian Nielsen <ipxe@sebbe.eu>
Tested-by: Sebastian Nielsen <ipxe@sebbe.eu>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
The fetch_setting() family of functions may currently modify the
definition of the specified setting (e.g. to add missing type
information). Clean up this interface by requiring callers to provide
an explicit buffer to contain the completed definition of the fetched
setting, if required.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Give tap devices a meaningful name, and avoid segmentation faults when
attempting to retrieve ${net0/bustype} by assigning a new bus type for
tap devices.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Prevent the card from flagging packets of 1518 bytes length as
overlength.
This fixes the High-MTU loopback test.
Signed-off-by: Thomas Miletich <thomas.miletich@gmail.com>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
The 3c90x B and C revisions support rounding up the packet length to a
specific boundary. Disable this feature to avoid overlength packets.
This fixes the loopback test.
Signed-off-by: Thomas Miletich <thomas.miletich@gmail.com>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
According to the 3c90x datasheet we have to stall the upload (receive)
engine before setting the receive ring address.
Signed-off-by: Thomas Miletich <thomas.miletich@gmail.com>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Some hardware (observed with an onboard RTL8168) will erroneously
report a buffer overflow error if the received packet exactly fills
the receive buffer.
Fix by adding an extra four bytes of padding to each receive buffer.
Debugged-by: Thomas Miletich <thomas.miletich@gmail.com>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Signed-off-by: Adrian Jamróz <adrian.jamroz@gmail.com>
Modified-by: Thomas Miletich <thomas.miletich@gmail.com>
Signed-off-by: Thomas Miletich <thomas.miletich@gmail.com>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Replace the old via-rhine driver with a new version using the iPXE
API.
Includes fixes by Thomas Miletich for:
- MMIO access
- Link detection
- RX completion in RX overflow case
- Reset and EEPROM reloading
- CRC stripping
- Missing cpu_to_le32() calls
- Missing memory barriers
Signed-off-by: Adrian Jamróz <adrian.jamroz@gmail.com>
Modified-by: Thomas Miletich <thomas.miletich@gmail.com>
Tested-by: Thomas Miletich <thomas.miletich@gmail.com>
Tested-by: Robin Smidsrød <robin@smidsrod.no>
Modified-by: Michael Brown <mcb30@ipxe.org>
Tested-by: Michael Brown <mcb30@ipxe.org>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Allow values to be read from PCI configuration space using the syntax
${pci/<busdevfn>.<offset>.<length>}
where <busdevfn> is the bus:dev.fn address of the PCI device
(expressed as a single integer, as returned by ${net0/busloc}),
<offset> is the offset within PCI configuration space, and <length> is
the length within PCI configuration space.
Values are returned in reverse byte order, since PCI configuration
space is little-endian by definition.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
realtek_destroy_ring() currently does nothing if the card is operating
in legacy (pre-RTL8139C+) mode. In particular, the producer and
consumer counters are incorrectly left holding their current values.
Virtual hardware (e.g. the emulated RTL8139 in qemu and similar VMs)
is tolerant of this behaviour, but real hardware will fail to transmit
if the descriptors are not used in the correct order.
Fix by resetting the producer and consumer counters in
realtek_destroy_ring() even if the card is operating in legacy mode.
Reported-by: Gelip <mrgelip@gmail.com>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Create an explicit concept of "settings scope" and eliminate the magic
values used for numerical setting tags.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
On some systems, it appears to be possible for writes to the EEPROM
registers to be delayed for long enough that the EEPROM's setup and
hold times are violated, resulting in invalid data being read from the
EEPROM.
Fix by inserting a PCI read cycle immediately after writes to
RTL_9346CR, to ensure that the write has completed before starting the
udelay() used to time the SPI bus transitions.
Reported-by: Gelip <mrgelip@gmail.com>
Tested-by: Gelip <mrgelip@gmail.com>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Some older RTL8139 chips seem to not immediately update the
RTL_CR.BUFE bit in response to a write to RTL_CAPR. This results in
iPXE seeing a spurious zero-length received packet, and thereafter
being out of sync with the hardware's RX ring offset.
Fix by inserting an extra PCI read cycle after writing to RTL_CAPR, to
give the chip time to react before we next read RTL_CR.
Reported-by: Gelip <mrgelip@gmail.com>
Tested-by: Gelip <mrgelip@gmail.com>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Some onboard RTL8169 NICs seem to leave the EEPROM pins disconnected.
The existing is_valid_ether_addr() test will not necessarily catch
this, since it expects a missing EEPROM to show up as a MAC address of
00:00:00:00:00:00 or ff:ff:ff:ff:ff:ff. When the EEPROM pins are
floating the MAC address may read as e.g. 00:00:00:00:0f:00, which
will not be detected as invalid.
Check the ID word in the first two bytes of the EEPROM (which should
have the value 0x8129 for all RTL8139 and RTL8169 chips), and use this
to determine whether or not an EEPROM is present.
Reported-by: Carl Karsten <carl@nextdayvideo.com>
Tested-by: Carl Karsten <carl@nextdayvideo.com>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Exploit the redefinition of iPXE error codes to include a "platform
error code" to allow for meaningful conversion of EFI_STATUS values to
iPXE errors and vice versa.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
The Intel 10 Gigabit NICs have a datapath that is almost
register-compatible with the Intel 1 Gigabit NICs. Expose common
functionality to avoid duplication of code in the new "intelx" driver.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
The Intel 10 Gigabit NICs use the same simplified (aka "legacy")
descriptor format and the same layout for descriptor register blocks
as the Intel 1 Gigabit NICs. The offsets of the descriptor register
blocks are not the same.
Simplify reuse of the existing code by removing all hardcoded offsets
for registers within descriptor register blocks, and ensuring that all
offsets are calculated using the descriptor register block base
address provided via intel_init_ring().
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Avoid using UINT16 and similar typedefs, which are non-standard in the
iPXE codebase and generate conflicts when trying to include any of the
EFI headers.
Also fix trailing whitespace in the affected files, to prevent
complaints from git.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Remove macros which aren't used anywhere in the driver, and which
conflict with macros of the same name used in the EFI headers.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Remove macros which aren't used anywhere in the driver, and which
conflict with macros of the same name used in the EFI headers.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
The iBFT NIC section has a VLAN field which must be filled in so that
iSCSI booting works over VLANs.
Unfortunately it is unclear from the IBM specification linked in
ibft.c whether the VLAN field is just the 802.1Q VLAN Identifier or
the full 802.1Q TCI. For now just fill in the VID, the Priority Code
Point and Drop Eligible Indicator could be set in the future if it
turns out they should be present too.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Modified-by: Michael Brown <mcb30@ipxe.org>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Commit 947976d ("[netdevice] Do not force a poll on net_tx()")
requires network devices to have TX rings that are sufficiently large
to allow a transmitted response to all packets received during a
single poll.
Reported-by: Robin Smidsrød <robin@smidsrod.no>
Tested-by: Robin Smidsrød <robin@smidsrod.no>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
The Intel NIC emulation in some versions of VMware seems to suffer
from a flaw whereby the Interrupt Cause Register (ICR) fails to assert
the usual "packet received" bit (ICR.RXT0) if a receive overflow
(ICR.RXO) has also occurred.
Work around this flaw by polling for completed descriptors whenever
either ICR.RXT0 or ICR.RXO is asserted.
Reported-by: Miroslav Halas <miroslav.halas@bankofamerica.com>
Debugged-by: Miroslav Halas <miroslav.halas@bankofamerica.com>
Tested-by: Miroslav Halas <miroslav.halas@bankofamerica.com>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Almost all clients of the raw-packet interfaces (UNDI and SNP) can
handle only Ethernet link layers. Expose an Ethernet-compatible link
layer to local clients, while remaining compatible with IPoIB on the
wire. This requires manipulation of ARP (but not DHCP) packets within
the IPoIB driver.
This is ugly, but it's the only viable way to allow IPoIB devices to
be driven via the raw-packet interfaces.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Some RTL8169 cards seem to drive the EEPROM CS line high (i.e. active)
when 9346CR.EEM is set to "normal operating mode", with the result
that the CS line is never deasserted. The symptom of this is that the
first read from the EEPROM will work, while all subsequent reads will
return garbage data.
Reported-by: Thomas Miletich <thomas.miletich@gmail.com>
Debugged-by: Thomas Miletich <thomas.miletich@gmail.com>
Tested-by: Thomas Miletich <thomas.miletich@gmail.com>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Some RTL8169 cards (observed with an RTL8169SC) power up advertising
only 100Mbps, despite being capable of 1000Mbps. Forcibly enable
advertisement of 1000Mbps on any RTL8169-like card.
This change relies on the assumption that the CTRL1000 register will
not exist on 100Mbps-only RTL8169 cards such as the RTL8101.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Some RTL8169 cards (observed with an RTL8169SC) crash and burn if DAC
is enabled, even if only 32-bit addresses are used. Observed
behaviour includes system lockups and repeated transmission of garbage
data onto the wire.
This seems to be a known problem. The Linux r8169 driver disables DAC
by default and provides a "use_dac" module parameter.
There appears to be no known test for determining whether or not DAC
will work. As a workaround, enable DAC only if we are built as as
64-bit binary. This at least eliminates the problem in the common
case of a 32-bit build, which will never use 64-bit addresses anyway.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Some bits in the C+ Command register are always one. Testing for the
presence of the register must allow for this.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Some RTL8169 cards (observed with an RTL8169SC) power up with
TCR.MXDMA set to 16 bytes. While this does not prevent proper
operation, it almost certainly degrades performance.
Fix by explicitly setting TCR.MXDMA to "unlimited".
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Some RTL8169 cards (observed with an RTL8169SC) power up with invalid
values in RCR.RXFTH and RCR.MXDMA, causing receive DMA to fail. Fix
by setting explicit values for both fields.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Some RTL8169 cards (observed with an RTL8169SC) power up with garbage
values in the ring address registers, and do not clear the registers
on reset.
Fix by always setting the high dword of the ring address registers.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Change the DMA alignment from 4096 bytes to 16 bytes, to conserve
available DMA memory. The hardware doesn't have any specific
alignment requirements.
Signed-off-by: Thomas Miletich <thomas.miletich@gmail.com>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Datasheet pp. 41-42 defines 'rx packet length' as upper word of
'status' dword field of the receive descriptor table.
http://www.smsc.com/media/Downloads_Archive/discontinued/83c171.pdf
Tested on SMC EtherPower II.
Signed-off-by: Alexey Smazhenko <darkover@corbina.com.ua>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
gcc 4.7 produces a spurious warning about an array subscript being out
of bounds. Use a pointer dereference instead of an array lookup to
inhibit this spurious warning.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
On i350 the datasheet contradicts itself in stating that the default
value of RXDCTL.ENABLE for queue zero is both set (according to the
"Receive Initialization" section) and unset (according to the "Receive
Descriptor Control - RXDCTL" section). Empirical evidence suggests
that the default value is unset.
Explicitly enable both transmit and receive queues to avoid any
ambiguity.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
On 82576 (and probably others), the datasheet states that "the tail
register of the queue (RDT[n]) should not be bumped until the queue is
enabled". There is some confusion over exactly what constitutes
"enabled": the initialisation blurb says that we should "poll the
RXDCTL register until the ENABLE bit is set", while the description
for the RXDCTL register says that the ENABLE bit is set by default
(for queue zero). Empirical evidence suggests that the ENABLE bit
reads as set immediately after writing to RCTL.EN, and so polling is
not necessary.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Use hw pointer in PCI driver data as expected by sky2_remove().
Signed-off-by: Valentine Barshak <gvaxon@gmail.com>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
RTL8139C+ cards use essentially the same datapath as RTL8169, which is
zerocopy and 64-bit capable. Older RTL8139 cards use a single receive
ring buffer rather than a descriptor ring, but still share substantial
amounts of functionality with RTL8169.
Include support for RTL8139 cards within the generic Realtek driver,
since there is no way to differentiate between RTL8139 and RTL8139C+
cards based on the PCI IDs alone.
Many thanks to all the people who worked on the rtl8139 driver over
the years.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
The link state is currently set at probe time, and updated only when
the device is polled. This results in the user seeing a misleading
stale "Link: down" message, if autonegotiation did not complete within
the short timespan of the probe routine.
Fix by updating the link state when the device is opened, so that the
message that ends up being displayed to the user reflects the real
link state at device open time.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Tested-by: Thomas Miletich <thomas.miletich@gmail.com>
Debugged-by: Thomas Miletich <thomas.miletich@gmail.com>
Tested-by: Robin Smidsrød <robin@smidsrod.no>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
iPXE provides no support for manually configuring the link speed.
Provide a generic routine which should be able to reset any MII/GMII
PHY and enable autonegotiation.
Prototyped-by: Thomas Miletich <thomas.miletich@gmail.com>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Add support for 82579-based chips such as those found on Sandy Bridge
motherboards. Based on d3738bb8203acf8552c3ec8b3447133fc0938ddd in
Linux.
Signed-off-by: Daniel Hokka Zakrisson <daniel@hozac.com>
Signed-off-by: Michael Brown <mcb30@ipxe.org>