Checking back with the forum, I see some advice that the ENC28J60 is returning wildly invalid data (I added low level logging of device register actions). The thing is, sometimes it works fine, and stays working fine. So I grope forward with some greybox tests:
- reproducibility: when it fails, it fails on the first packet. when it doesn’t fail, it can go on indefinitely (which here means about 2 hrs before I give up waiting).
- frequency: given the failure happens on the first packet, I can boot and see if the first packet passes or fails. I do about 40 boots, and get 45% failure.
- since that seems to suggest something setup-y, I add code to dump all the ENC28J60 registers right after init, and right before pulling the first packet, looking for discrepancies and correllations.
The last test showed that all the registers were fine, but there is a slight complication. One of the registers (well at least one), is a shadow register, and you don’t get the real value during initialization. The real value is obtained after some activity. This is the ERXWRPT register, which is the receive buffer write pointer, i.e. where incoming network data is written. In the failure cases, I noticed that pointer was not pointing within the receive buffer, but was pointing into the transmit buffer. That makes no sense, and should not be possible because the hardware constrains that to be between start and end.
While reading the datasheet to figure out how the chip works, I did notice in the intro section (who reads those, right?) that you should consult errata for post-ship bug notices. Why not? I went into it thinking that maybe I didn’t wait long enough after flicking on the powerctl before writing config, because that is something specific to this board, but noticed instead that there is an Errata 5 related to spurious reset of the ERXWRPT to 0, instead of to the receive buffer start. The workaround being that you should align your receive FIFO to address 0 to mask the problem. I did this and the problem went away. I left the board running overnight with continual network traffic.