To improve this, I have added the delays that the Tube protocol requires. Each transfer type has a specified minimum delay after setting up the transfer before the first byte should be fetched, and a further specified minimum delay between subsequent byte reads.
For the Type 0 transfer this routine uses, the delay in both cases is 24µs. This is equivalent to 48 cycles in host running the code at 2MHz.