* Remove unnecessary wait in pico_divider.
There is no need to wait if there is more than 8 cycles between setup and result readout.
Dividend/divisor readout should be correct without delay. Update comment to reflect that.
* Optimize hw_divider_save_state/hw_divider_restore_state.
Doing multiple pushes to avoid stack usage is faster.
The wait loop in hw_divider_save_state had an incorrect branch in the wait loop.
This didn't matter since the wait wasn't necessary to begin with.
* Remove pointless aligns in hardware_divider.
The regular_func_with_section inserts a new section so if aligning
is desired it should be placed in the macro after section start.
* Save a few bytes in hardware_divider.
Signed and unsigned code can use the same exit code.
Branching to the common code is free since we need the 8 cycle
delay anyway.