But unless you’re running bs=1 it will be painful vs 8x GPU as you’re almost certain to be activating most/all of the experts in a batch.
But unless you’re running bs=1 it will be painful vs 8x GPU as you’re almost certain to be activating most/all of the experts in a batch.