How Vision-Language Models Generate Text