Gemini 2.0 Flash Lite accepts multimodal inputs (text, images, audio, and documents) but produces text output only. This is a deliberate design choice. It targets the class of tasks where understanding rich input and producing structured text (descriptions, labels, or summaries) is the entire job.
At $0.075 per million input tokens, it fits workloads where unit economics drive architecture decisions. Large image batches stay cheap enough for annotation pipelines, moderation queues, accessibility generation, and visual extraction at scale.
The context window of 1.0M tokens accommodates long audio transcripts, multi-page documents, and extended image sequences within a single request. For ETL-style pipelines that process a batch of mixed-modality records and need structured text output from each, Gemini 2.0 Flash Lite provides the input flexibility of a multimodal model at a price closer to a text-only model.