Thesis. The paper argues that large, generative video models trained at scale are beginning to function as generalist vision systems—capable of performing many downstream tasks without explicit fine‑tuning—much as LLMs did for language. Using Veo 3 as a case study, the authors present qualitative breadth and quantitative benchmarks that together suggest emergent zero‑shot perception, manipulation, and early forms of visual reasoning. Video models are zero-shot lear…

Context and framing. The introduction explicitly draws a parallel to NLP’s shift from “one model per task” to unified LLMs, noting that video generators share the same training primitives (large scale, generative objectives, web data). The central question posed: Do video models develop general‑purpose vision abilities? The authors answer “yes,” supported by a study spanning 18,384 generated videos across 62 qualitative and 7 quantitative tasks (the count breakdown in Table 1 on p. 31 sums to 17,640 quantitative videos plus 744 qualitative samples = 18,384). Video models are zero-shot lear…

Method in brief. The team treats the deployed system as a black box with two pieces: an LLM-based prompt rewriter and Veo’s video generator. They seed each task with a static image (first frame) plus a text instruction; Veo then produces an 8‑second, 24 FPS, 720p video. Importantly, to isolate visual reasoning, they verify that a standalone LLM cannot reliably solve key tasks (e.g., specific maze/symmetry/navigation prompts) from the input image alone. (See method details on p. 2.) Video models are zero-shot lear…

Four tiers of capability

The work organizes results into a hierarchy—Perception → Modeling → Manipulation → Reasoning—and visualizes breadth in Figure 1 (p. 2), which plots per‑task success rates across 62 diverse tasks. The montage in Figure 2 (p. 3) illustrates exemplar successes at each tier. Video models are zero-shot lear…

  1. Perception.
    Zero-shot edge detection, segmentation, keypoints, super‑resolution, deblurring, denoising, and low‑light enhancement are shown, along with more cognitively interesting cases (conjunctive search, dalmatian illusion, cue‑conflict, Rorschach blots). Quantitatively, Veo 3 achieves OIS (best‑frame pass@10) ≈ 0.77 on BIPEDv2 edge detection (Figure 3, p. 5), far from a tuned SOTA model but noteworthy without task-specific training; many “false positives” were richer detail than the ground truth captured (e.g., foliage/tire treads, shown on p. 31, Fig. 60). For class‑agnostic instance segmentation on easy LVIS images, Veo 3 reaches mIoU ≈ 0.74 best‑frame pass@10 with a green background (Figure 4, p. 6), underscoring prompt/visual‑context sensitivity (green‑screen prior). Video models are zero-shot lear…
  2. Modeling (intuitive physics & world state).
    The model displays proto‑physics: flammability; rigid/soft body dynamics; air resistance; buoyancy; optical refraction/reflectance; additive/subtractive color mixing; and simple categorical distinctions. See Figures 21–29 (pp. 19–22) for qualitative sequences, including buoyancy (cap floats, rock sinks; Figure 24, p. 21) and a glass sphere that flips the background (refraction; Figure 27, p. 22). It also keeps memory of scene state across zooming (Figure 31, p. 23). Video models are zero-shot lear…
  3. Manipulation (editing & imagination).
    Veo 3 can remove backgrounds, inpaint/outpaint, colorize, transfer style, perform text manipulations, compose scenes, generate novel views, re‑pose 3D characters, and even produce professional portrait variants (Figures 32–43, pp. 23–26). In a quantitative “object extraction” test (count and align animals), Veo 3 reaches 93% pass@10 on the last frame (Figure 5, p. 6). In a small human study on Emu‑Edit samples, raters preferred Veo 3 over Veo 2 in both edit fidelity and precision (Figure 6, p. 7). Video models are zero-shot lear…
  4. Reasoning across space and time (“chain‑of‑frames”).
    The paper introduces Chain‑of‑Frames (CoF) as an analog to LLMs’ Chain‑of‑Thought: because the model must generate a temporally coherent sequence, it can “think” by doing—stepwise, visually. Tasks include tree BFS, graph traversal, sequence completion, symmetry completion, tool use, toy Sudoku, mazes, and rule extrapolation (Figures 48–59, pp. 28–30). Quantitatively:
    • Mazes: On 5×5 grids, Veo 3 attains pass@10 ≈ 78%, far above Veo 2 (~16%); on irregular mazes, Veo 3 succeeds ~75% while a reference image editor fails (Figure 7, p. 7).
    • Visual symmetry: Veo 3 strongly outperforms Veo 2 and an image‑editor baseline on both shaped and random patterns; prompt choice shifts pass@1 by 40–64 points across splits (Figure 8, p. 8; Table 2, p. 40).
    • Analogies: Strong for color & resize; below chance (0.33) on reflect/rotate, revealing systematic biases (Figure 9, p. 8; majority‑vote dynamics in Figure 61, p. 38). Video models are zero-shot lear…

Interpreting the results

Two methodological choices matter. First, the authors report best‑frame (upper bound if one can pick the right frame) and last‑frame performance (pre‑specified target frame), acknowledging that the model often “keeps animating” beyond task completion, depressing last‑frame scores. Second, they show consistent gains from Veo 2 → Veo 3 across tasks, and that pass@10 ≫ pass@1, implying room for test‑time compute strategies (e.g., sample‑and‑rank). Both patterns mirror the trajectory that turned early LLMs into practical generalists. Video models are zero-shot lear…

Conclusion. The core claim stands: with only prompting, a modern video model displays an unexpectedly wide breadth of competence across the vision stack and early forms of reasoning that require multi‑step, spatial‑temporal consistency. While per‑task ceilings lag bespoke systems, the breadth, upward trend with scale, and CoF behaviors support the view that video models are converging toward general‑purpose vision foundation models. Video models are zero-shot lear…

CyberLink PowerDirector & PhotoDirector 2026 | AI Video and Photo Editing Software for Windows | Slideshow Maker, Effects & Creative Design Tools | Box with Download Code

Quick Actions – AI analyzes your photo and applies personalized edits.

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

Digital Cut - Background Remover - Edit, remove and change the background from your pictures easily for Win 11, 10

Digital Cut – Background Remover – Edit, remove and change the background from your pictures easily for Win 11, 10

Remove the background from your photos in seconds – No need for Photoshop or any other complicated software….

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

Understanding AI Video Storytelling for Beginners: A Comprehensive Guide to Generative AI Tools, Advanced Prompting, and Technical Workflows to Create ... Cinematic Content from Idea to Screen.

Understanding AI Video Storytelling for Beginners: A Comprehensive Guide to Generative AI Tools, Advanced Prompting, and Technical Workflows to Create … Cinematic Content from Idea to Screen.

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

2D & 3D CAD Software Suite USB – 8 Program Bundle for Windows & macOS – Complete Design & Drafting Tools

2D & 3D CAD Software Suite USB – 8 Program Bundle for Windows & macOS – Complete Design & Drafting Tools

Ready-to-use software preloaded on a high-speed USB flash drive for easy installation on any Windows PC, no internet…

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

You May Also Like

The OAuth Permission Apocalypse.

“Allow All” is the new SQL injection — and shadow AI is…

Europe’s New Data-Center Playbook: Germany, France, and Spain Try to Square Renewables with Digital Sovereignty

TL;DR: The strategic tension: Europe’s regulatory lead could deliver greener, more sovereign…

Intel’s “Crescent Island” Inference GPU

What’s new: Intel announced Crescent Island, an inference‑focused GPU using the Xe3P architecture…

AI Sovereignty: Why Nations Are Building Their Own LLM Infrastructures

Introduction: The New Arms Race of Cognition The 20th century raced for…