0) What we’re optimizing for

Footprint & speed: 270M INT4 QAT keeps RAM/VRAM needs and power low (≈240 MB to load at Q4_0; additional memory for tokens/KV cache), making it a great fit for edge devices.  

Battery: Google’s internal test on a Pixel 9 Pro: ~0.75% battery for ~25 short chats with the INT4 variant—useful as a power envelope target.  

Context: Up to 32k tokens on 270M/1B (plan defaults below use 1–4k on mobile for stability).  

Distribution: Official LiteRT + MediaPipe LLM Inference give production‑grade Android/iOS/Web runtimes, with 270M IT builds already published in the LiteRT Community on Hugging Face.   

1) Architecture at a glance

On‑device first, cloud when needed

App (Android | iOS | Web)
├─ Local Inference (Gemma 3 270M INT4, default context 2k–4k)
│ ├─ JSON-mode prompts for structured outputs
│ ├─ LiteRT / MediaPipe LLM Inference runtime
│ └─ Local adapters (optional LoRA)
├─ Tooling (optional): Function Calling SDK for actions
└─ Escalation (router):
– Larger local model (1B) on capable devices
– Cloud (e.g., Gemini APIs) for long/complex tasks
  • Why this split: Empirically, 270M is excellent for structured extraction, classification, policy checks, routing, templated copy, smart‑reply—and avoids network and cost. Use a router for “hard” prompts (long context, multi‑hop). Google Developers Blog

2) Model artifacts & packaging

  • Primary: gemma-3-270m-it QAT/Q4_0 (instruction‑tuned) from Google/LiteRT channels. Prefer official LiteRT/MediaPipe‑ready packages when available; otherwise convert. Hugging Face+1
  • Where to fetch:
    LiteRT Community (Hugging Face): published 270M IT artifacts and guidance for Android/iOS/Web. Hugging Face
    Gemma 3 release overview (context windows, sizes, QAT): Google AI for Developers
  • Licensing: Weights are open under Gemma Terms of Use; require users to accept terms (HF gate) at first download. Wire this into the first‑run flow. Hugging Face

3) Android (Kotlin/Java) — production path

Runtime: MediaPipe LLM Inference API on LiteRT with GPU delegate fallback to CPU. Google AI for Developers+1

Steps

  1. Add dependency in build.gradle (versions per docs):
dependencies {
implementation 'com.google.mediapipe:tasks-genai:0.10.24'
}

This provides LlmInference and options for temperature/top‑k, streaming, etc. Google AI for Developers

  1. Model distribution
  • Host .task or .bin (LiteRT/MediaPipe‑compatible) on your CDN; on first run, present HF license screen, then download. Cache to app‑private storage; keep an on‑disk optimized layout cache (LiteRT does one‑time optimize at first load). Google Developers Blog
  • For MVP, you can side‑load during dev with adb push, but ship via runtime download for production. Google AI for Developers
  1. Initialization (simplified)
val opts = LlmInferenceOptions.builder()
.setModelPath(localPath) // downloaded model
.setMaxTokens(1024) // include input+output
.setTopK(40).setTemperature(0.8f).build()

val llm = LlmInference.createFromOptions(context, opts)
val result = llm.generateResponse(prompt)

Use generateResponseAsync for streaming. Google AI for Developers

  1. Performance defaults
  • Context: cap at 2k tokens by default; expose a setting for 4k on high‑end devices (S24U, Pixel 9).
  • Backend: allow users to pick GPU or CPU (UI toggle); GPU delegate improves throughput; KV‑cache layout and GPU weight‑sharing reduce latency/memory under the hood. Google Developers Blog
  • Battery budget: target ≤0.03%/turn for short prompts; profile with your prompt templates. Pixel 9 Pro reference: 0.75% per 25 short conversations. Google Developers Blog
  1. Optional features
  • RAG SDK (Edge): for simple doc QA with local embeddings. Google Developers Blog
  • Function Calling SDK (Edge): map model outputs to local actions (search, alarms). GitHub
  • LoRA adapters: if you fine‑tune for vertical tasks; Android supports LoRA (GPU path) with MediaPipe convertor. Google AI for Developers
  1. Sample app to crib from: AI Edge Gallery and LLM Demo. Google AI for Developers

4) iOS (Swift) — production path

Runtime: MediaPipe LLM Inference API on LiteRT with Metal acceleration; CocoaPods package. Google AI for Developers

Steps

  1. Pods
target 'MyLlmInferenceApp' do
use_frameworks!
pod 'MediaPipeTasksGenAI'
pod 'MediaPipeTasksGenAIC'
end

Google AI for Developers

  1. Model distribution
  • Bundle tiny stubs only; download the real model on first run after user accepts Gemma terms (HF gate). Store in app‑private documents; enable resumeable downloads; verify hash before load. LiteRT will cache optimized tensor layouts to cut future load time. Google Developers Blog
  1. Init (simplified)
import MediaPipeTasksGenai

let options = LlmInferenceOptions()
options.baseOptions.modelPath = localPath
options.maxTokens = 1024
options.topk = 40
options.temperature = 0.8
let llm = try LlmInference(options: options)
let out = try llm.generateResponse(inputText: prompt)

Use generateResponseAsync for streaming. Google AI for Developers

  1. Performance defaults
  • Context: default 2k; offer 4k on iPad Pro/modern iPhones.
  • Memory heads‑up: iOS enforces strict memory limits; keep KV‑cache conservative, and stream outputs.
  • LoRA: iOS supports static LoRA at init via converted adapters for Gemma classes; conversion via MediaPipe tools. Google AI for Developers
  1. Alt path (if you prefer OSS stack): llama.cpp GGUF with Metal is viable, but MediaPipe/LiteRT will be simpler to maintain across OS updates. GitHub
  2. Sample: iOS MediaPipe LLM Inference sample app. Google AI for Developers

5) Web — two good options

Option A — MediaPipe LLM Inference (WebGPU)

import { FilesetResolver, LlmInference } from '@mediapipe/tasks-genai';

const genai = await FilesetResolver.forGenAiTasks(
'https://cdn.jsdelivr.net/npm/@mediapipe/tasks-genai@latest/wasm'
);

const llm = await LlmInference.createFromOptions(genai, {
baseOptions: { modelAssetPath: '/assets/gemma-3-270m-it.task' },
maxTokens: 768, topK: 40, temperature: 0.8, randomSeed: 101
});
const text = await llm.generateResponse(prompt);

Option B — WebLLM (MLC)

  • Why B: Mature in‑browser LLM engine with OpenAI‑compatible API surface; broad model zoo (MLC format). Good if you already use MLC builds. GitHubwebllm.mlc.ai

Google’s launch post even highlights a Transformers.js demo using 270M in a browser—handy for very small apps. Google Developers Blog


6) Prompting & outputs (consistent across platforms)

  • JSON‑first templates (require strict keys, minimal prose).
  • Guardrails: temperature ≤0.4 for extraction/classification; ≤0.8 for copywriting.
  • Context mgmt: sliding‑window (truncate oldest), summarize tails if >2k tokens on mobile; reserve ~30–40% of maxTokens for output.
  • Function calling (Android today): use Edge FC SDK to map JSON to actions (search, reminders). GitHub

7) CI/CD for models

  1. Fine‑tune (full or LoRA) → export to LiteRT/MediaPipe (.task/.bin). (AI Edge Torch supports PyTorch→TFLite and integrates with LLM Inference.) GitHub+1
  2. Quantize: prefer QAT INT4 checkpoints to preserve quality at 4‑bit. Google Developers Blog
  3. Virus scan & hash artifacts; upload to private bucket + HF gated mirror if desired.
  4. Release train: semantic version model IDs, A/B via remote config, roll back by ID.
  5. Client downloads + verifies SHA‑256; keep per‑version caches for instant rollback.

8) Performance budgets & test matrix

  • Load memory (model only): ~240 MB (Q4_0). Plan headroom for KV cache (varies with batch × heads × layers × context). Google AI for Developers
  • TTFT target: <600 ms (warm) on flagship phones; <250 ms on desktop WebGPU. (Use MediaPipe samples to benchmark.) Google Developers Blog
  • Throughput: prioritize prefill speed (tokenizing + initial attention); keep max input <2k on mobile.
  • Matrix: Pixel 9 / S24U / mid‑range Android; iPhone 15/14; iPad Pro; Chrome/Safari/Edge (WebGPU on). Google AI for Developers

9) Privacy, safety, and compliance

  • On‑device by default: no payload leaves device for 270M path.
  • Content policy: enforce Gemma Prohibited Use + your own constraints; show the terms gate when fetching from HF. Google AI for Developers+1
  • Telemetry: opt‑in, coarse device class only; never log raw prompts.
  • Eval sets: per domain (e.g., classification/extraction for your content sites) and run pre‑/post‑deploy checks (precision/recall, JSON validity rate).

10) Rollout plan (4 sprints)

Sprint 0—Plumbing

  • Pick Option A (MediaPipe) for all three platforms to minimize divergence.
  • Wire model download + license acceptance.
  • Ship a hidden diagnostics screen: device info, backend, TTFT, tokens/s.

Sprint 1—MVP

  • Android + Web MVP (shared prompts).
  • Tasks: classification / extraction / short rewrites; cap maxTokens to 768–1024.
  • Add cloud fallback for >2k input or low confidence.

Sprint 2—iOS + adapters

  • iOS client with streaming; polish memory handling.
  • Optional LoRA adapter for one high‑value task (e.g., policy/scoring). Google AI for Developers

Sprint 3—Hardening

  • A/B model versions; battery soak; offline‑first UX (progress bars, resume).
  • Add RAG for on‑device doc snippets where helpful. Google Developers Blog


12) Nice‑to‑have extensions

  • WebLLM variant (OpenAI‑compatible local endpoint in browser) if you want one codepath that also works in Node/Electron. GitHub
  • Edge Function Calling on Android for “agentic” local actions. GitHub

Three minimal starters (Android, iOS, Web)

A three minimal starters (Android, iOS, Web) with the dependencies wired, license gate, model downloader, and a router stub (on‑device → cloud) so you can drop in your task prompts.

# Create a zip archive with Android, iOS, and Web starter code for Gemma 3 270M on-device inference.
import os, textwrap, json, zipfile, hashlib, pathlib

root = "/mnt/data/gemma270m_starters"
os.makedirs(root, exist_ok=True)

def write(path, content):
path = os.path.join(root, path)
os.makedirs(os.path.dirname(path), exist_ok=True)
with open(path, "w", encoding="utf-8") as f:
f.write(content.lstrip("\n"))

# ---------- TOP-LEVEL README ----------
write("README.md", f"""
# Gemma 3 270M — On‑Device Starters (Android • iOS • Web)

These are **minimal starter skeletons** to run **Gemma 3 270M (INT4/INT8 .task)** on‑device via **MediaPipe LLM Inference**.
They include: a **license gate**, a **model downloader**, and a tiny **router** that prefers local inference and falls back to a cloud endpoint for oversized prompts.

> ⚠️ **Licensing**: Gemma weights are under the **Gemma Terms of Use**. You must ensure users accept the terms before download, and you must host the model artifacts yourself or use gated distribution. Do *not* embed a Hugging Face token in client apps.

## What’s here
- **android/**: Kotlin Activity + downloader + Local LLM wrapper (MediaPipe) + cloud stub
- **ios/**: SwiftUI app + downloader + Local LLM wrapper (MediaPipe) + cloud stub
- **web/**: Vanilla HTML/JS using `@mediapipe/tasks-genai` over WebGPU + router stub

## Quick links
- MediaPipe LLM Inference (Android): https://ai.google.dev/edge/mediapipe/solutions/genai/llm_inference/android
- MediaPipe LLM Inference (iOS): https://ai.google.dev/edge/mediapipe/solutions/genai/llm_inference/ios
- MediaPipe LLM Inference (Web): https://ai.google.dev/edge/mediapipe/solutions/genai/llm_inference/web_js
- LiteRT Community (Gemma 3 270M IT): https://huggingface.co/litert-community/gemma-3-270m-it
- Gemma Terms of Use: https://ai.google.dev/gemma/terms

## Model artifacts
Recommended to mirror one of these to **your CDN** and update the URLs in each starter:
- Android/iOS: `gemma3-270m-it-q8.task` (or Q4_0/Q8 variant that matches your performance target)
- Web: `gemma3-270m-it-q8-web.task`

""")

# ---------- ANDROID ----------
write("android/README.md", """
# Android Starter (MediaPipe LLM Inference)

## 1) Create a new Android Studio project
- Template: **Empty Views Activity** (Kotlin), minSdk ≥ 26.
- Add the dependency in `app/build.gradle`:
```gradle
dependencies {
implementation 'com.google.mediapipe:tasks-genai:0.10.24'
implementation 'androidx.appcompat:appcompat:1.6.1'
implementation 'androidx.core:core-ktx:1.12.0'
implementation 'com.google.android.material:material:1.11.0'
}

Add permission in app/src/main/AndroidManifest.xml:

<uses-permission android:name="android.permission.INTERNET"/>

2) Drop the code in app/src/main/java/com/example/gemma/

MainActivity.kt, LicenseGate.kt, ModelManager.kt, LocalGemma.kt, CloudClient.kt, Router.kt from this folder.

3) Set your model URL

In ModelManager.kt, set MODEL_URL to your CDN URL for gemma3-270m-it-q8.task.
Do not ship HF access tokens inside apps.

4) Run on a real device (GPU preferred)

The MediaPipe LLM Inference API is optimized for real devices; emulators are not reliable.

“””)

write(“android/MainActivity.kt”, r”””
package com.example.gemma

import android.os.Bundle
import android.text.method.ScrollingMovementMethod
import android.widget.*
import androidx.appcompat.app.AppCompatActivity
import androidx.lifecycle.lifecycleScope
import kotlinx.coroutines.Dispatchers
import kotlinx.coroutines.launch
import kotlinx.coroutines.withContext

class MainActivity : AppCompatActivity() {

private lateinit var txtOutput: TextView
private lateinit var edtPrompt: EditText
private lateinit var btnSend: Button
private lateinit var radioRoute: RadioGroup
private lateinit var btnDownload: Button

private val modelManager by lazy { ModelManager(this) }
private var local: LocalGemma? = null
private val cloud = CloudClient(baseUrl = "https://YOUR_CLOUD_ENDPOINT") // TODO: replace

override fun onCreate(savedInstanceState: Bundle?) {
super.onCreate(savedInstanceState)

if (!LicenseGate.hasAccepted(this)) {
LicenseGate.show(this) {
// user accepted
setupUi()
}
} else {
setupUi()
}
}

private fun setupUi() {
setContentView(R.layout.activity_main)

txtOutput = findViewById(R.id.txtOutput)
edtPrompt = findViewById(R.id.edtPrompt)
btnSend = findViewById(R.id.btnSend)
radioRoute = findViewById(R.id.radioRoute)
btnDownload = findViewById(R.id.btnDownload)

txtOutput.movementMethod = ScrollingMovementMethod()

btnDownload.setOnClickListener {
lifecycleScope.launch {
appendLine("Downloading model…")
val path = withContext(Dispatchers.IO) { modelManager.ensureModel() }
appendLine("Model ready at: $path")
loadLocal(path)
}
}

btnSend.setOnClickListener {
val prompt = edtPrompt.text.toString().trim()
if (prompt.isEmpty()) return@setOnClickListener
lifecycleScope.launch { routeAndGenerate(prompt) }
}

// Autoload if already present
lifecycleScope.launch {
modelManager.getLocalModelPath()?.let { loadLocal(it) }
}
}

private suspend fun loadLocal(path: String) = withContext(Dispatchers.IO) {
try {
local?.close()
local = LocalGemma(this@MainActivity, path).also { it.load() }
appendLine("Local LLM loaded.")
} catch (e: Exception) {
appendLine("Failed to init local LLM: ${e.message}")
}
}

private suspend fun routeAndGenerate(prompt: String) = withContext(Dispatchers.IO) {
val routing = when (radioRoute.checkedRadioButtonId) {
R.id.optLocal -> Routing.LOCAL_ONLY
R.id.optCloud -> Routing.CLOUD_ONLY
else -> Routing.AUTO
}
val router = Router(local, cloud, maxLocalInputTokens = 2048)
appendLine("Routing: $routing")
try {
router.generate(prompt, routing, onToken = { token ->
runOnUiThread { txtOutput.append(token) }
}, onDone = { ok, source ->
runOnUiThread { appendLine("\n\n[done: $ok via $source]") }
})
} catch (e: Exception) {
appendLine("Error: ${e.message}")
}
}

private fun appendLine(msg: String) = runOnUiThread {
txtOutput.append("\n$msg")
}

override fun onDestroy() {
super.onDestroy()
local?.close()
}
}
""")

write("android/LicenseGate.kt", r"""
package com.example.gemma

import android.app.AlertDialog
import android.content.Context
import android.content.Intent
import android.net.Uri
import android.preference.PreferenceManager

object LicenseGate {
private const val KEY = "gemma_terms_accepted"

fun hasAccepted(context: Context): Boolean =
PreferenceManager.getDefaultSharedPreferences(context).getBoolean(KEY, false)

fun show(context: Context, onAccepted: () -> Unit) {
val dlg = AlertDialog.Builder(context)
.setTitle("Gemma Terms of Use")
.setMessage("You must accept the Gemma Terms of Use to download and run the model on this device.")
.setPositiveButton("View Terms") { _, _ ->
val i = Intent(Intent.ACTION_VIEW, Uri.parse("https://ai.google.dev/gemma/terms"))
context.startActivity(i)
}
.setNeutralButton("I Accept") { d, _ ->
PreferenceManager.getDefaultSharedPreferences(context)
.edit().putBoolean(KEY, true).apply()
d.dismiss()
onAccepted()
}
.setNegativeButton("Exit", null)
.create()
dlg.show()
}
}
""")

write("android/ModelManager.kt", r"""
package com.example.gemma

import android.content.Context
import java.io.File
import java.io.FileOutputStream
import java.net.HttpURLConnection
import java.net.URL

class ModelManager(private val context: Context) {
companion object {
// TODO: host the model yourself; do not embed gated URLs/tokens in apps.
private const val MODEL_URL = "https://YOUR-CDN/gemma3-270m-it-q8.task"
private const val MODEL_FILE = "gemma3-270m-it-q8.task"
}
fun getLocalModelPath(): String? {
val f = File(context.filesDir, MODEL_FILE)
return if (f.exists()) f.absolutePath else null
}

/** Download model if missing, return absolute path. */
fun ensureModel(): String {
val out = File(context.filesDir, MODEL_FILE)
if (out.exists()) return out.absolutePath
download(MODEL_URL, out)
return out.absolutePath
}

private fun download(url: String, outFile: File) {
val conn = URL(url).openConnection() as HttpURLConnection
conn.connectTimeout = 30000
conn.readTimeout = 30000
conn.inputStream.use { input ->
FileOutputStream(outFile).use { output ->
val buf = ByteArray(1 shl 16)
while (true) {
val n = input.read(buf)
if (n <= 0) break
output.write(buf, 0, n)
}
}
}
}

}
""")

write("android/LocalGemma.kt", r"""
package com.example.gemma

import android.content.Context
import com.google.mediapipe.tasks.genai.llminference.LlmInference
import com.google.mediapipe.tasks.genai.llminference.LlmInferenceOptions

class LocalGemma(private val context: Context, private val modelPath: String) : AutoCloseable {
private var llm: LlmInference? = null
fun load() {
val options = LlmInferenceOptions.builder()
.setModelPath(modelPath)
.setMaxTokens(1024) // input + output
.setTopK(40)
.setTemperature(0.8f)
.setRandomSeed(101)
.build()
llm = LlmInference.createFromOptions(context, options)
}

fun generateStream(prompt: String, onToken: (String) -> Unit, onDone: (Boolean) -> Unit) {
val inst = llm ?: throw IllegalStateException("LLM not loaded")
val opts = LlmInferenceOptions.builder()
.setModelPath(modelPath)
.setMaxTokens(1024)
.setTopK(40).setTemperature(0.8f).setRandomSeed(101)
.setResultListener { part, done ->
onToken(part ?: "")
if (done) onDone(true)
}
.setErrorListener { e ->
onToken("\n[error: ${'$'}e]"); onDone(false)
}
.build()
val streaming = LlmInference.createFromOptions(context, opts)
streaming.generateResponseAsync(prompt)
}

fun generate(prompt: String): String {
val inst = llm ?: throw IllegalStateException("LLM not loaded")
return inst.generateResponse(prompt)
}

override fun close() {
llm?.close()
llm = null
}
}
""")

write("android/CloudClient.kt", r"""
package com.example.gemma

import java.io.BufferedReader
import java.io.InputStreamReader
import java.net.HttpURLConnection
import java.net.URL

class CloudClient(private val baseUrl: String) {
/** Blocking demo GET endpoint: /generate?prompt=... Replace with your own. */
fun generate(prompt: String): String {
val url = URL("${baseUrl.trimEnd('/')}/generate?prompt=" + java.net.URLEncoder.encode(prompt, "UTF-8"))
val conn = url.openConnection() as HttpURLConnection
conn.connectTimeout = 30000
conn.readTimeout = 30000
return conn.inputStream.use { input ->
BufferedReader(InputStreamReader(input)).readText()
}
}
}
""")

write("android/Router.kt", r"""
package com.example.gemma

enum class Routing { AUTO, LOCAL_ONLY, CLOUD_ONLY }

class Router(
private val local: LocalGemma?,
private val cloud: CloudClient?,
private val maxLocalInputTokens: Int = 2048,
) {
/** Very crude token estimator (spaces as token proxies). Replace with a real tokenizer if needed. */
private fun estimateTokens(s: String) = (s.length / 4).coerceAtLeast(1)
fun generate(
prompt: String,
routing: Routing,
onToken: (String) -> Unit,
onDone: (Boolean, String) -> Unit
) {
when (routing) {
Routing.LOCAL_ONLY -> {
local ?: return onDone(false, "local-unavailable")
local.generateStream(prompt, onToken) { ok -> onDone(ok, "local") }
}
Routing.CLOUD_ONLY -> {
val out = cloud?.generate(prompt) ?: "[cloud unavailable]"
onToken(out); onDone(true, "cloud")
}
Routing.AUTO -> {
val tokens = estimateTokens(prompt)
val useLocal = local != null && tokens <= maxLocalInputTokens
if (useLocal) {
local!!.generateStream(prompt, onToken) { ok -> onDone(ok, "local") }
} else {
val out = cloud?.generate(prompt) ?: "[cloud unavailable]"
onToken(out); onDone(true, "cloud")
}
}
}
}
}
""")

write("android/res/layout/activity_main.xml", r"""

<?xml version="1.0" encoding="utf-8"?>

<LinearLayout xmlns:android="http://schemas.android.com/apk/res/android" android:layout_width="match_parent" android:layout_height="match_parent" android:orientation="vertical" android:padding="16dp">
<Button
android:id="@+id/btnDownload"
android:layout_width="match_parent"
android:layout_height="wrap_content"
android:text="Download / Load Model" />

<EditText
android:id="@+id/edtPrompt"
android:layout_width="match_parent"
android:layout_height="wrap_content"
android:hint="Enter prompt"
android:minLines="3"
android:gravity="top|start" />

<RadioGroup
android:id="@+id/radioRoute"
android:layout_width="match_parent"
android:layout_height="wrap_content"
android:orientation="horizontal">
<RadioButton android:id="@+id/optAuto" android:checked="true"
android:layout_width="wrap_content" android:layout_height="wrap_content"
android:text="Auto" />
<RadioButton android:id="@+id/optLocal"
android:layout_width="wrap_content" android:layout_height="wrap_content"
android:text="Local" />
<RadioButton android:id="@+id/optCloud"
android:layout_width="wrap_content" android:layout_height="wrap_content"
android:text="Cloud" />
</RadioGroup>

<Button
android:id="@+id/btnSend"
android:layout_width="match_parent"
android:layout_height="wrap_content"
android:text="Send" />

<TextView
android:id="@+id/txtOutput"
android:layout_width="match_parent"
android:layout_height="0dp"
android:layout_weight="1"
android:paddingTop="12dp"
android:textIsSelectable="true"
android:scrollbars="vertical"
android:textAppearance="?android:attr/textAppearanceSmall"
android:text="Ready." />
</LinearLayout> """)

write("android/AndroidManifest.xml", r"""
<manifest xmlns:android="http://schemas.android.com/apk/res/android" package="com.example.gemma">
<uses-permission android:name="android.permission.INTERNET"/>
<application android:label="Gemma270M Starter" android:theme="@style/Theme.AppCompat.Light.NoActionBar">
<activity android:name=".MainActivity">
<intent-filter>
<action android:name="android.intent.action.MAIN" />
<category android:name="android.intent.category.LAUNCHER" />
</intent-filter>
</activity>
</application>
</manifest>
""")

———- iOS ———-

write(“ios/README.md”, “””

iOS Starter (MediaPipe LLM Inference, SwiftUI)

1) Create a new App in Xcode (SwiftUI)

  • Add CocoaPods to the project and a Podfile like below, then pod install and open the .xcworkspace.
target 'Gemma270MStarter' do
use_frameworks!
pod 'MediaPipeTasksGenAI'
pod 'MediaPipeTasksGenAIC'
end

2) Add these files to your app target

ContentView.swift, LocalLlm.swift, ModelManager.swift, CloudClient.swift

3) Set your model URL

In ModelManager.swift, set MODEL_URL to your CDN URL for gemma3-270m-it-q8.task.

4) Run on device (Metal)

The LLM Inference API is optimized for real devices.
“””)

write(“ios/ContentView.swift”, r”””
import SwiftUI

struct ContentView: View {
@State private var accepted = UserDefaults.standard.bool(forKey: “gemma_terms_accepted”)
@State private var prompt: String = “”
@State private var output: String = “Ready.”
@State private var useRoute: Int = 0 // 0 = Auto, 1 = Local, 2 = Cloud
@StateObject private var local = LocalLlm()
private let cloud = CloudClient(baseUrl: “https://YOUR_CLOUD_ENDPOINT”) // TODO

var body: some View {
VStack(alignment: .leading, spacing: 12) {
HStack {
Button("Download / Load Model") {
output.append("\nDownloading model…")
Task {
let path = try? await ModelManager.shared.ensureModel()
output.append("\nModel ready at: \(path ?? "n/a")")
if let p = path {
try? await local.load(modelPath: p)
output.append("\nLocal LLM loaded.")
}
}
}
Spacer()
}
TextEditor(text: $prompt).frame(height: 120).border(.secondary)

Picker("", selection: $useRoute) {
Text("Auto").tag(0)
Text("Local").tag(1)
Text("Cloud").tag(2)
}.pickerStyle(.segmented)

Button("Send") {
output.append("\nRouting…")
Task {
await routeAndGenerate()
}
}

ScrollView { Text(output).font(.system(size: 12, design: .monospaced))
.frame(maxWidth: .infinity, alignment: .leading) }
}
.padding()
.sheet(isPresented: .constant(!accepted)) {
TermsSheet(accepted: $accepted)
}
.onChange(of: accepted) { _, v in
UserDefaults.standard.set(v, forKey: "gemma_terms_accepted")
}
}

func routeAndGenerate() async {
let router = Router(local: local, cloud: cloud, maxLocalInputTokens: 2048)
let routing: Routing = [Routing.auto, .localOnly, .cloudOnly][useRoute]
do {
output.append("\n---\n")
try await router.generate(prompt: prompt, routing: routing,
onToken: { token in
output.append(token)
}, onDone: { ok, src in
output.append("\n\n[done: \(ok) via \(src)]")
})
} catch {
output.append("\nError: \(error.localizedDescription)")
}
}
}

struct TermsSheet: View {
@Binding var accepted: Bool
var body: some View {
VStack(spacing: 12) {
Text("Gemma Terms of Use").font(.title3).bold()
Text("You must accept the Gemma Terms of Use to download and run the model.")
HStack {
Link("View Terms", destination: URL(string: "https://ai.google.dev/gemma/terms")!)
Spacer()
Button("I Accept") { accepted = true }
}
}.padding()
}
}
""")

write("ios/LocalLlm.swift", r"""
import Foundation
import MediaPipeTasksGenai

@MainActor
final class LocalLlm: ObservableObject {
private var llm: LlmInference? = nil
private var modelPath: String? = nil
func load(modelPath: String) async throws {
self.modelPath = modelPath
let opts = LlmInferenceOptions()
opts.baseOptions.modelPath = modelPath
opts.maxTokens = 1024
opts.topk = 40
opts.temperature = 0.8
opts.randomSeed = 101
self.llm = try LlmInference(options: opts)
}

func generate(prompt: String) async throws -> String {
guard let llm else { throw NSError(domain: "LocalLlm", code: 1, userInfo: [NSLocalizedDescriptionKey: "LLM not loaded"]) }
return try llm.generateResponse(inputText: prompt)
}

func generateStream(prompt: String,
onToken: @escaping (String) -> Void,
onDone: @escaping (Bool) -> Void) async throws {
guard let modelPath else { throw NSError(domain: "LocalLlm", code: 2, userInfo: [NSLocalizedDescriptionKey: "Model not loaded"]) }
let opts = LlmInferenceOptions()
opts.baseOptions.modelPath = modelPath
opts.maxTokens = 1024
opts.topk = 40
opts.temperature = 0.8
opts.randomSeed = 101
let streaming = try LlmInference(options: opts)
let stream = try streaming.generateResponseAsync(inputText: prompt)
Task {
do {
for try await part in stream {
onToken(part)
}
onDone(true)
} catch {
onToken("\n[error: \(error.localizedDescription)]")
onDone(false)
}
}
}
}
""")

write("ios/ModelManager.swift", r"""
import Foundation

actor ModelManager {
static let shared = ModelManager()
private init() {}
// TODO: host the model yourself; do not embed gated URLs/tokens in apps.
private let MODEL_URL = URL(string: "https://YOUR-CDN/gemma3-270m-it-q8.task")!
private let MODEL_FILE = "gemma3-270m-it-q8.task"

func localModelPath() -> String? {
let url = getDocumentsDir().appendingPathComponent(MODEL_FILE)
return FileManager.default.fileExists(atPath: url.path) ? url.path : nil
}

func ensureModel() async throws -> String {
let url = getDocumentsDir().appendingPathComponent(MODEL_FILE)
if FileManager.default.fileExists(atPath: url.path) {
return url.path
}
let (tmp, _) = try await URLSession.shared.download(from: MODEL_URL)
try FileManager.default.moveItem(at: tmp, to: url)
return url.path
}

private func getDocumentsDir() -> URL {
FileManager.default.urls(for: .documentDirectory, in: .userDomainMask)[0]
}
}

enum Routing { case auto, localOnly, cloudOnly }

@MainActor
struct Router {
let local: LocalLlm?
let cloud: CloudClient?
let maxLocalInputTokens: Int
private func estimateTokens(_ s: String) -> Int { max(1, s.count / 4) }

func generate(prompt: String, routing: Routing,
onToken: @escaping (String) -> Void,
onDone: @escaping (Bool, String) -> Void) async throws {
switch routing {
case .localOnly:
guard let local else { onDone(false, "local-unavailable"); return }
try await local.generateStream(prompt: prompt, onToken: onToken) { ok in onDone(ok, "local") }
case .cloudOnly:
let out = try await cloud?.generate(prompt: prompt) ?? "[cloud unavailable]"
onToken(out); onDone(true, "cloud")
case .auto:
let tokens = estimateTokens(prompt)
if let local, tokens <= maxLocalInputTokens {
try await local.generateStream(prompt: prompt, onToken: onToken) { ok in onDone(ok, "local") }
} else {
let out = try await cloud?.generate(prompt: prompt) ?? "[cloud unavailable]"
onToken(out); onDone(true, "cloud")
}
}
}
}
""")

———- WEB ———-

write(“web/README.md”, “””

Web Starter (MediaPipe LLM Inference + WebGPU)

1) Serve this folder locally

Any static server will do (Vite, python -m http.server, etc.).

2) Put your model on a CDN

Update config.js with a URL to gemma3-270m-it-q8-web.task on your CDN.

3) Open in a WebGPU-capable browser (Chrome/Edge recent).

""")

write("web/index.html", r"""

<!doctype html> <html> <head> <meta charset="utf-8"/> <meta name="viewport" content="width=device-width, initial-scale=1"> <title>Gemma 3 270M — Web Starter</title> </head> <body> <div id="gate" style="padding:16px;border-bottom:1px solid #ccc;"> <strong>Gemma Terms of Use</strong> — You must accept to download and run the model. <a target="_blank" href="https://ai.google.dev/gemma/terms">View Terms</a> <button id="accept">I Accept</button> </div> <div style="padding:16px;"> <button id="load">Download / Load Model</button> <textarea id="prompt" rows="6" style="width:100%;margin-top:8px;" placeholder="Enter prompt"></textarea> <div style="margin:8px 0;"> <label><input type="radio" name="route" value="auto" checked> Auto</label> <label><input type="radio" name="route" value="local"> Local</label> <label><input type="radio" name="route" value="cloud"> Cloud</label> </div> <button id="send">Send</button> <pre id="out" style="min-height:240px;border:1px solid #ddd;padding:8px;white-space:pre-wrap;"></pre> </div> <!-- Use ESM import from jsDelivr --> <script type="module" src="./main.js"></script> </body> </html> """)

write("web/config.js", r"""
export const TERMS_URL = "https://ai.google.dev/gemma/terms";
// Host the model yourself; do not embed gated HF URLs in production apps.
export const MODEL_URL = "https://YOUR-CDN/gemma3-270m-it-q8-web.task";
export const WASM_ROOT = "https://cdn.jsdelivr.net/npm/@mediapipe/tasks-genai@latest/wasm";
export const CLOUD_ENDPOINT = "https://YOUR_CLOUD_ENDPOINT";
""")

write("web/router.js", r"""
export class Router {
constructor(local, cloud, maxLocalInputTokens = 2048) {
this.local = local;
this.cloud = cloud;
this.max = maxLocalInputTokens;
}
estimateTokens(s) { return Math.max(1, Math.floor(s.length / 4)); }

async generate(prompt, routing, onToken, onDone) {
if (routing === "local") {
if (!this.local) return onDone(false, "local-unavailable");
await this.local.generateStream(prompt, onToken); onDone(true, "local"); return;
}
if (routing === "cloud") {
const out = await this.cloud.generate(prompt);
onToken(out); onDone(true, "cloud"); return;
}
// AUTO
if (this.local && this.estimateTokens(prompt) <= this.max) {
await this.local.generateStream(prompt, onToken); onDone(true, "local");
} else {
const out = await this.cloud.generate(prompt);
onToken(out); onDone(true, "cloud");
}
}
}
""")

write("web/main.js", r"""
import { FilesetResolver, LlmInference } from "https://cdn.jsdelivr.net/npm/@mediapipe/tasks-genai@latest";
import { TERMS_URL, MODEL_URL, WASM_ROOT, CLOUD_ENDPOINT } from "./config.js";
import { Router } from "./router.js";

const out = document.getElementById("out");
const promptBox = document.getElementById("prompt");
const btnLoad = document.getElementById("load");
const btnSend = document.getElementById("send");
const gate = document.getElementById("gate");
document.getElementById("accept").onclick = () => gate.style.display = "none";

let llm = null;

async function initLocal() {
const genai = await FilesetResolver.forGenAiTasks(WASM_ROOT);
llm = await LlmInference.createFromOptions(genai, {
baseOptions: { modelAssetPath: MODEL_URL },
maxTokens: 768, topK: 40, temperature: 0.8, randomSeed: 101
});
}

class Local {
async generateStream(prompt, onToken) {
return new Promise((resolve, reject) => {
llm.generateResponse(prompt, (partial, done) => {
if (partial) onToken(partial);
if (done) resolve(true);
});
});
}
}

class Cloud {
async generate(prompt) {
const u = new URL(CLOUD_ENDPOINT + "/generate");
u.searchParams.set("prompt", prompt);
const res = await fetch(u, { method: "GET" });
return await res.text();
}
}

btnLoad.onclick = async () => {
out.textContent += "\nDownloading / loading model…";
await initLocal();
out.textContent += "\nLocal LLM loaded.";
};

btnSend.onclick = async () => {
const routing = document.querySelector("input[name=route]:checked")?.value ?? "auto";
const router = new Router(llm ? new Local() : null, new Cloud(), 2048);
out.textContent += "\n---\n";
await router.generate(promptBox.value, routing,
(t) => out.textContent += t,
(ok, src) => out.textContent += \n\n[done: ${ok} via ${src}]);
};
""")

Zip it

zip_path = “/mnt/data/gemma270m_starters.zip”
with zipfile.ZipFile(zip_path, “w”, zipfile.ZIP_DEFLATED) as z:
for base, _, files in os.walk(root):
for fn in files:
p = os.path.join(base, fn)
z.write(p, arcname=os.path.relpath(p, root))

zip_path

'/mnt/data/gemma270m_starters.zip'

Awesome—here are copy‑pasteable, cross‑platform starters for Gemma 3 270M on‑device with a license gate, model downloader, and a tiny local→cloud router (Android • iOS • Web).

What you’re getting

gemma270m_starters/
README.md
android/ # Kotlin + MediaPipe LLM Inference + simple UI + router
ios/ # SwiftUI + MediaPipe LLM Inference + router
web/ # Vanilla HTML/JS + @mediapipe/tasks-genai (WebGPU) + router

Model artifacts to host on your CDN (update the URLs in each starter):

  • Android/iOS: gemma3-270m-it-q8.task
  • Web: gemma3-270m-it-q8-web.task (web‑optimized) — both visible on the LiteRT Community model page. Hugging Face

Licensing: Add a first‑run “I accept” gate before download (included). Gemma’s Terms of Use require passing along restrictions when you redistribute. Google AI for Developers

Android (Kotlin, MediaPipe LLM Inference)

  1. Add dependency in your app module:
dependencies {
implementation 'com.google.mediapipe:tasks-genai:0.10.24'
implementation 'androidx.appcompat:appcompat:1.6.1'
implementation 'androidx.core:core-ktx:1.12.0'
implementation 'com.google.android.material:material:1.11.0'
}

This is the official package/version from the Android LLM Inference guide. Google AI for Developers

  1. Add internet permission in AndroidManifest.xml:
<uses-permission android:name="android.permission.INTERNET"/>
  1. Drop in the files from android/:
  • MainActivity.kt – minimal chat-like UI; streams tokens
  • LicenseGate.kt – first‑run acceptance dialog linking to the Gemma Terms (opens browser) Google AI for Developers
  • ModelManager.kt – downloads gemma3-270m-it-q8.task to app storage (set your CDN URL)
  • LocalGemma.kt – wraps LlmInference (generateResponse / generateResponseAsync)
    Import path example: com.google.mediapipe.tasks.genai.llminference.LlmInference (as used by Google samples/issues). GitHub
  • Router.kt + CloudClient.kt – simple “Auto / Local only / Cloud only” routing

Notes:

  • The official guide recommends hosting the model and downloading at runtime, not bundling in the APK. Google AI for Developers
  • The API is optimized for real devices (Pixel 8+/S23+) and “does not reliably support emulators.” Google AI for Developers

Core init (already in LocalGemma.kt):

val options = LlmInferenceOptions.builder()
.setModelPath(modelPath)
.setMaxTokens(1024).setTopK(40).setTemperature(0.8f).setRandomSeed(101)
.build()
val llm = LlmInference.createFromOptions(context, options)

Use generateResponseAsync for streaming with a resultListener. Google AI for Developers


iOS (SwiftUI, MediaPipe LLM Inference)

  1. CocoaPods (Podfile):
target 'Gemma270MStarter' do
use_frameworks!
pod 'MediaPipeTasksGenAI'
pod 'MediaPipeTasksGenAIC'
end

These are the iOS packages from the official guide. Google AI for Developers

  1. Files in ios/:
  • ContentView.swift – SwiftUI UI + license sheet + router
  • LocalLlm.swift – wraps LlmInference (blocking & async streaming)
  • ModelManager.swift – downloads gemma3-270m-it-q8.task to Documents
  • CloudClient.swift – trivial GET endpoint client + Router enum/struct

Core init (already in LocalLlm.swift):

let options = LlmInferenceOptions()
options.baseOptions.modelPath = modelPath
options.maxTokens = 1024
options.topk = 40
options.temperature = 0.8
options.randomSeed = 101
let llm = try LlmInference(options: options)

Use generateResponseAsync(inputText:) to stream and iterate the async sequence. Google AI for Developers


Web (Vanilla JS + WebGPU)

Your starter is a single‑page app using the official @mediapipe/tasks-genai package and WebGPU:

  • index.html – terms gate, prompt box, route picker
  • main.js – loads WASM + LlmInference, streams tokens
  • router.js – same simple router pattern
  • config.js – set MODEL_URL to your CDN path for gemma3-270m-it-q8-web.task

Key bits (already in main.js):

import { FilesetResolver, LlmInference } from
"https://cdn.jsdelivr.net/npm/@mediapipe/tasks-genai@latest";

const genai = await FilesetResolver.forGenAiTasks(WASM_ROOT);
const llm = await LlmInference.createFromOptions(genai, {
baseOptions: { modelAssetPath: MODEL_URL },
maxTokens: 768, topK: 40, temperature: 0.8, randomSeed: 101
});

// Streaming:
llm.generateResponse(inputPrompt, (partial, done) => {
if (partial) append(partial);
});

This follows the official Web quickstart (WebGPU required). Google AI for Developers


Why these choices map cleanly to Gemma 3 270M

  • LiteRT/MediaPipe‑ready artifacts are published for 270M IT, including a web‑optimized .task variant—so you don’t need to convert to get started. Hugging Face
  • Android/iOS/Web LLM Inference APIs expose the same knobs (maxTokens, topK, temperature, randomSeed) and built‑in streaming methods, which keeps your app logic symmetric across platforms. Google AI for Developers+2Google AI for Developers+2

What you still need to plug in

  • Host the model on your CDN and update MODEL_URL in each starter. Don’t embed gated HF URLs or tokens in a client app. (The Android docs also note the model is too large to bundle—download at runtime.) Google AI for Developers
  • Cloud fallback: replace the simple GET /generate?prompt=... stubs with your real endpoint (or remove cloud entirely if you want on‑device‑only).

References you can lean on as you extend this

You May Also Like

Salesforce Launches Agentforce 360: Agents Go Enterprise-Grade

Date: October 13, 2025 Salesforce has launched Agentforce 360, its global agentic…

Market Impact of the FTC’s Updated Endorsement Guides & Rule Banning Fake Reviews (2023–2025)

Introduction The U.S. Federal Trade Commission (FTC) revised its Guides Concerning the…

Global AI Regulation Pivot: What It Means for Businesses in 2025

Introduction: A year of regulatory inflection Artificial intelligence (AI) is no longer…

UK’s slow rollout of the Critical Third Parties regime leaves cloud giants unregulated: implications for competition and customers across key verticals

Background – how the regime works and why delays matter The UK’s…