0) What we’re optimizing for
• Footprint & speed: 270M INT4 QAT keeps RAM/VRAM needs and power low (≈240 MB to load at Q4_0; additional memory for tokens/KV cache), making it a great fit for edge devices.
• Battery: Google’s internal test on a Pixel 9 Pro: ~0.75% battery for ~25 short chats with the INT4 variant—useful as a power envelope target.
• Context: Up to 32k tokens on 270M/1B (plan defaults below use 1–4k on mobile for stability).
• Distribution: Official LiteRT + MediaPipe LLM Inference give production‑grade Android/iOS/Web runtimes, with 270M IT builds already published in the LiteRT Community on Hugging Face.
⸻
1) Architecture at a glance
On‑device first, cloud when needed
App (Android | iOS | Web)
├─ Local Inference (Gemma 3 270M INT4, default context 2k–4k)
│ ├─ JSON-mode prompts for structured outputs
│ ├─ LiteRT / MediaPipe LLM Inference runtime
│ └─ Local adapters (optional LoRA)
├─ Tooling (optional): Function Calling SDK for actions
└─ Escalation (router):
– Larger local model (1B) on capable devices
– Cloud (e.g., Gemini APIs) for long/complex tasks
- Why this split: Empirically, 270M is excellent for structured extraction, classification, policy checks, routing, templated copy, smart‑reply—and avoids network and cost. Use a router for “hard” prompts (long context, multi‑hop). Google Developers Blog
2) Model artifacts & packaging
- Primary:
gemma-3-270m-itQAT/Q4_0 (instruction‑tuned) from Google/LiteRT channels. Prefer official LiteRT/MediaPipe‑ready packages when available; otherwise convert. Hugging Face+1 - Where to fetch:
– LiteRT Community (Hugging Face): published 270M IT artifacts and guidance for Android/iOS/Web. Hugging Face
– Gemma 3 release overview (context windows, sizes, QAT): Google AI for Developers - Licensing: Weights are open under Gemma Terms of Use; require users to accept terms (HF gate) at first download. Wire this into the first‑run flow. Hugging Face
3) Android (Kotlin/Java) — production path
Runtime: MediaPipe LLM Inference API on LiteRT with GPU delegate fallback to CPU. Google AI for Developers+1
Steps
- Add dependency in
build.gradle(versions per docs):
dependencies {
implementation 'com.google.mediapipe:tasks-genai:0.10.24'
}
This provides LlmInference and options for temperature/top‑k, streaming, etc. Google AI for Developers
- Model distribution
- Host
.taskor.bin(LiteRT/MediaPipe‑compatible) on your CDN; on first run, present HF license screen, then download. Cache to app‑private storage; keep an on‑disk optimized layout cache (LiteRT does one‑time optimize at first load). Google Developers Blog - For MVP, you can side‑load during dev with
adb push, but ship via runtime download for production. Google AI for Developers
- Initialization (simplified)
val opts = LlmInferenceOptions.builder()
.setModelPath(localPath) // downloaded model
.setMaxTokens(1024) // include input+output
.setTopK(40).setTemperature(0.8f).build()
val llm = LlmInference.createFromOptions(context, opts)
val result = llm.generateResponse(prompt)
Use generateResponseAsync for streaming. Google AI for Developers
- Performance defaults
- Context: cap at 2k tokens by default; expose a setting for 4k on high‑end devices (S24U, Pixel 9).
- Backend: allow users to pick GPU or CPU (UI toggle); GPU delegate improves throughput; KV‑cache layout and GPU weight‑sharing reduce latency/memory under the hood. Google Developers Blog
- Battery budget: target ≤0.03%/turn for short prompts; profile with your prompt templates. Pixel 9 Pro reference: 0.75% per 25 short conversations. Google Developers Blog
- Optional features
- RAG SDK (Edge): for simple doc QA with local embeddings. Google Developers Blog
- Function Calling SDK (Edge): map model outputs to local actions (search, alarms). GitHub
- LoRA adapters: if you fine‑tune for vertical tasks; Android supports LoRA (GPU path) with MediaPipe convertor. Google AI for Developers
- Sample app to crib from: AI Edge Gallery and LLM Demo. Google AI for Developers
4) iOS (Swift) — production path
Runtime: MediaPipe LLM Inference API on LiteRT with Metal acceleration; CocoaPods package. Google AI for Developers
Steps
- Pods
target 'MyLlmInferenceApp' do
use_frameworks!
pod 'MediaPipeTasksGenAI'
pod 'MediaPipeTasksGenAIC'
end
- Model distribution
- Bundle tiny stubs only; download the real model on first run after user accepts Gemma terms (HF gate). Store in app‑private documents; enable resumeable downloads; verify hash before load. LiteRT will cache optimized tensor layouts to cut future load time. Google Developers Blog
- Init (simplified)
import MediaPipeTasksGenai
let options = LlmInferenceOptions()
options.baseOptions.modelPath = localPath
options.maxTokens = 1024
options.topk = 40
options.temperature = 0.8
let llm = try LlmInference(options: options)
let out = try llm.generateResponse(inputText: prompt)
Use generateResponseAsync for streaming. Google AI for Developers
- Performance defaults
- Context: default 2k; offer 4k on iPad Pro/modern iPhones.
- Memory heads‑up: iOS enforces strict memory limits; keep KV‑cache conservative, and stream outputs.
- LoRA: iOS supports static LoRA at init via converted adapters for Gemma classes; conversion via MediaPipe tools. Google AI for Developers
- Alt path (if you prefer OSS stack):
llama.cppGGUF with Metal is viable, but MediaPipe/LiteRT will be simpler to maintain across OS updates. GitHub - Sample: iOS MediaPipe LLM Inference sample app. Google AI for Developers
5) Web — two good options
Option A — MediaPipe LLM Inference (WebGPU)
- Requirements: modern browser with WebGPU. Google AI for DevelopersMDN Web Docs
- Install:
npm i @mediapipe/tasks-genai(or via CDN). - Init (simplified):
import { FilesetResolver, LlmInference } from '@mediapipe/tasks-genai';
const genai = await FilesetResolver.forGenAiTasks(
'https://cdn.jsdelivr.net/npm/@mediapipe/tasks-genai@latest/wasm'
);
const llm = await LlmInference.createFromOptions(genai, {
baseOptions: { modelAssetPath: '/assets/gemma-3-270m-it.task' },
maxTokens: 768, topK: 40, temperature: 0.8, randomSeed: 101
});
const text = await llm.generateResponse(prompt);
- Why A: One API family across Android/iOS/Web; supports dynamic LoRA on Web. Google AI for Developers
Option B — WebLLM (MLC)
- Why B: Mature in‑browser LLM engine with OpenAI‑compatible API surface; broad model zoo (MLC format). Good if you already use MLC builds. GitHubwebllm.mlc.ai
Google’s launch post even highlights a Transformers.js demo using 270M in a browser—handy for very small apps. Google Developers Blog
6) Prompting & outputs (consistent across platforms)
- JSON‑first templates (require strict keys, minimal prose).
- Guardrails: temperature ≤0.4 for extraction/classification; ≤0.8 for copywriting.
- Context mgmt: sliding‑window (truncate oldest), summarize tails if >2k tokens on mobile; reserve ~30–40% of
maxTokensfor output. - Function calling (Android today): use Edge FC SDK to map JSON to actions (search, reminders). GitHub
7) CI/CD for models
- Fine‑tune (full or LoRA) → export to LiteRT/MediaPipe (
.task/.bin). (AI Edge Torch supports PyTorch→TFLite and integrates with LLM Inference.) GitHub+1 - Quantize: prefer QAT INT4 checkpoints to preserve quality at 4‑bit. Google Developers Blog
- Virus scan & hash artifacts; upload to private bucket + HF gated mirror if desired.
- Release train: semantic version model IDs, A/B via remote config, roll back by ID.
- Client downloads + verifies SHA‑256; keep per‑version caches for instant rollback.
8) Performance budgets & test matrix
- Load memory (model only): ~240 MB (Q4_0). Plan headroom for KV cache (varies with batch × heads × layers × context). Google AI for Developers
- TTFT target: <600 ms (warm) on flagship phones; <250 ms on desktop WebGPU. (Use MediaPipe samples to benchmark.) Google Developers Blog
- Throughput: prioritize prefill speed (tokenizing + initial attention); keep max input <2k on mobile.
- Matrix: Pixel 9 / S24U / mid‑range Android; iPhone 15/14; iPad Pro; Chrome/Safari/Edge (WebGPU on). Google AI for Developers
9) Privacy, safety, and compliance
- On‑device by default: no payload leaves device for 270M path.
- Content policy: enforce Gemma Prohibited Use + your own constraints; show the terms gate when fetching from HF. Google AI for Developers+1
- Telemetry: opt‑in, coarse device class only; never log raw prompts.
- Eval sets: per domain (e.g., classification/extraction for your content sites) and run pre‑/post‑deploy checks (precision/recall, JSON validity rate).
10) Rollout plan (4 sprints)
Sprint 0—Plumbing
- Pick Option A (MediaPipe) for all three platforms to minimize divergence.
- Wire model download + license acceptance.
- Ship a hidden diagnostics screen: device info, backend, TTFT, tokens/s.
Sprint 1—MVP
- Android + Web MVP (shared prompts).
- Tasks: classification / extraction / short rewrites; cap maxTokens to 768–1024.
- Add cloud fallback for >2k input or low confidence.
Sprint 2—iOS + adapters
- iOS client with streaming; polish memory handling.
- Optional LoRA adapter for one high‑value task (e.g., policy/scoring). Google AI for Developers
Sprint 3—Hardening
- A/B model versions; battery soak; offline‑first UX (progress bars, resume).
- Add RAG for on‑device doc snippets where helpful. Google Developers Blog
11) Quick links you’ll actually need
- Gemma 3 270M announcement (battery, QAT, positioning). Google Developers Blog
- Gemma 3 overview (sizes, context, memory table). Google AI for Developers
- LiteRT overview (Android/iOS/Web support). Google AI for Developers
- Android LLM Inference guide (API, Gradle dep, code). Google AI for Developers
- iOS LLM Inference guide (Pods, Swift code). Google AI for Developers
- Web LLM Inference guide (WebGPU, npm/cdn, code). Google AI for Developers
- LiteRT Community 270M IT model (ready to use). Hugging Face
12) Nice‑to‑have extensions
- WebLLM variant (OpenAI‑compatible local endpoint in browser) if you want one codepath that also works in Node/Electron. GitHub
- Edge Function Calling on Android for “agentic” local actions. GitHub
Three minimal starters (Android, iOS, Web)
A three minimal starters (Android, iOS, Web) with the dependencies wired, license gate, model downloader, and a router stub (on‑device → cloud) so you can drop in your task prompts.
# Create a zip archive with Android, iOS, and Web starter code for Gemma 3 270M on-device inference.
import os, textwrap, json, zipfile, hashlib, pathlib
root = "/mnt/data/gemma270m_starters"
os.makedirs(root, exist_ok=True)
def write(path, content):
path = os.path.join(root, path)
os.makedirs(os.path.dirname(path), exist_ok=True)
with open(path, "w", encoding="utf-8") as f:
f.write(content.lstrip("\n"))
# ---------- TOP-LEVEL README ----------
write("README.md", f"""
# Gemma 3 270M — On‑Device Starters (Android • iOS • Web)
These are **minimal starter skeletons** to run **Gemma 3 270M (INT4/INT8 .task)** on‑device via **MediaPipe LLM Inference**.
They include: a **license gate**, a **model downloader**, and a tiny **router** that prefers local inference and falls back to a cloud endpoint for oversized prompts.
> ⚠️ **Licensing**: Gemma weights are under the **Gemma Terms of Use**. You must ensure users accept the terms before download, and you must host the model artifacts yourself or use gated distribution. Do *not* embed a Hugging Face token in client apps.
## What’s here
- **android/**: Kotlin Activity + downloader + Local LLM wrapper (MediaPipe) + cloud stub
- **ios/**: SwiftUI app + downloader + Local LLM wrapper (MediaPipe) + cloud stub
- **web/**: Vanilla HTML/JS using `@mediapipe/tasks-genai` over WebGPU + router stub
## Quick links
- MediaPipe LLM Inference (Android): https://ai.google.dev/edge/mediapipe/solutions/genai/llm_inference/android
- MediaPipe LLM Inference (iOS): https://ai.google.dev/edge/mediapipe/solutions/genai/llm_inference/ios
- MediaPipe LLM Inference (Web): https://ai.google.dev/edge/mediapipe/solutions/genai/llm_inference/web_js
- LiteRT Community (Gemma 3 270M IT): https://huggingface.co/litert-community/gemma-3-270m-it
- Gemma Terms of Use: https://ai.google.dev/gemma/terms
## Model artifacts
Recommended to mirror one of these to **your CDN** and update the URLs in each starter:
- Android/iOS: `gemma3-270m-it-q8.task` (or Q4_0/Q8 variant that matches your performance target)
- Web: `gemma3-270m-it-q8-web.task`
""")
# ---------- ANDROID ----------
write("android/README.md", """
# Android Starter (MediaPipe LLM Inference)
## 1) Create a new Android Studio project
- Template: **Empty Views Activity** (Kotlin), minSdk ≥ 26.
- Add the dependency in `app/build.gradle`:
```gradle
dependencies {
implementation 'com.google.mediapipe:tasks-genai:0.10.24'
implementation 'androidx.appcompat:appcompat:1.6.1'
implementation 'androidx.core:core-ktx:1.12.0'
implementation 'com.google.android.material:material:1.11.0'
}
Add permission in app/src/main/AndroidManifest.xml:
<uses-permission android:name="android.permission.INTERNET"/>
2) Drop the code in app/src/main/java/com/example/gemma/
MainActivity.kt, LicenseGate.kt, ModelManager.kt, LocalGemma.kt, CloudClient.kt, Router.kt from this folder.
3) Set your model URL
In ModelManager.kt, set MODEL_URL to your CDN URL for gemma3-270m-it-q8.task.
Do not ship HF access tokens inside apps.
4) Run on a real device (GPU preferred)
The MediaPipe LLM Inference API is optimized for real devices; emulators are not reliable.
“””)
write(“android/MainActivity.kt”, r”””
package com.example.gemma
import android.os.Bundle
import android.text.method.ScrollingMovementMethod
import android.widget.*
import androidx.appcompat.app.AppCompatActivity
import androidx.lifecycle.lifecycleScope
import kotlinx.coroutines.Dispatchers
import kotlinx.coroutines.launch
import kotlinx.coroutines.withContext
class MainActivity : AppCompatActivity() {
private lateinit var txtOutput: TextView
private lateinit var edtPrompt: EditText
private lateinit var btnSend: Button
private lateinit var radioRoute: RadioGroup
private lateinit var btnDownload: Button
private val modelManager by lazy { ModelManager(this) }
private var local: LocalGemma? = null
private val cloud = CloudClient(baseUrl = "https://YOUR_CLOUD_ENDPOINT") // TODO: replace
override fun onCreate(savedInstanceState: Bundle?) {
super.onCreate(savedInstanceState)
if (!LicenseGate.hasAccepted(this)) {
LicenseGate.show(this) {
// user accepted
setupUi()
}
} else {
setupUi()
}
}
private fun setupUi() {
setContentView(R.layout.activity_main)
txtOutput = findViewById(R.id.txtOutput)
edtPrompt = findViewById(R.id.edtPrompt)
btnSend = findViewById(R.id.btnSend)
radioRoute = findViewById(R.id.radioRoute)
btnDownload = findViewById(R.id.btnDownload)
txtOutput.movementMethod = ScrollingMovementMethod()
btnDownload.setOnClickListener {
lifecycleScope.launch {
appendLine("Downloading model…")
val path = withContext(Dispatchers.IO) { modelManager.ensureModel() }
appendLine("Model ready at: $path")
loadLocal(path)
}
}
btnSend.setOnClickListener {
val prompt = edtPrompt.text.toString().trim()
if (prompt.isEmpty()) return@setOnClickListener
lifecycleScope.launch { routeAndGenerate(prompt) }
}
// Autoload if already present
lifecycleScope.launch {
modelManager.getLocalModelPath()?.let { loadLocal(it) }
}
}
private suspend fun loadLocal(path: String) = withContext(Dispatchers.IO) {
try {
local?.close()
local = LocalGemma(this@MainActivity, path).also { it.load() }
appendLine("Local LLM loaded.")
} catch (e: Exception) {
appendLine("Failed to init local LLM: ${e.message}")
}
}
private suspend fun routeAndGenerate(prompt: String) = withContext(Dispatchers.IO) {
val routing = when (radioRoute.checkedRadioButtonId) {
R.id.optLocal -> Routing.LOCAL_ONLY
R.id.optCloud -> Routing.CLOUD_ONLY
else -> Routing.AUTO
}
val router = Router(local, cloud, maxLocalInputTokens = 2048)
appendLine("Routing: $routing")
try {
router.generate(prompt, routing, onToken = { token ->
runOnUiThread { txtOutput.append(token) }
}, onDone = { ok, source ->
runOnUiThread { appendLine("\n\n[done: $ok via $source]") }
})
} catch (e: Exception) {
appendLine("Error: ${e.message}")
}
}
private fun appendLine(msg: String) = runOnUiThread {
txtOutput.append("\n$msg")
}
override fun onDestroy() {
super.onDestroy()
local?.close()
}
}
""")
write("android/LicenseGate.kt", r"""
package com.example.gemma
import android.app.AlertDialog
import android.content.Context
import android.content.Intent
import android.net.Uri
import android.preference.PreferenceManager
object LicenseGate {
private const val KEY = "gemma_terms_accepted"
fun hasAccepted(context: Context): Boolean =
PreferenceManager.getDefaultSharedPreferences(context).getBoolean(KEY, false)
fun show(context: Context, onAccepted: () -> Unit) {
val dlg = AlertDialog.Builder(context)
.setTitle("Gemma Terms of Use")
.setMessage("You must accept the Gemma Terms of Use to download and run the model on this device.")
.setPositiveButton("View Terms") { _, _ ->
val i = Intent(Intent.ACTION_VIEW, Uri.parse("https://ai.google.dev/gemma/terms"))
context.startActivity(i)
}
.setNeutralButton("I Accept") { d, _ ->
PreferenceManager.getDefaultSharedPreferences(context)
.edit().putBoolean(KEY, true).apply()
d.dismiss()
onAccepted()
}
.setNegativeButton("Exit", null)
.create()
dlg.show()
}
}
""")
write("android/ModelManager.kt", r"""
package com.example.gemma
import android.content.Context
import java.io.File
import java.io.FileOutputStream
import java.net.HttpURLConnection
import java.net.URL
class ModelManager(private val context: Context) {
companion object {
// TODO: host the model yourself; do not embed gated URLs/tokens in apps.
private const val MODEL_URL = "https://YOUR-CDN/gemma3-270m-it-q8.task"
private const val MODEL_FILE = "gemma3-270m-it-q8.task"
}
fun getLocalModelPath(): String? {
val f = File(context.filesDir, MODEL_FILE)
return if (f.exists()) f.absolutePath else null
}
/** Download model if missing, return absolute path. */
fun ensureModel(): String {
val out = File(context.filesDir, MODEL_FILE)
if (out.exists()) return out.absolutePath
download(MODEL_URL, out)
return out.absolutePath
}
private fun download(url: String, outFile: File) {
val conn = URL(url).openConnection() as HttpURLConnection
conn.connectTimeout = 30000
conn.readTimeout = 30000
conn.inputStream.use { input ->
FileOutputStream(outFile).use { output ->
val buf = ByteArray(1 shl 16)
while (true) {
val n = input.read(buf)
if (n <= 0) break
output.write(buf, 0, n)
}
}
}
}
}
""")
write("android/LocalGemma.kt", r"""
package com.example.gemma
import android.content.Context
import com.google.mediapipe.tasks.genai.llminference.LlmInference
import com.google.mediapipe.tasks.genai.llminference.LlmInferenceOptions
class LocalGemma(private val context: Context, private val modelPath: String) : AutoCloseable {
private var llm: LlmInference? = null
fun load() {
val options = LlmInferenceOptions.builder()
.setModelPath(modelPath)
.setMaxTokens(1024) // input + output
.setTopK(40)
.setTemperature(0.8f)
.setRandomSeed(101)
.build()
llm = LlmInference.createFromOptions(context, options)
}
fun generateStream(prompt: String, onToken: (String) -> Unit, onDone: (Boolean) -> Unit) {
val inst = llm ?: throw IllegalStateException("LLM not loaded")
val opts = LlmInferenceOptions.builder()
.setModelPath(modelPath)
.setMaxTokens(1024)
.setTopK(40).setTemperature(0.8f).setRandomSeed(101)
.setResultListener { part, done ->
onToken(part ?: "")
if (done) onDone(true)
}
.setErrorListener { e ->
onToken("\n[error: ${'$'}e]"); onDone(false)
}
.build()
val streaming = LlmInference.createFromOptions(context, opts)
streaming.generateResponseAsync(prompt)
}
fun generate(prompt: String): String {
val inst = llm ?: throw IllegalStateException("LLM not loaded")
return inst.generateResponse(prompt)
}
override fun close() {
llm?.close()
llm = null
}
}
""")
write("android/CloudClient.kt", r"""
package com.example.gemma
import java.io.BufferedReader
import java.io.InputStreamReader
import java.net.HttpURLConnection
import java.net.URL
class CloudClient(private val baseUrl: String) {
/** Blocking demo GET endpoint: /generate?prompt=... Replace with your own. */
fun generate(prompt: String): String {
val url = URL("${baseUrl.trimEnd('/')}/generate?prompt=" + java.net.URLEncoder.encode(prompt, "UTF-8"))
val conn = url.openConnection() as HttpURLConnection
conn.connectTimeout = 30000
conn.readTimeout = 30000
return conn.inputStream.use { input ->
BufferedReader(InputStreamReader(input)).readText()
}
}
}
""")
write("android/Router.kt", r"""
package com.example.gemma
enum class Routing { AUTO, LOCAL_ONLY, CLOUD_ONLY }
class Router(
private val local: LocalGemma?,
private val cloud: CloudClient?,
private val maxLocalInputTokens: Int = 2048,
) {
/** Very crude token estimator (spaces as token proxies). Replace with a real tokenizer if needed. */
private fun estimateTokens(s: String) = (s.length / 4).coerceAtLeast(1)
fun generate(
prompt: String,
routing: Routing,
onToken: (String) -> Unit,
onDone: (Boolean, String) -> Unit
) {
when (routing) {
Routing.LOCAL_ONLY -> {
local ?: return onDone(false, "local-unavailable")
local.generateStream(prompt, onToken) { ok -> onDone(ok, "local") }
}
Routing.CLOUD_ONLY -> {
val out = cloud?.generate(prompt) ?: "[cloud unavailable]"
onToken(out); onDone(true, "cloud")
}
Routing.AUTO -> {
val tokens = estimateTokens(prompt)
val useLocal = local != null && tokens <= maxLocalInputTokens
if (useLocal) {
local!!.generateStream(prompt, onToken) { ok -> onDone(ok, "local") }
} else {
val out = cloud?.generate(prompt) ?: "[cloud unavailable]"
onToken(out); onDone(true, "cloud")
}
}
}
}
}
""")
write("android/res/layout/activity_main.xml", r"""
<?xml version="1.0" encoding="utf-8"?>
<LinearLayout xmlns:android="http://schemas.android.com/apk/res/android" android:layout_width="match_parent" android:layout_height="match_parent" android:orientation="vertical" android:padding="16dp">
<Button
android:id="@+id/btnDownload"
android:layout_width="match_parent"
android:layout_height="wrap_content"
android:text="Download / Load Model" />
<EditText
android:id="@+id/edtPrompt"
android:layout_width="match_parent"
android:layout_height="wrap_content"
android:hint="Enter prompt"
android:minLines="3"
android:gravity="top|start" />
<RadioGroup
android:id="@+id/radioRoute"
android:layout_width="match_parent"
android:layout_height="wrap_content"
android:orientation="horizontal">
<RadioButton android:id="@+id/optAuto" android:checked="true"
android:layout_width="wrap_content" android:layout_height="wrap_content"
android:text="Auto" />
<RadioButton android:id="@+id/optLocal"
android:layout_width="wrap_content" android:layout_height="wrap_content"
android:text="Local" />
<RadioButton android:id="@+id/optCloud"
android:layout_width="wrap_content" android:layout_height="wrap_content"
android:text="Cloud" />
</RadioGroup>
<Button
android:id="@+id/btnSend"
android:layout_width="match_parent"
android:layout_height="wrap_content"
android:text="Send" />
<TextView
android:id="@+id/txtOutput"
android:layout_width="match_parent"
android:layout_height="0dp"
android:layout_weight="1"
android:paddingTop="12dp"
android:textIsSelectable="true"
android:scrollbars="vertical"
android:textAppearance="?android:attr/textAppearanceSmall"
android:text="Ready." />
</LinearLayout> """)
write("android/AndroidManifest.xml", r"""
<manifest xmlns:android="http://schemas.android.com/apk/res/android" package="com.example.gemma">
<uses-permission android:name="android.permission.INTERNET"/>
<application android:label="Gemma270M Starter" android:theme="@style/Theme.AppCompat.Light.NoActionBar">
<activity android:name=".MainActivity">
<intent-filter>
<action android:name="android.intent.action.MAIN" />
<category android:name="android.intent.category.LAUNCHER" />
</intent-filter>
</activity>
</application>
</manifest>
""")
———- iOS ———-
write(“ios/README.md”, “””
iOS Starter (MediaPipe LLM Inference, SwiftUI)
1) Create a new App in Xcode (SwiftUI)
- Add CocoaPods to the project and a
Podfilelike below, thenpod installand open the.xcworkspace.
target 'Gemma270MStarter' do
use_frameworks!
pod 'MediaPipeTasksGenAI'
pod 'MediaPipeTasksGenAIC'
end
2) Add these files to your app target
ContentView.swift, LocalLlm.swift, ModelManager.swift, CloudClient.swift
3) Set your model URL
In ModelManager.swift, set MODEL_URL to your CDN URL for gemma3-270m-it-q8.task.
4) Run on device (Metal)
The LLM Inference API is optimized for real devices.
“””)
write(“ios/ContentView.swift”, r”””
import SwiftUI
struct ContentView: View {
@State private var accepted = UserDefaults.standard.bool(forKey: “gemma_terms_accepted”)
@State private var prompt: String = “”
@State private var output: String = “Ready.”
@State private var useRoute: Int = 0 // 0 = Auto, 1 = Local, 2 = Cloud
@StateObject private var local = LocalLlm()
private let cloud = CloudClient(baseUrl: “https://YOUR_CLOUD_ENDPOINT”) // TODO
var body: some View {
VStack(alignment: .leading, spacing: 12) {
HStack {
Button("Download / Load Model") {
output.append("\nDownloading model…")
Task {
let path = try? await ModelManager.shared.ensureModel()
output.append("\nModel ready at: \(path ?? "n/a")")
if let p = path {
try? await local.load(modelPath: p)
output.append("\nLocal LLM loaded.")
}
}
}
Spacer()
}
TextEditor(text: $prompt).frame(height: 120).border(.secondary)
Picker("", selection: $useRoute) {
Text("Auto").tag(0)
Text("Local").tag(1)
Text("Cloud").tag(2)
}.pickerStyle(.segmented)
Button("Send") {
output.append("\nRouting…")
Task {
await routeAndGenerate()
}
}
ScrollView { Text(output).font(.system(size: 12, design: .monospaced))
.frame(maxWidth: .infinity, alignment: .leading) }
}
.padding()
.sheet(isPresented: .constant(!accepted)) {
TermsSheet(accepted: $accepted)
}
.onChange(of: accepted) { _, v in
UserDefaults.standard.set(v, forKey: "gemma_terms_accepted")
}
}
func routeAndGenerate() async {
let router = Router(local: local, cloud: cloud, maxLocalInputTokens: 2048)
let routing: Routing = [Routing.auto, .localOnly, .cloudOnly][useRoute]
do {
output.append("\n---\n")
try await router.generate(prompt: prompt, routing: routing,
onToken: { token in
output.append(token)
}, onDone: { ok, src in
output.append("\n\n[done: \(ok) via \(src)]")
})
} catch {
output.append("\nError: \(error.localizedDescription)")
}
}
}
struct TermsSheet: View {
@Binding var accepted: Bool
var body: some View {
VStack(spacing: 12) {
Text("Gemma Terms of Use").font(.title3).bold()
Text("You must accept the Gemma Terms of Use to download and run the model.")
HStack {
Link("View Terms", destination: URL(string: "https://ai.google.dev/gemma/terms")!)
Spacer()
Button("I Accept") { accepted = true }
}
}.padding()
}
}
""")
write("ios/LocalLlm.swift", r"""
import Foundation
import MediaPipeTasksGenai
@MainActor
final class LocalLlm: ObservableObject {
private var llm: LlmInference? = nil
private var modelPath: String? = nil
func load(modelPath: String) async throws {
self.modelPath = modelPath
let opts = LlmInferenceOptions()
opts.baseOptions.modelPath = modelPath
opts.maxTokens = 1024
opts.topk = 40
opts.temperature = 0.8
opts.randomSeed = 101
self.llm = try LlmInference(options: opts)
}
func generate(prompt: String) async throws -> String {
guard let llm else { throw NSError(domain: "LocalLlm", code: 1, userInfo: [NSLocalizedDescriptionKey: "LLM not loaded"]) }
return try llm.generateResponse(inputText: prompt)
}
func generateStream(prompt: String,
onToken: @escaping (String) -> Void,
onDone: @escaping (Bool) -> Void) async throws {
guard let modelPath else { throw NSError(domain: "LocalLlm", code: 2, userInfo: [NSLocalizedDescriptionKey: "Model not loaded"]) }
let opts = LlmInferenceOptions()
opts.baseOptions.modelPath = modelPath
opts.maxTokens = 1024
opts.topk = 40
opts.temperature = 0.8
opts.randomSeed = 101
let streaming = try LlmInference(options: opts)
let stream = try streaming.generateResponseAsync(inputText: prompt)
Task {
do {
for try await part in stream {
onToken(part)
}
onDone(true)
} catch {
onToken("\n[error: \(error.localizedDescription)]")
onDone(false)
}
}
}
}
""")
write("ios/ModelManager.swift", r"""
import Foundation
actor ModelManager {
static let shared = ModelManager()
private init() {}
// TODO: host the model yourself; do not embed gated URLs/tokens in apps.
private let MODEL_URL = URL(string: "https://YOUR-CDN/gemma3-270m-it-q8.task")!
private let MODEL_FILE = "gemma3-270m-it-q8.task"
func localModelPath() -> String? {
let url = getDocumentsDir().appendingPathComponent(MODEL_FILE)
return FileManager.default.fileExists(atPath: url.path) ? url.path : nil
}
func ensureModel() async throws -> String {
let url = getDocumentsDir().appendingPathComponent(MODEL_FILE)
if FileManager.default.fileExists(atPath: url.path) {
return url.path
}
let (tmp, _) = try await URLSession.shared.download(from: MODEL_URL)
try FileManager.default.moveItem(at: tmp, to: url)
return url.path
}
private func getDocumentsDir() -> URL {
FileManager.default.urls(for: .documentDirectory, in: .userDomainMask)[0]
}
}
enum Routing { case auto, localOnly, cloudOnly }
@MainActor
struct Router {
let local: LocalLlm?
let cloud: CloudClient?
let maxLocalInputTokens: Int
private func estimateTokens(_ s: String) -> Int { max(1, s.count / 4) }
func generate(prompt: String, routing: Routing,
onToken: @escaping (String) -> Void,
onDone: @escaping (Bool, String) -> Void) async throws {
switch routing {
case .localOnly:
guard let local else { onDone(false, "local-unavailable"); return }
try await local.generateStream(prompt: prompt, onToken: onToken) { ok in onDone(ok, "local") }
case .cloudOnly:
let out = try await cloud?.generate(prompt: prompt) ?? "[cloud unavailable]"
onToken(out); onDone(true, "cloud")
case .auto:
let tokens = estimateTokens(prompt)
if let local, tokens <= maxLocalInputTokens {
try await local.generateStream(prompt: prompt, onToken: onToken) { ok in onDone(ok, "local") }
} else {
let out = try await cloud?.generate(prompt: prompt) ?? "[cloud unavailable]"
onToken(out); onDone(true, "cloud")
}
}
}
}
""")
———- WEB ———-
write(“web/README.md”, “””
Web Starter (MediaPipe LLM Inference + WebGPU)
1) Serve this folder locally
Any static server will do (Vite, python -m http.server, etc.).
2) Put your model on a CDN
Update config.js with a URL to gemma3-270m-it-q8-web.task on your CDN.
3) Open in a WebGPU-capable browser (Chrome/Edge recent).
""")
write("web/index.html", r"""
<!doctype html> <html> <head> <meta charset="utf-8"/> <meta name="viewport" content="width=device-width, initial-scale=1"> <title>Gemma 3 270M — Web Starter</title> </head> <body> <div id="gate" style="padding:16px;border-bottom:1px solid #ccc;"> <strong>Gemma Terms of Use</strong> — You must accept to download and run the model. <a target="_blank" href="https://ai.google.dev/gemma/terms">View Terms</a> <button id="accept">I Accept</button> </div> <div style="padding:16px;"> <button id="load">Download / Load Model</button> <textarea id="prompt" rows="6" style="width:100%;margin-top:8px;" placeholder="Enter prompt"></textarea> <div style="margin:8px 0;"> <label><input type="radio" name="route" value="auto" checked> Auto</label> <label><input type="radio" name="route" value="local"> Local</label> <label><input type="radio" name="route" value="cloud"> Cloud</label> </div> <button id="send">Send</button> <pre id="out" style="min-height:240px;border:1px solid #ddd;padding:8px;white-space:pre-wrap;"></pre> </div> <!-- Use ESM import from jsDelivr --> <script type="module" src="./main.js"></script> </body> </html> """)
write("web/config.js", r"""
export const TERMS_URL = "https://ai.google.dev/gemma/terms";
// Host the model yourself; do not embed gated HF URLs in production apps.
export const MODEL_URL = "https://YOUR-CDN/gemma3-270m-it-q8-web.task";
export const WASM_ROOT = "https://cdn.jsdelivr.net/npm/@mediapipe/tasks-genai@latest/wasm";
export const CLOUD_ENDPOINT = "https://YOUR_CLOUD_ENDPOINT";
""")
write("web/router.js", r"""
export class Router {
constructor(local, cloud, maxLocalInputTokens = 2048) {
this.local = local;
this.cloud = cloud;
this.max = maxLocalInputTokens;
}
estimateTokens(s) { return Math.max(1, Math.floor(s.length / 4)); }
async generate(prompt, routing, onToken, onDone) {
if (routing === "local") {
if (!this.local) return onDone(false, "local-unavailable");
await this.local.generateStream(prompt, onToken); onDone(true, "local"); return;
}
if (routing === "cloud") {
const out = await this.cloud.generate(prompt);
onToken(out); onDone(true, "cloud"); return;
}
// AUTO
if (this.local && this.estimateTokens(prompt) <= this.max) {
await this.local.generateStream(prompt, onToken); onDone(true, "local");
} else {
const out = await this.cloud.generate(prompt);
onToken(out); onDone(true, "cloud");
}
}
}
""")
write("web/main.js", r"""
import { FilesetResolver, LlmInference } from "https://cdn.jsdelivr.net/npm/@mediapipe/tasks-genai@latest";
import { TERMS_URL, MODEL_URL, WASM_ROOT, CLOUD_ENDPOINT } from "./config.js";
import { Router } from "./router.js";
const out = document.getElementById("out");
const promptBox = document.getElementById("prompt");
const btnLoad = document.getElementById("load");
const btnSend = document.getElementById("send");
const gate = document.getElementById("gate");
document.getElementById("accept").onclick = () => gate.style.display = "none";
let llm = null;
async function initLocal() {
const genai = await FilesetResolver.forGenAiTasks(WASM_ROOT);
llm = await LlmInference.createFromOptions(genai, {
baseOptions: { modelAssetPath: MODEL_URL },
maxTokens: 768, topK: 40, temperature: 0.8, randomSeed: 101
});
}
class Local {
async generateStream(prompt, onToken) {
return new Promise((resolve, reject) => {
llm.generateResponse(prompt, (partial, done) => {
if (partial) onToken(partial);
if (done) resolve(true);
});
});
}
}
class Cloud {
async generate(prompt) {
const u = new URL(CLOUD_ENDPOINT + "/generate");
u.searchParams.set("prompt", prompt);
const res = await fetch(u, { method: "GET" });
return await res.text();
}
}
btnLoad.onclick = async () => {
out.textContent += "\nDownloading / loading model…";
await initLocal();
out.textContent += "\nLocal LLM loaded.";
};
btnSend.onclick = async () => {
const routing = document.querySelector("input[name=route]:checked")?.value ?? "auto";
const router = new Router(llm ? new Local() : null, new Cloud(), 2048);
out.textContent += "\n---\n";
await router.generate(promptBox.value, routing,
(t) => out.textContent += t,
(ok, src) => out.textContent += \n\n[done: ${ok} via ${src}]);
};
""")
Zip it
zip_path = “/mnt/data/gemma270m_starters.zip”
with zipfile.ZipFile(zip_path, “w”, zipfile.ZIP_DEFLATED) as z:
for base, _, files in os.walk(root):
for fn in files:
p = os.path.join(base, fn)
z.write(p, arcname=os.path.relpath(p, root))
zip_path
'/mnt/data/gemma270m_starters.zip'
Awesome—here are copy‑pasteable, cross‑platform starters for Gemma 3 270M on‑device with a license gate, model downloader, and a tiny local→cloud router (Android • iOS • Web).
What you’re getting
gemma270m_starters/
README.md
android/ # Kotlin + MediaPipe LLM Inference + simple UI + router
ios/ # SwiftUI + MediaPipe LLM Inference + router
web/ # Vanilla HTML/JS + @mediapipe/tasks-genai (WebGPU) + router
Model artifacts to host on your CDN (update the URLs in each starter):
- Android/iOS:
gemma3-270m-it-q8.task - Web:
gemma3-270m-it-q8-web.task(web‑optimized) — both visible on the LiteRT Community model page. Hugging Face
Licensing: Add a first‑run “I accept” gate before download (included). Gemma’s Terms of Use require passing along restrictions when you redistribute. Google AI for Developers
Android (Kotlin, MediaPipe LLM Inference)
- Add dependency in your app module:
dependencies {
implementation 'com.google.mediapipe:tasks-genai:0.10.24'
implementation 'androidx.appcompat:appcompat:1.6.1'
implementation 'androidx.core:core-ktx:1.12.0'
implementation 'com.google.android.material:material:1.11.0'
}
This is the official package/version from the Android LLM Inference guide. Google AI for Developers
- Add internet permission in
AndroidManifest.xml:
<uses-permission android:name="android.permission.INTERNET"/>
- Drop in the files from
android/:
MainActivity.kt– minimal chat-like UI; streams tokensLicenseGate.kt– first‑run acceptance dialog linking to the Gemma Terms (opens browser) Google AI for DevelopersModelManager.kt– downloadsgemma3-270m-it-q8.taskto app storage (set your CDN URL)LocalGemma.kt– wrapsLlmInference(generateResponse/generateResponseAsync)
Import path example:com.google.mediapipe.tasks.genai.llminference.LlmInference(as used by Google samples/issues). GitHubRouter.kt+CloudClient.kt– simple “Auto / Local only / Cloud only” routing
Notes:
- The official guide recommends hosting the model and downloading at runtime, not bundling in the APK. Google AI for Developers
- The API is optimized for real devices (Pixel 8+/S23+) and “does not reliably support emulators.” Google AI for Developers
Core init (already in LocalGemma.kt):
val options = LlmInferenceOptions.builder()
.setModelPath(modelPath)
.setMaxTokens(1024).setTopK(40).setTemperature(0.8f).setRandomSeed(101)
.build()
val llm = LlmInference.createFromOptions(context, options)
Use generateResponseAsync for streaming with a resultListener. Google AI for Developers
iOS (SwiftUI, MediaPipe LLM Inference)
- CocoaPods (
Podfile):
target 'Gemma270MStarter' do
use_frameworks!
pod 'MediaPipeTasksGenAI'
pod 'MediaPipeTasksGenAIC'
end
These are the iOS packages from the official guide. Google AI for Developers
- Files in
ios/:
ContentView.swift– SwiftUI UI + license sheet + routerLocalLlm.swift– wrapsLlmInference(blocking & async streaming)ModelManager.swift– downloadsgemma3-270m-it-q8.taskto DocumentsCloudClient.swift– trivial GET endpoint client +Routerenum/struct
Core init (already in LocalLlm.swift):
let options = LlmInferenceOptions()
options.baseOptions.modelPath = modelPath
options.maxTokens = 1024
options.topk = 40
options.temperature = 0.8
options.randomSeed = 101
let llm = try LlmInference(options: options)
Use generateResponseAsync(inputText:) to stream and iterate the async sequence. Google AI for Developers
Web (Vanilla JS + WebGPU)
Your starter is a single‑page app using the official @mediapipe/tasks-genai package and WebGPU:
index.html– terms gate, prompt box, route pickermain.js– loads WASM +LlmInference, streams tokensrouter.js– same simple router patternconfig.js– setMODEL_URLto your CDN path forgemma3-270m-it-q8-web.task
Key bits (already in main.js):
import { FilesetResolver, LlmInference } from
"https://cdn.jsdelivr.net/npm/@mediapipe/tasks-genai@latest";
const genai = await FilesetResolver.forGenAiTasks(WASM_ROOT);
const llm = await LlmInference.createFromOptions(genai, {
baseOptions: { modelAssetPath: MODEL_URL },
maxTokens: 768, topK: 40, temperature: 0.8, randomSeed: 101
});
// Streaming:
llm.generateResponse(inputPrompt, (partial, done) => {
if (partial) append(partial);
});
This follows the official Web quickstart (WebGPU required). Google AI for Developers
Why these choices map cleanly to Gemma 3 270M
- LiteRT/MediaPipe‑ready artifacts are published for 270M IT, including a web‑optimized
.taskvariant—so you don’t need to convert to get started. Hugging Face - Android/iOS/Web LLM Inference APIs expose the same knobs (maxTokens, topK, temperature, randomSeed) and built‑in streaming methods, which keeps your app logic symmetric across platforms. Google AI for Developers+2Google AI for Developers+2
What you still need to plug in
- Host the model on your CDN and update
MODEL_URLin each starter. Don’t embed gated HF URLs or tokens in a client app. (The Android docs also note the model is too large to bundle—download at runtime.) Google AI for Developers - Cloud fallback: replace the simple
GET /generate?prompt=...stubs with your real endpoint (or remove cloud entirely if you want on‑device‑only).
References you can lean on as you extend this
- Android LLM Inference guide (dependency, init, streaming, emulator caveat). Google AI for Developers
- iOS LLM Inference guide (pods, init, streaming). Google AI for Developers
- Web LLM Inference guide (npm/CDN, WASM root, streaming callback, WebGPU). Google AI for Developers
- Gemma 3 270M IT artifacts (including
gemma3-270m-it-q8-web.task). Hugging Face - Gemma Terms of Use (add the gate & pass along restrictions when redistributing). Google AI for Developers