0) What we’re optimizing for
• Footprint & speed: 270M INT4 QAT keeps RAM/VRAM needs and power low (≈240 MB to load at Q4_0; additional memory for tokens/KV cache), making it a great fit for edge devices.
• Battery: Google’s internal test on a Pixel 9 Pro: ~0.75% battery for ~25 short chats with the INT4 variant—useful as a power envelope target.
• Context: Up to 32k tokens on 270M/1B (plan defaults below use 1–4k on mobile for stability).
• Distribution: Official LiteRT + MediaPipe LLM Inference give production‑grade Android/iOS/Web runtimes, with 270M IT builds already published in the LiteRT Community on Hugging Face.
⸻
1) Architecture at a glance
On‑device first, cloud when needed
App (Android | iOS | Web)
├─ Local Inference (Gemma 3 270M INT4, default context 2k–4k)
│ ├─ JSON-mode prompts for structured outputs
│ ├─ LiteRT / MediaPipe LLM Inference runtime
│ └─ Local adapters (optional LoRA)
├─ Tooling (optional): Function Calling SDK for actions
└─ Escalation (router):
– Larger local model (1B) on capable devices
– Cloud (e.g., Gemini APIs) for long/complex tasks
- Why this split: Empirically, 270M is excellent for structured extraction, classification, policy checks, routing, templated copy, smart‑reply—and avoids network and cost. Use a router for “hard” prompts (long context, multi‑hop). Google Developers Blog
2) Model artifacts & packaging
- Primary:
gemma-3-270m-itQAT/Q4_0 (instruction‑tuned) from Google/LiteRT channels. Prefer official LiteRT/MediaPipe‑ready packages when available; otherwise convert. Hugging Face+1 - Where to fetch:
– LiteRT Community (Hugging Face): published 270M IT artifacts and guidance for Android/iOS/Web. Hugging Face
– Gemma 3 release overview (context windows, sizes, QAT): Google AI for Developers - Licensing: Weights are open under Gemma Terms of Use; require users to accept terms (HF gate) at first download. Wire this into the first‑run flow. Hugging Face
Top picks for "turn cros platform"
Open Amazon search results for this keyword.
As an affiliate, we earn on qualifying purchases.
3) Android (Kotlin/Java) — production path
Runtime: MediaPipe LLM Inference API on LiteRT with GPU delegate fallback to CPU. Google AI for Developers+1
Steps
- Add dependency in
build.gradle(versions per docs):
dependencies {
implementation 'com.google.mediapipe:tasks-genai:0.10.24'
}
This provides LlmInference and options for temperature/top‑k, streaming, etc. Google AI for Developers
- Model distribution
- Host
.taskor.bin(LiteRT/MediaPipe‑compatible) on your CDN; on first run, present HF license screen, then download. Cache to app‑private storage; keep an on‑disk optimized layout cache (LiteRT does one‑time optimize at first load). Google Developers Blog - For MVP, you can side‑load during dev with
adb push, but ship via runtime download for production. Google AI for Developers
- Initialization (simplified)
val opts = LlmInferenceOptions.builder()
.setModelPath(localPath) // downloaded model
.setMaxTokens(1024) // include input+output
.setTopK(40).setTemperature(0.8f).build()
val llm = LlmInference.createFromOptions(context, opts)
val result = llm.generateResponse(prompt)
Use generateResponseAsync for streaming. Google AI for Developers
- Performance defaults
- Context: cap at 2k tokens by default; expose a setting for 4k on high‑end devices (S24U, Pixel 9).
- Backend: allow users to pick GPU or CPU (UI toggle); GPU delegate improves throughput; KV‑cache layout and GPU weight‑sharing reduce latency/memory under the hood. Google Developers Blog
- Battery budget: target ≤0.03%/turn for short prompts; profile with your prompt templates. Pixel 9 Pro reference: 0.75% per 25 short conversations. Google Developers Blog
- Optional features
- RAG SDK (Edge): for simple doc QA with local embeddings. Google Developers Blog
- Function Calling SDK (Edge): map model outputs to local actions (search, alarms). GitHub
- LoRA adapters: if you fine‑tune for vertical tasks; Android supports LoRA (GPU path) with MediaPipe convertor. Google AI for Developers
- Sample app to crib from: AI Edge Gallery and LLM Demo. Google AI for Developers
4) iOS (Swift) — production path
Runtime: MediaPipe LLM Inference API on LiteRT with Metal acceleration; CocoaPods package. Google AI for Developers
Steps
- Pods
target 'MyLlmInferenceApp' do
use_frameworks!
pod 'MediaPipeTasksGenAI'
pod 'MediaPipeTasksGenAIC'
end
- Model distribution
- Bundle tiny stubs only; download the real model on first run after user accepts Gemma terms (HF gate). Store in app‑private documents; enable resumeable downloads; verify hash before load. LiteRT will cache optimized tensor layouts to cut future load time. Google Developers Blog
- Init (simplified)
import MediaPipeTasksGenai
let options = LlmInferenceOptions()
options.baseOptions.modelPath = localPath
options.maxTokens = 1024
options.topk = 40
options.temperature = 0.8
let llm = try LlmInference(options: options)
let out = try llm.generateResponse(inputText: prompt)
Use generateResponseAsync for streaming. Google AI for Developers
- Performance defaults
- Context: default 2k; offer 4k on iPad Pro/modern iPhones.
- Memory heads‑up: iOS enforces strict memory limits; keep KV‑cache conservative, and stream outputs.
- LoRA: iOS supports static LoRA at init via converted adapters for Gemma classes; conversion via MediaPipe tools. Google AI for Developers
- Alt path (if you prefer OSS stack):
llama.cppGGUF with Metal is viable, but MediaPipe/LiteRT will be simpler to maintain across OS updates. GitHub - Sample: iOS MediaPipe LLM Inference sample app. Google AI for Developers
5) Web — two good options
Option A — MediaPipe LLM Inference (WebGPU)
- Requirements: modern browser with WebGPU. Google AI for DevelopersMDN Web Docs
- Install:
npm i @mediapipe/tasks-genai(or via CDN). - Init (simplified):
import { FilesetResolver, LlmInference } from '@mediapipe/tasks-genai';
const genai = await FilesetResolver.forGenAiTasks(
'https://cdn.jsdelivr.net/npm/@mediapipe/tasks-genai@latest/wasm'
);
const llm = await LlmInference.createFromOptions(genai, {
baseOptions: { modelAssetPath: '/assets/gemma-3-270m-it.task' },
maxTokens: 768, topK: 40, temperature: 0.8, randomSeed: 101
});
const text = await llm.generateResponse(prompt);
- Why A: One API family across Android/iOS/Web; supports dynamic LoRA on Web. Google AI for Developers
Option B — WebLLM (MLC)
- Why B: Mature in‑browser LLM engine with OpenAI‑compatible API surface; broad model zoo (MLC format). Good if you already use MLC builds. GitHubwebllm.mlc.ai
Google’s launch post even highlights a Transformers.js demo using 270M in a browser—handy for very small apps. Google Developers Blog
6) Prompting & outputs (consistent across platforms)
- JSON‑first templates (require strict keys, minimal prose).
- Guardrails: temperature ≤0.4 for extraction/classification; ≤0.8 for copywriting.
- Context mgmt: sliding‑window (truncate oldest), summarize tails if >2k tokens on mobile; reserve ~30–40% of
maxTokensfor output. - Function calling (Android today): use Edge FC SDK to map JSON to actions (search, reminders). GitHub
7) CI/CD for models
- Fine‑tune (full or LoRA) → export to LiteRT/MediaPipe (
.task/.bin). (AI Edge Torch supports PyTorch→TFLite and integrates with LLM Inference.) GitHub+1 - Quantize: prefer QAT INT4 checkpoints to preserve quality at 4‑bit. Google Developers Blog
- Virus scan & hash artifacts; upload to private bucket + HF gated mirror if desired.
- Release train: semantic version model IDs, A/B via remote config, roll back by ID.
- Client downloads + verifies SHA‑256; keep per‑version caches for instant rollback.
8) Performance budgets & test matrix
- Load memory (model only): ~240 MB (Q4_0). Plan headroom for KV cache (varies with batch × heads × layers × context). Google AI for Developers
- TTFT target: <600 ms (warm) on flagship phones; <250 ms on desktop WebGPU. (Use MediaPipe samples to benchmark.) Google Developers Blog
- Throughput: prioritize prefill speed (tokenizing + initial attention); keep max input <2k on mobile.
- Matrix: Pixel 9 / S24U / mid‑range Android; iPhone 15/14; iPad Pro; Chrome/Safari/Edge (WebGPU on). Google AI for Developers
9) Privacy, safety, and compliance
- On‑device by default: no payload leaves device for 270M path.
- Content policy: enforce Gemma Prohibited Use + your own constraints; show the terms gate when fetching from HF. Google AI for Developers+1
- Telemetry: opt‑in, coarse device class only; never log raw prompts.
- Eval sets: per domain (e.g., classification/extraction for your content sites) and run pre‑/post‑deploy checks (precision/recall, JSON validity rate).
10) Rollout plan (4 sprints)
Sprint 0—Plumbing
- Pick Option A (MediaPipe) for all three platforms to minimize divergence.
- Wire model download + license acceptance.
- Ship a hidden diagnostics screen: device info, backend, TTFT, tokens/s.
Sprint 1—MVP
- Android + Web MVP (shared prompts).
- Tasks: classification / extraction / short rewrites; cap maxTokens to 768–1024.
- Add cloud fallback for >2k input or low confidence.
Sprint 2—iOS + adapters
- iOS client with streaming; polish memory handling.
- Optional LoRA adapter for one high‑value task (e.g., policy/scoring). Google AI for Developers
Sprint 3—Hardening
- A/B model versions; battery soak; offline‑first UX (progress bars, resume).
- Add RAG for on‑device doc snippets where helpful. Google Developers Blog
11) Quick links you’ll actually need
- Gemma 3 270M announcement (battery, QAT, positioning). Google Developers Blog
- Gemma 3 overview (sizes, context, memory table). Google AI for Developers
- LiteRT overview (Android/iOS/Web support). Google AI for Developers
- Android LLM Inference guide (API, Gradle dep, code). Google AI for Developers
- iOS LLM Inference guide (Pods, Swift code). Google AI for Developers
- Web LLM Inference guide (WebGPU, npm/cdn, code). Google AI for Developers
- LiteRT Community 270M IT model (ready to use). Hugging Face
12) Nice‑to‑have extensions
- WebLLM variant (OpenAI‑compatible local endpoint in browser) if you want one codepath that also works in Node/Electron. GitHub
- Edge Function Calling on Android for “agentic” local actions. GitHub
Three minimal starters (Android, iOS, Web)
A three minimal starters (Android, iOS, Web) with the dependencies wired, license gate, model downloader, and a router stub (on‑device → cloud) so you can drop in your task prompts.
# Create a zip archive with Android, iOS, and Web starter code for Gemma 3 270M on-device inference.
import os, textwrap, json, zipfile, hashlib, pathlib
root = "/mnt/data/gemma270m_starters"
os.makedirs(root, exist_ok=True)
def write(path, content):
path = os.path.join(root, path)
os.makedirs(os.path.dirname(path), exist_ok=True)
with open(path, "w", encoding="utf-8") as f:
f.write(content.lstrip("\n"))
# ---------- TOP-LEVEL README ----------
write("README.md", f"""
# Gemma 3 270M — On‑Device Starters (Android • iOS • Web)
These are **minimal starter skeletons** to run **Gemma 3 270M (INT4/INT8 .task)** on‑device via **MediaPipe LLM Inference**.
They include: a **license gate**, a **model downloader**, and a tiny **router** that prefers local inference and falls back to a cloud endpoint for oversized prompts.
> ⚠️ **Licensing**: Gemma weights are under the **Gemma Terms of Use**. You must ensure users accept the terms before download, and you must host the model artifacts yourself or use gated distribution. Do *not* embed a Hugging Face token in client apps.
## What’s here
- **android/**: Kotlin Activity + downloader + Local LLM wrapper (MediaPipe) + cloud stub
- **ios/**: SwiftUI app + downloader + Local LLM wrapper (MediaPipe) + cloud stub
- **web/**: Vanilla HTML/JS using `@mediapipe/tasks-genai` over WebGPU + router stub
## Quick links
- MediaPipe LLM Inference (Android): https://ai.google.dev/edge/mediapipe/solutions/genai/llm_inference/android
- MediaPipe LLM Inference (iOS): https://ai.google.dev/edge/mediapipe/solutions/genai/llm_inference/ios
- MediaPipe LLM Inference (Web): https://ai.google.dev/edge/mediapipe/solutions/genai/llm_inference/web_js
- LiteRT Community (Gemma 3 270M IT): https://huggingface.co/litert-community/gemma-3-270m-it
- Gemma Terms of Use: https://ai.google.dev/gemma/terms
## Model artifacts
Recommended to mirror one of these to **your CDN** and update the URLs in each starter:
- Android/iOS: `gemma3-270m-it-q8.task` (or Q4_0/Q8 variant that matches your performance target)
- Web: `gemma3-270m-it-q8-web.task`
""")
# ---------- ANDROID ----------
write("android/README.md", """
# Android Starter (MediaPipe LLM Inference)
## 1) Create a new Android Studio project
- Template: **Empty Views Activity** (Kotlin), minSdk ≥ 26.
- Add the dependency in `app/build.gradle`:
```gradle
dependencies {
implementation 'com.google.mediapipe:tasks-genai:0.10.24'
implementation 'androidx.appcompat:appcompat:1.6.1'
implementation 'androidx.core:core-ktx:1.12.0'
implementation 'com.google.android.material:material:1.11.0'
}
Add permission in app/src/main/AndroidManifest.xml:
<uses-permission android:name="android.permission.INTERNET"/>
2) Drop the code in app/src/main/java/com/example/gemma/
MainActivity.kt, LicenseGate.kt, ModelManager.kt, LocalGemma.kt, CloudClient.kt, Router.kt from this folder.
3) Set your model URL
In ModelManager.kt, set MODEL_URL to your CDN URL for gemma3-270m-it-q8.task.
Do not ship HF access tokens inside apps.
4) Run on a real device (GPU preferred)
The MediaPipe LLM Inference API is optimized for real devices; emulators are not reliable.
“””)
write(“android/MainActivity.kt”, r”””
package com.example.gemma
import android.os.Bundle
import android.text.method.ScrollingMovementMethod
import android.widget.*
import androidx.appcompat.app.AppCompatActivity
import androidx.lifecycle.lifecycleScope
import kotlinx.coroutines.Dispatchers
import kotlinx.coroutines.launch
import kotlinx.coroutines.withContext
class MainActivity : AppCompatActivity() {
private lateinit var txtOutput: TextView
private lateinit var edtPrompt: EditText
private lateinit var btnSend: Button
private lateinit var radioRoute: RadioGroup
private lateinit var btnDownload: Button
private val modelManager by lazy { ModelManager(this) }
private var local: LocalGemma? = null
private val cloud = CloudClient(baseUrl = "https://YOUR_CLOUD_ENDPOINT") // TODO: replace
override fun onCreate(savedInstanceState: Bundle?) {
super.onCreate(savedInstanceState)
if (!LicenseGate.hasAccepted(this)) {
LicenseGate.show(this) {
// user accepted
setupUi()
}
} else {
setupUi()
}
}
private fun setupUi() {
setContentView(R.layout.activity_main)
txtOutput = findViewById(R.id.txtOutput)
edtPrompt = findViewById(R.id.edtPrompt)
btnSend = findViewById(R.id.btnSend)
radioRoute = findViewById(R.id.radioRoute)
btnDownload = findViewById(R.id.btnDownload)
txtOutput.movementMethod = ScrollingMovementMethod()
btnDownload.setOnClickListener {
lifecycleScope.launch {
appendLine("Downloading model…")
val path = withContext(Dispatchers.IO) { modelManager.ensureModel() }
appendLine("Model ready at: $path")
loadLocal(path)
}
}
btnSend.setOnClickListener {
val prompt = edtPrompt.text.toString().trim()
if (prompt.isEmpty()) return@setOnClickListener
lifecycleScope.launch { routeAndGenerate(prompt) }
}
// Autoload if already present
lifecycleScope.launch {
modelManager.getLocalModelPath()?.let { loadLocal(it) }
}
}
private suspend fun loadLocal(path: String) = withContext(Dispatchers.IO) {
try {
local?.close()
local = LocalGemma(this@MainActivity, path).also { it.load() }
appendLine("Local LLM loaded.")
} catch (e: Exception) {
appendLine("Failed to init local LLM: ${e.message}")
}
}
private suspend fun routeAndGenerate(prompt: String) = withContext(Dispatchers.IO) {
val routing = when (radioRoute.checkedRadioButtonId) {
R.id.optLocal -> Routing.LOCAL_ONLY
R.id.optCloud -> Routing.CLOUD_ONLY
else -> Routing.AUTO
}
val router = Router(local, cloud, maxLocalInputTokens = 2048)
appendLine("Routing: $routing")
try {
router.generate(prompt, routing, onToken = { token ->
runOnUiThread { txtOutput.append(token) }
}, onDone = { ok, source ->
runOnUiThread { appendLine("\n\n[done: $ok via $source]") }
})
} catch (e: Exception) {
appendLine("Error: ${e.message}")
}
}
private fun appendLine(msg: String) = runOnUiThread {
txtOutput.append("\n$msg")
}
override fun onDestroy() {
super.onDestroy()
local?.close()
}
}
""")
write("android/LicenseGate.kt", r"""
package com.example.gemma
import android.app.AlertDialog
import android.content.Context
import android.content.Intent
import android.net.Uri
import android.preference.PreferenceManager
object LicenseGate {
private const val KEY = "gemma_terms_accepted"
fun hasAccepted(context: Context): Boolean =
PreferenceManager.getDefaultSharedPreferences(context).getBoolean(KEY, false)
fun show(context: Context, onAccepted: () -> Unit) {
val dlg = AlertDialog.Builder(context)
.setTitle("Gemma Terms of Use")
.setMessage("You must accept the Gemma Terms of Use to download and run the model on this device.")
.setPositiveButton("View Terms") { _, _ ->
val i = Intent(Intent.ACTION_VIEW, Uri.parse("https://ai.google.dev/gemma/terms"))
context.startActivity(i)
}
.setNeutralButton("I Accept") { d, _ ->
PreferenceManager.getDefaultSharedPreferences(context)
.edit().putBoolean(KEY, true).apply()
d.dismiss()
onAccepted()
}
.setNegativeButton("Exit", null)
.create()
dlg.show()
}
}
""")
write("android/ModelManager.kt", r"""
package com.example.gemma
import android.content.Context
import java.io.File
import java.io.FileOutputStream
import java.net.HttpURLConnection
import java.net.URL
class ModelManager(private val context: Context) {
companion object {
// TODO: host the model yourself; do not embed gated URLs/tokens in apps.
private const val MODEL_URL = "https://YOUR-CDN/gemma3-270m-it-q8.task"
private const val MODEL_FILE = "gemma3-270m-it-q8.task"
}
fun getLocalModelPath(): String? {
val f = File(context.filesDir, MODEL_FILE)
return if (f.exists()) f.absolutePath else null
}
/** Download model if missing, return absolute path. */
fun ensureModel(): String {
val out = File(context.filesDir, MODEL_FILE)
if (out.exists()) return out.absolutePath
download(MODEL_URL, out)
return out.absolutePath
}
private fun download(url: String, outFile: File) {
val conn = URL(url).openConnection() as HttpURLConnection
conn.connectTimeout = 30000
conn.readTimeout = 30000
conn.inputStream.use { input ->
FileOutputStream(outFile).use { output ->
val buf = ByteArray(1 shl 16)
while (true) {
val n = input.read(buf)
if (n <= 0) break
output.write(buf, 0, n)
}
}
}
}
}
""")
write("android/LocalGemma.kt", r"""
package com.example.gemma
import android.content.Context
import com.google.mediapipe.tasks.genai.llminference.LlmInference
import com.google.mediapipe.tasks.genai.llminference.LlmInferenceOptions
class LocalGemma(private val context: Context, private val modelPath: String) : AutoCloseable {
private var llm: LlmInference? = null
fun load() {
val options = LlmInferenceOptions.builder()
.setModelPath(modelPath)
.setMaxTokens(1024) // input + output
.setTopK(40)
.setTemperature(0.8f)
.setRandomSeed(101)
.build()
llm = LlmInference.createFromOptions(context, options)
}
fun generateStream(prompt: String, onToken: (String) -> Unit, onDone: (Boolean) -> Unit) {
val inst = llm ?: throw IllegalStateException("LLM not loaded")
val opts = LlmInferenceOptions.builder()
.setModelPath(modelPath)
.setMaxTokens(1024)
.setTopK(40).setTemperature(0.8f).setRandomSeed(101)
.setResultListener { part, done ->
onToken(part ?: "")
if (done) onDone(true)
}
.setErrorListener { e ->
onToken("\n[error: ${'$'}e]"); onDone(false)
}
.build()
val streaming = LlmInference.createFromOptions(context, opts)
streaming.generateResponseAsync(prompt)
}
fun generate(prompt: String): String {
val inst = llm ?: throw IllegalStateException("LLM not loaded")
return inst.generateResponse(prompt)
}
override fun close() {
llm?.close()
llm = null
}
}
""")
write("android/CloudClient.kt", r"""
package com.example.gemma
import java.io.BufferedReader
import java.io.InputStreamReader
import java.net.HttpURLConnection
import java.net.URL
class CloudClient(private val baseUrl: String) {
/** Blocking demo GET endpoint: /generate?prompt=... Replace with your own. */
fun generate(prompt: String): String {
val url = URL("${baseUrl.trimEnd('/')}/generate?prompt=" + java.net.URLEncoder.encode(prompt, "UTF-8"))
val conn = url.openConnection() as HttpURLConnection
conn.connectTimeout = 30000
conn.readTimeout = 30000
return conn.inputStream.use { input ->
BufferedReader(InputStreamReader(input)).readText()
}
}
}
""")
write("android/Router.kt", r"""
package com.example.gemma
enum class Routing { AUTO, LOCAL_ONLY, CLOUD_ONLY }
class Router(
private val local: LocalGemma?,
private val cloud: CloudClient?,
private val maxLocalInputTokens: Int = 2048,
) {
/** Very crude token estimator (spaces as token proxies). Replace with a real tokenizer if needed. */
private fun estimateTokens(s: String) = (s.length / 4).coerceAtLeast(1)
fun generate(
prompt: String,
routing: Routing,
onToken: (String) -> Unit,
onDone: (Boolean, String) -> Unit
) {
when (routing) {
Routing.LOCAL_ONLY -> {
local ?: return onDone(false, "local-unavailable")
local.generateStream(prompt, onToken) { ok -> onDone(ok, "local") }
}
Routing.CLOUD_ONLY -> {
val out = cloud?.generate(prompt) ?: "[cloud unavailable]"
onToken(out); onDone(true, "cloud")
}
Routing.AUTO -> {
val tokens = estimateTokens(prompt)
val useLocal = local != null && tokens <= maxLocalInputTokens
if (useLocal) {
local!!.generateStream(prompt, onToken) { ok -> onDone(ok, "local") }
} else {
val out = cloud?.generate(prompt) ?: "[cloud unavailable]"
onToken(out); onDone(true, "cloud")
}
}
}
}
}
""")
write("android/res/layout/activity_main.xml", r"""
<?xml version="1.0" encoding="utf-8"?>
<LinearLayout xmlns:android="http://schemas.android.com/apk/res/android" android:layout_width="match_parent" android:layout_height="match_parent" android:orientation="vertical" android:padding="16dp">
<Button
android:id="@+id/btnDownload"
android:layout_width="match_parent"
android:layout_height="wrap_content"
android:text="Download / Load Model" />
<EditText
android:id="@+id/edtPrompt"
android:layout_width="match_parent"
android:layout_height="wrap_content"
android:hint="Enter prompt"
android:minLines="3"
android:gravity="top|start" />
<RadioGroup
android:id="@+id/radioRoute"
android:layout_width="match_parent"
android:layout_height="wrap_content"
android:orientation="horizontal">
<RadioButton android:id="@+id/optAuto" android:checked="true"
android:layout_width="wrap_content" android:layout_height="wrap_content"
android:text="Auto" />
<RadioButton android:id="@+id/optLocal"
android:layout_width="wrap_content" android:layout_height="wrap_content"
android:text="Local" />
<RadioButton android:id="@+id/optCloud"
android:layout_width="wrap_content" android:layout_height="wrap_content"
android:text="Cloud" />
</RadioGroup>
<Button
android:id="@+id/btnSend"
android:layout_width="match_parent"
android:layout_height="wrap_content"
android:text="Send" />
<TextView
android:id="@+id/txtOutput"
android:layout_width="match_parent"
android:layout_height="0dp"
android:layout_weight="1"
android:paddingTop="12dp"
android:textIsSelectable="true"
android:scrollbars="vertical"
android:textAppearance="?android:attr/textAppearanceSmall"
android:text="Ready." />
</LinearLayout> """)
write("android/AndroidManifest.xml", r"""
<manifest xmlns:android="http://schemas.android.com/apk/res/android" package="com.example.gemma">
<uses-permission android:name="android.permission.INTERNET"/>
<application android:label="Gemma270M Starter" android:theme="@style/Theme.AppCompat.Light.NoActionBar">
<activity android:name=".MainActivity">
<intent-filter>
<action android:name="android.intent.action.MAIN" />
<category android:name="android.intent.category.LAUNCHER" />
</intent-filter>
</activity>
</application>
</manifest>
""")
———- iOS ———-
write(“ios/README.md”, “””
iOS Starter (MediaPipe LLM Inference, SwiftUI)
1) Create a new App in Xcode (SwiftUI)
- Add CocoaPods to the project and a
Podfilelike below, thenpod installand open the.xcworkspace.
target 'Gemma270MStarter' do
use_frameworks!
pod 'MediaPipeTasksGenAI'
pod 'MediaPipeTasksGenAIC'
end
2) Add these files to your app target
ContentView.swift, LocalLlm.swift, ModelManager.swift, CloudClient.swift
3) Set your model URL
In ModelManager.swift, set MODEL_URL to your CDN URL for gemma3-270m-it-q8.task.
4) Run on device (Metal)
The LLM Inference API is optimized for real devices.
“””)
write(“ios/ContentView.swift”, r”””
import SwiftUI
struct ContentView: View {
@State private var accepted = UserDefaults.standard.bool(forKey: “gemma_terms_accepted”)
@State private var prompt: String = “”
@State private var output: String = “Ready.”
@State private var useRoute: Int = 0 // 0 = Auto, 1 = Local, 2 = Cloud
@StateObject private var local = LocalLlm()
private let cloud = CloudClient(baseUrl: “https://YOUR_CLOUD_ENDPOINT”) // TODO
var body: some View {
VStack(alignment: .leading, spacing: 12) {
HStack {
Button("Download / Load Model") {
output.append("\nDownloading model…")
Task {
let path = try? await ModelManager.shared.ensureModel()
output.append("\nModel ready at: \(path ?? "n/a")")
if let p = path {
try? await local.load(modelPath: p)
output.append("\nLocal LLM loaded.")
}
}
}
Spacer()
}
TextEditor(text: $prompt).frame(height: 120).border(.secondary)
Picker("", selection: $useRoute) {
Text("Auto").tag(0)
Text("Local").tag(1)
Text("Cloud").tag(2)
}.pickerStyle(.segmented)
Button("Send") {
output.append("\nRouting…")
Task {
await routeAndGenerate()
}
}
ScrollView { Text(output).font(.system(size: 12, design: .monospaced))
.frame(maxWidth: .infinity, alignment: .leading) }
}
.padding()
.sheet(isPresented: .constant(!accepted)) {
TermsSheet(accepted: $accepted)
}
.onChange(of: accepted) { _, v in
UserDefaults.standard.set(v, forKey: "gemma_terms_accepted")
}
}
func routeAndGenerate() async {
let router = Router(local: local, cloud: cloud, maxLocalInputTokens: 2048)
let routing: Routing = [Routing.auto, .localOnly, .cloudOnly][useRoute]
do {
output.append("\n---\n")
try await router.generate(prompt: prompt, routing: routing,
onToken: { token in
output.append(token)
}, onDone: { ok, src in
output.append("\n\n[done: \(ok) via \(src)]")
})
} catch {
output.append("\nError: \(error.localizedDescription)")
}
}
}
struct TermsSheet: View {
@Binding var accepted: Bool
var body: some View {
VStack(spacing: 12) {
Text("Gemma Terms of Use").font(.title3).bold()
Text("You must accept the Gemma Terms of Use to download and run the model.")
HStack {
Link("View Terms", destination: URL(string: "https://ai.google.dev/gemma/terms")!)
Spacer()
Button("I Accept") { accepted = true }
}
}.padding()
}
}
""")
write("ios/LocalLlm.swift", r"""
import Foundation
import MediaPipeTasksGenai
@MainActor
final class LocalLlm: ObservableObject {
private var llm: LlmInference? = nil
private var modelPath: String? = nil
func load(modelPath: String) async throws {
self.modelPath = modelPath
let opts = LlmInferenceOptions()
opts.baseOptions.modelPath = modelPath
opts.maxTokens = 1024
opts.topk = 40
opts.temperature = 0.8
opts.randomSeed = 101
self.llm = try LlmInference(options: opts)
}
func generate(prompt: String) async throws -> String {
guard let llm else { throw NSError(domain: "LocalLlm", code: 1, userInfo: [NSLocalizedDescriptionKey: "LLM not loaded"]) }
return try llm.generateResponse(inputText: prompt)
}
func generateStream(prompt: String,
onToken: @escaping (String) -> Void,
onDone: @escaping (Bool) -> Void) async throws {
guard let modelPath else { throw NSError(domain: "LocalLlm", code: 2, userInfo: [NSLocalizedDescriptionKey: "Model not loaded"]) }
let opts = LlmInferenceOptions()
opts.baseOptions.modelPath = modelPath
opts.maxTokens = 1024
opts.topk = 40
opts.temperature = 0.8
opts.randomSeed = 101
let streaming = try LlmInference(options: opts)
let stream = try streaming.generateResponseAsync(inputText: prompt)
Task {
do {
for try await part in stream {
onToken(part)
}
onDone(true)
} catch {
onToken("\n[error: \(error.localizedDescription)]")
onDone(false)
}
}
}
}
""")
write("ios/ModelManager.swift", r"""
import Foundation
actor ModelManager {
static let shared = ModelManager()
private init() {}
// TODO: host the model yourself; do not embed gated URLs/tokens in apps.
private let MODEL_URL = URL(string: "https://YOUR-CDN/gemma3-270m-it-q8.task")!
private let MODEL_FILE = "gemma3-270m-it-q8.task"
func localModelPath() -> String? {
let url = getDocumentsDir().appendingPathComponent(MODEL_FILE)
return FileManager.default.fileExists(atPath: url.path) ? url.path : nil
}
func ensureModel() async throws -> String {
let url = getDocumentsDir().appendingPathComponent(MODEL_FILE)
if FileManager.default.fileExists(atPath: url.path) {
return url.path
}
let (tmp, _) = try await URLSession.shared.download(from: MODEL_URL)
try FileManager.default.moveItem(at: tmp, to: url)
return url.path
}
private func getDocumentsDir() -> URL {
FileManager.default.urls(for: .documentDirectory, in: .userDomainMask)[0]
}
}
enum Routing { case auto, localOnly, cloudOnly }
@MainActor
struct Router {
let local: LocalLlm?
let cloud: CloudClient?
let maxLocalInputTokens: Int
private func estimateTokens(_ s: String) -> Int { max(1, s.count / 4) }
func generate(prompt: String, routing: Routing,
onToken: @escaping (String) -> Void,
onDone: @escaping (Bool, String) -> Void) async throws {
switch routing {
case .localOnly:
guard let local else { onDone(false, "local-unavailable"); return }
try await local.generateStream(prompt: prompt, onToken: onToken) { ok in onDone(ok, "local") }
case .cloudOnly:
let out = try await cloud?.generate(prompt: prompt) ?? "[cloud unavailable]"
onToken(out); onDone(true, "cloud")
case .auto:
let tokens = estimateTokens(prompt)
if let local, tokens <= maxLocalInputTokens {
try await local.generateStream(prompt: prompt, onToken: onToken) { ok in onDone(ok, "local") }
} else {
let out = try await cloud?.generate(prompt: prompt) ?? "[cloud unavailable]"
onToken(out); onDone(true, "cloud")
}
}
}
}
""")
———- WEB ———-
write(“web/README.md”, “””
Web Starter (MediaPipe LLM Inference + WebGPU)
1) Serve this folder locally
Any static server will do (Vite, python -m http.server, etc.).
2) Put your model on a CDN
Update config.js with a URL to gemma3-270m-it-q8-web.task on your CDN.
3) Open in a WebGPU-capable browser (Chrome/Edge recent).
""")
write("web/index.html", r"""
<!doctype html> <html> <head> <meta charset="utf-8"/> <meta name="viewport" content="width=device-width, initial-scale=1"> <title>Gemma 3 270M — Web Starter</title> </head> <body> <div id="gate" style="padding:16px;border-bottom:1px solid #ccc;"> <strong>Gemma Terms of Use</strong> — You must accept to download and run the model. <a target="_blank" href="https://ai.google.dev/gemma/terms">View Terms</a> <button id="accept">I Accept</button> </div> <div style="padding:16px;"> <button id="load">Download / Load Model</button> <textarea id="prompt" rows="6" style="width:100%;margin-top:8px;" placeholder="Enter prompt"></textarea> <div style="margin:8px 0;"> <label><input type="radio" name="route" value="auto" checked> Auto</label> <label><input type="radio" name="route" value="local"> Local</label> <label><input type="radio" name="route" value="cloud"> Cloud</label> </div> <button id="send">Send</button> <pre id="out" style="min-height:240px;border:1px solid #ddd;padding:8px;white-space:pre-wrap;"></pre> </div> <!-- Use ESM import from jsDelivr --> <script type="module" src="./main.js"></script> </body> </html> """)
write("web/config.js", r"""
export const TERMS_URL = "https://ai.google.dev/gemma/terms";
// Host the model yourself; do not embed gated HF URLs in production apps.
export const MODEL_URL = "https://YOUR-CDN/gemma3-270m-it-q8-web.task";
export const WASM_ROOT = "https://cdn.jsdelivr.net/npm/@mediapipe/tasks-genai@latest/wasm";
export const CLOUD_ENDPOINT = "https://YOUR_CLOUD_ENDPOINT";
""")
write("web/router.js", r"""
export class Router {
constructor(local, cloud, maxLocalInputTokens = 2048) {
this.local = local;
this.cloud = cloud;
this.max = maxLocalInputTokens;
}
estimateTokens(s) { return Math.max(1, Math.floor(s.length / 4)); }
async generate(prompt, routing, onToken, onDone) {
if (routing === "local") {
if (!this.local) return onDone(false, "local-unavailable");
await this.local.generateStream(prompt, onToken); onDone(true, "local"); return;
}
if (routing === "cloud") {
const out = await this.cloud.generate(prompt);
onToken(out); onDone(true, "cloud"); return;
}
// AUTO
if (this.local && this.estimateTokens(prompt) <= this.max) {
await this.local.generateStream(prompt, onToken); onDone(true, "local");
} else {
const out = await this.cloud.generate(prompt);
onToken(out); onDone(true, "cloud");
}
}
}
""")
write("web/main.js", r"""
import { FilesetResolver, LlmInference } from "https://cdn.jsdelivr.net/npm/@mediapipe/tasks-genai@latest";
import { TERMS_URL, MODEL_URL, WASM_ROOT, CLOUD_ENDPOINT } from "./config.js";
import { Router } from "./router.js";
const out = document.getElementById("out");
const promptBox = document.getElementById("prompt");
const btnLoad = document.getElementById("load");
const btnSend = document.getElementById("send");
const gate = document.getElementById("gate");
document.getElementById("accept").onclick = () => gate.style.display = "none";
let llm = null;
async function initLocal() {
const genai = await FilesetResolver.forGenAiTasks(WASM_ROOT);
llm = await LlmInference.createFromOptions(genai, {
baseOptions: { modelAssetPath: MODEL_URL },
maxTokens: 768, topK: 40, temperature: 0.8, randomSeed: 101
});
}
class Local {
async generateStream(prompt, onToken) {
return new Promise((resolve, reject) => {
llm.generateResponse(prompt, (partial, done) => {
if (partial) onToken(partial);
if (done) resolve(true);
});
});
}
}
class Cloud {
async generate(prompt) {
const u = new URL(CLOUD_ENDPOINT + "/generate");
u.searchParams.set("prompt", prompt);
const res = await fetch(u, { method: "GET" });
return await res.text();
}
}
btnLoad.onclick = async () => {
out.textContent += "\nDownloading / loading model…";
await initLocal();
out.textContent += "\nLocal LLM loaded.";
};
btnSend.onclick = async () => {
const routing = document.querySelector("input[name=route]:checked")?.value ?? "auto";
const router = new Router(llm ? new Local() : null, new Cloud(), 2048);
out.textContent += "\n---\n";
await router.generate(promptBox.value, routing,
(t) => out.textContent += t,
(ok, src) => out.textContent += \n\n[done: ${ok} via ${src}]);
};
""")
Zip it
zip_path = “/mnt/data/gemma270m_starters.zip”
with zipfile.ZipFile(zip_path, “w”, zipfile.ZIP_DEFLATED) as z:
for base, _, files in os.walk(root):
for fn in files:
p = os.path.join(base, fn)
z.write(p, arcname=os.path.relpath(p, root))
zip_path
'/mnt/data/gemma270m_starters.zip'
Awesome—here are copy‑pasteable, cross‑platform starters for Gemma 3 270M on‑device with a license gate, model downloader, and a tiny local→cloud router (Android • iOS • Web).
What you’re getting
gemma270m_starters/
README.md
android/ # Kotlin + MediaPipe LLM Inference + simple UI + router
ios/ # SwiftUI + MediaPipe LLM Inference + router
web/ # Vanilla HTML/JS + @mediapipe/tasks-genai (WebGPU) + router
Model artifacts to host on your CDN (update the URLs in each starter):
- Android/iOS:
gemma3-270m-it-q8.task - Web:
gemma3-270m-it-q8-web.task(web‑optimized) — both visible on the LiteRT Community model page. Hugging Face
Licensing: Add a first‑run “I accept” gate before download (included). Gemma’s Terms of Use require passing along restrictions when you redistribute. Google AI for Developers
Android (Kotlin, MediaPipe LLM Inference)
- Add dependency in your app module:
dependencies {
implementation 'com.google.mediapipe:tasks-genai:0.10.24'
implementation 'androidx.appcompat:appcompat:1.6.1'
implementation 'androidx.core:core-ktx:1.12.0'
implementation 'com.google.android.material:material:1.11.0'
}
This is the official package/version from the Android LLM Inference guide. Google AI for Developers
- Add internet permission in
AndroidManifest.xml:
<uses-permission android:name="android.permission.INTERNET"/>
- Drop in the files from
android/:
MainActivity.kt– minimal chat-like UI; streams tokensLicenseGate.kt– first‑run acceptance dialog linking to the Gemma Terms (opens browser) Google AI for DevelopersModelManager.kt– downloadsgemma3-270m-it-q8.taskto app storage (set your CDN URL)LocalGemma.kt– wrapsLlmInference(generateResponse/generateResponseAsync)
Import path example:com.google.mediapipe.tasks.genai.llminference.LlmInference(as used by Google samples/issues). GitHubRouter.kt+CloudClient.kt– simple “Auto / Local only / Cloud only” routing
Notes:
- The official guide recommends hosting the model and downloading at runtime, not bundling in the APK. Google AI for Developers
- The API is optimized for real devices (Pixel 8+/S23+) and “does not reliably support emulators.” Google AI for Developers
Core init (already in LocalGemma.kt):
val options = LlmInferenceOptions.builder()
.setModelPath(modelPath)
.setMaxTokens(1024).setTopK(40).setTemperature(0.8f).setRandomSeed(101)
.build()
val llm = LlmInference.createFromOptions(context, options)
Use generateResponseAsync for streaming with a resultListener. Google AI for Developers
iOS (SwiftUI, MediaPipe LLM Inference)
- CocoaPods (
Podfile):
target 'Gemma270MStarter' do
use_frameworks!
pod 'MediaPipeTasksGenAI'
pod 'MediaPipeTasksGenAIC'
end
These are the iOS packages from the official guide. Google AI for Developers
- Files in
ios/:
ContentView.swift– SwiftUI UI + license sheet + routerLocalLlm.swift– wrapsLlmInference(blocking & async streaming)ModelManager.swift– downloadsgemma3-270m-it-q8.taskto DocumentsCloudClient.swift– trivial GET endpoint client +Routerenum/struct
Core init (already in LocalLlm.swift):
let options = LlmInferenceOptions()
options.baseOptions.modelPath = modelPath
options.maxTokens = 1024
options.topk = 40
options.temperature = 0.8
options.randomSeed = 101
let llm = try LlmInference(options: options)
Use generateResponseAsync(inputText:) to stream and iterate the async sequence. Google AI for Developers
Web (Vanilla JS + WebGPU)
Your starter is a single‑page app using the official @mediapipe/tasks-genai package and WebGPU:
index.html– terms gate, prompt box, route pickermain.js– loads WASM +LlmInference, streams tokensrouter.js– same simple router patternconfig.js– setMODEL_URLto your CDN path forgemma3-270m-it-q8-web.task
Key bits (already in main.js):
import { FilesetResolver, LlmInference } from
"https://cdn.jsdelivr.net/npm/@mediapipe/tasks-genai@latest";
const genai = await FilesetResolver.forGenAiTasks(WASM_ROOT);
const llm = await LlmInference.createFromOptions(genai, {
baseOptions: { modelAssetPath: MODEL_URL },
maxTokens: 768, topK: 40, temperature: 0.8, randomSeed: 101
});
// Streaming:
llm.generateResponse(inputPrompt, (partial, done) => {
if (partial) append(partial);
});
This follows the official Web quickstart (WebGPU required). Google AI for Developers
Why these choices map cleanly to Gemma 3 270M
- LiteRT/MediaPipe‑ready artifacts are published for 270M IT, including a web‑optimized
.taskvariant—so you don’t need to convert to get started. Hugging Face - Android/iOS/Web LLM Inference APIs expose the same knobs (maxTokens, topK, temperature, randomSeed) and built‑in streaming methods, which keeps your app logic symmetric across platforms. Google AI for Developers+2Google AI for Developers+2
What you still need to plug in
- Host the model on your CDN and update
MODEL_URLin each starter. Don’t embed gated HF URLs or tokens in a client app. (The Android docs also note the model is too large to bundle—download at runtime.) Google AI for Developers - Cloud fallback: replace the simple
GET /generate?prompt=...stubs with your real endpoint (or remove cloud entirely if you want on‑device‑only).
References you can lean on as you extend this
- Android LLM Inference guide (dependency, init, streaming, emulator caveat). Google AI for Developers
- iOS LLM Inference guide (pods, init, streaming). Google AI for Developers
- Web LLM Inference guide (npm/CDN, WASM root, streaming callback, WebGPU). Google AI for Developers
- Gemma 3 270M IT artifacts (including
gemma3-270m-it-q8-web.task). Hugging Face - Gemma Terms of Use (add the gate & pass along restrictions when redistributing). Google AI for Developers