• Footprint & speed: 270M INT4 QAT keeps RAM/VRAM needs and power low (≈240 MB to load at Q4_0; additional memory for tokens/KV cache), making it a great fit for edge devices.
• Battery: Google’s internal test on a Pixel 9 Pro: ~0.75% battery for ~25 short chats with the INT4 variant—useful as a power envelope target.
• Context: Up to 32k tokens on 270M/1B (plan defaults below use 1–4k on mobile for stability).
• Distribution: Official LiteRT + MediaPipe LLM Inference give production‑grade Android/iOS/Web runtimes, with 270M IT builds already published in the LiteRT Community on Hugging Face.
⸻
1) Architecture at a glance
On‑device first, cloud when needed
App (Android | iOS | Web) ├─ Local Inference (Gemma 3 270M INT4, default context 2k–4k) │ ├─ JSON-mode prompts for structured outputs │ ├─ LiteRT / MediaPipe LLM Inference runtime │ └─ Local adapters (optional LoRA) ├─ Tooling (optional): Function Calling SDK for actions └─ Escalation (router): – Larger local model (1B) on capable devices – Cloud (e.g., Gemini APIs) for long/complex tasks
Why this split: Empirically, 270M is excellent for structured extraction, classification, policy checks, routing, templated copy, smart‑reply—and avoids network and cost. Use a router for “hard” prompts (long context, multi‑hop). Google Developers Blog
2) Model artifacts & packaging
Primary:gemma-3-270m-itQAT/Q4_0 (instruction‑tuned) from Google/LiteRT channels. Prefer official LiteRT/MediaPipe‑ready packages when available; otherwise convert. Hugging Face+1
Where to fetch: – LiteRT Community (Hugging Face): published 270M IT artifacts and guidance for Android/iOS/Web. Hugging Face – Gemma 3 release overview (context windows, sizes, QAT): Google AI for Developers
Licensing: Weights are open under Gemma Terms of Use; require users to accept terms (HF gate) at first download. Wire this into the first‑run flow. Hugging Face
Amazon
Android GPU inference runtime
As an affiliate, we earn on qualifying purchases.
As an affiliate, we earn on qualifying purchases.
3) Android (Kotlin/Java) — production path
Runtime: MediaPipe LLM Inference API on LiteRT with GPU delegate fallback to CPU. Google AI for Developers+1
Steps
Add dependency in build.gradle (versions per docs):
This provides LlmInference and options for temperature/top‑k, streaming, etc. Google AI for Developers
Model distribution
Host .task or .bin (LiteRT/MediaPipe‑compatible) on your CDN; on first run, present HF license screen, then download. Cache to app‑private storage; keep an on‑disk optimized layout cache (LiteRT does one‑time optimize at first load). Google Developers Blog
For MVP, you can side‑load during dev with adb push, but ship via runtime download for production. Google AI for Developers
Initialization (simplified)
val opts = LlmInferenceOptions.builder() .setModelPath(localPath) // downloaded model .setMaxTokens(1024) // include input+output .setTopK(40).setTemperature(0.8f).build()
val llm = LlmInference.createFromOptions(context, opts) val result = llm.generateResponse(prompt)
Context: cap at 2k tokens by default; expose a setting for 4k on high‑end devices (S24U, Pixel 9).
Backend: allow users to pick GPU or CPU (UI toggle); GPU delegate improves throughput; KV‑cache layout and GPU weight‑sharing reduce latency/memory under the hood. Google Developers Blog
Battery budget: target ≤0.03%/turn for short prompts; profile with your prompt templates. Pixel 9 Pro reference: 0.75% per 25 short conversations. Google Developers Blog
Bundle tiny stubs only; download the real model on first run after user accepts Gemma terms (HF gate). Store in app‑private documents; enable resumeable downloads; verify hash before load. LiteRT will cache optimized tensor layouts to cut future load time. Google Developers Blog
Init (simplified)
import MediaPipeTasksGenai
let options = LlmInferenceOptions() options.baseOptions.modelPath = localPath options.maxTokens = 1024 options.topk = 40 options.temperature = 0.8 let llm = try LlmInference(options: options) let out = try llm.generateResponse(inputText: prompt)
Web Information Systems Engineering – WISE 2005 Workshops: WISE 2005 International Workshops, New York, NY, USA, November 20-22, 2005, Proceedings (Lecture Notes in Computer Science, 3807)
Why A: One API family across Android/iOS/Web; supports dynamic LoRA on Web. Google AI for Developers
Option B — WebLLM (MLC)
Why B: Mature in‑browser LLM engine with OpenAI‑compatible API surface; broad model zoo (MLC format). Good if you already use MLC builds. GitHubwebllm.mlc.ai
Google’s launch post even highlights a Transformers.js demo using 270M in a browser—handy for very small apps. Google Developers Blog
Amazon
LiteRT MediaPipe LLM SDK
As an affiliate, we earn on qualifying purchases.
As an affiliate, we earn on qualifying purchases.
6) Prompting & outputs (consistent across platforms)
Guardrails: temperature ≤0.4 for extraction/classification; ≤0.8 for copywriting.
Context mgmt: sliding‑window (truncate oldest), summarize tails if >2k tokens on mobile; reserve ~30–40% of maxTokens for output.
Function calling (Android today): use Edge FC SDK to map JSON to actions (search, reminders). GitHub
7) CI/CD for models
Fine‑tune (full or LoRA) → export to LiteRT/MediaPipe (.task/.bin). (AI Edge Torch supports PyTorch→TFLite and integrates with LLM Inference.) GitHub+1
Quantize: prefer QAT INT4 checkpoints to preserve quality at 4‑bit. Google Developers Blog
Virus scan & hash artifacts; upload to private bucket + HF gated mirror if desired.
Release train: semantic version model IDs, A/B via remote config, roll back by ID.
LiteRT Community 270M IT model (ready to use). Hugging Face
12) Nice‑to‑have extensions
WebLLM variant (OpenAI‑compatible local endpoint in browser) if you want one codepath that also works in Node/Electron. GitHub
Edge Function Calling on Android for “agentic” local actions. GitHub
Three minimal starters (Android, iOS, Web)
A three minimal starters (Android, iOS, Web) with the dependencies wired, license gate, model downloader, and a router stub (on‑device → cloud) so you can drop in your task prompts.
# Create a zip archive with Android, iOS, and Web starter code for Gemma 3 270M on-device inference. import os, textwrap, json, zipfile, hashlib, pathlib
These are **minimal starter skeletons** to run **Gemma 3 270M (INT4/INT8 .task)** on‑device via **MediaPipe LLM Inference**. They include: a **license gate**, a **model downloader**, and a tiny **router** that prefers local inference and falls back to a cloud endpoint for oversized prompts.
> ⚠️ **Licensing**: Gemma weights are under the **Gemma Terms of Use**. You must ensure users accept the terms before download, and you must host the model artifacts yourself or use gated distribution. Do *not* embed a Hugging Face token in client apps.
## What’s here - **android/**: Kotlin Activity + downloader + Local LLM wrapper (MediaPipe) + cloud stub - **ios/**: SwiftUI app + downloader + Local LLM wrapper (MediaPipe) + cloud stub - **web/**: Vanilla HTML/JS using `@mediapipe/tasks-genai` over WebGPU + router stub
## Model artifacts Recommended to mirror one of these to **your CDN** and update the URLs in each starter: - Android/iOS: `gemma3-270m-it-q8.task` (or Q4_0/Q8 variant that matches your performance target) - Web: `gemma3-270m-it-q8-web.task`
private lateinit var txtOutput: TextView private lateinit var edtPrompt: EditText private lateinit var btnSend: Button private lateinit var radioRoute: RadioGroup private lateinit var btnDownload: Button
private val modelManager by lazy { ModelManager(this) } private var local: LocalGemma? = null private val cloud = CloudClient(baseUrl = "https://YOUR_CLOUD_ENDPOINT") // TODO: replace
override fun onCreate(savedInstanceState: Bundle?) { super.onCreate(savedInstanceState)
if (!LicenseGate.hasAccepted(this)) { LicenseGate.show(this) { // user accepted setupUi() } } else { setupUi() } }
private fun setupUi() { setContentView(R.layout.activity_main)
object LicenseGate { private const val KEY = "gemma_terms_accepted"
fun hasAccepted(context: Context): Boolean = PreferenceManager.getDefaultSharedPreferences(context).getBoolean(KEY, false)
fun show(context: Context, onAccepted: () -> Unit) { val dlg = AlertDialog.Builder(context) .setTitle("Gemma Terms of Use") .setMessage("You must accept the Gemma Terms of Use to download and run the model on this device.") .setPositiveButton("View Terms") { _, _ -> val i = Intent(Intent.ACTION_VIEW, Uri.parse("https://ai.google.dev/gemma/terms")) context.startActivity(i) } .setNeutralButton("I Accept") { d, _ -> PreferenceManager.getDefaultSharedPreferences(context) .edit().putBoolean(KEY, true).apply() d.dismiss() onAccepted() } .setNegativeButton("Exit", null) .create() dlg.show() } } """)
class ModelManager(private val context: Context) { companion object { // TODO: host the model yourself; do not embed gated URLs/tokens in apps. private const val MODEL_URL = "https://YOUR-CDN/gemma3-270m-it-q8.task" private const val MODEL_FILE = "gemma3-270m-it-q8.task" } fun getLocalModelPath(): String? { val f = File(context.filesDir, MODEL_FILE) return if (f.exists()) f.absolutePath else null }
/** Download model if missing, return absolute path. */ fun ensureModel(): String { val out = File(context.filesDir, MODEL_FILE) if (out.exists()) return out.absolutePath download(MODEL_URL, out) return out.absolutePath }
private fun download(url: String, outFile: File) { val conn = URL(url).openConnection() as HttpURLConnection conn.connectTimeout = 30000 conn.readTimeout = 30000 conn.inputStream.use { input -> FileOutputStream(outFile).use { output -> val buf = ByteArray(1 shl 16) while (true) { val n = input.read(buf) if (n <= 0) break output.write(buf, 0, n) } } } }
func routeAndGenerate() async { let router = Router(local: local, cloud: cloud, maxLocalInputTokens: 2048) let routing: Routing = [Routing.auto, .localOnly, .cloudOnly][useRoute] do { output.append("\n---\n") try await router.generate(prompt: prompt, routing: routing, onToken: { token in output.append(token) }, onDone: { ok, src in output.append("\n\n[done: \(ok) via \(src)]") }) } catch { output.append("\nError: \(error.localizedDescription)") } } }
struct TermsSheet: View { @Binding var accepted: Bool var body: some View { VStack(spacing: 12) { Text("Gemma Terms of Use").font(.title3).bold() Text("You must accept the Gemma Terms of Use to download and run the model.") HStack { Link("View Terms", destination: URL(string: "https://ai.google.dev/gemma/terms")!) Spacer() Button("I Accept") { accepted = true } } }.padding() } } """)
write("ios/LocalLlm.swift", r""" import Foundation import MediaPipeTasksGenai
func generateStream(prompt: String, onToken: @escaping (String) -> Void, onDone: @escaping (Bool) -> Void) async throws { guard let modelPath else { throw NSError(domain: "LocalLlm", code: 2, userInfo: [NSLocalizedDescriptionKey: "Model not loaded"]) } let opts = LlmInferenceOptions() opts.baseOptions.modelPath = modelPath opts.maxTokens = 1024 opts.topk = 40 opts.temperature = 0.8 opts.randomSeed = 101 let streaming = try LlmInference(options: opts) let stream = try streaming.generateResponseAsync(inputText: prompt) Task { do { for try await part in stream { onToken(part) } onDone(true) } catch { onToken("\n[error: \(error.localizedDescription)]") onDone(false) } } } } """)
write("ios/ModelManager.swift", r""" import Foundation
actor ModelManager { static let shared = ModelManager() private init() {} // TODO: host the model yourself; do not embed gated URLs/tokens in apps. private let MODEL_URL = URL(string: "https://YOUR-CDN/gemma3-270m-it-q8.task")! private let MODEL_FILE = "gemma3-270m-it-q8.task"
zip_path = “/mnt/data/gemma270m_starters.zip” with zipfile.ZipFile(zip_path, “w”, zipfile.ZIP_DEFLATED) as z: for base, _, files in os.walk(root): for fn in files: p = os.path.join(base, fn) z.write(p, arcname=os.path.relpath(p, root))
zip_path
'/mnt/data/gemma270m_starters.zip'
Awesome—here are copy‑pasteable, cross‑platform starters for Gemma 3 270M on‑device with a license gate, model downloader, and a tiny local→cloud router (Android • iOS • Web).
Model artifacts to host on your CDN (update the URLs in each starter):
Android/iOS: gemma3-270m-it-q8.task
Web: gemma3-270m-it-q8-web.task (web‑optimized) — both visible on the LiteRT Community model page. Hugging Face
Licensing: Add a first‑run “I accept” gate before download (included). Gemma’s Terms of Use require passing along restrictions when you redistribute. Google AI for Developers
LiteRT/MediaPipe‑ready artifacts are published for 270M IT, including a web‑optimized .task variant—so you don’t need to convert to get started. Hugging Face
Android/iOS/Web LLM Inference APIs expose the same knobs (maxTokens, topK, temperature, randomSeed) and built‑in streaming methods, which keeps your app logic symmetric across platforms. Google AI for Developers+2Google AI for Developers+2
What you still need to plug in
Host the model on your CDN and update MODEL_URL in each starter. Don’t embed gated HF URLs or tokens in a client app. (The Android docs also note the model is too large to bundle—download at runtime.) Google AI for Developers
Cloud fallback: replace the simple GET /generate?prompt=... stubs with your real endpoint (or remove cloud entirely if you want on‑device‑only).
References you can lean on as you extend this
Android LLM Inference guide (dependency, init, streaming, emulator caveat). Google AI for Developers