0) What we’re optimizing for

• Footprint & speed: 270M INT4 QAT keeps RAM/VRAM needs and power low (≈240 MB to load at Q4_0; additional memory for tokens/KV cache), making it a great fit for edge devices.

• Battery: Google’s internal test on a Pixel 9 Pro: ~0.75% battery for ~25 short chats with the INT4 variant—useful as a power envelope target.

• Context: Up to 32k tokens on 270M/1B (plan defaults below use 1–4k on mobile for stability).

• Distribution: Official LiteRT + MediaPipe LLM Inference give production‑grade Android/iOS/Web runtimes, with 270M IT builds already published in the LiteRT Community on Hugging Face.

⸻

1) Architecture at a glance

On‑device first, cloud when needed

App (Android | iOS | Web)
├─ Local Inference (Gemma 3 270M INT4, default context 2k–4k)
│ ├─ JSON-mode prompts for structured outputs
│ ├─ LiteRT / MediaPipe LLM Inference runtime
│ └─ Local adapters (optional LoRA)
├─ Tooling (optional): Function Calling SDK for actions
└─ Escalation (router):
– Larger local model (1B) on capable devices
– Cloud (e.g., Gemini APIs) for long/complex tasks

Why this split: Empirically, 270M is excellent for structured extraction, classification, policy checks, routing, templated copy, smart‑reply—and avoids network and cost. Use a router for “hard” prompts (long context, multi‑hop). Google Developers Blog

2) Model artifacts & packaging

Primary: gemma-3-270m-it QAT/Q4_0 (instruction‑tuned) from Google/LiteRT channels. Prefer official LiteRT/MediaPipe‑ready packages when available; otherwise convert. Hugging Face+1
Where to fetch:
– LiteRT Community (Hugging Face): published 270M IT artifacts and guidance for Android/iOS/Web. Hugging Face
– Gemma 3 release overview (context windows, sizes, QAT): Google AI for Developers
Licensing: Weights are open under Gemma Terms of Use; require users to accept terms (HF gate) at first download. Wire this into the first‑run flow. Hugging Face

Amazon

Android GPU inference runtime

As an affiliate, we earn on qualifying purchases.

3) Android (Kotlin/Java) — production path

Runtime: MediaPipe LLM Inference API on LiteRT with GPU delegate fallback to CPU. Google AI for Developers+1

Steps

Add dependency in build.gradle (versions per docs):

dependencies {
  implementation 'com.google.mediapipe:tasks-genai:0.10.24'
}

This provides LlmInference and options for temperature/top‑k, streaming, etc. Google AI for Developers

Model distribution

Host .task or .bin (LiteRT/MediaPipe‑compatible) on your CDN; on first run, present HF license screen, then download. Cache to app‑private storage; keep an on‑disk optimized layout cache (LiteRT does one‑time optimize at first load). Google Developers Blog
For MVP, you can side‑load during dev with adb push, but ship via runtime download for production. Google AI for Developers

Initialization (simplified)

val opts = LlmInferenceOptions.builder()
  .setModelPath(localPath)        // downloaded model
  .setMaxTokens(1024)             // include input+output
  .setTopK(40).setTemperature(0.8f).build()

val llm = LlmInference.createFromOptions(context, opts)
val result = llm.generateResponse(prompt)

Use generateResponseAsync for streaming. Google AI for Developers

Performance defaults

Context: cap at 2k tokens by default; expose a setting for 4k on high‑end devices (S24U, Pixel 9).
Backend: allow users to pick GPU or CPU (UI toggle); GPU delegate improves throughput; KV‑cache layout and GPU weight‑sharing reduce latency/memory under the hood. Google Developers Blog
Battery budget: target ≤0.03%/turn for short prompts; profile with your prompt templates. Pixel 9 Pro reference: 0.75% per 25 short conversations. Google Developers Blog

Optional features

RAG SDK (Edge): for simple doc QA with local embeddings. Google Developers Blog
Function Calling SDK (Edge): map model outputs to local actions (search, alarms). GitHub
LoRA adapters: if you fine‑tune for vertical tasks; Android supports LoRA (GPU path) with MediaPipe convertor. Google AI for Developers

Sample app to crib from: AI Edge Gallery and LLM Demo. Google AI for Developers

Small Language Models for Mobile Devices: A Guide to On-Device AI, Model Optimization, and Edge Computing for Android and iOS

As an affiliate, we earn on qualifying purchases.

4) iOS (Swift) — production path

Runtime: MediaPipe LLM Inference API on LiteRT with Metal acceleration; CocoaPods package. Google AI for Developers

Steps

Pods

target 'MyLlmInferenceApp' do
  use_frameworks!
  pod 'MediaPipeTasksGenAI'
  pod 'MediaPipeTasksGenAIC'
end

Google AI for Developers

Model distribution

Bundle tiny stubs only; download the real model on first run after user accepts Gemma terms (HF gate). Store in app‑private documents; enable resumeable downloads; verify hash before load. LiteRT will cache optimized tensor layouts to cut future load time. Google Developers Blog

Init (simplified)

import MediaPipeTasksGenai

let options = LlmInferenceOptions()
options.baseOptions.modelPath = localPath
options.maxTokens = 1024
options.topk = 40
options.temperature = 0.8
let llm = try LlmInference(options: options)
let out = try llm.generateResponse(inputText: prompt)

Use generateResponseAsync for streaming. Google AI for Developers

Performance defaults

Context: default 2k; offer 4k on iPad Pro/modern iPhones.
Memory heads‑up: iOS enforces strict memory limits; keep KV‑cache conservative, and stream outputs.
LoRA: iOS supports static LoRA at init via converted adapters for Gemma classes; conversion via MediaPipe tools. Google AI for Developers

Alt path (if you prefer OSS stack): llama.cpp GGUF with Metal is viable, but MediaPipe/LiteRT will be simpler to maintain across OS updates. GitHub
Sample: iOS MediaPipe LLM Inference sample app. Google AI for Developers

Web Information Systems Engineering – WISE 2005 Workshops: WISE 2005 International Workshops, New York, NY, USA, November 20-22, 2005, Proceedings (Lecture Notes in Computer Science, 3807)

Used Book in Good Condition

As an affiliate, we earn on qualifying purchases.

5) Web — two good options

Option A — MediaPipe LLM Inference (WebGPU)

Requirements: modern browser with WebGPU. Google AI for Developers MDN Web Docs
Install: npm i @mediapipe/tasks-genai (or via CDN).
Init (simplified):

import { FilesetResolver, LlmInference } from '@mediapipe/tasks-genai';

const genai = await FilesetResolver.forGenAiTasks(
  'https://cdn.jsdelivr.net/npm/@mediapipe/tasks-genai@latest/wasm'
);

const llm = await LlmInference.createFromOptions(genai, {
  baseOptions: { modelAssetPath: '/assets/gemma-3-270m-it.task' },
  maxTokens: 768, topK: 40, temperature: 0.8, randomSeed: 101
});
const text = await llm.generateResponse(prompt);

Why A: One API family across Android/iOS/Web; supports dynamic LoRA on Web. Google AI for Developers

Option B — WebLLM (MLC)

Why B: Mature in‑browser LLM engine with OpenAI‑compatible API surface; broad model zoo (MLC format). Good if you already use MLC builds. GitHub webllm.mlc.ai

Google’s launch post even highlights a Transformers.js demo using 270M in a browser—handy for very small apps. Google Developers Blog

Amazon

LiteRT MediaPipe LLM SDK

As an affiliate, we earn on qualifying purchases.

6) Prompting & outputs (consistent across platforms)

JSON‑first templates (require strict keys, minimal prose).
Guardrails: temperature ≤0.4 for extraction/classification; ≤0.8 for copywriting.
Context mgmt: sliding‑window (truncate oldest), summarize tails if >2k tokens on mobile; reserve ~30–40% of maxTokens for output.
Function calling (Android today): use Edge FC SDK to map JSON to actions (search, reminders). GitHub

7) CI/CD for models

Fine‑tune (full or LoRA) → export to LiteRT/MediaPipe (.task/.bin). (AI Edge Torch supports PyTorch→TFLite and integrates with LLM Inference.) GitHub+1
Quantize: prefer QAT INT4 checkpoints to preserve quality at 4‑bit. Google Developers Blog
Virus scan & hash artifacts; upload to private bucket + HF gated mirror if desired.
Release train: semantic version model IDs, A/B via remote config, roll back by ID.
Client downloads + verifies SHA‑256; keep per‑version caches for instant rollback.

8) Performance budgets & test matrix

Load memory (model only): ~240 MB (Q4_0). Plan headroom for KV cache (varies with batch × heads × layers × context). Google AI for Developers
TTFT target: <600 ms (warm) on flagship phones; <250 ms on desktop WebGPU. (Use MediaPipe samples to benchmark.) Google Developers Blog
Throughput: prioritize prefill speed (tokenizing + initial attention); keep max input <2k on mobile.
Matrix: Pixel 9 / S24U / mid‑range Android; iPhone 15/14; iPad Pro; Chrome/Safari/Edge (WebGPU on). Google AI for Developers

9) Privacy, safety, and compliance

On‑device by default: no payload leaves device for 270M path.
Content policy: enforce Gemma Prohibited Use + your own constraints; show the terms gate when fetching from HF. Google AI for Developers+1
Telemetry: opt‑in, coarse device class only; never log raw prompts.
Eval sets: per domain (e.g., classification/extraction for your content sites) and run pre‑/post‑deploy checks (precision/recall, JSON validity rate).

10) Rollout plan (4 sprints)

Sprint 0—Plumbing

Pick Option A (MediaPipe) for all three platforms to minimize divergence.
Wire model download + license acceptance.
Ship a hidden diagnostics screen: device info, backend, TTFT, tokens/s.

Sprint 1—MVP

Android + Web MVP (shared prompts).
Tasks: classification / extraction / short rewrites; cap maxTokens to 768–1024.
Add cloud fallback for >2k input or low confidence.

Sprint 2—iOS + adapters

iOS client with streaming; polish memory handling.
Optional LoRA adapter for one high‑value task (e.g., policy/scoring). Google AI for Developers

Sprint 3—Hardening

A/B model versions; battery soak; offline‑first UX (progress bars, resume).
Add RAG for on‑device doc snippets where helpful. Google Developers Blog

11) Quick links you’ll actually need

Gemma 3 270M announcement (battery, QAT, positioning). Google Developers Blog
Gemma 3 overview (sizes, context, memory table). Google AI for Developers
LiteRT overview (Android/iOS/Web support). Google AI for Developers
Android LLM Inference guide (API, Gradle dep, code). Google AI for Developers
iOS LLM Inference guide (Pods, Swift code). Google AI for Developers
Web LLM Inference guide (WebGPU, npm/cdn, code). Google AI for Developers
LiteRT Community 270M IT model (ready to use). Hugging Face

12) Nice‑to‑have extensions

WebLLM variant (OpenAI‑compatible local endpoint in browser) if you want one codepath that also works in Node/Electron. GitHub
Edge Function Calling on Android for “agentic” local actions. GitHub

Three minimal starters (Android, iOS, Web)

A three minimal starters (Android, iOS, Web) with the dependencies wired, license gate, model downloader, and a router stub (on‑device → cloud) so you can drop in your task prompts.

# Create a zip archive with Android, iOS, and Web starter code for Gemma 3 270M on-device inference.
import os, textwrap, json, zipfile, hashlib, pathlib

root = "/mnt/data/gemma270m_starters"
os.makedirs(root, exist_ok=True)

def write(path, content):
    path = os.path.join(root, path)
    os.makedirs(os.path.dirname(path), exist_ok=True)
    with open(path, "w", encoding="utf-8") as f:
        f.write(content.lstrip("\n"))

# ---------- TOP-LEVEL README ----------
write("README.md", f"""
# Gemma 3 270M — On‑Device Starters (Android • iOS • Web)

These are **minimal starter skeletons** to run **Gemma 3 270M (INT4/INT8 .task)** on‑device via **MediaPipe LLM Inference**. 
They include: a **license gate**, a **model downloader**, and a tiny **router** that prefers local inference and falls back to a cloud endpoint for oversized prompts.

> ⚠️ **Licensing**: Gemma weights are under the **Gemma Terms of Use**. You must ensure users accept the terms before download, and you must host the model artifacts yourself or use gated distribution. Do *not* embed a Hugging Face token in client apps.

## What’s here
- **android/**: Kotlin Activity + downloader + Local LLM wrapper (MediaPipe) + cloud stub
- **ios/**: SwiftUI app + downloader + Local LLM wrapper (MediaPipe) + cloud stub
- **web/**: Vanilla HTML/JS using `@mediapipe/tasks-genai` over WebGPU + router stub

## Quick links
- MediaPipe LLM Inference (Android): https://ai.google.dev/edge/mediapipe/solutions/genai/llm_inference/android
- MediaPipe LLM Inference (iOS): https://ai.google.dev/edge/mediapipe/solutions/genai/llm_inference/ios
- MediaPipe LLM Inference (Web): https://ai.google.dev/edge/mediapipe/solutions/genai/llm_inference/web_js
- LiteRT Community (Gemma 3 270M IT): https://huggingface.co/litert-community/gemma-3-270m-it
- Gemma Terms of Use: https://ai.google.dev/gemma/terms

## Model artifacts
Recommended to mirror one of these to **your CDN** and update the URLs in each starter:
- Android/iOS: `gemma3-270m-it-q8.task` (or Q4_0/Q8 variant that matches your performance target)
- Web: `gemma3-270m-it-q8-web.task`

""")

# ---------- ANDROID ----------
write("android/README.md", """
# Android Starter (MediaPipe LLM Inference)

## 1) Create a new Android Studio project
- Template: **Empty Views Activity** (Kotlin), minSdk ≥ 26.
- Add the dependency in `app/build.gradle`:
```gradle
dependencies {
  implementation 'com.google.mediapipe:tasks-genai:0.10.24'
  implementation 'androidx.appcompat:appcompat:1.6.1'
  implementation 'androidx.core:core-ktx:1.12.0'
  implementation 'com.google.android.material:material:1.11.0'
}

Add permission in app/src/main/AndroidManifest.xml:

2) Drop the code in `app/src/main/java/com/example/gemma/`

MainActivity.kt, LicenseGate.kt, ModelManager.kt, LocalGemma.kt, CloudClient.kt, Router.kt from this folder.

3) Set your model URL

In ModelManager.kt, set MODEL_URL to your CDN URL for gemma3-270m-it-q8.task.
Do not ship HF access tokens inside apps.

4) Run on a real device (GPU preferred)

The MediaPipe LLM Inference API is optimized for real devices; emulators are not reliable.

“””)

write(“android/MainActivity.kt”, r”””
package com.example.gemma

import android.os.Bundle
import android.text.method.ScrollingMovementMethod
import android.widget.*
import androidx.appcompat.app.AppCompatActivity
import androidx.lifecycle.lifecycleScope
import kotlinx.coroutines.Dispatchers
import kotlinx.coroutines.launch
import kotlinx.coroutines.withContext

class MainActivity : AppCompatActivity() {

private lateinit var txtOutput: TextView
private lateinit var edtPrompt: EditText
private lateinit var btnSend: Button
private lateinit var radioRoute: RadioGroup
private lateinit var btnDownload: Button

private val modelManager by lazy { ModelManager(this) }
private var local: LocalGemma? = null
private val cloud = CloudClient(baseUrl = "https://YOUR_CLOUD_ENDPOINT") // TODO: replace

override fun onCreate(savedInstanceState: Bundle?) {
    super.onCreate(savedInstanceState)

    if (!LicenseGate.hasAccepted(this)) {
        LicenseGate.show(this) {
            // user accepted
            setupUi()
        }
    } else {
        setupUi()
    }
}

private fun setupUi() {
    setContentView(R.layout.activity_main)

    txtOutput = findViewById(R.id.txtOutput)
    edtPrompt = findViewById(R.id.edtPrompt)
    btnSend = findViewById(R.id.btnSend)
    radioRoute = findViewById(R.id.radioRoute)
    btnDownload = findViewById(R.id.btnDownload)

    txtOutput.movementMethod = ScrollingMovementMethod()

    btnDownload.setOnClickListener {
        lifecycleScope.launch {
            appendLine("Downloading model…")
            val path = withContext(Dispatchers.IO) { modelManager.ensureModel() }
            appendLine("Model ready at: $path")
            loadLocal(path)
        }
    }

    btnSend.setOnClickListener {
        val prompt = edtPrompt.text.toString().trim()
        if (prompt.isEmpty()) return@setOnClickListener
        lifecycleScope.launch { routeAndGenerate(prompt) }
    }

    // Autoload if already present
    lifecycleScope.launch {
        modelManager.getLocalModelPath()?.let { loadLocal(it) }
    }
}

private suspend fun loadLocal(path: String) = withContext(Dispatchers.IO) {
    try {
        local?.close()
        local = LocalGemma(this@MainActivity, path).also { it.load() }
        appendLine("Local LLM loaded.")
    } catch (e: Exception) {
        appendLine("Failed to init local LLM: ${e.message}")
    }
}

private suspend fun routeAndGenerate(prompt: String) = withContext(Dispatchers.IO) {
    val routing = when (radioRoute.checkedRadioButtonId) {
        R.id.optLocal -> Routing.LOCAL_ONLY
        R.id.optCloud -> Routing.CLOUD_ONLY
        else -> Routing.AUTO
    }
    val router = Router(local, cloud, maxLocalInputTokens = 2048)
    appendLine("Routing: $routing")
    try {
        router.generate(prompt, routing, onToken = { token ->
            runOnUiThread { txtOutput.append(token) }
        }, onDone = { ok, source ->
            runOnUiThread { appendLine("\n\n[done: $ok via $source]") }
        })
    } catch (e: Exception) {
        appendLine("Error: ${e.message}")
    }
}

private fun appendLine(msg: String) = runOnUiThread {
    txtOutput.append("\n$msg")
}

override fun onDestroy() {
    super.onDestroy()
    local?.close()
}
}
""")

write("android/LicenseGate.kt", r"""
package com.example.gemma

import android.app.AlertDialog
import android.content.Context
import android.content.Intent
import android.net.Uri
import android.preference.PreferenceManager

object LicenseGate {
private const val KEY = "gemma_terms_accepted"

fun hasAccepted(context: Context): Boolean =
    PreferenceManager.getDefaultSharedPreferences(context).getBoolean(KEY, false)

fun show(context: Context, onAccepted: () -> Unit) {
    val dlg = AlertDialog.Builder(context)
        .setTitle("Gemma Terms of Use")
        .setMessage("You must accept the Gemma Terms of Use to download and run the model on this device.")
        .setPositiveButton("View Terms") { _, _ ->
            val i = Intent(Intent.ACTION_VIEW, Uri.parse("https://ai.google.dev/gemma/terms"))
            context.startActivity(i)
        }
        .setNeutralButton("I Accept") { d, _ ->
            PreferenceManager.getDefaultSharedPreferences(context)
                .edit().putBoolean(KEY, true).apply()
            d.dismiss()
            onAccepted()
        }
        .setNegativeButton("Exit", null)
        .create()
    dlg.show()
}
}
""")

write("android/ModelManager.kt", r"""
package com.example.gemma

import android.content.Context
import java.io.File
import java.io.FileOutputStream
import java.net.HttpURLConnection
import java.net.URL

class ModelManager(private val context: Context) {
companion object {
// TODO: host the model yourself; do not embed gated URLs/tokens in apps.
private const val MODEL_URL = "https://YOUR-CDN/gemma3-270m-it-q8.task"
private const val MODEL_FILE = "gemma3-270m-it-q8.task"
}
fun getLocalModelPath(): String? {
    val f = File(context.filesDir, MODEL_FILE)
    return if (f.exists()) f.absolutePath else null
}

/** Download model if missing, return absolute path. */
fun ensureModel(): String {
    val out = File(context.filesDir, MODEL_FILE)
    if (out.exists()) return out.absolutePath
    download(MODEL_URL, out)
    return out.absolutePath
}

private fun download(url: String, outFile: File) {
    val conn = URL(url).openConnection() as HttpURLConnection
    conn.connectTimeout = 30000
    conn.readTimeout = 30000
    conn.inputStream.use { input ->
        FileOutputStream(outFile).use { output ->
            val buf = ByteArray(1 shl 16)
            while (true) {
                val n = input.read(buf)
                if (n <= 0) break
                output.write(buf, 0, n)
            }
        }
    }
}

}
""")

write("android/LocalGemma.kt", r"""
package com.example.gemma

import android.content.Context
import com.google.mediapipe.tasks.genai.llminference.LlmInference
import com.google.mediapipe.tasks.genai.llminference.LlmInferenceOptions

class LocalGemma(private val context: Context, private val modelPath: String) : AutoCloseable {
private var llm: LlmInference? = null
fun load() {
    val options = LlmInferenceOptions.builder()
        .setModelPath(modelPath)
        .setMaxTokens(1024)   // input + output
        .setTopK(40)
        .setTemperature(0.8f)
        .setRandomSeed(101)
        .build()
    llm = LlmInference.createFromOptions(context, options)
}

fun generateStream(prompt: String, onToken: (String) -> Unit, onDone: (Boolean) -> Unit) {
    val inst = llm ?: throw IllegalStateException("LLM not loaded")
    val opts = LlmInferenceOptions.builder()
        .setModelPath(modelPath)
        .setMaxTokens(1024)
        .setTopK(40).setTemperature(0.8f).setRandomSeed(101)
        .setResultListener { part, done ->
            onToken(part ?: "")
            if (done) onDone(true)
        }
        .setErrorListener { e ->
            onToken("\n[error: ${'$'}e]"); onDone(false)
        }
        .build()
    val streaming = LlmInference.createFromOptions(context, opts)
    streaming.generateResponseAsync(prompt)
}

fun generate(prompt: String): String {
    val inst = llm ?: throw IllegalStateException("LLM not loaded")
    return inst.generateResponse(prompt)
}

override fun close() {
    llm?.close()
    llm = null
}
}
""")

write("android/CloudClient.kt", r"""
package com.example.gemma

import java.io.BufferedReader
import java.io.InputStreamReader
import java.net.HttpURLConnection
import java.net.URL

class CloudClient(private val baseUrl: String) {
/** Blocking demo GET endpoint: /generate?prompt=... Replace with your own. */
fun generate(prompt: String): String {
val url = URL("${baseUrl.trimEnd('/')}/generate?prompt=" + java.net.URLEncoder.encode(prompt, "UTF-8"))
val conn = url.openConnection() as HttpURLConnection
conn.connectTimeout = 30000
conn.readTimeout = 30000
return conn.inputStream.use { input ->
BufferedReader(InputStreamReader(input)).readText()
}
}
}
""")

write("android/Router.kt", r"""
package com.example.gemma

enum class Routing { AUTO, LOCAL_ONLY, CLOUD_ONLY }

class Router(
private val local: LocalGemma?,
private val cloud: CloudClient?,
private val maxLocalInputTokens: Int = 2048,
) {
/** Very crude token estimator (spaces as token proxies). Replace with a real tokenizer if needed. */
private fun estimateTokens(s: String) = (s.length / 4).coerceAtLeast(1)
fun generate(
    prompt: String,
    routing: Routing,
    onToken: (String) -> Unit,
    onDone: (Boolean, String) -> Unit
) {
    when (routing) {
        Routing.LOCAL_ONLY -> {
            local ?: return onDone(false, "local-unavailable")
            local.generateStream(prompt, onToken) { ok -> onDone(ok, "local") }
        }
        Routing.CLOUD_ONLY -> {
            val out = cloud?.generate(prompt) ?: "[cloud unavailable]"
            onToken(out); onDone(true, "cloud")
        }
        Routing.AUTO -> {
            val tokens = estimateTokens(prompt)
            val useLocal = local != null && tokens <= maxLocalInputTokens
            if (useLocal) {
                local!!.generateStream(prompt, onToken) { ok -> onDone(ok, "local") }
            } else {
                val out = cloud?.generate(prompt) ?: "[cloud unavailable]"
                onToken(out); onDone(true, "cloud")
            }
        }
    }
}
}
""")

write("android/res/layout/activity_main.xml", r"""




    android:id="@+id/btnDownload"
    android:layout_width="match_parent"
    android:layout_height="wrap_content"
    android:text="Download / Load Model" />

    android:id="@+id/edtPrompt"
    android:layout_width="match_parent"
    android:layout_height="wrap_content"
    android:hint="Enter prompt"
    android:minLines="3"
    android:gravity="top|start" />

    android:id="@+id/radioRoute"
    android:layout_width="match_parent"
    android:layout_height="wrap_content"
    android:orientation="horizontal">
            android:layout_width="wrap_content" android:layout_height="wrap_content"
        android:text="Auto" />
            android:layout_width="wrap_content" android:layout_height="wrap_content"
        android:text="Local" />
            android:layout_width="wrap_content" android:layout_height="wrap_content"
        android:text="Cloud" />


    android:id="@+id/btnSend"
    android:layout_width="match_parent"
    android:layout_height="wrap_content"
    android:text="Send" />

    android:id="@+id/txtOutput"
    android:layout_width="match_parent"
    android:layout_height="0dp"
    android:layout_weight="1"
    android:paddingTop="12dp"
    android:textIsSelectable="true"
    android:scrollbars="vertical"
    android:textAppearance="?android:attr/textAppearanceSmall"
    android:text="Ready." />
 """)

write("android/AndroidManifest.xml", r"""











""")

———- iOS ———-

write(“ios/README.md”, “””

iOS Starter (MediaPipe LLM Inference, SwiftUI)

1) Create a new App in Xcode (SwiftUI)

Add CocoaPods to the project and a Podfile like below, then pod install and open the .xcworkspace.

target 'Gemma270MStarter' do
  use_frameworks!
  pod 'MediaPipeTasksGenAI'
  pod 'MediaPipeTasksGenAIC'
end

2) Add these files to your app target

ContentView.swift, LocalLlm.swift, ModelManager.swift, CloudClient.swift

3) Set your model URL

In ModelManager.swift, set MODEL_URL to your CDN URL for gemma3-270m-it-q8.task.

4) Run on device (Metal)

The LLM Inference API is optimized for real devices.
“””)

write(“ios/ContentView.swift”, r”””
import SwiftUI

struct ContentView: View {
@State private var accepted = UserDefaults.standard.bool(forKey: “gemma_terms_accepted”)
@State private var prompt: String = “”
@State private var output: String = “Ready.”
@State private var useRoute: Int = 0 // 0 = Auto, 1 = Local, 2 = Cloud
@StateObject private var local = LocalLlm()
private let cloud = CloudClient(baseUrl: “https://YOUR_CLOUD_ENDPOINT”) // TODO

var body: some View {
    VStack(alignment: .leading, spacing: 12) {
        HStack {
            Button("Download / Load Model") {
                output.append("\nDownloading model…")
                Task {
                    let path = try? await ModelManager.shared.ensureModel()
                    output.append("\nModel ready at: \(path ?? "n/a")")
                    if let p = path {
                        try? await local.load(modelPath: p)
                        output.append("\nLocal LLM loaded.")
                    }
                }
            }
            Spacer()
        }
        TextEditor(text: $prompt).frame(height: 120).border(.secondary)

        Picker("", selection: $useRoute) {
            Text("Auto").tag(0)
            Text("Local").tag(1)
            Text("Cloud").tag(2)
        }.pickerStyle(.segmented)

        Button("Send") {
            output.append("\nRouting…")
            Task {
                await routeAndGenerate()
            }
        }

        ScrollView { Text(output).font(.system(size: 12, design: .monospaced))
            .frame(maxWidth: .infinity, alignment: .leading) }
    }
    .padding()
    .sheet(isPresented: .constant(!accepted)) {
        TermsSheet(accepted: $accepted)
    }
    .onChange(of: accepted) { _, v in
        UserDefaults.standard.set(v, forKey: "gemma_terms_accepted")
    }
}

func routeAndGenerate() async {
    let router = Router(local: local, cloud: cloud, maxLocalInputTokens: 2048)
    let routing: Routing = [Routing.auto, .localOnly, .cloudOnly][useRoute]
    do {
        output.append("\n---\n")
        try await router.generate(prompt: prompt, routing: routing,
                                  onToken: { token in
                                      output.append(token)
                                  }, onDone: { ok, src in
                                      output.append("\n\n[done: \(ok) via \(src)]")
                                  })
    } catch {
        output.append("\nError: \(error.localizedDescription)")
    }
}
}

struct TermsSheet: View {
@Binding var accepted: Bool
var body: some View {
VStack(spacing: 12) {
Text("Gemma Terms of Use").font(.title3).bold()
Text("You must accept the Gemma Terms of Use to download and run the model.")
HStack {
Link("View Terms", destination: URL(string: "https://ai.google.dev/gemma/terms")!)
Spacer()
Button("I Accept") { accepted = true }
}
}.padding()
}
}
""")

write("ios/LocalLlm.swift", r"""
import Foundation
import MediaPipeTasksGenai

@MainActor
final class LocalLlm: ObservableObject {
private var llm: LlmInference? = nil
private var modelPath: String? = nil
func load(modelPath: String) async throws {
    self.modelPath = modelPath
    let opts = LlmInferenceOptions()
    opts.baseOptions.modelPath = modelPath
    opts.maxTokens = 1024
    opts.topk = 40
    opts.temperature = 0.8
    opts.randomSeed = 101
    self.llm = try LlmInference(options: opts)
}

func generate(prompt: String) async throws -> String {
    guard let llm else { throw NSError(domain: "LocalLlm", code: 1, userInfo: [NSLocalizedDescriptionKey: "LLM not loaded"]) }
    return try llm.generateResponse(inputText: prompt)
}

func generateStream(prompt: String,
                    onToken: @escaping (String) -> Void,
                    onDone: @escaping (Bool) -> Void) async throws {
    guard let modelPath else { throw NSError(domain: "LocalLlm", code: 2, userInfo: [NSLocalizedDescriptionKey: "Model not loaded"]) }
    let opts = LlmInferenceOptions()
    opts.baseOptions.modelPath = modelPath
    opts.maxTokens = 1024
    opts.topk = 40
    opts.temperature = 0.8
    opts.randomSeed = 101
    let streaming = try LlmInference(options: opts)
    let stream = try streaming.generateResponseAsync(inputText: prompt)
    Task {
        do {
            for try await part in stream {
                onToken(part)
            }
            onDone(true)
        } catch {
            onToken("\n[error: \(error.localizedDescription)]")
            onDone(false)
        }
    }
}
}
""")

write("ios/ModelManager.swift", r"""
import Foundation

actor ModelManager {
static let shared = ModelManager()
private init() {}
// TODO: host the model yourself; do not embed gated URLs/tokens in apps.
private let MODEL_URL = URL(string: "https://YOUR-CDN/gemma3-270m-it-q8.task")!
private let MODEL_FILE = "gemma3-270m-it-q8.task"

func localModelPath() -> String? {
    let url = getDocumentsDir().appendingPathComponent(MODEL_FILE)
    return FileManager.default.fileExists(atPath: url.path) ? url.path : nil
}

func ensureModel() async throws -> String {
    let url = getDocumentsDir().appendingPathComponent(MODEL_FILE)
    if FileManager.default.fileExists(atPath: url.path) {
        return url.path
    }
    let (tmp, _) = try await URLSession.shared.download(from: MODEL_URL)
    try FileManager.default.moveItem(at: tmp, to: url)
    return url.path
}

private func getDocumentsDir() -> URL {
    FileManager.default.urls(for: .documentDirectory, in: .userDomainMask)[0]
}
}

enum Routing { case auto, localOnly, cloudOnly }

@MainActor
struct Router {
let local: LocalLlm?
let cloud: CloudClient?
let maxLocalInputTokens: Int
private func estimateTokens(_ s: String) -> Int { max(1, s.count / 4) }

func generate(prompt: String, routing: Routing,
              onToken: @escaping (String) -> Void,
              onDone: @escaping (Bool, String) -> Void) async throws {
    switch routing {
    case .localOnly:
        guard let local else { onDone(false, "local-unavailable"); return }
        try await local.generateStream(prompt: prompt, onToken: onToken) { ok in onDone(ok, "local") }
    case .cloudOnly:
        let out = try await cloud?.generate(prompt: prompt) ?? "[cloud unavailable]"
        onToken(out); onDone(true, "cloud")
    case .auto:
        let tokens = estimateTokens(prompt)
        if let local, tokens <= maxLocalInputTokens {
            try await local.generateStream(prompt: prompt, onToken: onToken) { ok in onDone(ok, "local") }
        } else {
            let out = try await cloud?.generate(prompt: prompt) ?? "[cloud unavailable]"
            onToken(out); onDone(true, "cloud")
        }
    }
}
}
""")

———- WEB ———-

write(“web/README.md”, “””

Web Starter (MediaPipe LLM Inference + WebGPU)

1) Serve this folder locally

Any static server will do (Vite, python -m http.server, etc.).

2) Put your model on a CDN

Update config.js with a URL to gemma3-270m-it-q8-web.task on your CDN.

3) Open in a WebGPU-capable browser (Chrome/Edge recent).

""")

write("web/index.html", r"""

     Gemma 3 270M — Web Starter    Gemma Terms of Use — You must accept to download and run the model. View Terms  
      Auto  Local  Cloud 
  
 
     """)

write("web/config.js", r"""
export const TERMS_URL = "https://ai.google.dev/gemma/terms";
// Host the model yourself; do not embed gated HF URLs in production apps.
export const MODEL_URL = "https://YOUR-CDN/gemma3-270m-it-q8-web.task";
export const WASM_ROOT = "https://cdn.jsdelivr.net/npm/@mediapipe/tasks-genai@latest/wasm";
export const CLOUD_ENDPOINT = "https://YOUR_CLOUD_ENDPOINT";
""")

write("web/router.js", r"""
export class Router {
constructor(local, cloud, maxLocalInputTokens = 2048) {
this.local = local;
this.cloud = cloud;
this.max = maxLocalInputTokens;
}
estimateTokens(s) { return Math.max(1, Math.floor(s.length / 4)); }

async generate(prompt, routing, onToken, onDone) {
if (routing === "local") {
if (!this.local) return onDone(false, "local-unavailable");
await this.local.generateStream(prompt, onToken); onDone(true, "local"); return;
}
if (routing === "cloud") {
const out = await this.cloud.generate(prompt);
onToken(out); onDone(true, "cloud"); return;
}
// AUTO
if (this.local && this.estimateTokens(prompt) <= this.max) {
await this.local.generateStream(prompt, onToken); onDone(true, "local");
} else {
const out = await this.cloud.generate(prompt);
onToken(out); onDone(true, "cloud");
}
}
}
""")

write("web/main.js", r"""
import { FilesetResolver, LlmInference } from "https://cdn.jsdelivr.net/npm/@mediapipe/tasks-genai@latest";
import { TERMS_URL, MODEL_URL, WASM_ROOT, CLOUD_ENDPOINT } from "./config.js";
import { Router } from "./router.js";

const out = document.getElementById("out");
const promptBox = document.getElementById("prompt");
const btnLoad = document.getElementById("load");
const btnSend = document.getElementById("send");
const gate = document.getElementById("gate");
document.getElementById("accept").onclick = () => gate.style.display = "none";

let llm = null;

async function initLocal() {
const genai = await FilesetResolver.forGenAiTasks(WASM_ROOT);
llm = await LlmInference.createFromOptions(genai, {
baseOptions: { modelAssetPath: MODEL_URL },
maxTokens: 768, topK: 40, temperature: 0.8, randomSeed: 101
});
}

class Local {
async generateStream(prompt, onToken) {
return new Promise((resolve, reject) => {
llm.generateResponse(prompt, (partial, done) => {
if (partial) onToken(partial);
if (done) resolve(true);
});
});
}
}

class Cloud {
async generate(prompt) {
const u = new URL(CLOUD_ENDPOINT + "/generate");
u.searchParams.set("prompt", prompt);
const res = await fetch(u, { method: "GET" });
return await res.text();
}
}

btnLoad.onclick = async () => {
out.textContent += "\nDownloading / loading model…";
await initLocal();
out.textContent += "\nLocal LLM loaded.";
};

btnSend.onclick = async () => {
const routing = document.querySelector("input[name=route]:checked")?.value ?? "auto";
const router = new Router(llm ? new Local() : null, new Cloud(), 2048);
out.textContent += "\n---\n";
await router.generate(promptBox.value, routing,
(t) => out.textContent += t,
(ok, src) => out.textContent += \n\n[done: ${ok} via ${src}]);
};
""")

Zip it

zip_path = “/mnt/data/gemma270m_starters.zip”
with zipfile.ZipFile(zip_path, “w”, zipfile.ZIP_DEFLATED) as z:
for base, _, files in os.walk(root):
for fn in files:
p = os.path.join(base, fn)
z.write(p, arcname=os.path.relpath(p, root))

zip_path

'/mnt/data/gemma270m_starters.zip'

Awesome—here are copy‑pasteable, cross‑platform starters for Gemma 3 270M on‑device with a license gate, model downloader, and a tiny local→cloud router (Android • iOS • Web).

gemma270m_starters Download

What you’re getting

gemma270m_starters/
  README.md
  android/   # Kotlin + MediaPipe LLM Inference + simple UI + router
  ios/       # SwiftUI + MediaPipe LLM Inference + router
  web/       # Vanilla HTML/JS + @mediapipe/tasks-genai (WebGPU) + router

Model artifacts to host on your CDN (update the URLs in each starter):

Android/iOS: gemma3-270m-it-q8.task
Web: gemma3-270m-it-q8-web.task (web‑optimized) — both visible on the LiteRT Community model page. Hugging Face

Licensing: Add a first‑run “I accept” gate before download (included). Gemma’s Terms of Use require passing along restrictions when you redistribute. Google AI for Developers

Android (Kotlin, MediaPipe LLM Inference)

Add dependency in your app module:

dependencies {
  implementation 'com.google.mediapipe:tasks-genai:0.10.24'
  implementation 'androidx.appcompat:appcompat:1.6.1'
  implementation 'androidx.core:core-ktx:1.12.0'
  implementation 'com.google.android.material:material:1.11.0'
}

This is the official package/version from the Android LLM Inference guide. Google AI for Developers

Add internet permission in AndroidManifest.xml:

Drop in the files from android/:

MainActivity.kt – minimal chat-like UI; streams tokens
LicenseGate.kt – first‑run acceptance dialog linking to the Gemma Terms (opens browser) Google AI for Developers
ModelManager.kt – downloads gemma3-270m-it-q8.task to app storage (set your CDN URL)
LocalGemma.kt – wraps LlmInference (generateResponse / generateResponseAsync)
Import path example: com.google.mediapipe.tasks.genai.llminference.LlmInference (as used by Google samples/issues). GitHub
Router.kt + CloudClient.kt – simple “Auto / Local only / Cloud only” routing

Notes:

The official guide recommends hosting the model and downloading at runtime, not bundling in the APK. Google AI for Developers

The API is optimized for real devices (Pixel 8+/S23+) and “does not reliably support emulators.” Google AI for Developers

Core init (already in LocalGemma.kt):

val options = LlmInferenceOptions.builder()
  .setModelPath(modelPath)
  .setMaxTokens(1024).setTopK(40).setTemperature(0.8f).setRandomSeed(101)
  .build()
val llm = LlmInference.createFromOptions(context, options)

Use generateResponseAsync for streaming with a resultListener. Google AI for Developers

iOS (SwiftUI, MediaPipe LLM Inference)

CocoaPods (Podfile):

target 'Gemma270MStarter' do
  use_frameworks!
  pod 'MediaPipeTasksGenAI'
  pod 'MediaPipeTasksGenAIC'
end

These are the iOS packages from the official guide. Google AI for Developers

Files in ios/:

ContentView.swift – SwiftUI UI + license sheet + router
LocalLlm.swift – wraps LlmInference (blocking & async streaming)
ModelManager.swift – downloads gemma3-270m-it-q8.task to Documents
CloudClient.swift – trivial GET endpoint client + Router enum/struct

Core init (already in LocalLlm.swift):

let options = LlmInferenceOptions()
options.baseOptions.modelPath = modelPath
options.maxTokens = 1024
options.topk = 40
options.temperature = 0.8
options.randomSeed = 101
let llm = try LlmInference(options: options)

Use generateResponseAsync(inputText:) to stream and iterate the async sequence. Google AI for Developers

Web (Vanilla JS + WebGPU)

Your starter is a single‑page app using the official @mediapipe/tasks-genai package and WebGPU:

index.html – terms gate, prompt box, route picker
main.js – loads WASM + LlmInference, streams tokens
router.js – same simple router pattern
config.js – set MODEL_URL to your CDN path for gemma3-270m-it-q8-web.task

Key bits (already in main.js):

import { FilesetResolver, LlmInference } from
  "https://cdn.jsdelivr.net/npm/@mediapipe/tasks-genai@latest";

const genai = await FilesetResolver.forGenAiTasks(WASM_ROOT);
const llm = await LlmInference.createFromOptions(genai, {
  baseOptions: { modelAssetPath: MODEL_URL },
  maxTokens: 768, topK: 40, temperature: 0.8, randomSeed: 101
});

// Streaming:
llm.generateResponse(inputPrompt, (partial, done) => {
  if (partial) append(partial);
});

This follows the official Web quickstart (WebGPU required). Google AI for Developers

Why these choices map cleanly to Gemma 3 270M

LiteRT/MediaPipe‑ready artifacts are published for 270M IT, including a web‑optimized .task variant—so you don’t need to convert to get started. Hugging Face
Android/iOS/Web LLM Inference APIs expose the same knobs (maxTokens, topK, temperature, randomSeed) and built‑in streaming methods, which keeps your app logic symmetric across platforms. Google AI for Developers+2Google AI for Developers+2

What you still need to plug in

Host the model on your CDN and update MODEL_URL in each starter. Don’t embed gated HF URLs or tokens in a client app. (The Android docs also note the model is too large to bundle—download at runtime.) Google AI for Developers
Cloud fallback: replace the simple GET /generate?prompt=... stubs with your real endpoint (or remove cloud entirely if you want on‑device‑only).

References you can lean on as you extend this

Android LLM Inference guide (dependency, init, streaming, emulator caveat). Google AI for Developers
iOS LLM Inference guide (pods, init, streaming). Google AI for Developers
Web LLM Inference guide (npm/CDN, WASM root, streaming callback, WebGPU). Google AI for Developers
Gemma 3 270M IT artifacts (including gemma3-270m-it-q8-web.task). Hugging Face
Gemma Terms of Use (add the gate & pass along restrictions when redistributing). Google AI for Developers

Turn‑key, cross‑platform deployment plan to ship Gemma 3 270M on Android, iOS, and the Web

Up next

Author

Thorsten Meyer

Share article

2) Model artifacts & packaging

Android GPU inference runtime

3) Android (Kotlin/Java) — production path

Small Language Models for Mobile Devices: A Guide to On-Device AI, Model Optimization, and Edge Computing for Android and iOS

4) iOS (Swift) — production path

Web Information Systems Engineering – WISE 2005 Workshops: WISE 2005 International Workshops, New York, NY, USA, November 20-22, 2005, Proceedings (Lecture Notes in Computer Science, 3807)

5) Web — two good options

Option A — MediaPipe LLM Inference (WebGPU)

Option B — WebLLM (MLC)

LiteRT MediaPipe LLM SDK

6) Prompting & outputs (consistent across platforms)

7) CI/CD for models

8) Performance budgets & test matrix

9) Privacy, safety, and compliance

10) Rollout plan (4 sprints)

11) Quick links you’ll actually need

12) Nice‑to‑have extensions

Three minimal starters (Android, iOS, Web)

2) Drop the code in app/src/main/java/com/example/gemma/

3) Set your model URL

4) Run on a real device (GPU preferred)

———- iOS ———-

iOS Starter (MediaPipe LLM Inference, SwiftUI)

1) Create a new App in Xcode (SwiftUI)

2) Add these files to your app target

3) Set your model URL

4) Run on device (Metal)

———- WEB ———-

Web Starter (MediaPipe LLM Inference + WebGPU)

1) Serve this folder locally

2) Put your model on a CDN

3) Open in a WebGPU-capable browser (Chrome/Edge recent).

Zip it

What you’re getting

Android (Kotlin, MediaPipe LLM Inference)

iOS (SwiftUI, MediaPipe LLM Inference)

Web (Vanilla JS + WebGPU)

Why these choices map cleanly to Gemma 3 270M

What you still need to plug in

References you can lean on as you extend this

You May Also Like

Turn‑key, cross‑platform deployment plan to ship Gemma 3 270M on Android, iOS, and the Web

2) Drop the code in `app/src/main/java/com/example/gemma/`

Why these choices map cleanly to Gemma 3 270M