The physical layer of mobile automation

I wanted a phone farm that behaved less like a pile of devices and more like a software system.

That sounds simple until you try to do it with real iPhones. The normal developer answer is Appium, WebDriverAgent, XCTest, simulators, or private app automation. Those tools are useful in the right context, but they were not the shape of the system I wanted. I wanted real phones, real apps, the actual screen, the same touch targets a person sees, and a scriptable control layer.

The result is a stack that combines mirrored visual feedback, ESP32 HID input, Wi-Fi and serial command transport, a local web control plane, and a calibration loop. It is not magic. It is not undetectable. It is not WDA or XCTest. It is a set of physical devices that can be observed, commanded, adjusted, and eventually treated like a small fleet of software-controlled machines.

The product idea

The problem I kept coming back to was feedback.

If a script sends a tap to a phone but cannot see the screen, it is guessing. If it can see a screenshot from five seconds ago, it is still mostly guessing. If it can see the live screen but cannot input through a reliable path, it is just a monitor. The useful system needs both halves: observation and actuation.

So I built around this loop:

ObserveMirrored screen

ReplayKit frames expose the real phone UI instead of a stale screenshot.

ReadViewer state

The local viewer gives the operator or script a current control surface.

CommandDevice API

The API records a small command batch: home, move, click, type, or swipe.

TransportESP32 session

Wi-Fi, WebSocket, or serial carries the command to the active board session.

InputUSB HID

The phone receives normal external input through the USB HID path.

MeasureVisible result

The next frame shows whether the UI actually changed the way the script expected.

new frame returns to the viewer

Phone control loop

The useful loop is observation, command, USB HID input, and measured feedback from the next frame.

That is why I think of it less as "mobile automation" and more as a small cyber-physical system. The software is real, but it operates on glass rectangles, batteries, cables, Wi-Fi connections, mirrors, cursor bubbles, and apps that change their layouts whenever they feel like it.

Why not just use WDA

The honest answer is that I wanted the stack to operate through the same interaction layer a person uses.

WebDriverAgent and XCTest are developer automation surfaces. They are powerful when you control the target app or when the target allows accessibility tree inspection. But for social app workflows across real logged-in devices, they pull you toward a test-runner model. I wanted an operator model.

This system does not use WDA or XCTest for the iPhone path described here. The input side is HID. In the early version, the ESP32 accepted serial commands from the computer and translated them into mouse-style input. Later, the hardware direction moved toward an ESP32-S3 using native USB HID through a phone-side USB-C connection. In both cases, the phone receives input more like an external mouse or input accessory than a test framework command.

That does not mean the stack is invisible. It only means it has a different footprint. Apps can still observe behavior, timing, network patterns, account history, device state, and all the normal signals that come from using a real service at scale. Saying "not WDA" is accurate. Saying "undetectable" would be careless.

The practical benefit is control at the same layer as the UI.

The first working version

The first version was deliberately crude.

An ESP32 ran mouse firmware. The computer talked to it over serial at 115200 baud. A web interface translated browser movement into simple commands:

m,x,y for relative movement
a,x,y for absolute movement, 0-127
c,b for button clicks
p,b and r,b for press and release
s,v and h,v for scrolling
i for connection status

At the same time, the phone screen was mirrored back to the computer. The early path used AirPlay through UxPlay. With the right flags, it could be made reasonably responsive:

./UxPlay/uxplay -n "Homebase" -nc -vsync no -fps 30

That gave me the first full loop. See the phone, move the cursor, click the UI, see what happened, adjust.

OperatorBrowser trackpad

Browser movement starts as a local command, not a test framework event.

Early prototype

The first version was a closed loop: browser command, board input, phone response, mirrored feedback.

This was useful, but it exposed the real problems quickly. Some apps did not mirror the full interactive UI reliably. The cursor was not always visible in captured frames. Absolute positioning was not reliable on the flashed board I had in hand. The system worked only if calibration became a first-class workflow.

FactoryMirror and visual feedback

The big improvement was changing the visual feedback path.

AirPlay worked for normal mirroring, but TikTok could switch into a video-output mode that hid the full interactive UI. That broke anything involving comments, inputs, sheets, overlays, or action buttons. I needed the actual screen, not a media projection.

So the stack added a ReplayKit-based path through a small iPhone app called FactoryMirror. The phone uploads frames to a local receiver over WebSocket. A local viewer connects to that receiver and shows the live stream. In the handoff docs, the upload path is shaped like:

ws://<local-ip>:8013/upload?stream=device5

and the local viewer connects through:

ws://127.0.0.1:8013/viewer

The important result was simple: the full TikTok screen was visible again. Comment sheets, input fields, send buttons, close buttons, and overlays were visible as UI.

That changed the system from "send commands and hope" to "send commands and inspect the result." Static captures were useful, but live feedback was more reliable because the cursor bubble could disappear, blend into app colors, or fail to show up at all.

The command plane

Once the control loop worked on one phone, the next problem was making it fleet-shaped.

The local device API is intentionally boring. It runs on a normal local port, stores state in SQLite, exposes health and device listing endpoints, accepts command batches, and pushes pending commands to connected boards. It tracks whether an ESP32 is live over WebSocket and returns operational truth signals like sent_immediately.

The command grammar is small on purpose:

H homes the pointer
m,dx,dy moves relative to the current pointer
c,1 clicks
t,text types text
swipe,x1,y1,x2,y2,d drags
i asks for status

That command set looks too simple until you realize how much it buys. It is readable in logs, easy to copy into calibration scripts, storable as a workflow, and transportable over serial, UDP, or WebSocket-backed device connections. It can also be gated by allowlisted prefixes so the local API is not accepting arbitrary shell-shaped input.

CallerOperator script

Scripts send the same small command language a person can read in logs.

Local command architecture

A command becomes local state, hardware input, and visible feedback instead of a one-way script.

This is the part that made the farm start feeling like software. The physical device did not disappear. It became addressable.

Calibration is the real work

The most important command in the whole system is H.

That sounds ridiculous, but it is true. The board I was using did not reliably match the repo's expected absolute mouse behavior. The documented a,x,y command existed, and the software around it expected absolute positioning, but the board in front of me behaved more reliably with a reset plus relative movement.

The working rule became:

Send H to reset the cursor toward the origin.
Send one or more m,dx,dy moves.
Click or drag.
Watch the live viewer.
Adjust by small increments.

Calibration used a visible grid overlay. The phone frame was divided into columns and rows. I would send a command sequence, watch where the cursor or interaction landed, and record the result. A command like this became a normal test probe:

H
m,18,0
m,0,25
c,1

That kind of sequence eventually became a target: open TikTok, tap Home, open comments, focus an input, post a comment, close a sheet, swipe to the next post. Each phone model needed its own table because screen size, OS chrome, app layout, and pointer behavior differed.

One iPhone could use a next-post swipe like:

H
swipe,15,40,15,15,200

while another needed:

H
swipe,5,45,5,15,200

That is not a bug in the idea. That is the work. A phone farm behaves like software only after the physical layer has been measured enough that software can trust it.

There is also a timing side. After a swipe, the mirrored stream can lag the actual phone state, so the controller needs to wait before classifying or deciding.

The hardware path

The dev-board version proved the loop, but it was not a product-shaped piece of hardware.

Loose ESP32 boards are fine for proving input. They are bad for a farm. Cables move, boards reset, and charging becomes messy. A real setup needs a small inline dongle that can plug into the phone, enumerate as a HID device, and still allow charge-through power.

That led to the ESP32-S3 USB HID charge-through dongle.

The V2 board target is a compact 64 mm by 34 mm four-layer PCB. The phone-side USB-C plug connects the USB 2.0 data lines through ESD protection directly into the ESP32-S3 native USB device path. The ESP32-S3-MINI-1U enumerates as a single USB HID device. There is no USB hub in the current revision because the board only needs one HID device plus a separate charge path.

Board shapeOne HID device with separate charge-through power

Data path

The phone sees one native HID device through the ESP32-S3 USB path.

Phone sideUSB-C plug

Mechanical phone-side plug.

ProtectionUSB2 ESD

Protects D+ and D- before the MCU.

DeviceESP32-S3 USB HID

Enumerates as the single HID device.

ControlWi-Fi command path

Receives command batches off-phone.

Charge path

Power is negotiated separately so the phone can keep charging.

ChargerUSB-C receptacle

External charger enters the dongle.

PolicyPD sink controller

Locks the power role and policy.

RailProtected 5V

Feeds board power with protection.

Phone sideSource switch

Presents source behavior back to the phone.

USB HID dongle path

The board keeps HID data and charge-through power as separate paths that meet at the phone.

The charging side is where the board stops being a simple hobby adapter. The charge-side USB-C receptacle feeds a PD sink/controller path. The protected 5 V rail powers the system and feeds the phone-side source switch. The phone-side CC path needs a real source/UFP controller policy, not a passive resistor shortcut. The current docs explicitly gate production export on locking the final PD controller SKU and configuration.

That is exactly the kind of detail that matters in hardware. The HID path is conceptually simple. The power-role path is where a small wrong assumption can turn into a board that enumerates but does not charge, charges but loses data role, overheats, or behaves differently across phones and chargers.

The board also had normal manufacturing cleanup work: connector pad geometry, correct ESP32-S3 net assignment, removal of an unnecessary USB2514B hub scaffold, DRC/ERC cleanup, USB-C mechanical fit, PD passives, and DFM review gates. The PCB is where the prototype either becomes repeatable or exposes every shortcut.

Loading exported board model

The exported KiCad GLB for the V2 USB HID charge-through dongle.

The cost curve

The cost estimate made the same point in numbers.

For the V2 USB HID dongle, the PCBWay snapshot used a 64 mm by 34 mm, 4-layer, 0.8 mm FR-4 board with ENIG, black soldermask, turnkey top-side assembly, about 35 populated parts, and roughly 270 SMD pads.

The estimate before PCBWay manual review looked like this:

Quantity	Estimated total	Estimated unit cost
5	$564.68	$112.94
10	$831.62	$83.16
20	$860.68	$43.03
50	$1,405.38	$28.11

Those numbers are not final production pricing. They exclude customs, tariffs, taxes, procurement markup, attrition, spare components, and whatever PCBWay changes after reviewing the Gerbers, BOM, and CPL.

But the shape is useful. Five assembled boards are expensive because you are paying for setup, not just material. At twenty or fifty, the per-unit cost starts to look like a real accessory. That is the bridge from "I can control one phone with a dev board" to "I can imagine a standardized fleet module."

The main priced blocks also tell the story: ESP32-S3 module, Molex phone-side USB-C plug, GCT charge-side receptacle, PD sink, phone-side PD controller, eFuse/current limiter, 3.3 V LDO, USB ESD, EEPROM for policy configuration, and passives.

The constraints

The system works because it accepts its constraints instead of hand-waving them away.

Latency exists. AirPlay and ReplayKit frames are not instantaneous. A viewer can be fresh enough for control but still stale enough to misclassify right after a swipe. Scripts need delays, retries, and visible checkpoints.

Calibration drifts. App layouts change. The right-side action rail in TikTok moves depending on captions, overlays, ads, and post format. A coordinate can be a target band, not a permanent truth.

Heat matters. Screen mirroring, recording, charging, and running social apps for long periods warms devices. Old phones are especially sensitive. A useful farm needs physical spacing, airflow, charging sanity, and pacing.

Charging is not solved by wishing USB-C were simple. Charge-through requires power-role decisions, current limiting, PD policy, cable behavior, connector fit, and validation on real phones.

Physical devices fail in physical ways. Wi-Fi drops. Boards disconnect. Cables get bumped. A phone goes to sleep. A modal appears. The system has to report state honestly so an operator or script can recover.

And again: this is not an undetectable automation layer. It is a hardware-input, real-device control layer. That distinction is important technically and ethically. The work is making the fleet measurable and controllable, not pretending platforms cannot see behavior.

What made it interesting

The interesting part is that every layer has to be honest with the layer above it.

The phone mirror has to show the real UI. The viewer has to expose enough feedback for calibration. The command API has to say whether a board is connected. The board has to execute a tiny command language consistently. The firmware has to map commands to HID behavior. The hardware has to preserve data and power roles. The script has to wait long enough for the real app, not the app I wish existed.

When those layers line up, the farm stops feeling like a hack. It becomes a control system.

PhysicalPhone

The real device stays in the loop instead of being abstracted away.

Legible control loop

Visible state, local commands, firmware input, and board power all stay inspectable.

That is the lesson I care about most from this project. The right abstraction was not "hide the phones." The right abstraction was "make the phones legible."

Once a device is legible, it can be named, addressed, calibrated, scripted, and folded into a larger software loop.

The phone farm still has all the annoying properties of real hardware. It still needs cables, power, heat management, and human judgment when the UI changes. But it also has something much better than a one-off macro: a path from visible state to command to measured result.

That is the real product.

Not Appium wrapped around someone else's interface. Not a claim that hardware input makes detection disappear. Not a pile of phones clicking blindly.

A small fleet of real devices, controlled through mirrored feedback and HID input, with enough local software around it that the physical world starts to behave like an API.