I wanted a phone farm that behaved less like a pile of devices and more like a software system.
That sounds simple until you try to do it with real iPhones. The normal developer answer is Appium, WebDriverAgent, XCTest, simulators, or private app automation. Those tools are useful in the right context, but they were not the shape of the system I wanted. I wanted real phones, real apps, the actual screen, the same touch targets a person sees, and a scriptable control layer.
The result is a stack that combines mirrored visual feedback, ESP32 HID input, Wi-Fi and serial command transport, a local web control plane, and a calibration loop. It is not magic. It is not undetectable. It is not WDA or XCTest. It is a set of physical devices that can be observed, commanded, adjusted, and eventually treated like a small fleet of software-controlled machines.
The product idea
The problem I kept coming back to was feedback.
If a script sends a tap to a phone but cannot see the screen, it is guessing. If it can see a screenshot from five seconds ago, it is still mostly guessing. If it can see the live screen but cannot input through a reliable path, it is just a monitor. The useful system needs both halves: observation and actuation.
So I built around this loop:
The useful loop is observation, command, USB HID input, and measured feedback from the next frame.
That is why I think of it less as "mobile automation" and more as a small cyber-physical system. The software is real, but it operates on glass rectangles, batteries, cables, Wi-Fi connections, mirrors, cursor bubbles, and apps that change their layouts whenever they feel like it.
Why not just use WDA
The honest answer is that I wanted the stack to operate through the same interaction layer a person uses.
WebDriverAgent and XCTest are developer automation surfaces. They are powerful when you control the target app or when the target allows accessibility tree inspection. But for social app workflows across real logged-in devices, they pull you toward a test-runner model. I wanted an operator model.
This system does not use WDA or XCTest for the iPhone path described here. The input side is HID. In the early version, the ESP32 accepted serial commands from the computer and translated them into mouse-style input. Later, the hardware direction moved toward an ESP32-S3 using native USB HID through a phone-side USB-C connection. In both cases, the phone receives input more like an external mouse or input accessory than a test framework command.
That does not mean the stack is invisible. It only means it has a different footprint. Apps can still observe behavior, timing, network patterns, account history, device state, and all the normal signals that come from using a real service at scale. Saying "not WDA" is accurate. Saying "undetectable" would be careless.
The practical benefit is control at the same layer as the UI.
The first working version
The first version was deliberately crude.
An ESP32 ran mouse firmware. The computer talked to it over serial at 115200 baud. A web interface translated browser movement into simple commands:
m,x,yfor relative movementa,x,yfor absolute movement, 0-127c,bfor button clicksp,bandr,bfor press and releases,vandh,vfor scrollingifor connection status
At the same time, the phone screen was mirrored back to the computer. The early path used AirPlay through UxPlay. With the right flags, it could be made reasonably responsive:
./UxPlay/uxplay -n "Homebase" -nc -vsync no -fps 30
That gave me the first full loop. See the phone, move the cursor, click the UI, see what happened, adjust.
The first version was a closed loop: browser command, board input, phone response, mirrored feedback.
This was useful, but it exposed the real problems quickly. Some apps did not mirror the full interactive UI reliably. The cursor was not always visible in captured frames. Absolute positioning was not reliable on the flashed board I had in hand. The system worked only if calibration became a first-class workflow.
FactoryMirror and visual feedback
The big improvement was changing the visual feedback path.
AirPlay worked for normal mirroring, but TikTok could switch into a video-output mode that hid the full interactive UI. That broke anything involving comments, inputs, sheets, overlays, or action buttons. I needed the actual screen, not a media projection.
So the stack added a ReplayKit-based path through a small iPhone app called FactoryMirror. The phone uploads frames to a local receiver over WebSocket. A local viewer connects to that receiver and shows the live stream. In the handoff docs, the upload path is shaped like:
ws://<local-ip>:8013/upload?stream=device5
and the local viewer connects through:
ws://127.0.0.1:8013/viewer
The important result was simple: the full TikTok screen was visible again. Comment sheets, input fields, send buttons, close buttons, and overlays were visible as UI.
That changed the system from "send commands and hope" to "send commands and inspect the result." Static captures were useful, but live feedback was more reliable because the cursor bubble could disappear, blend into app colors, or fail to show up at all.
The command plane
Once the control loop worked on one phone, the next problem was making it fleet-shaped.
The local device API is intentionally boring. It runs on a normal local port, stores state in SQLite, exposes health and device listing endpoints, accepts command batches, and pushes pending commands to connected boards. It tracks whether an ESP32 is live over WebSocket and returns operational truth signals like sent_immediately.
The command grammar is small on purpose:
Hhomes the pointerm,dx,dymoves relative to the current pointerc,1clickst,texttypes textswipe,x1,y1,x2,y2,ddragsiasks for status
That command set looks too simple until you realize how much it buys. It is readable in logs, easy to copy into calibration scripts, storable as a workflow, and transportable over serial, UDP, or WebSocket-backed device connections. It can also be gated by allowlisted prefixes so the local API is not accepting arbitrary shell-shaped input.
A command becomes local state, hardware input, and visible feedback instead of a one-way script.
This is the part that made the farm start feeling like software. The physical device did not disappear. It became addressable.
Calibration is the real work
The most important command in the whole system is H.
That sounds ridiculous, but it is true. The board I was using did not reliably match the repo's expected absolute mouse behavior. The documented a,x,y command existed, and the software around it expected absolute positioning, but the board in front of me behaved more reliably with a reset plus relative movement.
The working rule became:
- Send
Hto reset the cursor toward the origin. - Send one or more
m,dx,dymoves. - Click or drag.
- Watch the live viewer.
- Adjust by small increments.
Calibration used a visible grid overlay. The phone frame was divided into columns and rows. I would send a command sequence, watch where the cursor or interaction landed, and record the result. A command like this became a normal test probe:
H
m,18,0
m,0,25
c,1
That kind of sequence eventually became a target: open TikTok, tap Home, open comments, focus an input, post a comment, close a sheet, swipe to the next post. Each phone model needed its own table because screen size, OS chrome, app layout, and pointer behavior differed.
One iPhone could use a next-post swipe like:
H
swipe,15,40,15,15,200
while another needed:
H
swipe,5,45,5,15,200
That is not a bug in the idea. That is the work. A phone farm behaves like software only after the physical layer has been measured enough that software can trust it.
There is also a timing side. After a swipe, the mirrored stream can lag the actual phone state, so the controller needs to wait before classifying or deciding.
The hardware path
The dev-board version proved the loop, but it was not a product-shaped piece of hardware.
Loose ESP32 boards are fine for proving input. They are bad for a farm. Cables move, boards reset, and charging becomes messy. A real setup needs a small inline dongle that can plug into the phone, enumerate as a HID device, and still allow charge-through power.
That led to the ESP32-S3 USB HID charge-through dongle.
The V2 board target is a compact 64 mm by 34 mm four-layer PCB. The phone-side USB-C plug connects the USB 2.0 data lines through ESD protection directly into the ESP32-S3 native USB device path. The ESP32-S3-MINI-1U enumerates as a single USB HID device. There is no USB hub in the current revision because the board only needs one HID device plus a separate charge path.
The board keeps HID data and charge-through power as separate paths that meet at the phone.
The charging side is where the board stops being a simple hobby adapter. The charge-side USB-C receptacle feeds a PD sink/controller path. The protected 5 V rail powers the system and feeds the phone-side source switch. The phone-side CC path needs a real source/UFP controller policy, not a passive resistor shortcut. The current docs explicitly gate production export on locking the final PD controller SKU and configuration.
That is exactly the kind of detail that matters in hardware. The HID path is conceptually simple. The power-role path is where a small wrong assumption can turn into a board that enumerates but does not charge, charges but loses data role, overheats, or behaves differently across phones and chargers.
The board also had normal manufacturing cleanup work: connector pad geometry, correct ESP32-S3 net assignment, removal of an unnecessary USB2514B hub scaffold, DRC/ERC cleanup, USB-C mechanical fit, PD passives, and DFM review gates. The PCB is where the prototype either becomes repeatable or exposes every shortcut.
The cost curve
The cost estimate made the same point in numbers.
For the V2 USB HID dongle, the PCBWay snapshot used a 64 mm by 34 mm, 4-layer, 0.8 mm FR-4 board with ENIG, black soldermask, turnkey top-side assembly, about 35 populated parts, and roughly 270 SMD pads.
The estimate before PCBWay manual review looked like this:
| Quantity | Estimated total | Estimated unit cost |
|---|---|---|
| 5 | $564.68 | $112.94 |
| 10 | $831.62 | $83.16 |
| 20 | $860.68 | $43.03 |
| 50 | $1,405.38 | $28.11 |
Those numbers are not final production pricing. They exclude customs, tariffs, taxes, procurement markup, attrition, spare components, and whatever PCBWay changes after reviewing the Gerbers, BOM, and CPL.
But the shape is useful. Five assembled boards are expensive because you are paying for setup, not just material. At twenty or fifty, the per-unit cost starts to look like a real accessory. That is the bridge from "I can control one phone with a dev board" to "I can imagine a standardized fleet module."
The main priced blocks also tell the story: ESP32-S3 module, Molex phone-side USB-C plug, GCT charge-side receptacle, PD sink, phone-side PD controller, eFuse/current limiter, 3.3 V LDO, USB ESD, EEPROM for policy configuration, and passives.
The constraints
The system works because it accepts its constraints instead of hand-waving them away.
Latency exists. AirPlay and ReplayKit frames are not instantaneous. A viewer can be fresh enough for control but still stale enough to misclassify right after a swipe. Scripts need delays, retries, and visible checkpoints.
Calibration drifts. App layouts change. The right-side action rail in TikTok moves depending on captions, overlays, ads, and post format. A coordinate can be a target band, not a permanent truth.
Heat matters. Screen mirroring, recording, charging, and running social apps for long periods warms devices. Old phones are especially sensitive. A useful farm needs physical spacing, airflow, charging sanity, and pacing.
Charging is not solved by wishing USB-C were simple. Charge-through requires power-role decisions, current limiting, PD policy, cable behavior, connector fit, and validation on real phones.
Physical devices fail in physical ways. Wi-Fi drops. Boards disconnect. Cables get bumped. A phone goes to sleep. A modal appears. The system has to report state honestly so an operator or script can recover.
And again: this is not an undetectable automation layer. It is a hardware-input, real-device control layer. That distinction is important technically and ethically. The work is making the fleet measurable and controllable, not pretending platforms cannot see behavior.
What made it interesting
The interesting part is that every layer has to be honest with the layer above it.
The phone mirror has to show the real UI. The viewer has to expose enough feedback for calibration. The command API has to say whether a board is connected. The board has to execute a tiny command language consistently. The firmware has to map commands to HID behavior. The hardware has to preserve data and power roles. The script has to wait long enough for the real app, not the app I wish existed.
When those layers line up, the farm stops feeling like a hack. It becomes a control system.
Visible state, local commands, firmware input, and board power all stay inspectable.
That is the lesson I care about most from this project. The right abstraction was not "hide the phones." The right abstraction was "make the phones legible."
Once a device is legible, it can be named, addressed, calibrated, scripted, and folded into a larger software loop.
The phone farm still has all the annoying properties of real hardware. It still needs cables, power, heat management, and human judgment when the UI changes. But it also has something much better than a one-off macro: a path from visible state to command to measured result.
That is the real product.
Not Appium wrapped around someone else's interface. Not a claim that hardware input makes detection disappear. Not a pile of phones clicking blindly.
A small fleet of real devices, controlled through mirrored feedback and HID input, with enough local software around it that the physical world starts to behave like an API.