Rahul Hathwar · March 19, 2026
Grab a tiny NPC, aim, throw it at a mining node. It flies through a parabolic arc and lands exactly where you aimed. It starts mining immediately, collects what drops, and carries it to a delivery point. If the node is destroyed mid-task, it automatically finds the next best target. It fishes. It avoids dangerous zones. It knows where every other worker is heading and doesn't land on top of them.
That's the worker NPC system in a sentence, and it still doesn't do it justice.
This article is a detailed account of how I built it: the architecture decisions and why I made them, the problems I hit, the improvements I took ownership of beyond the original spec as I noticed gaps, and the engineering principles that held all of it together. I was the sole engineer on this NPC system, working on a Roblox multiplayer startup project backed directly by KreekCraft, one of the platform's top global content creators with over 16 million subscribers. In playtesting, the system handled around 100 simultaneous workers across a server (15 to 20 per player) running smoothly on low-end mobile hardware.
This project is under NDA. I'm writing with explicit permission from the studio and covering exclusively the systems I personally owned in design and implementation. I've kept game-specific design details to the minimum needed for context.
Building for an Uncertain Spec
The most important decisions I made for this system had nothing to do with code. They had to do with recognizing early that the design was still evolving: what certain interactions were supposed to feel like, what mechanics might be added or cut, what order of operations made sense for a given feature. The wrong response to that is waiting for a finalized spec. The right response is designing technical architecture that makes change cheap.
My guiding principle was this: a design pivot should cost a reconfiguration, not a rewrite. That meant stateless and functional behavior where possible, modular components that could be reordered or swapped without touching what surrounded them, and explicit, formal state management rather than a spaghetti-structure of ad hoc flags.
This paid off more than I expected. Features added mid-project slotted in with low friction. Some polish phases took 2 to 3 days not because the work was simple, but because the architecture was already ready for them. A good foundation doesn't make development fast. It keeps development from becoming slow.
System Organization
Before getting into design decisions, it's worth describing how the system is organized, because the structure itself was a decision.
On the server, I used a singleton service layer (WorkerService) as the entry point, handling all network communication, player lifecycle events, and cross-worker coordination. Each active worker gets its own manager instance (WorkerManager) responsible for that worker's behavior, state, and lifecycle. A separate factory class (WorkerFactory) handles all worker creation, including validation and component pre-allocation. A renderer singleton (WorkerRenderer) handles all 3D model setup and visual mutation; no other class touches the model directly.
On the client, a parallel organization: a WorkerClient singleton handling network listeners and signals, a WorkerClientRenderer for smooth position tweens, a WorkerGrabInteraction module owning all raycasting and interaction-mode detection, and a WorkerOverheadGUI system managing per-worker billboards.
This follows standard industry OOP with single responsibility in mind. The practical benefit was that debugging was fast: a pathfinding bug lived in LocomotionController, a visual error lived in the renderer, a state issue lived in the FSM. Nothing leaked into anything else. Adding a feature meant identifying the right layer, not combing through an entanglement of dependencies.
State as a First-Class Citizen
Why a Formal State Machine
Workers exhibit complex, branching behavior. Mining, fishing, being grabbed mid-task, navigating, carrying resources, waiting for targets, handling failed transitions. It would have been easy to manage all of that with a set of boolean flags: isCarrying, isFishing, isGrabbed, hasTarget. Well... easy to start, painful to maintain. Conditions like if isCarrying and not isGrabbed and lastTask ~= "fishing" are fragile, opaque, and fail the second a single flag is improperly set.
I introduced a formal FSM early, and not just because it's cleaner. I anticipated two specific problems that a flag-based approach produces badly.
The first is state pollution. When a worker finishes mining and moves to a new task, there should be a clear, explicit moment where the old task's data is cleaned up and the new task begins fresh. With flags, that cleanup is scattered and easy to miss. With an FSM, every transition is an opportunity to answer the question: what do I clear, and what do I keep? The architecture forces you to confront that question. You can't skip it. That reduces an entire class of subtle state-leakage bugs to something you deal with at design time rather than runtime.
The second is cognitive load. As behavior grows more complex, it becomes impossible to reason about the full system from memory. An FSM makes the valid states and their transitions explicit documentation. Anyone reading the code can see exactly what a worker can and cannot do in a given state. This matters for maintenance, for onboarding, and for a fast-moving project where design details keep changing.
These decisions reflect a design philosophy that I will consistently carry with a lot of design decisions, as seen later in the article with cases such as error handling.
One FSM or Two?
Early on I faced a tough decision: implement a single FSM covering both behavior and animation, or maintain two separate machines where one drives gameplay logic and the other drives animation sequences.
Two FSMs would have been cleaner in theory. In practice, under time pressure, it would have added synchronization complexity between the two machines, more moving parts, and more places for bugs to hide. I chose one FSM and solved the animation problem with a blocking state concept.
Some transitions involve an animation sequence that must run to completion without interruption. The grab entry, for example, plays a multi-phase animation before the worker enters the held state. Rather than scattering guards across the codebase, the FSM can be set to blocking, which gates all transitions during that exclusive window:
function WorkerFSM:Transition(newState, reason)
-- Reject if current state is locked by an animation sequence
if self._isBlocked then
return { success = false, error = { code = "StateBlocked", ... } }
end
-- Pass 1: Is this transition structurally valid?
if not self:CanTransition(newState) then
return { success = false, error = { code = "InvalidState", ... } }
end
-- Pass 2: Does this transition pass its runtime validator?
local validator = self._validatorRegistry[`{self.currentState}->{newState}`]
if validator then
local ok, msg = validator(self.workerId, reason)
if not ok then
return { success = false, error = { code = "ValidatorFailed", message = msg } }
end
end
self.currentState = newState
return { success = true, value = true }
end
One mechanism, applied in one place, covers the full class of animation-exclusivity problems across the entire FSM. This was the balance I was aiming for. It's technically sound, requires no additional FSM to synchronize with, and fast to implement correctly under pressure.
Reflection: If I were building this again with more time, I'd move toward a more declarative, signal-driven pattern rather than the current approach of calling FSM transition functions inline alongside gameplay logic. The current architecture is correct and works well, but the coupling between state transitions and their effects is more imperative than I'd like. That's a known tradeoff I accepted to move forward.
State Granularity
A related design choice: how granular should the states be? Some states I defined exist almost purely to provide safe validation checkpoints, not to represent meaningfully distinct behavior.
FishingIdle and FishingActive are a good example. From a gameplay perspective, both states mean "worker is fishing." But they represent two different server-validated stages: in FishingIdle, the worker is waiting, and the server can reliably check that fishing is still valid (the spot is still available, the worker hasn't been grabbed). In FishingActive, a fish has spawned and the gameplay consequence of catching has been committed. Splitting these into two states means I have a clean checkpoint to perform validation and ensure the server and client agree before progressing.
The grab sequence works similarly. GrabEntry, GrabThrowIdle, and GrabThrowVisual are three states that encode three distinct validation opportunities: confirming the grab animation started correctly, confirming the player is in windback hold, and confirming the throw was committed. Each stage can validate independently. The FSM makes this explicit by design.
A Note on Error Handling
The validation results returned by the FSM look like this because of a pattern I pushed for early in development: Rust-inspired Result types across the entire codebase. Every operation that can fail returns a typed { success, value } or { success, error } structure rather than raising an exception or returning nil. This applies to the FSM, to inventory operations, to pathfinding calls, to network responses.
The practical effects are that error states are never implicit, callers are forced to handle them explicitly, and debugging has a consistent shape everywhere. I pitched this to the team at the start of the project and it became a shared standard. Later, when I needed errors to survive the network boundary, I built AppError on top of the same structured data. More on that in the network section.
Component Architecture
Not all behavior is always active. A worker that's idle doesn't need a live pathfinding state. A worker that isn't mining doesn't need mining progress data hanging around in memory. I designed the WorkerInstance type using an ECS-inspired component model: components are allocated on demand when a behavior starts and set to nil when it stops.
-- WorkerInstance always has core identity fields
export type WorkerInstance = {
workerId: string,
itemKey: string, -- References static definition in ItemRegistry
ownerId: number,
state: WorkerState,
-- Components: nil when inactive, allocated when needed
pathfinding: PathfindingComponent?,
mining: MiningComponent?,
fishing: FishingComponent?,
-- Carrying state: present only while transporting something
carriedMaterialKey: string?,
carriedFishKey: string?,
}
The WorkerDefinition in the item registry carries capability flags (hasPathfinding, hasMining) that determine which components a particular worker type supports, so the factory only pre-allocates what a given type actually uses.
The result is clear, readable, and explicit. When worker.pathfinding is nil, you know the worker isn't navigating. When it's populated, you know it is. Following similar principles from before, we use no confusing flags, so-called "magic" states, or any form of guessing. Strict typing enforces this at compile time.
Persistent vs. Ephemeral Data
One principle I enforced throughout was that the WorkerInstance type should contain only non-derivable state: data that uniquely identifies what a worker is doing and cannot be inferred from anything else. Temporary operational data that can be reconstructed, or that is only relevant to the manager's internal execution, lives in WorkerManager's private fields instead.
For example, the 3D model of a material that a worker is visually carrying (the actual Roblox object attached to the worker's hand) lives in WorkerManager._carriedMaterialModel. That's a rendering detail. The fact that a worker is carrying a material, and which material it is, lives in WorkerInstance.carriedMaterialKey. That's state.
This distinction matters for two reasons. One is correctness: WorkerInstance is the data that gets transmitted on the network and reasoned about by external systems. Leaking rendering details into it would be a layering violation. The other is readability: a designer or engineer reading the WorkerInstance type to understand the system should see a clean map of what defines a worker at any moment, not implementation details.
Movement and Precision
The Locomotion Problem
In production playtests, workers exhibited 1 to 2 reproducible movement bugs per minute of gameplay. Workers teleported while being grabbed. Workers still moved toward old targets after being reassigned. Multiple code paths competed over Humanoid:MoveTo() and produced erratic, snapping motion that was nearly impossible to trace to a single cause.
The root issue: movement logic was scattered. Multiple systems issued MoveTo() calls independently with no concept of ownership or priority. There was nothing to arbitrate between them.
I designed LocomotionController as a centralized movement authority. External code sets a target and an optional arrival callback and the controller handles everything from there. When a new target is set mid-movement, the previous movement is cancelled atomically and the old callback fires with reached = false. This means no lingering targets and no competing calls.
The tiered escalation strategy I implemented was informed by observation: the majority of the time, a worker can get from A to B in a straight line. Most cases where that fails involve a small obstacle (something easily jumpable), not a complex navigation problem. Only a small minority of situations genuinely require full pathfinding. So the controller escalates accordingly:
SetTarget(position)
│
▼
Linear MoveTo ── arrived? ──► Done
│
│ no progress (stuck)
▼
Jump attempt ── cleared? ──► Resume linear
│
│ still stuck
▼
Backstep + run-up sequence ── cleared? ──► Resume linear
│
│ still stuck
▼
PathfindingService ── path found? ──► Follow waypoints
│
│ no valid path
▼
Teleport (always logged; last resort only)
This mimics the intuitive approach a player would take to the same obstacle. It also makes the common case fast: a straight-line move is cheap and immediate, and redundant PathfindingService calls (which are expensive and sometimes fail where a direct MoveTo succeeds) are avoided unless actually necessary.
Getting Position Right
The visual quality of a worker mining from a node depends almost entirely on where it stands. If the positioning is even slightly off, the mining animation doesn't look like it's connecting with the node. Workers visually float, clip, or stand at odd angles. This was a non-trivial problem.
The positioning logic works in multiple stages. First, the server raycasts the node to determine its surface and shape. It computes the ideal standoff position: close enough that the animation reads as contact, from an approach angle that makes geometric sense given the worker's current position. This calculation accounts for the node's bounding volume, not just its center point, which was essential: using only the center caused workers to position themselves strangely on nodes with irregular shapes or that sat on uneven terrain.
The bigger challenge was timing. Early prototypes computed this position on demand at the moment the worker needed to start mining. That caused a visible snap: the worker would arrive at a rough position, recalculate, and then correct. The FSM actually helped surface this problem clearly: the transition to MiningMaterialNode had an obvious location where positional data needed to be available, and it wasn't.
The solution was pre-computation. Whether a worker reaches a node by walking or by being thrown, the mining standoff position is calculated asynchronously and stored in _pendingMiningPosition before the worker arrives. By the time the mining state begins, the position is already determined. The worker never snaps.
This is something I think about as an engineering principle: a technical architecture decision is never a valid excuse for a substandard experience. The FSM enforced clean state ownership, but that ownership structure also clarified where the pre-computation needed to happen. The answer was to reorganize responsibility, not to lower the visual standard.
No Two Workers Land in the Same Spot
When multiple workers target the same node, they need unique approach positions; they can't land on top of each other. Before I solved this, workers thrown to the same node would intersect visually and perform redundant geometry checks on every tick to try to separate themselves, producing jittery corrections and inconsistent results.
I replaced it with a module-level approach-slot reservation table keyed by node ID:
-- node ID → { worker ID → reserved approach position }
-- Written at throw-commit time. Consumed by MineMaterialNode.
local _nodeReservations: { [string]: { [string]: Vector3 } } = {}
A worker claims its approach slot at the moment a throw is committed, before it starts moving. When getApproachPositions distributes slots around a node, it reads this table at O(1). Other workers see the slot as occupied immediately and offset to the next available one. When a worker is interrupted or reassigned, its slot is released.
The visual result is smooth, self-aware landings: workers consistently space themselves around a node with no runtime corrections, no overlap, and no delay.
Terrain-Aware Path Selection
PathfindingService in Roblox offers two agent profiles: flat traversal and jump-capable. Using jump everywhere is unnecessarily expensive. Using flat everywhere silently fails on terrain with meaningful elevation differences.
Before dispatching a path computation, the controller evaluates the elevation delta between start and goal, then additionally raycasts the midpoint of the route to check for hills between the two points. If either check indicates the route needs jump capability, the controller upgrades to the jump-enabled agent. Otherwise it uses the cheaper flat profile. A small up-front calculation that completely eliminates a class of silent path failures.
Exclusion Zone Avoidance
There is a region of the map that workers must never enter. I implemented avoidance at two independent levels, because neither alone is sufficient.
The first is a PathfindingModifier-tagged invisible cylinder obstacle placed over the zone at startup. PathfindingService reads this tag in its cost table and routes around it natively at path-planning time.
The second is an XZ-geometry validation pass applied to every computed path, rejecting any waypoint that crosses the exclusion radius regardless of what the planner produced. This catches cases where the planner's approximation produces a path that technically avoids the modifier volume but still crosses the boundary at ground level.
Both layers can fail independently. The geometry check catches rare planner edge cases. The modifier prevents the planner from producing bad paths in the first place. Together they are robust.
The Grab and Throw Mechanic
The design brief for this was literally "grab and throw." The implementation was my call entirely.
I decided to make it a physics toy with character. The throw arc scales with distance: a short throw is nearly flat, a longer one rises into a proper parabola, and a very long throw peaks high before descending. This is calculated from a simple distance-to-curve mapping and a clamped tween duration that keeps even long throws snappy. The key constraint was that throw time should feel consistent: players are using this as an efficiency tool, and an unpredictable flight duration would interrupt the rhythm of gameplay.
The arc calculation is a shared module used by both the client arc-beam preview and the server throw tween. Both consume identical parameters from the same function. What you see while aiming is exactly where the worker lands.
-- Pseudocode: distance-to-arc mapping
if distance < SHORT_THRESHOLD:
curveSize = distance * SHORT_SCALE -- Near-flat
elif distance < MEDIUM_THRESHOLD:
curveSize = SHORT_BASE + (distance - SHORT_THRESHOLD) * MEDIUM_SCALE
else:
curveSize = MEDIUM_BASE + min((distance - MEDIUM_THRESHOLD) * LONG_SCALE, LONG_CAP)
tweenDuration = clamp(distance * SPEED_FACTOR, MIN_DURATION, MAX_DURATION)
Beyond the mechanics, I wanted the full sequence to feel polished. I sourced animations from Mixamo, verified licensing, timed them against the throw tween durations, tuned the VFX, and worked through configuration until the whole thing had a satisfying, energetic feel to it. A lot of that work wasn't in my spec. I did it anyway because I knew investors and stakeholders would see this demo, and something technically correct but visually inert comes off as unfinished to anyone who isn't reading the code. Early polish also built confidence across the team during a period when a lot of design was still unsettled. It was something tangible to react to.
Mobile Controls
Mobile UX hadn't been deeply explored when I started on the grab mechanic. I took it on myself to design and implement it fully.
The first thing I addressed was the tap target: a worker is a tiny character, and precision touch selection on a small screen is unreliable. I made the interaction hitbox significantly larger than the visual model, which alone eliminated most missed-tap frustrations.
For the throw interaction, I designed an outer ring that pulses while the player is holding a worker, communicating readiness without any text. Releasing requires a deliberate double-tap, which reduces accidental throws while also giving the player more time to aim. On PC you have a cursor and on mobile you have your thumb. These aren't the same input, and they deserved different solutions.
I mocked the full mobile interaction up in Affinity Designer first, presented it to the team, got alignment, and then implemented it. I think that sequence matters: design before code, share before building, get feedback while changes are still cheap to implement.
Worker Fishing
Workers can be assigned to fish from rivers. I was asked to implement the mechanic.
What I actually made went beyond that. I wanted it to look alive when you watched it. The worker walks to the bank, equips a fishing rod, and waits. When a catch triggers, a fish model spawns upstream and visually swims toward the worker along the river current before the reeling animation begins. I sketched out the sequence, prototyped it in a day, and showed it to the team. The infrastructure to support it was already in place, the cost was manageable, and the result was something more visually memorable.
The technical details of the fishing system are worth a breakdown on its own. Read more about it here.
The principle I kept returning to was that when a system is visible to players, its behavior is the experience. Getting the mechanics architecturally correct is necessary. Making them feel immersive and interactive is what actually matters to the person playing.
Network Architecture
Server Owns Truth, Client Owns Feel
The worker system is fully server-authoritative. The server manages the FSM, all state transitions, physics, model welds, and animation triggers. The client owns none of that.
What the client does own is smoothness. When the server initiates a throw, it sends a packet with the arc parameters computed by the shared WorkerThrowArc module. The client plays the position tween independently, decoupled from network latency. Because both sides consume the same calculation, the visual arc always matches where the worker actually lands. This means no reconciliation, no correction on arrival, and visually, no snap or stutter.
Error Propagation: Result Types and AppError
Earlier I described the Result type pattern for internal error handling. The counterpart on the network side is AppError: a structured error format I designed to survive the server-to-client boundary intact.
-- Server: throw a structured, typed error
error(AppError.new("CapacityExceeded", "Inventory full; cannot complete worker purchase"))
-- Network layer: pcall catches it, serializes to { errorCode, errorMessage }
-- Client: reads errorCode, maps it to a player-facing message
The underlying principle is data consistency. The error code and message are the same data at every layer: the server produces them, the network transmits them unchanged, and the client reads them directly. No finicky string parsing and no opaque failures that die or disappear mid-propagation-chain. The data is consistent and the way each layer uses it differs by context.
Event-Driven Grab Confirmation
The grab flow is deliberately validated on the server before the client UI responds. The client sends a grab request, the server validates it against the FSM (a grab is not always valid; certain states block it), the server fires the response, and only on a confirmed success does the client open the grab panel.
This is a deliberate choice. An optimistic UI that opens before server confirmation can easily desync from the actual worker state, which creates a class of subtle bugs that are hard to reproduce and confusing for players. Given that our network packets are small and the event signals are fast, the latency cost of waiting for confirmation is negligible. The correctness benefit is not.
Filtering Network Traffic
Not every FSM state transition needs to reach the client. Internal states like PlayingEmote carry no player-facing meaning. Two predicates gate outbound state packets on the server:
ShouldSendUpdate(state): is this state something the UI needs to display at all?IsSameDisplayState(oldState, newState): do these two FSM states map to the same display representation?
If either check returns false, no packet is sent. Traffic stays minimal and the client stays reactive to what is actually worth displaying.
VFX: Phase-Locked Mining Strikes
When a worker mines, a spark/strike particle effect fires in sync with the swing animation. Getting this right turned out to be non-trivial.
The naive approach is to start a client-side timer when the mining effect begins and toggle the emitter on and off based on elapsed time. This works until the client lags. A lag spike accumulates drift, and the visual effect falls out of sync with the animation permanently for the rest of that mining session.
I solved it with server-timestamp phase locking. When the server starts a mining effect, it sends the current server time as the phase origin. The client's flash heartbeat derives its state entirely from (GetServerTimeNow() - phaseOrigin) % flashPeriod. After any lag spike, the correct phase is restored on the very next tick. There is no accumulated drift because there is no accumulated state to drift.
-- Every Heartbeat tick, derive phase purely from server clock
local elapsed = workspace:GetServerTimeNow() - phaseOrigin
if elapsed < 0 then return end -- Before phase origin; stay dark
local phase = elapsed % flashPeriod
local shouldBeOn = phase < flashOnDuration
-- Transition emitters only on edge changes
if shouldBeOn ~= flashOn then
flashOn = shouldBeOn
for _, emitter in emitters do
emitter.Enabled = shouldBeOn
if shouldBeOn then emitter:Emit(emitCount) end
end
end
This approach also means that damage speed (how often the game registers a mining hit) stays server-authoritative. This is ideal as damage speed is a gameplay mechanic that a player should not be able to exploit. The client's visual timing is derived from the server's clock, so the animation and the damage tick stay naturally aligned without the server needing to tell the client when each individual hit lands.
Observability
A system with 18 states, concurrent workers, tiered pathfinding, and multi-phase animations provides the structure to identify where issues are occurring, but further instrumentation is required to take advantage of that effectively.
I built WorkerLogger, a filtering utility with 15 independent log channels: FSM transitions, state entry and exit, pathfinding details, waypoint progress, spawning, material collection, model initialization, packet tracing, stuck detection, and more. All channels are toggled in a single constants file. In production, every flag is off with zero runtime cost. During development, I narrow to exactly the channel I care about:
local logger = WorkerLogger.ForWorker(workerId)
logger.PathComputation(`Computed {#waypoints} waypoints via {agent} path`)
logger.StateTransition(`{from} → {to}: {reason}`)
logger.Collection(`Claimed approach slot at {position} for node {nodeId}`)
I also initialized the debug UI infrastructure: a set of in-game server and client methods that could quickly construct and reproduce specific test scenarios without relying on natural gameplay to trigger edge cases. The combination of the structured logger and the debug UI reduced the iteration loop on complex multi-state bugs from hours to minutes.
What Held It Together
The honest summary of this system is that every significant decision — the FSM, the component model, the centralized locomotion authority, the reservation table, the confirmation-gated UI, the phase-locked VFX — came from the same place: thinking clearly about what could go wrong before it did.
None of those problems were hypothetical or unlikely. Most of them either showed up in early prototypes or were predictable from the nature of concurrent, event-driven, real-time systems. The architecture didn't prevent all bugs. It prevented the class of bugs that are genuinely hard to fix because they are hard to track. That is, the ones caused by state that belongs to no one, transitions that no one audits, and side effects that happen in places no one expected.
The system can easily run 100 workers at once on low-end mobile. It extended cleanly as design evolved. Bugs were fast to find and fast to fix. That's what good architecture is actually for.
Interested in the inventory platform that powered worker purchasing and item storage? See the Inventory System article. For the river and fishing infrastructure, see the Fishing and River System article. For the UI engineering and interface design, see the UI Systems article.