Node-Based Pathfinding

About this Project
Roblox's PathfindingService
The Algorithm

Inspiration
Basic Algorithm
Connection Cost
One-Way Connections
Scriptable Connections

Optimizations & Benchmarking

Path Ordering
Path Baking
External Pathfinding Server

About this Project

Pathfinding is as complex as it is important. There are lots of algorithms out there that have their own strengths and weaknesses. This project is specifically for a custom pathfinding plugin for Roblox. I explore the reasons why their pathfinding service is inadequate in a lot of cases, what level of control I feel is necessary for most games, and my solution to this problem.

I'm not sure if this pathfinding algorithm has previously been invented by someone else, or if it has a name. I came up with this independently and will walk through the thought processes that led to the final product.

Roblox's PathfindingService

To start off with, it's important to note that "pathfinding" is a broad topic, and that there is no true one-size-fits-all solution. Different applications have different requirements, so every system will have limitations. Roblox designed their PathfindingService with ease of use in mind, and because of this, opted to design a system that automatically detects where NPCs can walk and jump. Their system works reasonably well and doesn't require any setup other than some simple boilerplate code. However, their system also has some significant drawbacks:

Performance: The PathfindingService analyzes the world every time a script tries to find a path. I won't claim to know how their system works under the hood, but it's not well-suited for handling large amounts of NPCs pathfinding at the same time.
Flexibility: The PathfindingService more or less assumes that all NPCs will be basic humanoids, and provides few customization options. There are no provisions for user-defined behaviors, such as NPCs that may need to avoid fall damage or NPCs that have additional capabilities for navigating the world like unlocking doors or using teleporters.
Reliability: Because PathfindingService relies on generating a navmesh automatically it will get things wrong at times. As such, it's possible that it will find no paths where there actually are viable paths, or that it will find paths that are impossible to traverse.

There are ways to work around these limitations, but defining a static navigation mesh would have significant advantages both in performance and in how much control a developer has over the NPCs' behavior. I'll be demonstrating these issues later in the Optimizations & Benchmarking section.

The Algorithm

Inspiration

I had the core idea for the underlying algorithm one Thursday morning in 2014 on my way to university. The inspiration for it, oddly enough, is Minecraft's lighting system. This might sound a little crazy, but bear with me.

Minecraft's lighting system is based on voxels. A light source has some base brightness level (brightness of 15 for most light sources). This illuminates all directly adjacent blocks with a light level of 14, all blocks directly adjacent to those with a light level of 13, and so on, until the light level hits 0:

So if I'm lost, and want to find my way to the torch, I could do this by looking at which of the surrounding blocks has the highest light level and travel to it, and repeat that process until I reach the torch. Of course, if I start in a cell with no light, that is completely surrounded by darkness, then no path can be found:

There's even more we can do with this, too. Say, for example, we have an obstacle that is difficult to traverse, and we only want to find a path across that obstacle if it is significantly shorter than the alternative path. In this example, the obstacle is represented by water (which reduces the light level by 2 per block instead of by 1):

Note how the numbers now preferentially pick a path that doesn't go through the water, unless the path through the water is significantly shorter than the long way around.

Basic Algorithm

So, with these thoughts in mind, we can move to a more general pathfinding algorithm. The two major changes made from this point onwards are that we no longer want to be confined to a grid, and that we don't want to be limited to some maximum brightness at the destination. In order to facilitate the first change, we use nodes in lieu of a grid, and allow each node to have connections to any number of other nodes. To facilitate the second change, we reverse the "brightness", i.e. a lower "brightness" value means that we're closer to the target. I'll be referring to this value as the "cost" of each node from this point forward.

The cost of each node is calculated as the cost of an adjacent node plus its distance to the current one. If there are multiple connections, the cost of the node is simply the lowest cost of all connections. Let's take the following node network as an example (the numbers in blue are the distances):

Calculating the costs for each node in this network gives us the following values (costs written in red). We'll call this step the "cost calculation step":

This then reveals the shortest path from anywhere in the node system to the target. We'll call this step the "path following step":

Connection Cost

If we want a connection to be more difficult to traverse, we can just multiply the distance by some factor to increase the cost of that particular connection. This makes it less likely to be chosen unless that path is significantly shorter than the alternatives. This is functionally equivalent to moving the nodes to make that particular connection longer without changing the lengths of any other connections.

There is a slight pitfall here: because, so far, the path following step simply iterated over all connections of the node to find the one with the lowest-cost node on the other end, we can run into situations where the algorithm crosses the more costly connection despite an easier path around being available:

There are two solutions here that spring to mind. The first is to take the "cost" of a connected node as its cost value plus the cost of the connection it's linked by, in essence repeating the calculation that produced the current node's value to begin with. This would cause the cost of nodes linked by more expensive connections to rise more than those linked by cheap connections, and reliably produce the lowest-cost path:

However, the idea of repeating calculations that have already been done feels inefficient. Sure, a simple addition doesn't take much extra time, but multiply that across thousands of pathfinding operations on systems consisting of hundreds of nodes with several connections each, and the amount of extra time starts to add up. This is particularly problematic when handling large amounts of pathfinding NPCs, or when pre-calculating ("baking") paths, where paths between potentially millions of node pairs are all calculated in one go.

The first instinct here may then be to store the combined cost of the remote node and the connection on each connection during the cost calculation step. This would trade some insignificant amount of memory and avoid the added cost of calculating these numbers again. However, there is a more elegant solution that saves even more processing power and actually uses less memory.

That approach is to simply store a reference to the node that produced the cost of the current node. That node will always be the one with the lowest combined cost. In order to follow a path from any given node to the target then simply requires reading that reference, jumping to that node, and repeating the process until we hit the target node. We no longer have to spend time iterating over all connections, because the reference immediately tells us which node should be next in the path.

As an added benefit, we also use less memory: all numbers in Lua are stored as doubles (which are 8 bytes in size), which is the same size as a memory address to another node (which is also 8 bytes), and we only store one per node rather than one per connection. However, these memory savings are unlikely to become significant compared to the memory usage of the rest of the game.

One-Way Connections

We can also expand this algorithm a bit further. We can, for instance, implement connections that can only be traversed in one direction by not considering them in some cases. Let's say we have the following network. The green path is unidirectional:

We still want the path to be considered when calculating node costs. However, if we consider this connection when stepping from the upstream end to the downstream end (in the "correct") direction when calculating node costs, we would create a situation where the algorithm finds a path to the downstream node and gets stuck there, because it cannot traverse the connection. To mitigate this issue, we only allow the cost calculation step to cross this connection in the "wrong" direction, and only allow the path following step to cross the connection in the "correct" direction. Calculating costs with the same target as before yields the same result:

However, something interesting happens when we reverse the origin and target nodes to find the path in the opposite direction:

Note how the path in this direction is actually longer than previously (the total path cost is 6 instead of 5). This is because the cost calculation step couldn't use the shorter path across the one-way connection, and so the cost on the far end of that connection ended up being higher than that of the node directly above it, instead of lower. The path following step just went to the next cheapest node as usual.

However, the path following step also doesn't consider this connection in one direction. In the example below, notice how the path following step ignores the shorter path and goes up and around the one-way path as intended:

Scriptable Connections

Thus far, the node networks have been completely static. This is fine for most maps, but is no good if you have any part on the map that might change at runtime - such as doors opening or closing, bridges collapsing or being built, and so on. In theory this is simple to handle: we can just give some connections a name, and then check in a dictionary whether that connection is enabled or not. If not, both the cost calculation step and the path following step simply ignore that connection.

However, this is where things get a little more complicated. This is fine if we run a full pathfind every time an NPC needs to go somewhere, but the fact the node map may change at any time means we can't simply cache a path between two nodes. We could just store an additional piece of information that contains the states of all scriptable connections to the dictionary, but this would cause the amount of information to rise exponentially with the number of unique connection names. In terms of caching paths containing scriptable connections at runtime, there isn't a whole lot we can do here. In this case, we just cache paths that only contain non-scriptable sections.

However, in terms of pre-calculating all possible paths, there is actually an optimization we can make that makes this more feasible. All we have to do is find the shortest path, then turn off one of the scriptable connections, and repeat this until we either find a path that doesn't contain a scriptable connection or fail to find a path at all. Let's take the following node map as an example - the red connections are scriptable, and we're finding paths from the leftmost to the rigthmost node:

As you can see, the top path is the shortest, and the paths increase in length. The red linkages are all scriptable, the blue ones are static.

The first instinct here may be to stubbornly iterate over all possible combinations of enabled/disabled paths, find the shortest path for each, and then remove duplicate paths after the fact. In this case, we would run 8 path finding operations, 5 of which would find duplicate results. The issue with this approach is not only that the runtime complexity of this approach rises exponentially (O(2^n)) with the amount of scriptable connections, but that n is the amount of scriptable connections in the entire node map, regardless of whether the paths we found actually cross them or not.

Given a system like this, it may be tempting to start with all connections being open, finding the shortest path, closing the connection it uses, and repeating that process until either no path is found, or the path contains no scriptable connections. This was my original approach as well, but it comes with a serious pitfall when there are several scriptable segments in one path. Let's consider the below node map - disabled connections will be a darker shade of red:

You may have noticed this approach completely glossed over one of the options, which is actually shorter than the last path:

So, we have one slow approach that considers all scriptable connections, and one that only considers those that are actually used but skips some paths. However, we can actually combine the two approaches to get one that only considers path segments that are actually used, but still considers all possibilities.

The approach is simple, in concept: we initially find a path. If that path contains no scriptable segments, then there will be no alternate paths that cross scriptable segments (the path we found is already the shortest one). If one or more scriptable connections were crossed, we add those to a list, and calculate a limit limit = 2^n - 1. We then initialize a counter to 1, and begin counting upwards toward that limit. We use the bits of the counter variable to toggle the connections in our list on and off. The least significant bits represent the connections that were found earlier on. Do note, however, that the state of the scriptable connections is always the inverse of the bit that represents it. The counter being at 0 represents the state where all scriptable connections are enabled - this is also why we initialize the counter at 1: we already checked the case where all connections are enabled.

When new scriptable connections are used, they are added to the list, and the upper limit for the counter is updated with the new scriptable segment count. This repeats until the counter reaches its limit, or a path is found that contains no scriptable connections. This ensures that all possible combinations of the scriptable connections we found are actually considered, but it eliminates all unused scriptable connections from that process.

The process now looks like:

However, this method can and will also generate duplicate paths. There is no telling whether a scriptable connection we've previously found will actually affect the next path we find without actually testing it. For this reason, we generate a hash of each path we find. We then see if we have already found a path with that hash, and if so, discard the result. At the end, we just take all remaining paths, and order them from shortest to longest. Lua has built-in functionality for this.

The hash is generated just from the scripted connections the path crosses. Since all other parts of the node system always behave the same way, two paths that use the exact same scriptable connections will invariably be exactly the same path, assuming the start and end nodes are the same.

Optimizations & Benchmarking

First, let's see how well this algorithm does in practice. For testing, I'll use a test world with 431 nodes, with a total of 1921 links between them. I believe this to be a decent representation of the complexity of an average game world. To test overall performance, I tested three separate paths on both systems and compared the results. The direction is always left to right (or top to bottom in test 2):

Test No.	Time using PathfindingService (ms)	Time using Node System (ms)	Images
1	304.813	0.7254	PathfindingService, Node System
2	200.63	0.7708	PathfindingService, Node System
3	182.059	12.36	PathfindingService, Node System

Particularly the first two results show a significant performance boost when using my own node system - the time taken to find a path is three orders of magnitude faster! The outlier here seems to be the 3rd test, where the node system takes 12 milliseconds to find all paths - but that's all 5 paths, taking into account several dynamic sections of the node mesh, whereas the first two tests are in static areas of the node map.

I'd also like to point out the blunder PathfindingService made in the 3rd test, where it deviates from the bridge and would lead an NPC to jump off a cliff, whereas the node system generates paths that are fully traversable. Additionally, in the 2nd test, PathfindingService sends the NPC jumping off of a sizable cliff before finding a more gradual slope to descend.

Path Ordering

The cost calculation step is significantly more expensive than the path following step. The nice thing is that the cost calculation step populates costs for the entire nav mesh, so we can re-use those costs as long as the target node stays the same. This isn't much of a benefit at runtime unless all NPCs are pathfinding to the same node, but it does provide a significant performance boost when calculating paths in bulk. This would be useful for, say, baking paths into the game before publishing.

Path Baking

Pathfinding is an expensive operation. Reducing the number of nodes helps, and this node mesh system already beats PathfindingService by a huge margin. However, calculating all possible paths ahead of time and storing them in the game allows us to skip pathfinding altogether. The plugin has a tool built-in for baking paths, and at runtime, this essentially reduces pathfinding to a lookup in a 2D array.

External Pathfinding Server

Roblox's Lua performs extremely well these days, and appears to be on-par with other languages like Java. This is good, but we're still limited to a single thread for the time being, so offloading the pathfinding effort to an external program will likely yield better results. In my case, using a quad-core processor, I was able to get my results 4x as quickly: