Introduction

Human mobility is a critical area of research, and understanding patterns in how individuals move is essential for a variety of applications. The Human Mobility Prediction Challenge provided me with the opportunity to explore and derive insights from real-world mobility data. After ranking in the top ten based on evaluation metrics, I presented my findings in Atlanta.

The key to the success of my predictions lies in the fact that many users exhibit similar movement patterns over time. Recognizing this, I leveraged these recurring patterns to develop an algorithm based on a classic machine learning technique—k-nearest neighbors (KNN). The resulting method not only accurately predicted user trajectories but also outperformed certain state-of-the-art large language models (LLMs) used by other competitors.

Method

The algorithm is developed in two main phases:

  1. The offline process
  2. The online process
as illustrated in the figure below.

Offline Process

In the offline process, the first step is to filter out incidental or outlier locations, ensuring that the model focuses on the most frequently visited places. Next, the data is segmented by time, based on the understanding that human behavior tends to follow consistent patterns over specific periods. By segmenting the data, the model can capture subtle variations in movement patterns during specific times of the day, which enhances prediction accuracy.

Finally, rather than using a universal model, individual models are created for each user, accounting for the uniqueness of their trajectories.

Online Process

In the online process, hashmaps are used to allow for quick retrieval of the appropriate models during inference. When predicting a user's future trajectory, the query data starts from the most recent observation, using the coordinate features as keys to identify the k-nearest neighbors in the next time step. This mechanism allows the model to efficiently track the user's movement without excessive computational overhead.

Although KNN is often considered computationally expensive, the small data set and time-frame restrictions for inference ensure that the number of calculations remains manageable and the inference time is minimal.