UC3 - Passenger Flow Prediction

Overview

Passenger flow prediction addresses forecasting passenger movement throughout airport terminals. This use case focuses on waiting times at security checkpoints—a significant bottleneck in the passenger flow process. However, passenger flow data is typically restricted due to privacy and commercial concerns.

We evaluate synthetic data generation for security checkpoint waiting time prediction using limited real-world data from Rotterdam airport.

Operational Context

Security checkpoints directly impact passenger experience and airport efficiency. Waiting times are key stress factors that influence passenger satisfaction and overall airport operations.

Passenger flow modeling typically requires extensive data collection, but the sensitive nature of this data limits access for research and model development.

Dataset and Methodology

Data Source and Limitations

The evaluation utilized security checkpoint data from Rotterdam The Hague Airport (RTM), collected by Janssen et al. This dataset comprises 2,277 passengers processed through 13 different security checkpoint lanes across 11 time blocks between February and December 2018.

Key dataset characteristics:

Detailed timing information for each security process stage
Passenger type classifications (business, senior, family, young, reduced mobility, regular)
Lane configuration data and processing conditions
Complete journey timestamps from baggage drop-off to reclaim

Limitation: Using data from a single airport limits generalizability. Despite outreach to multiple airports and aviation projects, passenger flow data remains restricted due to privacy and commercial concerns.

The methodology can be applied to other airport datasets when available, but Rotterdam’s unique procedures and layout constrain direct transferability.

Feature Engineering

The prediction task focused on forecasting total waiting time at security checkpoints, defined as the complete duration from baggage drop-off start to baggage reclaim end. Input features included:

Passenger Characteristics:

Passenger type (business, senior, family, young, PRM, regular)
Experience level with security procedures
Group size and number of luggage boxes

Operational Features:

Security lane number
ETD (Explosive Trace Detection) check requirement
Time block and lane configuration

Target Variable

Total waiting time was computed as the complete security checkpoint processing duration, representing the most operationally relevant metric for passenger experience and resource planning.

Synthetic Data Generation

A Tabular Variational Autoencoder (TVAE) was used for synthetic data generation due to the limited dataset size and mixed-type nature of the data. TVAE can capture feature distributions and interdependencies in smaller tabular datasets.

The model was trained using the SDV library for 10,000 epochs on the passenger dataset.

Results and Analysis

Data Fidelity Assessment

TVAE captured the general patterns in the original dataset across categorical and continuous variables.

Passenger Flow Distribution Comparison Distribution comparison between real (blue) and synthetic (orange) data across key passenger flow features

The synthetic data maintained distribution shapes for passenger types, lane configurations, and waiting times.

Predictive Performance Analysis

Waiting times showed uniform patterns across passenger categories and processing conditions, limiting the predictive power of available features.

Performance Results:

Experiment	Dataset	MSE	RMSE	MAE	R²
Equal Volume	Real Only	796,349	892.4	377.9	-0.168
	Synthetic Only	687,258	829.0	324.8	-0.008
	Real + Synthetic	706,264	840.4	342.6	-0.036
Enhanced Volume	Real Only	796,349	892.4	377.9	-0.168
	Synthetic Only (100×)	587,651	766.6	283.4	0.138
	Real + Synthetic (100×)	588,172	766.9	283.9	0.137

Regression performance metrics for security checkpoint waiting time prediction

Key Findings

Limited Baseline Predictability: Real-data models achieved negative R² values, indicating that waiting time patterns in the Rotterdam dataset were not well explained by the available features.
Synthetic Data Advantage: Models trained exclusively on synthetic data outperformed real-data baselines, suggesting that TVAE captured latent relationships not immediately apparent in the original dataset.
Scale Benefits: Leveraging synthetic data generation’s scalability (100× data volume) achieved the first positive R² values (0.138), demonstrating meaningful explanatory power for waiting time variability.
Operational Insight: The relatively uniform waiting times across passenger categories and processing conditions suggest that security checkpoint efficiency depends more on systemic factors than individual passenger characteristics.

Operational Implications

This represents a proof-of-concept rather than a comprehensive solution. The Rotterdam dataset’s limitations—single airport, security checkpoints only—constrain operational applicability.

Synthetic data generation showed value through data augmentation and improved model performance when scaled up 100×, achieving the first positive R² values (0.138).

Conclusion

UC3 shows a limited approach to passenger flow prediction using synthetic data. The Rotterdam dataset constrains operational utility, but TVAE methodology could transfer to more comprehensive datasets when available.

Synthetic data augmentation achieved positive R² values (0.138), representing progress over baseline approaches. The methodology provides a foundation for future development when broader passenger flow datasets become accessible.