Tabular Data Generation
The SynthAIr project developed specialized synthetic data generators to address critical challenges in Air Traffic Management: data scarcity, privacy constraints, and commercial sensitivity. Our tabular data pipeline processes mixed-type operational data including flight schedules, delays, and turnaround times using five complementary approaches, each optimized for different data characteristics and use cases. To learn more, see our research publications: Pre-Tactical Flight-Delay and Turnaround Forecasting with Synthetic Aviation Data and Synthetic Flight Data Generation Using Generative Models.Open-source implementations are available in our tabular generators repositories. Public deliverables describing the tabular data generators are currently under evaluation by the SESAR Joint Undertaking and will be made publicly available upon approval.
Overview
Tabular synthetic data generation in aviation requires handling complex mixed-type datasets with categorical features (airlines, airports, aircraft types), continuous variables (delays, durations), and temporal information (schedules, timestamps). Our approach addresses three fundamental challenges:
- Data Scarcity: Limited access to comprehensive flight operations data
- Privacy Protection: Commercial sensitivity of operational records
- Operational Diversity: Wide range of flight scenarios and conditions
We evaluate generators using the Train on Synthetic, Test on Real (TSTR) methodology across three critical prediction tasks: departure delays, arrival delays, and aircraft turnaround times.
Figure 1: Tabular Data Generation Pipeline. From EU flight operations data (OAG), our AI models generate synthetic flight records validated across three dimensions: privacy protection (Distance to Closest Record), statistical fidelity (correlation preservation, distribution similarity), and utility (performance in downstream prediction tasks).
Model Architectures
REaLTabFormer (Transformer-based)
REaLTabFormer leverages transformer architectures to generate synthetic tabular data by treating each flight record as a sequence of tokens. This autoregressive approach captures complex dependencies between features through sequential generation.
Key Features:
- GPT-2 based architecture with 6 transformer decoder layers
- Row-wise, token-by-token generation preserving feature relationships
- Target masking regularization prevents data memorization
- Overfitting detection using Q_δ statistic and Distance to Closest Record (DCR)
- Performance: Achieves 94-97% of real-data predictive performance
Figure 2: REaLTabFormer Architecture. Transformer-based autoregressive model treating tabular data as token sequences with specialized regularization and overfitting detection mechanisms.
TabSyn (Diffusion-based)
TabSyn employs a two-stage pipeline combining Variational Autoencoders with diffusion models in latent space, enabling efficient high-quality synthesis of mixed-type tabular data.
Key Features:
- “Encode → Diffuse → Decode” architecture operating in continuous latent space
- VAE encoder transforms mixed-type data into continuous representations
- Score-based diffusion with linear noise schedule (20-50 sampling steps)
- Adaptive β-VAE training with dynamic KL-divergence scheduling
- Performance: Strong fidelity with efficient sampling
Figure 3: TabSyn Architecture. Two-stage pipeline with VAE encoding to continuous latent space followed by efficient diffusion-based generation.
CTGAN (Conditional Adversarial)
Conditional Tabular GAN (CTGAN) addresses aviation data’s mixed-type nature through specialized preprocessing and conditional generation capabilities, particularly effective for handling imbalanced categorical distributions.
Key Features:
- Mode-specific normalization using Bayesian Gaussian Mixture Models
- Conditional generation with training-by-sampling for rare categories
- WGAN-GP loss with gradient penalty for stable adversarial training
- PacGAN framework processing 10 samples jointly to prevent mode collapse
- Performance: Effective for targeted scenario generation
Figure 4: CTGAN Architecture. Adversarial framework with specialized preprocessing for mixed-type data and conditional generation capabilities.
TVAE (Variational Autoencoder)
Tabular Variational Autoencoder (TVAE) provides a probabilistic approach to synthetic data generation with lightweight computational requirements and stable training characteristics.
Key Features:
- Evidence Lower Bound (ELBO) optimization
- Reparameterization trick for differentiable latent sampling
- Batch normalization and L2 regularization
- Minimal GPU requirements with fast training
- Performance: Good utility-to-compute ratio
Gaussian Copula (Statistical)
Gaussian Copula represents a classical statistical approach that models dependencies by separating marginal distributions from their correlation structure.
Key Features:
- Kernel Density Estimation for marginal distribution modeling
- Multivariate Gaussian copula for dependency structure
- No neural network training required
- Strongest privacy protection among all models
- Performance: Best for privacy-critical applications
Evaluation Framework
Our evaluation spans three dimensions ensuring synthetic data maintains essential properties for aviation operations:
Fidelity Assessment
- Statistical Similarity: Kolmogorov-Smirnov tests for continuous distributions, Chi-squared for categorical
- Correlation Preservation: Pearson, Spearman, Kendall correlations plus mixed-type measures
- Joint Distribution Fidelity: KL-divergence for multivariate distributional alignment
- Likelihood-based Assessment: Bayesian Network and Gaussian Mixture Model likelihood
- Detection Difficulty: Logistic regression classifier performance (synthetic vs. real)
Privacy Protection
Privacy evaluation uses Distance to Closest Record (DCR) metrics measuring how closely synthetic records resemble real training data:
- Baseline Protection: Compares synthetic-to-real distances against random data baseline
- Overfitting Protection: Tests memorization by comparing proximity to training versus holdout sets
Utility Evaluation
- Predictive Performance: RMSE, MAE, R² metrics across prediction tasks
- Feature Importance Alignment: Cosine similarity of feature importance vectors
- Utility Scores: Normalized performance ratios quantifying synthetic-to-real substitutability