Tabular Data Generation
The SynthAIr project developed specialized synthetic data generators to address critical challenges in Air Traffic Management: data scarcity, privacy constraints, and commercial sensitivity. Our tabular data pipeline processes mixed-type operational data including flight schedules, delays, and turnaround times using five complementary approaches, each optimized for different data characteristics and use cases. To learn more, see our research publications: Pre-Tactical Flight-Delay and Turnaround Forecasting with Synthetic Aviation Data and Synthetic Flight Data Generation Using Generative Models. Open-source implementations are available in our tabular generators repositories. Public deliverables are available on Zenodo.
Overview
Tabular synthetic data generation in aviation requires handling complex mixed-type datasets with categorical features (airlines, airports, aircraft types), continuous variables (delays, durations), and temporal information (schedules, timestamps). Our approach addresses three fundamental challenges:
- Data Scarcity: Limited access to comprehensive flight operations data
- Privacy Protection: Commercial sensitivity of operational records
- Operational Diversity: Wide range of flight scenarios and conditions
We evaluate generators using the Train on Synthetic, Test on Real (TSTR) methodology across three critical prediction tasks: departure delays, arrival delays, and aircraft turnaround times.
Figure 1: Tabular Data Generation Pipeline. From EU flight operations data (OAG), our AI models generate synthetic flight records validated across three dimensions: privacy protection (Distance to Closest Record), statistical fidelity (correlation preservation, distribution similarity), and utility (performance in downstream prediction tasks).
Model Architectures
REaLTabFormer (Transformer-based)
REaLTabFormer leverages transformer architectures to generate synthetic tabular data by treating each flight record as a sequence of tokens. This autoregressive approach captures complex dependencies between features through sequential generation.
Key Features:
- GPT-2 based architecture with 6 transformer decoder layers
- Row-wise, token-by-token generation preserving feature relationships
- Target masking regularization prevents data memorization
- Overfitting detection using Q_δ statistic and Distance to Closest Record (DCR)
- Performance: Achieves 94-97% of real-data predictive performance
Figure 2: REaLTabFormer Architecture. Transformer-based autoregressive model treating tabular data as token sequences with specialized regularization and overfitting detection mechanisms.
TabSyn (Diffusion-based)
TabSyn employs a two-stage pipeline combining Variational Autoencoders with diffusion models in latent space, enabling efficient high-quality synthesis of mixed-type tabular data.
Key Features:
- “Encode → Diffuse → Decode” architecture operating in continuous latent space
- VAE encoder transforms mixed-type data into continuous representations
- Score-based diffusion with linear noise schedule (20-50 sampling steps)
- Adaptive β-VAE training with dynamic KL-divergence scheduling
- Performance: Strong fidelity with efficient sampling
Figure 3: TabSyn Architecture. Two-stage pipeline with VAE encoding to continuous latent space followed by efficient diffusion-based generation.
CTGAN (Conditional Adversarial)
Conditional Tabular GAN (CTGAN) addresses aviation data’s mixed-type nature through specialized preprocessing and conditional generation capabilities, particularly effective for handling imbalanced categorical distributions.
Key Features:
- Mode-specific normalization using Bayesian Gaussian Mixture Models
- Conditional generation with training-by-sampling for rare categories
- WGAN-GP loss with gradient penalty for stable adversarial training
- PacGAN framework processing 10 samples jointly to prevent mode collapse
- Performance: Effective for targeted scenario generation
Figure 4: CTGAN Architecture. Adversarial framework with specialized preprocessing for mixed-type data and conditional generation capabilities.
TVAE (Variational Autoencoder)
Tabular Variational Autoencoder (TVAE) provides a probabilistic approach to synthetic data generation with lightweight computational requirements and stable training characteristics. Sharing its preprocessing pipeline with CTGAN, TVAE replaces adversarial training with a VAE objective, offering more stable convergence and the ability to generate large synthetic datasets efficiently.
Key Features:
- Evidence Lower Bound (ELBO) optimization combining reconstruction accuracy with KL regularisation
- Reparameterization trick enabling differentiable latent sampling and end-to-end backpropagation
- Batch normalisation and L2 regularisation throughout encoder and decoder layers
- Shared mode-based preprocessing with CTGAN: Bayesian GMM normalization for continuous features, one-hot encoding for categorical
- Minimal GPU requirements with fast training — best compute efficiency among all five tabular generators
- Performance: Good utility-to-compute ratio; scales well to large datasets (>1 million records)
Figure 5: TVAE Architecture. The encoder maps preprocessed flight records through stacked MLP layers to a variational bottleneck producing μ and log σ² parameters. The reparameterization trick samples latent code z = μ + σ·ε, which the decoder reconstructs back to flight record space. Training minimises the ELBO: reconstruction loss plus KL divergence from the standard normal prior.
Architecture Details
Preprocessing: Continuous features (delays, durations) are normalized using a Bayesian Gaussian Mixture Model to handle multi-modal distributions; categorical features (airline, airport, aircraft type) are one-hot encoded. This is the same preprocessing pipeline used by CTGAN.
Encoder: A stack of fully connected linear layers with batch normalisation and ReLU activations maps the preprocessed input to a compact intermediate representation. The final encoder layer splits into two heads — one for the mean vector μ and one for log σ² — defining a Gaussian posterior over the latent space.
Latent Sampling: The reparameterization trick samples z = μ + σ·ε where ε ~ N(0, I), making the sampling step differentiable and enabling gradient-based optimization through the stochastic node.
Decoder: A symmetric MLP stack maps latent code z back to the original feature space, applying activation functions appropriate to each output type (continuous or categorical). During generation, z is sampled directly from the prior N(0, I) and passed through the decoder.
Training Objective: TVAE minimises the ELBO:
L = E[log p(x|z)] − β · KL(q(z|x) || p(z))
where the reconstruction term encourages fidelity to input distributions and the KL term regularises the latent space towards the standard normal prior. L2 weight decay provides additional regularisation against overfitting on small datasets.
Gaussian Copula (Statistical)
Gaussian Copula represents a classical statistical approach that models dependencies by separating marginal distributions from their correlation structure.
Key Features:
- Kernel Density Estimation for marginal distribution modeling
- Multivariate Gaussian copula for dependency structure
- No neural network training required
- Strongest privacy protection among all models
- Performance: Best for privacy-critical applications
Evaluation Framework
Our evaluation spans three dimensions ensuring synthetic data maintains essential properties for aviation operations:
Fidelity Assessment
- Statistical Similarity: Kolmogorov-Smirnov tests for continuous distributions, Chi-squared for categorical
- Correlation Preservation: Pearson, Spearman, Kendall correlations plus mixed-type measures
- Joint Distribution Fidelity: KL-divergence for multivariate distributional alignment
- Likelihood-based Assessment: Bayesian Network and Gaussian Mixture Model likelihood
- Detection Difficulty: Logistic regression classifier performance (synthetic vs. real)
Privacy Protection
Privacy evaluation uses Distance to Closest Record (DCR) metrics measuring how closely synthetic records resemble real training data:
- Baseline Protection: Compares synthetic-to-real distances against random data baseline
- Overfitting Protection: Tests memorization by comparing proximity to training versus holdout sets
Utility Evaluation
- Predictive Performance: RMSE, MAE, R² metrics across prediction tasks
- Feature Importance Alignment: Cosine similarity of feature importance vectors
- Utility Scores: Normalized performance ratios quantifying synthetic-to-real substitutability