Using Large Language Models to Generate Synthetic Data for Proactive Pedestrian Safety Prediction: Overcoming Data Collection Barriers in Surrogate Safety Analysis
Pedestrian fatalities remain a persistent and growing safety crisis, with over 7,500 pedestrians killed on U.S. roads annually. Effective countermeasure deployment requires identifying high-risk locations before crashes occur, yet traditional crash-based analyses are insufficient due to the rarity of pedestrian crashes at individual intersections. Surrogate safety analysis using pedestrian–vehicle close calls offer a proactive alternative, but comprehensive observational data collection is prohibitively expensive and time-intensive. Video monitoring requires specialized equipment, extended deployment periods, and substantial manual processing. These practical constraints severely limit the geographic coverage, temporal scope, and contextual diversity of available datasets, ultimately hindering agencies' ability to develop reliable predictive tools that generalize across diverse intersection types and support evidence-based safety interventions statewide. This project addresses these fundamental challenges by introducing Large Language Models (LLMs) as a novel tool to generate high-quality synthetic pedestrian–vehicle interaction data. LLMs possess extensive pre-trained knowledge spanning transportation systems, human behavior, and urban environments, successfully demonstrated in healthcare and climate science for data augmentation. Building upon the Minnesota Traffic Observatory (MTO) dataset, where 18 intersections with 3,314 interactions involving 4,941 pedestrians, the research team will develop a validated methodology to generate contextually realistic scenarios incorporating roadway geometry, traffic control, land use, pedestrian demographics, and temporal patterns. This approach directly tackles the data scarcity problem that prevents agencies from conducting comprehensive pedestrian safety analyses across their jurisdictions. The project has three objectives: (1) develop a transparent LLM-based synthetic transportation-targeted data generation methodology with validation protocols ensuring realism and quality; (2) evaluate whether synthetic data-augmented models improve prediction accuracy and transferability across intersections compared to observational data alone, using precision-recall AUC, calibration diagnostics, leave-one-site-out validation and other appropriate approaches; and (3) determine the mechanisms driving performance improvements: whether from introducing realistic scenario diversity or addressing rare-event limitations, to guide best practices. The framework will incorporate probability calibration, split-conformal risk control, and decision-curve analysis to deliver deployment-ready tools with quantified uncertainty for operational use.
Language
- English
Project
- Status: Active
- Funding: $90,000.00
-
Contract Numbers:
69A3552348323
-
Sponsor Organizations:
Office of the Assistant Secretary for Research and Technology
University Transportation Centers Program
Department of Transportation
Washington, DC United States 20590 -
Managing Organizations:
2400 6th Street, NW
Washington, DC United States 20059 -
Project Managers:
Bruner, Britain
-
Performing Organizations:
2400 6th Street, NW
Washington, DC United States 20059 -
Principal Investigators:
Marin, Claudia
- Start Date: 20260202
- Expected Completion Date: 20260930
- Actual Completion Date: 0
- USDOT Program: University Transportation Centers Program
Subject/Index Terms
- TRT Terms: Data analysis; High risk locations; Machine learning; Pedestrian safety; Pedestrian vehicle crashes; Predictive models; Validation
- Subject Areas: Data and Information Technology; Pedestrians and Bicyclists; Planning and Forecasting; Safety and Human Factors;
Filing Info
- Accession Number: 01978545
- Record Type: Research project
- Source Agency: Research and Education for Promoting Safety (REPS) University Transportation Center
- Contract Numbers: 69A3552348323
- Files: UTC, RIP
- Created Date: Feb 3 2026 3:34PM