Lightweight Temporal Transformer Decomposition for Federated Autonomous Driving

Traditional vision-based autonomous driving systems often face difficulties in navigating complex environments when relying solely on single-image inputs. To overcome this limitation, incorporating temporal data such as past image frames or steering sequences, has proven effective in enhancing robustness and adaptability in challenging scenarios. While previous high-performance methods exist, they often rely on resource-intensive fusion networks, making them impractical for training and unsuitable for federated learning. To address these challenges, we propose lightweight temporal transformer decomposition, a method that processes sequential image frames and temporal steering data by breaking down large attention maps into smaller matrices. This approach reduces model complexity, enabling efficient weight updates for convergence and real-time predictions while leveraging temporal information to enhance autonomous driving performance. Intensive experiments on three datasets demonstrate that our method outperforms recent approaches by a clear margin while achieving real-time performance. Additionally, real robot experiments further confirm the effectiveness of our method.

To address data privacy, several autonomous driving approaches utilize federated learning to train decentralized models across multiple vehicles. However, most autonomous driving models still rely on single-frame inputs and develop relatively simple networks to enable feasible training in a federated learning setup. This single-frame approach overlooks the temporal data that each vehicle collects over time, which can provide essential context for understanding motion patterns, tracking objects, and anticipating potential hazards. As a result, these models do not fully leverage the sequence of information needed to better predict and respond to dynamic driving scenarios, ultimately limiting their performance and adaptability. In this paper, our goal is to develop a federated autonomous driving framework that incorporates temporal information as input. To address the complexity and learning challenges of the fusion model when training in a federated scenario using temporal information, we propose a Lightweight Temporal Transformer, a new approach designed to reduce the complexity of the network in each silo by efficiently approximating the information from the inputs. Our method utilizes a decomposition method under unitary attention to break down learnable attention maps into low-rank ones, ensuring that the resulting models remain lightweight and trainable. By reducing model complexity, our approach enables the network to use temporal data while ensuring convergence. Intensive experiments demonstrate that our approach significantly improves performance over state-of-the-art methods in federated autonomous driving. The figure below shows the comparison between traditional single-frame solutions for federated autonomous driving (a) and our lightweight temporal transformer network to enhance the training feasibility in federated learning setup (b).

Apart from X-ray images collected from our real robot, we also collect an EISimulation dataset from the CathSim simulator for simulated X-ray images. We manually label both data from the robot and CathSim simulator to use them in downstream tasks. We note that the datasets used to train the foundation model are not being used in downstream endovascular understanding tasks.We propose a Lightweight Temporal Transformer Decomposition method for federated autonomous driving, where a network of vehicles collaboratively trains a global driving policy by aggregating local weights from each vehicle. We minimize a local regression loss using mean squared error to predict steering angles from joint features extracted from temporally ordered RGB images and steering series. We employ unitary attention decoupling to reduce large tensors into smaller ones, followed by tensor factorization to decompose attention maps into factor matrices, ensuring our model is lightweight for real-time predictions on edge devices while preserving critical temporal information, with evaluations across three datasets confirming its effectiveness. The figure below shows an overview of our lightweight temporal transformer decomposition method for federated autonomous driving.

Table below shows a comparison between our approach and state-of-the-art methods, both with and without temporal information. The results demonstrate a clear performance advantage, as our method achieves notably lower RMSE and MAE across all three datasets: Udacity+, Carla, and Gazebo.

**Performance comparison between different methods. The Gaia topology is used. RMSE is used as bechmarking metric.**
Method	Udacity+	Gazebo	Carla	#Params (M)	Avg. Cycle Time (ms)
MobileNet	0.193	0.083	0.286	2.22	–
DroNet	0.183	0.082	0.333	0.31	–
St-p3	0.092	0.071	0.132	1247.87	–
ADD	0.097	0.049	0.166	3234.22	–
HPO	0.088	0.044	0.157	5990.19	–
FedAvg	0.212	0.094	0.269	0.31	152.4
FedProx	0.152	0.077	0.226	0.31	111.5
STAR	0.179	0.062	0.208	0.31	299.9
FedTSE	0.144	0.063	0.079	89.1	1172
TGCN	0.137	0.069	0.193	78.33	224
Fed-STGRU	0.129	0.059	0.151	91.01	370
BFRT	0.113	0.054	0.111	427.26	1256
MFL	0.108	0.052	0.133	173.87	781
CDL	0.141	0.062	0.183	0.63	72.7
MATCHA	0.182	0.069	0.208	0.31	171.3
MBST	0.183	0.072	0.214	0.31	82.1
FADNet	0.162	0.069	0.203	0.32	62.6
PriRec	0.137	0.066	0.196	325.57	272
PEPPER	0.124	0.055	0.115	89.13	438
Ours (CLL)	0.088	0.045	0.091	5.01	–
Ours (SFL)	0.107	0.049	0.072	5.01	180
Ours (DFL)	0.091	0.043	0.076	5.01	121

Lightweight Temporal Transformer Decomposition for Federated Autonomous Driving

Abstract

Problem Statement

Methodology

Experiments

BibTeX