Lightweight Temporal Transformer Decomposition for Federated Autonomous Driving

1University of Liverpool, UK
2AIOZ, Singapore
3National Tsing Hua University, Taiwan

Abstract

Traditional vision-based autonomous driving systems often face difficulties in navigating complex environments when relying solely on single-image inputs. To overcome this limitation, incorporating temporal data such as past image frames or steering sequences, has proven effective in enhancing robustness and adaptability in challenging scenarios. While previous high-performance methods exist, they often rely on resource-intensive fusion networks, making them impractical for training and unsuitable for federated learning. To address these challenges, we propose lightweight temporal transformer decomposition, a method that processes sequential image frames and temporal steering data by breaking down large attention maps into smaller matrices. This approach reduces model complexity, enabling efficient weight updates for convergence and real-time predictions while leveraging temporal information to enhance autonomous driving performance. Intensive experiments on three datasets demonstrate that our method outperforms recent approaches by a clear margin while achieving real-time performance. Additionally, real robot experiments further confirm the effectiveness of our method.

Problem Statement

To address data privacy, several autonomous driving approaches utilize federated learning to train decentralized models across multiple vehicles. However, most autonomous driving models still rely on single-frame inputs and develop relatively simple networks to enable feasible training in a federated learning setup. This single-frame approach overlooks the temporal data that each vehicle collects over time, which can provide essential context for understanding motion patterns, tracking objects, and anticipating potential hazards. As a result, these models do not fully leverage the sequence of information needed to better predict and respond to dynamic driving scenarios, ultimately limiting their performance and adaptability. In this paper, our goal is to develop a federated autonomous driving framework that incorporates temporal information as input. To address the complexity and learning challenges of the fusion model when training in a federated scenario using temporal information, we propose a Lightweight Temporal Transformer, a new approach designed to reduce the complexity of the network in each silo by efficiently approximating the information from the inputs. Our method utilizes a decomposition method under unitary attention to break down learnable attention maps into low-rank ones, ensuring that the resulting models remain lightweight and trainable. By reducing model complexity, our approach enables the network to use temporal data while ensuring convergence. Intensive experiments demonstrate that our approach significantly improves performance over state-of-the-art methods in federated autonomous driving. The figure below shows the comparison between traditional single-frame solutions for federated autonomous driving (a) and our lightweight temporal transformer network to enhance the training feasibility in federated learning setup (b).
cars peace

Methodology

Apart from X-ray images collected from our real robot, we also collect an EISimulation dataset from the CathSim simulator for simulated X-ray images. We manually label both data from the robot and CathSim simulator to use them in downstream tasks. We note that the datasets used to train the foundation model are not being used in downstream endovascular understanding tasks.We propose a Lightweight Temporal Transformer Decomposition method for federated autonomous driving, where a network of vehicles collaboratively trains a global driving policy by aggregating local weights from each vehicle. We minimize a local regression loss using mean squared error to predict steering angles from joint features extracted from temporally ordered RGB images and steering series. We employ unitary attention decoupling to reduce large tensors into smaller ones, followed by tensor factorization to decompose attention maps into factor matrices, ensuring our model is lightweight for real-time predictions on edge devices while preserving critical temporal information, with evaluations across three datasets confirming its effectiveness. The figure below shows an overview of our lightweight temporal transformer decomposition method for federated autonomous driving.
cars peace

Experiments

Table below shows a comparison between our approach and state-of-the-art methods, both with and without temporal information. The results demonstrate a clear performance advantage, as our method achieves notably lower RMSE and MAE across all three datasets: Udacity+, Carla, and Gazebo.
Performance comparison between different methods. The Gaia topology is used. RMSE is used as bechmarking metric.
Method Udacity+ Gazebo Carla #Params (M) Avg. Cycle Time (ms)
MobileNet0.1930.0830.2862.22
DroNet0.1830.0820.3330.31
St-p30.0920.0710.1321247.87
ADD0.0970.0490.1663234.22
HPO0.0880.0440.1575990.19
FedAvg0.2120.0940.2690.31152.4
FedProx0.1520.0770.2260.31111.5
STAR0.1790.0620.2080.31299.9
FedTSE0.1440.0630.07989.11172
TGCN0.1370.0690.19378.33224
Fed-STGRU0.1290.0590.15191.01370
BFRT0.1130.0540.111427.261256
MFL0.1080.0520.133173.87781
CDL0.1410.0620.1830.6372.7
MATCHA0.1820.0690.2080.31171.3
MBST0.1830.0720.2140.3182.1
FADNet0.1620.0690.2030.3262.6
PriRec0.1370.0660.196325.57272
PEPPER0.1240.0550.11589.13438
Ours (CLL)0.0880.0450.0915.01
Ours (SFL)0.1070.0490.0725.01180
Ours (DFL)0.0910.0430.0765.01121

BibTeX

Soon