


Review! Comprehensively summarize the important role of basic models in promoting autonomous driving
Jun 11, 2024 pm 05:29 PM寫在前面&筆者的個人理解
最近來,隨著深度學習技術的發(fā)展和突破,大規(guī)模的基礎模型(Foundation Models)在自然語言處理和計算機視覺領域取得了顯著性的成果?;A模型在自動駕駛當中的應用也有很大的發(fā)展前景,可以提高對于場景的理解和推理。
- 通過對豐富的語言和視覺數(shù)據(jù)進行預訓練,基礎模型可以理解和解釋自動駕駛場景中的各類元素并進行推理,為駕駛決策和規(guī)劃提供語言和動作命令。
- 基礎模型可以根據(jù)對駕駛場景的理解來實現(xiàn)數(shù)據(jù)增強,用于提供在常規(guī)駕駛和數(shù)據(jù)收集期間不太可能遇到的長尾分布中那些罕見的可行場景以實現(xiàn)提高自動駕駛系統(tǒng)準確性和可靠性的目的。
- 對基礎模型應用的另外一個場景是在于世界模型,該模型展示了理解物理定律和動態(tài)事物的能力。通過采用自監(jiān)督的學習范式對海量數(shù)據(jù)進行學習,世界模型可以生成不可見但是可信的駕駛場景,促進對于動態(tài)物體行為預測的增強以及駕駛策略的離線訓練過程。
本文主要概述了基礎模型在自動駕駛領域中的應用,并根據(jù)基礎模型在自動駕駛模型方面的應用、基礎模型在數(shù)據(jù)增強方面的應用以及基礎模型中世界模型對于自動駕駛方面的應用三方面進行展開。 在自動駕駛模型方面,基礎模型可以用于實現(xiàn)各種自動駕駛功能,例如車輛的感知、決策和控制等。通過基礎模型,車輛可以獲取周圍環(huán)境的信息,并做出相應的決策和控制動作。 在數(shù)據(jù)增強方面,基礎模型可以用于增強數(shù)據(jù)
本文鏈接:https://arxiv.org/pdf/2405.02288
自動駕駛模型
基于語言和視覺基礎模型的類人駕駛
在自動駕駛中,語言和視覺的基礎模型顯示出了巨大的應用潛力,通過增強自動駕駛模型在駕駛場景中的理解和推理,實現(xiàn)自動駕駛的類人駕駛。下圖展示了基于語言和視覺的基礎模型對駕駛場景的理解以及給出語言引導指令和駕駛行為的推理。
基礎模型對于自動駕駛模型增強范式
目前很多工作都已經(jīng)證明語言和視覺特征可以有效增強模型對于駕駛場景的理解,再獲取對于當前環(huán)境的整體感知理解后,基礎模型就會給出一系列的語言命令,如:“前方有紅燈,減速慢行”,“前方有十字路口,注意行人”等相關語言指令,便于自動駕駛汽車根據(jù)相關的語言指令執(zhí)行最終的駕駛行為。
近年來,學術界和工業(yè)界將GPT的語言知識嵌入到自動駕駛的決策過程中。以語言命令的形式提高自動駕駛的性能,以促進大模型自動駕駛中的應用。考慮到大模型有望真正部署在車輛端,它最終需要落在規(guī)劃或控制指令上,基礎模型最終應該從動作狀態(tài)級別授權自動駕駛。一些學者已經(jīng)進行了初步探索,但仍有很多發(fā)展空間。更重要的是,一些學者通過類似GPT的方法探索了自動駕駛模型的構建,該方法直接輸出基于大規(guī)模語言模型的軌跡,然后通過控制命令實現(xiàn),相關工作已經(jīng)匯總在如下表格中。
使用預訓練主干網(wǎng)絡進行端到端自動駕駛
上述的相關內容其核心思路是提高自動駕駛決策的可解釋性,增強場景理解解析,指導自動駕駛系統(tǒng)的規(guī)劃或控制。在過去的一段時間內,有許多工作一直以各種方式優(yōu)化預訓練模型主干網(wǎng)絡,并且取得了非常不錯的成果。因此,為了更加全面的總結基礎模型在自動駕駛中的應用,我們對預訓練主干網(wǎng)絡以及取得了非常不錯的成果的研究進行了總結和回顧。下圖展示了端到端自動駕駛的整體過程。
基于預訓練主干網(wǎng)絡的端到端自動駕駛系統(tǒng)的流程圖
In the overall process of end-to-end autonomous driving, extracting low-level information from raw data determines the potential of subsequent model performance to a certain extent. Excellent pre-training backbone can make the model have stronger feature learning capabilities. Pre-trained convolutional networks such as ResNet and VGG are the most widely used backbone networks for end-to-end model visual feature extraction. These pre-trained networks are usually trained using object detection or segmentation as the task of extracting generalized features, and the performance they achieve has been verified in many works.
In addition, early end-to-end autonomous driving models were mainly based on various types of convolutional neural networks and were completed through imitation learning or reinforcement learning. Some recent work has attempted to build an end-to-end autonomous driving system with a Transformer network structure, and has also achieved relatively good results, such as Transfuser, FusionAD, UniAD and other works.
Data enhancement
With the further development of deep learning technology and the further improvement and upgrade of the underlying network architecture, basic models with pre-training and fine-tuning have shown Increasingly powerful performance. The basic model represented by GPT has enabled the transformation of large models from the rules of the learning paradigm to a data-driven approach. The importance of data as a key link in model learning is irreplaceable. During the training and testing of autonomous driving models, a large amount of scene data is used to enable the model to have good understanding and decision-making capabilities for various road and traffic scenarios. The long-tail problem faced by autonomous driving is also the fact that there are endless unknown edge scenarios, which makes the model's generalization ability seem to never be enough, resulting in poor performance.
Data augmentation is crucial to improving the generalization ability of autonomous driving models. The implementation of data enhancement needs to consider two aspects
- On the one hand: how to obtain large-scale data so that the data provided to the autonomous driving model is sufficiently diverse and extensive
- On the other hand: how to obtain as much high-quality data as possible, so that the data used for training and testing autonomous driving models is accurate and reliable
Therefore, related research work is mainly carried out from the above two aspects. Technical research: First, enrich the data content in existing data sets and enhance data characteristics in driving scenarios. The second is to generate multi-level driving scenarios through simulation.
Extending autonomous driving data sets
Existing autonomous driving data sets are mainly obtained by recording sensor data and then labeling the data. The data features obtained in this way are usually very low-level, and the magnitude of the data set is also relatively poor, which is completely insufficient for the visual feature space of autonomous driving scenarios. The advanced semantic understanding, reasoning and interpretation capabilities of the basic model represented by the language model provide new ideas and technical approaches for the enrichment and expansion of autonomous driving data sets. Expanding the data set by leveraging the advanced understanding, reasoning, and interpretation capabilities of the underlying model can help better evaluate the explainability and control of autonomous driving systems, thereby improving the safety and reliability of autonomous driving systems.
Generating driving scenes
Driving scenes are of great significance to autonomous driving. In order to obtain different driving scene data, relying only on the vehicle's sensors for real-time collection requires huge costs, and it is difficult to obtain enough scene data for some edge scenes. Generating realistic driving scenes through simulation has attracted the attention of many researchers. Traffic simulation research is mainly divided into two categories: rule-based and data-driven.
- Rule-based approach: Use predefined rules, which are often insufficient to describe complex driving scenarios, and the simulated driving scenarios are simpler and more general
- Based on data-driven The approach: Use driving data to train a model from which it can continuously learn and adapt. However, data-driven methods usually require a large amount of labeled data for training, which hinders the further development of traffic simulation
With the development of technology, the current way of generating data has gradually transformed from a rule-based approach A data-driven approach. By efficiently and accurately simulating driving scenarios, including various complex and dangerous situations, a large amount of training data is provided for model learning, which can effectively improve the generalization ability of the autonomous driving system. At the same time, the generated driving scenarios can also be used to evaluate different autonomous driving systems and algorithms to test and verify system performance. The following table is a summary of different data augmentation strategies.
Summary of different data enhancement strategies
World Model
A world model is considered an artificial intelligence model that contains an overall understanding or representation of the environment in which it operates. The model is able to simulate the environment to make predictions or decisions. In recent literature, the term "world model" is mentioned in the context of reinforcement learning. This concept is also gaining traction in autonomous driving applications because of its ability to understand and elucidate the dynamics of the driving environment. World models are highly related to reinforcement learning, imitation learning, and deep generative models. However, utilizing world models in reinforcement learning and imitation learning usually requires well-labeled data, and methods such as SEM2 and MILE are performed in the supervised paradigm. At the same time, there are also attempts to combine reinforcement learning and unsupervised learning based on the limitations of labeled data. Due to their close association with self-supervised learning, deep generative models have become increasingly popular and a lot of work has been proposed. The figure below shows the overall flow chart of using the world model to enhance the autonomous driving model.
Overall flow chart of world model enhancement for autonomous driving model
Deep generative model
Deep generative model Typically include variational autoencoders, generative adversarial networks, flow models, and autoregressive models.
- The variational autoencoder combines the ideas of autoencoders and probabilistic graphical models to learn the underlying structure of the data and generate new samples
- The generative adversarial network consists of two neural networks, It consists of a generator and a discriminator, which compete and enhance each other using adversarial training, and ultimately achieve the goal of generating real samples.
- The flow model converts a simple prior distribution into a complex posterior distribution through a series of reversible transformations. Generate similar data samples
- Autoregressive model is a type of sequence analysis method that describes the relationship between current observations and past observations based on the autocorrelation between sequence data. The estimation of model parameters is usually using This is done using least squares and maximum likelihood estimation. The diffusion model is a typical autoregressive model that learns a stepwise denoising process from pure noise data. Due to its powerful generative performance, the diffusion model is a new SOTA model among current deep generative models
Generative methods
Based on the powerful capabilities of deep generative models , using deep generative models as world models to learn driving scenarios to enhance autonomous driving has gradually become a research hotspot. Next we review the use of deep generative models as world models in autonomous driving. Vision is one of the most direct and effective ways for humans to obtain information about the world, because image data contains extremely rich feature information. Many previous works have completed the task of image generation through world models, showing that world models have good understanding and reasoning capabilities for image data. Overall, researchers hope to learn the inherent evolutionary laws of the world from image data and then predict future states. Combined with self-supervised learning, the world model is used to learn from image data, fully releasing the model's reasoning capabilities and providing a feasible direction for building a generalized basic model in the visual domain. The figure below shows a summary of some related work using world models.
Summary of work using world models for prediction
Non-generative methods
Compare to generative world models For example, LeCun elaborated on his different conceptions of world models by proposing the Joint Extraction and Prediction Architecture (JEPA). This is a non-generative and self-supervised architecture because it does not predict the output directly based on the input data, but encodes the input data in an abstract space to complete the final prediction. The advantage of this prediction method is that it does not require predicting all information about the output and can eliminate irrelevant details.
JEPA is a self-supervised learning architecture based on energy models, which observes and learns how the world works and highly generalized laws. JEPA also has great potential in autonomous driving and is expected to generate high-quality driving scenarios and driving strategies by learning how driving works.
Conclusion
This article provides a comprehensive overview of the important role of the basic model in autonomous driving applications. Judging from the summary and findings of the relevant research work surveyed in this article, another direction worthy of further exploration is how to design an effective network architecture for self-supervised learning. Self-supervised learning can effectively break through the limitations of data annotation, allowing the model to learn data on a large scale, and fully unleash the model's reasoning capabilities. If the basic model of autonomous driving can be trained using different scales of driving scene data under a self-supervised learning paradigm, its generalization ability is expected to be greatly improved. Such advances may enable a more general base model.
In short, although there are many challenges in applying the basic model to autonomous driving, it has a very broad application space and development prospects. In the future, we will continue to observe the progress of basic models applied to autonomous driving.
The above is the detailed content of Review! Comprehensively summarize the important role of basic models in promoting autonomous driving. For more information, please follow other related articles on the PHP Chinese website!

Hot AI Tools

Undress AI Tool
Undress images for free

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Yesterday during the interview, I was asked whether I had done any long-tail related questions, so I thought I would give a brief summary. The long-tail problem of autonomous driving refers to edge cases in autonomous vehicles, that is, possible scenarios with a low probability of occurrence. The perceived long-tail problem is one of the main reasons currently limiting the operational design domain of single-vehicle intelligent autonomous vehicles. The underlying architecture and most technical issues of autonomous driving have been solved, and the remaining 5% of long-tail problems have gradually become the key to restricting the development of autonomous driving. These problems include a variety of fragmented scenarios, extreme situations, and unpredictable human behavior. The "long tail" of edge scenarios in autonomous driving refers to edge cases in autonomous vehicles (AVs). Edge cases are possible scenarios with a low probability of occurrence. these rare events

Boston Dynamics Atlas officially enters the era of electric robots! Yesterday, the hydraulic Atlas just "tearfully" withdrew from the stage of history. Today, Boston Dynamics announced that the electric Atlas is on the job. It seems that in the field of commercial humanoid robots, Boston Dynamics is determined to compete with Tesla. After the new video was released, it had already been viewed by more than one million people in just ten hours. The old people leave and new roles appear. This is a historical necessity. There is no doubt that this year is the explosive year of humanoid robots. Netizens commented: The advancement of robots has made this year's opening ceremony look like a human, and the degree of freedom is far greater than that of humans. But is this really not a horror movie? At the beginning of the video, Atlas is lying calmly on the ground, seemingly on his back. What follows is jaw-dropping

Imagine an artificial intelligence model that not only has the ability to surpass traditional computing, but also achieves more efficient performance at a lower cost. This is not science fiction, DeepSeek-V2[1], the world’s most powerful open source MoE model is here. DeepSeek-V2 is a powerful mixture of experts (MoE) language model with the characteristics of economical training and efficient inference. It consists of 236B parameters, 21B of which are used to activate each marker. Compared with DeepSeek67B, DeepSeek-V2 has stronger performance, while saving 42.5% of training costs, reducing KV cache by 93.3%, and increasing the maximum generation throughput to 5.76 times. DeepSeek is a company exploring general artificial intelligence

The latest video of Tesla's robot Optimus is released, and it can already work in the factory. At normal speed, it sorts batteries (Tesla's 4680 batteries) like this: The official also released what it looks like at 20x speed - on a small "workstation", picking and picking and picking: This time it is released One of the highlights of the video is that Optimus completes this work in the factory, completely autonomously, without human intervention throughout the process. And from the perspective of Optimus, it can also pick up and place the crooked battery, focusing on automatic error correction: Regarding Optimus's hand, NVIDIA scientist Jim Fan gave a high evaluation: Optimus's hand is the world's five-fingered robot. One of the most dexterous. Its hands are not only tactile

Earlier this month, researchers from MIT and other institutions proposed a very promising alternative to MLP - KAN. KAN outperforms MLP in terms of accuracy and interpretability. And it can outperform MLP running with a larger number of parameters with a very small number of parameters. For example, the authors stated that they used KAN to reproduce DeepMind's results with a smaller network and a higher degree of automation. Specifically, DeepMind's MLP has about 300,000 parameters, while KAN only has about 200 parameters. KAN has a strong mathematical foundation like MLP. MLP is based on the universal approximation theorem, while KAN is based on the Kolmogorov-Arnold representation theorem. As shown in the figure below, KAN has

SpringDataJPA is based on the JPA architecture and interacts with the database through mapping, ORM and transaction management. Its repository provides CRUD operations, and derived queries simplify database access. Additionally, it uses lazy loading to only retrieve data when necessary, thus improving performance.

Target detection is a relatively mature problem in autonomous driving systems, among which pedestrian detection is one of the earliest algorithms to be deployed. Very comprehensive research has been carried out in most papers. However, distance perception using fisheye cameras for surround view is relatively less studied. Due to large radial distortion, standard bounding box representation is difficult to implement in fisheye cameras. To alleviate the above description, we explore extended bounding box, ellipse, and general polygon designs into polar/angular representations and define an instance segmentation mIOU metric to analyze these representations. The proposed model fisheyeDetNet with polygonal shape outperforms other models and simultaneously achieves 49.5% mAP on the Valeo fisheye camera dataset for autonomous driving

1. Architecture of Llama3 In this series of articles, we implement llama3 from scratch. The overall architecture of Llama3: Picture the model parameters of Llama3: Let's take a look at the actual values ??of these parameters in the Llama3 model. Picture [1] Context window (context-window) When instantiating the LlaMa class, the variable max_seq_len defines context-window. There are other parameters in the class, but this parameter is most directly related to the transformer model. The max_seq_len here is 8K. Picture [2] Vocabulary-size and AttentionL
