Glossary: Vision-Language-Action (VLA) Systems
A
Automatic Speech Recognition (ASR): Technology that converts spoken language into text.
Action Planning: The process of determining a sequence of actions to achieve a goal.
C
Cognitive Planning: High-level planning that uses reasoning and knowledge to determine appropriate actions.
L
Large Language Model (LLM): A type of artificial intelligence model trained on vast amounts of text to understand and generate human-like language.
LLM-Based Planning: Using large language models to generate action plans from natural language commands.
M
Multimodal Fusion: The process of combining information from multiple sensory modalities (e.g., vision and language) to form a coherent understanding.
N
Natural Language Understanding (NLU): The ability of a system to comprehend and interpret human language.
S
Speech Recognition: The technology that enables computers to identify and process spoken words.
Speech-to-Action Pipeline: The sequence of processing steps that converts spoken commands into robot actions.
V
Vision-Guided Action: Robot actions that are informed and controlled by visual perception.
Vision-Language-Action (VLA) System: An integrated system that combines visual perception, language understanding, and action execution.
Visual Servoing: Controlling robot motion based on visual feedback.
Visual Grounding: Connecting language to visual elements in the environment.
Vision-Guided Manipulation: Precise manipulation tasks guided by visual feedback.
V (continued)
Vision Component: The part of a VLA system that handles visual perception and scene understanding.
Language Component: The part of a VLA system that processes natural language commands and extracts intent.
Action Component: The part of a VLA system that executes physical behaviors and manipulations.
Integration Layer: The component that coordinates the flow of information between vision, language, and action components.
Object Reference Resolution: Identifying which specific objects are being referred to in language commands.
Spatial Reference Understanding: Comprehending spatial relationships described in language.
Task Decomposition: Breaking down complex commands into sequences of simpler actions.
Hierarchical Planning: Creating multi-level plans from high-level goals to specific actions.
Dependency Management: Understanding which tasks must be completed before others.
Contextual Disambiguation: Using environmental and conversational context to resolve ambiguous references.
Visual Servoing: Controlling robot motion based on visual feedback.
Grasp Planning: Determining the best points and orientations for grasping objects.
Manipulation Trajectory: Planning safe and efficient paths for manipulation tasks while avoiding obstacles.