Glossary: Vision-Language-Action (VLA) Systems

A

Automatic Speech Recognition (ASR): Technology that converts spoken language into text.

Action Planning: The process of determining a sequence of actions to achieve a goal.

C

Cognitive Planning: High-level planning that uses reasoning and knowledge to determine appropriate actions.

L

Large Language Model (LLM): A type of artificial intelligence model trained on vast amounts of text to understand and generate human-like language.

LLM-Based Planning: Using large language models to generate action plans from natural language commands.

M

Multimodal Fusion: The process of combining information from multiple sensory modalities (e.g., vision and language) to form a coherent understanding.

N

Natural Language Understanding (NLU): The ability of a system to comprehend and interpret human language.

S

Speech Recognition: The technology that enables computers to identify and process spoken words.

Speech-to-Action Pipeline: The sequence of processing steps that converts spoken commands into robot actions.

V

Vision-Guided Action: Robot actions that are informed and controlled by visual perception.

Vision-Language-Action (VLA) System: An integrated system that combines visual perception, language understanding, and action execution.

Visual Servoing: Controlling robot motion based on visual feedback.

Visual Grounding: Connecting language to visual elements in the environment.

Vision-Guided Manipulation: Precise manipulation tasks guided by visual feedback.

V (continued)

Vision Component: The part of a VLA system that handles visual perception and scene understanding.

Language Component: The part of a VLA system that processes natural language commands and extracts intent.

Action Component: The part of a VLA system that executes physical behaviors and manipulations.

Integration Layer: The component that coordinates the flow of information between vision, language, and action components.

Object Reference Resolution: Identifying which specific objects are being referred to in language commands.

Spatial Reference Understanding: Comprehending spatial relationships described in language.

Task Decomposition: Breaking down complex commands into sequences of simpler actions.

Hierarchical Planning: Creating multi-level plans from high-level goals to specific actions.

Dependency Management: Understanding which tasks must be completed before others.

Contextual Disambiguation: Using environmental and conversational context to resolve ambiguous references.

Visual Servoing: Controlling robot motion based on visual feedback.

Grasp Planning: Determining the best points and orientations for grasping objects.

Manipulation Trajectory: Planning safe and efficient paths for manipulation tasks while avoiding obstacles.

A​

C​

L​

M​

N​

S​

V​

V (continued)​

A

C

L

M

N

S

V

V (continued)