Multimodal AI

We can’t even imagine the full power of Multimodal AI.

When crushing a soundly and crunchy slice of toasted bread, several senses like vision, hearing, touch, and smell are engaged in such an experience. This might seem a very ordinary and trivial experience, no doubt. However, it actually hides a fascinating corpus of complex events taking place behind the scenes! Although vision is concerned with frequency bands of the electromagnetic spectrum, hearing is concerned with changes in pressure at the ears, or smell is concerned with volatile electro-chemical compounds, our brain is elegantly able to build a unified and rich representation of the food from such multiple and diverse sources of information.

Such an enhanced representation allows humans to reach extraordinary performances in identifying, recognizing, and memorizing the object or its context. Human cognition evolved in a multifaceted world. Almost every human experience consists in processing a vast embed of sensory inputs coming from different sources in the environment. Our brain is able to draw useful connections between these different sensory modalities enabling efficient representations of the world.

Where we are now

One of the most exciting challenges for modern AI research so far was to draw heavily from human’s ability in multisensory integration to enable machines to build complex and rich representations of multimodal data sources. The main turning point was the introduction of AI technologies capable of capturing some abstract representations of the different stimuli (e.g. image, audio, text) to relate them in meaningful ways. This enabled the learning of common features from such multimodal sources with the advantage to efficiently mine their correlations to make better decisions resulting from their fusion. Nowadays, multimodality-empowered AI reached results that were unimaginable a few years ago, such as retrieving and creating images via natural language instructions, obtaining superior performance in recognizing and classifying objects, and describing scenes or visual contexts in plain language as a human would do.

Why it matters

AI is ubiquitous in almost all expert human activities, from manufacturing and finance to medical diagnostic and human resources management. As technology advances, a continuously increasing amount of information is gathered in the form of (un)structured data from social, natural, and behavioral phenomena, in a variety of different formats and modalities. Leveraging information carried by multiple sources of information via universal multimodal modeling can lead to discovering relationships between phenomena we could never have imagined to enhance human and machine decision-making.

Our research at Neuraptic AI

Multimodal AI was conceived for bridging the gap between human and machine capabilities in processing multisensory information. This posited a great challenge for research but also forced AI to restrict the focus on human-friendly sensory inputs (e.g. language and images). On the one hand, this provided a crucial incentive toward the comprehension of intelligence in order to endow machines with efficient cognitive-perceptual systems. On the other hand, it took us away from the purpose of using machines’ ability to sustain and enhance humans’ experience in those activities where human cognition fails, such as processing physical or physiological signals, huge tabular data, or long sequences of historical data (non-human-friendly data). With Hypercontext, we aim to take advantage of the most recent advances in these two fundamental but still unrelated fields, to enable both human-friendly and non-human-friendly data to co-exist in a unified AI framework. Hypercontext is designed to open the doors to a new human cognitive enhancement era. Our research is focused on building universal architectures where all the different data sources can be seen as unique complex percepts, a piece of the world carrying a particular semantic, or meaning. Although this is a trivial task when dealing with data that can be naturally partitioned into uni-modal components (e.g. a video clip can be decomposed in its visual and audio constituents), things become challenging when batches of sequential data points or a set of variables in a table row are taken into account. To solve the problem, we design original Trasformer models that operate in a two-step approach, which first extrapolates abstract but semantically rich representations of non-human-friendly data sources, and then fuses such representations with human-friendly data ones.

Want to know more?

Many diverse architectures and data representation strategies have been introduced in the last few years to endow Deep Learning models with multimodal capabilities. Most of the scientific challenges consisted in solving problems concerning how to represent data from different modalities, where/when to fuse these representations, and how to generalize knowledge to multiple but related tasks.


Learning representations of input data is definitely the ultimate talent of every Deep Learning model. However, the representational problem has never been so crucial as for Multimodal Learning, since the goal here is to learn a joint representation space suitable for all the modalities under consideration. Strategies to unify representations at the very early stage of information processing of visual and auditory data have been mostly inspired by Vision Transformers, by projecting patches (tokens) of images and audio spectrograms to embedding spaces suitable for encoding. Therefore, language models played a crucial role in achieving latent representations of the inputs at later stages of information processing, thanks to auxiliary learnable tokens, such as the Class (CLS) Token, able to capture whole sentences’ semantics. Special tokens like this were actually able to capture semantics of diverse modalities, allowing to relate even heterogeneous multimodal inputs.

Recently, the same semantic token proved its suitability to represent features in Tabular Models, as well as embedding representations of discrete concepts like actions and environmental states.


Data representation is certainly a crucial step toward multimodal integration, but deciding where, or when, to fuse modality-specific representations largely affects model performance. Several strategies have been proposed to maximize the efficiency of learning shared semantics between modalities. In standard approaches, attention mechanisms naturally permitted a lazy fusion by processing a joint array of modality-specific data tokens. However, more sophisticated approaches allowed Fusion Tokens to be shared between modality-specific encoders, designed to explicitly encapsulate relationships between sensory inputs, resembling the functioning of multisensory neurons in the brain cortex. Such an approach gave rise to an entire taxonomy detailing the stage at which multisensory neurons come into play, such as early, middle, or late modality-specific information processing stages. However, examples of late fusion have been also successfully explored in encoder-decoder architectures, where modality-specific inputs were first entirely processed by modality-specific encoders, and the encoded modalities represented as a whole in the decoder.

A simple representation fusion strategy

A concatenated array of multimodal input tokens are fed into a Transformer Encoder, and the attention mechanism is used as a lazy fusion mechanism. A special token represented to aggregation of all the inputs are then projected to produce an output.

An explicit bottleneck fusion strategy

Each modality-specific input is fed into a proper encoder, which also has to attend to a set of fusion tokens that is shared among the inputs. CLS Tokens for each modality learn to represent both specific and shared modality features. They are then projected and averaged to produce a unique output.


One of the greatest scientific challenges of this century is certainly to find the key to accessing a General Artificial Intelligence, sufficiently independent from human supervision, and able to conceive solutions for diverse problems by discriminating the suitable information to accomplish the purpose. Broadly speaking, efficient multi-tasking is achieved when models learn to approximate the joint probability between a combination of sensory inputs and the output, given a particular task, that is, a specific problem configuration or set of instructions. In general, encoder-decoder massive architectures were adopted to ensure cross-attention between encoded multisensory inputs and task instructions, generalizing sequence-to-sequence modeling. However, the actual focus in multi-task learning was making AI models aware of the task context. Rather heterogeneous solutions proposed in the most recent years produced impressive results allowing multimodal multi-task models to succeed in dozens, or even hundreds of different tasks. In particular, one of the most costly-efficient and flexible approaches consisted in engineering the output architecture to ensure different behavioral outcomes for each task, via multiple output-networks. Different approaches leveraged pre-trained sensory priors and trained the model to discriminate between multimodal task-prompt of images, text, or both. A less popular but effective approach used, instead, an only-encoder approach to build a universal perceiver representing both input and output as tokenized embedding representations and capable of solving several tasks without explicit task instructions supervision.

An Encoder-Decoder architecture to handle multi-tasking

Different modality-specific inputs are processed by proper Encoders and the modality-specific encoded representations are then combined. The Decoder learns to attend suitable input features given task requirements thanks to crossattention mechanisms between task and input latent representations. Several task-specific networks are used to handle multi-tasking.

An only-Encoder model that learns the joint probability between inputs and targets

Applicable to many different tasks. Both modality-specific inputs and targets are concatenated and projected to their representations. Their geometric similarity is computed to make the model aware of how to maximize the likelihood of the right target based on input features.