Google AI introduces “LIMoE”: one of the first large-scale architectures that processes both images and text using a sparse mix of experts

This Article is written as a summay by Marktechpost Staff based on the paper 'Multimodal Contrastive Learning with LIMoE: the Language-Image Mixture of Experts'. All Credit For This Research Goes To The Researchers of This Project. Check out the paper and blog post.

Please Don't Forget To Join Our ML Subreddit

Google Research has a long-standing interest in parsimony research. Pathways encapsulates the research goal of creating a single colossal model capable of handling thousands of activities and data types. Sparse unimodal models for language (Switch, Task-MoE, GLaM) and computer vision have made significant progress so far (Vision MoE). Today, the Google Al team is investigating large, sparse models that simultaneously handle images and text with modality-independent routing, another major step towards Pathways’ goal. Multimodal contrastive learning is a viable option, as it requires a deep understanding of images and text to match images to their accurate descriptions. The most effective models for this work rely on separate networks for each modality.

Sparse models stand out as one of the most promising means for deep learning in the future. Lightweight models using conditional calculus learn to route specific inputs to different “experts” in a potentially large network, rather than each part of a model analyzing each piece of information. This has many advantages. First, model size can increase while computational costs remain constant, which is a more efficient and environmentally friendly way to scale models, which is usually necessary for good performance. Parsimony also compartmentalizes brain networks in organic ways. For dense models that learn several different tasks simultaneously or sequentially, harmful interference or catastrophic forgetting, where the model gets worse at previous functions as new ones are added, are common difficulties (continuous learning). Sparse models avoid both of these problems: by not applying the entire model to all inputs, model “experts” can focus on distinct tasks or data types while still benefiting from shared components of the model.

The Google AI team presents the first large-scale multimodal architecture using sparse mixing of experts in “Multimodal Contrastive Learning with LIMoE: Mixing Linguistic Image Images of Experts”. It analyzes images and words simultaneously but with little activated experts who specialize organically. LMoE outperforms comparable dense multimodal models and two-tower techniques in non-shooting image categorization. LMoE can scale smoothly and learn to handle a wide range of inputs due to scarcity, easing the tension between being a jack-of-all-trades generalist and an expert-master.

Models with a sparse mix of experts

Data is represented by transformers as a series of vectors (or tokens). They can describe almost anything that can be defined as a series of passages, such as photographs, movies, and sounds, although they were developed for text. In the new large-scale MoE models, expert layers have been added to the Transformer architecture.

A typical transformer comprises several “blocks”, each containing several distinct layers. A feed-forward network is one of these layers (FFN). This single FFN is replaced in LIMoE and the works described above by an expert layer with multiple parallel FFNs, each of which is an expert. A main router predicts which experts should manage which tokens, given a series of passes to process. Only a few experts are activated on each ticket, which means that while the capacity of the model is greatly increased by having so many experts, the actual computational cost is kept low by employing them sparingly. The price of the model is comparable to the regular Transformer model if only one expert is activated.

LIMoE does just that, enabling one expert per case and matching the computational cost of dense baselines. The LMoE router, on the other hand, can see image or text data tokens.

When MoE models attempt to deliver all tokens to the same expert, they uniquely fail. Auxiliary losses or additional training objectives are commonly used to encourage balanced use of Experts. The Google AI team discovered that handling many modalities combined with sparingly led to new failure modes that conventional auxiliary losses could not solve. To remedy this, they created additional losses. They implemented route prioritization (BPR) during training, two innovations that resulted in stable and efficient multimodal models.

New Auxiliary Losses (LIMoE aux) and Routing Prioritization (BPR) improved overall performance (left) and increased routing behavior success rate (middle and right). A low success rate shows that the router is not using all available experts and many tokens are being dropped due to overflow of individual experts, which generally suggests that the sparse model is not learning well. The LIMoE combo guarantees high routing success rates for images and text and significantly improved speed.

With LIMoE, you can learn in different ways.

Models are trained on coupled image-text data in multimodal contrastive learning (e.g., a photo and its caption). Typically, an image model extracts an image representation, while a text model extracts a text representation. The contrastive learning objective encourages image and text representations to be close together for the same image-text combination and far apart for information from other pairs. Such aligned representation models can be adapted to new tasks without additional training data (“zero-shot”).

On the popular ImageNet dataset, CLIP and ALIGN (two-tower models) scaled this technique to achieve classification accuracy of 76.2% and 76.4%, respectively. Single-tower models that compute both image and text representations are investigated. They found that this had a negative impact on dense models, possibly due to harmful interference or a lack of capacity. A compute-ready LMoE, on the other hand, not only outperforms the thick one-tower model, but also the dense two-tower model. The Google AI team used a training strategy similar to CLIP to train a group of models. LMoE’s use of sparse, as shown below, provides a significant performance improvement over dense models of comparable cost.


The LiT and BASIC methods used specific pre-training procedures in addition to scaling and reassignment of already extraordinarily high quality image models. Despite having no pre-training or modality-specific components, LIMoE-H/14 achieved 84.1% accuracy training from scratch. It is also fascinating to compare the scale of these models: the LiT and BASIC parameter models are 2.1B and 3B respectively. LIMoE-H/14 contains 5.6 billion parameters, but the scarcity allows it to apply only 675 million parameters per token, which makes it considerably lighter.

Understanding LIMoE Behavior

LIMoE was inspired by the idea that sparse conditional calculus allows a generalist multimodal model to achieve the specialization required to excel in understanding each modality while remaining generic.

Distributions for an eight-expert LMoE; the percentages indicate the amount of image tokens processed by the expert. There are one or two clearly specialized text experts (indicated by the mostly blue experts), usually two to four image specialists (mostly red), and the others fall somewhere in the middle.


First, they observe the emergence of experts specializing in specific modalities. Because there are many more image tokens in their training framework than text tokens, all experts process at least some images. However, some experts deal primarily with images, primarily with text, or both. Distributions for a LIMoE with eight experts; the percentages reflect the number of image tokens processed by the expert. One or two experts are text specialists, two to four are image specialists, and the rest are somewhere in the middle.


For each token, LIMoE selects an expert. They see the emergence of semantic specialists who specialize in specific areas such as plants or wheels although they have not been trained.

To take part

Multimodal models that handle many tasks are one possible route. Two crucial criteria for success are size and the ability to avoid interference between various activities and modalities while exploiting synergies. Parsimonious conditional calculus is an excellent technique for achieving both of these goals. The good performance of LIMoE with less computation provides powerful and efficient general purpose models that nevertheless have the capacity and flexibility for the specialization required to excel in specific tasks.

Previous Stadium Drive construction project in Kalamazoo nears halfway through
Next More than 60 people rally in downtown Charleston over gun restrictions – The Daily Eastern News