Feature stories, news review, opinion & commentary on Artificial Intelligence

Apple's AI Research Team Unveils Key Insights for Top-Notch AI that Understands Both Text and Images

Apple's artificial intelligence gurus have been working on a super-smart AI that's a whiz with words and pictures, called a Multimodal Large Language Model (MLLM for short). They've been tinkering with all the nuts and bolts that make it tick, trying to figure out the secret sauce that allows the AI to learn and perform better. After lots of testing, they've come up with some pretty important tips and tricks.

First off, they've been experimenting with something called an "image encoder" – that's the part of the AI that deals with visuals. They found that if you feed the AI bigger pictures and teach it with certain "contrastive" methods using lots of image-and-text stuff, the AI gets better at understanding what's going on in the images.

But they didn't just stop there. They also looked at the "VL connector", which is basically a bridge between the picture-dealing part and the word-dealing part of the AI. Turns out, what really matters is how many bits of information from the image this connector can handle, and how good the image itself is. The actual design of the bridge, though? Not so important.

Now, when it comes to teaching the AI all this, the data it learns from is super crucial. The team at Apple used a mixture of different types of data. They had some that was just images with captions (like "a dog playing with a frisbee"), some that was more like mixed documents with images and text all over the place, and even some that was just plain old text.

Here's the lowdown on what they found:

  • To help the AI get the hang of things without any additional examples (zero-shot learning), captioned images were the champions.
  • When they wanted the AI to learn fast with just a few examples (few-shot learning), those mixed documents were a big deal.
  • Text-only data made sure the AI didn't forget how to handle words without pictures.
  • Using specially created synthetic data gave the AI a boost in handling a few examples.

Then came the ultimate test - making the AI even smarter by throwing more data at it and building bigger versions (scaling up, as they call it). The team used everything they learned from their experiments to build “MM1”, a family of these AIs that come in different sizes, the largest being a mega 30 billion parts! They even used a trick called Mixture-of-Experts (MoE) to crank up the AI's capacity without slowing it down too much.

The MM1 AIs aced loads of tough tasks right after learning (pre-training), beating most other similar AIs in town. But that's not all. When the team gave MM1 a bit of extra teaching focused on specific tasks (supervised fine-tuning), it showed off even more of its brainpower, doing quite well on a wide range of different tests.

What's really cool is that MM1 can juggle thinking about multiple images at once and can even learn quickly with just a little bit of guidance (few-shot prompting). The team made sure to share all these juicy details because they think these lessons are like gold for anyone trying to build their own smart AIs, even when new methods and data come along.

The world of AI is changing fast, and this research from Apple is a step forward in making AIs that understand our world – both what we see and what we say!

Read the research paper: MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training