AI

Researchers at Apple are developing the MM1 family of multimodal AI models, which has up to 30 billion parameters

Published

1 month ago

March 18, 2024

The MM1 AI models are still in the pre-training stage, according to Apple experts

In a preprint publication, Apple researchers presented their work on developing a multimodal large language model (LLM) for artificial intelligence (AI). The study describes how it achieved the enhanced capabilities of multimodality and made the foundation model train on both text-only input and visuals, and it was published on an online platform on March 14. The Cupertino-based tech giant’s latest AI developments coincide with CEO Tim Cook’s announcement during the company’s earnings calls that AI features might be released later this year.

The research paper’s pre-print version is available online at arXiv, an open-access repository for academic works. Nonetheless, there is no peer review for the papers published here. Although Apple is not mentioned in the paper itself, the majority of the researchers who are cited are associated with the machine learning (ML) division of the corporation, which has led to the assumption that the project is also associated with the producer of iPhones.

Working on MM1, a family of multimodal models with up to 30 billion parameters, is what the researchers claim to be working on. The authors of the research referred to it as a “performant multimodal LLM (MLLM)” and noted that in order to build an AI model that can comprehend both text and image-based inputs, various architecture components, data selections, and image encoders were used.

Read also:-
Grok, Elon Musk’s AI chatbot, is now open source for researchers and developers thanks to xAI

As an illustration, the paper said, “We show that, in comparison to other published pre-training results, achieving state-of-the-art (SOTA) few-shot results across multiple benchmarks requires a careful mix of image-caption, interleaved image-text, and text-only data for large-scale multimodal pre-training.”

To put it simply, the AI model is not yet sufficiently trained to produce the required outputs; that is, it is still in the pre-training phase. At this point, the model’s workflow and final data processing are designed using the algorithm and AI architecture. With the use of image encoders and a vision language connector, the team of Apple researchers was able to include computer vision into the model. The team then discovered that the findings were competitive when compared to previous models at the same stage when testing with a combination of simply images, image and text, and text-only data sets.

Even though this is a big breakthrough, there is insufficient evidence in this research report to conclude that Apple’s operating system will include a multimodal AI chatbot. It’s hard to even determine at this point if the AI model is multimodal in terms of receiving inputs or producing output (i.e., whether it can produce AI images or not). However, if peer review confirms that the results are consistent, the tech giant has made significant progress toward developing a native generative AI core model.