Phi-4-Multimodal: The compact beast in the making
Microsoft open-sources Phi 4 Mini & Multimodal Small Language Models
Overview
Phi is a series of open source AI models developed by Microsoft. Phi is currently the most powerful and cost-effective small language model (SLM), with very good benchmarks in multi-language, reasoning, text/chat generation, coding, images, audio and other scenarios. You can deploy Phi to the cloud or to edge devices, and you can easily build generative AI applications with limited computing power. This week it officially released the Phi-4-mini-instruct (3.8B) and Phi-4-multimodal (5.6B) under MIT license.
- It beats Gemini 2.0 Flash, GPT4o, Whisper, SeamlessM4T v2
- The models on Hugging Face hub, integrated with Transformers!
Phi-4-Mini:
- Parameters: 3.8 billion
- Architecture: 32 Transformer layers, 3,072 hidden state size, Group Query Attention (GQA) with 24 query heads and 8 key/value heads
- Vocabulary: 200K tokens for multilingual support.
- Training Data: High-quality web and synthetic data, emphasizing math and coding
- Performance: Outperforms similar-sized models and matches larger models (e.g., DeepSeek-Rl-Distill-Qwen-7B) on math and coding tasks
Phi-4-Multimodal:
- Parameters: 5.6 billion
- Architecture: Uses “Mixture of LoRAs” to add modality-specific adapters without fine-tuning the base model
- Modalities: Integrates text, vision, and speech/audio
- Vision Modality: SigLIP-400M image encoder, 2-layer MLP projector, dynamic multi-crop strategy
- Speech/Audio Modality: 3-layer convolution, 24 conformer blocks, 80ms token rate
- Performance: Ranks first on OpenASR leaderboard, supports vision+language, vision+speech, and speech/audio tasks, outperforming larger models
Training Pipeline:
- Language Training: Pre-training on 5 trillion tokens, post-training with function calling, summarization, and instruction-following data
- Multimodal Training: Vision training (4 stages), speech/audio training (2 stages), and joint vision-speech training
- Reasoning Training: Pre-trained on 60B CoT tokens, fine-tuned on 200K high-quality CoT samples, and DPO-trained on 300K preference samples
Few Key Use-cases
Phi-4-multimodal model is intended for broad multilingual and multimodal commercial and research use . The model provides uses for general purpose AI systems and applications which require
- Memory/compute constrained environments
- Latency bound scenarios
- Strong reasoning (especially math and logic)
- Function and tool calling
- General image understanding
- Optical character recognition
- Chart and table understanding
- Multiple image comparison
- Multi-image or video clip summarization
- Speech recognition
- Speech translation
- Speech QA
- Speech summarization
- Audio understanding
Evaluation Conclusions — Takeaways
The tabular comparison below shows the results of the model for different modalities. As noticed, for reasoning abilities it outperforms DeepSeek-R1-Distill-Llama-8B and matches the Qwen-7B counterpart on different datasets. On language abilities for math, reasoning and coding tasks it outperforms the well established open-source models. Similarly, for the speech related tasks & evaluation, the vision related and the combination of both tasks it performs well in comparison to their counterparts.
For more details, pls do check their Technical Report here, here
Quick Testing of Multimodal SLM: Phi-4-multimodal
Phi-4-multimodal is a fully multimodal model that supports text, visual, and voice inputs. If you’re already familiar with visual contexts, the model can even generate code directly from images, streamlining the development process. Let test it out. Navigate to this URL — https://huggingface.co/microsoft/Phi-4-multimodal-instruct
You will find the following as highlighted below.
Next, click on the Nvidia Playground option, that will take you to this page as seen below.
Note: You don’t need to create any login for testing demo examples; but alternatively you can always register and proceed.
Next, let’s click the option as highlighted below — “Experiment with some audio and images we have for you.”
It will load the example pane as show below. Let’s try the audio experimentation as highlighted. The task is to transcribe the spoken audio content.
Once you provide the input to the model, we receive the transcribed text for the audio input as seen below.
Next, let’s evaluate an image input and discuss with the model as shown below. We inputted two images and asked it identify what is the commonality from both inputted. The response as seen features — “Animal”
Conclusion
Phi-4-Multimodal is a cutting-edge model exceling in high-quality reasoning from image and audio inputs. It is a fully multimodal model capable of vision, audio, text, multilingual understanding, strong reasoning, encoding, and more. Phi-4-mini brings significant enhancements in multilingual support, reasoning, and mathematics, and now, the long-awaited function calling feature is finally supported.
If you liked this blog post, pls encourage me to publish more content with your love & support with a clap 👏
Democratizing Learning… Cheers :)
Let’s get connected
You can reach me at ajay.arunachalam08@gmail.com; Connect via Linkedin
Check out my other works from Github Repo
About Me
I am an AWS Certified Cloud Solution Architect & AWS Certified Machine Learning Specialist. In the past, I have worked in Telecom, Retail, Banking and Finance, Healthcare, Media, Marketing, Education, Agriculture, and Manufacturing sectors. I have 7+ years of experience in delivering Data Science & Analytic solutions of which 6+ years of experience is client facing. I have Lead & Managed a large team of Data engineers, ML engineers, Data Scientists, Data analysts & Business analysts. Also, I am experienced with Technical/Management skills in the area of business intelligence, data warehousing, reporting and analytics holding Microsoft Certified Power BI Associate Certifications. I have worked on several key strategic & data-monetization initiatives in the past. Being a certified Scrum Master, I practice agile principles while focusing on collaboration, customer, continuous improvement, and sustainable development.