In this post, I will briefly define Large Multimodal Models (LMM), explain the data modalities of LMMs, and demonstrate Google Gemini’s multimodal capabilities using text and images. This is purely a no-code blog post. I will demonstrate everything directly from the Google Cloud Console using Vertex AI Studio and Vertex AI to showcase a simple example of using multimodal inputs (text file and image)

Introduction

A Large Multimodal Model (LMM) is an Artificial Intelligence (AI) model capable of processing and understanding multiple types of input data modalities simultaneously and providing outputs based on this information. The input data can include a mix of audio, video, text, images, and even code.

Data modalities of LMM

Audio

The model can understand sound, music, spoken language, and any other form of auditory inputs.

Images

The model can understand objects, scenes, text, and even emotions from an image.

Video

The model can understand visual and auditory elements form the video. It’s a combination of audio and visual data

Text

Text includes any form of content such as books, PDF files, text files, online news, articles, research papers, social media posts, and so on. The model understands the textual context, processes it, and generates outputs.

Code

The model also takes programming language code as input.

Multimodal examples

Multimodal means one or more of the following:

Input and output are of different modalities

Can take one type of input modality and generate one type of modality output (example: text-to-image, image-to-text, audio-to-text). Here is the example diagram: Input and output are of different modalities

Inputs are multimodal

Can take more than one type of inputs and generate one type output. Here is the example diagram: Inputs are multimodal In above, diagram we can see that LMM is taking three types of modalities (Audio, text and image) and generating one modality output (video)

Outputs are multimodal

Can take one type of input and generate more than one output. Here is the example diagram: Outputs are multimodal In above, diagram we can see that LMM is taking one type of modality (video) and generating three types of output (audio, text, and video)

Inputs and Outputs are multimodal

Can take more than one of inputs and generate more than one output. Here is the example diagram:

Inputs and Outputs are multimodal

In above, diagram we can see that LMM is taking more than one types of modalities (audio, video, image, text and code) and generating more than one types of output modalities (audio, video, image, text and code)

Large Multimodal Model(LMM) Prompting With Google Gemini

I will show you the steps to use the Goolge Gemini multimodal.

1. Login to Google Cloud

Since I am demonstrating from the Google Cloud Console, you will need to login to your Google Cloud account or need to signup if you don’t Google Cloud account. here is the link cloud.google.com. For this demonstration I have set up a branch new account.

2. Setup Billing

If you haven’t set up billing, you’ll need to add your payment details so Google can charge you accordingly. If you’re a new user on the Google Cloud Platform, you’ll get a 90-day free trial with $300 USD in credit, which will be more than enough to try out Gemini for our demonstration.

3. Create Project

Go to your Google Cloud Console https://console.cloud.google.com and then create new google project clicking one red box 1 which will open red box popup and click on red box 2.

Create Google Project

Give the name to the Google project form red box 3. You can also see the project ID just below the project name text box; this ID cannot be changed in the future. You can leave the location set to the default. Create Google Project . After you hit a save button it will take few seconds to create a project.

4. Enable APIs & Services

To enable APIs & Services first you need to go to red box 4 and click on red box 5 Enable APIs & Services Now you are inside a project dashboard. Search for “Enable APIs & Services” form search text box (Red box 6) and click on Enable APIs & Services (Red box 7) Enable APIs & Services Click on + ENABLE APIS AND SERVICES (Red Box 8) Enable APIs & Services Search for Vertex AI API and click on Vertex AI API (Red Box 9) Vertex AI API Click on Vertex AI API text (Red Box 10) Vertex AI API Click on enable button (Red Box 11) You may be asked for billing information if you haven’t submitted it already. Vertex AI API It will take coupe of seconds to enable the Vertex AI API. Once the Vertex API is enabled, your browser window will display the screen shown below. Vertex AI API

5. Prompting With Multimodal

Now let’s go to the Vertex AI studio.

Search for Vertex AI and click on Vertex AI (Red Box 12). Vertex AI studio Next step is to clink on freeform menu (Red Box 13) Vertex AI studio From the screen above, you can choose the LMM model (Red Box 14), upload files (Red Box 15) and input prompts.

Example

I have below text prompt and multimodal inputs (a text file knowledge-base.txt and an image mount-everest.png).

Text prompt is :

Read the information from both the text file and the image. Tell us the old and new heights

Image is : (mount-everest.png) Image has text on it. Mount Everest

Text file is (knowledge-base.txt): text file has below content

1
The height of Mount Everest was 8,848 meters

Let me try with only text prompt, what will be the output? Vertex AI studio

Look at the response. Model did not find the information that we asked for.

Output is:

Please provide me with the text file and the image so I can read the information and tell you the old and new heights.

Let’s try with the files (knowledge-base.txt text file and mount-everest.png image file) and supply same prompt used before. prompting with freeform

Now, look at the output. The model is reading both the knowledge-base.txt text file and the mount-everest.png image file, and is responding the correct result.

Output is:

The old height of Mount Everest was 8,848 meters. The new height of Mount Everest is 8,849 meters.

Old height of Mount Everest was read from knowledge-base.txt text file and new height from the mount-everest.png image file, which is the expected output.

Congratulations! You have successfully learned how to use Large Multimodal Model (LMM) prompting with Google Gemini and Vertex AI

One more thing: after completing the lab, don’t forget to delete your project; otherwise, Google may charge you.

In this post, we explored the capabilities of Large Multimodal Models (LMM) and demonstrated how Google Gemini processes different data modalities to generate meaningful outputs. By using text and image inputs, we saw how the model effectively interprets and combines multimodal data to provide accurate results

Recorded YouTube video for the demonstration: