f in x
Gemini Multimodality: Analyzing Images, Audio, Video, and Documents with the API
> cd .. / HUB_EDITORIALE
Intelligenza Artificiale & Software

Gemini Multimodality: Analyzing Images, Audio, Video, and Documents with the API

[2026-06-14] Author: Ing. Calogero Bono

You have a 50-page PDF to extract as JSON? A training video to transcribe and summarize? Or a gallery of product photos to auto-tag? Until now, you needed different tools, separate API keys, and plenty of patience. With the Gemini API and its native multimodality, everything is solved with a single model.

We at Meteora Web work with clients handling real volumes: scanned invoices, image catalogs, meeting voice recordings. We stopped chaining services and switched directly to Gemini. When a model understands text, images, audio, and video natively, your stack simplifies and ROI climbs.

What multimodality means in Gemini

Traditional language models read only text. Gemini — from version 1.5 Pro onward — accepts images, audio, video and documents (PDF, DOCX, XLSX, PPTX) and processes them as if they were text. No separate preprocessing: you pass the file and it interprets it.

For us, coming from ERP management and accounting, it's like moving from a paper double-entry ledger to an interconnected spreadsheet. Data no longer lives in silos: an invoice PDF becomes a JSON object ready for the database, a customer support video becomes a knowledge base.

Supported formats

  • Images: JPEG, PNG, WEBP, HEIC, HEIF. Up to 20 MB per file (optimization recommended).
  • Audio: MP3, WAV, FLAC, OGG. Audio is converted to text with optional timestamps.
  • Video: MP4, MOV, AVI, WebM. Gemini samples frames and extracts audio tracks automatically.
  • Documents: PDF, DOCX, XLSX, PPTX, TXT. Files up to 50 MB. They are read as sequences of pages, sheets, or slides.

No manual preprocessing needed. No frame-by-frame video splitting or PDF text extraction. Send it and it processes it.

Sponsored Protocol

Images: descriptions, classifications, and native OCR

The most common use case is image analysis. With Gemini you can:

  • Get a textual description of a photo (useful for e-commerce SEO or accessibility).
  • Extract text from screenshots, signs, scanned documents (OCR without extra libraries).
  • Detect objects, colors, brands, emotions — all in a single prompt.

Practical example: describing a product image for a fashion catalog

Suppose you have a product photo. With one API call you get description, dominant colors, and perceived materials. Here's how in Python:

import google.generativeai as genai

# Configure (use your API key)
genai.configure(api_key='YOUR_API_KEY')
model = genai.GenerativeModel('gemini-1.5-pro')

# Load image
import PIL.Image
img = PIL.Image.open('catalog/red_sweater.jpg')

# Prompt for analysis
prompt = """Describe this image for an e-commerce catalog. 
Include: type of garment, predominant color, apparent material, 
style (casual, elegant, sporty). Output in JSON."""

response = model.generate_content([prompt, img])
print(response.text)

Result:

{
  "garment": "turtleneck sweater",
  "predominant_color": "burgundy red",
  "apparent_material": "wool or wool blend",
  "style": "elegant casual"
}

You insert this output directly into your e-commerce database. No manual copy-paste. We tested it on a 300-item catalog: labeling time dropped from 3 days to 2 hours.

Sponsored Protocol

OCR on invoices and documents

Another use close to our hearts: extracting data from scanned invoices. With a single prompt:

invoice = PIL.Image.open('invoice_1234.jpg')
prompt = "Extract from this invoice: document number, date, total amount, supplier VAT number. Output JSON."
response = model.generate_content([prompt, invoice])
print(response.text)

Using response_schema you can enforce the exact JSON structure, eliminating surprises.

Audio: transcription, summary, and sentiment analysis

Gemini is not just speech-to-text. It can listen to an entire audio and answer questions about content, identify different speakers, extract action items from memos. Audio is sampled at 16kHz and processed as a token sequence — no external codecs.

Transcribe and summarize a meeting

audio_file = genai.upload_file(path='team_meeting.mp3')
model = genai.GenerativeModel('gemini-1.5-pro')
response = model.generate_content([
    "Transcribe this audio and then summarize the main points in bullet list. "
    "Highlight assigned actions.",
    audio_file
])
print(response.text)

Result:

Transcription
[00:00] Marco: We need to update the contact form...
...
Main points
- Update contact form by Friday
- Prepare SEO report by Tuesday
- Assign document review to Laura

Assigned actions
- Marco: form update
- Laura: SEO report

You can also ask to extract the sentiment of participants, or skip irrelevant parts. We use it for client calls: once the call is recorded, in 30 seconds we have minutes and action items.

Sponsored Protocol

Video: extracting frames and tracks in one go

Video is the most complex format, but Gemini treats it as a sequence of frames (about 1 per second) plus the audio track. You can ask for descriptions, object counts, narrative summaries, or specific questions about what happens at a given timestamp.

Analyzing a training video

video_file = genai.upload_file(path='training_software.mp4')
# Wait for processing (sometimes polling needed)
file_name = video_file.name

model = genai.GenerativeModel('gemini-1.5-pro')
response = model.generate_content([
    "Watch this training video and answer: "
    "1. What steps are shown? "
    "2. Are there common mistakes highlighted? "
    "3. Provide a minute index for each section.",
    video_file
])
print(response.text)

Result:

1. Steps shown:
- (0:00-1:30) Account setup
- (1:30-3:00) Data import...
2. Common mistakes:
- At 2:15 shows password in plaintext → not recommended
3. Index:
- 0:00 Introduction
- 1:30 Setup
- 3:00 Import

If you only want the spoken text, add "Transcribe only the audio component." If you want only frame analysis, use the video file but prompt to ignore audio.

Documents: PDF, Excel, Word, PowerPoint

Documents are read as text. Gemini extracts content from each page (PDF), sheet (Excel) or slide (PowerPoint). You can ask questions, summarize, compare versions, or extract tables.

Sponsored Protocol

Extract data from a warehouse Excel

excel_file = genai.upload_file(path='inventory_2025.xlsx')
model = genai.GenerativeModel('gemini-1.5-pro')
response = model.generate_content([
    "From this Excel file, list products with quantity less than 10 "
    "and calculate the total value of at-risk stock. "
    "Use Price and Quantity columns.",
    excel_file
])
print(response.text)

Result:

Products with quantity < 10:
- Red sweater: 7 pcs x €45 = €315
- Black shoes size 42: 3 pcs x €89 = €267
...
Total at-risk stock value: €1,234

Note: Gemini does not execute formulas. If your Excel has calculations, they must be already resolved or you must provide the derived values. We recommend exporting to CSV or pre-calculating the derived columns. It's a known limitation but can be worked around with minimal preprocessing.

Limits and practical tips

  • File size: The limit for uploaded files via API is 2 GB for videos, but in practice with very large files (over 50 MB) processing times increase. We recommend compressing images to 1200px and using audio at 128kbps.
  • Long videos: Gemini 1.5 Pro supports up to 1 hour of video. Beyond that, segment.
  • Complex documents: PDFs with multi-column layouts or nested tables may lose structure. Always test on a sample.
  • Cost: Multimodality costs more in token count. Every image or frame counts as tokens. Use concise prompts to reduce costs.
  • Privacy: Uploaded files are processed by Google. If you have sensitive data, check your contract's policy or use Vertex AI.

Real use cases we've implemented

With a client managing an invoice archive, we automated accounting categorization: PDF invoices → supplier, amount, VAT extraction → populating a double-entry sheet. Time saved: 4 hours per week.

Sponsored Protocol

A communication agency asked us to analyze client interview videos: automatic transcription, extraction of key quotes, draft case study. With Gemini, the copywriter receives 70% of the material ready.

A clothing store (which we've been following since the ERP system) uses image analysis to auto-tag new arrivals: color, type, style. Insertion into WooCommerce in 2 clicks.

In summary — what to do now

  1. Get a Gemini API key from Google AI Studio.
  2. Install the library: pip install google-generativeai.
  3. Try the image example with a product photo. Verify the JSON output.
  4. Move to a real document (invoice or Excel) and ask for field extraction.
  5. Automate: write a script that scans a folder of files and produces a report.

Gemini's multimodality is not a future feature: it works today, costs less than multi-model solutions, and simplifies your stack. We've integrated it into our platforms and clients see the difference. For the full API deep dive, read the Gemini API pillar guide. For security and tracking, check our articles on INP and Core Web Vitals.

Official documentation: Gemini API Vision (images and video) and Gemini API Audio.

Ing. Calogero Bono

> AUTHOR_EXTRACTED

Ing. Calogero Bono

Ingegnere Informatico, co-fondatore di Meteora Web. Esperto in architetture software, sicurezza informatica e sviluppo sistemi scalabili.
[ Read Full Dossier ]

> METEORA_WEB // DIGITAL AGENCY

We build the digital presence your business deserves.

Websites, social media, online advertising, e-commerce and high-performance hosting, engineered with method by computer engineers in Sciacca, for all of Italy.

> MW_JOURNAL

> READ_ALL()