Skip to main content
This example demonstrates how to use the video recording tools to capture and analyze video content playing on a mobile device screen.
This example is available on GitHub: video_transcription_example.py

What This Example Does

1

Opens a video app

Navigates to YouTube or another video platform
2

Starts screen recording

Begins capturing the device screen in the background
3

Plays video content

Lets the video play for a specified duration
4

Stops and analyzes

Stops recording and uses Gemini to analyze the video content

Prerequisites

Video recording tools require additional setup:
  1. ffmpeg must be installed on your system (for video compression)
  2. A video-capable Gemini model must be configured in utils.video_analyzer

Install ffmpeg

brew install ffmpeg

Supported Video Analyzer Models

The video_analyzer utility requires a video-capable Gemini model:
ModelProviderNotes
gemini-3-flash-previewgoogleRecommended - Fast and capable
gemini-3-pro-previewgoogleHigher quality, slower
gemini-2.5-flashgoogleGood balance
gemini-2.5-progooglePremium quality
gemini-2.0-flashgoogleFast, reliable

Video Recording Tools

When video recording is enabled, the agent has access to two tools:

start_video_recording

Starts a background screen recording on the mobile device.
  • Recording continues until stop_video_recording is called
  • No duration limit - recording runs as long as needed
  • Audio: Not captured (video only)
On Android, the native screenrecord command has a 3-minute limit, but mobile-use automatically handles this by segmenting and concatenating recordings seamlessly. You don’t need to worry about this limit.

stop_video_recording

Stops the current screen recording and analyzes the video content.
prompt
str
default:"Describe what happened in the video."
Specifies what to extract from the video. Examples:
  • "Describe what actions are shown on screen"
  • "What happens after each 10 seconds of the video?"
  • "List all UI elements and buttons that appear"

Complete Code

video_transcription_example.py
import asyncio

from minitap.mobile_use.config import LLM, LLMConfig, LLMConfigUtils, LLMWithFallback
from minitap.mobile_use.sdk.agent import Agent
from minitap.mobile_use.sdk.builders.agent_config_builder import AgentConfigBuilder
from minitap.mobile_use.sdk.types.agent import AgentConfig
from minitap.mobile_use.sdk.types.task import AgentProfile, TaskRequest


def get_video_capable_llm_config() -> LLMConfig:
    """
    Returns an LLM config with video_analyzer configured.

    The video_analyzer must use a video-capable Gemini model:
    - gemini-3-flash-preview (recommended - fast and capable)
    - gemini-3-pro-preview
    - gemini-2.5-flash
    - gemini-2.5-pro
    - gemini-2.0-flash
    """
    return LLMConfig(
        planner=LLMWithFallback(
            provider="openai",
            model="gpt-5-nano",
            fallback=LLM(provider="openai", model="gpt-5-mini"),
        ),
        orchestrator=LLMWithFallback(
            provider="openai",
            model="gpt-5-nano",
            fallback=LLM(provider="openai", model="gpt-5-mini"),
        ),
        contextor=LLMWithFallback(
            provider="openai",
            model="gpt-5-nano",
            fallback=LLM(provider="openai", model="gpt-5-mini"),
        ),
        cortex=LLMWithFallback(
            provider="openai",
            model="gpt-5",
            fallback=LLM(provider="openai", model="o4-mini"),
        ),
        executor=LLMWithFallback(
            provider="openai",
            model="gpt-5-nano",
            fallback=LLM(provider="openai", model="gpt-5-mini"),
        ),
        utils=LLMConfigUtils(
            outputter=LLMWithFallback(
                provider="openai",
                model="gpt-5-nano",
                fallback=LLM(provider="openai", model="gpt-5-mini"),
            ),
            hopper=LLMWithFallback(
                provider="openai",
                model="gpt-5-nano",
                fallback=LLM(provider="openai", model="gpt-5-mini"),
            ),
            # IMPORTANT: video_analyzer must use a video-capable Gemini model
            video_analyzer=LLMWithFallback(
                provider="google",
                model="gemini-3-flash-preview",
                fallback=LLM(provider="google", model="gemini-2.5-flash"),
            ),
        ),
    )


async def main():
    config: AgentConfig = (
        AgentConfigBuilder()
        .add_profile(
            AgentProfile(
                name="VideoCapable",
                llm_config=get_video_capable_llm_config(),
            )
        )
        .with_video_recording_tools()  # Enable video recording tools
        .build()
    )

    agent = Agent(config=config)
    try:
        await agent.init()

        result = await agent.run_task(
            request=TaskRequest(
                goal="""
                1. Open YouTube app
                2. Search for "Python tutorial"
                3. Start recording the screen
                4. Play the first video
                5. Wait for the first 30 seconds of the video to play
                6. Stop recording and describe what was shown in the video
                """,
                profile="VideoCapable",
            )
        )
        print(f"Task result: {result}")
    finally:
        await agent.clean()


if __name__ == "__main__":
    asyncio.run(main())

Code Breakdown

1. Configure video_analyzer in LLMConfigUtils

The key configuration is adding video_analyzer to your LLMConfigUtils:
utils=LLMConfigUtils(
    outputter=LLMWithFallback(...),
    hopper=LLMWithFallback(...),
    # Add video_analyzer with a Gemini model
    video_analyzer=LLMWithFallback(
        provider="google",
        model="gemini-3-flash-preview",
        fallback=LLM(provider="google", model="gemini-2.5-flash"),
    ),
)
The video_analyzer is optional in LLMConfigUtils. It’s only required when using video recording tools.

2. Enable Video Recording Tools

Use the builder method to enable video tools:
config = (
    AgentConfigBuilder()
    .add_profile(profile)
    .with_video_recording_tools()  # Enables start/stop recording tools
    .build()
)
Calling with_video_recording_tools() will raise FFmpegNotInstalledError if ffmpeg is not installed.

3. Use Recording in Task Goals

The agent can now use recording tools in natural language goals:
goal = """
1. Start recording the screen
2. Navigate to the video
3. Wait for 30 seconds
4. Stop recording and describe what happened on screen
"""

Configuration File Approach

You can also configure video_analyzer in a JSONC config file:
llm-config.video.jsonc
{
  "planner": {
    "provider": "openai",
    "model": "gpt-5-nano"
  },
  "orchestrator": {
    "provider": "openai",
    "model": "gpt-5-nano"
  },
  "contextor": {
    "provider": "openai",
    "model": "gpt-5-nano",
    "fallback": {
      "provider": "openai",
      "model": "gpt-5-mini"
    }
  },
  "cortex": {
    "provider": "openai",
    "model": "gpt-5",
    "fallback": {
      "provider": "openai",
      "model": "o4-mini"
    }
  },
  "executor": {
    "provider": "openai",
    "model": "gpt-5-nano"
  },
  "utils": {
    "hopper": {
      "provider": "openai",
      "model": "gpt-5-nano"
    },
    "outputter": {
      "provider": "openai",
      "model": "gpt-5-nano"
    },
    // Video analyzer for transcription
    "video_analyzer": {
      "provider": "google",
      "model": "gemini-3-flash-preview",
      "fallback": {
        "provider": "google",
        "model": "gemini-2.5-flash"
      }
    }
  }
}
Then use it:
profile = AgentProfile(name="video", from_file="llm-config.video.jsonc")

config = (
    AgentConfigBuilder()
    .add_profile(profile)
    .with_video_recording_tools()
    .build()
)

How It Works

Android’s native screenrecord command has a 3-minute hard limit. To work around this, mobile-use:
  1. Automatically restarts recording before each 3-minute segment ends
  2. Saves each segment locally
  3. Concatenates all segments using ffmpeg when you stop recording
Result: You can record for as long as you need - the segmentation is handled transparently.

Use Cases

goal = """
1. Open YouTube and play "TED Talk on AI"
2. Start recording
3. Wait 2 minutes
4. Stop recording and describe the key visual content shown
"""

Custom Analysis Prompts

The stop_video_recording tool accepts a prompt parameter for custom analysis:
# The agent will use this prompt when analyzing the video
goal = """
Start recording, play the video for 1 minute, then stop recording 
and answer: "What are the 3 main takeaways from this video?"
"""

Troubleshooting

Error: ffmpeg is required for video recording but is not installedSolution: Install ffmpeg using your system’s package manager:
  • macOS: brew install ffmpeg
  • Linux: apt install ffmpeg or dnf install ffmpeg
  • Windows: Download from ffmpeg.org
Error: with_video_recording_tools() requires 'video_analyzer' in utilsSolution: Add video_analyzer to your profile’s LLMConfigUtils:
utils=LLMConfigUtils(
    outputter=...,
    hopper=...,
    video_analyzer=LLMWithFallback(
        provider="google",
        model="gemini-3-flash-preview",
        fallback=LLM(provider="google", model="gemini-2.5-flash"),
    ),
)
Warning: Concatenation failed, using last segment onlyCause: ffmpeg failed to merge video segments (Android only).Solution: Ensure ffmpeg is properly installed and working. The recording will still succeed but may only contain the last 3-minute segment.
Error: Recording stopped but analysis failedPossible causes:
  • Video file too large (>14MB after compression)
  • Gemini API rate limits
  • Invalid video format
Solution: Try shorter recordings or check your Google API quota.

Best Practices

Keep Recordings Short

Shorter recordings (under 2 minutes) process faster and more reliably

Use Specific Prompts

Tell the agent exactly what to extract: “list all buttons shown” vs “describe the workflow”

Configure Fallbacks

Always set a fallback model for video_analyzer in case of API issues

Test ffmpeg First

Verify ffmpeg works before running: ffmpeg -version

Next Steps