Skip to content

Captioning

GeminiCaption

Generates captions for images using Gemini 2.0 Flash Vision.

You must set either GEMINI_API_KEY or GOOGLE_API_KEY in your environment to use this node.

Parameters

  • target_prop (default: 'caption'): The property to store the caption in
  • prompt (default: None): The prompt to use for the Gemini model
  • instructions (default: None): Additional instructions to append to the prompt
  • parallel (default: 8): The number of images to process in parallel. Adjust based on API rate limits.

Output properties

  • image.{target_prop}: The caption generated for the image

Example

dataset >> GeminiCaption()

GPT4oCaption

Generates captions for images using GPT4o.

You must set OPENAI_API_KEY in your environment to use this node.

Parameters

  • target_prop (default: 'caption'): The property to store the caption in
  • caption_type (default: 'descriptive'): The type of caption to generate ('descriptive' or 'booru')
  • prompt (default: None): The prompt to use for the GPT-4o model (read the code if you're curious about customizing this)
  • instructions (default: None): Additional instructions to append to the prompt
  • parallel (default: 8): The number of images to process in parallel. Adjust based on OpenAI rate limits.

Output properties

  • image.{target_prop}: The caption generated for the image

Example

dataset >> Gpt4oCaption(caption_type='descriptive')

JoyCaptionAlphaOne

Generates captions for images using JoyCaption Alpha One

Parameters

  • target_prop (default: 'caption'): The property to store the caption in
  • caption_type (default: 'descriptive'): The type of caption to generate ('descriptive', 'stable_diffusion', or 'booru')
  • caption_tone (default: 'formal'): The tone of the caption ('formal' or 'informal')
  • caption_length (default: 'any'): The length of the caption ('any' or an integer)
  • batch_size (default: 4): The number of images to process in parallel

Output properties

  • image.{target_prop}: The caption generated for the image

Example

dataset >> JoyCaptionAlphaOne

JoyCaptionAlphaTwo

Generates captions for images using JoyCaption Alpha Two with additional caption types and options.

Parameters

  • target_prop (default: 'caption'): The property to store the caption in
  • caption_type (default: 'descriptive'): The type of caption to generate. Options:
  • 'descriptive'
  • 'descriptive_informal'
  • 'training_prompt'
  • 'midjourney'
  • 'booru_tag_list'
  • 'booru_like_tag_list'
  • 'art_critic'
  • 'product_listing'
  • 'social_media_post'
  • caption_length (default: 'long'): The length of the caption ('any', 'very short', 'short', 'medium-length', 'long', 'very long', or an integer)
  • extra_options (default: []): List of extra options to include in the caption
  • name_input (default: ''): Name to use when referring to people/characters in the image
  • batch_size (default: 4): The number of images to process in parallel

Output properties

  • image.{target_prop}: The caption generated for the image

Example

dataset >> JoyCaptionAlphaTwo(
    caption_type='art_critic',
    caption_length='very long',
    extra_options=['Include information about lighting'],
    name_input='Alice'
)

LlamaCaption

Generates captions for images using meta-llama/llama-3.2-11B-Vision-Instruct.

In order to download this model, you need to be logged into huggingface.

Parameters

  • target_prop (default: 'caption'): The property to store the caption in
  • prompt (default: None): The prompt to use for the Llama model (read the code)
  • instructions (default: None): Additional instructions to include in the prompt
  • batch_size (default: 1): The number of images to process in parallel. If you are running out of memory, try reducing this value.

Output properties

  • image.{target_prop}: The caption generated for the image

Example

dataset >> LlamaCaption

LLMCaptionVariations

Generates variations of existing image captions using LLaMA 3.1 8B Instruct. This preserves the original caption and adds numbered variations as new properties.

Parameters

  • target_prop (default: 'caption'): Base property name to store variations in. Will append _1, _2 etc.
  • variations (default: 2): Number of variations to generate per image
  • parallel (default: 20): The number of images to process in parallel
  • temperature (default: 0.7): The temperature to use when sampling from the model
  • model (default: 'together_ai/meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo'): The LLM model to use for generating variations

Output properties

  • image.{target_prop}: The original caption (preserved)
  • image.{target_prop}_1: First variation
  • image.{target_prop}_2: Second variation (if variations > 1)
  • etc.

Example

# Generate 3 variations of each caption
dataset >> LLMCaptionVariations(variations=3)

LLMCaptionTransform

Computes image captions using a Large Language Model. This can be used for a number of purposes included:

  • Cleaning up captions from VLMs
  • Computing better captions by combining results from multiple VLMs
  • Computing captions based on tags or other metadata
  • Combining tags or other metadata with VLM caption results
  • Transforming captions into different styles (e.g. fluid vs booru).

And so on. This is commonly one of the last nodes in a workflow, that determines the final caption.

Language models are accessed via API using litellm. Model strings should follow their conventions. For example:

  • together_ai/meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo
  • together_ai/meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo
  • openai/gpt-4o
  • openai/gpt-4o-mini

Depending on the platform used, you will need to set environment variables like OPENAI_API_KEY or TOGETHER_API_KEY or ANTHROPIC_API_KEY to access the models.

Be aware that OpenAI and Anthropic models are relatively censored and may refuse certain tasks based on the content. We have had good experiences with Llama and Qwen models.

LLM results are cached with the model, prompt, and parameters as the key, beprepared will never execute the same request twice.

We plan to support local LLMs in the future, and would welcome pull requests that implement this efficiently.

Parameters

  • model: The name of the language model to use
  • prompt: A function that takes an image and returns a prompt for the language model
  • target_prop (default: caption): The property to store the transformed caption in
  • parallel (default: 20): The number of images to process in parallel. Adjust based on rate limits
  • temperature (default: 0.5): The temperature to use when sampling from the language model. In general, lower temperatures give more consistent and "safe" results and reduce the language model's tendency to hallucinate.

Output properties

  • image.{target_prop}: The transformed caption

Example


            dataset 
            >> JoyCaptionAlphaOne(target_prop='joycaption')
            >> GPT4oCaption(target_prop='gpt4ocaption')
            >> XGenMMCaption(target_prop='xgenmmcaption')
            >> QwenVLCaption(target_prop='qwenvlcaption')
            >> LlamaCaption(target_prop='llamacaption')
            >> Gemma3Caption(target_prop='gemma3caption')
            >> LLMCaptionTransform('together_ai/meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo',
                                   lambda image: f"""
    Multiple VLMs have captioned this image. These are their results: 

    - JoyCaption: {image.joycaption.value}
    - GPT4oCaption: {image.gpt4ocaption.value}
    - XGenMMCaption: {image.xgenmmcaption.value}
    - QwenVLCaption: {image.qwenvlcaption.value}
    - LlamaCaption: {image.llamacaption.value}
    - Gemma3: {image.gemma3caption.value}

    Please generate a final caption for this image based on the above information. Your response should be the caption, with no extra text or boilerplate.
                                   """.strip(),
                                   target_prop='caption')

QwenVLCaption

Generates captions for images using Qwen/Qwen2-VL-7B-Instruct.

Parameters

  • target_prop (default: 'caption'): The property to store the caption in
  • prompt (default: None): The prompt to use for the Qwen 2 VL 7B model (read the code)
  • instructions (default: None): Additional instructions to include in the prompt
  • batch_size (default: 1): The number of images to process in parallel. If you are running out of memory, try reducing this value.

Output properties

  • image.{target_prop}: The caption generated for the image

Example

dataset >> QwenVLCaption

Gemma3Caption

Generates captions for images using Google's Gemma 3 12B Instruction-Tuned Vision-Language Model.

Gemma 3 is a powerful multimodal model that can process both text and images, generating high-quality text outputs. It supports a 128K token context window and is multilingual, supporting over 140 languages. The model excels at detailed image descriptions, visual question answering, and image analysis.

Requirements: - Transformers version: Requires transformers >= 4.46.0. Run pip install --upgrade transformers if you encounter model loading errors. - VRAM: Gemma 3 12B requires significant VRAM (24GB+ recommended). Consider using batch_size=1 to manage memory usage.

Parameters

  • target_prop (default: 'caption'): The property to store the caption in
  • prompt (default: Detailed description prompt): The prompt to use for image captioning
  • system_prompt (default: None): System prompt to set the assistant's behavior
  • instructions (default: None): Additional instructions to append to the prompt
  • batch_size (default: 1): The number of images to process in parallel. Keep at 1 for 12B model.

Output properties

  • image.{target_prop}: The caption generated for the image

Example

# Basic usage with default detailed captioning
dataset >> Gemma3Caption()

# Custom prompt for specific focus
dataset >> Gemma3Caption(
    prompt='What objects are visible in this image?'
)

# With system prompt and instructions
dataset >> Gemma3Caption(
    system_prompt='You are an art critic analyzing paintings.',
    prompt='Analyze this artwork',
    instructions='Focus on style, technique, and emotional impact'
)

Passthrough

A node that does nothing and returns the dataset unchanged. This can be useful as a no-op placeholder in conditional pipeline branches.

Example

# Use Passthrough as a no-op alternative in a conditional
dataset >> (ProcessNode() if condition else Passthrough())

MapCaption

Maps the current caption to a new caption using a function.

Parameters

  • func: Function that takes the current caption string and returns a new caption string.

Output Properties

  • image.caption: The transformed caption.

Example

# Add an exclamation mark to all captions
dataset >> MapCaption(lambda caption: caption + "!")

# Prepend a prefix to all captions
dataset >> MapCaption(lambda caption: f"A photo showing {caption}")

SetCaption

Sets the caption property on an image.

Parameters

  • caption: The caption to set.

Output Properties

  • image.caption: The caption of the image.

Example

dataset >> SetCaption("ohwx person")

XGenMMCaption

Generates captions for images using Salesforce/xgen-mm-phi3-mini-instruct-r-v1.

Parameters

  • target_prop (default: 'caption'): The property to store the caption in
  • prompt (default: None): The prompt to use for the xGen-mm model (read the code)
  • instructions (default: None): Additional instructions to include in the prompt
  • batch_size (default: 4): The number of images to process in parallel. If you are running out of memory, try reducing this value.

Output properties

  • image.{target_prop}: The caption generated for the image

Example

dataset >> XGenMMCaption

Florence2Caption

Generates captions for images using the Florence-2-large model.

Parameters

  • target_prop (default: 'caption'): The property to store the caption in
  • task (default: Florence2Task.MORE_DETAILED_CAPTION): The captioning task to perform. One of:
  • Florence2Task.CAPTION: Basic caption
  • Florence2Task.DETAILED_CAPTION: Detailed caption
  • Florence2Task.MORE_DETAILED_CAPTION: More detailed caption
  • batch_size (default: 8): The number of images to process in parallel. If you are running out of memory, try reducing this value.

Output properties

  • image.{target_prop}: The caption generated for the image

Example

dataset >> Florence2Caption(task=Florence2Task.DETAILED_CAPTION)