Captioning
GeminiCaption
Generates captions for images using Gemini 2.0 Flash Vision.
You must set either GEMINI_API_KEY
or GOOGLE_API_KEY
in your environment to use this node.
Parameters
target_prop
(default:'caption'
): The property to store the caption inprompt
(default:None
): The prompt to use for the Gemini modelinstructions
(default:None
): Additional instructions to append to the promptparallel
(default:8
): The number of images to process in parallel. Adjust based on API rate limits.
Output properties
image.{target_prop}
: The caption generated for the image
Example
dataset >> GeminiCaption()
GPT4oCaption
Generates captions for images using GPT4o.
You must set OPENAI_API_KEY
in your environment to use this node.
Parameters
target_prop
(default:'caption'
): The property to store the caption incaption_type
(default:'descriptive'
): The type of caption to generate ('descriptive'
or'booru'
)prompt
(default:None
): The prompt to use for the GPT-4o model (read the code if you're curious about customizing this)instructions
(default:None
): Additional instructions to append to the promptparallel
(default:8
): The number of images to process in parallel. Adjust based on OpenAI rate limits.
Output properties
image.{target_prop}
: The caption generated for the image
Example
dataset >> Gpt4oCaption(caption_type='descriptive')
JoyCaptionAlphaOne
Generates captions for images using JoyCaption Alpha One
Parameters
target_prop
(default:'caption'
): The property to store the caption incaption_type
(default:'descriptive'
): The type of caption to generate ('descriptive'
,'stable_diffusion'
, or'booru'
)caption_tone
(default:'formal'
): The tone of the caption ('formal'
or'informal'
)caption_length
(default:'any'
): The length of the caption ('any'
or an integer)batch_size
(default:4
): The number of images to process in parallel
Output properties
image.{target_prop}
: The caption generated for the image
Example
dataset >> JoyCaptionAlphaOne
JoyCaptionAlphaTwo
Generates captions for images using JoyCaption Alpha Two with additional caption types and options.
Parameters
target_prop
(default:'caption'
): The property to store the caption incaption_type
(default:'descriptive'
): The type of caption to generate. Options:'descriptive'
'descriptive_informal'
'training_prompt'
'midjourney'
'booru_tag_list'
'booru_like_tag_list'
'art_critic'
'product_listing'
'social_media_post'
caption_length
(default:'long'
): The length of the caption ('any'
,'very short'
,'short'
,'medium-length'
,'long'
,'very long'
, or an integer)extra_options
(default:[]
): List of extra options to include in the captionname_input
(default:''
): Name to use when referring to people/characters in the imagebatch_size
(default:4
): The number of images to process in parallel
Output properties
image.{target_prop}
: The caption generated for the image
Example
dataset >> JoyCaptionAlphaTwo(
caption_type='art_critic',
caption_length='very long',
extra_options=['Include information about lighting'],
name_input='Alice'
)
LlamaCaption
Generates captions for images using meta-llama/llama-3.2-11B-Vision-Instruct
.
In order to download this model, you need to be logged into huggingface.
Parameters
target_prop
(default:'caption'
): The property to store the caption inprompt
(default:None
): The prompt to use for the Llama model (read the code)instructions
(default:None
): Additional instructions to include in the promptbatch_size
(default:1
): The number of images to process in parallel. If you are running out of memory, try reducing this value.
Output properties
image.{target_prop}
: The caption generated for the image
Example
dataset >> LlamaCaption
LLMCaptionVariations
Generates variations of existing image captions using LLaMA 3.1 8B Instruct. This preserves the original caption and adds numbered variations as new properties.
Parameters
target_prop
(default:'caption'
): Base property name to store variations in. Will append _1, _2 etc.variations
(default:2
): Number of variations to generate per imageparallel
(default:20
): The number of images to process in paralleltemperature
(default:0.7
): The temperature to use when sampling from the modelmodel
(default:'together_ai/meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo'
): The LLM model to use for generating variations
Output properties
image.{target_prop}
: The original caption (preserved)image.{target_prop}_1
: First variationimage.{target_prop}_2
: Second variation (if variations > 1)- etc.
Example
# Generate 3 variations of each caption
dataset >> LLMCaptionVariations(variations=3)
LLMCaptionTransform
Computes image captions using a Large Language Model. This can be used for a number of purposes included:
- Cleaning up captions from VLMs
- Computing better captions by combining results from multiple VLMs
- Computing captions based on tags or other metadata
- Combining tags or other metadata with VLM caption results
- Transforming captions into different styles (e.g. fluid vs booru).
And so on. This is commonly one of the last nodes in a workflow, that determines the final caption.
Language models are accessed via API using litellm. Model strings should follow their conventions. For example:
together_ai/meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo
together_ai/meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo
openai/gpt-4o
openai/gpt-4o-mini
Depending on the platform used, you will need to set environment variables like OPENAI_API_KEY
or TOGETHER_API_KEY
or ANTHROPIC_API_KEY
to access the models.
Be aware that OpenAI and Anthropic models are relatively censored and may refuse certain tasks based on the content. We have had good experiences with Llama and Qwen models.
LLM results are cached with the model, prompt, and parameters as the key, beprepared will never execute the same request twice.
We plan to support local LLMs in the future, and would welcome pull requests that implement this efficiently.
Parameters
model
: The name of the language model to useprompt
: A function that takes an image and returns a prompt for the language modeltarget_prop
(default:caption
): The property to store the transformed caption inparallel
(default:20
): The number of images to process in parallel. Adjust based on rate limitstemperature
(default:0.5
): The temperature to use when sampling from the language model. In general, lower temperatures give more consistent and "safe" results and reduce the language model's tendency to hallucinate.
Output properties
image.{target_prop}
: The transformed caption
Example
dataset
>> JoyCaptionAlphaOne(target_prop='joycaption')
>> GPT4oCaption(target_prop='gpt4ocaption')
>> XGenMMCaption(target_prop='xgenmmcaption')
>> QwenVLCaption(target_prop='qwenvlcaption')
>> LlamaCaption(target_prop='llamacaption')
>> Gemma3Caption(target_prop='gemma3caption')
>> LLMCaptionTransform('together_ai/meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo',
lambda image: f"""
Multiple VLMs have captioned this image. These are their results:
- JoyCaption: {image.joycaption.value}
- GPT4oCaption: {image.gpt4ocaption.value}
- XGenMMCaption: {image.xgenmmcaption.value}
- QwenVLCaption: {image.qwenvlcaption.value}
- LlamaCaption: {image.llamacaption.value}
- Gemma3: {image.gemma3caption.value}
Please generate a final caption for this image based on the above information. Your response should be the caption, with no extra text or boilerplate.
""".strip(),
target_prop='caption')
QwenVLCaption
Generates captions for images using Qwen/Qwen2-VL-7B-Instruct
.
Parameters
target_prop
(default:'caption'
): The property to store the caption inprompt
(default:None
): The prompt to use for the Qwen 2 VL 7B model (read the code)instructions
(default:None
): Additional instructions to include in the promptbatch_size
(default:1
): The number of images to process in parallel. If you are running out of memory, try reducing this value.
Output properties
image.{target_prop}
: The caption generated for the image
Example
dataset >> QwenVLCaption
Gemma3Caption
Generates captions for images using Google's Gemma 3 12B Instruction-Tuned Vision-Language Model.
Gemma 3 is a powerful multimodal model that can process both text and images, generating high-quality text outputs. It supports a 128K token context window and is multilingual, supporting over 140 languages. The model excels at detailed image descriptions, visual question answering, and image analysis.
Requirements:
- Transformers version: Requires transformers >= 4.46.0
. Run pip install --upgrade transformers
if you encounter model loading errors.
- VRAM: Gemma 3 12B requires significant VRAM (24GB+ recommended). Consider using batch_size=1 to manage memory usage.
Parameters
target_prop
(default:'caption'
): The property to store the caption inprompt
(default: Detailed description prompt): The prompt to use for image captioningsystem_prompt
(default:None
): System prompt to set the assistant's behaviorinstructions
(default:None
): Additional instructions to append to the promptbatch_size
(default:1
): The number of images to process in parallel. Keep at 1 for 12B model.
Output properties
image.{target_prop}
: The caption generated for the image
Example
# Basic usage with default detailed captioning
dataset >> Gemma3Caption()
# Custom prompt for specific focus
dataset >> Gemma3Caption(
prompt='What objects are visible in this image?'
)
# With system prompt and instructions
dataset >> Gemma3Caption(
system_prompt='You are an art critic analyzing paintings.',
prompt='Analyze this artwork',
instructions='Focus on style, technique, and emotional impact'
)
Passthrough
A node that does nothing and returns the dataset unchanged. This can be useful as a no-op placeholder in conditional pipeline branches.
Example
# Use Passthrough as a no-op alternative in a conditional
dataset >> (ProcessNode() if condition else Passthrough())
MapCaption
Maps the current caption to a new caption using a function.
Parameters
func
: Function that takes the current caption string and returns a new caption string.
Output Properties
image.caption
: The transformed caption.
Example
# Add an exclamation mark to all captions
dataset >> MapCaption(lambda caption: caption + "!")
# Prepend a prefix to all captions
dataset >> MapCaption(lambda caption: f"A photo showing {caption}")
SetCaption
Sets the caption
property on an image.
Parameters
caption
: The caption to set.
Output Properties
image.caption
: The caption of the image.
Example
dataset >> SetCaption("ohwx person")
XGenMMCaption
Generates captions for images using Salesforce/xgen-mm-phi3-mini-instruct-r-v1
.
Parameters
target_prop
(default:'caption'
): The property to store the caption inprompt
(default:None
): The prompt to use for the xGen-mm model (read the code)instructions
(default:None
): Additional instructions to include in the promptbatch_size
(default:4
): The number of images to process in parallel. If you are running out of memory, try reducing this value.
Output properties
image.{target_prop}
: The caption generated for the image
Example
dataset >> XGenMMCaption
Florence2Caption
Generates captions for images using the Florence-2-large model.
Parameters
target_prop
(default:'caption'
): The property to store the caption intask
(default:Florence2Task.MORE_DETAILED_CAPTION
): The captioning task to perform. One of:Florence2Task.CAPTION
: Basic captionFlorence2Task.DETAILED_CAPTION
: Detailed captionFlorence2Task.MORE_DETAILED_CAPTION
: More detailed captionbatch_size
(default:8
): The number of images to process in parallel. If you are running out of memory, try reducing this value.
Output properties
image.{target_prop}
: The caption generated for the image
Example
dataset >> Florence2Caption(task=Florence2Task.DETAILED_CAPTION)