Ensure that the OpenAI Tokens-Per-Minute (TPM) limits meet your organization’s specific requirements based on the model and SKU being used. For example, ‘gpt-4:Standard:10’ means that GPT-4 models using the Standard SKU should use a value of less than or equal to 10K TPM.
Managing TPM limits appropriately helps control costs, prevent unexpected billing spikes, and ensure consistent availability of AI resources for your applications.
Tokens-Per-Minute (TPM) limits determine how many tokens your application can send to OpenAI’s API within a minute. Think of tokens as pieces of words – most words in English are 1-2 tokens, with roughly 4 characters per token on average. These limits serve multiple purposes:
- Cost control – By setting appropriate TPM limits, you prevent unintended overuse that could lead to excessive charges
- Resource allocation – Ensures fair distribution of AI computing resources across your organization
- Performance planning – Helps predict and manage application performance based on AI response needs
Without proper TPM limits, applications can potentially consume more tokens than anticipated, leading to significant unexpected costs, especially during traffic spikes or in case of application issues like infinite loops.
Cost Impact
TPM limits directly affect your OpenAI API costs in several ways:
- Overprovisioning – Setting TPM limits much higher than needed means paying for unused capacity
- Underprovisioning – Setting limits too low may impact application performance and user experience
- Cost predictability – Appropriate limits make budgeting and forecasting more accurate
Potential Savings
Consider a scenario where you’re using GPT-4 with Standard SKU pricing:
TPM Setting | Monthly Token Usage | Approximate Monthly Cost |
---|---|---|
100K TPM | 4.32B tokens | $86,400 |
10K TPM | 432M tokens | $8,640 |
1K TPM | 43.2M tokens | $864 |
Reducing an unnecessarily high TPM limit from 100K to 10K could save approximately $77,760 per month if your actual usage requirements are closer to the lower limit.
Even a more modest reduction from 50K to 40K TPM could save $17,280 monthly – significant savings that directly impact your bottom line.
Implementation Guide
Infrastructure-as-Code Example (Terraform)
Problematic configuration:
resource "azurerm_cognitive_account" "openai" {
name = "my-openai-service"
location = azurerm_resource_group.example.location
resource_group_name = azurerm_resource_group.example.name
kind = "OpenAI"
sku_name = "S0"
deployment {
name = "gpt-4-deployment"
model {
format = "OpenAI"
name = "gpt-4"
version = "0613"
}
scale {
type = "Standard"
capacity = 120 # 120K TPM - potentially excessive
}
}
}
Improved configuration:
resource "azurerm_cognitive_account" "openai" {
name = "my-openai-service"
location = azurerm_resource_group.example.location
resource_group_name = azurerm_resource_group.example.name
kind = "OpenAI"
sku_name = "S0"
deployment {
name = "gpt-4-deployment"
model {
format = "OpenAI"
name = "gpt-4"
version = "0613"
}
scale {
type = "Standard"
capacity = 10 # 10K TPM - more reasonable limit
}
}
}
Step-by-Step Instructions
For Terraform Configurations
- Identify current TPM settings: Use Infracost to scan your infrastructure code for OpenAI deployments with potentially excessive TPM limits.
- Evaluate actual usage requirements: Review application logs and OpenAI usage metrics to determine actual token consumption patterns.
- Update configuration files: Modify the capacity value in your Terraform configuration to reflect appropriate limits.
- Validate changes: Run terraform plan to verify the changes will be applied correctly.
- Apply the changes: Use terraform apply to update your infrastructure.
- Monitor impact: After implementation, track both performance and cost metrics to confirm the new limits are appropriate.
For Azure Portal (Manual Configuration)
- Navigate to your Azure OpenAI resource in the Azure portal
- Select “Model deployments” from the left menu
- Click on the specific model deployment you want to adjust
- Under “Capacity (TPM)”, select the appropriate value
- Save your changes
Best Practices
- Start conservatively: Begin with lower TPM limits and increase only as needed
- Monitor usage patterns: Regularly review token usage to identify trends and adjust limits accordingly
- Implement tiered limits: Consider different TPM limits for development, testing, and production environments
- Create alerts: Set up monitoring alerts for when your usage approaches configured limits
- Document decisions: Keep records of TPM limit decisions and their rationale for future reference
- Regular reviews: Schedule quarterly reviews of TPM limits as part of your FinOps practice
Tools and Scripts
- Infracost: Scan your infrastructure-as-code to identify potential cost optimization opportunities, including OpenAI TPM limits.
- Azure Monitor: Create custom dashboards to track OpenAI usage against configured limits
- Terraform state analysis: Use terraform state show commands to audit current configuration
- PowerShell scripts: Automate the collection of usage statistics across multiple OpenAI deployments
Examples
Scenario 1: Development Environment Optimization
A development team initially set up a GPT-4 deployment with 50K TPM for their test environment. By analyzing actual usage patterns, they discovered peak usage never exceeded 2K TPM. By reducing the limit to 5K TPM (still providing headroom), they reduced their potential maximum monthly costs from $43,200 to $4,320.
Scenario 2: Production Scaling Strategy
An e-commerce company uses GPT-4 for customer service automation. Initially, they set a 100K TPM limit to handle potential traffic spikes. After implementing a more sophisticated scaling strategy with automated TPM adjustments based on time-of-day patterns, they reduced average TPM to 30K, saving approximately $50,400 monthly while maintaining service levels.
Scenario 3: Multi-Environment Management
A financial services organization maintained identical 80K TPM limits across development, staging, and production environments. By implementing environment-appropriate limits (5K for development, 20K for staging, 60K for production), they reduced overall OpenAI costs by 42% without affecting application performance.
Considerations and Caveats
When Higher TPM Limits May Be Justified
- Mission-critical applications: Systems where AI response time is directly tied to user experience or business outcomes
- High-traffic consumer applications: Services with unpredictable usage spikes that must maintain responsiveness
- Batch processing workloads: Applications that need to process large volumes of data in short time windows
Implementation Challenges
- Accurate forecasting: Predicting appropriate TPM limits requires good historical data
- Application design: Some applications may need refactoring to handle TPM limits gracefully
- Regional variations: Different regions may have different TPM limit availability
- Model changes: Upgrading to newer model versions may require TPM limit reconsideration
Risk Mitigation Strategies
- Implement circuit breakers: Design applications to degrade gracefully when approaching TPM limits
- Queue-based architecture: Buffer requests during usage spikes for processing when capacity is available
- Hybrid approaches: Consider using different models with different cost structures for various use cases
- Auto-scaling policies: Some environments support dynamic TPM scaling based on usage patterns