Ensure that the OpenAI Tokens-Per-Minute (TPM) limits meet your organization’s specific requirements based on the model and SKU being used. For example, ‘gpt-4:Standard:10’ means that GPT-4 models using the Standard SKU should use a value of less than or equal to 10K TPM.

Managing TPM limits appropriately helps control costs, prevent unexpected billing spikes, and ensure consistent availability of AI resources for your applications.

Tokens-Per-Minute (TPM) limits determine how many tokens your application can send to OpenAI’s API within a minute. Think of tokens as pieces of words – most words in English are 1-2 tokens, with roughly 4 characters per token on average. These limits serve multiple purposes:

  1. Cost control – By setting appropriate TPM limits, you prevent unintended overuse that could lead to excessive charges
  2. Resource allocation – Ensures fair distribution of AI computing resources across your organization
  3. Performance planning – Helps predict and manage application performance based on AI response needs

Without proper TPM limits, applications can potentially consume more tokens than anticipated, leading to significant unexpected costs, especially during traffic spikes or in case of application issues like infinite loops.

Cost Impact

TPM limits directly affect your OpenAI API costs in several ways:

  • Overprovisioning – Setting TPM limits much higher than needed means paying for unused capacity
  • Underprovisioning – Setting limits too low may impact application performance and user experience
  • Cost predictability – Appropriate limits make budgeting and forecasting more accurate

Potential Savings

Consider a scenario where you’re using GPT-4 with Standard SKU pricing:

TPM SettingMonthly Token UsageApproximate Monthly Cost 
100K TPM4.32B tokens$86,400
10K TPM432M tokens$8,640
1K TPM43.2M tokens$864

Reducing an unnecessarily high TPM limit from 100K to 10K could save approximately $77,760 per month if your actual usage requirements are closer to the lower limit.

Even a more modest reduction from 50K to 40K TPM could save $17,280 monthly – significant savings that directly impact your bottom line.

Implementation Guide

Infrastructure-as-Code Example (Terraform)

Problematic configuration:

resource "azurerm_cognitive_account" "openai" {
  name                = "my-openai-service"
  location            = azurerm_resource_group.example.location
  resource_group_name = azurerm_resource_group.example.name
  kind                = "OpenAI"
  sku_name            = "S0"
 
  deployment {
    name = "gpt-4-deployment"
    model {
      format  = "OpenAI"
      name    = "gpt-4"
      version = "0613"
    }
    scale {
      type     = "Standard"
      capacity = 120 # 120K TPM - potentially excessive
    }
  }
}

Improved configuration:

resource "azurerm_cognitive_account" "openai" {
  name                = "my-openai-service"
  location            = azurerm_resource_group.example.location
  resource_group_name = azurerm_resource_group.example.name
  kind                = "OpenAI"
  sku_name            = "S0"
 
  deployment {
    name = "gpt-4-deployment"
    model {
      format  = "OpenAI"
      name    = "gpt-4"
      version = "0613"
    }
    scale {
      type     = "Standard"
      capacity = 10 # 10K TPM - more reasonable limit
    }
  }
}

Step-by-Step Instructions

For Terraform Configurations

  1. Identify current TPM settings: Use Infracost to scan your infrastructure code for OpenAI deployments with potentially excessive TPM limits.
  2. Evaluate actual usage requirements: Review application logs and OpenAI usage metrics to determine actual token consumption patterns.
  3. Update configuration files: Modify the capacity value in your Terraform configuration to reflect appropriate limits.
  4. Validate changes: Run terraform plan to verify the changes will be applied correctly.
  5. Apply the changes: Use terraform apply to update your infrastructure.
  6. Monitor impact: After implementation, track both performance and cost metrics to confirm the new limits are appropriate.

For Azure Portal (Manual Configuration)

  1. Navigate to your Azure OpenAI resource in the Azure portal
  2. Select “Model deployments” from the left menu
  3. Click on the specific model deployment you want to adjust
  4. Under “Capacity (TPM)”, select the appropriate value
  5. Save your changes

Best Practices

  • Start conservatively: Begin with lower TPM limits and increase only as needed
  • Monitor usage patterns: Regularly review token usage to identify trends and adjust limits accordingly
  • Implement tiered limits: Consider different TPM limits for development, testing, and production environments
  • Create alerts: Set up monitoring alerts for when your usage approaches configured limits
  • Document decisions: Keep records of TPM limit decisions and their rationale for future reference
  • Regular reviews: Schedule quarterly reviews of TPM limits as part of your FinOps practice

Tools and Scripts

  • Infracost: Scan your infrastructure-as-code to identify potential cost optimization opportunities, including OpenAI TPM limits.
  • Azure Monitor: Create custom dashboards to track OpenAI usage against configured limits
  • Terraform state analysis: Use terraform state show commands to audit current configuration
  • PowerShell scripts: Automate the collection of usage statistics across multiple OpenAI deployments

Examples

Scenario 1: Development Environment Optimization

A development team initially set up a GPT-4 deployment with 50K TPM for their test environment. By analyzing actual usage patterns, they discovered peak usage never exceeded 2K TPM. By reducing the limit to 5K TPM (still providing headroom), they reduced their potential maximum monthly costs from $43,200 to $4,320.

Scenario 2: Production Scaling Strategy

An e-commerce company uses GPT-4 for customer service automation. Initially, they set a 100K TPM limit to handle potential traffic spikes. After implementing a more sophisticated scaling strategy with automated TPM adjustments based on time-of-day patterns, they reduced average TPM to 30K, saving approximately $50,400 monthly while maintaining service levels.

Scenario 3: Multi-Environment Management

A financial services organization maintained identical 80K TPM limits across development, staging, and production environments. By implementing environment-appropriate limits (5K for development, 20K for staging, 60K for production), they reduced overall OpenAI costs by 42% without affecting application performance.

Considerations and Caveats

When Higher TPM Limits May Be Justified

  • Mission-critical applications: Systems where AI response time is directly tied to user experience or business outcomes
  • High-traffic consumer applications: Services with unpredictable usage spikes that must maintain responsiveness
  • Batch processing workloads: Applications that need to process large volumes of data in short time windows

Implementation Challenges

  • Accurate forecasting: Predicting appropriate TPM limits requires good historical data
  • Application design: Some applications may need refactoring to handle TPM limits gracefully
  • Regional variations: Different regions may have different TPM limit availability
  • Model changes: Upgrading to newer model versions may require TPM limit reconsideration

Risk Mitigation Strategies

  • Implement circuit breakers: Design applications to degrade gracefully when approaching TPM limits
  • Queue-based architecture: Buffer requests during usage spikes for processing when capacity is available
  • Hybrid approaches: Consider using different models with different cost structures for various use cases
  • Auto-scaling policies: Some environments support dynamic TPM scaling based on usage patterns

Frequently Asked Questions (FAQs)

TPM (Tokens Per Minute) measures the number of tokens your application can process within a minute, while RPM (Requests Per Minute) refers to the number of API calls. TPM directly affects costs as you pay for token usage, while RPM is more related to the API’s technical limitations. Most cost optimization strategies focus on TPM, as it’s the primary billing metric.

Start by analyzing your current token usage patterns. Look at average usage, peak usage, and usage variation. A general rule is to set your TPM limit at approximately 1.5-2x your average peak usage to allow for traffic spikes while not grossly overprovisioning. Infracost can help identify potential optimization opportunities in your infrastructure code.

Yes. Setting TPM limits too low may result in request throttling during high-traffic periods, causing increased latency or failed requests. Always test performance after adjusting TPM limits, especially for production environments. Consider implementing a queuing system for applications that may experience variable load.

No. Different models have different TPM limit options and pricing structures. More capable models like GPT-4 generally have higher costs per token, making TPM limit optimization even more important for these models. Always check the latest pricing documentation for your specific model.

For most organizations, quarterly reviews are sufficient. However, consider more frequent reviews in these scenarios:

  • After significant application changes
  • When launching new features that may affect AI usage
  • During periods of rapid user growth
  • After observing any unexpected cost increases

Yes, this is possible through architectural decisions. For instance, you can create multiple OpenAI deployments with different TPM limits and direct different types of operations to the appropriate deployment. This approach allows for more granular cost control and resource allocation.

Infracost excels at identifying potential optimization opportunities in your infrastructure-as-code before deployment. For ongoing monitoring, you should use cloud-native monitoring tools like Azure Monitor or custom dashboards. Infracost allows you to scan your code for this policy and many others to prevent cost issues before they occur.