资源限制

简要概述

配置 Cortex 服务(如分发器、摄取器等)的默认限制,这些限制适用于所有租户,例如,您可以设置默认的最大样本数量或最大标签数等。

此外,还允许您为每个租户设置特定的限制,这些特定限制将覆盖默认限制,并且仅适用于相应的租户。您可以根据不同租户的需求设置不同的限制,以确保资源分配的公平性和合理性。

通过配置limits_config,您可以在Cortex中管理和调整各个服务的限制,以满足特定的需求和约束。请注意,这些限制可以帮助保护Cortex集群免受滥用或超负荷使用,同时提供适当的资源分配和性能保证。

配置示例

官方文档

limits:
  # 每租户的速率限制(以每秒样本数为单位)
  # http code=429 resp=ingestion rate limit (25000) exceeded
  ingestion_rate: 25000
  # 应将摄取速率限制应用于每个分发器实例(local)
  # 还是在整个集群中均匀分配(global)
  ingestion_rate_strategy: local
  # 每租户允许的突发大小(以样本数量为单位)
  ingestion_burst_size: 50000
  # 启用所有用户处理带有外部标签的样本的标志,这些外部标签用于标识高可用Prometheus设置中的副本
  accept_ha_samples: false
  # 从指标中提取哪个标签值作为集群标识
  ha_cluster_label: cluster
  # 从指标中提取哪个标签值作为副本标识
  ha_replica_label: __replica__
  ha_max_clusters: 0
  drop_labels: []
  max_label_name_length: 1024
  max_label_value_length: 2048
  max_label_names_per_series: 30
  max_labels_size_bytes: 0
  max_metadata_length: 1024
  reject_old_samples: false
  reject_old_samples_max_age: 2w
  creation_grace_period: 10m
  enforce_metadata_metric_name: true
  enforce_metric_name: true
  ingestion_tenant_shard_size: 0
  max_series_per_query: 100000
  max_series_per_user: 5000000
  max_series_per_metric: 50000
  max_global_series_per_user: 0
  max_global_series_per_metric: 0
  max_metadata_per_user: 8000
  max_metadata_per_metric: 10
  max_global_metadata_per_user: 0
  max_global_metadata_per_metric: 0
  max_fetched_chunks_per_query: 2000000
  max_fetched_series_per_query: 0
  max_fetched_chunk_bytes_per_query: 0
  max_fetched_data_bytes_per_query: 0
  max_query_lookback: 0s
  max_query_length: 0s
  max_query_parallelism: 14
  max_cache_freshness: 1m
  max_queriers_per_tenant: 0
  query_vertical_shard_size: 0
  ruler_evaluation_delay_duration: 0s
  ruler_tenant_shard_size: 0
  ruler_max_rules_per_rule_group: 0
  ruler_max_rule_groups_per_tenant: 0
  store_gateway_tenant_shard_size: 0
  compactor_blocks_retention_period: 0s
  compactor_tenant_shard_size: 0
  s3_sse_type: ""
  s3_sse_kms_key_id: ""
  s3_sse_kms_encryption_context: ""
  alertmanager_receivers_firewall_block_cidr_networks: ""
  alertmanager_receivers_firewall_block_private_addresses: false
  alertmanager_notification_rate_limit: 0
  alertmanager_notification_rate_limit_per_integration: {}
  alertmanager_max_config_size_bytes: 0
  alertmanager_max_templates_count: 0
  alertmanager_max_template_size_bytes: 0
  alertmanager_max_dispatcher_aggregation_groups: 0
  alertmanager_max_alerts_count: 0
  alertmanager_max_alerts_size_bytes: 0

数据结构

github.com/cortexproject/cortex/pkg/util/validation/limits.go

Limits

// Limits describe all the limits for users; can be used to describe global default
// limits via flags, or per-user limits via yaml config.
type Limits struct {
    // Distributor enforced limits.
    IngestionRate             float64             `yaml:"ingestion_rate" json:"ingestion_rate"`
    IngestionRateStrategy     string              `yaml:"ingestion_rate_strategy" json:"ingestion_rate_strategy"`
    IngestionBurstSize        int                 `yaml:"ingestion_burst_size" json:"ingestion_burst_size"`
    AcceptHASamples           bool                `yaml:"accept_ha_samples" json:"accept_ha_samples"`
    HAClusterLabel            string              `yaml:"ha_cluster_label" json:"ha_cluster_label"`
    HAReplicaLabel            string              `yaml:"ha_replica_label" json:"ha_replica_label"`
    HAMaxClusters             int                 `yaml:"ha_max_clusters" json:"ha_max_clusters"`
    DropLabels                flagext.StringSlice `yaml:"drop_labels" json:"drop_labels"`
    MaxLabelNameLength        int                 `yaml:"max_label_name_length" json:"max_label_name_length"`
    MaxLabelValueLength       int                 `yaml:"max_label_value_length" json:"max_label_value_length"`
    MaxLabelNamesPerSeries    int                 `yaml:"max_label_names_per_series" json:"max_label_names_per_series"`
    MaxLabelsSizeBytes        int                 `yaml:"max_labels_size_bytes" json:"max_labels_size_bytes"`
    MaxMetadataLength         int                 `yaml:"max_metadata_length" json:"max_metadata_length"`
    RejectOldSamples          bool                `yaml:"reject_old_samples" json:"reject_old_samples"`
    RejectOldSamplesMaxAge    model.Duration      `yaml:"reject_old_samples_max_age" json:"reject_old_samples_max_age"`
    CreationGracePeriod       model.Duration      `yaml:"creation_grace_period" json:"creation_grace_period"`
    EnforceMetadataMetricName bool                `yaml:"enforce_metadata_metric_name" json:"enforce_metadata_metric_name"`
    EnforceMetricName         bool                `yaml:"enforce_metric_name" json:"enforce_metric_name"`
    IngestionTenantShardSize  int                 `yaml:"ingestion_tenant_shard_size" json:"ingestion_tenant_shard_size"`
    MetricRelabelConfigs      []*relabel.Config   `yaml:"metric_relabel_configs,omitempty" json:"metric_relabel_configs,omitempty" doc:"nocli|description=List of metric relabel configurations. Note that in most situations, it is more effective to use metrics relabeling directly in the Prometheus server, e.g. remote_write.write_relabel_configs."`
    MaxExemplars              int                 `yaml:"max_exemplars" json:"max_exemplars"`

    // Ingester enforced limits.
    // Series
    MaxSeriesPerQuery        int `yaml:"max_series_per_query" json:"max_series_per_query"`
    MaxLocalSeriesPerUser    int `yaml:"max_series_per_user" json:"max_series_per_user"`
    MaxLocalSeriesPerMetric  int `yaml:"max_series_per_metric" json:"max_series_per_metric"`
    MaxGlobalSeriesPerUser   int `yaml:"max_global_series_per_user" json:"max_global_series_per_user"`
    MaxGlobalSeriesPerMetric int `yaml:"max_global_series_per_metric" json:"max_global_series_per_metric"`

    // Metadata
    MaxLocalMetricsWithMetadataPerUser  int `yaml:"max_metadata_per_user" json:"max_metadata_per_user"`
    MaxLocalMetadataPerMetric           int `yaml:"max_metadata_per_metric" json:"max_metadata_per_metric"`
    MaxGlobalMetricsWithMetadataPerUser int `yaml:"max_global_metadata_per_user" json:"max_global_metadata_per_user"`
    MaxGlobalMetadataPerMetric          int `yaml:"max_global_metadata_per_metric" json:"max_global_metadata_per_metric"`
    // Out-of-order
    OutOfOrderTimeWindow model.Duration `yaml:"out_of_order_time_window" json:"out_of_order_time_window"`

    // Querier enforced limits.
    MaxChunksPerQuery            int            `yaml:"max_fetched_chunks_per_query" json:"max_fetched_chunks_per_query"`
    MaxFetchedSeriesPerQuery     int            `yaml:"max_fetched_series_per_query" json:"max_fetched_series_per_query"`
    MaxFetchedChunkBytesPerQuery int            `yaml:"max_fetched_chunk_bytes_per_query" json:"max_fetched_chunk_bytes_per_query"`
    MaxFetchedDataBytesPerQuery  int            `yaml:"max_fetched_data_bytes_per_query" json:"max_fetched_data_bytes_per_query"`
    MaxQueryLookback             model.Duration `yaml:"max_query_lookback" json:"max_query_lookback"`
    MaxQueryLength               model.Duration `yaml:"max_query_length" json:"max_query_length"`
    MaxQueryParallelism          int            `yaml:"max_query_parallelism" json:"max_query_parallelism"`
    MaxCacheFreshness            model.Duration `yaml:"max_cache_freshness" json:"max_cache_freshness"`
    MaxQueriersPerTenant         int            `yaml:"max_queriers_per_tenant" json:"max_queriers_per_tenant"`
    QueryVerticalShardSize       int            `yaml:"query_vertical_shard_size" json:"query_vertical_shard_size" doc:"hidden"`

    // Query Frontend / Scheduler enforced limits.
    MaxOutstandingPerTenant int `yaml:"max_outstanding_requests_per_tenant" json:"max_outstanding_requests_per_tenant"`

    // Ruler defaults and limits.
    RulerEvaluationDelay        model.Duration `yaml:"ruler_evaluation_delay_duration" json:"ruler_evaluation_delay_duration"`
    RulerTenantShardSize        int            `yaml:"ruler_tenant_shard_size" json:"ruler_tenant_shard_size"`
    RulerMaxRulesPerRuleGroup   int            `yaml:"ruler_max_rules_per_rule_group" json:"ruler_max_rules_per_rule_group"`
    RulerMaxRuleGroupsPerTenant int            `yaml:"ruler_max_rule_groups_per_tenant" json:"ruler_max_rule_groups_per_tenant"`

    // Store-gateway.
    StoreGatewayTenantShardSize int `yaml:"store_gateway_tenant_shard_size" json:"store_gateway_tenant_shard_size"`

    // Compactor.
    CompactorBlocksRetentionPeriod model.Duration `yaml:"compactor_blocks_retention_period" json:"compactor_blocks_retention_period"`
    CompactorTenantShardSize       int            `yaml:"compactor_tenant_shard_size" json:"compactor_tenant_shard_size"`

    // This config doesn't have a CLI flag registered here because they're registered in
    // their own original config struct.
    S3SSEType                 string `yaml:"s3_sse_type" json:"s3_sse_type" doc:"nocli|description=S3 server-side encryption type. Required to enable server-side encryption overrides for a specific tenant. If not set, the default S3 client settings are used."`
    S3SSEKMSKeyID             string `yaml:"s3_sse_kms_key_id" json:"s3_sse_kms_key_id" doc:"nocli|description=S3 server-side encryption KMS Key ID. Ignored if the SSE type override is not set."`
    S3SSEKMSEncryptionContext string `yaml:"s3_sse_kms_encryption_context" json:"s3_sse_kms_encryption_context" doc:"nocli|description=S3 server-side encryption KMS encryption context. If unset and the key ID override is set, the encryption context will not be provided to S3. Ignored if the SSE type override is not set."`

    // Alertmanager.
    AlertmanagerReceiversBlockCIDRNetworks     flagext.CIDRSliceCSV `yaml:"alertmanager_receivers_firewall_block_cidr_networks" json:"alertmanager_receivers_firewall_block_cidr_networks"`
    AlertmanagerReceiversBlockPrivateAddresses bool                 `yaml:"alertmanager_receivers_firewall_block_private_addresses" json:"alertmanager_receivers_firewall_block_private_addresses"`

    NotificationRateLimit               float64                  `yaml:"alertmanager_notification_rate_limit" json:"alertmanager_notification_rate_limit"`
    NotificationRateLimitPerIntegration NotificationRateLimitMap `yaml:"alertmanager_notification_rate_limit_per_integration" json:"alertmanager_notification_rate_limit_per_integration"`

    AlertmanagerMaxConfigSizeBytes             int `yaml:"alertmanager_max_config_size_bytes" json:"alertmanager_max_config_size_bytes"`
    AlertmanagerMaxTemplatesCount              int `yaml:"alertmanager_max_templates_count" json:"alertmanager_max_templates_count"`
    AlertmanagerMaxTemplateSizeBytes           int `yaml:"alertmanager_max_template_size_bytes" json:"alertmanager_max_template_size_bytes"`
    AlertmanagerMaxDispatcherAggregationGroups int `yaml:"alertmanager_max_dispatcher_aggregation_groups" json:"alertmanager_max_dispatcher_aggregation_groups"`
    AlertmanagerMaxAlertsCount                 int `yaml:"alertmanager_max_alerts_count" json:"alertmanager_max_alerts_count"`
    AlertmanagerMaxAlertsSizeBytes             int `yaml:"alertmanager_max_alerts_size_bytes" json:"alertmanager_max_alerts_size_bytes"`
}

flagext.StringSlice

github.com/cortexproject/cortex/pkg/util/flagext/

// StringSlice is a slice of strings that implements flag.Value
type StringSlice []string

relabel.Config

github.com/prometheus/prometheus/model/relabel/

flagext.CIDRSliceCSV

// CIDRSliceCSV is a slice of CIDRs that is parsed from a comma-separated string.
// It implements flag.Value and yaml Marshalers.
type CIDRSliceCSV []CIDR

// CIDR is a network CIDR.
type CIDR struct {
    Value *net.IPNet
}

NotificationRateLimitMap

type NotificationRateLimitMap map[string]float64

应用场景

客户端写入高可用

服务端 cortex 的 limit 配置:

distributor:
  ha_tracker:
    enable_ha_tracker: true
    ha_tracker_update_timeout: 15s
    ha_tracker_update_timeout_jitter_max: 5s
    ha_tracker_failover_timeout: 30s
    kvstore:
      store: etcd
      prefix: /cortex/ha-tracker/
      etcd:
        endpoints:
        - 192.168.31.201:2379
        dial_timeout: 10s
        max_retries: 10

limits:
  accept_ha_samples: true 
  ha_cluster_label: cluster
  ha_replica_label: __replica__
  • 客户端上报以下数据
pod1 -> metric_name{cluster="cluster1", __replica__="test1"}
pod2 -> metric_name{cluster="cluster1", __replica__="test2"}

由于此时 pod1 与 pod2 处于一个集群 “cluster1”,而分别上报不同的副本 “test1”、“test2”,则同一时间仅允许一个写入。

拒绝日志示例:

replicas did not mach, rejecting sample: replica=test2, elected=test1

允许接收过期数据

  • 服务端 cortex 的 limit 配置
limits:
  out_of_order_time_window: 30m
  • 客户端上报以下数据

服务端 cortex 当前时间: “2023-09-08T11:40:00”

则允许以下数据写入:

metric_name1{} 2023-09-08T11:20:00
metric_name1{} 2023-09-08T11:30:00
metric_name1{} 2023-09-08T11:40:00

则拒绝以下数据写入:

metric_name1{} 2023-09-08T10:50:00
metric_name1{} 2023-09-08T11:10:00
metric_name1{} 2023-09-08T11:20:00



最后修改 2023.09.24: refactor: update cortex (ba4ddf9)