# Statistical Aggregations

Mathematical aggregations are intended to work over sets of data rather than individual points. For example, standard deviation is an "aggregate" because it is returning an "aggregate" value on a set of data.

### argMax

`argMax` returns the value for `column` in the row that contains the maximum value for `metric`. For example, if there were a table with teams and wins, the team that had won the most games would be returned for `argMax(team, wins)`.

Parameters:

• `column` - the column value you want returned.
• `metric` - the column that has the values you are trying to find the maximum for.

Usage:

• `argMax(column, metric)`

Returns:

• The value in the column for which the metric is the highest. If several different columns have the same metric value, the first one will be returned.

### argMin

argMin returns the value for `column` in the row that contains the mimimum value for `metric`. For example, if there were a table with teams and wins, the team that had won the fewest games would be returned for `argMin(team, wins)`.

Parameters:

• `column` - the column value you want returned.
• `metric` - the column that has the values you are trying to find the maximum for.

Usage:

• `argMin(column, metric)`

Returns:

• The value in the column for which the metric is the lowest. If several different columns have the same metric value, the first one will be returned.

### avg

avg returns the average on the data that is part of the set passed into the function.

Parameters:

• `set` - the data set to find the average of. This can be a column name in a table.

Usage:

• `avg(set)`

Returns:

• A double that is the average.

### categoricalInformationValue

`categoricalInformationValue` Clickhouse docs  ### corr

corr calculates the correlation of two sets of data using `Σ((x - x̅)(y - y̅)) / sqrt(Σ((x - x̅)^2) * Σ((y - y̅)^2))`

Parameters:

• `x` - the first set of data
• `y` - the second set of data

Usage:

• `corr(x,y)`

Returns:

• A double between -1 and +1.
• -1 indicates negative correlation
• 0 indicates no correlation
• +1 indicates positive correlation

### count

count calculates the number of rows in a set that have non-null values.

Parameters:

• `expression` - the set of rows to count. Expression can be the result of another function or column name.
• `DISTINCT` is used to only count the number of unique values rather than every value. Using `DISTINCT` is optional.

Usage:

• `count([DISTINCT] expression)`

Returns:

• The number of rows in the collection.

• If `DISTINCT` is used, only the number of unique values are counted. This behavior is the same as uniqExact

``````    **see also [Cardinality functions](/docs/cardinality#uniqexact)**
``````

### covarPop

covarPop calculates the joint probability of two parameters across a population. Specifically, it calculates `Σ((x - x̅)(y - y̅)) / n` where `n` is the size of the population.

Parameters:

• `x,y` - the two parameters to calculate covariance over.

Usage:

• `covarPop(x,y)`

Returns:

• The covariance between x and y of the population. If the value is positive, the parameters are likely related. If the covariance is negative, they are unlikely related. The value itself does not carry any significance.

### covarSamp

covarSamp calculates the joint probability of two parameters across a sample. Specifically, it calculates `Σ((x - x̅)(y - y̅)) / n-1` where `n` is the size of the sample.

Parameters:

• `x,y` - the two parameters to calculate covariance over.

Usage:

• `covarSamp(x,y)`

Returns:

• The covariance between x and y in the sample. If the value is positive, the parameters are likely related. If the covariance is negative, they are unlikely related. The value itself does not carry any significance.

### histogram

histogram separates data into a number of bins specified in the arguments that represent the distribution of data in the set.

Parameters:

• `number_of_bins` - how many groups the function should divide the data into
• `data` - the values to separate into the bins

Usage:

• `histogram(number_of_bins)(data)`

Returns:

• An array of tuples representing the histogram. Each tuple contains `lower_bound, upper_bound, height`. The upper bound of one tuple should be the lower bound of the tuple directly after it.

### kurtPop, kurtSamp

`kurtPop` and `kurtSamp` calculate the kurtosis of the population and sample respectively. Kurtosis is a measurement of outliers. The higher the kurtosis, the more outliers exists, and should be investigated.

Excess kurtosis means values <>3.

Parameters:

• `expression` - the set to operate on, typically a column name.

Usage:

• `kurtPop(expression)`

Returns:

• `Double` representing the kurtosis. Positive values indicate a more narrow distribution than a normal distribution. Negative values indicate a wider distribution, meaning more outliers exist.

### max

`max` returns the maximum value in a set, typically a column.

Parameters:

• `set` - the set to operate on, typically a column name.

Usage:

• `max(set)`

Returns:

• a single value, the same type as the types in `set`.

### median

`median` returns the median, which is the value that separates the upper half and lower half of the data set.

Parameters:

• `set` - the set to operate on, typically a column name.

Usage:

• `median(set)`

Returns:

• a single value, the same type as the types in `set`.

### min

`min` returns the minimum value in a set, typically a column.

Parameters:

• `set` - the set to operate on, typically a column name.

Usage:

• `min(set)`

Returns:

• a single value, the same type as the types in `set`.

### quantile

`quantile` approximates the value in the set for which `percent`% of the values in the set fall below. Quantiles help determine the probability of the distribution of a set.

Uses reservoir sampling for sampling.

Parameters:

• `percent` - A `Double` between 0 and 1 representing he quantile to calculate. For example, .95 means 95th quantile. `0` is equivalent to `min` and `1` is equivalent to `max`
• `set` - the set to operate on, typically a column name.

Usage:

• `quantile(percent)(set)`

Returns:

• a single value, the same type as the types in `set`.

### quantiles

`quantiles` approximates more than one quantile value in a single call, for which `percent`% of the values in the set fall below. Quantiles help determine the probability of the distribution of a set.

Uses reservoir sampling for sampling.

Parameters:

• `percent1` - A `Double` between 0 and 1 representing he quantile to calculate. For example, .95 means 95th quantile. `0` is equivalent to `min` and `1` is equivalent to `max`
• `percentN` - A `Double` between 0 and 1 representing he quantile to calculate. For example, .95 means 95th quantile. `0` is equivalent to `min` and `1` is equivalent to `max`
• `set` - the set to operate on, typically a column name.

Usage:

• `quantiles(percent1, percent2, percentN)(set)`

Returns:

• an array containing the quantile values in the order they were requested in the function call. The type is the same type as the data set.

### quantileDeterministic

`quantileDeterministic` approximates the value in the set for which `percent`% of the values in the set fall below. Quantiles help determine the probability of the distribution of a set.

Reservoir sampling uses a random number to sample. `quantileDeterministic` takes an extra argument, which replaces the random number. This gives the same result each time on the data set.

Parameters:

• `percent` - A `Double` between 0 and 1 representing he quantile to calculate. For example, .95 means 95th quantile. `0` is equivalent to `min` and `1` is equivalent to `max`
• `set` - the set to operate on, typically a column name.
• `seed` - the number to use instead of a random number in sampline

Usage:

• `quantileDeterministic(percent)(set, seed)`

Returns:

• a single value, the same type as the types in `set`.

### quantileExact

`quantileExact` calculates the exact quantile value in the set for which `percent`% of the values in the set fall below. Quantiles help determine the probability of the distribution of a set.

No sampling is done. The quantile is calculated on all values.

Parameters:

• `percent` - A `Double` between 0 and 1 representing he quantile to calculate. For example, .95 means 95th quantile. `0` is equivalent to `min` and `1` is equivalent to `max`
• `set` - the set to operate on, typically a column name.

Usage:

• `quantileExact(percent)(set)`

Returns:

• a single value, the same type as the types in `set`.

### skewPop, skewSamp

`skew` calculates the shape, or asymmetry, of a set of values. Skew helps determine the relationship between the average, max, and min of a set. If most of the values are above the average, then the skew is negative, or `left skewed`, meaning the left tail is longer. If most values fall below the average, the set is said to have positive skew, or 'right tailed' meaning that the righ tail is longer.

`skewPop` calculates the skew of the entire data set, or population. `skewSamp` calculates the skew of a sample of the data.

Parameters:

• `distribution` - the set for which the skew is needed.

Usage:

• `skewSamp(distribution)`
• `skewPop(distribution)`

Returns:

• `Float64` that is the skewness of the population or sample.

### stddevPop, stddevSamp

`stddev` calculates the standard deviation of a set of values, which is the width of the distribution, or the range of the values, of a data set. Standard deviation helps calculate 'confidence' in a statistical conclusion.

`stddevSamp` calculates the standard deviation on a sample, where `stddevPop` calculates the standard deviation on the entire data set, or population.

Parameters:

• `distribution` - the set for which the skew is needed.

Usage:

• `stddevPop(distribution)`
• `stddevSamp(distribution)`

Returns:

• `Float64` that is the standard deviation of the population or sample.

### sum

`sum` calculates the sum of all vaules in a set.

Parameters:

• `set` - the data set to sum. Usually a column. Must be a numeric type.

Usage:

• `sum(set)`

Returns:

• A number that is the sum of the set, the same data type as the set.

### sumWithOverflow

`sumWithOverflow` calculates the sum of all vaules in a set, returning the same data type as the set. If the sum is greater than the maximum value for the data type, an error is returned.

Parameters:

• `set` - the data set to sum. Usually a column. Must be a numeric type.

Usage:

• `sumWithOverflow(set)`

Returns:

• A number that is the sum of the set, the same data type as the set, or an error if the sum is great than the maximum value for the data type.

### topK, topKWeighted

`topK` returns the `N` most frequently occurring values in a column.

Parameters:

• `N` - number of values to return.
• `column` - the column to operate on
• `weight` (Optional) - weight to use for each column. See topKWeighted for more detail.

Usage:

• `topK(N)(column)`
• `topKWeighted(N)(column, weight)`

Returns:

• An array of the top N most occurring values, in descending order of frequency.

### varPop, varSamp

`var` calculates the variance, or how far the distribution of the set of values are spread out from the average. Variance is the same as standard deviation squared.

`varSamp` calculates the variance on a sample, where `varPop` calculates the variance on the entire data set, or population.

Parameters:

• `distribution` - the set for which the skew is needed.

Usage:

• `varPop(distribution)`
• `varSamp(distribution)`

Returns:

• `Float64` that is the variance of the population or sample.