Statistical Aggregations

Mathematical aggregations are intended to work over sets of data rather than individual points. For example, standard deviation is an "aggregate" because it is returning an "aggregate" value on a set of data.

argMax

argMax returns the value for column in the row that contains the maximum value for metric. For example, if there were a table with teams and wins, the team that had won the most games would be returned for argMax(team, wins).

Parameters:

  • column - the column value you want returned.
  • metric - the column that has the values you are trying to find the maximum for.

Usage:

  • argMax(column, metric)

Returns:

  • The value in the column for which the metric is the highest. If several different columns have the same metric value, the first one will be returned.

argMin

argMin returns the value for column in the row that contains the mimimum value for metric. For example, if there were a table with teams and wins, the team that had won the fewest games would be returned for argMin(team, wins).

Parameters:

  • column - the column value you want returned.
  • metric - the column that has the values you are trying to find the maximum for.

Usage:

  • argMin(column, metric)

Returns:

  • The value in the column for which the metric is the lowest. If several different columns have the same metric value, the first one will be returned.

avg

avg returns the average on the data that is part of the set passed into the function.

Parameters:

  • set - the data set to find the average of. This can be a column name in a table.

Usage:

  • avg(set)

Returns:

  • A double that is the average.

categoricalInformationValue

categoricalInformationValue Clickhouse docs

Clickhouse DocsClickhouse Docs

corr

corr calculates the correlation of two sets of data using Σ((x - x̅)(y - y̅)) / sqrt(Σ((x - x̅)^2) * Σ((y - y̅)^2))

Parameters:

  • x - the first set of data
  • y - the second set of data

Usage:

  • corr(x,y)

Returns:

  • A double between -1 and +1.
  • -1 indicates negative correlation
  • 0 indicates no correlation
  • +1 indicates positive correlation

More information about correlation can be found on Wikipedia.

count

count calculates the number of rows in a set that have non-null values.

Parameters:

  • expression - the set of rows to count. Expression can be the result of another function or column name.
  • DISTINCT is used to only count the number of unique values rather than every value. Using DISTINCT is optional.

Usage:

  • count([DISTINCT] expression)

Returns:

  • The number of rows in the collection.

  • If DISTINCT is used, only the number of unique values are counted. This behavior is the same as uniqExact

        **see also [Cardinality functions](/docs/cardinality#uniqexact)**
    

covarPop

covarPop calculates the joint probability of two parameters across a population. Specifically, it calculates Σ((x - x̅)(y - y̅)) / n where n is the size of the population.

Parameters:

  • x,y - the two parameters to calculate covariance over.

Usage:

  • covarPop(x,y)

Returns:

  • The covariance between x and y of the population. If the value is positive, the parameters are likely related. If the covariance is negative, they are unlikely related. The value itself does not carry any significance.

More information about covariance can be found on Wikipedia.

covarSamp

covarSamp calculates the joint probability of two parameters across a sample. Specifically, it calculates Σ((x - x̅)(y - y̅)) / n-1 where n is the size of the sample.

Parameters:

  • x,y - the two parameters to calculate covariance over.

Usage:

  • covarSamp(x,y)

Returns:

  • The covariance between x and y in the sample. If the value is positive, the parameters are likely related. If the covariance is negative, they are unlikely related. The value itself does not carry any significance.

More information about covariance can be found on Wikipedia.

histogram

histogram separates data into a number of bins specified in the arguments that represent the distribution of data in the set.

Parameters:

  • number_of_bins - how many groups the function should divide the data into
  • data - the values to separate into the bins

Usage:

  • histogram(number_of_bins)(data)

Returns:

  • An array of tuples representing the histogram. Each tuple contains lower_bound, upper_bound, height. The upper bound of one tuple should be the lower bound of the tuple directly after it.

More information about histogram can be found on Wikipedia.

kurtPop, kurtSamp

kurtPop and kurtSamp calculate the kurtosis of the population and sample respectively. Kurtosis is a measurement of outliers. The higher the kurtosis, the more outliers exists, and should be investigated.

Excess kurtosis means values <>3.

Parameters:

  • expression - the set to operate on, typically a column name.

Usage:

  • kurtPop(expression)

Returns:

  • Double representing the kurtosis. Positive values indicate a more narrow distribution than a normal distribution. Negative values indicate a wider distribution, meaning more outliers exist.

max

max returns the maximum value in a set, typically a column.

Parameters:

  • set - the set to operate on, typically a column name.

Usage:

  • max(set)

Returns:

  • a single value, the same type as the types in set.

median

median returns the median, which is the value that separates the upper half and lower half of the data set.

Parameters:

  • set - the set to operate on, typically a column name.

Usage:

  • median(set)

Returns:

  • a single value, the same type as the types in set.

min

min returns the minimum value in a set, typically a column.

Parameters:

  • set - the set to operate on, typically a column name.

Usage:

  • min(set)

Returns:

  • a single value, the same type as the types in set.

quantile

quantile approximates the value in the set for which percent% of the values in the set fall below. Quantiles help determine the probability of the distribution of a set.

Uses reservoir sampling for sampling.

Parameters:

  • percent - A Double between 0 and 1 representing he quantile to calculate. For example, .95 means 95th quantile. 0 is equivalent to min and 1 is equivalent to max
  • set - the set to operate on, typically a column name.

Usage:

  • quantile(percent)(set)

Returns:

  • a single value, the same type as the types in set.

quantiles

quantiles approximates more than one quantile value in a single call, for which percent% of the values in the set fall below. Quantiles help determine the probability of the distribution of a set.

Uses reservoir sampling for sampling.

Parameters:

  • percent1 - A Double between 0 and 1 representing he quantile to calculate. For example, .95 means 95th quantile. 0 is equivalent to min and 1 is equivalent to max
  • percentN - A Double between 0 and 1 representing he quantile to calculate. For example, .95 means 95th quantile. 0 is equivalent to min and 1 is equivalent to max
  • set - the set to operate on, typically a column name.

Usage:

  • quantiles(percent1, percent2, percentN)(set)

Returns:

  • an array containing the quantile values in the order they were requested in the function call. The type is the same type as the data set.

quantileDeterministic

quantileDeterministic approximates the value in the set for which percent% of the values in the set fall below. Quantiles help determine the probability of the distribution of a set.

Reservoir sampling uses a random number to sample. quantileDeterministic takes an extra argument, which replaces the random number. This gives the same result each time on the data set.

Parameters:

  • percent - A Double between 0 and 1 representing he quantile to calculate. For example, .95 means 95th quantile. 0 is equivalent to min and 1 is equivalent to max
  • set - the set to operate on, typically a column name.
  • seed - the number to use instead of a random number in sampline

Usage:

  • quantileDeterministic(percent)(set, seed)

Returns:

  • a single value, the same type as the types in set.

quantileExact

quantileExact calculates the exact quantile value in the set for which percent% of the values in the set fall below. Quantiles help determine the probability of the distribution of a set.

No sampling is done. The quantile is calculated on all values.

Parameters:

  • percent - A Double between 0 and 1 representing he quantile to calculate. For example, .95 means 95th quantile. 0 is equivalent to min and 1 is equivalent to max
  • set - the set to operate on, typically a column name.

Usage:

  • quantileExact(percent)(set)

Returns:

  • a single value, the same type as the types in set.

skewPop, skewSamp

skew calculates the shape, or asymmetry, of a set of values. Skew helps determine the relationship between the average, max, and min of a set. If most of the values are above the average, then the skew is negative, or left skewed, meaning the left tail is longer. If most values fall below the average, the set is said to have positive skew, or 'right tailed' meaning that the righ tail is longer.

skewPop calculates the skew of the entire data set, or population. skewSamp calculates the skew of a sample of the data.

Parameters:

  • distribution - the set for which the skew is needed.

Usage:

  • skewSamp(distribution)
  • skewPop(distribution)

Returns:

  • Float64 that is the skewness of the population or sample.

stddevPop, stddevSamp

stddev calculates the standard deviation of a set of values, which is the width of the distribution, or the range of the values, of a data set. Standard deviation helps calculate 'confidence' in a statistical conclusion.

stddevSamp calculates the standard deviation on a sample, where stddevPop calculates the standard deviation on the entire data set, or population.

Parameters:

  • distribution - the set for which the skew is needed.

Usage:

  • stddevPop(distribution)
  • stddevSamp(distribution)

Returns:

  • Float64 that is the standard deviation of the population or sample.

sum

sum calculates the sum of all vaules in a set.

Parameters:

  • set - the data set to sum. Usually a column. Must be a numeric type.

Usage:

  • sum(set)

Returns:

  • A number that is the sum of the set, the same data type as the set.

sumMap

Clickhouse sumMap docs Clickhouse DocsClickhouse Docs

sumWithOverflow

sumWithOverflow calculates the sum of all vaules in a set, returning the same data type as the set. If the sum is greater than the maximum value for the data type, an error is returned.

Parameters:

  • set - the data set to sum. Usually a column. Must be a numeric type.

Usage:

  • sumWithOverflow(set)

Returns:

  • A number that is the sum of the set, the same data type as the set, or an error if the sum is great than the maximum value for the data type.

topK, topKWeighted

topK returns the N most frequently occurring values in a column.

Clickhouse documentation Clickhouse DocsClickhouse Docs

Parameters:

  • N - number of values to return.
  • column - the column to operate on
  • weight (Optional) - weight to use for each column. See topKWeighted for more detail.

Usage:

  • topK(N)(column)
  • topKWeighted(N)(column, weight)

Returns:

  • An array of the top N most occurring values, in descending order of frequency.

varPop, varSamp

var calculates the variance, or how far the distribution of the set of values are spread out from the average. Variance is the same as standard deviation squared.

varSamp calculates the variance on a sample, where varPop calculates the variance on the entire data set, or population.

Parameters:

  • distribution - the set for which the skew is needed.

Usage:

  • varPop(distribution)
  • varSamp(distribution)

Returns:

  • Float64 that is the variance of the population or sample.

Did this page help you?