Statistical Aggregations
Mathematical aggregations are intended to work over sets of data rather than individual points. For example, standard deviation is an "aggregate" because it is returning an "aggregate" value on a set of data.
argMax
argMax
returns the value for column
in the row that contains the maximum value for metric
. For example, if there were a table with teams and wins, the team that had won the most games would be returned for argMax(team, wins)
.
Parameters:
column
- the column value you want returned.metric
- the column that has the values you are trying to find the maximum for.
Usage:
argMax(column, metric)
Returns:
- The value in the column for which the metric is the highest. If several different columns have the same metric value, the first one will be returned.
argMin
argMin returns the value for column
in the row that contains the mimimum value for metric
. For example, if there were a table with teams and wins, the team that had won the fewest games would be returned for argMin(team, wins)
.
Parameters:
column
- the column value you want returned.metric
- the column that has the values you are trying to find the maximum for.
Usage:
argMin(column, metric)
Returns:
- The value in the column for which the metric is the lowest. If several different columns have the same metric value, the first one will be returned.
avg
avg returns the average on the data that is part of the set passed into the function.
Parameters:
set
- the data set to find the average of. This can be a column name in a table.
Usage:
avg(set)
Returns:
- A double that is the average.
categoricalInformationValue
categoricalInformationValue
Clickhouse docs
corr
corr calculates the correlation of two sets of data using Σ((x - x̅)(y - y̅)) / sqrt(Σ((x - x̅)^2) * Σ((y - y̅)^2))
Parameters:
x
- the first set of datay
- the second set of data
Usage:
corr(x,y)
Returns:
- A double between -1 and +1.
- -1 indicates negative correlation
- 0 indicates no correlation
- +1 indicates positive correlation
More information about correlation can be found on Wikipedia.
count
count calculates the number of rows in a set that have non-null values.
Parameters:
expression
- the set of rows to count. Expression can be the result of another function or column name.DISTINCT
is used to only count the number of unique values rather than every value. UsingDISTINCT
is optional.
Usage:
count([DISTINCT] expression)
Returns:
-
The number of rows in the collection.
-
If
DISTINCT
is used, only the number of unique values are counted. This behavior is the same as uniqExact**see also [Cardinality functions](/docs/cardinality#uniqexact)**
covarPop
covarPop calculates the joint probability of two parameters across a population. Specifically, it calculates Σ((x - x̅)(y - y̅)) / n
where n
is the size of the population.
Parameters:
x,y
- the two parameters to calculate covariance over.
Usage:
covarPop(x,y)
Returns:
- The covariance between x and y of the population. If the value is positive, the parameters are likely related. If the covariance is negative, they are unlikely related. The value itself does not carry any significance.
More information about covariance can be found on Wikipedia.
covarSamp
covarSamp calculates the joint probability of two parameters across a sample. Specifically, it calculates Σ((x - x̅)(y - y̅)) / n-1
where n
is the size of the sample.
Parameters:
x,y
- the two parameters to calculate covariance over.
Usage:
covarSamp(x,y)
Returns:
- The covariance between x and y in the sample. If the value is positive, the parameters are likely related. If the covariance is negative, they are unlikely related. The value itself does not carry any significance.
More information about covariance can be found on Wikipedia.
histogram
histogram separates data into a number of bins specified in the arguments that represent the distribution of data in the set.
Parameters:
number_of_bins
- how many groups the function should divide the data intodata
- the values to separate into the bins
Usage:
histogram(number_of_bins)(data)
Returns:
- An array of tuples representing the histogram. Each tuple contains
lower_bound, upper_bound, height
. The upper bound of one tuple should be the lower bound of the tuple directly after it.
More information about histogram can be found on Wikipedia.
kurtPop, kurtSamp
kurtPop
and kurtSamp
calculate the kurtosis of the population and sample respectively. Kurtosis is a measurement of outliers. The higher the kurtosis, the more outliers exists, and should be investigated.
Excess kurtosis means values <>3.
Parameters:
expression
- the set to operate on, typically a column name.
Usage:
kurtPop(expression)
Returns:
Double
representing the kurtosis. Positive values indicate a more narrow distribution than a normal distribution. Negative values indicate a wider distribution, meaning more outliers exist.
max
max
returns the maximum value in a set, typically a column.
Parameters:
set
- the set to operate on, typically a column name.
Usage:
max(set)
Returns:
- a single value, the same type as the types in
set
.
median
median
returns the median, which is the value that separates the upper half and lower half of the data set.
Parameters:
set
- the set to operate on, typically a column name.
Usage:
median(set)
Returns:
- a single value, the same type as the types in
set
.
min
min
returns the minimum value in a set, typically a column.
Parameters:
set
- the set to operate on, typically a column name.
Usage:
min(set)
Returns:
- a single value, the same type as the types in
set
.
quantile
quantile
approximates the value in the set for which percent
% of the values in the set fall below. Quantiles help determine the probability of the distribution of a set.
Uses reservoir sampling for sampling.
Parameters:
percent
- ADouble
between 0 and 1 representing he quantile to calculate. For example, .95 means 95th quantile.0
is equivalent tomin
and1
is equivalent tomax
set
- the set to operate on, typically a column name.
Usage:
quantile(percent)(set)
Returns:
- a single value, the same type as the types in
set
.
quantiles
quantiles
approximates more than one quantile value in a single call, for which percent
% of the values in the set fall below. Quantiles help determine the probability of the distribution of a set.
Uses reservoir sampling for sampling.
Parameters:
percent1
- ADouble
between 0 and 1 representing he quantile to calculate. For example, .95 means 95th quantile.0
is equivalent tomin
and1
is equivalent tomax
percentN
- ADouble
between 0 and 1 representing he quantile to calculate. For example, .95 means 95th quantile.0
is equivalent tomin
and1
is equivalent tomax
set
- the set to operate on, typically a column name.
Usage:
quantiles(percent1, percent2, percentN)(set)
Returns:
- an array containing the quantile values in the order they were requested in the function call. The type is the same type as the data set.
quantileDeterministic
quantileDeterministic
approximates the value in the set for which percent
% of the values in the set fall below. Quantiles help determine the probability of the distribution of a set.
Reservoir sampling uses a random number to sample. quantileDeterministic
takes an extra argument, which replaces the random number. This gives the same result each time on the data set.
Parameters:
percent
- ADouble
between 0 and 1 representing he quantile to calculate. For example, .95 means 95th quantile.0
is equivalent tomin
and1
is equivalent tomax
set
- the set to operate on, typically a column name.seed
- the number to use instead of a random number in sampline
Usage:
quantileDeterministic(percent)(set, seed)
Returns:
- a single value, the same type as the types in
set
.
quantileExact
quantileExact
calculates the exact quantile value in the set for which percent
% of the values in the set fall below. Quantiles help determine the probability of the distribution of a set.
No sampling is done. The quantile is calculated on all values.
Parameters:
percent
- ADouble
between 0 and 1 representing he quantile to calculate. For example, .95 means 95th quantile.0
is equivalent tomin
and1
is equivalent tomax
set
- the set to operate on, typically a column name.
Usage:
quantileExact(percent)(set)
Returns:
- a single value, the same type as the types in
set
.
skewPop, skewSamp
skew
calculates the shape, or asymmetry, of a set of values. Skew helps determine the relationship between the average, max, and min of a set. If most of the values are above the average, then the skew is negative, or left skewed
, meaning the left tail is longer. If most values fall below the average, the set is said to have positive skew, or 'right tailed' meaning that the righ tail is longer.
skewPop
calculates the skew of the entire data set, or population. skewSamp
calculates the skew of a sample of the data.
Parameters:
distribution
- the set for which the skew is needed.
Usage:
skewSamp(distribution)
skewPop(distribution)
Returns:
Float64
that is the skewness of the population or sample.
stddevPop, stddevSamp
stddev
calculates the standard deviation of a set of values, which is the width of the distribution, or the range of the values, of a data set. Standard deviation helps calculate 'confidence' in a statistical conclusion.
stddevSamp
calculates the standard deviation on a sample, where stddevPop
calculates the standard deviation on the entire data set, or population.
Parameters:
distribution
- the set for which the skew is needed.
Usage:
stddevPop(distribution)
stddevSamp(distribution)
Returns:
Float64
that is the standard deviation of the population or sample.
sum
sum
calculates the sum of all vaules in a set.
Parameters:
set
- the data set to sum. Usually a column. Must be a numeric type.
Usage:
sum(set)
Returns:
- A number that is the sum of the set, the same data type as the set.
sumMap
sumWithOverflow
sumWithOverflow
calculates the sum of all vaules in a set, returning the same data type as the set. If the sum is greater than the maximum value for the data type, an error is returned.
Parameters:
set
- the data set to sum. Usually a column. Must be a numeric type.
Usage:
sumWithOverflow(set)
Returns:
- A number that is the sum of the set, the same data type as the set, or an error if the sum is great than the maximum value for the data type.
topK, topKWeighted
topK
returns the N
most frequently occurring values in a column.
Parameters:
N
- number of values to return.column
- the column to operate onweight
(Optional) - weight to use for each column. See topKWeighted for more detail.
Usage:
topK(N)(column)
topKWeighted(N)(column, weight)
Returns:
- An array of the top N most occurring values, in descending order of frequency.
varPop, varSamp
var
calculates the variance, or how far the distribution of the set of values are spread out from the average. Variance is the same as standard deviation squared.
varSamp
calculates the variance on a sample, where varPop
calculates the variance on the entire data set, or population.
Parameters:
distribution
- the set for which the skew is needed.
Usage:
varPop(distribution)
varSamp(distribution)
Returns:
Float64
that is the variance of the population or sample.
Updated 9 months ago