Difference between revisions of "Ranking vectors of data"

From Lazarus wiki
Jump to navigationJump to search
(→‎See also: Adding references)
(→‎Simple algorithm not caring for ties: fixing typo in line 14 of the code (originally reported on Twitter by Jesús Gómez))
 
(4 intermediate revisions by the same user not shown)
Line 1: Line 1:
 
'''Ranking''' is an important method in statistics and data science. It is a preparative step that replaces each value by a rank in the order of magnitude of the original data. This operation is necessary for many non-parametric methods, e.g. the U test by Wilcoxon-Mann-Whitney or Spearman's rank correlation.
 
'''Ranking''' is an important method in statistics and data science. It is a preparative step that replaces each value by a rank in the order of magnitude of the original data. This operation is necessary for many non-parametric methods, e.g. the U test by Wilcoxon-Mann-Whitney or Spearman's rank correlation.
  
== Simple algorithms not caring for ties ==
+
== Simple algorithm not caring for ties ==
 
In the simplest case, where every value is distinct without identical data, the following frugal algorithm may be sufficient:
 
In the simplest case, where every value is distinct without identical data, the following frugal algorithm may be sufficient:
  
Line 16: Line 16:
 
     lessCount := 0;
 
     lessCount := 0;
 
     for j := 0 to n - 1 do
 
     for j := 0 to n - 1 do
       if x.data[j] < x.data[i] then
+
       if x[j] < x[i] then
 
         inc(lessCount);
 
         inc(lessCount);
       result[i] := lessCount[i] + 1;
+
       result[i] := lessCount + 1;
 
   end;
 
   end;
 
end;
 
end;
Line 34: Line 34:
 
   TiesMethod: TTiesMethod = average): array of real;
 
   TiesMethod: TTiesMethod = average): array of real;
 
var
 
var
   i, j, k, n: integer;
+
   i, j, n: integer;
 
   equalCount, lessCount: array of integer;
 
   equalCount, lessCount: array of integer;
 
begin
 
begin
Line 95: Line 95:
 
[[Category:Mathematics]]
 
[[Category:Mathematics]]
 
[[Category:Statistical algorithms]]
 
[[Category:Statistical algorithms]]
[[Category:Modelling and Simulation]]
 
 
[[Category: Code]]
 
[[Category: Code]]

Latest revision as of 09:48, 7 March 2021

Ranking is an important method in statistics and data science. It is a preparative step that replaces each value by a rank in the order of magnitude of the original data. This operation is necessary for many non-parametric methods, e.g. the U test by Wilcoxon-Mann-Whitney or Spearman's rank correlation.

Simple algorithm not caring for ties

In the simplest case, where every value is distinct without identical data, the following frugal algorithm may be sufficient:

function rank(x: array of longint): array of real;
var
  i, j, n: integer;
  lessCount: integer; { number of observations less than a given observation }
begin
  n := length(x);
  SetLength(result, n);
  for i := 0 to n - 1 do
  begin
    lessCount := 0;
    for j := 0 to n - 1 do
      if x[j] < x[i] then
        inc(lessCount);
      result[i] := lessCount + 1;
  end;
end;

This function delivers a list of ranks that replace the original data. It doesn't deliver satisfactory results if the data contain identical values, e.g. in the vector 3, 4, 4, 9, 0, 1. In statistical terminology, repeated observation-values are referred to as ties. The potential presence of ties requires a more complex algorithm:

Advanced algorithm addressing potential ties

type
  TTiesMethod = (ascend, descend, average, min, max);
  
function rank(x: array of longint; 
  TiesMethod: TTiesMethod = average): array of real;
var
  i, j, n: integer;
  equalCount, lessCount: array of integer;
begin
  n := length(x);
  SetLength(lessCount, n);
  SetLength(equalCount, n);
  SetLength(Result, n);
  for i := 0 to n - 1 do
  begin
    equalCount[i] := 0;
    lessCount[i] := 0;
    for j := 0 to n - 1 do
      case TiesMethod of
        average, min, max:
          if x[j] = x[i] then
            Inc(equalCount[i])
          else if x[j] < x[i] then
            Inc(lessCount[i]);
        ascend:
          if (j <= i) and (x[j] = x[i]) then
            Inc(equalCount[i])
          else if x[j] < x[i] then
            Inc(lessCount[i]);
        descend:
          if (j >= i) and (x[j] = x[i]) then
            Inc(equalCount[i])
          else if x[j] < x[i] then
            Inc(lessCount[i]);
      end;
  end;
  for i := 0 to n - 1 do
    case TiesMethod of
      average:
        Result[i] := lessCount[i] + (equalCount[i] + 1) / 2;
      min:
        Result[i] := lessCount[i] + 1;
      max, ascend, descend:
        Result[i] := lessCount[i] + equalCount[i];
    end;
end;

This function delivers a vector of ranks that is corrected for ties. Similar to the function rank of the statistical programming language S (e.g. in the environment R) it supports different methods to address potential ties. They include average (default method, the rank for each occurrence of a repeated number is given as rank[i] = lessCount + (equalCount + 1) / 2), ascend (permutation with increasing values at each index set of ties), descend (decreasing values), min (replaces values with ties with their minimum) and max (replacement by maximum).

See also

References

  1. Cooke D, Craven AH, Clarke GM. Statistical Computing in Pascal. Edward Arnold (Publishers), London 1985
  2. Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) The New S Language. Wadsworth & Brooks/Cole.
  3. R Documentation: Sample Ranks.