12 September 2012

Part 6: T-SQL Implementation of NORMDIST / NORM.S.DIST

This is Part 6 of a set of posts. The other posts are:
Part 1: Introduction
Part 2: Probability Density Function Implementation
Part 3: Cumulative Distribution Function Implementation 1
Part 4: Cumulative Distribution Function Implementation 2
Part 5: Cumulative Distribution Function Implementation 3
Part 6: References and Notes

References and Notes

As stated in Part 1 the implementations presented in this series of posts are my own, however I conducted no original research in their development.
In this part I list the sources I used while preparing this series of posts as well as notes and additional sources of interest.

Main Sources

Wikipedia article on the Normal Distribution - the Wikipedia article was my first stop when looking for information for these series of posts. It contains the basic information and formulas for the PDF and CDF functions, but more important the section about numerical approximations contains a reference to algorithm 26.2.17 from Abramowitz & Stegun (1964) used in Part 4
Milton Abramowitz and Irene Stegun (1964) - Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables - referred in these posts as "Abramowitz & Stegun (1964)"; this book contains a wide array of formulas and approximation in several fields. The Wikipedia article has more information about this work.
The approximations used in Part 3 and Part 4 are taken from this source.
W. J. Cody (1969) - Rational Chebyshev Approximations for the Error Function, Mathematics of Computation, Vol. 23, No. 107 (Jul., 1969), pp. 631-637 - Referred in these posts as "W. J. Cody (1969)", I came accross this paper from one of the pioneers in numerical analysis while looking for Hart (1968) (see Addtional References below) and decided to use his approximation for the "ultimate" implementation in Part 5.

Improvents in Excel 2007 and 2010 are discussed in the following sources:

The white paper in the last reference in particular is very interesting as it contains a primer on common approximation methods.

Additional References

Cecil Hastings, Jr. (1955) - Approximations for Digital Computers - this is the source for the algoritms presented in Abramowitz & Stegun (1964), thought the actual approximations in this book are for the erf function.
Part 1 of this book describes the methods used to develop the approximations.

John F. Hart, et al (1968) - Computer Approximations - this oft sited book contains a wide array of approximations for several functions including highly accurate ones for the erf function.

Part 5: T-SQL Implementation of NORMDIST / NORM.S.DIST

This is Part 5 of a set of posts. The other posts are:
Part 1: Introduction
Part 2: Probability Density Function Implementation
Part 3: Cumulative Distribution Function Implementation 1
Part 4: Cumulative Distribution Function Implementation 2
Part 5: Cumulative Distribution Function Implementation 3
Part 6: References and Notes

Cumulative Distribution Function Implementation 3

The cumulative distribution function has no closed form expression and can only be approximated (see Part 1.)
The following implementation uses a set of rational approximations from W. J. Cody (1969) (see Part 6 for full references.)

-- =============================================
-- Author:    Eli Algranti
-- Description:  Standard Normal Distribution 
--        Cumulative Distribution Function
--        using a rational polynomial 
--        approximation to erf() from
--        W. J. Cody 1969
-- Copyright:  Eli Algranti (c) 2012
--
-- This code is licensed under the Microsoft Public 
-- License (Ms-Pl) (http://tsqlnormdist.codeplex.com/license)
-- =============================================
CREATE FUNCTION [dbo].[StdNormalDistributionCDF_3] ( @x FLOAT)
RETURNS FLOAT
AS
    BEGIN
    DECLARE @Z FLOAT = ABS(@x)/SQRT(2.0);
    DECLARE @Z2 FLOAT = @Z*@Z; -- optimization
    
    IF (@Z >=11.0) -- value is too large no need to compute
    BEGIN
      IF @x > 0.0
        RETURN 1.0;
      RETURN 0.0;
    END

    -- Compute ERF using W. J. Cody 1969
    
    DECLARE @ERF FLOAT;
    
    IF (@Z <= 0.46786)
    BEGIN
      DECLARE @pA0 FLOAT = 3.209377589138469472562E03;
      DECLARE @pA1 FLOAT = 3.774852376853020208137E02;
      DECLARE @pA2 FLOAT = 1.138641541510501556495E02;
      DECLARE @pA3 FLOAT = 3.161123743870565596947E00;
      DECLARE @pA4 FLOAT = 1.857777061846031526730E-01;
      
      DECLARE @qA0 FLOAT = 2.844236833439170622273E03;
      DECLARE @qA1 FLOAT = 1.282616526077372275645E03;
      DECLARE @qA2 FLOAT = 2.440246379344441733056E02;
      DECLARE @qA3 FLOAT = 2.360129095234412093499E01;
      DECLARE @qA4 FLOAT = 1.000000000000000000000E00;

      -- For efficiency compute sequence of powers of @Z 
      -- (instead of calling POWER(@Z,2), POWER(@Z,4), etc.)
      DECLARE @ZA4 FLOAT = @Z2*@Z2;
      DECLARE @ZA6 FLOAT = @ZA4*@Z2;
      DECLARE @ZA8 FLOAT = @ZA6*@Z2;


      SELECT @ERF = @Z *
        (@pA0 + @pA1*@Z2 + @pA2*@ZA4 + @pA3*@ZA6 + @pA4*@ZA8) /
        (@qA0 + @qA1*@Z2 + @qA2*@ZA4 + @qA3*@ZA6 + @qA4*@ZA8);
    END
    ELSE IF (@Z <= 4.0)
    BEGIN
      DECLARE @pB0 FLOAT = 1.23033935479799725272E03;
      DECLARE @pB1 FLOAT = 2.05107837782607146532E03;
      DECLARE @pB2 FLOAT = 1.71204761263407058314E03;
      DECLARE @pB3 FLOAT = 8.81952221241769090411E02;
      DECLARE @pB4 FLOAT = 2.98635138197400131132E02;
      DECLARE @pB5 FLOAT = 6.61191906371416294775E01;
      DECLARE @pB6 FLOAT = 8.88314979438837594118E00;
      DECLARE @pB7 FLOAT = 5.64188496988670089180E-01;
      DECLARE @pB8 FLOAT = 2.15311535474403846343E-08;
      
      DECLARE @qB0 FLOAT = 1.23033935480374942043E03;
      DECLARE @qB1 FLOAT = 3.43936767414372163696E03;
      DECLARE @qB2 FLOAT = 4.36261909014324715820E03;
      DECLARE @qB3 FLOAT = 3.29079923573345962678E03;
      DECLARE @qB4 FLOAT = 1.62138957456669018874E03;
      DECLARE @qB5 FLOAT = 5.37181101862009857509E02;
      DECLARE @qB6 FLOAT = 1.17693950891312499305E02;
      DECLARE @qB7 FLOAT = 1.57449261107098347253E01;
      DECLARE @qB8 FLOAT = 1.00000000000000000000E00;

      -- For efficiency compute sequence of powers of @Z 
      -- (instead of calling POWER(@Z,2), POWER(@Z,3), etc.)
      DECLARE @ZB3 FLOAT = @Z2*@Z;
      DECLARE @ZB4 FLOAT = @ZB3*@Z;
      DECLARE @ZB5 FLOAT = @ZB4*@Z;
      DECLARE @ZB6 FLOAT = @ZB5*@Z;
      DECLARE @ZB7 FLOAT = @ZB6*@Z;
      DECLARE @ZB8 FLOAT = @ZB7*@Z;

      SELECT @ERF = 1.0 - EXP(-@Z2) *
              (@pB0 + @pB1*@Z + @pB2*@Z2 + @pB3*@ZB3 + @pB4*@ZB4
               + @pB5*@ZB5 + @pB6*@ZB6 + @pB7*@ZB7 + @pB8*@ZB8) /
              (@qB0 + @qB1*@Z + @qB2*@Z2 + @qB3*@ZB3 + @qB4*@ZB4
               + @qB5*@ZB5 + @qB6*@ZB6 + @qB7*@ZB7 + @qB8*@ZB8);
    END
    ELSE
    BEGIN
      DECLARE @pC0 FLOAT = -6.58749161529837803157E-04;
      DECLARE @pC1 FLOAT = -1.60837851487422766278E-02;
      DECLARE @pC2 FLOAT = -1.25781726111229246204E-01;
      DECLARE @pC3 FLOAT = -3.60344899949804439429E-01;
      DECLARE @pC4 FLOAT = -3.05326634961232344035E-01;
      DECLARE @pC5 FLOAT = -1.63153871373020978498E-02;
      
      DECLARE @qC0 FLOAT = 2.33520497626869185443E-03;
      DECLARE @qC1 FLOAT = 6.05183413124413191178E-02;
      DECLARE @qC2 FLOAT = 5.27905102951428412248E-01;
      DECLARE @qC3 FLOAT = 1.87295284992346047209E00;
      DECLARE @qC4 FLOAT = 2.56852019228982242072E00;
      DECLARE @qC5 FLOAT = 1.00000000000000000000E00;
      
      DECLARE @pi FLOAT = 3.141592653589793238462643383;
      
      -- For efficiency compute sequence of powers of @Z 
      -- (instead of calling POWER(@Z,-2), POWER(@Z,-3), etc.)
      DECLARE @ZC2 FLOAT = (1/@Z)/@Z;
      DECLARE @ZC4 FLOAT = @ZC2*@ZC2;
      DECLARE @ZC6 FLOAT = @ZC4*@ZC2;
      DECLARE @ZC8 FLOAT = @ZC6*@ZC2;
      DECLARE @ZC10 FLOAT = @ZC8*@ZC2;

      SELECT @ERF = 1 - EXP(-@Z2)/@Z * (1/SQRT(@pi) + 1/(@Z2)*
             ((@pC0 + @pC1*@ZC2 + @pC2*@ZC4 + @pC3*@ZC6 + @pC4*@ZC8 + @pC5*@ZC10) /
              (@qC0 + @qC1*@ZC2 + @qC2*@ZC4 + @qC3*@ZC6 + @qC4*@ZC8 + @qC5*@ZC10)));
    END

    DECLARE @cd FLOAT = 0.5*(1+@ERF);

    IF @x > 0
      RETURN @cd;

    RETURN 1.0-@cd;
    END

Verification Using Excel 2010 as Reference


Maximum difference 5.55111512312578E-16 at value 1.00999999999983



Verification Using Excel 2007 as Reference


Maximum difference 5.09148279093097E-13 at value 4.97999999999994



Discussion

This more complex and slower implementation uses three different rational approximations to provide maximum accuracy over a wide range of values.
Here at last we see the differences between Excel 2010 and Excel 2007.
Using Excel 2010 as reference we note the approximation provides the maximum precision possible with a double precision float (15.5 decimal places.)
Using Excel 2007 the precision drops down to 12 decimal places. The graph shows this happens for only two small bands at around 5 and -5 which is where the Excel 2007 implementation is lacking and where Excel 2010 shows improvements.

Part 4: T-SQL Implementation of NORMDIST / NORM.S.DIST

This is Part 4 of a set of posts. The other posts are:
Part 1: Introduction
Part 2: Probability Density Function Implementation
Part 3: Cumulative Distribution Function Implementation 1
Part 4: Cumulative Distribution Function Implementation 2
Part 5: Cumulative Distribution Function Implementation 3
Part 6: References and Notes

Cumulative Distribution Function Implementation 2

The cumulative distribution function has no closed form expression and can only be approximated (see Part 1.)
The following implementation uses approximation 26.2.17 from Abramowitz & Stegun (1964) (see Part 6) for full references.

-- =============================================
-- Author:    Eli Algranti
-- Description:  Standard Normal Distribution 
--        Cummulative Distribution Function
--        using polinomial approximation 26.2.17 
--        from Abramowitz & Stegun (1964)
-- Remarks:    This function depends on 
--        [dbo].[StdNormalDistributionPDF]
--        to calculate the Standard Normal
--        Distribution PDF for the value
-- Copyright:  Eli Algranti (c) 2012
--
-- This code is licensed under the Microsoft Public 
-- License (Ms-Pl) (https://tsqlnormdist.codeplex.com/license)
-- =============================================
CREATE FUNCTION [dbo].[Stdnormaldistributioncdf_2] (@x FLOAT)
returns FLOAT
AS
  BEGIN
      DECLARE @Z FLOAT = Abs(@x)

      IF ( @Z >= 15 ) -- value is too large no need to compute
        BEGIN
            IF @x > 0
              RETURN 1;

            RETURN 0;
        END

      -- Compute the Standard Normal Cummulative Distribution using 
      -- polinomial approximation 26.2.17 from Abramowitz & Stegun (1964)
      DECLARE @p FLOAT = 0.2316419;
      DECLARE @b1 FLOAT = 0.319381530;
      DECLARE @b2 FLOAT = -0.356563782;
      DECLARE @b3 FLOAT = 1.781477937;
      DECLARE @b4 FLOAT = -1.821255978;
      DECLARE @b5 FLOAT = 1.330274429;
      DECLARE @t FLOAT = 1.0 / ( 1.0 + @p * @Z );
      -- For efficiency compute sequence of powers of @t (instead of calling POWER(@t,2), POWER(@t,3), etc.)
      DECLARE @t2 FLOAT = @t * @t;
      DECLARE @t3 FLOAT = @t2 * @t;
      DECLARE @t4 FLOAT = @t3 * @t;
      DECLARE @cd FLOAT = 1.0 - [dbo].[Stdnormaldistributionpdf](@Z) * (
                                  @b1 * @t + @b2 * @t2 + @b3 * @t3 + @b4 * @t4 +
                                  @b5 *
                                  @t4 * @t )

      IF @x > 0
        RETURN @cd;

      RETURN 1.0 - @cd;
  END 

Verification Using Excel 2010 as Reference


Maximum difference 7.4506132929919E-08 at value 0.719999999999832



Verification Using Excel 2007 as Reference


Maximum difference 7.4506132929919E-08 at value 0.719999999999832



Discussion

This approximation provides a precision of eight decimal places and surprisingly the same maximum difference for both versions of Excel. I had to check twice.
Unlike the previous implementation this one is not a pure polynomial approximation and requires the PDF to be computed.
The performance of this implementation can be vastly improved if the PDF needs to be computed anyway by passing the pre-computed PDF value as a parameter instead of computing it in the function.

Part 3: T-SQL Implementation of NORMDIST / NORM.S.DIST

This is Part 3 of a set of posts. The other posts are:
Part 1: Introduction
Part 2: Probability Density Function Implementation
Part 3: Cumulative Distribution Function Implementation 1
Part 4: Cumulative Distribution Function Implementation 2
Part 5: Cumulative Distribution Function Implementation 3
Part 6: References and Notes

Cumulative Distribution Function Implementation 1

The cumulative distribution function has no closed form expression and can only be approximated (see Part 1.)
The following implementation uses polynomial approximation 26.2.18 from Abramowitz & Stegun (1964) (see Part 6) for full references.

-- =============================================
-- Author:    Eli Algranti
-- Description:  Standard Normal Distribution 
--        Cummulative Distribution Function
--        using polinomial approximation 26.2.18 
--        from Abramowitz & Stegun (1964)
-- Copyright:  Eli Algranti (c) 2012
--
-- This code is licensed under the Microsoft Public 
-- License (Ms-Pl) (https://tsqlnormdist.codeplex.com/license)
-- =============================================
CREATE FUNCTION [dbo].[Stdnormaldistributioncdf_1] (@x FLOAT)
returns FLOAT
AS
  BEGIN
      DECLARE @Z FLOAT = Abs(@x)

      IF ( @Z >= 15 ) -- value is too large no need to compute
        BEGIN
            IF @x > 0
              RETURN 1;

            RETURN 0;
        END

      -- Compute the Standard Normal Cummulative Distribution using 
      -- polinomial approximation 26.2.18 from Abramowitz & Stegun (1964)
      DECLARE @c1 FLOAT = 0.196854;
      DECLARE @c2 FLOAT = 0.115194;
      DECLARE @c3 FLOAT = 0.000344;
      DECLARE @c4 FLOAT = 0.019527;
      -- For efficiency compute sequence of powers of @Z (instead of calling POWER(@Z,2), POWER(@Z,3), etc.)
      DECLARE @Z2 FLOAT = @Z * @Z;
      DECLARE @Z3 FLOAT = @Z2 * @Z;
      DECLARE @cd FLOAT = 1.0 - 0.5 * Power(1 + @c1 * @Z + @c2 * @Z2 + @c3 * @Z3
                                            +
                                            @c4 * @Z3 * @Z, -4)

      IF @x > 0
        RETURN @cd;

      RETURN 1.0 - @cd;
  END 

Verification Using Excel 2010 as Reference


Maximum difference 0.000232983371206426 at value -1.82000000000017



Verification Using Excel 2007 as Reference


Maximum difference 0.000232983371206363 at value -1.82000000000017



Discussion

This approximation provides a precision of three decimal places, which is enough for many applications.
The main advantage of this implementation is its simplicity, requiring just a handful of arithmetic operations.

Part 2: T-SQL Implementation of NORMDIST / NORM.S.DIST

This is Part 2 of a set of posts. The other posts are:
Part 1: Introduction
Part 2: Probability Density Function Implementation
Part 3: Cumulative Distribution Function Implementation 1
Part 4: Cumulative Distribution Function Implementation 2
Part 5: Cumulative Distribution Function Implementation 3
Part 6: References and Notes

Probability Density Function Implementation

The probability density function can be computed directly from the definition (see Part 1.)

-- =============================================
-- Author:    Eli Algranti
-- Description:  Standard Normal Distribution 
--        Probability Density Function
-- Copyright:  Eli Algranti (c) 2012
--
-- This code is licensed under the Microsoft Public 
-- License (Ms-Pl) (https://tsqlnormdist.codeplex.com/license)
-- =============================================
CREATE FUNCTION [dbo].[StdNormalDistributionPDF]
(
  @x FLOAT
)
RETURNS FLOAT
AS
BEGIN
  DECLARE @pi FLOAT = 3.141592653589793238462643383;
  
  -- The standard normal probability corresponding to @x
  RETURN EXP(-0.5*@x*@x)/SQRT(2.0*@pi);
END

Verification Using Excel 2010 as Reference


Maximum difference 4.71844785465692E-16 at value 1.02999999999983



Verification Using Excel 2007 as Reference


Maximum difference 4.9960036108132E-16 at value 1.02999999999983

Note: The graphs do not display a value for the Y axis because the values are too small.

Discussion

No surprises here. The T-SQL implementation closely matches both Excel's implementations and is within the error margin for double computations. The function may be optimized by precomputing SQRT(2*@pi).

Part 1: T-SQL Implementation of NORMDIST / NORM.S.DIST

This is Part 1 of a set of posts. The other posts are:
Part 1: Introduction
Part 2: Probability Density Function Implementation
Part 3: Cumulative Distribution Function Implementation 1
Part 4: Cumulative Distribution Function Implementation 2
Part 5: Cumulative Distribution Function Implementation 3
Part 6: References and Notes

Introduction and Methods

I’m not the first to use Excel as a tool for developing an algorithm; nor the first, when implementing said algorithm, to be surprised that SQL Server does not ship with a library of statistical functions.
In this series of posts I’ll develop a set of T-SQL functions which implement the very useful Normal Distribution’s ‘Probability Density Function’ (PDF) and ‘Cumulative Distribution Function’ (CDF.)

The Very Basics

The Standard Normal Distribution is the special case of a Normal Distribution with mean zero and standard deviation one. The function for the Standard Normal Distribution for variable x is:

The Normal Distribution PDF (for a variable x with mean ยต and standard deviation s) can be expressed as a function of the Standard Normal Distribution PDF:

The function for the Standard Normal Distribution CDF is:

Like the Normal Distribution PDF the Normal Distribution CDF can be expressed as a function of the Standard Normal Distribution CDF:

Another way to express it is by means of the special function erf (the error function):


The Implementation

The Standard Normal Distribution PDF (and by extension the Normal Distribution PDF) can be implemented as a user defined function directly from the definition. Such an implementation is presented in Part 2 of this series. The Standard Normal Distribution CDF has no closed form equivalent and can only be approximated:
Part 3 presents an implementation of a simple (and fast) polynomial approximation with a precision slightly better than three decimal points.
Part 4 presents a more complex polynomial approximation implementation with a precision of seven decimal points.
Finally Part 5 presents an implementation using a set of rational approximation. This more complex and slower) approximation yields a precision of fifteen decimal digits, which is also the precision of a double precision floating point value.
The approximations in Part 3 & Part 4 are courtesy of Abramowitz & Stegun (1964) while the implementation in Part 5 uses an approximation of the erf function from W. J. Cody (1969.)
See Part 6 for a list of references and notes.

Ensuring Accuracy

One of the problems when implementing mathematical functions is how to ensure the implementation is correct and the values returned accurate.
In order to test the accuracy of my implementations I've used to compute the values of doubles from -10 to 10 with 0.01 intervals and compared them to the results of the Excel 2010 implementation of the same function. Statistical functions in Excel versions prior to Excel 2003 where criticized for their accuracy. Microsoft greatly increased the accuracy of Excel 2003 from previous versions and made more improvements in Excel 2010.
Since the trigger for the work in these series of posts was my desire to implement an algorithm prototyped in Excel, and given that the latest Excel version addresses accuracy concerns; Excel 2010 is a good candidate to test the implementations.
I've also included the results using Excel 2007 as the reference implementation. In order to perform the comparison I wrote a small console utility called CheckError. CheckError gets as parameter the range, interval and functions to test and then:
  • Writes on the console the maximum difference between Excel and Sql Server implementations and at which input value it occurrs.
  • Writes files each containing the double precision floating point binary representation of:
    • The input values.
    • The Excel results.
    • The Sql Server results.
    • The differences between the Excel and Sql Server results.
  • Writes a tab delimited file containing decimal string representations of the value above.
The binary files can be used to test the precision of the solution using other reference implementations (say MatLab.) The tab delimited files should not be used for this purpose as there is a loss of precision involved in converting from the binary representation to the decimal string representation.
I used the tab delimited file to create the graphs presented along side the implementations since the loss of precision is not relevant in this case.
The source code for CheckError the T-SQL implementations, output files and Excel file with graphs can be found at codeplex and is released under the Microsoft Public License (Ms-PL.)

Caveat Emptor

The license for the code discussed here contains the full “as is” disclaimer, however since statistical functions may be used in all sorts of critical applications consider the following warnings if you decide to use these implementations:
While the tests described above are good enough for my purposes they are far from a complete analysis of the implementations.
In particular:
  1. I have not performed a review of available literature to discover whether the sources I’ve based my implementations on have been disputed or corrected.
  2. My implementation (or Excel 2010 which I use as the reference implementation or both) may have bugs which my testing does not reveal and may result in an error greater than the one stated.

Note Regarding References

The implementations and testing tools presented here are my own, but no original research was undertaken in their development.
Part 6 contains a full list of references and notes on where I found what information.