An Average Column: I Mean, What Mode is Your Median

This month, Doug Steele follows up on an earlier article by Chris Weber to look at different ways to calculate “measures of central tendency”: mean, median, and mode.

How can I calculate the median for my data?

Before launching into a solution for this question, let’s go over some terminology (if you read Chris Weber’s Computing the Median Again article in the Apr-2005 issue of Smart Access, you’ll be familiar with some of these terms). You may have noticed the phrase “measures of central tendency” used in my introduction. This is a term used in statistics and covers more than just “calculating an average.” The fundamental idea is that one of the best ways to summarize a set of data and still retain part of the information is to represent the set with a single value. Measures of central tendency are ways of calculating that single number that’s representative of an entire list of numbers. There are three commonly used measures of central tendency:

  • Mean–The arithmetic average of a set of numbers (the most common measure of central tendency).
  • Median–The value of the middle item when the data is arranged from lowest to highest (assuming an odd number of observations) or the average value of the two middle items (when there’s an even number of observations).
  • Mode–The observation that occurs most frequently in a data set.

Most of you are probably familiar with the concept of mean (which is often referred to as “arithmetic average”), but the other two measures might not be as familiar to you, so an example might be in order.

Let’s assume you throw three dice a total of 12 times, and get the results shown in Table 1.

Table 1. A sample set of dice rolls.

Roll Total
1 14
2 13
3 8
4 8
5 12
6 8
7 10
8 9
9 5
10 3
11 17
12 10

It’s easy to calculate the mean: It’s (14+13+8+8+12+8+10+9+5+3+17+10)/12 = 9.75. This is the number to use when the values are evenly distributed throughout the range.

How do I find the median?

To find the median, you must arrange the 12 totals in order (it doesn’t really matter whether you arrange them in ascending or descending order, although it’s more common to use ascending order). In this case, that’s 3, 5, 8, 8, 8, 9, 10, 10, 12, 13, 14, 17. Since there’s an even number of samples, the median will be the average value of the sixth and seventh elements. The sixth element is 8 and the seventh element is 10; therefore, the median is (9+10)/2 = 9.5. This is the number to use when the values aren’t evenly distributed–when the results are skewed to one end of the range or clustered in the middle.

Finally, the mode is the value that occurs most often. With the elements arranged in order in the preceding paragraph, it’s fairly straightforward to see that 8 occurs more times than any other value, so the mode is 8. This is the number to use when you’re interested in the “most likely” number.

So the next question is, “How can I calculate these values in Access?”

The average can be easy: Access has a function, DAvg, that will compute the mean of a set of values in a specified set of records (a domain). To calculate the mean of our DiceRolls table, the code would be:

?DAvg("Total", "DiceRolls")

 9.75

Unfortunately, Access doesn’t have similar functions to compute median and mode, so you’ll have to create your own. To be consistent with DAvg (and other Domain functions), I called the function to calculate the median DMedian and named the arguments Expr, Domain, and Criteria:

Function DMedian( _

    Expr As String, _

    Domain As String, _

    Optional Criteria As String = "" _

) As Variant

The parameters are as follows:

  • Expr is an expression that identifies the field containing the numeric data for which you want to calculate the median. It can be a string expression identifying a field in a table or query, or it can be an expression that performs a calculation on data in that field.
  • Domain is a string expression identifying the set of records that constitutes the domain (a table name or a query name).
  • Criteria is an optional string expression used to restrict the range of data on which the DMedian function is performed. Criteria is equivalent to the WHERE clause in a SQL expression, without the word WHERE. If Criteria is omitted, the DMedian function evaluates Expr against the entire domain. Note that any field that’s included in Criteria must also be a field in Domain.

The next part of the function declares the variables that I’ll need later in the function:

Dim dbMedian As DAO.Database

Dim rsMedian As DAO.Recordset

Dim dblTemp1 As Double

Dim dblTemp2 As Double

Dim lngOffset As Long

Dim lngRecCount As Long

Dim strSQL As String

Dim varMedian As Variant

I begin processing by creating a SQL string that will return Expr in a sorted order. If a value was supplied for Criteria, include it in the SQL statement, ignoring any Null values (if for no other reason than to be consistent with DAvg):

 strSQL = "SELECT " & Expr & " AS Data " & _

    "FROM " & Domain & " "

  strSQL = strSQL & _

    "WHERE " & Expr & " IS NOT NULL "

  If Len(Criteria) > 0 Then

    strSQL = strSQL & "AND (" & Criteria & ") "

  End If

  strSQL = strSQL & "ORDER BY " & Expr

Next I instantiate a recordset, using the SQL statement created earlier, to return all of the relevant data from the domain and make sure that records are actually returned:

Set dbMedian = CurrentDb()

  Set rsMedian = dbMedian.OpenRecordset(strSQL)

  If rsMedian.BOF = False And _

    rsMedian.EOF = False Then

Assuming records were returned, I have to determine how many. To do this, I move to the end of the recordset so that RecordCount will return an accurate count. I then check to see if an odd number of records is returned by dividing the RecordCount by 2 using the Mod operator, which returns the remainder left by the division (if the remainder is 0 then I know that I have an even number of records). Dividing the RecordCount by zero also lets me know how many elements to move backwards from the end of the recordset to reach the midpoint of the records. The median will be that element.

rsMedian.MoveLast

    lngRecCount = rsMedian.RecordCount

    If lngRecCount Mod 2 <> 0 Then

      lngOffset = ((lngRecCount + 1) / 2) - 2

      If lngOffset >= 0 Then

        rsMedian.Move -lngOffset - 1

      End If

      varMedian = rsMedian("DataValue")

    Else

If there’s an even number of records, I have to move backwards to the element after the midpoint of the recordset and retrieve that value. I then move backwards once more to retrieve the element before the midpoint and retrieve that value. Computing the mean of the two values retrieved will give the median:

 lngOffset = (lngRecCount / 2) - 2

      If lngOffset >= 0 Then

        rsMedian.Move -lngOffset - 1

        dblTemp1 = rsMedian("DataValue")

        rsMedian.MovePrevious

        dblTemp2 = rsMedian("DataValue")

        varMedian = (dblTemp1 + dblTemp2) / 2

      End If

    End If

  Else

If no records were returned, the median will be Null:

   varMedian = Null

  End If

With the median calculated, I clean up after myself and the function is complete:

rsMedian.Close

  Set rsMedian = Nothing

  Set dbMedian = Nothing

  DMedian = varMedian

End Function

You’d use this function in the same way as DAvg:

?DMedian("Total", "DiceRolls")

 9.5

Okay, can I determine the mode values of my data as well?

Determining the mode is complicated by the fact that it’s possible for more than one value to be the mode. For instance, it’s not hard to imagine a set of dice rolls that returned three 8s and three 5s. Therefore, the DMode function needs to be able to return an array. As you can see, the definitions for Expr, Domain, and Criteria are the same as with my DMedian function:

Function DMode( _

  Expr As String, _

  Domain As String, _

  Optional Criteria As String = "" _

) As Variant

Dim dbMode As DAO.Database

Dim rsMode As DAO.Recordset

Dim lngLoop As Long

Dim lngMaxFreq As Long

Dim strSQL As String

Dim varMode As Variant

As before, a SQL statement must be created. This time, though, the SQL statement doesn’t simply return all of the qualifying values in the domain. Instead, I’m going to create an Aggregate query that returns each unique value in the domain, plus how many times that value occurs. Further, it’s going to return the values in descending order of occurrence. In other words, it will return the value that occurs the most times, followed by the value that occurs the second most times, and so on until the values that occur the fewest times appear at the end of the recordset:

strSQL = "SELECT [" & FieldName & "], " & _

    "Count(*) AS Frequency " & _

    "FROM [" & TableName & "] "

  If Len(WhereClause) > 0 Then

    strSQL = strSQL & _

      "WHERE " & WhereClause & " "

  End If

  strSQL = strSQL & _

    "GROUP BY [" & FieldName & "] "

  strSQL = strSQL & "ORDER BY 2 DESC, 1 ASC"

As before, I instantiate a recordset, using the SQL statement I just created, to return all of the relevant data from the domain and check that records were actually returned:

 Set dbMode = CurrentDb()

  Set rsMode = dbMode.OpenRecordset(strSQL)

  If rsMode.BOF = False And _

    rsMode.EOF = False Then

Assuming that records were returned, I determine how many occurrences there were for the value that occurred the most number of times (which is right at the top of the recordset) and save that value in a variable called lngMaxFreq. I then loop through the recordset until a value with fewer occurrences is encountered. For each value that occurs the same number of times as what’s stored in lngMaxFreq, I add that value to an array. The contents of that array will represent the mode value(s) for the domain.

Here's the code:

    varMode = Array()

    lngLoop = 0

    lngMaxFreq = rsMode("Frequency")

    Do While rsMode("Frequency") = lngMaxFreq

      ReDim Preserve varMode(0 To lngLoop)

      varMode(lngLoop) = rsMode(FieldName)

      lngLoop = lngLoop + 1

      rsMode.MoveNext

    Loop

  Else

    varMode = Null

  End If

Finally, as before, I clean up after myself, and the function’s done:

rsMode.Close

  Set rsMode = Nothing

  Set dbMode = Nothing

  DMode = varMode

End Function

Unfortunately, it’s not quite as easy to use this DMode function as it is to use the other Domain functions. If you were to type the following code into the Immediate window, you’d get a runtime error 13 (Type Mismatch):

?DMode("Total", "DiceRolls")

Instead, you must use code like this:

Sub DetermineMode( _

  Expr As String, _

  Domain As String, _

  Optional Criteria As String = "" _

)

On Error GoTo Err_DetermineMode

Dim lngCount As Long

Dim lngLoop As Long

Dim strField As String

Dim strTable As String

Dim varMode As Variant

  varMode = DMode(Expr, Domain, Criteria)

  If IsNull(varMode) Then

    Debug.Print "No Mode found"

  Else

    lngCount = UBound(varMode) - _

      LBound(varMode) + 1

    If lngCount = 1 Then

      Debug.Print "One Mode value found:"

    Else

      Debug.Print lngCount & _

        " Mode values found:"

    End If

    For lngLoop = LBound(varMode) To _

      UBound(varMode)

      Debug.Print _

        (lngLoop - LBound(varMode) + 1) & _

        ": " & varMode(lngLoop)

    Next lngLoop

  End If

End Sub

Calling the DetermineMode routine, you’d see something like this:

Call DetermineMode("Total", "DiceRolls")

One Mode value found:

1: 8

What happens, though, if my sample data does have two modes (as in Table 2)? It turns out that the mean and median for this data are exactly the same as before, as demonstrated by these examples:

?DAvg("Total", "AlternateDiceRolls")

 9.75

DMedian("Total", "AlternateDiceRolls")

 9.5

Table 2. AlternateDiceRolls.

Roll Total
1 14
2 14
3 8
4 8
5 11
6 8
7 10
8 9
9 6
10 5
11 14
12 10

However, now I have two different values that are both modes for my data:

?DetermineMode("Total", "AlternateDiceRolls")

2 Mode values found:

1: 8

2: 14

What if my numbers aren’t in a table and I want to compute the median?

VBA allows you to use the keyword ParamArray as the last argument in the list of arguments for a function or subroutine to indicate that the final argument is an Optional array of Variant elements. This feature allows you to pass an arbitrary number of values to the routine, and you can treat that list of values as a single array.

This means that it’s possible to declare a function like this:

Function Median( _

  ParamArray DataPoints() As Variant _

) As Variant

You can then call the function like this and have all of the values gathered into the ParamArray DataPoints:

Median(3, 6, 1, 2, 4)

Within the function, you must sort the array DataPoints into ascending order before you can find the middle position and determine the median.

There are several restrictions to using the ParamArray keyword. The important one here is that you can’t pass the ParamArray array to another routine, so you can’t sort the array by passing to a sort routine that will arrange the numbers into ascending order.

A second issue, though, means that passing the array to a sort routine probably wouldn’t be a good idea in any case. Since you can pass anything to the array, you must validate all of the values before you try to compute the median. For instance, how would you determine the median for red, blue, yellow, white? Since you have an even number of arguments, you’d have to calculate the average of the middle two terms: yellow and blue (presumably green if you’re mixing colors additively).

One way to deal with both issues is to create a new array within the routine and transfer only valid (that is, numeric) values into the new array. Assuming that at least one numeric value ends up in this new array, you can sort the new array and compute the median.

While I don’t intend to discuss the routine I used for doing the sort (there are plenty of comments in the code in the download database), something like the following will let you compute the median for an arbitrary number of values:

Function Median( _

  ParamArray DataPoints() As Variant _

) As Variant

Dim lngArraySize As Long

Dim lngCurrPos As Long

Dim lngLoop As Long

Dim lngPos1 As Long

Dim lngPos2 As Long

Dim varValues() As Variant

  If IsMissing(DataPoints) = True Then

    Median = Null

  Else

    ReDim varValues(LBound(DataPoints) To _

      UBound(DataPoints))

    lngCurrPos = LBound(DataPoints) - 1

    For lngLoop = LBound(DataPoints) To _

      UBound(DataPoints)

      If IsNull(DataPoints(lngLoop)) = False Then

        If IsNumeric(DataPoints(lngLoop)) Then

          lngCurrPos = lngCurrPos + 1

          varValues(lngCurrPos) = _

            DataPoints(lngLoop)

        End If

      End If

    Next lngLoop

    If lngCurrPos >= 0 Then

      ReDim Preserve varValues( _

        LBound(DataPoints) To lngCurrPos)

      Call QuickSortVariants(varValues, _

        LBound(varValues), UBound(varValues))

      lngArraySize = UBound(varValues) - _

        LBound(varValues) + 1

      If lngArraySize Mod 2 = 0 Then

        lngPos1 = lngArraySize / 2 - 1

        lngPos2 = lngPos1 + 1

        Median = (varValues(lngPos1) + _

          varValues(lngPos2)) / 2

      Else

        lngPos1 = (lngArraySize - 1) / 2

        Median = varValues(lngPos1)

      End If

    Else

      Median = Null

    End If

  End If

End Function

Passing my first set of dice roll values to this function would give this result:

?Median(14, 13, 8, 8, 12, 8, 10, 9, 5, 3, 17, 10)

 9.5

I’m not sure there’s really any reason for such a function: As far as I’m concerned, requiring the ability to calculate the median like this is likely an indication that your tables haven’t been properly normalized. However, now you have that function if you need it.

Your download file is called Steele_AA200510.accdb

 

Other Related Articles Include 
 
Computing the Median (Again)

 External Links

 Allan Browne discusses Min and Max Of Ranges and ParamArray

About Doug Steele

Doug Steele has worked with databases-both mainframe and PC-for many years with a major international oil company. In a previous life, he taught introductory computer programming at the University of Waterloo. Microsoft has recognized him as an Access MVP for his contributions to the user community over the years.

Doug is the author of Access Solutions with Arvin Meyer.

This entry was posted in VBA. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *


*


This site uses Akismet to reduce spam. Learn how your comment data is processed.