This month, Doug Steele follows up on an earlier article by Chris Weber to look at different ways to calculate “measures of central tendency”: mean, median, and mode.
How can I calculate the median for my data?
Before launching into a solution for this question, let’s go over some terminology (if you read Chris Weber’s Computing the Median Again article in the Apr-2005 issue of Smart Access, you’ll be familiar with some of these terms). You may have noticed the phrase “measures of central tendency” used in my introduction. This is a term used in statistics and covers more than just “calculating an average.” The fundamental idea is that one of the best ways to summarize a set of data and still retain part of the information is to represent the set with a single value. Measures of central tendency are ways of calculating that single number that’s representative of an entire list of numbers. There are three commonly used measures of central tendency:
- Mean–The arithmetic average of a set of numbers (the most common measure of central tendency).
- Median–The value of the middle item when the data is arranged from lowest to highest (assuming an odd number of observations) or the average value of the two middle items (when there’s an even number of observations).
- Mode–The observation that occurs most frequently in a data set.
Most of you are probably familiar with the concept of mean (which is often referred to as “arithmetic average”), but the other two measures might not be as familiar to you, so an example might be in order.
Let’s assume you throw three dice a total of 12 times, and get the results shown in Table 1.
Table 1. A sample set of dice rolls.
It’s easy to calculate the mean: It’s (14+13+8+8+12+8+10+9+5+3+17+10)/12 = 9.75. This is the number to use when the values are evenly distributed throughout the range.
How do I find the median?
To find the median, you must arrange the 12 totals in order (it doesn’t really matter whether you arrange them in ascending or descending order, although it’s more common to use ascending order). In this case, that’s 3, 5, 8, 8, 8, 9, 10, 10, 12, 13, 14, 17. Since there’s an even number of samples, the median will be the average value of the sixth and seventh elements. The sixth element is 8 and the seventh element is 10; therefore, the median is (9+10)/2 = 9.5. This is the number to use when the values aren’t evenly distributed–when the results are skewed to one end of the range or clustered in the middle.
Finally, the mode is the value that occurs most often. With the elements arranged in order in the preceding paragraph, it’s fairly straightforward to see that 8 occurs more times than any other value, so the mode is 8. This is the number to use when you’re interested in the “most likely” number.
So the next question is, “How can I calculate these values in Access?”
The average can be easy: Access has a function, DAvg, that will compute the mean of a set of values in a specified set of records (a domain). To calculate the mean of our DiceRolls table, the code would be:
?DAvg("Total", "DiceRolls") 9.75
Unfortunately, Access doesn’t have similar functions to compute median and mode, so you’ll have to create your own. To be consistent with DAvg (and other Domain functions), I called the function to calculate the median DMedian and named the arguments Expr, Domain, and Criteria:
Function DMedian( _ Expr As String, _ Domain As String, _ Optional Criteria As String = "" _ ) As Variant
The parameters are as follows:
- Expr is an expression that identifies the field containing the numeric data for which you want to calculate the median. It can be a string expression identifying a field in a table or query, or it can be an expression that performs a calculation on data in that field.
- Domain is a string expression identifying the set of records that constitutes the domain (a table name or a query name).
- Criteria is an optional string expression used to restrict the range of data on which the DMedian function is performed. Criteria is equivalent to the WHERE clause in a SQL expression, without the word WHERE. If Criteria is omitted, the DMedian function evaluates Expr against the entire domain. Note that any field that’s included in Criteria must also be a field in Domain.
The next part of the function declares the variables that I’ll need later in the function:
Dim dbMedian As DAO.Database Dim rsMedian As DAO.Recordset Dim dblTemp1 As Double Dim dblTemp2 As Double Dim lngOffset As Long Dim lngRecCount As Long Dim strSQL As String Dim varMedian As Variant
I begin processing by creating a SQL string that will return Expr in a sorted order. If a value was supplied for Criteria, include it in the SQL statement, ignoring any Null values (if for no other reason than to be consistent with DAvg):
strSQL = "SELECT " & Expr & " AS Data " & _ "FROM " & Domain & " " strSQL = strSQL & _ "WHERE " & Expr & " IS NOT NULL " If Len(Criteria) > 0 Then strSQL = strSQL & "AND (" & Criteria & ") " End If strSQL = strSQL & "ORDER BY " & Expr
Next I instantiate a recordset, using the SQL statement created earlier, to return all of the relevant data from the domain and make sure that records are actually returned:
Set dbMedian = CurrentDb() Set rsMedian = dbMedian.OpenRecordset(strSQL) If rsMedian.BOF = False And _ rsMedian.EOF = False Then
Assuming records were returned, I have to determine how many. To do this, I move to the end of the recordset so that RecordCount will return an accurate count. I then check to see if an odd number of records is returned by dividing the RecordCount by 2 using the Mod operator, which returns the remainder left by the division (if the remainder is 0 then I know that I have an even number of records). Dividing the RecordCount by zero also lets me know how many elements to move backwards from the end of the recordset to reach the midpoint of the records. The median will be that element.
rsMedian.MoveLast lngRecCount = rsMedian.RecordCount If lngRecCount Mod 2 <> 0 Then lngOffset = ((lngRecCount + 1) / 2) - 2 If lngOffset >= 0 Then rsMedian.Move -lngOffset - 1 End If varMedian = rsMedian("DataValue") Else
If there’s an even number of records, I have to move backwards to the element after the midpoint of the recordset and retrieve that value. I then move backwards once more to retrieve the element before the midpoint and retrieve that value. Computing the mean of the two values retrieved will give the median:
lngOffset = (lngRecCount / 2) - 2 If lngOffset >= 0 Then rsMedian.Move -lngOffset - 1 dblTemp1 = rsMedian("DataValue") rsMedian.MovePrevious dblTemp2 = rsMedian("DataValue") varMedian = (dblTemp1 + dblTemp2) / 2 End If End If Else
If no records were returned, the median will be Null:
varMedian = Null End If
With the median calculated, I clean up after myself and the function is complete:
rsMedian.Close Set rsMedian = Nothing Set dbMedian = Nothing DMedian = varMedian End Function
You’d use this function in the same way as DAvg:
?DMedian("Total", "DiceRolls") 9.5
Okay, can I determine the mode values of my data as well?
Determining the mode is complicated by the fact that it’s possible for more than one value to be the mode. For instance, it’s not hard to imagine a set of dice rolls that returned three 8s and three 5s. Therefore, the DMode function needs to be able to return an array. As you can see, the definitions for Expr, Domain, and Criteria are the same as with my DMedian function:
Function DMode( _ Expr As String, _ Domain As String, _ Optional Criteria As String = "" _ ) As Variant Dim dbMode As DAO.Database Dim rsMode As DAO.Recordset Dim lngLoop As Long Dim lngMaxFreq As Long Dim strSQL As String Dim varMode As Variant
As before, a SQL statement must be created. This time, though, the SQL statement doesn’t simply return all of the qualifying values in the domain. Instead, I’m going to create an Aggregate query that returns each unique value in the domain, plus how many times that value occurs. Further, it’s going to return the values in descending order of occurrence. In other words, it will return the value that occurs the most times, followed by the value that occurs the second most times, and so on until the values that occur the fewest times appear at the end of the recordset:
strSQL = "SELECT [" & FieldName & "], " & _ "Count(*) AS Frequency " & _ "FROM [" & TableName & "] " If Len(WhereClause) > 0 Then strSQL = strSQL & _ "WHERE " & WhereClause & " " End If strSQL = strSQL & _ "GROUP BY [" & FieldName & "] " strSQL = strSQL & "ORDER BY 2 DESC, 1 ASC"
As before, I instantiate a recordset, using the SQL statement I just created, to return all of the relevant data from the domain and check that records were actually returned:
Set dbMode = CurrentDb() Set rsMode = dbMode.OpenRecordset(strSQL) If rsMode.BOF = False And _ rsMode.EOF = False Then
Assuming that records were returned, I determine how many occurrences there were for the value that occurred the most number of times (which is right at the top of the recordset) and save that value in a variable called lngMaxFreq. I then loop through the recordset until a value with fewer occurrences is encountered. For each value that occurs the same number of times as what’s stored in lngMaxFreq, I add that value to an array. The contents of that array will represent the mode value(s) for the domain.
Here's the code: varMode = Array() lngLoop = 0 lngMaxFreq = rsMode("Frequency") Do While rsMode("Frequency") = lngMaxFreq ReDim Preserve varMode(0 To lngLoop) varMode(lngLoop) = rsMode(FieldName) lngLoop = lngLoop + 1 rsMode.MoveNext Loop Else varMode = Null End If
Finally, as before, I clean up after myself, and the function’s done:
rsMode.Close Set rsMode = Nothing Set dbMode = Nothing DMode = varMode End Function
Unfortunately, it’s not quite as easy to use this DMode function as it is to use the other Domain functions. If you were to type the following code into the Immediate window, you’d get a runtime error 13 (Type Mismatch):
?DMode("Total", "DiceRolls") Instead, you must use code like this: Sub DetermineMode( _ Expr As String, _ Domain As String, _ Optional Criteria As String = "" _ ) On Error GoTo Err_DetermineMode Dim lngCount As Long Dim lngLoop As Long Dim strField As String Dim strTable As String Dim varMode As Variant varMode = DMode(Expr, Domain, Criteria) If IsNull(varMode) Then Debug.Print "No Mode found" Else lngCount = UBound(varMode) - _ LBound(varMode) + 1 If lngCount = 1 Then Debug.Print "One Mode value found:" Else Debug.Print lngCount & _ " Mode values found:" End If For lngLoop = LBound(varMode) To _ UBound(varMode) Debug.Print _ (lngLoop - LBound(varMode) + 1) & _ ": " & varMode(lngLoop) Next lngLoop End If End Sub
Calling the DetermineMode routine, you’d see something like this:
Call DetermineMode("Total", "DiceRolls") One Mode value found: 1: 8
What happens, though, if my sample data does have two modes (as in Table 2)? It turns out that the mean and median for this data are exactly the same as before, as demonstrated by these examples:
?DAvg("Total", "AlternateDiceRolls") 9.75 DMedian("Total", "AlternateDiceRolls") 9.5
Table 2. AlternateDiceRolls.
However, now I have two different values that are both modes for my data:
?DetermineMode("Total", "AlternateDiceRolls") 2 Mode values found: 1: 8 2: 14
What if my numbers aren’t in a table and I want to compute the median?
VBA allows you to use the keyword ParamArray as the last argument in the list of arguments for a function or subroutine to indicate that the final argument is an Optional array of Variant elements. This feature allows you to pass an arbitrary number of values to the routine, and you can treat that list of values as a single array.
This means that it’s possible to declare a function like this:
Function Median( _ ParamArray DataPoints() As Variant _ ) As Variant
You can then call the function like this and have all of the values gathered into the ParamArray DataPoints:
Median(3, 6, 1, 2, 4)
Within the function, you must sort the array DataPoints into ascending order before you can find the middle position and determine the median.
There are several restrictions to using the ParamArray keyword. The important one here is that you can’t pass the ParamArray array to another routine, so you can’t sort the array by passing to a sort routine that will arrange the numbers into ascending order.
A second issue, though, means that passing the array to a sort routine probably wouldn’t be a good idea in any case. Since you can pass anything to the array, you must validate all of the values before you try to compute the median. For instance, how would you determine the median for red, blue, yellow, white? Since you have an even number of arguments, you’d have to calculate the average of the middle two terms: yellow and blue (presumably green if you’re mixing colors additively).
One way to deal with both issues is to create a new array within the routine and transfer only valid (that is, numeric) values into the new array. Assuming that at least one numeric value ends up in this new array, you can sort the new array and compute the median.
While I don’t intend to discuss the routine I used for doing the sort (there are plenty of comments in the code in the download database), something like the following will let you compute the median for an arbitrary number of values:
Function Median( _ ParamArray DataPoints() As Variant _ ) As Variant Dim lngArraySize As Long Dim lngCurrPos As Long Dim lngLoop As Long Dim lngPos1 As Long Dim lngPos2 As Long Dim varValues() As Variant If IsMissing(DataPoints) = True Then Median = Null Else ReDim varValues(LBound(DataPoints) To _ UBound(DataPoints)) lngCurrPos = LBound(DataPoints) - 1 For lngLoop = LBound(DataPoints) To _ UBound(DataPoints) If IsNull(DataPoints(lngLoop)) = False Then If IsNumeric(DataPoints(lngLoop)) Then lngCurrPos = lngCurrPos + 1 varValues(lngCurrPos) = _ DataPoints(lngLoop) End If End If Next lngLoop If lngCurrPos >= 0 Then ReDim Preserve varValues( _ LBound(DataPoints) To lngCurrPos) Call QuickSortVariants(varValues, _ LBound(varValues), UBound(varValues)) lngArraySize = UBound(varValues) - _ LBound(varValues) + 1 If lngArraySize Mod 2 = 0 Then lngPos1 = lngArraySize / 2 - 1 lngPos2 = lngPos1 + 1 Median = (varValues(lngPos1) + _ varValues(lngPos2)) / 2 Else lngPos1 = (lngArraySize - 1) / 2 Median = varValues(lngPos1) End If Else Median = Null End If End If End Function
Passing my first set of dice roll values to this function would give this result:
?Median(14, 13, 8, 8, 12, 8, 10, 9, 5, 3, 17, 10) 9.5
I’m not sure there’s really any reason for such a function: As far as I’m concerned, requiring the ability to calculate the median like this is likely an indication that your tables haven’t been properly normalized. However, now you have that function if you need it.
Other Related Articles Include
Computing the Median (Again)