Soundex: “Close” Only Counts in Horseshoes

This month, Doug Steele looks at a couple of techniques to help determine when text entries are “close enough” to be considered the same.

My users have problems with the accuracy of their typing. Is there some way that I can check whether something close to the name they typed already exists in the database?

My first impulse was to tell you not to let your users type the name but rather have them use a ComboBox to let them select the correct name. However, if your users would have to select from a large list, picking from a list may not be the best interface to use.

Fortunately, there are other possibilities. If you’re working with English names, there’s something called the Soundex algorithm, patented in 1918 by Margaret O’Dell and Robert C. Russell and used to help the U.S. Census Bureau with its filing. As it states at www.archives.gov/research_room/genealogy/census/soundex.html, “The Soundex is a coded surname (last name) index based on the way a surname sounds rather than the way it is spelled. Surnames that sound the same, but are spelled differently, like SMITH and SMYTH, have the same code and are filed together. The Soundex coding system was developed so that you can find a surname even though it may have been recorded under various spellings.”

How the Soundex algorithm does this is by converting a word into a four-character representation based on the word’s phonetic pronunciation (see Table 1). The first character is the first letter in the word, while the next three characters are digits representing the phonetic sounds of the consonants within the word. If the word is too short to make a four-character code, it pads the end of the code with zeros. In general, vowels (as well as the letters H, W, and Y) are ignored, unless they’re the first letter of the string.

The detailed rules are:

  • The initial letter of the word is always kept.
  • If the word has any double letters, they should be treated as one letter. For example, Gutierrez is coded G-362 (G, 3 for the T, 6 for the first R, the second R is ignored, 2 for the Z).
  • If the word has different letters side-by-side that have the same number in the Soundex coding guide, they should be treated as one letter. For example, Pfister is coded as P-236 (P, F ignored because it has the same value [1] that P has, then the digit 2 for the S, 3 for the T, and 6 for the R).

Table 1. Soundex coding guide.

Number Phonetic classification Letters represented by the number
1 labials and labio-dentals B, F, P, V
2 gutterals and sibilants C, G, J, K, Q, S, X, Z
3 dental-mutes D, T
4 palatal-fricatives L
5 labio-nasals (M) and lingual-nasals (N) M, N
6 dental fricatives R

If a vowel (A, E, I, O, U) separates two consonants that have the same Soundex code, the consonant to the right of the vowel is coded. For example, Tymczak is coded as T-522 (T, 5 for the M, 2 for the C, Z ignored because it has the same value [2] as C, 2 for the K). Even though C, Z, and K all have the same value (2), because the vowel “A” separates the Z and the K, the K is coded.

If H or W separates two consonants that have the same Soundex code, the consonant to the right of the vowel isn’t coded. For example, Ashcraft is coded A-261 (A, 2 for the S, C ignored because it has the same value [2] as S, 6 for the R, 1 for the F). It’s not coded A-226 even though there’s the letter H between the S and C. (You may find some implementations of Soundex that don’t use this rule.)

It's pretty straightforward, and pretty easy to code:

Function GetSoundex(ValueIn As String) As String

Dim lngIndex As Long

Dim strCurrChar As String

Dim strCurrVal As String

Dim strInput As String

Dim strPrevVal As String

Dim strSoundex As String

strInput = UCase$(ValueIn)

strSoundex = Left(strInput, 1)

lngIndex = 1

Do While Not Len(strSoundex) = 4

If lngIndex > Len(strInput) Then

strSoundex = strSoundex & "0"

Else

strCurrChar = Mid$(strInput, lngIndex, 1)

Select Case strCurrChar

Case "B", "F", "P", "V"

strCurrVal = "1"

Case "C", "G", "J", "K", "Q", _

"S", "X", "Z"

strCurrVal = "2"

Case "D", "T"

strCurrVal = "3"

Case "L"

strCurrVal = "4"

Case "M", "N"

strCurrVal = "5"

Case "R"

strCurrVal = "6"

Case Else ' vowel, H, W, or Y

strCurrVal = "0"

End Select

End If

If (strCurrVal <> "0") Then

If (strCurrVal <> strPrevVal) Then

If lngIndex <> 1 Then

strSoundex = strSoundex & strCurrVal

End If

End If

End If

If strCurrChar <> "H" And _

strCurrChar <> "W" Then

strPrevVal = strCurrVal

End If

lngIndex = lngIndex + 1

Loop

End Function

As you can see, this bit of code ensures that the value returned always has four characters (Rule 1):

If lngIndex > Len(strInput) Then

strSoundex = strSoundex & "0"

Else

This code is how I enforce the remaining rules:

If (strCurrVal <> "0") Then

If (strCurrVal <> strPrevVal) Then

If lngIndex <> 1 Then

strSoundex = strSoundex & strCurrVal

End If

End If

End If

If strCurrChar <> "H" And _

strCurrChar <> "W" Then

strPrevVal = strCurrVal

End If

The vowels and the letters H and W have been assigned a value of 0, so they’re ignored by the first If statement. For any other letter, I check whether the current character’s value is the same as the previous character’s value. If it is, I ignore it. That handles both Rules 2 and 3. Since the first character isn’t encoded, because we actually use the character itself instead of its value, I’ve included the If lngIndex <> 1 check.

Since Rule 5 states that we don’t encode a consonant to the right of an H or W if it has the same value as the consonant to the left of the H or W, I don’t change the value stored in strPrevVal when I encounter one of those letters. Because I do change strPrevVal in every other case, then whenever I encounter a vowel, strPrevVal will get set to 0. As a result, strPrevVal will be different from the next consonant after the vowel, thus satisfying Rule 4.

Now that you have some way of encoding the text that your users are entering, how can you use it?

In a query, you can use the GetSoundex function to encode each name both in the database and in the input parameters to your query. You can then compare the Soundex values. The SQL would look something like this:

SELECT Customers.Id, Customers.FirstNm,

Customers.LastNm, Customers.AddressLine1,

Customers.AddressLine2, Customers.City,

Customers.Province

FROM Customers

WHERE GetSoundex([LastNm])=

GetSoundex([Enter Last Name])

ORDER BY Customers.LastNm, Customers.FirstNm

This is actually one case where I might recommend violating database normalization rules and storing the Soundex value in the database, rather than calculating it each time. Since the Soundex value won’t change over time, the query will be a lot more efficient if you don’t need to run the function against each row in the table every time you run the query!

If you look in the accompanying downloadable database, you can see how I use the GetSoundex function to filter the recordset displayed:

If Len(Me.txtSearchName & vbNullString) > 0 Then

Me.Filter = "GetSoundex([LastNm]) = '" & _

GetSoundex(Me.txtSearchName) & "'"

Me.FilterOn = True

Else

Me.Filter = vbNullString

Me.FilterOn = False

End If

There are a number of variations on Soundex (some database engines–SQL Server, for instance–have a Soundex function built in). Some versions return more digits, while some have different rules for translating from letters to numbers. Since the point of the Soundex value is to return the same value for names that sound the same, it makes sense that if you’re using a language other than English (the language for which the original Soundex algorithm was developed), the values you’d use might be different. I don’t intend to go into details about the variations, but, as an example, the Daitch-Mokotoff Soundex System was developed to help when working with Eastern European Jewish names. The Daitch-Mokotoff Soundex values for Moskowitz and Moskovitz, for instance, will be the same, while the American Soundex values will be different. Do a Google on Soundex for more details. As a result, counting on a built-in Soundex function can be risky unless it allows you to pick different rules for different kinds of names.

I should point out that Soundex will generate large numbers of both false positives and false negatives. This limitation is true of even the best Soundex improvement techniques, since Soundex is intended to bring some of the fuzzy and inexact nature of human vocal interaction to computing and is therefore always going to be inexact. In the downloadable database that accompanies this article, I have an Employee table containing almost 22,000 names. Those names compute to fewer than 3,000 different Soundex values. However, when you take a look at what names are associated with the same Soundex value, you should get an idea of just how “fuzzy” the matching is. For instance, the following 16 names all have a Soundex value of S530:

  • SAINT
  • SAMHAT
  • SAMMUT
  • SANDAU
  • SANDE
  • SANDHU
  • SANDY
  • SCHMIDT
  • SCHWINDT
  • SHAND
  • SHIMODA
  • SMIDT
  • SMITH
  • SMYTH
  • SMYTHE
  • SZMIDT

As you can see, not all of these names have their roots in England, so, while they’d be pronounced differently, the English-based Soundex function treats them the same.

Are there other options to Soundex (or Soundex-based alternatives)?

Do you ever do those Word Ladder games where you have to change one word to another word by changing one letter at a time? For instance, can you change TEARS to SMILE in six steps, changing only one letter at a time and always having a real word at each step? Here’s one solution (there may be others–don’t e-mail me!):

  1. T E A R S
  2. S E A R S
  3. S T A R S
  4. S T A R E
  5. S T A L E
  6. S T I L E
  7. S M I L E

In the mid-1960s, a Russian scientist named Vladimir Levenshtein devised an algorithm that essentially figures out the “edit distance” between two words. Edit distance is the minimum number of deletions, insertions, or substitutions required to transform one string to another. Since then, the Levenshtein Distance (LD) algorithm has been used for such purposes as spell checking, speech recognition, DNA analysis, and plagiarism detection.

As an example, the words “computer” and “commuter” are very similar: You only need to change one letter (the p to an m) to change the first word into the second. That means that the LD between the two words is 1. Similarly, you can change the word sport to spot by deleting the letter r (or, conversely, you can change the word spot to sport by adding the letter r). That means that the LD between those two words is also 1.

I’ve seen complicated mathematical explanations of how to calculate the Levenshtein Distance, but I find the following explanation easiest to follow:

  1. Given two strings, S1 and S2, if either of the strings is zero-length, then the LD will be the length of the other string. (In other words, if the length of S1 is 0, LD(s1, s2) will equal the length of S2, or if the length of S2 is 0, LD(s1, s2) will equal the length of S1.)
  2. Assuming both S1 and S2 are non-zero-length strings, set L1 to the length of S1, and L2 to the length of S2. Construct a matrix containing L2+1 rows (numbered 0 to L2) and L1+1 columns (numbered 0 to L1).
  3. Initialize the first row of the matrix to the values 0 to L1. Initialize the first column of the matrix to the values 0 to L2.
  4. Examine each character in S1 (i going from 1 to L1), and each character in S2 (j going from 1 to L2).

If the ith character in S1 equals the jth character in S2 (that is, S1(i) = S2(j)), then the “cost” is 0. Otherwise, it’s 1.

This gives you the input values to the cost calculation. You now work through the matrix sequentially (left to right, top to bottom), calculating on a cumulative value based on three other cells adjacent to the current cell. The minimum results of these three calculations are saved in the current cell. The three cells and the calculations used are:

  • The cell immediately above, plus 1: a(i-1, j) + 1
  • The cell immediately to the left, plus 1: a(i, j-1) + 1
  • The cell diagonally above and to the left, plus the cost: a(i-1, j-1) + cost

Once you’ve worked through all of the cells in the matrix, the LD will be found bottom left cell: a(L1, L2). Whew.

Here’s what that looks like in code:

Function LD( _

String1 As String, _

String2 As String _

) As Integer

Dim intCost As Integer

Dim intDist() As Integer

Dim intLen1 As Integer

Dim intLen2 As Integer

Dim intL1 As Integer

Dim intL2 As Integer

Dim intVal1 As Integer

Dim intVal2 As Integer

Dim intVal3 As Integer

Dim strCurrChar1 As String

Dim strCurrChar2 As String

intLen1 = Len(String1)

intLen2 = Len(String2)

If intLen1 = 0 Then

LD = intLen2

ElseIf intLen2 = 0 Then

LD = intLen1

Else

ReDim intDist(0 To intLen1, 0 To intLen2)

For intL1 = 0 To intLen1

intDist(intL1, 0) = intL1

Next intL1

For intL2 = 0 To intLenString2

intDist(0, intL2) = intL2

Next intLoop2

For intL1 = 1 To intLen1

strCurrChar1 = Mid$(String1, intL1, 1)

For intL2 = 1 To intLen2

strCurrChar2 = Mid$(String2, intL2, 1)

If (strCurrChar1 = strCurrChar2) Then

intCost = 0

Else

intCost = 1

End If

intVal1 = intDist(intL1 - 1, intL2) + 1

intVal2 = intDist(intL1, intL2 - 1) + 1

intVal3 = intDist(intL1 - 1, intL2 - 1) _

+ intCost

intDist(intL1, intL2) = _

Minimum(intVal1, intVal2, intVal3)

Next intLoop2

Next intLoop1

End If

LD = intDist(intLen1, intLen2)

End Function

I created a helper function, called Minimum, that simply returns the smallest of three integers passed to it and used it in the previous code. Here’s that function:

Private Function Minimum( _

ByVal I As Integer, _

ByVal J As Integer, _

ByVal K As Integer _

) As Integer

Dim intMin As Integer

intMin = I

If J < intMin Then

intMin = J

End If

If K < intMin Then

intMin = K

End If

Minimum = intMin

End Function

I should point out that if you apply the algorithm to our Word Ladder example, the Levenshtein Distance between TEARS and SMILE is only 5, while it took me 6 words to convert from one to the other. That’s because the Levenshtein algorithm isn’t restricted to using real words when computing the distance.

You can use the Levenshtein Distance between the string in the database and your search string as a means of selecting rows. In this example, I’ll find all rows where the two fields LastNm and LastName have an LD of 2 or less:

SELECT Customers.Id, Customers.FirstNm,

Customers.LastNm, Customers.AddressLine1,

Customers.AddressLine2, Customers.City,

Customers.Province

FROM Customers

WHERE LD([LastNm], [LastName]) < 2

ORDER BY Customers.LastNm, Customers.FirstNm

The Soundex method returned identical values for words that Soundex regards as identical. In other words, the version of Soundex that you use decides what counts as “close enough” by assigning identical values. Since two words will only have an LD of 0 if they’re identical, to use the LD function to find “close enough” words you must check for some difference that’s greater than 0. In my example, I used a difference of 2. You’ll have to experiment to determine what’s “close enough” for your needs. Unfortunately, I can’t give you any silver bullet for doing this; you’ll have to use trial and error to determine what works for you.

Download the accompanying database to play with various values of Levenshtein Distance. Good luck!

Your download file is called Steele_AA200504.accdb

About Doug Steele

Doug Steele has worked with databases-both mainframe and PC-for many years with a major international oil company. In a previous life, he taught introductory computer programming at the University of Waterloo. Microsoft has recognized him as an Access MVP for his contributions to the user community over the years.

Doug is the author of Access Solutions with Arvin Meyer.

This entry was posted in VBA. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *


*


This site uses Akismet to reduce spam. Learn how your comment data is processed.