Computing Text Similarity

A part of the implementation of the NDifference project is finding similarities between pieces of code, specifically finding the closest match between several candidates for an overloaded method.

Recently I discovered an article on Wikipedia describing the Levenshtein distance algorithm, a measure of how similar two pieces of text are, or how many changes would be required to turn one into the other. Identical texts have a score of zero, changing "cat" into "dog" would score 3. Changing "cog" into "dog" would score 1.

Taking the reference implementation, it seemed that it could benefit from losing the arrays and some of the more manual parts of the algorithm and using some built-in .Net goodness.


public static class TextSimilarity
{
	// Using Levenshtein Distance technique.
	// For algorithm see : http://en.wikipedia.org/wiki/Levenshtein_distance
	// adapted from http://www.codeproject.com/Articles/13525/Fast-memory-efficient-Levenshtein-algorithm
	public static int DifferenceBetween(object s, object t)
	{
		string first = s.ToString();
		string second = t.ToString();

		// degenerate cases
		if (first == second) return 0;
		if (first.Length == 0) return second.Length;
		if (second.Length == 0) return first.Length;

		// create two work vectors of integer distances
		// initialize the previous row of distance)
		// this row is A[0][i]: edit distance for an empty s
		// the distance is just the number of characters to delete from t
		List<int> previousRow = new List<int>(Enumerable.Range(0, second.Length + 1));
		List<int> currentRow = new List<int>(Enumerable.Repeat(0, second.Length + 1));

		int indexInFirst = 0;

		foreach(var characterFromFirst in first)
		{
			// calculate current row distances from the previous row
			// edit distance is delete (i+1) chars from first to match empty second
			currentRow[0] = indexInFirst + 1;

			int indexInSecond = 0;

			// use formula to fill in the rest of the row
			foreach(var characterFromSecond in second)
			{
				var cost = (characterFromFirst == characterFromSecond) ? 0 : 1;

				int[] candidatesForLeastCost = 
				{ 
					currentRow[indexInSecond] + 1, 
					previousRow[indexInSecond + 1] + 1, 
					previousRow[indexInSecond] + cost 
				};

				currentRow[indexInSecond + 1] = candidatesForLeastCost.Min(); 

				++indexInSecond;
			}

			// copy current row to previous row for next iteration
			previousRow = new List<int>(currentRow);

			++indexInFirst;
		}

		return currentRow[second.Length];
	}
}

For example, the first part of the algorithm is concerned with initialising data and in the original implementation this is 5-6 lines of very C-like code.

Here, I have replaced it with simple initialisations of two lists, one from Enumerable.Range the other from Enumerable.Repeat. So, the first list is filled with an increasing sequence of numbers and the second is filled with all zeroes. It takes longer to say it than to write or read it.

Next, towards the end of the first loop, there is a copy operation which has been replaced by another simple assignment.

Finally, the inner loop calculation tries to find the lowest value of three alternatives. Originally, this used repeated, nested calls to Math.Min but here has been replaced with a variant from the IEnumerable extension methods which picks the lowest value from a sequence.

I'm still not completely happy with the implementation, particularly the index values required to access elements of the previous and current row values. I will update this as I discover better techniques :)