Nearest Neighbors in C#

.NET Implementation of the Essential Classification Algorithm

Brian Ross
4 min readAug 31, 2020

Introduction

The task for data science students first “build week” as part of Lambda School’s computer science curriculum is to implement one of a selection of commonly used algorithms from scratch. As you can guess from the title of this post I selected k-nearest neighbors, but I also pushed myself out of my comfort zone and elected to implement the algorithm in C#.

Why C#?

For those mainly familiar with Python, let us first explore what C# (pronounced C-Sharp) is. C# is a language developed by Anders Hejlsberg for Microsoft as part of the .NET initiative in 2000. It is a general purpose, multi-paradigm programming language. Strongly-typed, lexically-scoped, imperative, declarative, functional, generic, object-oriented, and component-oriented paradigms form the basis for the language. The language has wide adoption by companies large and small.

Design Goals for C#

In designing C#, Microsoft outlined the following goals:

  1. The language should be a simple and modern object-oriented programming language
  2. The language should provide strong type checking, array bounds checking, detect attempts to use uninitialized variables, automatic garbage collection.
  3. It should be able to support developing components in distributed environments
  4. Portability is highly important especially for those already using C and C++
  5. Supports internationalization
  6. Should be suitable for writing applications on both hosted and embedded systems.
  7. The language should be efficient and economical, however, it is not meant to be a direct competitor to C or assembly.

C# vs Python

Python is my first, and true love, but as of late I have come to the conclusion that while it performs exceptionally well at those tasks which is appropriate, it is far from the “hammer for every nail”. To that end I’ve sought to explore some other programming languages and one that I came back to again and again was C#. Let us explore some main differences between the two languages.

Static Typing vs Dynamic Typing

C# is statically typed, Python is dynamically typed. What this means essentially is that in C# type checking is performed when the program is compiled whereas in Python, being an interpreted language, type checking is performed at run time. Now which one is better is up for argument, but there is a line of thinking that suggest dynamically typed languages are not necessarily appropriate for what is referred to as “mission-critical applications”. While I think both have there places I think the argument comes down to: what is the risk tolerance for the program?

Consider an application which provides some sort of functionality for life support systems in a hospital. The risk tolerance here is significantly lower than for perhaps and application that serves blog posts. If the former program fails due to some unforeseen type conversion at runtime it could potentially result in a death. While we all love our blog posts we hopefully also agree that Medium going down for a few minutes is not as bad as life support system for a critical care unit going offline for the same amount of time.

The Algorithm

So let’s get into the algorithm. The k-nearest neighbors algorithm is pretty simple. It is considered a supervised algorithm, that means that it requires labeled classes. It’s like trying to teach a child their colors. You first need to show to them and point out and example of a color, for example red. Then once you have shown them enough examples of the color they can abstract that information to other sources. Supervised algorithms work the same way, they require a training set from which to learn the specific attribute of the labeled classes before they can abstract those characteristics to new information.

In eli5 terms the k-nearest neighbor algorithm works like this you have a set of data with which to train the algorithm. This algorithm then takes the various characteristics of the different observations and then maps them to points in space. You then have the observation which you are trying to predict. So you have a voting pool, we call this ‘k’. For k members in the voting pool each member casts a vote. Say you have a voting pool of 5 members, or k=5, if you have three 1’s, a 2 and a 3, then the algorithm would predict the value of the unlabeled observation to be 1.

The Implementation

My implementation was simplistic as well. The first hurdle was not having pandas to lean on in loading the data. So the first task was to write a CSV parser which could iterate over the CSV file and store the values in a doubly-nested array.

From there for each value in the test set it is compared against the members of the training set. The euclidean distances is calculated between the two points and the voting pool is formed based on the value of the k argument supplied.

Then those values in the training set each cast a vote and the value with the most votes wins!

--

--

Brian Ross

Primarily interested in the intersection of advancements in data science and public good. linkedin.com/in/brianthomasross