Designing Effective ML Systems

Asking the right questions makes all the difference

Brian Ross
6 min readJul 10, 2021

We are now firmly in the era of mass adoption for machine learning. The number of businesses seeking to incorporate machine learning solutions into their products and applications continues to grow daily. With that growth comes many assumptions from decision-makers and stakeholders. One that can be almost pernicious at times is the idea that a machine learning engineer is a mathematician first, and then everything else second. No one doubts the need for a strong understanding of the various mathematical fields on top of which these models are built — to do so would be ridiculed, justifiably so, as it would be akin to giving a toddler a paint ball gun and asking them to paint “Starry Night”.

However, at the end of the day, a notebook full of the best models can only add value to a business post-hoc. The most powerful of algorithms, the most accurate of models add the majority of their value, not tucked away in a notebook somewhere, but when they are put into production. This is where machine learning begins not aide in only post-hoc analysis of what has happened, but begins to inform what is happening, and what is going to happen.

When we begin to think of machine learning solutions in this context one thing becomes very clear:

The model is only one part of the larger solution.

It is in this realization that we as machine learning engineers lean not on math, but on another discipline which has also had a large hand in shaping machine learning — computer science. When we begin to lean into the wealth of work done in the computer science field, and more specifically software engineering we can pick up many invaluable tools which help take our work out of the notebook and into the wild.

Laying the Foundation

Foundations matter, whether you’re building a house or building a solution. Stakeholders aren’t interested in something shiny if it wont last and or doesn’t fit their long term needs. Before we get carried away with anything else we need to understand a few things:

Domain Knowledge

If you don’t understand the problem, then you won’t deliver the right solution. Take time to research the domain, understand the problem and speak with subject matter experts. If you are trying to design a system and haven’t encountered at least a little resistance or push back from someone who the system will affect then you need to dive deeper. Many times when we are creating these solutions we will be augmenting processes that people have done manually for quite some time. Their experience and input is exponentially more valuable to the success of your project than any single component, and you should leverage it to the fullest extent.


What are the goals of the project or solution? How do you measure success? Our projects need to have clearly defined goals and deliverables that we can work towards and use to inform our design.


What are the performance needs of the solution? Are inferences needed in real-time or can there be a delay? How many concurrent users does this system need to support?


There are known limitations to a project such as budget, available resources, timelines, but be sure to understand any other constraints that need to be enforced. Common things to watch out for are contractual, statutory and regulatory requirements.

Getting the Goods

Once you have a good understanding of the foundational aspects of your project the next thing to worry about is getting at some actual data. To ensure your solution will have adequate data pipelines you need to know the following:

  1. Does the data you need already exist and if so in what state?
  2. If you have only unlabelled data available what steps need to be taken to generate annotated sets?
  3. How can you incorporate a feedback loop to ensure improved data collection post deployment?
  4. What are the storage requirements and what technologies are available?
  5. What privacy constraints exist regarding the collection of user data?
  6. What are the security requirements of data moving through your system?
  7. What can be done in batch and what can be done in stream?

Once you have the answers to these questions you can pick the right tools and tech to suit the needs of the project.

The Model

As we all know there is no one size fits all solution, not yet anyway. Building on the foundation you built earlier you should hopefully be able to break the problem into one or more specific tasks. Once you’ve identified the specific tasks you can begin the model selection process. It’s important to remember when thinking through this that simpler will usually be better and not to tie yourself to any one model since any specific models performance will still need to be proven. Always, always remember that we are building solutions and not dogmas. Statistical models can often outperform deep learning techniques and are often easier to implement and maintain. When thinking about baselines try to seek out more than just random baselines.


New data will always flow into your system so when thinking about model training its important to consider your refresh needs. Will the model need to be continually refreshed or can scheduled refreshes suffice? A lot of the times when there are deficits in this area they can go easily unnoticed. What processes can we implement in order to avoid this common pitfall? Will additional pipelines be needed in order to incorporate insights from the data that the system itself is generating? Once a deployed model is in production another important thing to consider is extensibility. If we seek to incorporate new features will this break the implementation? What tools can be used to enhance this workflow? How can we guarantee that changes to models are validated before entering production?


As the platform or system grows how can we ensure it continues to meet the needs of the project? Monolithic architectures should be generally avoided. To what extent though do the various components of your project need to be factored out though? In most instances even training needs to be distributed. In the worst case scenario though even a single sample might be too large for one machine to hold. It’s important to understand these considerations when designing a solution since unwinding the clock is a very painful process in most cases.


How will your model be used by others? In my experience ML systems are best built as micro-platforms. That is we have gated access to the inner workings of the system in which we ingest data, access to other teams or components is then restricted, and finally a service API exists to pass on the actionable data. If you’re interacting with other teams its important to consider their needs and limitations in this step. There can be many benefits to providing familiar interfaces for others to interact with in this step. Other use cases might call for performing inference directly on user devices especially when privacy is a top concern. It’s important to consider the numerous trade offs here and ensure your model or models are being served in the way that makes the most sense for all parties involved.


This is commonly overlooked by many when designing systems but in some ways it can be considered one of the most important pieces. Non-technical stakeholders will often determine the funding and timelines for your project so it is of supreme importance to ensure that your system includes mechanisms for communicating its value to those stakeholders. We should seek to ensure that our work can be appreciated and understood by business interests who may not have a formal background in machine learning.


Of course there is always more to learn when it comes to developing at the cutting-edge but I hope you found this post informative and helpful and I thank you for taking the time to read.



Brian Ross

Primarily interested in the intersection of advancements in data science and public good.