Machine learning at Lifion

Mano Vikash
Lifion Engineering
Published in
13 min readJun 2, 2021

--

Photo by Franck V. on Unsplash

This article is about how we approach the problem of building ML models for features at Lifion, by ADP. When building ML models in practice, due to various reasons theoretical underpinnings may at times be overlooked. The discrepancy between theory and practice in machine learning is a fiercely debated topic. We will discuss how we build cutting edge ML models in practice while being conscious of the theoretical aspects behind our work.

Here is the opening line of the preface of Tom Mitchell’s book on machine learning from 1997

The field of machine learning is concerned with the question of how to construct computer programs that automatically improve with experience.

This statement about machine learning still makes sense to this day and resonates with us in our work on a variety of ML-powered services and features. We will start by making this statement more precise by looking at a concrete problem we worked on for a particular feature which required predicting the items in a sequence, which we shall refer to here as next word prediction. We wanted to use ML to help our Application Developers, who use our low-code Lifion Development Platform, to be more productive when building the Next-Gen HCM Solution that we provide to our clients. Similar to “Smart Reply” and “Autocomplete” features when writing emails and searching online, our feature here helps provide suggestions in our development tooling to help them build with greater velocity and best practices in mind.

In the quote above, the end result of constructing a computer program is an artifact which is usually called an ML model or for sake of brevity, model. The experience is data from the past and the improvement is quantified with respect to some task at hand. The model improves during the training phase by learning patterns from the data. We later use the model during the inference phase to make predictions.

In-depth ML knowledge will not be assumed for this article. We will motivate understanding of technical concepts via intuition, give informal definitions, and point the interested reader to other sources for more details. The target audience is anyone who is curious to learn more about the approach we take for building models for developing a new feature and ML practitioners who want to learn more about the theoretical aspects.

source: https://xkcd.com/1838/

Problem statement

The problem of predicting the next word in a sequence is encountered in a lot of scenarios. While typing text, our phones and laptops are able to predict what is the next word we are going to type. These features have become more common in recent times as models have come a long way in learning patterns in human language. There are many standard libraries that can be used to develop these models and these language models are usually trained on a large corpus such as Wikipedia. Here is an example of a nice visual representation of some patterns that a language model can learn. In the Jupyter notebook below, we show how to load a previously trained model, feed in “The weather is getting” as input to the model and see that it predicts “better”. All this takes less than 10 lines of code. Cool! Example code for the training phase is more involved and can be found here.

In the output of the last cell, we see each word is mapped to a unique token. The total number of tokens in the model is the size of the vocabulary.

The size of the vocabulary for the model we used in the example above is about 50,000. For the actual feature we were building, we were not working on human language. Our language consisted of metadata and the vocabulary size was much smaller. For illustrative purposes, let us think of the vocabulary size to be 50.

Putting all this together formally, our problem looked like the following-

1. During the training phase we observed a set of m sequences.

2. Each Sᵢ was an ordered sequence of alphabets of length f(i) generated by some underlying probabilistic process from a vocabulary V.

3. We used the m sequences to construct the model. During the inference stage, we were given a partial sequence of alphabets and the model predicted the next alphabet like the following-

where the sequence of aᵢ’s generated during inference phase was generated by the same underlying probabilistic process that generated the data during the training phase. This is called i.i.d. assumption (independent and identically distributed) and is necessary to make sure that the same patterns are present during the training and inference stages.

Feature

For our feature, the words were not typed on a keyboard. They were selected from a list of options in a web-based graphical user interface similar to how we may select items from a drop down list. The model performance was crucial to the value of the feature and there were various metrics to quantify it. One potential metric could have been whether the prediction made by the model was what the user actually ended up selecting during the inference stage. This is a measure of accuracy. When we did some data exploration, we noticed that some words were more frequent than others, i.e. class imbalance.

As the most frequent classes were already easily accessible to the end user in our user interface, we felt it would not provide a lot of value to the user by predicting them. Hence, if a model keeps predicting the most frequent class all the time, it would not add much value though the accuracy of the model may be high. Let us take a closer look at the metric we chose to optimize that we felt would provide better value for our use case and how we estimate it.

Metric

The standard practice for estimating any metric is to split the m sequences during the training stage into training data, validation data, and testing data.

The model is trained and fine-tuned using the training and validation data respectively. We use the testing data and the trained model to construct the test set which is composed of test samples that look like the following:

We then use the test set to estimate the metrics. Suppose we are interested in the accuracy of a model, we can easily measure this by comparing the predicted value and actual value for each test sample in the test set. For a given sample if the predicted value and actual value agree, then we say that the sample is correctly predicted and the accuracy of the test set is simply the fraction of test samples that are correctly predicted. In other words, it is the number of correctly predicted samples divided by the total number of samples. The accuracy ø of a model M on a test set is defined as

Because of class imbalance in our dataset and other user interface considerations discussed earlier, accuracy did not seem to be the relevant metric for us to optimize. The main problem with accuracy is that it places less weightage on the classes with low frequency. On the other hand, if we focus on making the accuracy of each class to be high, it will make sure that we are not ignoring the classes with low frequency. Let us quantify this metric more carefully. For a fixed class c, we first take the subset of the test set whose actual value is c. The fraction of correctly predicted samples in this subset is the accuracy of the model on a test set for the class c defined as

The average of this quantity over all the classes is a metric which gives equal weightage to each class. This gives us the metric called macro-averaged accuracy denoted by µ.

We can construct a validation set and train set similar to how we constructed the test set. We denote the value of the metrics in the validation set and train set as

Finally, we denote the expected value of these metrics during the inference stage as

As mentioned earlier, we will focus on µ for the rest of the article. But most results can be rephrased for other metrics. Remember that µ can be computed exactly for a given model for a given test set and this serves as an estimate for how well the model will perform during the inference stage under the i.i.d. assumption. What happens during the inference stage can not be foreseen and we can only talk probabilistically. The expected value of the metrics in the inference stage is equal to the expected value of the metrics computed on the test set.

We next jump into the training stage, where we discuss the procedure for constructing a model by optimizing the metric on the data.

Training stage

There are many libraries (such as scikit-learn, TensorFlow, and more) which give us the ability to build out models. We have to make decisions on which library or set of libraries we want to try out for solving our problem. To make a careful decision, we must first understand the problem we are trying to solve and the methods that these libraries employ under the hood.

Getting into the details of all the libraries and their underlying methods would be too much detail for this article. If you want to dive into more details on these methods, see here [1] and here [3] for a theoretical and practical treatment respectively. There are many intricacies involved but at the same time there are some general principles that are common across all these methods. They abstract the model training problem in some mathematical framework such as support vector machines, bayesian models, and so on. In the mathematical framework, the problem translates to the problem of optimizing a mathematical function (referred to as loss) which in turn depends on the metric we want to optimize. We then use appropriate optimization techniques such as linear programming, gradient descent, and so on to optimize the loss using the data to construct a model. For a fixed mathematical framework, the set of all potential models that it could represent is denoted by and the end result of the optimization is one of the models in this set and is denoted by Mₒ.

Model selection

The level of complexity that we expect from a given feature gives us a sense of the complexity of the mathematical framework that is necessary and that in turn affects which library we pick for building our models. There are many variables at play here and it is very difficult to foresee which model is going to work for our feature and datasets.

Practice

The general approach is to try out models of various levels of complexity depending on the data and evaluate them using our metric. Every time we try a model and evaluate, we gain some knowledge and use this to make a more informed choice of what we want to try next. Experimental results computing µ for various models on sample data is shown below:

In practice, the model that gives the best value of µ on our validation set is picked and further hyperparameter optimization is carried out. This is called cross-validation. This leads us to the following folklore question:

That works very well in practice, but how does it work out in theory?

Theory

On a more serious note, let us look at the theoretical justification behind what happens in practice. We trained a bunch of models using the training data, computed µ for these trained models on the validation data and picked out the model with the highest value. How do we relate this to the model performance during the inference stage? There are many ways to think about this relationship and one of them is using generalization bounds. At the end of the day, these bounds are a formulation of Occam’s razor and they serve to make sure we do not overfit on data during the training phase. In other words, we want a model that does well for our feature during the inference phase and anything that we do during the training phase (splitting data, computing metrics, selecting models, etc.) should serve to optimize for it.

Some common types of generalization bounds are VC dimension bounds and Radamacher complexity bounds. The Radamacher complexity bound states that with high probability over the data generation process, the following bound holds

Let us go over the terms left to right-

  1. The first term is the expected value of µ during inference. [Inference µ]
  2. The second term is the exact value of µ evaluated on the train set. [Training µ]
  3. The third term is the Radamacher complexity. It is a measure of the representational capacity of the underlying mathematical framework. It is dependent on a set of all potential models that it could represent ∏. It also depends on the data generating process and the number of samples in the train set. [complexity term]
  4. The last term contains ∆ which quantifies the high probability with respect to which the bound holds and n which is the number of train samples. We usually think of n being large enough that we can just ignore this term completely. [what we can ignore under reasonable assumptions]

Some things to note:

  1. The bound holds for any model M and in particular the optimized model Mₒ.
  2. We do not give the bound in full generality. For instance, we can replace the µ with other metrics, the set we evaluate on need not be the train set, etc.

This bound motivates us to optimize for the entire right hand side (RHS) instead of just the training µ term. For some mathematical frameworks, this can be incorporated into the loss function during training. This gives us a way to build principled algorithms that generalize to new data and gives theoretical justification for the concept of regularization. Libraries such as sklearn contain many algorithms (like linear support vector classification) that do this in the loss function and hence use the train set to optimize the entire RHS. When comparing models built out by different frameworks, picking the model that gives the best value of µ on the validation set can also be motivated as optimizing for the entire RHS of the bound. In general, these bounds give us a way to motivate model choice decisions made in practice.

Finally, we have not leaked any information from the test set into our model selection. Hence, we can get an unbiased estimate of the performance of the model during inference using the following

The complexity bound in this section is paraphrasing Theorem 3.5 in Chapter 3 of Foundations of Machine Learning [2]. This bound is for binary classification but a similar bound for the multi-class regime can be found in the later chapters.

Our experiences

Oftentimes, when developing ML models for features in the industry, practitioners tend to take shortcuts such as splitting the data into two parts (train and test) instead of the three (train, validation and test) that we mentioned earlier. Taking such shortcuts could mean that practitioners end up using the same dataset for validation and testing. This is an obvious case of information leakage and there are methods such as nested cross-validation to mitigate the effects of such leakage.

When iterating on models over longer time periods, data is being collected and models are being updated continuously. Every time we update a model, we care about how the model is going to perform during the inference stage. All our experiments should be structured towards optimizing and estimating this performance. However, model updates affect the way users interact with the system which in turn affect the data being generated (i.e. the data distribution has evolved and it is no longer iid) and all this is happening repeatedly. In this case, how do we partition the data and compute the metrics? The solution is not as simple as making a new data split on all the available data as the data distribution may be changing over time and there are other costs involved. For instance, conducting all the previous experiments on the new data split may require a lot of effort (time, compute power, etc.). Such situations lead to complex cost-benefit tradeoffs where practitioners may be forced to take decisions which leads to more nuanced forms of information leakage. Even in these situations, empirically computing some of the terms of the bound in the previous section over time helps us to get a better sense of what is changing and hence guides us towards making improvements to the model.

Conclusion

Working in our fast-paced environment at Lifion requires us to be agile and developing ML models is no different. In such an environment, delivering in an iterative way and with high velocity can be challenging, at times making it necessary to make pragmatic choices. Rigorous results from the theoretical world guide us to build models in a principled way that deliver better experiences for our clients. We use our agile approach as an opportunity to continually iterate towards striking the right balance between velocity and rigor when we develop models for our features.

Additional reading

If you’re interested to read further into these topics, I recommended the following resources. The chapter on machine learning basics in [1] gives a concise introduction and does an excellent job of giving intuitive explanations while setting things up in theoretically solid footing. [2] gives a thorough introduction to the theory behind ML and [3] gives a quick practical guide to ML. Enjoy!

[1] Deep Learning by Ian Goodfellow, Yoshua Bengio and Aaron Courville

[2] Foundations of Machine Learning by Mehryar Mohri, Afshin Rostamizadeh and Ameet Talwalkar

[3] Machine Learning Crash Course by Google

--

--