Our machine-learning models are trained to predict the likelihood of an applicant to be hired and then perform better in a given role by using predictive patterns found in thousands of candidates hired via our platform. These patterns are derived from the free text answers as well as multiple choice personality question answers provided by the candidates. A typical PredictiveHire assessment contains 5-10 free text questions and 30-40 multiple choice questions that typically takes a candidate 20-30min to respond.

The data science methodology we employ is driven by a culture of continuous research around experimenting and selecting algorithms that eliminates bias and improves accuracy of the generated models. We currently select from among 5 different classification algorithms based on the input data and 10-fold cross validation to test the models.

How it works

We use a rigorous process to extract useful and predictive properties (called "features" in machine learning lingo) from the text responses given by the candidates, which are then used in building predictive models. These various features are the outcome of the research work we perform in feature engineering, a continuous discovery process unearthing predictive properties in the input data. We use various natural language process (NLP) methods to extract these properties and the following tables list the categories of properties we currently use (a simple introduction to NLP can be found here.

Personality: we use an NLP model build using a deep neural network to map a given piece of text to personality metrics recognised in workforce science.

Readability: we calculate scores for readability using a number of standard measures, including Flesch-Kincaid Grade, amongst others. 

Semantic alignment: another NLP-based deep neural network model, trained using past hires and their responses. 

Sentiment: measuring the polarity and subjectivity of the text response

Text structure and quality: a plethora of measures including sentence count, word count, ratio of various parts of speech etc.

These feature values extracted from thousands of candidates (a data matrix of tens of thousands of data points) are used to then train a machine learning classifier that is able to identify the latent patterns in the data to assign a propensity score indicating how suitable a candidate is for a given role. We typically use 90% of the data for training, with 10% kept for testing. In training we use 10-fold cross validation to measure the accuracy of the model (See here for an overview of k-fold cross validation. The 10% of the test set is then used to validate the model by calculating the accuracy (concurrent validity) in predictive the outcomes. Our current models show test accuracies in the range 70% - 76%. 

Rationale Behind Feature Selection

The discovery and selection of features that goes into model building is done with careful consideration of the predictive ability of the feature and its validity. We consider following aspects in selecting features:

Prior peer-reviewed research that indicates wider research and scrutiny of the features. For example the research and applications listed below in the reference section shows the existence of latent patterns in text related to personality and the applicability of deep learning models to extract those. These research work help form the basis for thefeature categories of Personality, Semantic Alignment and Sentiment listed above.

Concurrent validity of the features in explaining the outcome. Features are tested individually to measure the strength of the correlation to the outcomes. We use Spearman and Kendall correlation tests to test for this.

Model Training Process

The model training process is an iterative process that applies multiple machine learning models to the training dataset while monitoring key performance indicators of the model to identify which model provides the best accuracy measures.

We currently use tree based ensemble models and deep neural network algorithms to build our models. Ensemble methods use multiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms alone. This has been applied successfully in advanced classification tasks as face recognition, emotion recognition, fraud detection and medicine. For example one of the algorithms that we have been able to use successfully is the random forest algorithm that combines random decision trees with bagging to achieve very high classification accuracy. All training sessions use random forest, extra tree, xgboost and logistic regression (for comparison) by default and the best model is selected. Depending on performance we also use K-Nearest Neighbor, Support Vector Machine and Naive Bayes Classification algorithms.

With large text corpuses deep neural network based algorithms such as Convolutional Neural Networks (CNN) have been applied successfully in many natural language processing (NLP) tasks including text based classification. We use CNN based models in developing our text to personality classifiers and text tohiring outcome classifiers we used to measure the semantic alignment of candidate response. 

Method and KPIs

Our training is based on standard machine learning practices that include:

Leaving 10% of the data for independent testing of the model

Using 90% of the training with 10-fold cross validation to build a model

Test the model on the 10% test data

We currently monitor the following model performance indicators:

Accuracy: Ratio of correct predictions 

Precision: Ratio of true positive predictions over all positive predictions (true positive + false positive)

Recall: Ration of true positive predictions over all positive candidates in the dataset (true positive + false negative)

F1 measure: A harmonic mean of precision and recall

AUC: Area under the curve in theReceiver operating characteristic (ROC) graph


See below for the ROC curve from which the AUC was calculated. AUC is a probability measure of how likely a positive candidate will be classified as positive (true positive) vs a negative candidate is classified as positive (false positive). In psychology and behavioural science an AUC > 0.7 is considered as a good outcome. We have seen the AUC of our models increase with the magnitude of candidate data and outcome data such as hiring decisions and post-hire performance.


N. Majumder, S. Poria, A. Gelbukh and E. Cambria, "Deep Learning-Based Document Modeling for Personality Detection from Text," in IEEE Intelligent Systems, vol. 32, no. 2, pp. 74-79, Mar.-Apr. 2017. doi: 10.1109/MIS.2017.23

Ong, Veronica & Rahmanto, A.D.S. & Williem, Williem & Suhartono, Derwin. (2017). Exploring personality prediction from text on social media: A literature review. 9. 65-70.

Ramos dos Santos, Wesley & Paraboni, Ivandre. (2018). Personality facets recognition from text.

Carducci, Giulio & Rizzo, Giuseppe & Monti, Diego & Palumbo, Enrico & Morisio, Maurizio. (2018). TwitPersonality: Computing Personality Traits from Tweets Using Word Embeddings and Supervised Learning. Information (Switzerland). 9. 10.3390/info9050127.

Golbeck, Jennifer Ann (2016) "Predicting Personality from Social Media Text," AIS Transactions on Replication Research: Vol. 2 , Article 2.DOI: 10.17705/1atrr.00009

Boyd, R. L., & Pennebaker, J. W. (2017). Language-based personality: A new approach to personality in a digital world. Current Opinion in Behavioral Sciences, 18, 63-68. http://dx.doi.org/10.1016/j.cobeha.2017.07.017

R. McCrae, Robert & Costa, Paul. (1989). Reinterpreting the Myers-Briggs Type Indicator From the Perspective of the Five-Factor Model of Personality. Journal of personality. 57. 17-40. 10.1111/j.1467-6494.1989.tb00759.x.

Kim, Yoon. (2014). Convolutional Neural Networks for Sentence Classification. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. 10.3115/v1/D14-1181. 

Did this answer your question?