At Deck, we build predictions using an approach we call “contextual inference.” This approach uses hard data on individuals’ past behaviors, and the context around those behaviors, to predict what new people in new contexts might do. In this documentation, we’ll discuss how we are using that approach to determine the likelihood a given person will make a contribution to a specific political campaign.
OUR APPROACH
If you’ve received emails or text messages asking you to donate to a campaign you’ve never heard of, you’ve probably been identified as a likely Democratic campaign contributor by a fundraising data vendor. These vendors create lists of people from across the country who have contributed to other campaigns in the past and sell those lists to campaigns eager to get their fundraising operation off the ground. Often, these vendors will charge campaigns some percentage of the donations raised from prospects on these lists.
While these lists have been helpful, they also have drawbacks — such as exacerbating donor fatigue, shifting a campaign’s focus to out-of-district donors, and leaving prospective contributors who aren’t existing Democratic Party donors on the table.
For these reasons, we’ve built a model that generates person-level, campaign-specific probability scores. This approach (1) takes into account what makes each campaign unique, (2) learns from past donors without limiting future outreach to them, and (3) makes it easier to identify potential donors in your own district.
THE DATA WE USE
This model uses traits of voters, traits of candidates, election-related media coverage, campaign finance reports, and election results.
Voter traits — We use historic snapshots of each state’s voter file from TargetSmart (for soft side customers) and the DNC (for hard side customers) to determine a set of demographic and socioeconomic traits for American adults registered to vote at the time of various past elections.
Candidate and election traits — We rely on Ballotpedia, VoteSmart, Open States, Reflective Democracy, and state election agencies to collect information on the candidates who ran (or are running) for office — including their incumbency status, endorsements from issue advocacy organizations, demographics, history of holding elected office, and more. We also use these sources to identify when elections take place and what rules govern voting eligibility and access.
Media — We license historic and current online, print, and TV news content from Critical Mention, Aylien, and the Google Programmable Search API. We then identify articles and clips related to specific candidates and elections, and match them to the appropriate campaigns. We also use natural language processing tools (both licensed and built by our team) to parse the topics discussed in coverage and the sentiment of candidate mentions.
Finance — We gather itemized and summary campaign finance data from state and local campaign finance portals and the National Institute on Money in Politics. We then match contribution records to individual registered voters, allowing us to track how the traits of a campaigns’ contributors are changing over time. (Note: due to recent FEC advisory opinions, this model is only trained on state and local campaign contribution records.)
HOW THE MODELS ARE BUILT
The first step in preparing this model is to assemble its training data. To do this, we combine a registered voter’s traits (including their demographics and socioeconomic details) with the details of each election (including whether it’s a primary or special election, how much media coverage the race was getting, and how financially competitive the race was), and the traits of the individual campaigns on the ballot. Campaign traits we use include candidate’s demographics, incumbency status, and history in office; the volume and type of coverage the campaign is getting in the media; and the demographic and socioeconomic traits of the campaign’s existing donors.
Next, we use our training data to identify the features most likely to have high predictive power -- either alone or in combination with others -- and those most likely to confuse a model into overfitting or diminish the impact of other features. At this stage, we (1) prune highly correlated features and features without meaningful variation, (2) use a technique called VSURF (variable selection using random forest) to better interpret how features will interact with each other, and (3) impute missing data.
Finally, we design a deep learning architecture to predict our outcome. In this case, we’ve built a six-layer neural network. The model uses a sigmoid activation function and the Adam optimization algorithm, optimizing for low binary cross-entropy.
EVALUATING ACCURACY
To validate this model, we suppressed a randomly selected batch of campaigns from our training data. We then used the data associated with those campaigns to test the model’s accuracy.
|
|
|
| |
0.97 | 0.94 | 0.77 | 8.05 | 0.00 |
When validated on over 45,000 testing samples, we found that the model’s area under the ROC curve was 0.97, indicating that the model’s ranking of contribution likelihood aligned with actual contributions in our testing samples 97% of the time. The overall binary accuracy of our testing predictions was 77%. And in a lift chart organized by decile, the top decile had a lift of 805 while the bottom had a lift of 0. This means that voters with the top decile of scores were 8 times more likely than a random voter to make a contribution to a specific campaign. Those in the bottom decile were less than 1% as likely to make a contribution.
|
| Lift by decile (on 48,600 records held out from training) |
Actual contributors | Predicted contributors | Lift |
0.731 | 0.713 | 805 |
0.112 | 0.098 | 129 |
0.031 | 0.026 | 35 |
0.013 | 0.010 | 14 |
0.009 | 0.005 | 10 |
0.004 | 0.002 | 5 |
0.002 | 0.001 | 2 |
0.001 | 0.000 | 1 |
0.000 | 0.000 | 0 |
0.000 | 0.000 | 0 |
MOST VALUABLE PREDICTORS
While it’s difficult to measure variable importance in deep learning models, we can use the output of our VSURF runs to estimate which variables have the most predictive power. The most significant variables are described below, grouped by category.
Voter traits
Turnout in past elections
History of political giving
History of charitable giving
Household income
Party affiliation
Neighborhood population density
Race and ethnicity
Education attainment
Issue stances
Campaign traits
Recent fundraising trendline
Volume of media coverage
Election stage
Other offices on the ballot
Demographics of contributors
Demographics of candidates
Candidate issue stances
Generic congressional ballot polling
Distance from Election Day