All Collections
Model Documentation and Data Sources
Deck - Contributor model documentation
Deck - Contributor model documentation
CG Kelly avatar
Written by CG Kelly
Updated this week

At Deck, we build predictions using an approach we call “contextual inference.” This approach uses hard data on individuals’ past behaviors, and the context around those behaviors, to predict what new people in new contexts might do. In this documentation, we’ll discuss how we are using that approach to determine the likelihood a given person will make a contribution to a specific political campaign.

OUR APPROACH

If you’ve received emails or text messages asking you to donate to a campaign you’ve never heard of, you’ve probably been identified as a likely Democratic campaign contributor by a fundraising data vendor. These vendors create lists of people from across the country who have contributed to other campaigns in the past and sell those lists to campaigns eager to get their fundraising operation off the ground. Often, these vendors will charge campaigns some percentage of the donations raised from prospects on these lists.

While these lists have been helpful, they also have drawbacks — such as exacerbating donor fatigue, shifting a campaign’s focus to out-of-district donors, and leaving prospective contributors who aren’t existing Democratic Party donors on the table.

For these reasons, we’ve built a model that generates person-level, campaign-specific probability scores. This approach (1) takes into account what makes each campaign unique, (2) learns from past donors without limiting future outreach to them, and (3) makes it easier to identify potential donors in your own district.

THE DATA WE USE

This model uses traits of voters, traits of candidates, election-related media coverage, campaign finance reports, and election results.

  • Voter traits — We use historic snapshots of each state’s voter file from TargetSmart (for soft side customers) and the DNC (for hard side customers) to determine a set of demographic and socioeconomic traits for American adults registered to vote at the time of various past elections.

  • Candidate and election traits — We rely on Ballotpedia, VoteSmart, Open States, Reflective Democracy, and state election agencies to collect information on the candidates who ran (or are running) for office — including their incumbency status, endorsements from issue advocacy organizations, demographics, history of holding elected office, and more. We also use these sources to identify when elections take place and what rules govern voting eligibility and access.

  • Media — We license historic and current online, print, and TV news content from Critical Mention, Aylien, and the Google Programmable Search API. We then identify articles and clips related to specific candidates and elections, and match them to the appropriate campaigns. We also use natural language processing tools (both licensed and built by our team) to parse the topics discussed in coverage and the sentiment of candidate mentions.

  • Finance — We gather itemized and summary campaign finance data from state and local campaign finance portals and the National Institute on Money in Politics. We then match contribution records to individual registered voters, allowing us to track how the traits of a campaigns’ contributors are changing over time. (Note: due to recent FEC advisory opinions, this model is only trained on state and local campaign contribution records.)

HOW THE MODELS ARE BUILT

The first step in preparing this model is to assemble its training data. To do this, we combine a registered voter’s traits (including their demographics and socioeconomic details) with the details of each election (including whether it’s a primary or special election, how much media coverage the race was getting, and how financially competitive the race was), and the traits of the individual campaigns on the ballot. Campaign traits we use include candidate’s demographics, incumbency status, and history in office; the volume and type of coverage the campaign is getting in the media; and the demographic and socioeconomic traits of the campaign’s existing donors.

Next, we use our training data to identify the features most likely to have high predictive power -- either alone or in combination with others -- and those most likely to confuse a model into overfitting or diminish the impact of other features. At this stage, we (1) prune highly correlated features and features without meaningful variation, (2) use a technique called VSURF (variable selection using random forest) to better interpret how features will interact with each other, and (3) impute missing data.

Finally, we design a deep learning architecture to predict our outcome. In this case, we’ve built a six-layer neural network. The model uses a sigmoid activation function and the Adam optimization algorithm, optimizing for low binary cross-entropy.

EVALUATING ACCURACY

To validate this model, we suppressed a randomly selected batch of campaigns from our training data. We then used the data associated with those campaigns to test the model’s accuracy.

When validated on over 45,000 testing samples, we found that the model’s area under the ROC curve was 0.97, indicating that the model’s ranking of contribution likelihood aligned with actual contributions in our testing samples 97% of the time. The overall binary accuracy of our testing predictions was 77%. And in a lift chart organized by decile, the top decile had a lift of 805 while the bottom had a lift of 0. This means that voters with the top decile of scores were 8 times more likely than a random voter to make a contribution to a specific campaign. Those in the bottom decile were less than 1% as likely to make a contribution.

Lift by decile (on 48,600 records held out from training)

Actual contributors

Predicted contributors

Lift

0.731

0.713

805

0.112

0.098

129

0.031

0.026

35

0.013

0.010

14

0.009

0.005

10

0.004

0.002

5

0.002

0.001

2

0.001

0.000

1

0.000

0.000

0

0.000

0.000

0

MOST VALUABLE PREDICTORS

While it’s difficult to measure variable importance in deep learning models, we can use the output of our VSURF runs to estimate which variables have the most predictive power. The most significant variables are described below, grouped by category.

Voter traits

  • Turnout in past elections

  • History of political giving

  • History of charitable giving

  • Household income

  • Party affiliation

  • Neighborhood population density

  • Race and ethnicity

  • Education attainment

  • Issue stances

Campaign traits

  • Recent fundraising trendline

  • Volume of media coverage

  • Election stage

  • Other offices on the ballot

  • Demographics of contributors

  • Demographics of candidates

  • Candidate issue stances

  • Generic congressional ballot polling

  • Distance from Election Day

Did this answer your question?