As crowdsourced cybersecurity programs grow and operate over the course of time, one of the most common effects of ‘program aging’ is a general reduction in the volume of submissions over time. Roughly analogous to the aging of anything (bodies, cars, etc), a crowdsourced cybersecurity program requires adjusting expectations over time (e.g. one cannot eat as much pizza with no ill effects at 40, as one could at 17), as well engaged maintenance to ensure the health and longevity of the program. Continued output requires continued input; it should come as no surprise that programs where the owner isn’t engaged, quickly lose the engagement of researchers testing on that program – and with it, the number of findings. Today we’ll be looking to address the perennial question of (a) understanding why a program may not be getting submissions, and (b) what to do about it, in a methodical, data-driven manner.
It’s important to call out early that it’s untenable and unreasonable to expect the same amount of persistent output over time as when a program first starts. Without a doubt (and as anyone who has run such a program can attest), the first few weeks-to-months of a bug bounty program can be chaotic, with the number of findings often being close to overwhelming. Up front, the entire attack surface is laid bare, and researchers often readily identify scores of vulnerabilities during this period (depending on the size of the attack surface). After a period of time, this unrelenting storm calms and reverts to a slower trickle of findings – this is for two dominant reasons: (1) the low hanging fruit has largely been picked, making it harder to identify net-new issues (whether the result of remediation, better coding practices, or simply the rest of the crowd identifying the available low-hanging issues), and consequently (2) many of the initial storm of researchers have moved on to the next program. The specifics behind these two points will be discussed in more detail as we go forward, and will be recurring themes throughout this series, particularly in terms of understanding researchers and their motivations. If the goal is to get researchers to participate on a given (*your) program, then understanding the motivations of that group and appealing to them is core to creating that desired level of participation.
The fundamentally important point of note here from points one and two above, is that researchers optimize for ROI. This is not rocket or even social science – it’s more or less just a function of being human. As an individual, you’re most likely to put time and effort into that which you believe will lead to the best possible outcome. If I believe that spending one hour on a certain endeavor will enable me to make X dollars, and spending that same hour on another endeavor will enable earnings of 2-3X dollars, the outcome of where the time will be spent can be estimated accurately 98% of the time by a well trained parrot. We’ll discuss later in this series how we can swing the ROI equation back in our favor (particularly in that there are many different ways to influence the notion of relative value – which can sometimes even be driven by items such as altruism, relationships, and so on), but this should serve as a basic and fundamental working principle in everything we do going forward: Pp (participation probability; comparative to any other opportunity) = rV (relative value) / Ti (time invested). We’ll expand on this as we go forward, but this is a good starting point for now.
In the interest of not making this a monolithic ten page treatise, we’ll be breaking this series up into a two part series that will address the top reasons for why a program experiences diminishing output, and what a program owner can do to impact that outcome. Notably, we’ve already touched on the first reason: that the number of findable issues across the available attack surface have been substantially reduced, thereby altering the ROI appeal of the given program.
For the same reason that new elements or solar-system planets aren’t being discovered at the same rate during their respective heydays, once the initial findings have been identified on a program, subsequent findings face a much steeper ROI equation. Where before one could input three hours to find $750 in findings (say, one reflected XSS), now it may take 3-5x as long to make the same amount of money on a given program – if such issues even exist anymore (maybe they’ve all been picked clean, or better still, remediated by the program owner). Compared to other opportunities, this may no longer be the best (most optimized for Pp) bet, and so the tester(s) move on.
All is not lost. There are some extremely easy ways to address this predictable change in motivation. First, it’s important to note that motivation isn’t constant across all testers: userA’s Pp threshold may be much lower than userB’s – for instance, if userB has access to more programs, they may have opportunities that userA doesn’t; then for that reason, userA may have more interest in participating for longer, while userB moves on more quickly. Or perhaps userA is a full time hunter, meaning that they simply have more hours to spare, and thereby don’t require as aggressive of a Pp threshold. Ultimately, participating (Pp) on a given program may no longer be sufficient below a certain threshold of $X set by the user, but can vary by demographics and general interests, among other factors.
With this in mind, the first thing to do once participation dies down is to increase the sample size. Variance increases drastically with a smaller sample size – a violent crime in an otherwise sleepy, small town could see a double-digit spike in their crime statistics that wouldn’t otherwise happen in a larger metropolitan area. In the same way, if only five people are invited to participate in a program, the odds are low of getting an accurate and average distribution of Pp thresholds that reflect the totality of the crowd. With a small sample size, unless we’re absolutely limited to that small of a sample, we don’t want to make reward adjustments (the most common form of rV) until we’re confident that we have a more thorough distribution of Pp thresholds. For instance, with only five participants, if the rV needs to be multiplied by 5x in order to maintain the desired Pp, then it could be naturally assumed that increasing rewards by 5x will then re-engage those testers. However, it may not actually need to change at all if those five testers simply happen to have high Pp thresholds, while the true average of the whole may (especially those with low Pp thresholds) find the program perfectly desirable as-is.
The exact number for how many testers are needed to validate against the average can vary, but as a general rule, going public is the only surefire way to validate against the whole population. Going public is the notion of exposing your program to the entire security community by displaying it on bugcrowd.com/programs – here, anyone (not just a private crowd) can see your program and engage on the scope; this is the truest and most effective way of engaging the crowd as a whole. Furthermore, it’s worth knowing and emphasizing that going public is also the most realistic simulation of how nefarious parties would think about the targets in the wild – which is one of the most valuable and irreplicable value drivers in running a crowdsourced security program. Provided one cannot go public in the near term, a crowd size in the low hundreds is usually sufficient to get a perspective on the private crowd Pp threshold (which is different from the general population Pp threshold, which is different from the high-value Pp threshold, and so on).
This takes us to our next important point: high-value testers have high Pp thresholds. It can be tempting (and often misleading) to see general participation (notably by those with low Pp thresholds), and believe the program is appealing to all, or at least the average. But what’s lost here is the recognition that high value testers have a much more expensive disposition than average or low value testers – much like how a master builder commands more value than a handyman found on Craigslist. We won’t spend too much time on this matter, except to make sure we remember not all participation is equal. Depending on the desired output (attracting top talent, medium talent, or some talent), then the incentivization will predictably need to adapt; similarly, we also need to embellish our equation as follows: Pp = T(talent) * (rV / Ti).
Provided we’ve taken care of getting at least some quality private crowd exposure (thereby ensuring a wide array of possible Pp thresholds in the invited cohort; which is to say Pp is stable), the next step is to work on the other side of the equation for either rV or Ti. The recommended approach is to start with focusing on rV.
The simplest way to approach rV is to set a desired output, and then continuously adjust rV to make sure that output is achieved – keeping in mind that rV, while most commonly dollars, is not always dollars – there can be many ways a researcher can extract value from participation; also, the value is relative for the same reason – a piece of swag for one researcher may be of high value, and for another, mean nothing. If our goal is to get X P3 vulnerabilities per month (once more, recognizing that the output achieved at the start will likely never be re-achieved, and targeting that rate of continuous output is unrealistic – akin to running a vehicle at redline nonstop), then all we need to do is to keep increasing rV until we hit that output (again, provided Pp is already stable with a large enough sample size). And when that drops again, we then increase it once more until we hit the desired point. As you can see, this is a highly engaged and iterative process for which there is no shortcut (which is to say that being actively engaged in the success of one’s program is absolutely critical to, well, having a successful program); failing to adjust rV is a common and fatal mistake that we often see in the wild, that will invariably result in a self-fulfilling prophecy of lower and lower activation engagement – with an outcome akin to the common meme “we’ve tried nothing, and we’re all out of ideas”.
In general the rule is to measure outcomes across 30-day spreads. If the goal is to get at least one P1 every 30 days, and we’ve got a satisfactory Pp distribution, then at the 30-day mark from the last P1, it becomes time to increase the rV for P1 findings by at least 25%. If, after 30 days that hasn’t moved the needle, then it’s time for another iteration until the desired effect has been reached (or the organization’s maximum amount they’d pay to prevent an issue being exploited in the wild has been reached; to arrive at this number, the organization needs to simply ask itself “how much would I pay to prevent a breach?”… short of seven figure payouts, this number should almost never be something as menial as $5k or such; stopping at such a low number means that as soon as the RoI for a security researcher decreases below that threshold, then they’ll stop hunting. Meanwhile, for an attacker, their upside remains a lot-lot-lot larger than $5k in most situations, allowing them to keep hunting beyond where the ethical hunters would stop, which is why incentivization should be an always growing function).
It’s worth calling out that one could quickly get exponential with the above rule of increasing every 30 days. For instance, 2500 would become 3125, then 3900, then 4875, and in six months from the first adjustment, 9500, and would balloon to over 30k by the 12 month mark. To prevent this, after the first three adjustments at the 30-day mark (which will represent a rough doubling of rewards when done in 25% increments), adjustments should happen every 60 days going forward (further, not to go too far into the weeds, but if a program goes a full year without getting any findings for a given priority level, it’s worth critically evaluating the scope to make sure it’s sane, and then making adjustments every 90 days).
This should be adjusted by each priority level independent of the others – e.g. if there are plenty of P4s rolling in, then there’s no need to increase the rV for P4s until they dip below their desired threshold. And while thresholds can be (and often are) higher than one, as a rule, if there’s been no findings of a given nature over the course of 30 consecutive days, it’s time to raise the rV for that type of finding, regardless of any other thresholds. An important corollary to this is that: it’s important to be aware that focusing exclusively on critical findings is typically detrimental to the whole.
It can be tempting to only care about critical findings and discard the rest. However, this may be yet one more contributing factor for why a program may see middling engagement. Going back to our equation for Pp, if one removes P1/P2/P3 from the menu, they’re also implicitly increasing the amount of time it takes to find an issue (Ti) while also removing 3/4ths (or more) of the possible/potential rV. Say, for the sake of argument, it takes twenty hours, on average to find a P1, and five hours to find a P3. If the payout for the P1 is 5k, the total (potential) rV for those 20 hours is 5k. Now, if a P3 gets rewarded at 750 and four are also identified during the course of testing for the P1, the total rV now becomes 8k – a much more enticing Pp value, when divided by time invested. All of this is non-inclusive of the fact that the tester may not ever be able to find a P1 in the first place! So, when creating programs, while it may be tempting to not care about medium/low/high vulnerabilities, it’s (a) important to remember that they actually are critically important to be aware of and remediate…; and (b) removing rewards for non-critical findings will most certainly reduce overall engagement, including, paradoxically, the number of critical findings that are identified.
To recap, the most common reason as to why a program may not be getting submissions can be summarized as:
The result of diminished ROI for researchers when testing, as a function of fewer findable, net-new issues against the attack surface comparative to alternatives (either as a result of: “becoming more secure through knowing about and fixing the issues”… or the more simple and likely: “the majority of easier-to-find issues have been identified, but not necessarily fixed”).
For which, the following are recommended first steps in addressing:
- Review the sample size of testers, as it may not yet be large enough to engage an even distribution of participating researchers. The ideal actualization of this is to bring the program public, which will allow for the entire distribution of the crowd to become engaged (provided the program meets their particular Pp threshold).
- Review the relative value (rV) provided by the program, as it may not be high enough to encourage researchers to participate in the program (or more importantly, the right researchers to participate). This needs to be advanced regularly, every 30 days to ensure that the target threshold is attained (or every 60 days after more than three consecutive raises of at least 25%).
- Ensure that the program is not artificially lowering Pp by creating a deflated potential rV (such as only accepting findings of certain types). Limiting to critical issues can further diminish ROI, and paradoxically result in fewer total (and critical) issues.
In future blogs we’ll touch on topics such as matching incentives to the target (type, access, etc), increasing exposure, creative ways to increase relative value (rV), and owner engagement/relationships as they relate to their program… looking forward to it!