In this session from DevRelCon Earth 2020, Melissa and Adam discuss concrete examples of how to measure: 1) evangelist/advocate engagement with developers 2) composite measures of Open Source project health and 3) program impact estimations using synthetic control groups. The material is based on real world examples from developer relations programs at HashiCorp and Amazon Web Services.
Takeaways coming soon!
Speaker 1: Thanks for having us. I really appreciate it. Looking forward to this session. Lots for us to talk about. So let me just start with, you know, a little bit of what we're gonna talk about talk about, you know, a little bit about who we are, then we're going to have Melissa's gonna talk about developer advocate engagement scores.
And Melissa's also gonna give a little bit of perspective about HashiCorp. Does a little bit of thinking about open source project health. And then I'm gonna talk about the subject data in my heart, is how to measure your program impact in DevRel. So first of all, I'm Adam Fitzgerald. I've been doing developer relations for most of the last fifteen years at various companies, BA Systems, Spring Source, VMware, Pivotal, AWS for six years, and then the last year at HashiCorp.
So I've been spending a little bit of time thinking about these things, and, hopefully, there's a few things I've learned along the way. Still lots more to learn, of course. Melissa, do want to give give everybody an introduction to yourself?
Speaker 2: Sure. Hello. I'm Melissa Gurney Green. I am the director of community development here at HashiCorp, and I run the developer advocate team as well as community programs.
Speaker 1: Super. And for those of you that, hopefully, everybody knows HashiCorp. But if you don't, we the company behind some of the most useful tools for the cloud for practitioners, operators, and developers. So whether you're using Vagrant, Packer, or using Terraform to provision and operate your cloud across multiple providers. We're using Vault to secure your cloud console to collect and build a service mesh across multiple providers, or Nomad to actually deploy, run, and operate your applications across multiple environments.
We build great tools for help practitioners do great things with their cloud environment. We believe the future is all about multi cloud, and that's what our tools are focused on, helping people conquer those challenges. Alright. Melissa, why don't you talk to the folks about Developer Advocate engagement scores?
Speaker 2: Happy to. I'd like to kinda start off with a bit of a question. We can maybe do a virtual hand raise here. How many of you have been in the situation where there was something you were super passionate about, but, perhaps it was cut because someone somewhere in the decision making process didn't fully understand it. Right?
That's why I'm here. That happened to me, when I was back in the infrastructure space. But it's it's something that that became increasingly important over over my career, so it's something I'm I'm super excited to get into with all of you. This year, we kind of launched a brand new, measurement for our developer advocate team. Where we started was really this notion of here are all the things we're doing.
Look at this list. Isn't it awesome? But it was a whole lot of data and not a lot of context around that data, not a lot of information. So we switched to this notion of impact hours. And what impact hours are is it takes into account the number of attendees or viewers of your specific piece of content, and then it has a weight associated with it based on the potential impact of that content.
So you have, like, direct meetings. You have live talks. You've got digital content, blogs, tools, different things that the developer advocates create. And with each of those, there's kind of this impact associated with it and what is the user's ability to act and engage and really get in deep with our tools. So for things like direct meetings, you're able to really focus in on that individual's workflow.
It's not really easy to scale, but it's worth a lot in terms of weight because of your ability to kinda tailor the conversation to their actual needs and their ability to act. Whereas live talks, it's a little more of a general conversation, and there's an ability to engage there and follow-up and talk to people afterwards. So it's got a little less of a weight. And then things like blogs, much wider reach than most live talks, but also, a little little more general, yeah, a bigger audience. So these are these are the kind of things that we think about when it comes to impact hours.
And with this, there's different kind of ways we track that. So one of them is engagement by channel. And, really, this is kind of focusing on what we're doing, how we're doing it, and the number of kind of response or impact to the things that we're doing. So as you can see here, these this is our last five months of impact scoring. So, there are a couple of things in here that started off kind of steady with, for example, our Hug and user group, which is in yellow.
And really with that, it dropped off in March and April due to the pandemic. So there are things that we engaged steadily with and then and then had to kind of refactor around. And the same is true for that kind of green section, which is livestreams. Right? We started in February kind of experimenting with this notion of livestreams and, oh, people really join us if we if we do all this live and and kind of hack around, and what does that look like?
And then as as we've noticed through this time of us all being off travel and more focused on creating digital content, it's it's picked up quite a bit as far as the amount and frequency of what we're doing, but also the impact of of those live streams. So so that's been something we can kind of talk about and look at and kind of measure. We can also look at things like that bottom line for blogs.
Speaker 1: Sorry.
Speaker 2: And oh, you're fine. The bottom line for blogs and say, oh, in April, our blog score went to zero. Why is that? As it turns out, there's a lot of effort that goes into writing blogs, and it wasn't really weighted as as well as it should have. So we adjusted the weight, and now, blogs are more frequent.
Another way we like to look at this is engagement by product, and this does an amazing thing for us in that we can look at this and see kind of what the impact we're having per tool and have those conversations with the product and engineering teams and product marketing around our strategy, but what the impact is of our actual strategy in having these conversations and driving this influence around adoption of a workflow that's related to one of our tools. As you can see here, there's kind of dips in February and June. This is intentional. So we have, two digital events, one in each month. The first one is Hashi Talks in February where we cover all the tools.
So instead of trying to pick apart each tool, we tend to classify that as a stack event. And the same is true for our conference HashiConf, which just happened this June, and that that measure is shown there. Another thing this will show us too is is kind of, okay, what's the impact of adding a team member? So we added a team member for Vault in May. They were getting ramped ramped up and helping with HashiConf in June.
So in July and August, we expect that vault number to to increase in volume, in this kind of picture. So what does this really mean as far as results? A lot of the things we do around impact is really around inspiring people to to take a chance and try something new, to change the way they think about things, to adopt a new workflow. And that's a lot of feeling and emotion. Right?
But, we really needed a way to to take that kind of value and translate it into something that we could have a fact based conversation around. And this kind of measurement system really gives us the ability to do that, and it's it happens in a couple different ways. I touched on content strategy when we were talking about products. We can go in and talk to to to the tools team and talk about our content, based on each tool and then break it down even further and say things like, well, we've been doing a lot of blogs and videos. What if we added more of this content?
What impact would that have? And have those discussions as well as as talking about our overall approach. But another thing it does is it gives us the ability to kind of define lanes for each level of developer advocate. So we can we can take a junior advocate all the way up to a principal advocate and kind of get an idea for for what the range of impact scores is at each level. And then say, okay.
Well, you're performing at this level. What's needed to get to the next level? And have those discussions and and talk about that. And what we see is over time with the developer advocates, what happens is that impact score grows into a level they should be at. So, so we look at that.
Another thing that's super important for this discussion and and one thing that's helped me with monumentally is head count discussions and how to talk about the, capability of the team and what the impact of adding new team members will be to the things that we care about the most. So it's it's easy for me to go in and say, hey. We need two or three new team members. The team is swamped. But, but it's a lot harder when you don't have the the data to back that up and the ability to say, hey.
I've got this this challenge I need to solve. We wanna make a bigger impact with this tool or or with this set of tools. How do we do that? Okay. Well, I know if I had a developer advocate here, I'm gonna get this much impact after they ramp up.
And and that's measurable and calculatable, and it's much easier to have that discussion with, with leadership than than to have the, I just need more people because we're overwhelmed. Right? So then we have measuring developer advocates down. We're obviously still learning a lot, but does it matter if our open source projects are not healthy? And the answer is no.
So we also have to look at that because without a community, there's no one to talk to. With with this, we've got, different health indicators that we think about. Of course, you have your traditional hero numbers where you have downloads and stars and forks, and those are all great and fine. They're hard to kind of gather, especially downloads, because there's so many places people can download our tools, and that's by design. We want it to be easy for, people to access our tools and that information.
But, for us, the real heroes have been issues and pull requests. And why is that? So issues and pull requests really tell a story of engagement. They can tell us who's using the tool. They can tell us how they're using the tool.
They can tell us what else they're using with it so we can drive, integration features with different tools and and stuff in the ecosystem. It also tells us a little bit about what their experience is like. It's easy to to kinda go out into the Twitterverse and and find, find comments on both sides of your argument as far as what people think about what you do. But how much does that matter if they're not actually using the tools themselves? And and the truth is it's hard to measure from Twitter to see to see that.
So how do we do it? We we really look at these issues and pull requests and upvotes and things like that to to determine what the experience is like and how we can make it better for the people that actually use our tools, which is super exciting to me. And one thing that this has really given us is the ability to give the product and engineering teams a measurable road map from the community perspective. So these are the top 10 things that the community really wants I know. That would make a difference that would impact our tools in a way that is measurable because we can say it has this many upvotes.
It's got this many comments. These this many people have approached us at conferences about it, and and it's much easier to have that conversation than this is our top three list because we talk to a bunch of people. So it's it's really helps those conversations move along. And with that, back to you, Adam.
Speaker 1: Alright. Super. Thanks, Melissa. Alright. So I wanna talk a little bit about something that's dear to my heart, and that is not just metrics in general, but specifically measuring things that you do and what kind of impact they have on the business.
You know, it's one of the things that is common question that gets asked of DevRel. DevRel typically sits in between a whole bunch of different areas inside the organization. Everybody's like, what have you done for me, and what have you actually actually actually have you driven for the business? So let's talk about a couple of the different sort of experiences I've had and sort of the technique that I think is pretty useful and could be widely applicable for a lot of people, but it's not well known. So let me just go ahead and, you know, share the obligatory Gilbert slide about metrics and remind you that, you know, while this is great to laugh at, in my experience, when you get to any kind of serious organization, there are gonna be inspections on why you actually are running your team or organizing your responsibilities, and you need a defensible way to go make arguments.
And although there might be politics involved, you're much better prepared if you can come with some real accurate data about how your programs are working. So as an inspiration for this, I wanted to look at some other different areas. And when we think about what this looks like from a sort of a medical setting or from a political science setting, sort of the gold standard for understanding whether my program or product or drug or treatment program made a difference is the randomized control trial, you know, in the best of situations, the double blinded randomized control trial. And what does this mean? Well, it actually means that you take an individual that's gonna receive some kind of treatment and you place them at random into one of two groups, Either in the treatment group, which we have on the bottom here, or into a control group on the top.
The randomization is important because it helps identify and remove selection bias, and it means the assignment is kind of blind. And then the control part is super important because it identifies whether the treatment is successful or not. In this situation, we're giving these people in the treatment group at the bottom. It might look like a database, but maybe it's, you know, a treatment to turn them into cats or something. The top group is receiving some kind of placebo or non action or some kind of drug or pill or some kind of treatment that doesn't they they're aware they're receiving treatment, but the treatment actually has no associated with it.
It's a saline solution or a placebo pill, for example. And then what you do is you then go ahead and take a look and say, look at the treatment group. Let's work out which things actually had an effect, And here we can see there's a whole slew of, the population in the treatment group that winds up turning into cats. And then in the general population, there's a much smaller rate of cats being produced through the placebo. There's one in thirty six in the top part of the chart and eight of 18 in the bottom chart.
Then I can make a comparison between my my treatment group and my control group, and I can understand that, well, we got 16 times as many cats in the treatment group as we would have got ordinarily in the control group. So my assumption here and a pretty reasonable assumption is that my intervention, my treatment, my the thing that I'm testing, this this database symbol down here at the bottom is actually producing a 16% 16 times more effective rate at getting the outcome we're interested in, which is maybe turning people into cats. But what does this mean for developer relations? What does a randomized controlled trial look for developer look like for developer relations? Well, you know, you could think about randomized assignment of your individuals into a a test group or a treatment group and a control group.
That's what we do when we a b test web experiences, for example. But more generally, it's kind of very, very hard to do. Usually, you're actually putting people into programs because a, they're your most active community members, or b, your most important customers, or c, they're in a market that you really care about. So usually, the selection bias part here in program engagement is actually pretty real, and it's hard to measure against that selection bias. Additionally, when you provide a treatment in DevRel, what are you going do?
Are you going to give one person a piece of software that works and one person a of software that doesn't work? You're gonna return null results from your API calls for some of your customers and not for other view other customers? So it's kind of, like, really hard to make that kind of assessment of what treatment really looks like. But if we sort of, like, abstract away from that and start let's think about it less like a drug trial, more like a I've got a collection of people that we perform some collection of actions for and another collection of people that we didn't perform those actions for, then it becomes a little bit clearer. So maybe the people in the treatment group are people that went to your developer conference or they're part of your developer advisory panel or they're people that you wound up going to do site visits with to identify what their problems were and solve problems for them, then you can start thinking about these groups as different.
And one is the treatment group, and one is the collection of people who didn't receive that treatment. Not necessarily a control group per se, but they they they people that didn't receive the treatment, they didn't receive a placebo, but they didn't receive the treatment. And then you can go ahead and measure and say, okay. The people in the treatment group, did they turn into Uber developers? Did they become optocats instead of just cats?
And did those people wind up doing so at a rate that's that was higher than the rate of the people that didn't go through that program? And then you could go ahead and measure the difference between those two, and that could give you some kind of sense of whether there's an impact there. Now truth is, this is a little bit hard to do in, Deborah, and it's a little bit hard to do outside of clinical drug trials, but there have been some techniques that have been developed to allow people to do this. They largely come from political science. So there's a technique called synthetic control method that was developed in comparative politics by a collection of economists, and political scientists.
And these were used initially in 2014 to do sort of comparisons about the effects of policy interventions in places in America. The canonical example is this effect of California's tobacco control program. But the technique there is actually generalizable and is very useful. So I'm gonna walk through it in very simple terms. The idea here is that you're not looking to compare each individual in your treatment group to the total population in the control group.
Instead, what you're doing is you're looking to find for each person in your treatment group, is there a collection of individuals in your non treated population who you could produce a linear combination of those individuals that make them look like the, treated individual prior to the treatment. And so this is actually usually most effective when you're using some kind of time series set of data, so some kind of measure of impact for smoking, its rate of smoking. But in developer relations, it's usually, use of platform, so, like, platform utilization or more commonly, billing, like, much do they bill on the platform weekly over a particular period of time. And you what you're looking to do is build a collection of non treated individuals who largely map to the same characteristics prior to treatment happening as the treated individual that you're gonna be interested in. And you can do that for one individual, and you do it for the next individual.
They might have a different collection of nontreated people that they can build into a linear combination to produce a close match to their pretreatment behavior. And then you can go ahead and apply the treatment and find out what the results are. Now one piece of mathematics here I feel like I kinda have to describe, and that is how do you find this linear combination of non treated individuals that go ahead and match up with the treated individual? Well, what you're actually looking for here is this vector w, w star here in this summation notation. You're looking for this vector that actually minimizes this sum over the pretreatment time period between the treated individual behavior across that time series and the linear combination with those weights for the group that's in the non treaty population.
And under certain regularity conditions, which aren't too hard to meet, this this identity, this set of weights exists and can be used. And so you can find a weight that minimizes this sum difference here, and that means you've got a closely match linear combination. Now I simplified it greatly in this example I'm showing you. It's usually many, many more control points than just three or four, but the weights quickly drop off when there's a high collection of regularity to the way that this information is distributed. There's some caveats here statistically about, you know, finding matching vectors inside the convex hull of the non treated individuals and the covariant factors that you can sort of ignore for now or read about in the research if you're really interested in it.
So what does this actually mean? So it means that I can go ahead and take a look at these treated individuals and look at their synthetic controls, so both the blue treated individual and the red treated individual, compare them to their synthetic controls. We see the blue individual didn't become an Uber developer, but the red individual did. And so we can measure some kind of difference in impact on what happened to the thing we're measuring after the treatment period has has happened. So this is, are they using the platform more, or are they billing more on the platform?
Some kind of time series data that identifies that there's a difference there. Now you might be like, well, Adam, that's kinda complicated. Not only is there a bunch of math in there, but I've you're only doing it across two people. You had to do a whole bunch of work, and I gotta do it against a population of a thousand people that came to my conference. Well, that's great.
That's what programming is good for and packages are really good for. Unfortunately, there's been some work done by a political scientist at UC San Diego who actually generalizes this in a static control method that actually takes into account some of the sort of difficulties of using it that was created in the original cases. And best thing for us possible is actually created an R package that you can go ahead and use to do exactly those computations. And so this is kind of like the the really useful part of this. It's like somebody's already done all the hard work for you.
All you gotta do is munch your data into the format that's necessary in order to get these results. So I built a sample data app dataset to go ahead and show what this one looks app winds up actually looking like. So why don't we walk through that really quickly? So in this fictitious example, you've got 50 customers, five of which were in a treatment group. Maybe you invited them to a developer advisory panel, and they came and you gave them the inside track of where your product's going, and it wound up producing more use.
You're expecting it to produce more usage out of them in the post treatment period. You have telemetry data for the 20 time steps pretreatment, and we're gonna only look at the 10 time steps afterwards. Now that could be weekly billing data. It could be weekly platform usage or daily platform usage, for example. The the idea is that you have the data that sort of lines up for that.
So let's take a look at what that data looks like. Here, you can see in the first left hand side of this chart, you can actually see the time periods the first 20 time periods from zero to 20. You can actually see the utilization or billing rates for each one of the customers. And then at time 20, the intervention happens, so the developer advisory panel happens. And then the five people that went to those developer advisory panels, the ones that marked in gray red, you can actually see that there's some kind of change or delta in the way that their spend or engagement platform increases.
And across those 10 time periods post intervention, you can say, well, what is the difference? It looks like we're getting more out of those guys than they are out of the average, but how can I actually tell? How can I actually know what's being what's what's happening? Well, what you do is you take the average, you find the synthetic controls for all of those individuals in the treated group, you take the average of those, and then you can graph them against one another. And this is what the G Sync package actually does.
You can see in the pretreatment time period, before time period 20, you can actually go ahead and see that these things tightly match against one another. There's very, very little variation. And that's by design. You are picking those weights w, actually minimize the difference between those two things. So your control group, your synthetic control, and your treatment group are very, very closely matched there.
But post treatment in the period from 20 to 30 here, you can see there's a big divergence between what's happening for the treated group, which is in black, and the synthetic control that matches up against that treatment group, is in blue. The difference between those two is the difference that's attributable to the intervention that was provided. And you can do this with a test of statistical significance and confidence. So I've actually banded here around the blue line, the 90 to 95% quantiles that identify what the significance of this is. And so if you look in this treatment group in the post intervention period on the right hand side, you can see these black lines are breaking way out of the 95 bound confidence measure around the blue line.
So we can be sure that these treatments were actually the treat the intervention was actually causal as opposed to happening by error. It's a very, very high statistical standard to meet, and you can have a very high confidence about what these actually mean. And then if we go chart this back against all of the time series data for all of the treated individuals and and untreated individuals, actually becomes even clearer. You can actually see the information sort of like jumping out that the average black line there is really describing the separation from the synthetic control, which would be what happens if they were just like the rest of the normal untreated which looks much more like the average across all of the population. And so this gives you some kind of sum average treated measure.
In this sample dataset, if you actually add up this difference across those ten weeks, you wind up with an additional 55 units per individual in the treatment group. So whether that's 55 or 55, additional units of, operation on your platform, this gives you a concrete measure with a very, very high degree of certainty that the intervention was the cause of that measure. It also this technique also removes a lot of selection bias because you're actually finding matching individuals or matching synthetic controls that remove that because of the you're lowering the weight between the difference in the pretreatment period, and you're not prejudicing the selection in any kind of way. And it also gives you the mechanism, allows you to measure the control the the the size of the impact relative to the overall operation. So it's actually a fantastic tool for doing the types of tests that will just define impact.
And this is a technique that when I was at AWS, my team used pretty extensively. We used it to identify what the business impact of the startup lofts in San Francisco, New York, Tokyo, Tel Aviv, London, and other locations were, and we used it to build the funding model for that program. We also used it to evaluate the impact of the developer engagement programs we had. The developer days program showed more than 50% increase in likelihood that people become substantial users on the AWS platform. We also use it in the CIO engagement program to show that the CIO engagement program had multi $100,000,000 positive impact to the AWS business.
We use it in enterprise strategy programs to identify the annual customer spend increased dramatically, when there was a enterprise strategy engagement at the executive level for the program. So there's lots of different ways that this can be used, and it's a great technique using very, very common openly available tools to go measure program impact. So just to recap, we're at time. So I'll give you the fast version. You gotta identify your treatment group.
You gotta identify what thing you're trying to measure, impact measure. You gotta select criteria for your comparable control group members, identify the time windows over which your pretreatment, post treatment groups mostly compared. You compute the synthetic control for each group member, and then you go ahead and check the confidence parameters for those synthetic controls. Then you average the controls and the treatment populations, and then you can measure the impact as the cumulative measure of the delta between the treatment group and the control group. And that together gives you the total impact over that period.
There's a bunch bunch of gotchas, which I can talk about or we can follow-up with afterwards, that you've gotta watch out for. In particular, the confidence parameters and significance you have to pay attention to. Making sure you've got the right comparables is super important. Making sure you got the right ratios for the time windows and right ratios for treatment group versus nontreatment group. Those things are all sort of sort of nuances you gotta sort of work through.
But it's actually hugely applicable to anything that has time series data. So I encourage you to think about using it as ways to measure impact for the programs that you're executing. So with that, I'll finish. Thank you guys for listening to me rant and talk a little bit about mathematics and running through a little bit of science there. Hopefully, it wasn't too fast.
And happy to answer any questions, and thanks for paying attention.
Speaker 3: Thanks so much. I feel like it's just a requirement now that an equation has to get flashed up.
Speaker 2: Well, you know, know
Speaker 1: last time I did it, that was the most tweety part of my talk. So I figured I I gotta feed the tweet machine. I don't know if I gotta put some kind of, like, nice little mathematical formula
Speaker 2: on that. Yes. Yes.
Speaker 3: Alright. Let's go through the questions. I'll go chronologically. So, Melissa, starting with you. So some of the questions are around the weights.
So how do you take the bias out of that process? Like, what what how did you get to that? And you you talked about making some adjustments later on. And so my question to add to that was also yeah. Did you engage with users to decide on those weights as well?
So those are the questions.
Speaker 2: To start with, we kind of just we kinda hashed it out together, Adam and I did. So so we went through, and we had had a few meetings, and we're like, this is what I think this way it should be. This is what I think this way it should be based on, the type of content being delivered, whether it's an in person talk or a keynote or or something more like a blog. Right? And we kind of we kind of stack ranked the content just based off of the type of delivery mechanism and the overall impact because we obviously didn't want to rate something like a blog at the same level we would rate an in person talk because a blog has the ability to to get a lot more reach.
And we we also broke out video content and other things, but, we just took a stab at it and got something going. We we sent it off to the team to to look at and got their feedback. So the developer advocates kind of kind of dove in and put on put on working hats and and said and gave us a round of initial feedback on it and then said, okay. I know how I'm gonna break this system and then tried. So so we had them kind of trying to break the system, and we watched their behavior and and how they were doing.
And we knew that we wanted to have a content strategy that was pretty pretty diverse. Right? We wanna meet the practitioners where they are. And to do that, there are different styles of learning. So we knew we we needed the kind of base levels for for content and and minimum kind of content submission.
So when we noticed that some content started to fall off, we started I started asking the team questions like, why is this happening? And and kind of talking to Adam and adjusting those those weights based on that discussion as was with the blogs. The user community feedback has been mostly in the form of comments or or posts on our forum or or in other places and and sort of the open source, like, adoption stuff that we looked at too where where there are opening issues or or poor requests related to something that we may have talked about.
Speaker 1: So I'll I'll I'll say as well, like, it it is possible to go do a regression across historical information and use it to produce the weights. We did that for certain weights and when we use the program similar to this at AWS. But the truth is it's, like, it it like, it's usually not that far from what your actual best guess is. And usually what happens is as you build the weights or you fine tune the weights, you settle down to, like, this consensus model of what the weights really wind up being, and they don't change that much after you've gone through a couple of iterations. And so, really, it is, you can do a whole bunch of numerical analysis to go work it out.
And, yeah, we tried that when we did it at AWS. And in the end, it went up just sort of, like, matching up with what your gut was in the end anyway. So, like, a little bit of cons the conscious thought about it is worth, you know, a lot more than the cycles of your data scientists trying to work out what the weights are.
Speaker 3: And I'll assume it goes without saying that those can change depending on your company, your product, your community, your maturity, or Yeah.
Speaker 1: I mean, like, you I would expect I would expect each company to have think about their weights kinda differently. Right? There are things that you value differently in different organizations, and there are parts that you want to encourage people to follow that are different depending on what the product is you're trying to put in front of them. So, you know, there are certain things that is like, okay. You know, what my learning is is not the same as what your learning is gonna be because I don't know I know my business.
I don't know your business.
Speaker 3: Yeah. Yeah. And the sort of follow-up again, Melissa, with that question. Did you I don't know if you guys engage with Slack or similar where someone asked a question. You said, oh, check out this blog post.
It has the answer. Like, did you incorporate it all then following up and, hey. Did you end up reading it? Was it the right post? Did you incorporate any of that?
Speaker 2: Our developer advocates sort of do that naturally. We we kind of focus them on building relationships with the community and and doing things like follow-up. So we've done some of that. Most of it's via Twitter because we don't have a a Slack for the community, but it's either via Twitter or the forum for them.
Speaker 3: I guess I was asking, did that play back into your weights? I guess. Sorry.
Speaker 2: Not yet.
Speaker 1: I I I I will also say, intentionally, we added a time decay to the content production so that you only get the value out of content production over a ninety day period off the publish the the content piece is being published. That kind of stops this, hey. I wrote the seminal article on something three years ago, and I'm still getting credit for it today. It we want to make sure that the way you think about this has a a time decay associated with this. You're constantly using it to focus upon what the impact is of your next action.
Speaker 2: It also helps us keep our sanity from a tracking perspective. Yeah. If you have to track things from four years ago, you're gonna lose your mind even if it's automated. Yeah.
Speaker 3: So that actually relates to one of my questions. Was docs one of the content categories? I'm not sure if I caught that.
Speaker 2: Docs isn't formally listed, but, yes, we look at it.
Speaker 1: Yeah. And we actually have at HashiCorp, we actually have a a docs in education group that's separate from the DAs. So the DAs aren't directly responsible for producing those things. They can do it. It's treated similarly as a blog post would be treated, but we actually have a different group that's responsible with docs.
And they got a they got a separate way of the the documentation is being measured. It's really sort of, like, web metric based as you can imagine.
Speaker 3: Mhmm. Interesting. Alright. So, Adam, for your parts, we had some questions about let me double check here. Sorry.
How does n need to be when you're doing sorry. Maybe you can double
Speaker 1: Is this in the Slack channel or where
Speaker 3: Yeah. AWS scale feels a lot larger than, say, HashiCorp scale. So but I assume that you're using the same system.
Speaker 1: So so
Speaker 3: I guess yeah. What does n need to be?
Speaker 2: Your question.
Speaker 1: So so that's that's a really good question. So you can do this with populations. Like I said, it's the ratio between the treated population and the untreated population that matters. And and also the ratio between the tree tree pre t treated time period and the post treatment time intervention time period. But the populations don't actually have to be big.
So, for example, when we applied this at AWS, some of these populations we're talking about, like CIO sort of briefings, we're talking about the numb order of, like, a 100 briefings in a year. And so you needed only a several 100 complimentary non brief CIOs in order to make a comparison. You didn't need, like, tens of thousands in order to make statistically significant measures of impact. K? What you do need, however, is time series data that measures the activity or the output you're trying to measure.
And that's one of the things we're not at yet from the HashiCorp side. Right? We're still an open source company delivering open source solutions for lots and lots of people. Our platform as a service or our platform based versions of our tools are coming online right now. We have Terraform Cloud, for example, and we just announced last week Hashcorp cloud products for console that's in private beta right now.
We'll be adding Vault for later this year. So the sort of the time series trajectory information for us is a little bit out from a company perspective, but at AWS, it was super obvious. We got utilization metrics and, you know, billing metrics like our Yin Yang there. So it's very easy to do. In fact, the biggest problem at AWS wasn't the size of the end.
It was actually controlling the end. So I mentioned that it's really important you find appropriate set of comparables to your treatment group. And so let me give you a really concrete example. At the StartUpLoft, it was fantastic. We had, like you know, this is a free space where startups could go share Wi Fi, get talks, and get help from AWS experts.
Really great experience. Totally free. But if you got people from Stripe or Uber that showed up and they, like, utilize something, their spending on the AWS, like, totally dwarfed everybody else. And so what happens what I meant from in a statistical sense was they were that if you can consider them as part of the treatment group, you were unable to find a synthetic control for them within the convex hull of the untreated group. There was not enough comparables to go find a match for them.
So, basically, what we did was we just cut off anybody that was in, like, a unicorn or a high spending account, got rid of all those guys, then measured the impact for everybody else. And so it's really important that you understand what you're really comparing to and not letting, like, extreme outliers influence what the measure is. And that otherwise, you can get a full sense of of impact or improvement.
Speaker 3: Thanks. I know we're a bit over, but there's a a burning question that's been a lot of our minds. So you Melissa, you shared about how, you know, your measurements here help to justify ROI for a head count. And we've been in some conversations where sometimes you have to justify hard cut for, you know, one person. Like, what is the impact of this person over the next six months or what have you?
Adam, I know you have tons of experience. The question is, is it linear? Like, how do you how do you explain that using this model, maybe even?
Speaker 2: It's level based. So if we hire a senior person, they're gonna have a bigger impact more immediately than if we hire a less senior person. Right? So so it's really kind of taking that impact hour breakdown by level and, applying a bit of an onboarding period to that. The onboarding period varies based on the level as well.
And then and then saying, okay. In six months, this is where we're going to be at with this focus area based on this hire. If we hire if we're able to hire somebody at this level And and kind of talking through that that with with management and and leadership.
Speaker 1: Yeah. And I'd say one of the other things that's worth thinking about here and it shows up when he comes across multi year is that there's there should be an expectation that your impact actually increases in the aggregate across the group multiyear. That shows that you've got improved efficiency in the team and the way you're delivering and producing content. It's not just about just ordinarily summing individuals. You wanna go set targets that actually have an improvement piece in it.
And if you're building programs that increase operational efficiency, whether there's a live streaming program where you're slowly building an audience over time, or whether you're building a a content production program where you get slowly more efficient about identifying topics where you gotta provide, you know, interesting content. You should be looking for that kind of efficiency as a leader as to what those things mean. So it's I wouldn't say it's super linear in any kind of sense. It's linear with a percentage improvement fudge factor that you put on top that might winds up working as a goal to increase everybody's engagement.
Speaker 3: Definitely. Definitely. Well, thank you so much. Thanks for closing out this day of metrics, and thanks to all of you viewing and asking your questions on Slack. So as I mentioned, this is on Tuesdays and Thursdays, so I will see you on Thursdays with with this Thursday with more talks.
So thank you, Adam and Melissa. It's great seeing you again.
Speaker 1: Thank you for having us. This is super fun, and thank you for organizing. And it's lovely to see you and happy to take questions from anybody via other channels. I'm the at devrel chap on Twitter, Melissa.
Speaker 2: At solution geek.
Speaker 1: At super.