What is the job to be done of your job?

A few days ago three of us were talking back from the Devonshire Square market after having enjoyed the warm, sunny weather by eating our food outside. The topic of conversation had wandered but had eventually rested on careers. Specifically, we had hit upon the topic of how to decide on what one’s next position might be. How can one decide between staying at an existing job or leaving to something else? If the decision is to leave, then how does the decision between different positions on offer get made?

The question seems to be easy to answer by just listing out specific aspects one could be looking for: look for a higher salary, look for better working conditions, look for a place that has less overtime, take the one where you’ll learn the most. This hasn’t answered the question at all! It has actually made the question even harder to answer by obscuring the issue with too many details and, at the same time, not providing any way of deciding which of these specific aspects the decider cares about.

To get to the real answer, you have to first ask yourself a very different question: what are you hiring this new (or your existing) job to do for you?

Are you looking to have it pay off your debts? Move you to another country? Teach you a new skill? Allow you to have time to fish?

If you look at the job as a product to be bought or a service to be hired, what are you hoping to get out of it?

Once the job to be done is clear, then the choice between the different aspects above will be, if not easy, at least guided toward a goal.

Posted in Software, Stories | Leave a comment

Reading Notes from “Johnny Bunko: the last career guide you’ll ever need”

There is no plan

The world is too unpredictable to be able to plan out in advance your career or your life. You can make choices for instrumental reasons: the choice is based on the perceived utility for something else down the road. The object of choice is seen as an instrument of a plan. You can make choices for fundamental reasons: the choice is based on the inherent value of the object of the choice. The object of choice is seen as good in and of itself (reminds me of the Meditations on the subject of stoicism that a friend has been talking about). Because things are so unpredictable, instrumental reasons often lead to the wrong choices for the plan anyway. Most successful people, most of the time use fundamental reasons.

My interpretation: using fundamental reasons actually keep your options open and keep you happy at the same time! Instrumental reasons shut down options and likely are not what you actually wanted to do thereby making you less happy.

Think strengths, not weaknesses

Capitalize on what your good at. Don’t focus on what you are bad at.

My interpretation: this seems to align well with the first rule. Focussing on weaknesses is an instrumental reason. You think that by getting better at something you currently suck at you are going to be better at something else, it is just a step in the direction of some planned out goal. Focussing on your strengths is a fundamental reason for doing something. Your strengths are things you already inherently value. They are the things that give you energy.

It’s not about you

“You are here to serve, not to self actualize.” You are doing your best if your actions bring out the best in others.

My interpretation: I read this as something like pay it forward, but that doesn’t seem right. Pay it forward is an instrumental reason, you do it because you think that someone will help you out down the line. The message here is that you need to have the mindset that there is inherent value in serving, at least some of the time if not most the time.

Persistence trumps excellence

Small accomplishments compound. Keep doing them, persistently, and you’ll get ahead of anyone who does a few big accomplishments. “What do musicians and athletes do that others don’t?…They show up. They practice and practice and practice some more.”

My interpretation: I learned at least two valuable things from kendo: “practice doesn’t make perfect, perfect practice makes perfect” and you won’t improve if you only practice once a week.

Make excellent mistakes

Don’t try to avoid mistakes, you’ll end up being too concerned about being wrong and won’t want to try anything. Successful people make huge mistakes, but they aren’t stupid mistakes (“oops, I forgot to send the mail to tell them this is happening”) they make big mistakes because they are trying something new. Each mistake moves them a little bit closer to excellence. They learn from their mistakes. It is more risky to avoid mistakes than to make mistakes and learn.

My interpretation: screwing up isn’t a problem, in fact it is the best way to learn. Screwing up in avoidable, repeated ways is a problem. Don’t back off. Keep trying. Persistence is better than excellence, and persistence in the face of excellent mistakes will let you mix persistence and excellence.

Leave an imprint

Ask yourself now what your legacy is. Your life isn’t infinite and so you should use your limited time to do something that matters. Apply yourself to something that is larger than yourself.

My interpretation: strive to use your strengths persistently to make excellent mistakes on things that aren’t about you and don’t get tunnel vision on a plan that is out of your control anyway. That will lead to a fulfilling life that you can look back on and say was a good life.

Posted in Uncategorized | Leave a comment

Math for Programmers

Given that software development has so many ties into mathematics, I’m always surprised by how seldom it seems to show up in my day to day life. However, today was a little bit different.

Finding success through failure

One of the items that I dealt with, where some math was useful, was in some test failures that we’ve been seeing at work. The software is run through its automated tests in a number of variations. There are two axes of variation: the style of installation and the operating system it is running on. The number of total variations that get run is then simply the product of the sizes of the two axes. In this case there are 4 and 14, which gives a total of 56 variations.

Each time the entire suite of tests ran, there would be some seemingly random number of them that would fail for a reason that we couldn’t, and still can’t, explain. It was completely random from everything that we could see. However, the number that failed seemed to be somewhat consistent. It always seemed to end up with around 11 of the variations failing. That meant that 11/56 = 0.19642 or ~20% of the variations failed every time.

Screen Shot 2014-06-20 at 6.04.37 PM

Now, it is really tempting to take those failed variations and run just them, and none of the others, again. For some reason it feels like you are learning something about the failures by doing that (there are cases in which you might, but let’s not get into that right now). However, the key to this is that which ones fail has been random in every occurrence that we have seen it. If 20% of the original set failed, then the expectation is that 20% of the rerun set would fail.

In my real life situation, those 11 failed variations were rerun. What happened? Well, what should I expect to happen? If this really is a random occurrence, then 20% of the 11 should fail again. 11 * 0.20 = 2.2. So I should expect about 2 of the variations to fail again. What actually happened? One of them failed. In such a small sample set, that seems to fall pretty well inside the realm of probability.

Screen Shot 2014-06-20 at 6.05.42 PM

Let’s step back for a bit. In this kind of a situation, how many re-runs would I need to do to get a “fully passing” run. One in which, by rerunning the failures, I eventually get to the point where every variation has passed? This comes about by repeated application of the 20%. In the first run 20% fail. In the second run 20% of the 20% fail, which is 4% of the total. On a third run 20% of the 20% of the 20% fail, which is 0.8% of the total. At that point there is a 99.2% chance that every failed variation has passed.

This is exactly what we saw when the failed variations were re-run. On the third run (the second rerun) everything had magically passed! Everything was good! Unicorns and rainbows!

Screen Shot 2014-06-20 at 6.06.31 PM

Except not. This was just the inevitable outcome of rerunning the failed variations, when the failures strike randomly. The 20% chance of failure is still there. It hasn’t gone away.

Mathematics destroys my illusions.

Something was a little fishy in there. Let’s look at some of those numbers again. The first run had a 20% failure rate (11/56 = 0.19642857 … not quite 20% but close enough for our purposes). However, a 20% failure rate predicts that the subsequent run would see 2.2 failures. Since a variation can only pass or fail, I interpret that as either two or three failures. But there was only one failure! What happened?

Let me explain a little bit about what the failure is that is showing up in these tests. Every once in a while the system times out. The client is making a request to the server and the connection appears to just hang, timeout, and then fail. The only hunch that we have is that this is somehow related to load on the system. When the tests are run, there are about 30 of the test variations running at once, in addition to any other test jobs that might be running at the same time. The hypothesis is that all of the running tests are somehow causing things to slow down and, at times, time out. This is bolstered by the fact that when we try to run any one of the tests that shows the failure independent of the entire run of all of the variations we don’t see the errors.

So let’s go back to those numbers. If this really is some sort of load problem on the testing infrastructure, then by rerunning just the smaller number of variations (11), that might have changed the probability of encountering the problem. Instead of it being a 20% probability, which predicts seeing 2 failures, there seems to be a 9% probability. That seems like another data point that points toward this being something related to the load on the test infrastructure.

Posted in Uncategorized | Leave a comment

Understanding Environment Caching in Puppet

Recently, puppet got a new way of defining environments. Along with the new environments, came a new way of handling the caching of the environments inside the puppet master. I found out that what puppet is now doing, and was doing in the past, wasn’t well understood by users and so this post is to get rid of some of the mystery.

I’m going to cover what an environment is internal to puppet, what the caching of an environment is, and what it isn’t, and what it used to be. In other post I’ll cover how to evaluate your situation and tune the environment cache and your workflow.

What is an environment?

Seems like such a simple question…but it isn’t. What an environment is in your puppet installation really depends on your workflow. In some workflows the environment is used to represent different stages in a build and deploy pipeline (dev, test, stage, production). In other workflows, I’ve heard of environments used to represent different clusters of servers or data centers. Internal to puppet, the concept of an environment becomes much simpler (but still not very easy). At its absolute simplest, an environment inside puppet is an instance of the Puppet::Node::Environment class. A Puppet::Node::Environment has a name, a set of initial manifests, and a set of modules, each with its own manifests.

In addition to all of these pieces that are directly related to an environment, there are also numerous other pieces that are indirectly related. Custom functions, which are written in ruby, are indirectly associated with the environment. Custom types are in a similar boat. Neither of these are completely under puppet’s control since they are extra ruby code that is loaded into the runtime.

Why does any of this matter? Any time a puppet master needs to construct a catalog of an agent node, it requires having all of the information about an environment on hand. Scanning the filesystem for the relevant directories and files, as well as parsing and validating puppet code, can all take time and put extra load on the server. If you’ve ever investigated a computer that is performing too much IO, then you’ll know how detrimental to overall performance unnecessary disk operations can be.

Improving Performance of Catalog Creation

Given that a puppet environment is needed for creating a catalog, and given that it can be expensive, both in terms of IO operations and in terms of CPU time, what can be done? The answer is to reduce unnecessary disk and cpu operations. But now we are faced with another question. Which operations are unnecessary?

A somewhat simplistic algorithm for what puppet does when creating a catalog is (ignoring lots of variations and other complications):

program_code = []
environment.manifests.each do |manifest_file|
  program_code << parse(File.read(manifest_file))

environment.modules.each do |module|
  module.manifests.each do |manifest_file|
    program_code << parse(File.read(manifest_file))

catalog = execute(program_code, node_name, facts)
return catalog

Every one of these operations are necessary, but the results of some change more often than others. Listing, reading, and parsing the manifest files only needs to happen when the file contents change, or when a file is added or removed. All of the other steps need to happen every time, either because the facts change every time, or because evaluating the program code has side effects which could affect the outcome. For example, custom functions that read and write to databases could have different results on every invocation.

The environment caching in puppet takes the approach of preserving the parsed manifest files between uses of an environment. This allows the master to skip reading, parsing, and validating the manifests every time it needs to build a catalog. The last piece of the caching puzzle is when puppet will evict an environment from the cache and start again.

Environment Cache Eviction

A cache eviction strategy defines when to remove an entry from a cache. Some of the most common strategies are to have a fixed size cache and then use an LRU (Least Recently Used) algorithm to remove entries when more space is needed in the cache. Puppet uses a much simpler cache eviction strategy: time-to-live. When an environment is placed into the cache the time is recorded. After a configured amount of time, the entry will be removed and the environment’s manifests will need to be parsed again. Right now puppet only decides to evict a specific environment cache entry when that specific environment is requested. If an environment is requested one time and never requested again it will remain in the cache indefinitely. This doesn’t lead to memory leaks in most puppet master configurations, however, because passenger, and other RACK servers, will terminate a worker process after a certain number of requests. If this wasn’t the case, then over time many puppet master setups would have an ever growing environment cache, since it is common to use short-lived, development or testing environments. These environments would never be freed from memory and the master process would continually grow in size (this is yet another reason why the WEBrick puppet master should not be used in a production scenario).

Right now the master only supports three eviction strategies:

  1. Always (environment_timeout = 0) – any time the environment is requested the cached information is evicted and recreated.
  2. Timeout (environment_timeout = <duration>) – if the cached information is older that the configured timeout it is evicted and recreated, otherwise the cached information is used as-is.
  3. Never (environment_timeout = unlimited) – the cached information will never be evicted.

Each individual environment can have its own eviction strategy configured and so it can be tuned to how that particular environment will be used.

How did this used to work?

Pre-directory environments, puppet used a slightly different mechanism for refreshing the parsed manifests. Whenever it encountered a new environment it would create a new Puppet::Node::Environment instance. This instance would never be thrown away. Instead the Puppet::Node::Environment would track what manifest files it had parsed and keep track of the mtime of each file as well as the last time the mtime was checked. Whenever catalog was requested it checked if any mtimes were last checked more than filetimeout (a setting that can be specified in puppet.conf) seconds ago. For any of the recorded mtimes that had expired it would stat the corresponding file. If any of the new mtimes are different from the old mtimes, it would throw away all of the parsed manifests and begin parsing them all again.

That system had led to several problems. First, it is much harder to explain and understand what will happen when (not all of the files will timeout at the same time and so changing one file might cause an immediate reparse, while changing another won’t). Secondly, it relied heavily on stat calls, which we had found were often a bottleneck and something we want to avoid as much as possible in production systems. Third, the internal design relied heavily on tracking and changing a lot of state. It was simply harder for us to understand the code and ensure that it was correct.

The new environment caching system is a lot easier for us to work with. There have been some bugs and difficulties as we have tried to refactor the existing code to use the new system, but in the end it will be much more controlled and understandable to everyone.

In a follow up post I’m going to cover how to evaluate the puppet master’s workload for tuning the environment cache.

Posted in Software | Tagged | Leave a comment

You’re done when

Every team that I’ve worked with has eventually come to ask the question: when is it done? And they don’t just mean “done”, they mean, when is it “done-done“? Every time, for whatever reason, this seemingly innocuous question becomes a sticking point of a lot of conversations. Sometimes everyone can agree on what it takes, but there ends up being concerns about the feasibility of the definition in practice. Other times there is some dissent about the points of done related to what is actually in the team’s control. All of these are perfectly valid concerns. Some of those concerns point to larger, organization problems that are being faced and some point to fear of tackling what at first looks like a daunting mountain of change to reach “done-done” of knowing when you are “done-done”.

I’m not going to try to solve all of those issues that come up during those conversations, but I’m going to do my part to try to coax them on a little bit and put out what I’ve started using as my baseline definition of “done-done”.

You are done when:

  1. The code has been merged into all relevant branches.
  2. All automated tests pass on all affected branches.
  3. End user documentation about the change has been written.
  4. The feature or bug fix has been demoed to another member of the team, and that person has had the chance to explore the bug fix/feature and find any potential problems.
  5. Packages for all new releases of the product have been made.
  6. It has been confirmed by the customer to be what they wanted.

In essence it comes down to: the change is made (points 1 and 2), it has been communicated to others (points 3 and 5), it has been shown to be working (point 4), and it is what was actually wanted (point 6).

If you want to take it to the next level you can add another point to being “done-done”:

  1. The customer is using the change and is happy with it.
Posted in Software | Leave a comment

You’re Wrong and Can Never Be Convinced Otherwise

Are you a rational, logical person? When you are given the data about a situation, do you change your mind? I would be willing to bet that you like to think about yourself in that light. I know that I do.

However, I know that I still hold beliefs in spite of some data. If you give me some data about a situation, I’m not very likely to believe it. I might accept it, I might even agree with it, but I won’t likely change my thought process that you, as the presenter of that data, think I should be changing. Why is that?

While sitting around, drinking some tea, and eating some lovely, lovely bread with Jeffrey Fredrick, we had covered topics ranging from ease at work, the results of the Strengths Finder test, beliefs that people hold, estimation and tracking in software development, and the problems people have understanding estimates, when Jeff blurted out (paraphrasing), “I’d love to put together something about naive pitfalls. For instance in selling, a naive way of going about it is believing that people want to hear about your product. They don’t, they want to hear how you’ll solve their problem.”

This brought out a whole list of naive fallacies that we’ve both found that people hold.

Fallacy 1: Scheduling

Belief: The way to get more done is to schedule more stuff.

Reality: The way to get more done is to work on a single thing until it is finished.

Fallacy 2: Persuasion

Belief: The way to persuade people is to give them data.

Reality: The way to persuade people is to agree that they have a certain understanding of something, and then provide a story that leads from their understanding to your understanding.

Fallacy 3: Feedback

Belief: Feedback is something that is given to someone.

Reality: Feedback is something that the person has to seek out.

Fallacy 4: Experience

Belief: The longer you do your job the better you get at it.

Reality: The more you deliberately practice your job, the better you get at your job.

The commonality that I see in all of these is the belief that the data alone will change what belief a person holds. In all cases the person who holds the belief needs to be the one taking themselves down a path that leads them from one belief about the world to another belief.

Take the fallacy of experience as an example. Purely gathering experience is not what produces change. Instead, the change occurs when the person experiencing the situation reflects on what has been occurring, inspects their own belief structures, and finds ways to change themselves to produce different outcomes. The data pointed to the existence something around them, but the person had to find a path to changing their surroundings.

It may be true that I can’t convince you that you are wrong (oh, the irony of linking to an article about confirmation bias here). I can, however, provide you with a story that will maybe lead you down the path of understanding why I think you are wrong.

Posted in Software, Stories | 2 Comments

CITCON 2012 in Portland

CITCON is a conference that I’ve been attending since 2006 in London. I showed up because it was “near” where I was living in Germany and sounded like it might be interesting. Boy was it ever interesting! The only European one that I missed was because I had a final exam on the same day as the Brussels conference (damn you, formal methods!).

Well, now that I don’t live in Europe I probably won’t be making it to those CITCONs very much anymore, but there are still the North America ones. And what do you know? The next North America CITCON is right in my back yard here in Portland.

Having just moved to the city I didn’t know much about much of anything. Knowing that Portland has a lot of beer and breweries doesn’t do much when you are trying to find a space that fits a 100 people in an open-spaces style conference, but after some searching (and drinking a few McMenamins beers) we found a place. Next was to get the word out.

AgilePDX people there gave me a lot of advice. The RubyPDX group also get behind it. Various meetups that I crashed put up with my pitches to get people to show up, as well. I used my now tenuous connections to some groups in Seattle to get the word out up there a little.

In doing this I meet a lot of great people and groups that I’m planning on keeping in touch with. One is AgilePDX here in Portland who is going to be hosting another conference that meshes very well with CITCON: Agile Testing Open NW. If you can’t make CITCON, or if you find CITCON great, then I think you’ll probably want to go to ATONW, too.

Another group that I got in contact with, for those of you up in Seattle, is the The Center of the Agile Universe (aka The Fremont Agile Meetup). They are also putting on a very interesting sounding conference, that once again, if you like CITCON or miss it, I think you should attend MAQCon.

I’m really looking forward to this CITCON and hope to meet yet more people in the area.

Posted in Uncategorized | Leave a comment