Software Interface Designer Manifesto

  1. A program without interface is just machine code. Machines understand everything tailored to language rules. They don’t care about interfaces, but humans do. Interfaces are for humans.

  2. Interfaces are read many times more than are written. The weaker an interface is, the more diffcult it is to understand its intention. Respect other humans.

  3. Various tools, patterns and techniques may be utilised to create interface, but only human can blend it together with appropriate proportions. That is a good interface.

  4. Interface designer doesn’t start on 09:00 am and stops 05:00 pm. Creative work is not a machine with on/off button. When the conditions are good, good interface will appear in 2 hours. When conditions are bad, 2 days may be insufficient to create a good one. Don’t push on it.

  5. Creating interfaces, despite its science nature, is an art. Therefore may be described as beautiful or awful, good or bad, strong or weak or whatever adjective is suitable. Judging interfaces requires both, wisdom and experience.

  6. SOLID, TDD, DRY, CLEAN: they exist for a reason. Mantaining good interfaces is a pleasure. Otherwise there’s always WTF.

Stats Whisper, the stats gatherer

A few months ago I needed a simple tool that would gather certain app stats and integrate with our Rails apps easily. The underlying requirements were:

  • collect visits counter and/or response time of given part of app;
  • measure only certain (the most interesting) parts of app, e.g. concrete component or path, because the overall stats view is easily affordable with Google Analytics so additional toolset (collector, storage and visualization) sounds like an overhead.

Meet Stats Whisper

So I’ve created the Stats Whisper, a simple data gatherer for StatsD. StatsD, because of counters and timers data types, support for UDP packets and Graphite integration – we’re using it internally as data storage.

From Rails perspective, Stats Whisper is a middleware, which interacts with each request and gather data according to whisper_config.yml config file. Currently it can only provide a whitelist of which requests – or parts of app – have to be measured (time of execution in ms and counters for each route). The whitelist consists of regular expressions, e.g: ^/dashboard, matching only interesting requests. The message is being sent to StatsD (via UDP port) immediately once the request is completed.

It is essential to understand that the purpose of this library is to focus only on requests defined within whitelist. All the remaining are skipped, because it aims to measure only the most interesting parts of app, e.g. a concrete component – lets say user dashboard, product, a set of products or whatever is important to unleash the business value. Generally speaking, it’s up to the end user, what to measure and why.

The Stats Whisper library is not the only one on the market. I’m familiar with:

  • statsd-instrument, that can measure any app method execution time or count the amount of method invocation so it works even closer to the app than Stats Whisper;
  • scout_statsd_rack which measures execution time and count of requests of any app path – it’s not possible to specify only certain paths.

A word about stats gathering

The aim of such measurements is to find anomalies that prevent the business from normal work. It is important to understand, what to measure and why. Start with critical components of your app, consider which parts might be the most important for the end user. The Stats Whisper library will help you gather appropriate statistics and identify bottlenecks. As an example, consider the chart shown below:

Understand the noise

“In average” (these quotes are on purpose) the response time is about 100ms per request, however sometimes it’s even order of magnitude bigger than the average. I’ve looked around and found that these peaks occur when user performs some search action, what was the bottleneck in this case.

Regarding quoted average phrase, note how StatsD computes its statistics values, especially timing data type. Be careful with these params, because they may get you inaccurate results. I mean they’re completely solid, but consider what mean or max offer and how these may change your point of view.

See interactions at peak performance

Another useful part of app statistics data analysis is the ability to unveil peak performance periods and how they interact e.g. with crucial components of the system while such events occur. See the chart shown below:

This is the real data gathered during students enrollments for elective courses. The enrollments started at 8 a.m. where the highest peak can be observed. Each student request response time has been measured and sent to StatsD counter and timer objects. The results are shown on first and second row. It’s worth noting that despite the peak performance, the upper (max value) of StatsD timer didn’t grew vast for main page and dashboard. I’ve also attached the CPU load avg to this chart to show it’s quite useless measurement, because note that it almost completely does not reflect the peak traffic – it does not tell you nothing about what is hapenning.

DevOps in small companies – part II – entering automation

A few months ago I’ve written first post in this series and it seems it’s time to continue the discussion, because things didn’t stop. Not at all.

The investment in configuration, or to be more specific, in automating things isn’t free. It depends on many factors, obviously, and here it was a compromise between what needs to be done and what could be done. In our case automation, configuration management (CM) or whatever in between was the second one. The world wouldn’t end while not having CM solutions on board. Especially here, where we don’t manage a farm of VM’s in a cloud environment and to be honest, where any action could be done manually.

Even though you manage even one simple VM, I’d automate this

During last months we’ve done a lot in case of automation. We’ve also learned a lot, I mean not only the new tools, but the two–words I’d call ‘good practices’ in case of the overall environment management. We manage about 10 VM’s so it’s not much and these are in private University cluster. We’re not clouded with all of its pros and cons, but we try (or apply eventually) some cloud–solutions, e.g. we really value the cattle vs. kittens paradigm (covered in first blog post of this series).

Although we don’t manage big clusters or clouds, we managed some good practices that apply in any environment. We believe that:

  1. Any taken action closer to automation makes your environments less error–prone. It’s insanely important in any environment, whether you have a huge cluster or a single VM, because tools works fine until someone touches it, right? If so, don’t let anyone touch anything directly, automate it.
  2. Any part of automated configuration is recreatable, repeatable, and so it’s testable! You can test whatever you want in a way however you want to before putting it into production environment.
  3. Any part of automated configuration can be reused and applied within any other environment. These are so–called roles and you can re–use them for any environment you’d like to provision.
  4. Automation standarizes your environment, either a huge cluster or a single VM. It encourages you or any other person in the team to do things in a specific way, so any other person after any period of time can handle this. Whether edit some config of important tool or just add another package to the system, it all lies in one place.

Automating things isn’t free

Daily work still needs to be done, because automation isn’t a top priority. Having said that, most of the CM–related work we’ve made during spare time. Week after week another components joined to the “automated WALL·E family”. We’ve used Ansible as the CM–tool and I believe personally it was a good choice, because it simply let us do the job. We’ve also introduced a few tools to achieve simple CI and so we added Jenkins, which integrates with our Gerrit to perform code review so each Ansible change has been tested upon staging environment before merge into master branch. Furthermore, for any master branch merge, Gerrit triggered an event and so Jenkins would run production build. The complete process is shown below:

However, running automated things, is so don’t keep dinosaurs

Once you’ve built automated configuration, your environments are no more pets or dinosaurs. They’re easily recreatable and configurable at scale if needed. However, the ‘scale’ word is not necessary here at all. Even having just a single VM, e.g. company developer tools VM, would be a good practice to clean it up and automate, because such VM’s become dinosaurs fast. Once the toolset has been installed, it’s better to not touch it at all, because who would ever remember why they’re exist in a such way.

To give certain examples, we’ve entered automated configuration world and gather profits from:

  • Standarization, where these old dinosaur–like VM’s again became manageable.
  • Changes testability, where each change can be tested before putting into prodution environment.
  • Recreatable environments, so we can forget about VM major system upgrade and instead create exactly the same VM, but with newer environment version – this is so–called zero downtime migration.
  • Monitoring things. It’s a shame to say that, but we weren’t monitor our services until that time. It’s quite interesting what metrics could tell you about particular service or the whole system. I mean, among other things, counting or measuring requests response time for certain views (actually it’s a topic for another blog post).
  • …each other, because all these configs, packages and other manageable things lie in one place and so anyone can enter the repository and see how exactly that thing has been performed or installed. It’s all way more transparent.

Don’t feel ashamed and start automating things today.

First solution isn't always the smartest – a few thoughts about using Ansible

Basically, this post is a continuation of Why we don’t focus on testing Ansible roles extensively and essentially touches Ansible and expands, among other things, a few thoughts about using this tool within a CI environment.

The problem: execution time of Ansible playbook takes too long

The context

Having a set of VM’s and several roles to execute, I’ve started to think how to shorten the execution time within the cluster.

First solution – extract and execute only the code that’s been changed

As we use here a CI for Ansible, the first idea was to execute only the role that’s been changed. It sounds quite reasonable, because only concrete piece of playbook lifecycle is executed, without touching all the rest, unchanged. However, it works smootly until it concerns internal roles. Let me explain the current solution for staging environment. What’s executed after a change is being pushed into repository, is distinguished with a piece of Bash script:

tags=`git show --pretty="format:" --name-only $GIT_COMMIT | grep -v 'roles/requirements.yml' | grep -e 'roles\/' | awk -F "/" '{print $2}' | paste -sd "," -`
if ! [ -z "$tags" ]; then
  echo "Running for tags: $tags"
  ansible-playbook --tags="$tags" -i staging_inv site.yml
else
  # Execute all stuff
  ansible-playbook -i staging_inv site.yml
fi

In particular, it extracts what’s been changed from a Git tree and enforces to run build for concrete tags. These tags match role names, e.g. if any file of role common has been changed, build executes only for role common. Unfortunately, it shines until you add an external role. Given that, lets say the main directory playbook structure looks like:

$ tree ./ -L  1
├── ansible.cfg
├── files
├── group_vars
├── host_vars
├── roles
│   ├── ...
│   ├── requirements.yml
├── site.yml
└── staging_inv

When you add an external role, what you do – in most cases – is extending *vars with some configuration variables related to the role and that’s all. It provides great flexibility for including additional roles, however it also reduces the possibility of extraction only certain roles to execute (based on the piece of code showed above). For such nginx external role example, you’d only need to add some variables related to the role so the above extraction script wouldn’t match any code from within roles directory and hence, peform all tasks defined within a playbook.

Second solution – build a wrapper role

Any Ansible role may depend on any other role, where dependent roles are executed first. Role dependencies are given within host role meta/main.yml:

---
dependencies:
  - ansible-role-nginx

The host role (one that’s having dependencies) would provide all essential variables for the dependent roles and it plays nicely. Basically, the nginx wrapper role looks like:

$ tree ./roles/nginx/ -L 1
├── defaults
├── meta
├── tasks
└── vars

where vars provide common variables for ansible-role-nginx role. The common word is on purpose, because what if you’d like to deliver configuration for several nginx instances, where each instance differs slightly (e.g. is having different SSL cert)? The whole wrapper role plan crashes, because it needs to be distinguished somehow what plays where, so the solution would likely to use either group or host_ vars, whereas the extraction script doesn’t know anything about these directories (because they reside within playbook main dir).

However, there’s a light for such approach, I mean using wrapper roles:

  1. nginx role–case is quite unusual. In most cases it will be sufficient to use wrapper role vars and define essential variables there.
  2. External role common code has his own isolated environment with the ability to test it, using the above Bash script.
  3. Wrapper role may include additional tasks and these are applied right after all dependent roles are applied. However, to apply pre–role tasks, different approach is needed.

The problem – applying pre–role tasks for certain role

The context

The current design of applying pre or post tasks of certain roles is limited to concrete pre/post tasks defined within a playbook. Such approach, however, implies that playbook becomes both, the declaration and definition of roles and tasks, which sounds like a straight way of having a speghetti code.

Everything should be roleized

Because it keeps your code clean and readable, no matter whether it’s a bunch of tasks or just one that creates a directory. Be consistent in what you do and that will cause profits. Instead of adding pre_tasks to your playbook, create another role, e.g. pre-nginx that simply creates cache directory or whatever is needed before role is executed.

The problem – complex role configuration and staying DRY

The context

Lets say you have nginx role on board and it manages many Nginx instances. Some of them need various SSL certs or are working with different application servers. How to manage that and stay DRY?

Cheat with Jinja2 features

Ansible uses YAML language for tasks definition and despite its simplicity, it has some limitations (e.g. config inheritance). Here comes Jinja2 template language that would help in such cases. Let me explain it on an example, e.g. with this nginx role. The role is used upon the wrapper role pattern described above and contains:

# meta/main.yml
---
dependencies:
  - ansible-role-nginx

# vars/main.yml
---

common_conf: |
  index index.html;

  location /favicon.ico {
    return 204;
    access_log     off;
    log_not_found  off;
  }

  location /robots.txt {
    alias ;
  }

  ...

nginx_configs:
  ssl:
    - ssl_certificate_key /cert.key
    - ssl_certificate     /cert.pem
  upstream:
    - upstream 

nginx_http_params:
  - proxy_cache_path  /var/www/nginx-cache/  levels=1:2 keys_zone=one:10m inactive=7d  max_size=200m
  - proxy_temp_path   /var/www/nginx-tmp/

Then, for a concrete host or group vars of your inventory, specify final configuration. Lets say you have foo app and you’d like to provide config for bar host that reside within your inventory file. Given that:

# host_vars/bar/nginx.yml
---
root_dir: /var/www/foo/public/
location_app: |
  proxy_pass http://some_cluster;
  proxy_set_header X-Accel-Buffering no;
  ...


location_app_https:
  - "{{ location_app }}"
  - proxy_set_header X-Forwarded-Proto https;

app_common_conf: |
  server_name bar.example.com;
  root {{ root_dir }};

  location / {
    try_files $uri $uri/index.html $uri.html @app;
  }
nginx_sites:
  status:
    - listen 80
    - server_name 127.0.0.1
    - location /status { allow 127.0.0.1; deny all; stub_status on; }
  app:
    - listen 80
    - "{{ common_conf }}"
    - "{{app_common_conf}}"
    - |
      location @app {
        {{ location_app }}
      }
  app_ssl:
    - listen 443 ssl
    - "{{common_conf}}"
    - "{{app_common_conf}}"
    - |
      location @app {
        {{ location_app_https | join(" ") }}
      }


upstream:
  some_cluster { server unix:/var/www/foo/tmp/sockets/unicorn.sock fail_timeout=0; }

And certs file, encrypted with ansible-vault is given as:

# host_vars/bar/cert.yml
---
ssl_certs_privkey: |
  -----BEGIN CERTIFICATE-----
  ...
  -----END CERTIFICATE-----

ssl_certs_cert: |
  -----BEGIN PRIVATE KEY-----
  ...
  -----END PRIVATE KEY-----

The nginx role doesn’t install SSL certs itself so it’s up to you how and where you’d like to put them. However, it might be simply achieved with these tasks, applied before nginx role:

- name: Ensure SSL folder exist
  file: >
    path={{ssl_certs_path}}
    state=directory
    owner="{{ssl_certs_path_owner}}"
    group="{{ssl_certs_path_group}}"
    mode=700

- name: Provide nginx SSL cert.pem
  copy: >
    content="{{ ssl_certs_privkey }}"
    dest={{ssl_certs_path}}/cert.pem
    owner="{{ssl_certs_path_owner}}"
    group="{{ssl_certs_path_group}}"
    mode=700

- name: Provide nginx SSL cert.key
  copy: >
    content="{{ ssl_certs_cert }}"
    dest={{ssl_certs_path}}/cert.key
    owner="{{ssl_certs_path_owner}}"
    group="{{ssl_certs_path_group}}"
    mode=700

Note the difference between > and | in YAML. The former is the folded style and means that any newline in YAML will be replaced with space character, whereas the latter preserves newline character.

Jinja2 templates in conjunction of YAML features, provide great flexibility in config definition. However, as of Ansible 2.0, it’s likely that it will change slightly, because it will be possible to use Jinja2 combine feature for merging hashes.

Why we don't focus on testing Ansible roles extensively

We provision our environments with Ansible and we want these to be super–reliable. However, having sometimes several daily deployments, how to ensure that any change will not ruin the production environment? Some whisper to move to the containers world and get rid of the traditional way of provisioning/maintaining environments. Here, in the middle of major Ops changes, we use private cluster working on bare metal and so, we have slightly different requirements than the cloud world. We don’t use containers everywhere and we don’t have a plan to do so, at least within apps related context. As we provision with Ansible we want to be sure that any change will not cause any environment outage.

Testing any CM tool is not a trivial task, because they essentially need an isolated environment to fire tests. It’s not just a matter of amount of RAM or CPU cycles, but primarily of having the dedicated environment the services need to operate. Moreover, as we use private cluster whereas we don’t manage it, we have just a bunch of VM’s we can use in whatever manner is needed, but still without any easy way to drop or spin up new VM.

Testing Ansible changes

The Ansible tool marvelously implements roles–profiles pattern, which give us the ability to test any particular service in isolation – let’s call it as a service unit test. In Ansible terms, any service is simply a role that delivers some set of commands to ensure that service is up and running. Here, we can distinguish certain test levels criteria:

  1. Service is up and running on localhost.
  2. Service talks to authorized clients.
  3. Service delivers appropriate content.

Testing the first level is often met by the role itself and since you’d use something out of the box, you’ve it included. Ansible has a bunch of predefined modules and another tons within Ansible Galaxy maintained by the vast community. Actually it’s very likely any tool you’d imagine to use has already well–prepared role ready for deployment.

The next levels of tests are completely up to you, but you’d probably find, that it’s getting complicated fast, even for a small change, e.g. adding another web–VM instance within hba.conf file to get access to PostgreSQL database. So we started to consider of having a CI for infrastructure provisioner, where:

  1. The cost of environment preparation is relatively small.
  2. Time of execution is as minimized as possible.

Having these assumptions defined, consider the schema below:

In short, when developer commits new change to Gerrit, Jenkins triggers new job for test–kitchen gem, which internally spawns Docker container(s) to perform change tests. Gem test–kitchen is able to establish more containers at once and run tests concurrently. To distinguish what roles have changed per commit:

git diff-tree --no-commit-id --name-only -r COMMIT_ID | grep -e 'roles\/'

I’ve built an example of how to use test–kitchen with predefined Docker image where tests run in a matter of seconds. It really works great, but in context of role, not the whole system. The awesomeness disappear when you realize it’s not what you wanted to achieve, because in case of Ops – in my opinion – it’s more important to focus on integration tests to provide more customer oriented environment, e.g. at least to test if given service is running or responding instead of focusing if directory exists or config has changed.

Indeed, if tests run per each role, it’s easy to spin up test environments and run tests fast thanks to containers. Such tests, however, have the drawback that they don’t give the value you’d expect – each role provides some service, but testing such single service without interaction with other services is quite meaningless. Nginx won’t serve appropriate content without interaction with some webserver and so, webserver won’t serve appropriate content without some database and so on.

On the other hand, blending all Docker–Jenkins–whatever tools to build CI just for testing for Nginx availability on port 80 is like using a sledgehammer to crack a nut. So we decided to discontinue such process, because of the overhead of preparation test environments to gain valuable results.

The good the bad and the ugly

Nonetheless, the idea of role–oriented tests is definitely worth looking at. With some investment in scripting and perhaps Docker Compose on board, it would spin the environment with services talking to each other, but it’s still an overhead to deal with. Besides, there’re also Docker containers limitations regarding changes in container networking or firewall (need extra --privileged mode) and so they also should be discussed before entering containers.

As for our CI environment, so far we’ve ended up with testing Ansible changes using flags --syntax-check --check on appropriate playbook from within Jenkins job and doing peer review.