{"id":1621,"date":"2014-02-03T13:00:06","date_gmt":"2014-02-03T13:00:06","guid":{"rendered":"http:\/\/writeasync.net\/?p=1621"},"modified":"2014-02-02T23:33:06","modified_gmt":"2014-02-02T23:33:06","slug":"testing-at-the-right-level","status":"publish","type":"post","link":"https:\/\/writeasync.net\/?p=1621","title":{"rendered":"Testing at the right level"},"content":{"rendered":"<p>A large-scale distributed service is deployed to a datacenter across hundreds of machines. The basic topology is as follows:<br \/>\n<a href=\"http:\/\/writeasync.net\/wp-content\/uploads\/2014\/02\/ServiceTopology.png\"><img loading=\"lazy\" decoding=\"async\" src=\"http:\/\/writeasync.net\/wp-content\/uploads\/2014\/02\/ServiceTopology-300x162.png\" alt=\"ServiceTopology\" width=\"300\" height=\"162\" class=\"alignnone size-medium wp-image-1671\" srcset=\"https:\/\/writeasync.net\/wp-content\/uploads\/2014\/02\/ServiceTopology-300x162.png 300w, https:\/\/writeasync.net\/wp-content\/uploads\/2014\/02\/ServiceTopology.png 684w\" sizes=\"auto, (max-width: 300px) 100vw, 300px\" \/><\/a><br \/>\nConsider the following scenarios and requirements:<\/p>\n<ul>\n<li>The service should respond with an error if a client requests a nonexistent resource.<\/li>\n<li>The load balancing algorithm should promote fairness and spread load across workers.<\/li>\n<li>The workers should be fault-tolerant and recover from transient failures.<\/li>\n<\/ul>\n<p>What would be an appropriate set of tests to evaluate the service against these criteria? As with anything in the realm of testing, there are virtually infinite possible &#8220;correct&#8221; answers to this depending on risks, costs, schedule concerns, and so on. Using this example, let&#8217;s drill down into the requirements and explore a few different possibilities. Hopefully this can be a jumping off point for evaluating benefits and tradeoffs while planning a large-scale testing effort.<\/p>\n<p><strong>The service should respond with an error if a client requests a nonexistent resource.<\/strong> Yes, this is rather vague, so let&#8217;s fill in some details. Assume that &#8220;resources&#8221; are <a href=\"http:\/\/en.wikipedia.org\/wiki\/Persistence_(computer_science)\">persistent<\/a> and allow typical <a href=\"http:\/\/en.wikipedia.org\/wiki\/Create,_read,_update_and_delete\">CRUD operations<\/a>. The <a href=\"http:\/\/en.wikipedia.org\/wiki\/Thin_client\">client is thin<\/a> and uses a simple <a href=\"http:\/\/en.wikipedia.org\/wiki\/Representational_state_transfer\">REST-like pattern<\/a>. Armed with this information, we might reasonably conclude the following:<\/p>\n<ul>\n<li>The client has loose coupling to the server and very little processing logic.<\/li>\n<li>Given the persistence model, a resource that <em>currently<\/em> does not exist may have <em>previously<\/em> existed.<\/li>\n<\/ul>\n<p><strong>The load balancing algorithm should promote fairness and spread load across workers.<\/strong> Those pesky product managers strike again with their vague terminology! Let&#8217;s say after further clarification we find that &#8220;load&#8221; is intended to be measured by &#8220;active request count&#8221; on a worker. Requests are routed to less loaded workers first. In case of a tie (say, that all workers have exactly 10 active requests), the request is routed to a randomly chosen worker using a <a href=\"http:\/\/en.wikipedia.org\/wiki\/Uniform_distribution_(discrete)\">uniform distribution<\/a>. Based on this information, we might make these observations:<\/p>\n<ul>\n<li>The balancing algorithm itself can be easily described, modeled, and verified in isolation.<\/li>\n<li>Despite theoretical &#8220;perfection&#8221; of the algorithm, this model is likely to produce imperfect results in several real-life situations.<\/li>\n<\/ul>\n<p><strong>The workers should be fault-tolerant and recover from transient failures.<\/strong> Where to start? We could probably nitpick every word in this sentence. Let&#8217;s say that we come to learn that a worker is hosted in its own process and there is some external watcher that checks if the worker process is up and restarts it if not. (By contrast, a <em>persistent<\/em> failure might be the total shutdown of the machine where the worker is running &#8212; no recovery guarantee is made in this case.) With this bit of clarity, we could perhaps say this:<\/p>\n<ul>\n<li>The system must first experience a fault in order for us to even evaluate fault-tolerance and recovery.<\/li>\n<li>The ability to recover is likely to be impacted by the state of the worker prior to the fault.<\/li>\n<li>Though there is apparently no <em>automatic<\/em> recovery guarantee for persistent failure, someone will need to do <em>something<\/em> about it should this arise in practice.<\/li>\n<\/ul>\n<p>So now that we have some halfway-clear requirements and a few initial observations, we might begin sketching out some tests. But first, let&#8217;s consider some heuristics of testing large systems. The bigger the system under test gets, the less detailed we can be in usefully measuring any single component&#8217;s behavior. Put another way, surface area (breadth) and precision (depth) tend to be inversely correlated. By the same token, the cost of testing also becomes greater &#8212; think of the required resources, the setup time, the development and execution time, and the analysis effort needed for a test running across hundreds of machines.<\/p>\n<p>This is not to say that such large-scale tests are not useful. Rather, it means that we must be very intentional about the <strong>level of testing required<\/strong> to meet the goals of the project while adequately accounting for risks. We should <strong>be wary of one-size-fits-all approaches<\/strong> (e.g. &#8220;every test will require a production-scale environment&#8221;). It is more likely that we would have to employ a <strong>diverse set of tests at many different levels<\/strong> to achieve business goals, knowing that we can&#8217;t expect to simultaneously maximize <a href=\"http:\/\/en.wikipedia.org\/wiki\/Project_management_triangle\">fast, good, and cheap<\/a>.<\/p>\n<p>In the next few posts, I&#8217;ll explore these ideas in more detail, discussing some common tradeoffs in light of the above examples.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>A large-scale distributed service is deployed to a datacenter across hundreds of machines. The basic topology is as follows: Consider the following scenarios and requirements: The service should respond with an error if a client requests a nonexistent resource. The&hellip; <\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[81,51],"tags":[],"class_list":["post-1621","post","type-post","status-publish","format-standard","hentry","category-distributed","category-testing"],"_links":{"self":[{"href":"https:\/\/writeasync.net\/index.php?rest_route=\/wp\/v2\/posts\/1621","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/writeasync.net\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/writeasync.net\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/writeasync.net\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/writeasync.net\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=1621"}],"version-history":[{"count":0,"href":"https:\/\/writeasync.net\/index.php?rest_route=\/wp\/v2\/posts\/1621\/revisions"}],"wp:attachment":[{"href":"https:\/\/writeasync.net\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=1621"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/writeasync.net\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=1621"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/writeasync.net\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=1621"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}