Reducing complexity using higher-order functions

Introduction

In my last post, I talked about the correlation between the overall cyclometric complexity and the defect count in your software.

I listed a few ways that you can reduce the overall complexity. One of those ways was to use the Map/Reduce/Filter patterns extensively in your software. In this article, I'm going to show you how to use these routines to dramatically lower your overall cyclometric complexity.

Map, Reduce and Filter are just routines that describe common behavioural patterns in programs. They all share a crucial similarity; they take as one of their arguments a function which is executed within the routine itself.

As we shall see, this innovation will allow you to reduce the cyclometric complexity of your application. Sometimes the reduction can be very dramatic.

This concept of passing routines as arguments to other routines is not new, it's been around in languages such as Haskell, Lisp and Erlang for decades. The lack of side-effects in these programing languages force you to think about functions in a more enlightened way. This sort of thinking is now only just becoming mainstream in common programming languages such as Python, Java and C#.

Anonymous Routines

When we write a routine during our normal programming lives, we usually explicitly name them, much like this snippet below:

	def add(a,b):
		return a+b

	print add(6, 10)

The vast majority of program code written in modern software is of this form; variables are passed into routines, these routine do computation on this data and return a result. These routines have a specific name and access rules, so they can be reused within the framework of your program.

But what if rather than merely having data stored in the arguments of a function, we could have routines themselves. Take a look at the following snippet:

	def construct_adder(valueToAdd):
		return lambda y: y + valueToAdd
  
	adder = construct_adder(10)
	print adder(6)

What's going on here? Well, the routine called "construct_adder" is a special sort of routine. It is a routine that returns another routine. We call any routine that takes another function as a parameter or returns a function a "higher-order function." In this case, construct_adder returns a routine that adds the number specified in the "valueToAdd" argument to whatever is passed to it. This routine is assigned to the variable "adder", which is executed in the print statement on the next line.

The key point here is that routine is dynamically constructed at run-time. The routine that we've created doesn't have a name in the traditional sense. The only way it can be referenced is through the variable it assigned to, in this case "adder." Once "adder" goes out of scope, the routine no longer exists and can no longer be executed.

The magic occurs in the "lambda" expression in the construct_adder routine. I don't want to get too hung up on the syntax, but essentially the lambda says, "construct an in-line function here, with y as an argument." The result of the function is the evaluation of the expression to the right of the colon.

While the example given above is a pretty superficial, the ability to construct custom routines at run-time using user supplied data is extremely powerful. It is a massive cyclometric complexity slayer. In this rest of this article, I'm going to try and give you a flavour how you can use anonymous functions to reduce the cyclometric complexity of your programs.

Reducing complexity by using anonymous routines with higher-order functions

Using the Filter routine

One of the first things you learn when you start writing software is how to do a loop. The vast majority of places where you see a loop, the loop is usually iterating over a collection of things and performing some operation. These routine often have a many decision points. They are decision points that can be culled, if higher-order functions are properly deployed.

Consider this Python snippet:

	def get_matching_words(startswith, wordlist):
    		result=[]
    		for word in wordlist:
		      if word.startswith(startswith):
			      result.append(word)
    		return result
	

The aim of this routine is to return a list of all the words in "wordlist" that start with the value passed in the "startswith" argument.

This routine has a cyclometric complexity of three. There's a code path for the case where wordList is empty. There's a code path for where there are items in the word list but there are no matches, and finally there is a code path for when there are items in the word list that do match, and the result gets appended.

You can do the same job in a single line using the Filter routine:

	def get_matching_words(startswith, wordlist):
    		return filter(lambda word: word.startswith(startswith), wordlist)
	

The filter routine takes two arguments.

The first is a routine that takes a single argument. That routine must return "True" or "False." Filter than executes that routine, passing each member of "wordlist" to the routine. If the routine returns true for that entry, the item is added to the list returned by filter. If the routine returns false, the item is omitted from the list.

The second argument is the collection of items to run the routine on.

In the case above, I defined a single anonymous routine that checks whether the argument passed to it starts with the value specified by "startswith."

It's pretty clear that this program is equivalent to the longer snippet above, except that this routine has a cyclometric complexity of one. Your immediate thought here might be that I've just moved the cyclometric complexity in to the inbuilt "filter" routine. Yes, you're correct but bear in mind that there are a whole host of problems that can be solved using the filter routine. The complexity increase of your program when using filter is constant, regardless of how many times you use it. Each time you use a for loop, you're going to increase the complexity by at least two. Once for the code path represented by the body of the loop and one for case where there are no items to loop through.

Another objection that might be raised is that loops are easy and that removing complexity in this way is not focusing on the real cause of bugs. I disagree, people do screw up loops and they're easier to get right if you use higher-order functions.

Using Map and Reduce

When I was writing this article, I decided to go and hunt through the code base at work and find some real-life loops that we can apply these ideas to. For copyright reasons, I can't post the exact code. However, I did come across a loop that is easy to screw up and looks much neater using Map and Reduce. I've sort of translated it in to python:

Take a look at this code snippet:

      def construct_list(list):
	      result = ""

	      for counter in range(0,len(list)):
		      result += str(list[counter])
		      if counter<len(list)-1:
			      result+=","
	      return result
      

The purpose of this routine is to concatenate a list of integers in to a string separated by commas. In our C# code, this routine was used to build a string for use inside a "in" clause in a SQL select statement.

It's very easy to screw up this routine. You can easily get an off by one error on both the "range" routine at the top of the loop and the condition that decides whether to add a comma or not in the middle of the loop.

The snippet above has a cyclometric complexity of three.

It might seem hard to get the complexity below this value, but let's compare this with what we can do with map and reduce:

      def construct_list(list):
	      if len(list) == 0:
		      return ""
      
	      string_list = map(lambda item: str(item), list)
	      return reduce(lambda firstItem,secondItem: firstItem + "," + secondItem, string_list)
      

This routine has two phases, first map is called. What this does is for each element in the list, it runs the function passed to it and maps its result to a duplicate list. This list is returned when the map routine has completed and is assigned to string_list, in my program. I've simply used map to convert all the items in the list to strings. In the "reduce" phase, we reduce the list down to a single value.

This routine has a cyclometric complexity of two. So we've made a saving of one over the first snippet. We could remove the if check at the top of the snippet but doing so would mean that if we passed an empty list to the routine, it will cause an error because reduce can't handle empty lists. One possible approach around this is to write a wrapper routine around reduce t handle empty lists. You could then handle all list loops using this new routine, with no further complexity cost.

The clearest way to write this routine is probably to use recursion, but even this has a cyclometric complexity of three:

      def construct_list_recursive(list):
	      if len(list) == 0:
		      return ""

	      if len(list) == 1:
		      return str(list[0])

	      return construct_list_recursive(list[:-1]) + "," + str(list[-1])
      

It's clear that when the goal is to reduce cyclometric complexity, it's hard to beat map/reduce.

Passing routines as arguments to other routines

So far, we've looked at reducing cyclometric complexity by focusing on Map, Reduce and Filter. However, these routines are just higher-order functions provided by the Python standard library. Even bigger savings can be made if you write your own higher-order functions.

Take for example, code that looks like this:

	def insert_meta_data(data):
		title_field = "Title"
		if data.has_key( title_field) and len(data[artist_field]) <= 30:
			insert_rowdata(title_field, data[title_field])
		
		artist_field="Artist"
		if data.has_key(artist_field) and len(data[artist_field]) <= 50:
			insert_rowdata(artist_field, data[artist_field])
		
		release_date_field = "Release Date"       
		if data.has_key(release_date_field) and valid_release_data(release_date_field):
			insert_rowdata(release_date, data[release_date_field])
	
	..... The code continues for twenty such fields .....
	

The point I want to illustrate with the snippet above is that there may be different validation rules required for each field. When you have code like this, it precludes looping through some list of parameters and validating each one in turn, because each one has slightly different constraints.

Now imagine that we want to implement an update routine. There are two ways to do this traditionally. The most obvious way is to simply duplicate the checks in both routines. This clearly has a massive cyclometric complexity penalty and duplicates code so that approach should be avoided at all cost.

The next is to rewrite the routine so that validation occurs before insertion. The validation step would produce a list of fields that are valid for insert/update and both routines could share a call to this validation logic. Each body of code would loop through the fields, adding or updating them as you go. Each of these routines must have a minimum cyclometric complexity of two, so you have a minimum complexity spend of four to implement it this way. It's also quite a bit more complicated, not in the cyclometric sense, but in the mental sense. There's more to fit in your head with this design.

However, there is a dramatically more powerful way to solve this problem. You could do this:

	
	def change_meta_data(data, update_routine):
		title_field = "Title"
		if data.has_key(title_field) and len(data[artist_field]) <= 30:
			update_routine(title_field, data[title_field])
	
		artist_field = "Artist"
		if data.has_key(artist_field) and len(data[artist_field]) <= 50:
			update_routine(artist_field, data[artist_field]
	
		release_date_field = "Release Date"
		if data.has_key(release_date_field) and valid_release_data(release_date_field):
			update_routine(release_date, data[release_date_field])
	

The second argument contains a routine that will be executed during the body of the code. You now pass in which routine you want to run within change_meta_data and it will either update or insert a row. Notice how the cyclometric complexity of the change_meta_data routine has not gone up, even though the routine can now operate in two (or more!) distinct modes. The most obvious choice of update_routine would be something that inserts or updates a table row, but this doesn't have to be case. For example, the update_routine could be inserting or updating an in-memory cache.

The point is, we can decide at run time what gets executed within this routine and we can get this ability to choose without having to introduce any additional complexity.

Conclusion

In this article, I wanted give you a taste of how higher-order functions can help you control complexity. With a small amount of careful thought, it is possible to chop out a considerable amount of cyclometric complexity contained within your application.

I hope I've convinced you that higher-order functions are worth investigating in your project. They're a powerful tool that helps you to write less code and the less code you write the less you have to debug and maintain.

2008-05-03 16:43:17 GMT | #Programming | Permalink
XML View Previous Posts