Quick, what does the following code do? Show
It’s impossible to tell right? If you were trying to modify or debug this code, you’d be at a loss unless you could read the author’s mind. Even if you were the author, a few days after writing this code you might not remember what it does because of the unhelpful variable names and use of magic numbers. Working with data science code, I often see examples like above (or worse): code with variable names such as X, y, xs, x1, x2, tp, tn, clf, reg, xi, yi, ii and numerous unnamed constant values. To put it frankly, data scientists (myself included) are terrible at naming variables. Clear Variable Names in 3 Steps
As I’ve grown from writing research-oriented data science code for one-off analyses to production-level code (at Cortex Building Intelligence), I’ve had to improve my programming by unlearning practices from data science books, courses and the lab. There are significant differences between deployable machine learning code and how data scientists learn to program, but we’ll start here by focusing on two common and easily fixable problems:
Both these problems contribute to the disconnect between data science research (or Kaggle projects) and production machine learning systems. Yes, you can get away with them in a Jupyter Notebook that runs once, but when you have mission-critical machine learning pipelines running hundreds of times per day with no errors, you have to write readable and understandable code. Fortunately, there are best practices from software engineering we data scientists can adopt, including the ones we’ll cover in this article. Note: I’m focusing on Python since it’s by far the most widely used language in industry data science. Some Python-specific naming rules (see here for more details) include:
More From Will KoerhsenThe Poisson Process and Poisson Distribution, Explained Naming VariablesThere are three basic ideas to keep in mind when naming variables:
What does this look like in practice? Let’s go through some improvements to variable names. X and YIf you’ve seen these several hundred times, you know they commonly refer to features and targets in a data science context, but that may not be obvious to other developers reading your code. Instead, use names that describe what these variables represent such as house_features and house_prices. ValueWhat does the value represent? It could stand for velocity_mph, customers_served, efficiency or revenue_total. A name such as value tells you nothing about the purpose of the variable and just creates confusion. TempEven if you are only using a variable as a temporary value store, still give it a meaningful name. Perhaps it is a value where you need to convert the units, so in that case, make it explicit: # Don't do this temp = get_house_price_in_usd(house_sqft, house_room_count) final_value = temp * usd_to_aud_conversion_rate# Do this instead house_price_in_usd = get_house_price_in_usd(house_sqft, house_room_count) house_price_in_aud = house_price_in_usd * usd_to_aud_conversion_rateusd, aud, mph, kwh, sqftIf you’re using abbreviations like these, make sure you establish them ahead of time. Agree with the rest of your team on common abbreviations and write them down. Then, in code review, make sure to enforce these written standards. tp, tn, fp, fnAvoid machine learning-specific abbreviations. These values represent true_positives, true_negatives, false_positives and false_negatives, so make it explicit. Besides being hard to understand, the shorter variable names can be mistyped. It’s too easy to use tp when you meant tn, so write out the whole description. The above are examples of prioritizing ease of reading code instead of how quickly you can write it. Reading, understanding, testing, modifying and debugging poorly written code takes far longer than well-written code. Overall, trying to write code faster by using shorter variable names will actually increase your program’s development and debugging time! If you don’t believe me, go back to some code you wrote six months ago and try to modify it. If you find yourself having to decipher your own past code, that’s an indication you should be concentrating on better naming conventions. xs and ysThese are often used for plotting, in which case the values represent x_coordinates and y_coordinates. However, I’ve seen these names used for many other tasks, so avoid the confusion by using specific names that describe the purpose of the variables such as times and distances or temperatures and energy_in_kwh. When Accuracy Isn't Enough...Use Precision and Recall to Evaluate Your Classification Model What Makes a Bad Variable Name?Most problems with naming variables stem from:
On the first point, while languages like Fortran did limit the length of variable names (to six characters), modern programming languages have no restrictions so don’t feel forced to use contrived abbreviations. Don’t use overly long variable names either, but if you have to favor one side, aim for readability. With regards to the second point, when you write an equation or use a model — and this is a point schools forget to emphasize — remember the letters or inputs represent real-world values!
Let’s see an example that makes both mistakes. Say we have a polynomial equation for finding the price of a house from a model. You may be tempted to write the mathematical formula directly in code: temp = m1 * x1 + m2 * (x2 ** 2) final = temp + bThis is code that looks like it was written by a machine for a machine. While a computer will ultimately run your code, it’ll be read by humans, so write code intended for humans! To do this, we need to think not about the formula itself (the how) and consider the real-world objects being modeled (the what). Let’s write out the complete equation. This is a good test to see if you understand the model): house_price = price_per_room * rooms + \ price_per_floor_squared * (floors ** 2) house_price = house_price + expected_mean_house_priceIf you are having trouble naming your variables, it means you don’t know the model or your code well enough. We write code to solve real-world problems, and we need to understand the problem our model represents.
Descriptive variable names let you work at a higher level of abstraction than a formula, helping you focus on the problem domain. Other Variable Naming ConsiderationsOne of the important points to remember when naming variables is: consistency counts. Staying consistent with variable names means you spend less time worrying about naming and more time solving the problem. This point is relevant when you add aggregations to variable names. Variable Names — Dos and Dont’s
Aggregations in Variable NamesSo you’ve got the basic idea of using descriptive names, changing xs to distances, e to efficiency and v to velocity. Now, what happens when you take the average of velocity? Should this be average_velocity, velocity_mean, or velocity_average? Following these two rules will resolve this situation:
Following these rules, your set of aggregated variables might be velocity_avg, distance_avg, velocity_min, and distance_max. Rule two is a matter of personal choice, and if you disagree, that’s fine. Just make sure you consistently apply the rule you choose. A tricky point comes up when you have a variable representing the number of an item. You might be tempted to use building_num, but does that refer to the total number of buildings, or the specific index of a particular building?
To avoid ambiguity, use building_count to refer to the total number of buildings and building_index to refer to a specific building. You can adapt this to other problems such as item_count and item_index. If you don’t like count, then item_total is also a better choice than num. This approach resolves ambiguity and maintains the consistency of placing aggregations at the end of names. Loop IndexesFor some unfortunate reason, typical loop variables have become i, j, and k. This may be the cause of more errors and frustration than any other practice in data science. Combine uninformative variable names with nested loops (I’ve seen loops nested include the use of ii, jj, and even iii) and you have the perfect recipe for unreadable, error-prone code. This may be controversial, but I never use i or any other single letter for loop variables, opting instead for describing what I’m iterating over such as for building_index in range(building_count): ....or for row_index in range(row_count): for column_index in range(column_count): ....This is especially useful when you have nested loops so you don’t have to remember if i stands for row or column or if that was j or k. You want to spend your mental resources figuring out how to create the best model, not trying to figure out the specific order of array indexes. (In Python, if you aren’t using a loop variable, then use _ as a placeholder. This way, you won’t get confused about whether or not the variable is used for indexing.) Variable Names — Conventions to Avoid
All of these rules stick to the principle of prioritizing read-time understandability instead of write-time convenience. Coding is primarily a method for communicating with other programmers, so give your team members some help in making sense of your computer programs. Never Use Magic NumbersA magic number is a constant value without a variable name. I see these used for tasks like converting units, changing time intervals or adding an offset: final_value = unconverted_value * 1.61 final_quantity = quantity / 60 value_with_offset = value + 150(These variable names are all bad, by the way!) Magic numbers are a large source of errors and confusion because:
Instead of using magic numbers in this situation, we can define a function for conversions that accepts the unconverted value and the conversion rate as parameters: def convert_usd_to_aud(price_in_usd, aud_to_usd_conversion_rate): price_in_aus = price_in_usd * usd_to_aud_conversion_rate return price_in_ausIf we use the conversion rate throughout a program in many functions, we could define a named constant in a single location: USD_TO_AUD_CONVERSION_RATE = 1.61 def convert_usd_to_aud(price_in_usd): price_in_aud = price_in_usd * USD_TO_AUD_CONVERSION_RATE return price_in_aud(Remember, before we start the project, we should establish with our team that usd = US dollars and aud = Australian dollars. Standards matter!) Here’s another example: # Conversion function approach def get_revolution_count(minutes_elapsed, revolutions_per_minute): revolution_count = minutes_elapsed * revolutions_per_minute return revolution_count # Named constant approach REVOLUTIONS_PER_MINUTE = 60 def get_revolution_count(minutes_elapsed): revolution_count = minutes_elapsed * REVOLUTIONS_PER_MINUTE return revolution_countUsing a NAMED_CONSTANT defined in a single place makes changing the value easier and more consistent. If the conversion rate changes, you don’t need to hunt through your entire codebase to change all the occurrences, because you’ve defined it in only one location. It also tells anyone reading your code exactly what the constant represents. A function parameter is also an acceptable solution if the name describes what the parameter represents. As a real-world example of the perils of magic numbers, in college, I worked on a research project with building energy data that initially came in 15-minute intervals. No one gave much thought to the possibility this could change, and we wrote hundreds of functions with the magic number 15 (or 96 for the number of daily observations). This worked fine until we started getting data in five and one-minute intervals. We spent weeks changing all our functions to accept a parameter for the interval, but even so, we were still fighting errors caused by the use of magic numbers for months. More From Our Data Science ExpertsA Beginner's Guide to Evaluating Classification Models in Python Real-world data has a habit of changing on you. Conversion rates between currencies fluctuate every minute and hard-coding in specific values means you’ll have to spend significant time re-writing your code and fixing errors. There is no place for magic in programming, even in data science. The Importance of Standards and ConventionsThe benefits of adopting standards are that they let you make a single global decision instead of many local ones. Instead of choosing where to put the aggregation every time you name a variable, make one decision at the start of the project, and apply it consistently throughout. The objective is to spend less time on concerns only peripherally related to data science: naming, formatting, style — and more time solving important problems (like using machine learning to address climate change). If you are used to working by yourself, it might be hard to see the benefits of adopting standards. However, even when working alone, you can practice defining your own conventions and using them consistently. You’ll still get the benefits of fewer small decisions and it’s good practice for when you inevitably have to develop on a team. Anytime you have more than one programmer on a project, standards become a must! Keep Clarifying Your Code5 Ways to Write More Pythonic Code You might disagree with some of the choices I’ve made in this article, and that’s fine! It’s more important to adopt a consistent set of standards than the exact choice of how many spaces to use or the maximum length of a variable name. The key point is to stop spending so much time on accidental difficulties and instead concentrate on the essential difficulties. (Fred Brooks, author of the software engineering classic The Mythical Man-Month, has an excellent essay on how we’ve gone from addressing accidental problems in software engineering to concentrating on essential problems). Now let's go back to the initial code we started with and fix it up. for i in range(n): for j in range(m): for k in range(l): temp_value = X[i][j][k] * 12.5 new_array[i][j][k] = temp_value + 150We’ll use descriptive variable names and named constants. PIXEL_NORMALIZATION_FACTOR = 12.5 PIXEL_OFFSET_FACTOR = 150 for row_index in range(row_count): for column_index in range(column_count): for color_channel_index in range(color_channel_count): normalized_pixel_value = ( original_pixel_array[row_index][column_index][color_channel_index] * PIXEL_NORMALIZATION_FACTOR ) transformed_pixel_array[row_index][column_index][color_channel_index] = (normalized_pixel_value + PIXEL_OFFSET_FACTOR)Now we can see that this code is normalizing the pixel values in an array and adding a constant offset to create a new array (ignore the inefficiency of the implementation!). When we give this code to our colleagues, they will be able to understand and modify it. Moreover, when we come back to the code to test it and fix our errors, we’ll know precisely what we were doing. Clarifying your variable names may seem like a dry activity, but if you spend time reading about software engineering, you realize what differentiates the best programmers is the repeated practice of mundane techniques such as using good variable names, keeping routines short, testing every line of code, refactoring, etc. These are the techniques you need to take your code from research or exploration to production-ready and, once there, you’ll see how exciting it is for your data science models to influence real-life decisions. How do you write a good variable name?A good variable name should:. Be clear and concise.. Be written in English. ... . Not contain special characters. ... . Not conflict with any Python keywords, such as for , True , False , and , if , or else .. What is a meaningful variable name?You can define symbolic variables with meaningful names. Meaningful variable names, like PAY_RAISE, describe the contents of the variable and make CLISTs easy to read and maintain. Note that an ampersand (&) is not part of a variable name; it tells the CLIST to use the value of the variable.
What is a good example of a variable name?The following are examples of valid variable names: age, gender, x25, age_of_hh_head. The following are examples of invalid variable names: age_ (ends with an underscore);
Why is it important to give a variable a meaningful name?Variables are so important to the code that they deserve a good name that accurately describes their purpose. Sometimes a bad name can be the difference between a fellow developer understanding what everything does at first glance and not having any clue where to begin. It's likely that we've all been there.
|