Wednesday, March 07, 2007

Dummies for dummies

“Let us remember the unfortunate econometrician who, in one of the major functions of his system, had to use a proxy for risk and a dummy for sex.” Fritz Machlup (1974)

As a student I was really fascinated when I first came across dummy variables. Here was a way of incorporating qualitative effects into regression equations. For example, in examining the factors affecting the hourly earnings of individuals as well as specifying potential influences that can be quantified (e.g. age, number of years of education, number of years experience etc.) you could include dummy variables to distinguish between dichotomous categories (that is situations that fall into two groups). So for example you could have a gender dummy variable to look for differences between the earnings of male and female workers, all other factors having been accounted for. All you must do is assign the value zero to the dummy for female workers and the value 1 for male workers and then include the dummy variable in the regression along with all the other potential influences on earnings. The estimated coefficient of the dummy variable will measure the differential in earnings purely due to the worker being a male. Of course as well as measuring the difference you can also use a standard t-test on the estimated coefficient to see if the measured differential is actually significant.

We can see how this works algebraically and graphically if we focus on just one other regressor – years of experience.

If the relationship is assumed to be a linear one then we can write the model algebraically as Yi = b1 + b2Xi+ b3Di + ui

The dummy variable Di is assigned the value 0 for all women in the sample and 1 for all men. The affect of this is to make the equation showing how women’s earnings are believed to be generated just Yi = b1 + b2Xi+ uiwhile for men the equation becomes Yi = b1 + b3+b2Xi+ + ui.

When you put in the value 0 for D the b3 Di term disappears but when you put in the value 1 for D the term is just b3, which can be grouped together with the b1 term. Effectively the intercept becomes b+b3 for men – or to put it another way there is a parallel upward shift in the regression line of b3 for men as compared with the base line that defines the relationship between earnings and experience for women.

Figure 1. The effect of a dummy variable shown graphically

As well as incorporating dummy variables to look for shifts in a regression line we can also use interactive dummies to test for differences in the slope parameters.

Figure 2 Testing for differences in intercepts and slopes with interactive dummies

The expanded model includes a regressor that is the product of the dummy variable and the X variable. So its value will be zero when D = 0 and X when D = 1. Effectively its parameter b4 will pick up the differential in slope between the regression lines for men and women. Figure 2 corresponds to a situation where initially men earn less per hour than women (the graph implies b3 is negative) but the additional payments received by men for every extra year of experience exceed those given to women.

Dummies in models based on time series data
Dummy variables can be very useful to pick up the effects of circumstances that only apply to some time periods and not others when working with time series data. For example suppose we are looking at the relationship between the sales of ice cream in a particular supermarket and the average temperature over a succession of days. Here we have a t subscript for our observations Yt = b1 + b2 Xt + ut and we would be want to estimate the value of b2 so that, based on weather forecasts for the days ahead, we could predict extra demand that might be expected as temperatures rise.

However it may be worth testing to see if demand is higher at weekends or on public holidays. If for the moment we just treat all such days as the same we could define just one dummy variable that takes the value 0 on ordinary week days but is assigned the value 1 for days that are at the weekend or correspond to public holidays. So the extended model becomes
Yt = b1 + b2Xt+ b3 Dt + ut and b3 would be used to measure the additional demand to be expected on high days and holidays.

If there is enough data points, separate dummy variables could be defined for Saturdays, Sundays and holidays. Within this less restrictive model each dummy would have a separately estimated parameter. But it would also be possible to test whether the different day effects are actually needed by seeing if the restriction implicit in the model given above could be accepted against the alternative of the more general model.

Seasonal dummies
One common use of dummy variables with time series data is to allow for seasonal shifts in a relationship.

Suppose for example that you have quarterly data on energy consumption, the price of energy and consumers’ income for a number of years. You might specify a simple model, perhaps in log-linear rather than linear form, relating energy consumption to the price of energy and consumers’ income. (Perhaps both the price and income variables would be better expressed in real terms i.e. after adjustment for inflation rather than in current nominal prices – but we shall ignore that point here).

For a variety of reasons energy consumption may be higher in some quarters than others even after we have taken note of the values of price and consumers’ income (for example it is colder and darker in the winter leading to higher energy use). We could allow for these differences by incorporating three quarterly dummy variables, say for quarters 1,2 and 3, leaving quarter 4 as the base period.

Notice that we don’t have a dummy for all four quarters. This is sometimes referred to as the dummy variable trap. One quarter has to be kept as the base, just as with the gender dummy we didn’t have two dummy variables but chose one gender category – female – to be the base group. Incidentally it doesn’t really matter which category is assigned to be the base group although it may be convenient to specify the dummy as we did so that its coefficient takes a positive value.

So going back to our energy demand equation, with the dummies the model becomes

Log(Energy)t= b1+ b2Log(Price)t+ b3 Log(Income)t+ d1D1t+ d2D2t+ d3D3t+ ut

Here D1 takes the value 1 for all first quarter observations, zero otherwise; D2 takes the value 1 for all second quarter observations, zero otherwise; D3 takes the value 1 for all third quarter observations, zero otherwise.

I have used the symbol d for the coefficients of the dummy variables to distinguish them from those of the measured variables.

Interpreting the coefficients of dummy variables in log-linear models
Some care is required in interpreting dummy variable coefficients in log-linear regression models. In the example we have just been looking at suppose that the estimated value of d1comes out as 0.01. This means that the intercept in the log-linear equation will be increased by 0.01 for all first quarter observations. But what does that mean for energy consumption itself. If Log(Energy) is up by 0.01 then Energy will be up by a factor of exp (0.01) = 1.01005 (assuming we have used logs to the base e – natural logarithms. Because log-linear models imply that the underlying variables interact with each other in multiplicative way the shift in the log-linear equation implies a multiplicative effect in the un-logged version of the equation.

Impulse and step dummies
Dummy variables can be included in regression models based on time series data to account for special circumstances or events that affect an individual observation. An example could be a model looking at quarterly sales of poultry products. In the light of the avian flue scare at the Bernard Matthews factory in Norfolk last month we might want to include an impulse dummy for the first quarter of 2007. The use of an impulse dummy would be based on the assumption that the (negative) effect would disappear and things return to normal in the following month. If that isn’t the case a step dummy might be more suitable. A step dummy takes the value 0 for all periods before a particular event and then the value 1 for periods after that time. Effectively the intercept steps up (or down) after the event. An example might be a dummy variable to measure the affect of banning of smoking in pubs on their revenue.

Dummy dependent variables (i.e. dummies on the left-hand side of regression equations)
Dummy variables can also appear on the left-hand side of regression equations. Limited dependent variable models of this sort can help to explain things like why some households have Internet access at home while others don’t or why some students succeed in passing a course while others don’t. This is interesting stuff but a whole new topic that we will look at another time.

Why don’t you look for interesting examples of the use of dummy variables in published work. Here are a few to be going on with: (1)the effect of computer ownership on college grade point average (Wooldridge, Introductory Econometrics p 235; (2) the effect of physical attractiveness on wages (Hamermesh and Biddle, 1994) reported in Wooldridge p 242; the effect of satellite TV on football attendances (Allan, Applied Economics Letters, 2004)

Online material on dummy variables
[1] Introductory Econometrics, a textbook written by Humberto Barreto and Frank Howland, was published by Cambridge University Press in 2005. The authors have put the introductory section of each chapter online, together with Excel spreadsheets to illustrate their material. Take a look at their material for chapter 8 on dummy variables at http://caleb.wabash.edu/econometrics/EconometricsBook/chap8.htm
[2] Kelly Rathje and Christopher Bruce have a short contribution to the Expert Witness newsletter (Winter 2000) that illustrates the use of dummy variables. Go to http://www.economica.ca/ew54p3.htm.

Becon said...

Hiya
Given a model Y=Bo+BiX where Y is saving and X is income, we are asked to rewrite it in a generl form to show that the savings behaviour of low income household is different from the savings behaviour of high income households.(in otherwords show that the marginal propensity to save depends on the level of income).Can one use dummy variable in this case. apparently using dummy variable is not good enough because we have used it elsewhere to show the effect of gender for the same model

8:05 AM
Guy said...

You need a non-linear model to allow
the marginal propensity to save to vary over incomes. It is the derivative of the function, dY/dX, which is constant for a linear function.

You could try a quadratic function of the form
Y = B0+B1X+B2X^2 + u

Here dY/dX = B1+2B2X

7:47 AM
Mina Das said...

This comment has been removed by the author.

11:26 AM
Mina Das said...

Sir now i think that i have need a teacher like you for training your are great.

Video card

11:28 AM
Clotilde Noémie Mahé said...

Hello, I am having some problems with the regressions i am running... first, i am assessing the potential impact of international migration, in stocks, on child mortality in developing country for the year 2000, controlling for several variables, such as income pc, literacy rate, etc. without logging the variables, both independent and dependent, the results do not mean anything, but logging them, because they are significantly skewed (none of the variables have passed the normality tests), i obtain some results a bit more meaningful (especially, the instruments i use become valid instruments). However, i do not understand how i should interpret the various coefficients, since the dependent variable is a log transformation of a variable in per thousand and some of the independent variables are log transformation of variables in percentage, or dummies that i haven't transformed in logarithm. I am a bit lost... thank you very much if you have any idea about it

5:16 AM
Clotilde Noémie Mahé said...

Hello, I am having some problems with the regressions i am running... first, i am assessing the potential impact of international migration, in stocks, on child mortality in developing country for the year 2000, controlling for several variables, such as income pc, literacy rate, etc. without logging the variables, both independent and dependent, the results do not mean anything, but logging them, because they are significantly skewed (none of the variables have passed the normality tests), i obtain some results a bit more meaningful (especially, the instruments i use become valid instruments). However, i do not understand how i should interpret the various coefficients, since the dependent variable is a log transformation of a variable in per thousand and some of the independent variables are log transformation of variables in percentage, or dummies that i haven't transformed in logarithm. I am a bit lost... I will be very thankful for any of your comment

5:17 AM
Alistair Zhao said...

Hi, I am thinking that if I have multiple dummies with multiple categories, such as age bands rather than a straight forward variable, am I still able to explore the interactions between say gender and age?
Thank you

10:31 AM
Bala Yusuf said...

Hi,i am novice in regression using spss.i am doing a study to determine the January effect o. Stock returns mi have monthly data from 1990 to 2012.this data is in excel how do I run this data in Spss and convert each month to dummies.Finally how do I run the regression and interprete the results. I am indeed a dummy on this.I do need a step by step guide.

Thanks Guy

12:42 AM
Bala Yusuf said...

Hi,i am novice in regression using spss.i am doing a study to determine the January effect o. Stock returns mi have monthly data from 1990 to 2012.this data is in excel how do I run this data in Spss and convert each month to dummies.Finally how do I run the regression and interprete the results. I am indeed a dummy on this.I do need a step by step guide.

Thanks Guy

12:43 AM