Guy's Econometrics blog: March 2007

Thursday, March 29, 2007

Get Real!

Many applied econometric models based on aggregate time series data make use of expenditure, income or GDP series. Consumption functions, demand functions or even production functions, for example. It is important to recognise the need to use series that have been expressed at constant prices (relating to a suitable base year) rather than at the current prices ruling in each individual year. That is the series should be in real rather than nominal terms.

For example if you are modelling the total use of energy in the UK you may want to include real GDP as one of your explanatory variables, to indicate the overall level of economic activity in the economy. Using current price GDP figures would overstate the growth of real economic activity as it would include increases due to inflation as well as those due to actual economic activity. You will need a measure of GDP that has been deflated by dividing through by an appropriate measure of inflation – in this case the GDP deflator.

Often suitably deflated series are readily available in published form. For example the series with the code ABMI - UK Gross Domestic Product in £ million at constant 2003 prices - can now be downloaded directly from the UK National Statistics website. Go to http://www.statistics.gov.uk/statbase/tsdtimezone.asp?vlnk=pn2, select Blue Book, select UK national and domestic product, and then pick out the ABMI series.

Sometimes you will have to find an appropriate price deflator for yourself to adjust a current price series for inflation. Here you will have to make sure that you are using the appropriate price deflator as there are many series that track inflation. The CPI (consumer price index) would be appropriate if you were looking at overall consumer expenditure but there will be occasions when some other price series such as the GDP deflator would be appropriate.

Implicit price series
The availability of expenditure series in both current and constant price form means that you will have the possibility of recovering an implicit price series for the expenditure category. For example if you have the two series FOODEXP and FOODEXP2005, where FOODEXP measures total expenditure on food each year at current prices while FOODEXP2005 is total expenditure on food at constant (2005) prices, then the index of food prices would be say PFOOD = FOODEXP/FOODEXP2005 (or you might like to multiply this by 100 so that the base year value of the series is 100 rather than 1). This of course would give you a measure of the nominal price of food rather than an index of the price of food in real terms. To get that you will have to divide your nominal price series by an overall price index for the economy.

Real interest rates
Another area where you might need to think about real rather than nominal values is with interest rates. Real interest rates are nominal (or market) interest rates minus the rate of inflation. If the rate of inflation in a country is quite high, then nominal interest rates would also have to be high in order to provide a real return for lenders.

You will also have to be careful to compute other series in real terms. For example you might need real wages, the real money supply or even a measure of real exchange rates. Make sure that you know how to do this!

Monday, March 26, 2007

How do you spell "heteroskedasticity"?

Econometrics is full of long and difficult words: stochastic, kurtosis, multicollinearity, autocorrelation and - perhaps worst of all - heteroskedastcity.

You may like to know that the correct spelling of heteroskedasticity was actually the subject of a one page article in the journal Econometrica. In the March 1985 issue J Huston McCulluch argued that there should be a k in the middle of the word and not a c. His argument was that the word had been constructed in English directly from Greek roots rather than coming into the English language indirectly via the French.

The earliest use that McCulloch could find for either heteroskedastcity or heteroscedastcity was in a 1923 statistics text by Truman L Kelley. However John Aldrich in contributing to a website on the earliest known uses of some words in mathematics states that 'The terms heteroscedasticity and homoscedasticity were introduced in 1905 by Karl Pearson in "On the general theory of skew correlation and non-linear regression," Drapers' Company Res. Mem. (Biometric Ser.) II. Pearson wrote, "If ... all arrays are equally scattered about their means, I shall speak of the system as a homoscedastic system, otherwise it is a heteroscedastic system." The words derive from the Greek skedastos (capable of being scattered).'

One might ask the question as to why is it that econometricians, statisticians and other scientists use Greek words such as this as labels for these concepts when ordinary English words might do just as well. In part the answer might be to avoid confusion with the ordinary English usage of a word. For example the word "investment" when used by an economist carries a very specific meaning that might not be apparent even to a well-educated non-economist. Perhaps at least having a special word for the concept makes a reader check exactly what it means rather than making an assumption that it must mean what he thinks it does.

Of course what can happen is that these special words can then make it into ordinary language. In case you think that would be impossible for the word "heteroskedasticty" I invite you to read the paragraph about the distribution of rabbits around the UK in the Smallweed column of the Guardian for 23rd July 2005. "Have these brainboxes never heard of the concept of heteroscedasticity?" Yes, we have but you spell it with a k and not a c!

References
[1] J Huston McCulloch. On Heteros*edastcity. Econometrica 1985, Vol 53 No 2 (March) p483.
[2] John Aldrich. Earliest Known Uses of Some of the Words of Mathematics (H). Accessed March 2007.
[3] David McKie. Armageddon isn't upon us. The Guardian 31st August 2006.
[4] Smallweed column 23rd July 2005

The Power of Logs

When I was at school (in the 1960s!) we were always using logarithms (or “logs” for short). We used our “log tables” to help us multiply together unpleasant looking numbers that we couldn’t very easily work with using just pen, paper and brain. If we had two nasty looking numbers to multiply together we would look up their logarithms (to the base 10), add together these logarithms and then finally take the anti-logarithm of this sum of logs to give us our answer. Once you know what logarithms are you will quickly see how this method works. You can use a similar method to divide one number by another only in this case you subtract the logs instead of adding them.

Today’s generation of school children have no need to use log tables for this kind of calculation. They can use a calculator (perhaps even one built into a mobile phone) or maybe a computer for unpleasant calculations. But students of economics would still be advised to find out something about logarithms as they might come across them in a variety of situations: when using a logarithmic transformation of a power function equation for example, or you might even meet a logarithmic equation itself. In the first situation you are basically making use of the properties of logarithms to turn a multiplicative function in the original variables into a linear (additive) function of the logarithms. Back to that later when we have had a look at the basic concept of a logarithm.

Put bluntly, the logarithm of a number is the power that a base must be raised to in order to get the number. You can see that we first have to think about what we mean by the base. In theory you could use any positive number as the base but in practice we usually work with one of two numbers: either the number 10 or the exponential constant known as e (it has a value approximately = 2.71828). Let’s come back to e later and stick with the base 10 for now.

From the definition we can see that the logarithm of the number 100, to the base 10, is 2. Why? Because if we raise the base 10 to the power 2, we get the number 100. Think of another simple example. What is the logarithm of the number 0.1 to the base 10? Answer? -1, because 10 raised to the power minus 1 = 1/10 = 0.1

Notice that the log of 1 to the base 10 is 0. Because 10 to the power 0 = 1.
In fact the log of 1 is zero whatever base we work with. Any positive number raised to the power zero is 1. But we are getting ahead of ourselves. Let’s stay with the base 10 for now.

Let’s see how logs to the base 10 were used in log tables. First someone has to produce a set of tables solving the equation 10^(log) = number, for lots of different numbers. For example, the logarithm of 2 to the base 10 is (to four significant figures) 0.3010. [I looked up this logarithm so many times in my youth that it is imprinted in my brain – I don’t even need to refer to a set of log tables to get the result!]. Similarly the log of 3 to the base 10 is 0.4771. So if I add these two logarithms together I should get the logarithm of the number 6 – because the log of a product is the sum of the logarithms. Now 0.3010 + 0.4771 = 0.7781. Actually when I look up the log of 6 to the base 10 I get 0.7782. (Using only four significant figure approximations has caused us to get an approximation error.) Rather than searching through the log tables to find the number that has a log = 0.7781 we were able to make use of the anti-logarithm table in which the results were set out in the other direction. That is the tables were constructed in a way to give you the number corresponding to any particular logarithm that you had calculated. Doing it this way round I would find that the anti-log of 0.7781 comes out as 5.999 – another approximation error due to rounding. Before you start smiling at this too much remember that even using a calculator or a computer there may be some rounding involved – although you will get much more than four significant figures.

The Scottish mathematician John Napier (1550-1617) is generally credited with the invention or discovery of logarithms – although apparently they were known about in eighth century India (see Smoller (2001) or Alfeld (1997) for more details).

Let’s think for a minute about how this all works. Take any base b (>0) .
If we know that b^u = x and b^v = y then simple algebra tells us that xy = b^(u+v). When we multiply two separate powers of b together we just add these powers. The insight in developing logarithms was to see that we could turn hard calculations (multiplication and division) into easier calculations (addition and subtraction) by providing a set of u and v values to go with the set of x and y values that could then be reused time and time again.

In analytical (as opposed to computational) work there may be advantages in working with logs to the base e. This is because the exponential function y = e^x has the special property that its derivative at any point is equal to the function itself – that is dy/x = y for all values of x. If you plot the graph of the function the slope of the curve is always the same as the function itself. This means that the derivative of the inverse function or logarithmic function y = lnx will be 1/x [Logs to the base e are written as lnx – the ln is short for “natural” logarithm]. This is very convenient. Logarithmic functions can be useful themselves in economics as they have the property that as x goes up y goes up but at a declining rate – something that we expect to get in a whole range of economic relationships such as production functions and utility functions.

But for us today it is the logarithmic transformation that is of most interest.

Suppose that we think that two variables are related by a power function equation – say Q = AP^b (maybe here Q is quantity demanded, P is price, A is just a constant of proportionality whose value will depend on the units of measurement for P and Q, and b is negative so as to ensure an inverse relation between the variables). If this is true then the graph showing the relationship between P and Q is a downward sloping (non-linear) curve. In the special case here b = -1 we have a rectangular hyperbola with the graph totally symmetrical around the 45 degree line, but with other values of b the graph will approach one of the axes more steeply than the other.

But if we plot the logarithm of Q against the logarithm of P we will see a downward sloping straight line with gradient b (remember b is negative). This is because the first rule of logarithms is that log(AB) = logA + logB (sticking to the same base. So log Q = logA + log(P^b). Now from the second rule of logarithms log(P^b)= b logP. Now logA is just another constant if A is a constant, so we have found that logQ = a constant plus b times log P. The power function is linear in the logarithms – or as we sometimes say for short – it is log-linear.

In regression analysis of course we don’t expect all our observations to fit exactly on a straight line (or a curve) but if, when we plot the logs of one variable against the logs of another we get points clustered around a straight line then it suggests that the underlying variables are linked by a power function equation – and the power in that equation can be estimated from the slope of the line linking the logs.

References
[1] John Napier and logarithms Laura Smoller UALR March 2001
[2} What on earth is a logarithm? Peter Alfeld, University of Utah 1997

Wednesday, March 07, 2007

Dummies for dummies

“Let us remember the unfortunate econometrician who, in one of the major functions of his system, had to use a proxy for risk and a dummy for sex.” Fritz Machlup (1974)

As a student I was really fascinated when I first came across dummy variables. Here was a way of incorporating qualitative effects into regression equations. For example, in examining the factors affecting the hourly earnings of individuals as well as specifying potential influences that can be quantified (e.g. age, number of years of education, number of years experience etc.) you could include dummy variables to distinguish between dichotomous categories (that is situations that fall into two groups). So for example you could have a gender dummy variable to look for differences between the earnings of male and female workers, all other factors having been accounted for. All you must do is assign the value zero to the dummy for female workers and the value 1 for male workers and then include the dummy variable in the regression along with all the other potential influences on earnings. The estimated coefficient of the dummy variable will measure the differential in earnings purely due to the worker being a male. Of course as well as measuring the difference you can also use a standard t-test on the estimated coefficient to see if the measured differential is actually significant.

We can see how this works algebraically and graphically if we focus on just one other regressor – years of experience.

If the relationship is assumed to be a linear one then we can write the model algebraically as Y_i = b₁ + b₂X_i+ b₃D_i + u_i

The dummy variable D_i is assigned the value 0 for all women in the sample and 1 for all men. The affect of this is to make the equation showing how women’s earnings are believed to be generated just Y_i = b₁ + b₂X_i+ u_iwhile for men the equation becomes Yi = b₁ + b₃+b₂X_i+ + u_i.

When you put in the value 0 for D the b₃ D_i term disappears but when you put in the value 1 for D the term is just b₃, which can be grouped together with the b₁ term. Effectively the intercept becomes b+b₃ for men – or to put it another way there is a parallel upward shift in the regression line of b₃ for men as compared with the base line that defines the relationship between earnings and experience for women.

Figure 1. The effect of a dummy variable shown graphically

As well as incorporating dummy variables to look for shifts in a regression line we can also use interactive dummies to test for differences in the slope parameters.

Figure 2 Testing for differences in intercepts and slopes with interactive dummies

The expanded model includes a regressor that is the product of the dummy variable and the X variable. So its value will be zero when D = 0 and X when D = 1. Effectively its parameter b₄ will pick up the differential in slope between the regression lines for men and women. Figure 2 corresponds to a situation where initially men earn less per hour than women (the graph implies b₃ is negative) but the additional payments received by men for every extra year of experience exceed those given to women.

Dummies in models based on time series data
Dummy variables can be very useful to pick up the effects of circumstances that only apply to some time periods and not others when working with time series data. For example suppose we are looking at the relationship between the sales of ice cream in a particular supermarket and the average temperature over a succession of days. Here we have a t subscript for our observations Yt = b₁ + b₂ X_t + u_t and we would be want to estimate the value of b₂ so that, based on weather forecasts for the days ahead, we could predict extra demand that might be expected as temperatures rise.

However it may be worth testing to see if demand is higher at weekends or on public holidays. If for the moment we just treat all such days as the same we could define just one dummy variable that takes the value 0 on ordinary week days but is assigned the value 1 for days that are at the weekend or correspond to public holidays. So the extended model becomes
Y_t = b₁ + b₂X_t+ b₃ D_t + u_t and b₃ would be used to measure the additional demand to be expected on high days and holidays.

If there is enough data points, separate dummy variables could be defined for Saturdays, Sundays and holidays. Within this less restrictive model each dummy would have a separately estimated parameter. But it would also be possible to test whether the different day effects are actually needed by seeing if the restriction implicit in the model given above could be accepted against the alternative of the more general model.

Seasonal dummies
One common use of dummy variables with time series data is to allow for seasonal shifts in a relationship.

Suppose for example that you have quarterly data on energy consumption, the price of energy and consumers’ income for a number of years. You might specify a simple model, perhaps in log-linear rather than linear form, relating energy consumption to the price of energy and consumers’ income. (Perhaps both the price and income variables would be better expressed in real terms i.e. after adjustment for inflation rather than in current nominal prices – but we shall ignore that point here).

For a variety of reasons energy consumption may be higher in some quarters than others even after we have taken note of the values of price and consumers’ income (for example it is colder and darker in the winter leading to higher energy use). We could allow for these differences by incorporating three quarterly dummy variables, say for quarters 1,2 and 3, leaving quarter 4 as the base period.

Notice that we don’t have a dummy for all four quarters. This is sometimes referred to as the dummy variable trap. One quarter has to be kept as the base, just as with the gender dummy we didn’t have two dummy variables but chose one gender category – female – to be the base group. Incidentally it doesn’t really matter which category is assigned to be the base group although it may be convenient to specify the dummy as we did so that its coefficient takes a positive value.

So going back to our energy demand equation, with the dummies the model becomes

Log(Energy)_t= b₁+ b₂Log(Price)_t+ b₃ Log(Income)_t+ d₁D_1t+ d₂D_2t+ d₃D_3t+ u_t

Here D₁ takes the value 1 for all first quarter observations, zero otherwise; D₂ takes the value 1 for all second quarter observations, zero otherwise; D₃ takes the value 1 for all third quarter observations, zero otherwise.

I have used the symbol d for the coefficients of the dummy variables to distinguish them from those of the measured variables.

Interpreting the coefficients of dummy variables in log-linear models
Some care is required in interpreting dummy variable coefficients in log-linear regression models. In the example we have just been looking at suppose that the estimated value of d₁comes out as 0.01. This means that the intercept in the log-linear equation will be increased by 0.01 for all first quarter observations. But what does that mean for energy consumption itself. If Log(Energy) is up by 0.01 then Energy will be up by a factor of exp (0.01) = 1.01005 (assuming we have used logs to the base e – natural logarithms. Because log-linear models imply that the underlying variables interact with each other in multiplicative way the shift in the log-linear equation implies a multiplicative effect in the un-logged version of the equation.

Impulse and step dummies
Dummy variables can be included in regression models based on time series data to account for special circumstances or events that affect an individual observation. An example could be a model looking at quarterly sales of poultry products. In the light of the avian flue scare at the Bernard Matthews factory in Norfolk last month we might want to include an impulse dummy for the first quarter of 2007. The use of an impulse dummy would be based on the assumption that the (negative) effect would disappear and things return to normal in the following month. If that isn’t the case a step dummy might be more suitable. A step dummy takes the value 0 for all periods before a particular event and then the value 1 for periods after that time. Effectively the intercept steps up (or down) after the event. An example might be a dummy variable to measure the affect of banning of smoking in pubs on their revenue.

Dummy dependent variables (i.e. dummies on the left-hand side of regression equations)
Dummy variables can also appear on the left-hand side of regression equations. Limited dependent variable models of this sort can help to explain things like why some households have Internet access at home while others don’t or why some students succeed in passing a course while others don’t. This is interesting stuff but a whole new topic that we will look at another time.

Why don’t you look for interesting examples of the use of dummy variables in published work. Here are a few to be going on with: (1)the effect of computer ownership on college grade point average (Wooldridge, Introductory Econometrics p 235; (2) the effect of physical attractiveness on wages (Hamermesh and Biddle, 1994) reported in Wooldridge p 242; the effect of satellite TV on football attendances (Allan, Applied Economics Letters, 2004)

Online material on dummy variables
[1] Introductory Econometrics, a textbook written by Humberto Barreto and Frank Howland, was published by Cambridge University Press in 2005. The authors have put the introductory section of each chapter online, together with Excel spreadsheets to illustrate their material. Take a look at their material for chapter 8 on dummy variables at http://caleb.wabash.edu/econometrics/EconometricsBook/chap8.htm
[2] Kelly Rathje and Christopher Bruce have a short contribution to the Expert Witness newsletter (Winter 2000) that illustrates the use of dummy variables. Go to http://www.economica.ca/ew54p3.htm.

Tuesday, March 06, 2007

Notation, notation, notation

If you are new to econometrics you may well get confused or even irritated by the variations in the notation used by different text book authors. Most (but not all) authors use the Greek letter beta to represent the unknown parameter that is the coefficient of an independent variable in a regression equation, with a subscript to indicate the variable that it is associated with. But different authors accommodate the constant intercept in such equations in different ways. Some give a subscript zero to this first beta, continuing with subscripts 1 to k for the betas linked to the X variables (which have matching subscripts 1 to k). Kennedy, Stock and Watson, and Wooldridge all adopt this form of notation. But an alternative approach is to give a subscript 1 to the constant intercept with the other betas then following on with subscripts 2 to k. In this approach the variable X1 just consists of a column of constant values (=1). This is the convention followed by Dougherty, Greene, Gujarati (Basic Econometrics), Hill, Griffiths and Judge, and Pindyck & Rubinfeld. Yet another variant is for the constant intercept to be labelled alpha, with the beta coefficients numbered from 1 to k (Maddala).

These differences are relatively minor, but when we move on to choice of symbols for the least squares estimates to go with these parameters there is a further lack of consistency. Kennedy, Maddala, Pindyck & Rubinfeld, Stock and Watson, Wooldridge and Gujarati (Basic Econometrics) put a "hat" over the Greek letter to show that we have an estimate (or estimator) of the parameter rather than its unknown value. But Dougherty, Greene, Hill, Griffiths and Judge and Gujarati in his other book (Essential Econometrics) instead use the equivalent Roman letter b for each of the betas.

And then we have the disturbances and their estimated equivalents, the residuals. Of the ten textbooks that I examined six use the letter u to denote the (unobservable) disturbance in the regression equation but three of the others use the Greek letter epsilon. One book (Hill, Griffiths and Judge) uses an ordinary e for the disturbance (or error) term. Now that can be confusing because two of the authors that use u as the disturbance have e to stand for the associated residual, as does one of the authors (Greene) who has epsilon for the disturbance. Hill, Griffiths and Judge put a hat over the e to denote the residual while Gujarati (Basic Econometrics), Maddala, Stock & Watson, and Wooldridge put a hat on the u to denote the residual. Pindyck & Rubinfeld put a hat on the epsilon that they use for the disturbance when they want to indicate the residual that goes with it. You can find a table showing the different symbols used by the various authors on my Introduction to Econometrics website at Portsmouth.

What is to be done about this? Why can't all these authors agree on a common system of notation? Taking the second question first I guess that each would argue that there are advantages of working with the particular convention that they adopt. There is certainly a logic to each of the choices made by the different authors but it does make it difficult for a student who consults more than one text book as he tries to get to grips with the subject. You might advise him to stick just to one textbook until he is confident enough about the meaning of the various symbols to recognise a slightly different label being used elsewhere. But I have never wanted just to recommend a single textbook for the courses that I teach. Different types of exposition suit different students. Some want a formal presentation and can handle the proofs and derivations that go with it. Others need a more intuitive approach with lots of examples and illustrations. And I find that some authors give a better exposition on one topic (perhaps autocorrelation) but maybe not such a good one as elsewhere on another (multicollinearity perhaps). So students can benefit by reading more than one account. In any case at some point they will have to get to grips with different notational systems used in the journal articles that they must read so it might be better to face up to this sooner rather than later.

So my point is...? Let's face up to the fact that there are different notational conventions, look at each of them and compare them explicitly and thereby enable students to become flexible enough to switch form one to another as the situation requires. It may also help students gain a better understanding of the underlying concepts if they have to think more carefully about what they are reading or writing.

Friday, March 02, 2007

Degrees of freedom and cowboy econometrics

In the glossary at the end of his “A Guide to Econometrics” text (Fifth Edition, 2003, p 545) Peter Kennedy defines degrees of freedom as “..the number of free or linearly independent sample observations used in the calculation of a statistic”. In regression models we often see the number of degrees of freedom defined as “the number of observations minus the number of parameters to be estimated” – so for simple bivariate regression models that means n-2 (there are two parameters here: one fixing the slope of the line relating the variables and one fixing the intercept on the vertical axis).

Take an extreme case where you only have two observations on (X,Y). You have no freedom in fitting the line at all as there is only one line that can be chosen to connect the two points (see Figure 1).

Anybody undertaking any serious applied work would be advised to ensure that they have a great many more observations than 2. Not only do these extra observations provide some freedom in estimating the equation of the line, they also ensure that the 95% confidence intervals for the parameters are not too wide. The confidence interval will be the point estimate plus or minus the product of the standard error of the parameter estimate and the t-value leaving two and a half percent of the distribution in each tail (where we look up t with the appropriate number of degrees of freedom). A glance at the t-tables shows that the t-values fall as the number of degrees of freedom increase. For example t(0.025;10) = 2.228 while t(0.025;60) = 2.

When assessing the statistical significance of a variable in a regression (or more correctly of its accompanying parameter) the calculated t value must be compared with the critical value from the tables. So for example if the calculated t-value was say 3.5 then we would be able to reject the null hypothesis that the parameter is zero and accept the alternative hypothesis. [If we have a strong a priori view about the sign of the parameter, as predicted by theory, we might use a one-tailed test which would put all the 5% significance level area into one tail and thus pick out a smaller critical value that has to be exceeded for the decision to be taken to reject the null.] Your computer software might also produce a figure for the P-value or probability value linked to the calculated statistic. This measures the area beyond the calculated value, in the tail(s). This provides an alternative way for you to decide whether to reject the null or not. You simply compare the P-value with 0.05 (i.e. 5%). If the P-value is < 0.05 then you can reject the null.

When I was student the computer software was not that sophisticated and you definitely had to use the t-tables. A friend of mine on the same course never had his tables with him when he was doing the practical exercises set by the lecturer and was
famous for saying “Just check if its bigger than 2”. His reasoning was that whatever degrees of freedom might be appropriate, the t-value you got from the table was always approximately 2. See the examples I mentioned above and also Figure 2 which shows the tabulated t-values at 5% and 2½% for various degrees of freedom.

Of course if you do this you will be slightly misrepresenting the actual significance level of the test. I called my friend’s approach to the subject “cowboy econometrics” (an analogy with “cowboy builders” like the ones who worked on my house and didn’t properly measure the doors they were fitting. They shut OK but they don’t fit snugly so I get a draft under the gap at the bottom).

These days there really is no excuse not to do the tests properly. And there are also some very nice online Java applets that will calculate either the probability value to go with any t-value (for a given number of degrees of freedom) or the t-value to go with a specified P-value. See for example the one produced by R Webster West of the Department of Statistics at the Texas A&M University, from which I have taken the following screen grab – Figure 3. Screen grab of the t-distribution applet graphic.

So it turns out that they are not all cowboys in Texas!

Guy's Econometrics blog