## Wednesday, November 6, 2019

### Dealing with the log of zero in regression models

Author and guest post by Eren Ocakverdi

The title of this blog piece is a verbatim excerpt from the Bellego and Pape (2019) paper suggested by Professor David E. Giles in his October reading list. (Editor's note: Professor Giles has recently announced the end of his blog - it is a fantastic resource and will be missed!). The topic is immediately familiar to practitioners who occasionally encounter the difficulty in applied work. In this regard, it is reassuring that the frustration is being addressed and that there is indeed an ongoing quest for the silver bullet.

### Introduction

Consider the following data generating process where the dependent variable may contain zeros: $$\log(y_i) = \alpha + x_i^\prime \beta + \epsilon_i \quad \text{with} \quad E(\epsilon_i)=0$$ The most common remedy to the logarithm of zero value problem among practitioners is to add a common (observation independent) positive constant to the problematic observations. In other words, to work with the model: $$\log(y_i + \Delta) = \alpha + x_i^\prime \beta + \omega_i$$ where $\Delta$ is the corrective constant.

In the aforementioned paper, the authors use Monte Carlo simulations to demonstrate that the bias incurred by this correction is not necessarily negligible for small values of $\Delta$, and in fact, may be substantial.

Figure 1: Estimation bias as a function of $\Delta$

In order to handle the zeros in model variables, the paper offers a new (complementary) solution that:
1. Does not generate computational bias by arbitrary normalization.
2. Does not generate correlation between the error term and regressors.
3. Does not require the deletion of observation.
4. Does not require the estimation of a supplementary parameter.
5. Does not require addition of a discretionary constant.

### A Novel Approach

Bellego and Pape (2019) suggest that instead of adding a common positive constant $\Delta$, one ought to add some optimal, observation-dependent positive value $\Delta_{i}$. The novel strategy results in the following model and is estimated via GMM: $$\log(y_i + \Delta_{i}) = \alpha + x_i^\prime \beta + \eta_{i}$$ where $\Delta_i = \exp(x_i^\prime \beta)$ and $\eta_i = \log⁡(1 + \exp(\alpha + \epsilon_i))$.

Since the details can be referred to in the original paper, here I’d like to replicate the simulation exercise in which the authors illustrate their method and make a comparison with other approaches. (The tables below can be replicated in EViews by running the program file loglinear.prg.)

Figure 2: Output of OLS estimation (with $\Delta = 1$)

Figure 3: Output of Pseudo Poissson Maximum Likelihood (PPML) estimation

Figure 4: Output of proposed solution (GMM estimation)

Simulation results show that both the PPML and the GMM solutions provide correct estimates (i.e. $\alpha = 0$ , $\beta_{1} = \beta_{2} = 1$), whereas OLS results are biased due to adding a common constant to all data points. Although $\alpha$ is not identified in the proposed solution, the authors suggest OLS estimation to obtain the coefficient:

Figure 5: OLS estimation of alpha parameter: $\log⁡(\exp(\eta_i)-1)=\alpha+\epsilon_i$

When zeros are observed in both the dependent and independent variables, the authors suggest a functional coefficient model of the form: $$\log(y_i) = \alpha + \mathbb{1}_{x_i > 0}\times\log(x_i)\beta_{x_i>0}+\mathbb{1}_{x_i=0}\times\beta_{x_i=0}+\epsilon_i$$ Again, a simulation exercise is carried out to compare the estimated coefficients with different methods. (The tables below can be reproduced in EViews by running the program loglog.prg.)

Figure 6: OLS estimation

Figure 7: PPML estimation

Figure 8: GMM estimation

Simulation results show that the suggested (flexible) formulation of the $\beta$ coefficients works well for all estimation methods ($\alpha=0$ and $\beta = 1.5$).

### References

1. Bellego, C. and L-D. Pape. Dealing with the log of zero in regression models. CREST: Working Paper, No:2019-13, 2019.