Gretl Manual: Gnu Regression, Econometrics and Time-series Library | ||
---|---|---|
Prev | Chapter 5. Sub-sampling a dataset | Next |
By "restricting" the sample we mean selecting observations on the basis of some Boolean (logical) criterion, or by means of a random number generator. This is likely to be most relevant for cross-sectional or panel data.
Suppose we have data on a cross-section of individuals,
recording their gender, income and other characteristics. We
wish to select for analysis only the women. If we have a
gender
dummy variable with value 1 for men
and 0 for women we could do
smpl gender=0 --restrictto this effect. Or suppose we want to restrict the sample to respondents with incomes over $50,000. Then we could use
smpl income>50000 --restrict
A question arises here. If we issue the two commands above in sequence, what do we end up with in our sub-sample: all cases with income over 50000, or just women with income over 50000? By default, in a gretl script, the answer is the latter: women with income over 50000. The second restriction augments the first. If you want a new restriction to be applied independently of any existing restrictions you should first recreate the full dataset using
smpl full
Alternatively, you can add the replace option to the smpl command:
smpl income>50000 --restrict --replace
This option has the effect of automatically re-establishing the full dataset before applying the new restriction.
Unlike a simple "setting" of the sample, "restricting" the sample may result in selection of non-contiguous observations from the full data set. It may also change the structure of the data set.
This can be seen in the case of panel data. Say we have a
panel of five firms (indexed by the variable
firm
) observed in each of several years
(identified by the variable year
). Then the
restriction
smpl year=1995 --restrictproduces a dataset that is not a panel, but a cross-section for the year 1995. Similarly
smpl firm=3 --restrictproduces a time-series dataset for firm number 3.
For these reasons (possible non-contiguity in the observations, possible change in the structure of the data), gretl acts differently when you "restrict" the sample as opposed to simply "setting" it. In the case of setting, the program merely records the starting and ending observations and uses these as parameters to the various commands calling for the estimation of models, the computation of statistics, and so on. In the case of restriction, the program makes a reduced copy of the dataset and by default treats this reduced copy as a simple, undated cross-section. If you wish to re-impose a time-series or panel interpretation of the reduced dataset you can do so using setobs (and panel if appropriate).
The fact that "restricting" the sample results in the creation of a reduced copy of the original dataset may raise an issue when the dataset is very large (say, several thousands of observations). With such a dataset in memory, the creation of a copy may lead to a situation where the computer runs low on memory for calculating regression results. You can work around this as follows:
Open the full data set, and impose the sample restriction.
Save a copy of the reduced data set to disk.
Close the full dataset and open the reduced one.
Proceed with your analysis.