PSID data set builder for R

Economists frequently use public datasets. One frequently used dataset is the Panel Study of Income Dynamics, short PSID, maintained by the Institute of Social Research at the University of Michigan.

I'm introducing psidR, which is a small helper package for R here which makes constructing panels from the PSID a bit easier.

One potential difficulty with the PSID is to construct a longitudinal dataset, i.e. one where individuals are followed over several survey waves. There are several solutions.

In the so-called data center, users can use drill-down menus to select relevant variables from each wave. If the user wants only recent waves, there exists a subsetting mechanism (e.g. only household heads younger than 55). As the required dataset gets larger, this becomes unhandy, as the interface gets slower and slower, and the clicking procedure is rather error prone. The main motivation for this package is that I've spent too many hours clicking on cryptic variable names only to realize after I was done that I had forgotten a variable. Unacceptable.
User may download the data and attempt to merge the annual interview files in order to obtain the desired panel. Though conceptually not very difficult (there is an individual index file, which provides a link for individuals across years), it is a cumbersome accounting exercise to find the right variable names from each year and do the right merges.
One can use psidR. The main function is inspired by the Stata add on package psiduse. Here is the function's signature.


build.panel(datadir,fam.vars,ind.vars=NULL,fam.files=NULL,ind.file=NULL,heads.only=TRUE,core=TRUE,design="balanced",verbose=FALSE)

There is a default behaviour, where the user only points towards a data directory. otherwise one can specify custom locations for family files and individual index.
you can supply the PSID data in stata format or csv files
The user has to supply a data.frame "fam.vars" which lists the variable names for all required waves.
it's possible to tell the function that a certain variable is missing in a given year (without the variable getting dropped, so you can impute it later on)
One can subset the data for household heads only
there is a switch to only get the core sample
There are 3 different sample designs to choose from: balanced panel (all individuals must be present in all waves), k-period panel (individuals must be at least k periods present) and unbalanced (all included)
with "verbose=TRUE" the function prints comments as you go along.

An issue could be memory. The dataset is quite big. I use data.table to keep things manageable, but it's hard to get around a data.table of 628MB, which is the size of the individual file index. The verbose option prints memory load at various points, so you may be able to intervene and through out some things if you hit a limit.

UPDATE: Following up on a comment below, I used another data source, ONS JOBS02 for the labor market statistics. I report the findings below.

I read a series of articles related to the goings of the UK housing market, the likely effects of the new Help To Buy scheme, the 10% increase in mean London house price over the last year, and employment statistics. I failed to reproduce some numbers cited in the economist (below). This post talks about this.

It all starts with this blog post on the economist:

http://www.economist.com/blogs/buttonwood/2013/09/house-prices

It talks about many things, amongst which employment and housing completions, and how the UK seems likely to be embarking on another round of debt-fueled growth. I combined house completions and employment in construction and real estate industry into a single plot against time.

In the bottom panel, I find the increase in jobs in the real estate agents sector quite dazzling. In the top panel, you can see completed houses over time, split by who actually built them. I think HA (Housing Associations) and LA (local authorities) can be viewed as "public".

The economist blog has a quote from a report by Fathom Consulting:

The real estate sector accounts for almost of quarter of all the jobs created in the UK over the year to June. The rise in real estate employment in the latest quarter is the strongest on record. Over the past year the number of real estate jobs has risen by 77,000, the number of construction jobs is 1,000 higher, manufacturing is 14,000 lower. The number of real estate jobs is now at a record high – 100,000 more than at the peak of the boom in the summer of 2008. The number of construction jobs is more than 300,000 lower than its peak.

And so, the UK is going from a nation of shopkeepers to one of real estate agents. I had a look at this report of Fathom on the likely impact of Help To Buy (very interesting - no model description, see below), but couldn't find the one on the employment numbers cited above. The economist says it's in a "splendid piece of research last week", I couldn't find anything resembling that. Probably my fault. Or not.

By the way, the same citation appears in the telegraph.

I was unable to reproduce those numbers. My numbers are ONS employment statistics, table JOBS03, UK total. Everything is linked and document in my code.

My R code is in this github repo in a file called UKjobs.r

It would be interesting to know why our numbers are so different. I find an increase of "only" 28.000 real estate agent employees in the last year, as opposed to Fathom's 77.000. Even worse is my figure for change in that sector relative to summer 2008, with Fathom put at 100.000. I find a much more modest 21.000 increase of jobs in estate agents. This graph plots the difference of employment with employment in summer 2008 for both estate agents and construction.

There is no doubt that regardless of the eventual magnitude, this trend is striking.

I have got to say though that the way the economist and Fathom put those numbers out there is strange. No source, no code, nothing.

Notice that I'm not trying to say Fathom Consulting juke their numbers or don't do proper work (impossible to tell), most likely they just defined "Real Estate Sector" in a different way (I just took the column headed "Real Estate Sector"), or we used a different data set or whatever.

But without knowing all of this, how I am to judge those informations? Of course I couldn't read their report (because that is only for paying customers, presumably containing all of those details), but the Economist uses them without any further qualifications.

I think that in general, it would be very useful to have this metadata in a section on any Consulting's website, particularly if they get cited in the media. I can see that they want to sell their work to customers, but if they want to be participating in the public discussion, which is clearly in their interest, this stuff must be verifiable.

UPDATE:

I have been told that the difference may arise by using table JOBS02 instead of JOBS03. As far as I can tell, the tables differ in the number of categories they have (JOBS03 has much greater detail). Both are according to SIC 2007 industry classification. JOBS02 is called "Workforce Jobs by Industry (seasonally adjusted)", JOBS03 is called "Employee jobs by industry".

I used ONS table JOBS02 to see whether I could get any closer to the 77.000 increase in real estate agent jobs reported by Fathom over the last year. Here is the data (d.realest is the quarterly difference in real estate jobs)

date d.realest d.construct

1: 2012-03-01 2 -12

2: 2012-06-01 0 -42

3: 2012-09-01 13 -11

4: 2012-12-01 0 -11

5: 2013-03-01 13 11

This is a total of +28.000 over the last year. With regards to changes with respect to summer 2008, the JOBS02 data table produces the following graph, where we see that as of 2013-03-01 we have 39.000 more real estate agents than at 2008-06-01. I'm clueless as to how Fathom could get 100.000 instead of this number.

View comments

plausibel

View comments