Economists frequently use public datasets. One frequently used dataset is the Panel Study of Income Dynamics, short PSID, maintained by the Institute of Social Research at the University of Michigan.

I'm introducing psidR, which is a small helper package for R here which makes constructing panels from the PSID a bit easier.

One potential difficulty with the PSID is to construct a longitudinal dataset, i.e. one where individuals are followed over several survey waves. There are several solutions.

  1. In the so-called data center, users can use drill-down menus to select relevant variables from each wave. If the user wants only recent waves, there exists a subsetting mechanism (e.g. only household heads younger than 55). As the required dataset gets larger, this becomes unhandy, as the interface gets slower and slower, and the clicking procedure is rather error prone. The main motivation for this package is that I've spent too many hours clicking on cryptic variable names only to realize after I was done that I had forgotten a variable. Unacceptable.
  2. User may download the data and attempt to merge the annual interview files in order to obtain the desired panel. Though conceptually not very difficult (there is an individual index file, which provides a link for individuals across years), it is a cumbersome accounting exercise to find the right variable names from each year and do the right merges. 
  3. One can use psidR. The main function is inspired by the Stata add on package psiduse. Here is the function's signature.
build.panel(datadir,fam.vars,ind.vars=NULL,fam.files=NULL,ind.file=NULL,heads.only=TRUE,core=TRUE,design="balanced",verbose=FALSE)
  • There is a default behaviour, where the user only points towards a data directory. otherwise one can specify custom locations for family files and individual index.
  • you can supply the PSID data in stata format or csv files
  • The user has to supply a data.frame "fam.vars" which lists the variable names for all required waves. 
  • it's possible to tell the function that a certain variable is missing in a given year (without the variable getting dropped, so you can impute it later on)
  • One can subset the data for household heads only
  • there is a switch to only get the core sample
  • There are 3 different sample designs to choose from: balanced panel (all individuals must be present in all waves), k-period panel (individuals must be at least k periods present) and unbalanced (all included)
  • with "verbose=TRUE" the function prints comments as you go along. 
An issue could be memory. The dataset is quite big. I use data.table to keep things manageable, but it's hard to get around a data.table of 628MB, which is the size of the individual file index. The verbose option prints memory load at various points, so you may be able to intervene and through out some things if you hit a limit.
7

View comments

  1. UPDATE: Following up on a comment below, I used another data source, ONS JOBS02 for the labor market statistics. I report the findings below.


    I read a series of articles related to the goings of the UK housing market, the likely effects of the new Help To Buy scheme, the 10% increase in mean London house price over the last year, and employment statistics. I failed to reproduce some numbers cited in the economist (below). This post talks about this.

    It all starts with this blog post on the economist:

    http://www.economist.com/blogs/buttonwood/2013/09/house-prices

    It talks about many things, amongst which employment and housing completions, and how the UK seems likely to be embarking on another round of debt-fueled growth. I combined house completions and employment in construction and real estate industry into a single plot against time.

    In the bottom panel, I find the increase in jobs in the real estate agents sector quite dazzling. In the top panel, you can see completed houses over time, split by who actually built them. I think HA (Housing Associations) and LA (local authorities) can be viewed as "public".




    The economist blog has a quote from a report by Fathom Consulting:

    The real estate sector accounts for almost of quarter of all the jobs created in the UK over the year to June. The rise in real estate employment in the latest quarter is the strongest on record. Over the past year the number of real estate jobs has risen by 77,000, the number of construction jobs is 1,000 higher, manufacturing is 14,000 lower. The number of real estate jobs is now at a record high – 100,000 more than at the peak of the boom in the summer of 2008. The number of construction jobs is more than 300,000 lower than its peak.
    And so, the UK is going from a nation of shopkeepers to one of real estate agents. I had a look at this report of Fathom on the likely impact of Help To Buy (very interesting - no model description, see below), but couldn't find the one on the employment numbers cited above. The economist says it's in a "splendid piece of research last week", I couldn't find anything resembling that. Probably my fault. Or not.

    By the way, the same citation appears in the telegraph.

    I was unable to reproduce those numbers. My numbers are ONS employment statistics, table JOBS03, UK total. Everything is linked and document in my code.

    My R code is in this github repo in a file called UKjobs.r

    It would be interesting to know why our numbers are so different. I find an increase of "only" 28.000 real estate agent employees in the last year, as opposed to Fathom's 77.000. Even worse is my figure for change in that sector relative to summer 2008, with Fathom put at 100.000. I find a much more modest 21.000 increase of jobs in estate agents. This graph plots the difference of employment with employment in summer 2008 for both estate agents and construction.


    There is no doubt that regardless of the eventual magnitude, this trend is striking. 

    I have got to say though that the way the economist and Fathom put those numbers out there is strange. No source, no code, nothing. 
    Notice that I'm not trying to say Fathom Consulting juke their numbers or don't do proper work (impossible to tell), most likely they just defined "Real Estate Sector" in a different way (I just took the column headed "Real Estate Sector"), or we used a different data set or whatever. 

    But without knowing all of this, how I am to judge those informations? Of course I couldn't read their report (because that is only for paying customers, presumably containing all of those details), but the Economist uses them without any further qualifications. 

    I think that in general, it would be very useful to have this metadata in a section on any Consulting's website, particularly if they get cited in the media. I can see that they want to sell their work to customers, but if they want to be participating in the public discussion, which is clearly in their interest, this stuff must be verifiable. 

    UPDATE:
    I have been told that the difference may arise by using table JOBS02 instead of JOBS03. As far as I can tell, the tables differ in the number of categories they have (JOBS03 has much greater detail). Both are according to SIC 2007 industry classification. JOBS02 is called "Workforce Jobs by Industry (seasonally adjusted)", JOBS03 is called "Employee jobs by industry".

    I used ONS table JOBS02 to see whether I could get any closer to the 77.000 increase in real estate agent jobs reported by Fathom over the last year. Here is the data (d.realest is the quarterly difference in real estate jobs)

             date         d.realest       d.construct
    1: 2012-03-01         2         -12
    2: 2012-06-01         0         -42
    3: 2012-09-01        13         -11
    4: 2012-12-01         0         -11
    5: 2013-03-01        13          11

    This is a total of +28.000 over the last year. With regards to changes with respect to summer 2008, the JOBS02 data table produces the following graph, where we see that as of 2013-03-01 we have 39.000 more real estate agents than at 2008-06-01. I'm clueless as to how Fathom could get 100.000 instead of this number.


    10

    View comments

  2. For loss of a better place, I'll store my recipe for homemade pizza dough here. This will make dough for 14 people.

    • 2kg strong white flour (no self-raising or other extras)
    • 4 sachets of yeast, 7g each (not the super fast bicarbonate stuff)
    • Salt
    • Sugar
    • Olive Oil
    • Water
    The main problem is to get the right consistency, i.e. how much water to add. You'll have to do some experiments here. Just start with obviously too little water and add (be careful: you'll quickly have put too much water, then you need to top up again with flour. it goes round and round.)

    I usually mix all the ingredients in a big bowl before adding the water, i.e. flour, yeast, 2 table spoons of salt and sugar each, and approx 5 table spoons of oil. Only then I add the water. An alternative would be to dissolve the yeast in a bit of water first and add it like that to the flour; I should try that out.

    Then you work the dough for at least 10 mins. Nothing should stick on your hands (too wet), and if you rip the dough in two pieces you don't want to find any remaining parts the are not well mixed with the rest (dry flour, etc). It shouldn't be too chewy as you'll find it hard to spread it later on.

    switch on the oven for 3 minutes just to make it a bit warm. put dough in a big bowl you covered in flour. put the bowl in the oven and leave to rest for 2 hours. the longer the better. the warm environment in the oven makes the yeast grow faster.

    0

    Add a comment

  3. I've got some questions regarding this issue, maybe someone out there has a clue.

    The Austrian contingent was 300 out of 1000 soldiers


    How come that the other countries involved did not conclude that the UN mandate was not justifying the increased danger?  According to Austrian prime minister Faymann, exactly that (increased danger) is the main reason for the decision:


    It seems odd that one country pulls out of an international mission, leaving the mission endangered. Also it's a bit surprising that a country would sign up to a mission to enforce a cease fire only to pack up and leave after it comes under attack? Or is this not a good description of what happened?

    On Austrian TV it was alleged that the whole thing is due to upcoming elections. If that's the main motivation this is a poor show altogether.

    Who is taking over?

    It seems the Fiji islands will step in. 

    Fiji?

    Fiji has a GDP per capita of 4,193 USD, population 0.8 million.
    Austria's GDP per capita is 49,687 USD, population 8.4 million.

    0

    Add a comment

  4. Inspired by the Institute of Fiscal Studies' "Where do you fit in" application, where people can find out their position in the UK's income distribution, I wanted to find out how the picture in London looks like. Quite different. If you are in a very high percentile nationwide, high incomes of mainly financial sector employees in London make sure that you find yourself a couple of ranks further down. That's my guess anyway, but I think there is little reason to doubt where the big salaries are earned.
    The data are not equivalized, i.e. does not take into account number of earners and dependents in a household.

    Here's the plot. The code and info on how to get the data are available on github. The red lines mark the median (0.5 on the y axis) and the corresponding quantile. I.e. the median household income is somewhere in the £30-35K bracket (it's £33,430).


    3

    View comments

  5. I just pushed the most recent version of the PSID panel data builder introduced a little while ago. Got some user feedback and made some improvements. The package is hosted on github.

    News:
    • I added a reproducible example using artificial data which you can run by calling 'example(build.panel)'. This means you can try out the package before bothering to download anything and it provides a simple test of the main function.
    • I've included a suggestion to use the R survey package to analyse this dataset and made it explicit in the examples how to obtain the desired weights for each wave. Note that your results are invalid in the majority of cases if you ignore the survey design (i.e. the weights).
    • I got some useful comments from Anthony Damico (thanks!) and integrated the SAScii package. (check out his tutorials at http://www.asdfree.com/).  This allows one to download the data directly from the PSID server into R, thereby removing any dependency on Stata or SAS to preprocess the raw data. (As is common with large datasets, the raw data come in ASCII format that needs to be fixed up into rows and columns.) The downside is that downloading directly takes a rather long time: downloading FAM1985ER, FAM1986ER and the index IND2009ER took 3 and a half hours.
    Hopefully I can get another round of feedback (particularly from a windows user: I could not test that all the paths are written correctly on a unix system) before submitting to CRAN.


    3

    View comments

  6. I got intrigued by the numbers presented in this news article talking about the re-trial in the Amanda Knox case. The defendants, accused and initially convicted of murder, were acquitted in the appeal's instance when the judge ruled that the forensic evidence was insufficiently conclusive. The appeals judge ignored the forensic scientist's advice to retest a DNA sample, because
    "The sum of the two results, both unreliable… cannot give a reliable result," he wrote.
    Now the acquittal has been overturned by Italy's highest instance court and there is a re-trial. As the example illustrates, there can be considerably more information gained from a second test.

    I'm reproducing the calculations which led to the resulting numbers of the example in the news article in this hands on tutorial. (make sure to click "hide toolbar" do be able to see the last lines.)

    You can do it by yourself with pen and paper. I use a software called RStudio to write the example. Have a look at the source file (faircoins.Rmd) on my github.
    0

    Add a comment

  7. I just finished reading the extraordinary book tomorrow's table by P. Ronald and Raoul Adamchak. (I linked to Ronald's blog). In this post I wanted to quickly redo a calculation Adamchak does on page 16, where he explains to his students how much energy is required to produce the fertilizer used to grow one acre of corn using conventional agriculture (as opposed to organic methods).

    Energy required to produce fertilizer for 1 acre of corn

    According to the economists of the US Department of Agriculture it takes about 133 pounds of Nitrogen fertilizer to grow 1 acre of corn. It also takes a host of other fertilizers, but let's focus on Nitrogen for now. They are all synthetically produced. 
    1. To produce 1 pound of Nitrogen it takes an energy input of 24.500 British Termal Units (BTU), which is just a measure of energy (like Joule, or calories). 
    2. So, the energy equivalent to provide fertilizer to grow one acre of corn is 
    3. F = 133 x 24.500BTU = 3.258.500BTU / acre.
    Now to make sense of that number, let wikipedia tell you that standard gasoline has 114.000 BTU per gallon. To bring this to an end, divide the energy required F by 144.000 BTU to get F in terms of gallons of gas:
    F / 114.000 = 3.258.500 / 114.000 = 28.5 Gallons of gas
    for 1 acre of corn. Keep in mind that is just to produce the fertilizer. (At 1 gallon = 4.54 litre, that is 130 litres of gasoline).

    In General


    I think the debate about genetically modified food is fundamental, and this book brings a lot of knowledge to the table. I can highly recommend the book

    • to anyone who cares about how we will go about nurturing 9 billion people in a sustainable way. 
    • I recommend it also to anyone who has strong feelings about GM food, but can't really define any good reason for their opposition. That's my category.
    • Finally, this is an important book for everybody who thinks he has very good reasons to oppose GM. One has to periodically fact check ones system of beliefs, or run the risk of ending up like  Mark Lynas
    0

    Add a comment

  8. Economists frequently use public datasets. One frequently used dataset is the Panel Study of Income Dynamics, short PSID, maintained by the Institute of Social Research at the University of Michigan.

    I'm introducing psidR, which is a small helper package for R here which makes constructing panels from the PSID a bit easier.

    One potential difficulty with the PSID is to construct a longitudinal dataset, i.e. one where individuals are followed over several survey waves. There are several solutions.

    1. In the so-called data center, users can use drill-down menus to select relevant variables from each wave. If the user wants only recent waves, there exists a subsetting mechanism (e.g. only household heads younger than 55). As the required dataset gets larger, this becomes unhandy, as the interface gets slower and slower, and the clicking procedure is rather error prone. The main motivation for this package is that I've spent too many hours clicking on cryptic variable names only to realize after I was done that I had forgotten a variable. Unacceptable.
    2. User may download the data and attempt to merge the annual interview files in order to obtain the desired panel. Though conceptually not very difficult (there is an individual index file, which provides a link for individuals across years), it is a cumbersome accounting exercise to find the right variable names from each year and do the right merges. 
    3. One can use psidR. The main function is inspired by the Stata add on package psiduse. Here is the function's signature.
    build.panel(datadir,fam.vars,ind.vars=NULL,fam.files=NULL,ind.file=NULL,heads.only=TRUE,core=TRUE,design="balanced",verbose=FALSE)
    • There is a default behaviour, where the user only points towards a data directory. otherwise one can specify custom locations for family files and individual index.
    • you can supply the PSID data in stata format or csv files
    • The user has to supply a data.frame "fam.vars" which lists the variable names for all required waves. 
    • it's possible to tell the function that a certain variable is missing in a given year (without the variable getting dropped, so you can impute it later on)
    • One can subset the data for household heads only
    • there is a switch to only get the core sample
    • There are 3 different sample designs to choose from: balanced panel (all individuals must be present in all waves), k-period panel (individuals must be at least k periods present) and unbalanced (all included)
    • with "verbose=TRUE" the function prints comments as you go along. 
    An issue could be memory. The dataset is quite big. I use data.table to keep things manageable, but it's hard to get around a data.table of 628MB, which is the size of the individual file index. The verbose option prints memory load at various points, so you may be able to intervene and through out some things if you hit a limit.
    7

    View comments

  9. I was recently asked by a friend whether it's worth to buy a house in the UK. That is, assuming they could put down the money, whether it was worth buying as opposed to renting. Apart from obvious things like the expected length of stay in one place, the interest on mortgages and how prices might develop and so forth, they were interested in particular in the amount of transaction costs they were likely to face: fees, taxes and so forth. In case anyone is chasing the same information, here goes:

    Transaction Costs

    So, costs. I pay the following on an annual basis, given that I am a so-called lease holder (there's a guy called the "freeholder", kind of managing the building):

    ground rent 50£
    service charge 50£
    building insurance 350 £

    plus anything that needs fixing, the guy takes a 15% cut (we had to repair the roof for 3000, he took 0.15*3000 for nothing). Notice that the same kind of arrangement does not exist everywhere in the UK, and that sometimes you can buy a share of the freehold (you would then decide together with the other parties in the house whether and how to fix that roof). 

    my transaction costs.
    Assume that my house cost something in the range of 150.000 and 250.000.
    stamp duty: a percentage of the purchase price. e.g. you pay 1% on values between 125.000 and 250.000, 3% for 250.000 to 500.000 and so on.  here's the schedule.
    HMLR fee: 280
    landlords fee: 115
    insurance: 154
    property lawyer: 980

    so the biggest chunk is the stamp duty, ie a tax. we didn't pay anything for estate agents, as that's taken over by the seller.
    0

    Add a comment

  10. So Paul Krugman laments in his post that policy makers across Europe have blindly signed up to the "Austerity only" ticket. He cites some evidence which I find fairly convincing. I just want to raise the point that what he says cannot be used as a critique against the Monti government.

    Basically what he's saying is that Monti was installed as a puppet of European creditor nations to make sure that austerity would be imposed and the country's government debt would be continued to be serviced. They put in Monti to get their money back. The facts are however, Krugman says, that austerity policies don't work. In fact, the linked vox.eu article shows research indicating that financial markets panicked, forcing southern Europeans into austerity. Had the ECB acted sooner, markets would have calmed earlier, and austerity would not have been needed. I find this research quite convincing (that may be a result on the extreme scarsity of data based research on this subject, combined with the extreme abundance of ideologic brainwash).

    To come back to my point: You have to give Monti the benefit that he was called in to put out a fire, illustrated in the bottom graph on this page:

    http://countryeconomy.com/risk-premium/italy

    It illustrates how much more the Italian government had to pay to debt finance than Germany on the eve of Berlusconi's exit in November 2011. The country was actually on the verge of bankruptcy, as a result of a financial market panic, not a change in fundamentals. The spread was widely cited as unsustainable. This is not the same as George Osborne talking about Britain being in danger of potential bankruptcy and the continued need for austerity, a belief which is not at all widely held. Krugman's critique applies to Britain, I would say.
    Don't bring in Monti in november 2011 and there is all sorts of hell breaking loose. Default and dismemberment of the euro for example. Big costs all around.

    The Krugman critique applies to policy makers in creditor countries. It applies to Monti if you believe that he had any leverage at the ECB and could have forced them into guaranteeing Italian (and other southern european bonds) earlier. Which is unlikely at best.
    2

    View comments

links
About Me
About Me
Blog Archive
Loading
Dynamic Views theme. Powered by Blogger. Report Abuse.